Interpetatle data mining for legume crude protein prediction


ÇATAL M. İ.

INDUSTRIAL CROPS AND PRODUCTS, cilt.240, 2026 (SCI-Expanded, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 240
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1016/j.indcrop.2026.122664
  • Dergi Adı: INDUSTRIAL CROPS AND PRODUCTS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, Geobase, INSPEC, Directory of Open Access Journals
  • Recep Tayyip Erdoğan Üniversitesi Adresli: Evet

Özet

Despite the critical need for rapid nutritional assessment in diverse legume species, there remains a significant knowledge gap regarding the efficacy and interpretability of advanced data mining algorithms for the nondestructive prediction of Crude Protein (CP) content. The scientific challenge lies in the conventional methods (e.g., Kjeldahl), which are destructive, expensive, and time-intensive, preventing rapid, high-throughput screening necessary for modern crop breeding programs and industrial quality control. This study investigates the efficacy of four data mining algorithms-Multivariate Adaptive Regression Splines (MARS), Support Vector Regression (SVR), k-Nearest Neighbors (KNN), and Artificial Neural Networks (ANN)-for non-destructively predicting CP across 14 legume species. Our objective was to identify a robust and interpretable computational approach that can enhance precision agriculture and industrial legume production. MARS demonstrated superior predictive performance, yielding the lowest Root Mean Squared Error (RMSE) of 0.62 and a Relative Root Mean Squared Error (RRMSE) of 3.38. It also achieved the highest coefficient of determination (R2) of 0.93 and an Akaike Information Criterion (AIC) of -5.51, indicating a strong fit and robustness in modeling complex non-linear relationships and variable interactions inherent in agricultural datasets. SVR showed moderate accuracy with an RMSE of 1.17, exhibiting efficiency with smaller datasets. Conversely, ANN displayed a Mean Absolute Percentage Error (MAPE) of 96.78 %, indicating poor generalization, likely due to overfitting or incompatibility with the dataset size. KNN offered interpretable results but with lower accuracy, reporting an RMSE of 1.92. This research establishes MARS as a precise and interpretable tool for non-destructive CP prediction in legumes, offering a significant improvement over traditional methods for precision agriculture and industrial crop development. The core scientific contribution is the development and validation of a robust, multi-species, nonlinear MARS prediction framework that successfully translates complex nutrient interactions (e.g., Ca/P, K/ (Ca+Mg) ratios) into actionable, biologically relevant hinge functions. Its integration into crop improvement programs holds potential to accelerate the development of high-protein cultivars through rapid, non-destructive screening. These findings contribute to the application of machine learning in agriculture, providing actionable insights for improving crop trait selection and industrial legume utilization.