Authors |
Yuanchao Li, Hongwei Zeng, Miao Zhang, Bingfang Wu, Yan Zhao, Xia Yao,Tao Cheng, Xingli Qin, Fangming Wu |
Intro |
Yield prediction is essential in food security, food trade, and field management. However, due to the associated
complex formation mechanisms of yield, accurate and timely yield prediction remains challenging in remote
sensing-based crop monitoring domains. In this study, a framework of soybean yield prediction integrating
extreme gradient boosting (XGBoost) and multidimensional feature engineering was developed at the county
level in the United States using publicly available datasets. Excellent accuracy values were obtained for over 959
counties in 12 states throughout the midwestern U.S., with a test coefficient of determination (R2) of 0.82 and a
root-mean-square error (RMSE) of 0.246 t/ha, using our approach. Following a “train–validate–test” assessment
strategy, our study shows that XGBoost outperforms other county-level soybean yield prediction models with
identical inputs, including linear regression (LR), random forest (RF), k-nearest neighbor (KNN), artificial neural
network (ANN), support vector regression (SVR), long short-term memory (LSTM), and deep neural network
(DNN). The results show that accurate results of soybean yield prediction can be obtained as early as the podsetting
stage. We implemented the feature importance and Shapley additive explanations (SHAP) algorithms
to quantify the impact of input features on the XGBoost model in the training and prediction stages, respectively.
The enhanced vegetation index (EVI) at the pod-setting period is the most crucial factor, but the yield prediction
is not dependent on only a few key features. Yields were detrended using longer-term historical yield data, and R2
increased from 0.58 to 0.82 while RMSE decreased from 0.374 t/ha to 0.246 t/ha. We employed multidimensional
feature engineering to generate phenology-based features, and R2 improved from 0.79 to 0.82 while RMSE
decreased from 0.268 t/ha to 0.246 t/ha using this approach. The framework can be easily implemented and
extended in the future in combination with early crop identification. |