Using the leave-two-out method to determine the optimal statistical crop model

The use of statistical models to study the impact of weather on crop yield has not ceased to increase. Unfortunately, this type of application is characterised by datasets with a very limited number of samples (typically one sample per year). In general, statistical inference uses three datasets: the training dataset to optimise the model parameters, the validation datasets to select the best model, and the testing dataset to evaluate the model generalisation ability. Splitting the overall database into three datasets is impossible in crop yield modelling. The leave-one-out cross-validation method or simply leave-one-out (LOO) 5 has been introduced to facilitate statistical modelling when the database is limited. However, the model choice is made using the testing dataset, which can be misleading by favouring unnecessarily complex models. The nested cross-validation approach was introduced in machine learning to avoid this problem by truly utilising three datasets, especially problems with limited databases. In this study, we proposed one particular implementation of the nested cross-validation, called the leave-two-out method (LTO), to choose the best model with an optimal model complexity (using the validation dataset) and estimate the true 10 model quality (using the testing dataset). Two applications are considered: Robusta coffee in Cu M’gar (Dak Lak, Vietnam) and grain maize over 96 French departments. In both cases, LOO is misleading by choosing too complex models; LTO indicates that simpler models actually perform better when a reliable generalisation test is considered. The simple models obtained using the LTO approach have reasonable yield anomaly forecasting skills in both study crops. This LTO approach can also be used in seasonal forecasting applications. We suggest that the LTO method should become a standard procedure for statistical crop 15

forecast the weather sensitivity of crop yield.
The long-term trend represents the slow evolution of the crop yield; it often describes the changes in management like fertilisation or irrigation. Thus, suppressing this trend from the yield time series allows removing the influence of non-weather related factors. For Robusta coffee, a simple linear function is used to define the yield trend: y(t) = y 0 + α · t, where y 0 is the yield in 2000, and α is the constant annual rate of improvement. Once the yield trend is defined, the coffee yield anomalies are 70 calculated by removing this trend from the raw yield data. The Robusta coffee yield for year t is noted as y(t), the long-tem trend value as y(t), and the coffee yield anomaly a(t) (in %) is calculated as: If a(t) > 0, then the yield in year t is higher than in a regular year, and vice versa. For instance, an anomaly of a(t) = −16 means that the yield for year t is 16 % lower than the annual trend.

Grain maize yield database
The French crop data (area, production, and yield) on the regional level (i.e., department which is an administrative unit in France) were collected from Agreste website (https://agreste.agriculture.gouv.fr; "Statistique agricole annuelle ") for a period of 22 years (from 1989 to 2010). The data are available for several crops such as soft wheat, durum wheat, maize, oats, etc.
This study considers an application for grain maize over 96 French departments (Fig. 1). Some specific tests (in Sect. 5) will the Equator). The monthly data are then projected from its original 0.1 • × 0.1 • regular grid into the crop administrative levels to match the yield data.

90
In this study, the 2 × n monthly weather anomaly variables (representing P and T for n months) are considered. The number of months n varies for each crop: season's peak (Dinh et al., 2021). Thus, 2 × 19 month weather data (P and T from May of year (t − 1) to November of year t) are used as potential explanatory variables for Robusta coffee yield anomalies.

95
-For grain maize: six months of growing period (from sowing to harvest) will be studied. Thus, n = 6 results into 2 × 6 weather variables: P and T from April to September.
Weather anomalies could be considered as for crop yield data. However, the climate trend of the 10 to 20 years is relatively low compared to the inter-annual variations. Thus, the long-term trend can be neglected, and the relative anomalies will be estimated based on the long-term average. This average value is computed for each of the n months before the harvest time.

100
In addition, we applied a 3-month moving average centred on the particular month (instead of the monthly data) to reduce the variability at the monthly scale since this variability would introduce instabilities in our analysis due to the short database time length. (This is actually a regularisation technique).

Statistical yield models
The statistical models intend to measure the impact of weather on crop yield anomalies, which can be noted as: where a(t) is the crop yield anomaly for year t, f w the parametric statistical model, w the model parameters, and X the set of weather inputs. The function f w can be based on multiple statistical methods depending on the complexity of the application, for instance, linear regression (Prasad et al., 2006;Kern et al., 2018;Lecerf et al., 2019), partial least-squares regression (Ceglar et al., 2016), random forest (Beillouin et al., 2020), neural network Aires, 2016, 2018), or mixed-effects (Mathieu and Aires, 2016), etc. 110 In this study, two statistical models are considered: -Linear regression (LIN) is the simple model and the most frequently used. The relationship between the crop yield anomalies a(t) and the weather variables X ti (i = 1, 2, · · · , n input is the number of input variables) is formulated as: where α i are the regression coefficents. Detailed description of the LIN model can be found, for instance, in Dinh et al.
-Neural Network (NN) is a non-linear statistical model. The simplest type of NN is the feedforward model (Bishop, 1995;Schmidhuber, 2015), where there is only one direction-forward-from the input nodes, through the hidden nodes and to the output nodes. Only one hidden layer with n neuron neurons is considered in the architecture here. The output crop yield anomaly a is modeled by the following equation: where w are the weights, x i are the weather variables, b are the NN biases. A detailed description of the NN model (applied for impact models) is described, for instance, in Mathieu and Aires (2016).
The least-squares criterion, which measures the discrepancies between the target and estimated crop yield anomalies, is used to optimise the model during the calibration process for both LIN and NN models. For instance, it is used to obtain the 125 coefficients α i in Eq.
(2) and the NN parameters w in Eq.
Two diagnostics are considered to measure the quality of the yield anomaly estimations. One is the correlation COR (unitless) between the estimated a est and observed a obs yield anomalies. The Root Mean Square Error is defined as: RMSE = 1 nsamp nsamp i=1 (a est (i) − a obs (i)). It includes systematic and random errors of the model. The unit of RMSE is the same as a(t); RMSE=40 represents a 40% error.

Model complexity and overfitting
Various factors control the complexity of a statistical model: the model architecture (the number of potential predictors on which the inputs are chosen, the number of inputs, or simply the model types) or the training process (e.g., the number of epochs in NN). In principle, it is possible to increase the model quality by increasing its complexity because a more complex model can fit better a database. However, it is not always the case: the model complexity can be too high compared to the 135 limited information included in the training database. This leads to the overfitting (or overtraining) problem, i.e., the model fits the training dataset artificially well, but it cannot predict well data not present in the training dataset, meaning that the model is not reliable not to be used.
In the following, by studying the sensitivity of the model quality to different complexity levels, we want to determine the optimal statistical crop model that truly estimates the yield anomalies as best as possible, considering the limited database by 140 avoiding overfitting.

Training, validation and testing datasets
One of the main challenges in statistical inference is that the model is set up using a samples database, but it must perform well on new, previously unseen samples. This is a complex task, and methods have been designed to perform good training, chose the suitable model, and measure the model generalisation ability realistically. For that purpose, the overall database B 145 is divided into three datasets: B = B T rain + B V al + B T est , each one of these three datasets undertakes a specific task (Ripley, 1996): -The training dataset B T rain is used to calibrate the model parameters.
-The validation dataset B V al is a sample of data held back from the training dataset, which is used to find the best model. For instance, it helps tune the model hyper-parameters: choose the more adequate inputs (i.e., feature selection), -The testing dataset B T est is held back from the training and the validation datasets to estimate the true model ability to generalise.
The process of partitioning B will be called in the following as the "folding " process. For instance, the folding choice can 155 be chosen using 50 %, 25 %, and 25 % of the whole database B for B T rain , B V al , and B T est respectively.
The need for the validation dataset is not always understood. The training dataset is used to fit the parameters; the testing dataset is often used to estimate the model quality but also to choose the best model (as in the LOO approach). However, using only this testing dataset brings a risk of choosing the model that is best suited to this particular testing dataset. This issue is a special kind of overfitting, not on the model calibration but on the model choice. If the database is big, many samples in the 160 testing dataset will be representative enough; thus, choosing the best model based on it is acceptable. If the database is small as in crop modelling tasks, the model selection can be too specific for the particular samples of the testing dataset; thus, an overfitting problem can appear. It will be seen in the following that the missing of validation dataset can be misleading for yield modelling studies. We avoid this difficulty by having a dataset to calibrate the model (training) and another one to choose the best model (validation). The truly independent testing dataset is then used to measure the model generalisation ability to 165 process well unseen data.
3 Measuring the quality of statistical yield models In practice, statistical yield models often face the problem of "data scarcity " (i.e., having a limited number of crop yield data).
One sample corresponds to one year of data; a 19-year database thus provides only 19 samples. This scarcity of data introduces two problems: Second, it is necessary to divide the database into training, validation, and testing datasets (Sect. 2.4). With a limited number 175 of samples, the training process may need every possible data point to determine model parameters adequately (Kuhn and Johnson, 2013), and it might be impossible to choose the best model or assess its generalisation ability with the remaining samples. It is challenging to keep a significant percentage of the database for the validation and the testing datasets.
To choose a model with an adequate complexity level and avoid overfitting, a robust way to measure the generalisation ability is necessary, using as few samples as possible. Cross-validation (Allen, 1974;Stone, 1974) was developed as an effective 180 method for both model selection and model assessment when having a small number of samples.

Leave-One-Out
The LOO method is one common type of cross-validation, in which the model trained, chosen, and evaluated using only two datasets. The main idea of LOO is that given n samp available samples in B; the model is calibrated n samp times using (n samp − 1) samples in the training dataset B T rain (leaving one sample out). The resulting model is then tested (n samp − 1) 185 on the left sample (B T est ). There is n samp testing score estimations, one for each sample. In this case, B = B T rain + B T est and B V al is empty. The averaging of these n samp testing scores is expected to be a robust assessment of the model ability to generalise on new samples. However, since no validation dataset is used to select the best model, the choice of the best model is made using the testing dataset; thus, it may be biased towards this testing dataset (Cawley and Talbot, 2010). The chosen model is not independent of the testing dataset, and thus, the obtained testing score is not reliable.

Leave-Two-Out
LOO is very useful in many cases (Kogan et al., 2013;Dinh et al., 2021), but as described in Sect. 2.4, the overall database needs to be divided into three datasets: B = B T rain +B V al +B T est . A LTO approach, adapted from the nested cross-validation (Stone, 1974), is then proposed in the following.

Folding scheme 195
A folding process is used to generate the training, validation, and testing scores. Each fold divides the database B into a training dataset B T rain of (n samp −2) samples, a validation B V al , and a testing B T est datasets, with one sample each. Two samples are considered out of the training dataset instead of one in the LOO procedure. This folding process is presented in Fig. 2, with the number of folds n f old = n samp × (n samp − 1). This is why this approach is also called k*l-fold cross-validation when l=k-1.  (i.e., LIN model with three inputs) with 12 potential predictors, we obtain n mod = C 3 12 = 220 models. These models are used to perform the yield anomaly estimations. In the vertical axis, for each of the n samp choices of the testing dataset id test ∈ {1, 2, · · · , n samp }, there are (n samp − 1) possible validation datasets, and thus training datasets. These (n samp − 1) training 205 datasets correspond each to the training of the models in the horizontal axis (i.e., to fit model parameters). So (n samp − 1) validation and (n samp − 1) testing estimations are obtained for each one of the n mod models. The averaged validation score is used to choose the best model bm i for i = 1, 2, · · · , n samp ; this is the role of the validation dataset.

Validation and testing scores
Each choice of the testing dataset (each id test ) corresponds to a selected best model bm i and two distributions (i.e., probability density functions (PDFs)) for (n samp − 1) validation errors and (n samp − 1) testing errors, showed in Fig. 3(a2). These 210 two distributions result in a validation score (blue dot) and a testing score (red dot).
Finally, the n samp testing choices give n samp validation and n samp testing scores that form a validation PDF in blue line, a testing PDF in red line, and thus the two scores V λ and T λ in Fig. 3(b).
A pseudo-code is provided in "Appendix A " to facilitate the implementation of the LTO procedure in any language.

215
The process represented in Fig. 3 is used to obtain the validation (V λ ) and testing (T λ ) scores from the LTO approach for a given model complexity λ. A different complexity level (different λ) results into different V λ and T λ values. The V λ and T λ evolution curves obtained for validation and testing RMSE values of yield anomalies for an increasing model complexity are presented in Fig. 4. For simplicity, only validation and testing scores will be discussed since the training error should be consistently decreasing when increasing the model complexity. When increasing the complexity level (λ > λ), the validation 220 error is smaller (V λ < V λ ) but the testing error is bigger (T λ > T λ ). It is also known as overfitting: the complexity level is too high; the model can highly fit the validation dataset but does not generalise well. In the following applications (Sect. 4 and 5), we will study these evolution curves for different models with various complexity levels in order to identify the appropriate yield models for Robusta coffee and grain maize. The first application concerns the statistical yield modelling of Robusta coffee in Cu M'gar (Dak Lak, Vietnam). The goal is to define a model that can predict the yield anomalies and then estimate its true applicability measured by a reliable generalisation score. A model is defined by several factors, including the number of potential predictors on which the inputs are chosen, the actual number of inputs, or the model type. We first assess several models with varying complexities to find the appropriate model using LOO (Sect. 3.1) and LTO (Sect. 3.2) approaches.

Yield model selection
Two methods of estimating the model quality (LOO and LTO) are considered to choose the appropriate model complexity. Figure 5 shows the RMSE values of the predicted Robusta coffee yield anomalies for the LIN3 and LIN5 models, which are the linear regression models with three and five inputs, respectively. These values are computed using the LOO and LTO procedures for the training, validation, and testing datasets. The horizontal axis shows 13 models with a different number of 235 potential predictors ranging from 3 to 38.
The LOO procedure suggests that the more complex the model is, the better results are. Both training and testing RMSE values decrease gradually (Fig. 5(a1)) with the increase of the number of potential predictors for LIN3 models (although the training error shows fluctuations). It is even more obvious for LIN5 models: the testing RMSE value decreases by 5 % when  increasing the model complexity from five potential predictors to 38 potential predictors (Fig. 5(a2)). Thus, models with more 240 inputs and more potential predictors would appear adequate when using the LOO procedure.
The LTO procedure is considered in Fig. 5(b1) and (b2), the training and validation RMSE values decrease with the model complexity, in a similar way as the training and testing errors in the LOO procedure. This similarity is because the LTO validation dataset has the same role as the LOO testing dataset: to find the best model! However, the testing errors do increase with the increase of the number of potential predictors (Fig. 5(b1)). This behaviour is ever stronger for the LIN5 model of 245 Fig. 5(b2). Because the potential of overtraining is higher with a more complex model, we observe a more significant difference between the testing errors and validation/training errors in this case (Fig. 5(b2)) than the LIN3 model ( Fig. 5(b1)). The LTO procedure clearly indicates that a simpler model (i.e., a lower number of potential predictors) is more suitable. This conclusion makes sense because it is inappropriate to use a very complex model (as the LOO model choice) when having only 19 samples.
The LOO procedure is actually misleading because it suffers from overfitting: it chooses the best model and assesses the 250 generalisation ability on the same testing dataset. This overfitting issue is suppressed in the LTO procedure since we chose the model on the validation dataset and assessed its generalisation score on an independent testing dataset.
Another example using the NN models (NN3 with three inputs and seven neurons in the hidden layer) in Fig. 6(a) shows the same behaviour: the more complex the model is, the higher the testing error becomes due to the overtraining (The model is stopped at six potential predictors due to the computationally cost. More NN examples will be discussed in Sect. 5). For the 255 same number of potential predictors, the testing errors in NN3 models ( Fig. 6(a)) are much higher than those in LIN3 models ( Fig. 5(b1)). The significant difference between training errors and validation/testing errors in NN3 models is related to the overfitting problem (compared to the LIN3 models). Using a NN model that is too complex for a limited database is highly dangerous. We also tested the LTO procedure with an increased complexity level by keeping the same number of potential predictors (n pre =30) but increasing the number of inputs from two to seven (on the horizontal axis in Fig. 6(b)). In this case, In short, considering the limited information in the available database-that is used to train, select the model, and evaluate its quality-it is not possible to use more than a very simple and limited model. Therefore, for this 19-samples coffee yield modelling case, using a simple LIN model is better than a complex one (NN model, for instance), and it will be illusory to 265 think that complex plant relations can be implemented with such a limited number of samples.

Yield anomaly estimation
As shown in the previous section, the LTO procedure allows us to chose a reasonable model, simple enough, with fewer potential predictors and inputs. Thus, the crop yield estimations of the LTO method will be assessed here to see how good the selected model (LIN3 model with three predictors) is. Figure 7 presents , 2005-2009, 2010, 2011) or a decreasing trend from 2011 to 2015. Also, the correlation score means that the model can explain more than 30 % (0.57 2 ) of the variation in coffee yield anomalies, which is in agreement, for instance, with Dinh et al. (2021). This value is reasonable as the weather is among several factors (e.g., agricultural practices, 275 diseases, topography) that affect the coffee yield. It is possible to apply such a statistical crop yield model to future climate simulations and then study the impact of climate change on coffee (Bunn et al., 2015;Craparo et al., 2015a;Läderach et al., 2017). This would be the subject of a forthcoming study.

Grain maize over France
This application considers several aspects of grain maize over France. First, the sensitivity of the forecasting quality to the 280 model complexity is studied using the LOO and LTO approaches over the Bas-Rhin department, one of the major grain maizeproducing departments in France. Then, the forecasting scores are investigated over the major grain maize-producing departments.

Yield model selection -Focus on Bas-Rhin
This section aims to define an appropriate statistical model for grain maize using 22 years of yield data. This test is done 285 over Bas-Rhin (i.e., one major grain maize-producing department in France). As shown in Sect. 4.1, the LOO approach is 14 https://doi.org/10.5194/gmd-2021-218 Preprint. Discussion started: 6 August 2021 c Author(s) 2021. CC BY 4.0 License. misleading by choosing too complex models; we focus here on the LTO results for different models with various complexity levels. Figure 8 describes the RMSE values of the predicted grain maize yield anomalies for three datasets (training, validation, and testing) of the LTO procedure for LIN3 models with various complexity levels and several architectures of NN3 models.
The results of LIN3 models are presented in Fig. 8(a), and NN3 models (with seven neurons in the hidden layer) are in 290 Fig. 8(b), with a different number of potential predictors ranging from three to 12 in the horizontal axis. In the two cases, the LTO procedure shows a similar behaviour as for the Robusta coffee application of the previous section: the validation/training errors decrease with the number of potential predictors, while the testing errors show an opposite trend. These overtraining results suggest that a simple model (e.g., LIN3 with three potential predictors) is more adequate: the testing RMSE value is small and close to the RMSE values over the two other datasets.

295
More complex models are tested in Fig. 8(c), in which NN3 models with four potential predictors are considered. The model complexity here corresponds to the number of neurons in the hidden layer (from two to 20 neurons) in the horizontal axis. The impact of overfitting (Sect. 2.3) is noticeable when the model is too complex. For instance, in Fig. 8(c), the training errors get smaller for more neurons in the hidden layer, as expected. However, the testing and validation errors show large fluctuations when increasing the number of neurons. The overfitting problem appears at the first step with two neurons in the hidden layer, 300 show by the high testing error in Fig. 8(c). Same results (not shown) are obtained for NN3 models with n potential preditors, where n = 3, 7, 12. Thus, the NN models are unreliable in this case due to the limited number of samples to train a non-linear model.

Reliability model assessment
In this section, a statistical yield model is applied first over 96 French departments to assess the true model quality. Then, we 305 will focus on ten major departments to assess how the selected models perform for yield anomaly predictions. Figure 9 shows the true testing RMSE maps of predicted grain maize yield anomalies in France. Here, the testing errors induced from the LTO procedure are used on the models chosen by the LOO and the LTO approaches. In other words, both methods (LOO and LTO) can be used to identify optimal crop models, but only the LTO method is used (as a reliable tool) to estimate the model generalisation ability. For example, when considering only LIN3 models, LOO chooses models with 12 310 potential predictors, while LTO chooses those with three potential predictors. From these choices, the true model generalisation scores (i.e., testing errors) are estimated using the LTO approach, showed in the RMSE maps of Fig. 9(a1) and (b1). Another example focuses on LIN5 models (presented in Fig. 9(a2) and (b2)). The true errors obtained from the LOO choice are clearly higher than those from the LTO choice for LIN3 models. For instance, the testing RMSE values range from 10 % to 18 % in many departments in Fig. 9(a1), while these values are often lower than 10 % in Fig. 9(b1). This difference shows that 315 the LOO approach under-estimates these true errors, as seen in Fig. 9(a1). Thus, the model choice of the LOO approach is misleading. For more complex models like LIN5 models-that is prefered by the LOO choice-in the second row of Fig. 9, the higher errors are observed, especially for LOO model errors of many northern departments with up to 22 % of RMSE ( Fig. 9(a2)). This grain maize application confirms the benefit of LTO to select and assess the true quality of statistical yield Figure 10. Boxplots of residuals (the difference between the observed and estimated yield anomalies) for ten major grain maize-producing departments: red horizontal bars are medians, boxes show the 25th-75th percentiles, error bars depict the minimum and maximum values, and red + signs are suspected outliers.
Although there are some extreme values (Lot-et-Garonne) and some outliers, the interquartile, which ranges from about -8 % to 8 %, shows slight differences between the observations and estimations over study departments.

Seasonal yield forecasting
The LTO approach is helpful for selecting an adequate model with better forecasting. In the following, the model chosen by the LTO procedure is tested for seasonal forecasting, from the sowing time (April) to the forecasting months (from June to 330 September). In this scenario, for example, for the June model, all-weather variables (including P and T) from April to June can be selected for the forecasting. Table 1 represents the correlations between the observed and estimated yield anomalies of the forecasts from June to September. The quality of the seasonal forecasting models gradually increases when approaching the harvest because more information is provided. With the weather information at the beginning of the season (April, May, and June), the June forecasting model obtains an average correlation of 0.35 between the observations and estimations. This score 335 is significantly improved when adding information of July (correlation of 0.51). This improvement means that the weather in July strongly influences grain maize yields. The improvement from July to August is much less than from June to July, with an average increase of 0.16 and 0.01, respectively. No information is added in the September forecasting model since it coincides with the harvest time.
In addition, Fig. 11 Table 1. The correlation between the observed and estimated yield anomalies for different forecasting months (from June to September), over ten major grain maize-producing departments. Figure 11. The observed (a obs ) and the estimated grain maize yield anomalies time series, for different forecating months from June to September (e.g., aJun means June forecasting), in Landes (France).
(0.63). This score slightly increases when approaching the harvest. It also indicates that the weather can explain more than 40 % (0.67 2 = 44.89 %) of variations in grain maize yield anomalies in this region, which is in line with other crop studies (Ray et al., 2015;Ceglar et al., 2017). However, the forecasting models cannot predict all the extremes (e.g., negative yield anomaly in 1990) that are probably influenced by the extremes of climate (Hawkins et al., 2013;Ceglar et al., 2016). The statistical models could be improved by adding the indices that focus on extreme weather events.

Conclusions and perspectives
Crop yield modelling is very useful in agriculture as it can help increase the yield, improve the production quality, and minimise the impact of adverse conditions. Statistical models are among the most used approaches with many advantages (Lobell and Burke, 2010;Iizumi et al., 2013;Mathieu and Aires, 2018;Gaudio et al., 2019). The main difficulty in this context is the 350 limitation of the available crop databases to calibrate such statistical models. Applications typically rely on only two or three decades of data (Prasad et al., 2006;Ceglar et al., 2016;Kern et al., 2018). This small sample size issue directly impacts the complexity level that can be used in the statistical model: a model too complex cannot be fit with such limited data, and assessing the true model quality is also challenging. In practice, statistical inference requires three datasets: one for calibrating the model, a second one for choosing the right model (or tuning the model hyper-parameters), and another for assessing the 355 true model generalisation skills (Ripley, 1996). Dividing a very small database into such three datasets is very difficult.
The LOO method has been used as a cross-validation tool to calibrate, select, and assess the model (Kogan et al., 2013;Zhao et al., 2018;Dinh et al., 2021). It was shown in this paper that LOO could be misleading because it uses only one dataset to chose the best model and estimate its generalisation skills simultaneously. This is a true problem as LOO is one of the main statistical tools to obtain good crop yield models. This study proposes a nested cross-validation approach in the form of what 360 we call a LTO method. This method uses a complex folding scheme to estimate independent training, validation, and testing scores. Results show that LOO is truly misleading and can artificially request complex models that overfit the problem. In contrast, LTO shows that only very simple models can be used when the database is limited in size. The LTO implementation proposed here is very general and can be applied to any statistical crop modelling application.
Two applications have been considered here. The first one concerns the coffee yield modelling over a district in Vietnam's 365 major Robusta coffee-producing region. It was shown that monthly mean precipitation and temperature could explain more than 30 % of the coffee yield anomaly variability. The 70 % remaining variability is due to non-climatic factors (agricultural practices, diseases, or political and social context). Explaining a third of the coffee yield variability is in line with the literature (Ray et al., 2015;Craparo et al., 2015b;Dinh et al., 2021). LTO was able to identify the suitable complexity of the statistical model that can be trained on the historical record and estimate the true model ability to predict yield on independent years.

370
The second application is related to grain maize over France. The LTO was used here to chose between simple linear models and more complex neural network models. Our findings also showed that LOO was misleading in overestimating the testing scores. LTO indicated that a simple linear model should be used and estimated the model generalisation ability correctly. This approach can also be helpful in seasonal forecasting applications (during the growing and the beginning of harvest seasons).
In this application, the weather can explain more than 40 % of the yield anomaly variability, which is a reasonable score (Ray 375 et al., 2015;Ceglar et al., 2017). This score can vary depending on study regions, e.g., some regions are more sensitive to the climate than others. n mod = number of models; n f old = number of folds of the dataset; Score(2,n f old ,n mod ); %representing RMSE or COR; 2 for [Test,Val]; bm = best model {1, · · · , n mod }; Score T est (isamp) = mean(Score(1,n f old {isamp},ibm(isamp))); Test score Score V al (isamp) = mean(Score(2,n f old {isamp},ibm(isamp))); Val score end F inalScore T est = mean(Score T est ) 420 F inalScore V al = mean(Score V al ) Author contributions. All authors conceptualized the research and formulated the model. LADT implemented the model in Matlab and analyzed the output with FA. All authors contributed to writing the paper.
Competing interests. The authors declare that they have no conflict of interest.