The use of statistical models to study the impact of weather on crop yield has not ceased to increase. Unfortunately, this type of application is characterized by datasets with a very limited number of samples (typically one sample per year). In general, statistical inference uses three datasets: the training dataset to optimize the model parameters, the validation dataset to select the best model, and the testing dataset to evaluate the model generalization ability. Splitting the overall database into three datasets is often impossible in crop yield modelling due to the limited number of samples. The leave-one-out cross-validation method, or simply leave one out (LOO), is often used to assess model performance or to select among competing models when the sample size is small. However, the model choice is typically made using only the testing dataset, which can be misleading by favouring unnecessarily complex models. The nested cross-validation approach was introduced in machine learning to avoid this problem by truly utilizing three datasets even with limited databases. In this study, we propose one particular implementation of the nested cross-validation, called the nested leave-two-out cross-validation method or simply the leave two out (LTO), to choose the best model with an optimal model selection (using the validation dataset) and estimate the true model quality (using the testing dataset). Two applications are considered: robusta coffee in Cu M'gar (Dak Lak, Vietnam) and grain maize over 96 French departments. In both cases, LOO is misleading by choosing models that are too complex; LTO indicates that simpler models actually perform better when a reliable generalization test is considered. The simple models obtained using the LTO approach have improved yield anomaly forecasting skills in both study crops. This LTO approach can also be used in seasonal forecasting applications. We suggest that the LTO method should become a standard procedure for statistical crop modelling.

Many approaches are available to study the impact of climate and weather variables on crop yield. Statistical modelling, which aims to find relations between a set of explanatory variables and crop yield, is a widely used approach (see, for example,

Unfortunately, crop modelling is often characterized by datasets with a very limited number of samples. For instance,

Cross-validation

To solve the issues of LOO, another more complex approach has been introduced: nested cross-validation

Robusta (

A coffee tree is a perennial, which is highly productive for about 30 years

The robusta coffee yield data were obtained from the General Statistics Office of Vietnam for the 2000–2018 period (

The long-term trend represents the slow evolution of the crop yield; it often describes the changes in management like fertilization or irrigation. Thus, suppressing this trend from the yield time series allows removing the influence of non-weather related factors, and this is common practice

Grain maize database:

Grain maize (

The French grain maize data (area, production, and yield) on the regional level (i.e. department, which is an administrative unit in France) were collected from the Agreste website (

Similar to robusta coffee, the grain maize anomalies are calculated by removing the long-term yield trend. Here, a 10-year moving average window is used because the trend is slightly more complex than the one we found for robusta coffee.

The monthly-mean total precipitation (

This study considers the

for robusta coffee, we evaluated

for grain maize, 6 months of growing period (from sowing to harvest) will be studied (Sect.

Weather anomalies could be considered in the same way as for crop yield data. However, the climate trend of the 10 to 20 years is relatively low compared to the inter-annual variations. Thus, the long-term trend can be neglected, and the relative anomalies will be estimated based on the long-term average. This average value is computed for each of the

We also considered adding other explanatory variables (not shown), e.g. maximum and minimum temperature and solar radiation. However, we chose not to include these variables for several reasons: (1) these variables show relatively low correlations to the crop yield anomalies; (2) they are highly correlated to

The statistical models measure the impact of weather on crop yield anomalies, which can be denoted as

In this study, two statistical models are considered:

Linear regression (LIN) is the simplest model and the most frequently used. The relationship between the crop yield anomalies

Neural network (NN) is a non-linear statistical model. The simplest type of NN is the feedforward model

The least-squares criterion, which minimizes the discrepancies between the model predictions and observed values, is used to optimize the model during the calibration process for both LIN and NN models. It is used to obtain the coefficients

Two diagnostics are considered here to measure the quality of the yield anomaly estimations: (1) the Pearson's correlation COR (unitless) between the estimated

Model selection is the process of selecting one model – among many candidate models – that generalizes best

When performing the model selection, it is possible to artificially fit the training dataset better. For example, increasing the model complexity can increase the model quality because a more complex model can better fit the training data. However, such a simple reasoning is dangerous: the model complexity can be too high compared to the limited information included in the training database. This limitation leads to the overfitting (or overtraining) problem, i.e. the model fits the training dataset artificially well but it cannot predict well data not present in the training dataset. Thus, an overfitted model makes poor predictions and is not reliable. There is no general rule determining the model complexity based on the number of samples. An empirical tool needs to be used to check the adequacy of the model. In the following, by studying the sensitivity of the model quality to different complexity levels, we want to determine the optimal statistical crop model that truly estimates the yield anomalies as best as possible.

One of the main challenges in statistical inference is that the model is set up using a samples database, but it must perform well on new – previously unseen – samples. For that purpose, the overall database

The training dataset

The validation dataset

The testing dataset

The process of partitioning

The need for the validation dataset is often misrepresented in the literature on crop modelling. The training dataset is used to fit the parameters; the testing dataset is often used to estimate the model quality but also to choose the best model (as in the LOO approach).
However, using only this testing dataset without a validation dataset brings a risk of choosing the model that best suits to this particular testing dataset. This represents a special kind of overfitting, which is not on the model calibration but on the model choice.
If the database is big, many samples in the testing dataset will be representative enough; therefore, choosing the best model based on it is acceptable. If the database is small (as often in crop modelling tasks), the model selection can be too specific for the particular samples of the testing dataset; thus, an overfitting problem can appear (Sect.

With a limited number of samples, the training process may need every possible data point to determine model parameters

The LOO method is one common type of cross-validation in which the model uses only two datasets: one to train and another to choose the model and evaluate the result.
The main idea of LOO is that given

LOO is very useful in many cases

Folding strategy for the LTO procedure with

For LTO, we will divide the database into three datasets: training, validation, and testing. Each time we split or partition the dataset, this is referred to as a “fold”. Each fold divides the database

Illustration of the LTO procedure to estimate a model quality for a fixed complexity level

Figure

Each choice of the testing value (each

Finally, the

A pseudo-code is provided in Appendix

Schematic illustration of validation and testing RMSE values of predicted yield anomalies for an increasing number of inputs, obtained from the LTO procedure. For a fixed complexity level, i.e.

The process represented in Fig.

The first application concerns the statistical yield modelling of robusta coffee in Cu M'gar (Dak Lak, Vietnam). The goal is to find a model that makes the most robust predictions of crop yield anomaly as a function of weather variables. We first assess several models (with a varying number of inputs or number of potential predictors) to find the appropriate model choices using both LOO (Sect.

We first investigated the model choice by varying the number of inputs. In this example, the number of potential predictors is fixed at 18 (

The training, validation, and/or testing RMSE values of the predicted coffee yield anomalies, using different LIN models (with 18 potential predictors) by increasing the number of inputs, in Cu M'gar (Dak Lak, Vietnam):

Figure

The training, validation, and/or testing RMSE values of the predicted coffee yield anomalies, using LIN5 models by increasing the number of potential predictors, in Cu M'gar (Dak Lak, Vietnam):

The LOO procedure suggests that the more potential predictors the models have, the better the results are. Both training and testing RMSE values decrease gradually (Fig.

The LOO procedure is actually misleading because it could encourage us to choose a model that overfits the data: the same testing dataset is used to choose the best model and to assess the generalization ability. If the modellers select the best model based on information from the LTO procedure, they are less likely to choose an overfit model. As in this case, they choose the model on the validation dataset and assess its generalization score on an independent testing dataset.

In short, considering the limited information in the available database – that is used to train, select the model, and evaluate its quality – it is not possible to use more than a very simple and limited model. Therefore, for this 19-sample coffee yield modelling case, using a simple LIN model is better than a complex one.

The previous section showed that the LTO procedure allows us to choose a reasonable model, simple enough, with fewer inputs and potential predictors. Thus, the crop yield estimations of the LTO method will be assessed here to see how good the selected model (LIN3 model with three predictors) is.
The final model includes {

The observed (solid line) and LTO estimated (dashed line) coffee yield anomalies time series in Cu M'gar (Dak Lak, Vietnam).

Figure

This application considers several aspects of grain maize over France. First, the sensitivity of the forecasting quality to the model selection is studied, using the LOO and LTO approaches, over Bas-Rhin and Landes – the major grain-maize-producing departments and all 96 departments in France. Then, the forecasting scores are investigated over 10 major grain-maize-producing departments.

In this section, we describe how we selected the most appropriate statistical model for grain maize using 22 years of yield data. This test is done over Bas-Rhin and Landes (i.e. two major grain-maize-producing departments in France). As shown in Sect.

The training, validation, and testing RMSE values of the predicted grain maize yield anomalies, using different LIN models (with 12 potential predictors) by increasing the number of inputs, in

Similar to the robusta coffee case (Sect.

The training, validation, and testing RMSE values of the predicted grain maize yield anomalies, using different NN models with various choices, in Bas-Rhin (France):

More complex models were tested in Fig.

The training, validation, and testing RMSE values of the predicted grain maize yield anomalies, using different models by increasing the number of potential predictors, in Bas-Rhin (France):

We also tested other examples with LIN3 and NN3 models (Fig.

In this section, a statistical yield model is applied first over 96 French departments to assess the true model quality. Then, we will focus on 10 major departments to assess how the selected models perform for yield anomaly predictions.

The true testing RMSE maps of predicted grain maize yield anomalies in France for LOO

Figure

Boxplots of residuals (the difference between the observed and estimated yield anomalies) for 10 major grain-maize-producing departments: red horizontal bars are medians, boxes show the 25th–75th percentiles, error bars depict the minimum and maximum values, and red

We now analyse how good the LTO testing estimations are compared to the observations over 10 major grain-maize-producing departments (as shown in Fig.

The LTO approach is helpful for selecting an appropriate model with better forecasting. Here, the model chosen by the LTO procedure is tested for seasonal forecasting, from the sowing time (April) to the forecasting months (i.e. from June to September): all-weather variables (including

The correlation between the observed and estimated yield anomalies for different forecasting months (from June to September), over 10 major grain-maize-producing departments.

The observed (

In addition, Fig.

Crop yield modelling is very useful in agriculture as it can help increase the yield, improve the production quality, and minimize the impact of adverse conditions. Statistical models are among the most used approaches with many advantages. The main difficulty in this context is the limitation of the available crop databases to calibrate such statistical models. Applications typically rely on only 2 or 3 decades of data. This small sample size issue directly impacts the complexity level that can be used in the statistical model: a model too complex cannot be fit with such limited data, and assessing the true model quality is also challenging. In practice, statistical inference requires three datasets: one for calibrating the model, a second one for choosing the right model (or tuning the model hyperparameters), and another for assessing the true model generalization skills. Dividing a very small database into such three datasets is very difficult.

The LOO method has been used as a cross-validation tool to calibrate, select, and assess competing models

Two applications have been considered. The first one concerns the coffee yield modelling over a major robusta coffee-producing district in Vietnam. It was shown that considering the available historical yield record, the best statistical model can explain about 30 % of the coffee yield anomaly variability. The remaining variability is rather large and may be explained by non-climatic factors (e.g. prices, sociotechnical factors, managerial decisions, or political and social context). It could also come from climate; however, the model would require more detailed variables (e.g. at a daily scale) or more samples to go into greater details of the climate–crop-yield relationship.
In addition, explaining a third of the coffee yield variability is in line with the literature

The second application is related to grain maize yield in France. The LTO was used here to choose between simple linear models and more complex neural network models. Our findings also show that LOO was misleading in overestimating the testing scores. LTO indicated that a simple linear model is preferable because it has a lower testing error. This approach can also be helpful in seasonal forecasting applications (during the growing and the beginning of harvest seasons). In this application, the weather can explain more than 40 % of the yield anomaly variability, which is similar to that reported in the literature

In the future, the mixed-effects model can be considered instead of a straightforward statistical model. This approach – which intends to use samples in several regions (e.g. gathering samples into groups) to compensate for the lack of historical data – could help us obtain more complex crop models

The Matlab code used to run an example of the leave-two-out method is available at the following Zenodo link:

The coffee data were provided by Vietnam's General Statistics Office (GSO) for the 2000–2018 period. These data are available from GSO on reasonable request. For any inquiries, please send an email to banbientap@gso.gov.vn.
The series of annual agricultural statistics 1989–2010 (i.e. grain maize and other French crops) are available on request from: fatima.bouhaddi@agriculture.gouv.fr or angelique.toulon@agriculture.gouv.fr.
In addition, the weather data, i.e. ERA5-Land data, can be downloaded from

All authors conceptualized the research and formulated the model. TLAD implemented the model in Matlab and analysed the output with FA. All authors contributed to writing the paper.

The contact author has declared that neither they nor their co-author has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We are grateful to Matthew Heberger for his useful suggestions for improving this paper. We thank the editor and the reviewers for their valuable comments and suggestions on our paper.

This work is part of Thi Lan Anh Dinh's PhD, which benefited from the French state aid managed by the ANR under the “Investissements d'avenir” programme with the reference ANR-16-CONV-0003 and the Australian Centre for International Agricultural Research, conducted as part of activities for the project “Enhancing smallholder livelihoods in the Central Highlands of Vietnam through improving the sustainability of coffee and black pepper farming systems and value chains” (AGB/2018/175).

This research has been supported by the Agence Nationale de la Recherche (grant no. ANR-16-CONV-0003) and the Australian Centre for International Agricultural Research (grant no. AGB/2018/175).

This paper was edited by Christoph Müller and reviewed by Christoph Gornott and three anonymous referees.