Reduction of predictive uncertainty in estimating irrigation water requirement through multi-model ensembles and ensemble averaging

. Irrigation agriculture plays an increasingly important role in food supply. Many evapotranspiration models are used today to estimate the water demand for irrigation. They consider different stages of crop growth by empirical crop coefﬁcients to adapt evapotranspiration throughout the vegetation period. We investigate the importance of the model structural versus model parametric uncertainty for irrigation simulations by considering six evapotranspiration models and ﬁve crop coefﬁcient sets to estimate irrigation water requirements for growing wheat in the Murray–Darling Basin, Australia. The study is carried out using the spatial decision support system SPARE:WATER. We ﬁnd that structural model uncertainty among reference ET is far more important than model parametric uncertainty introduced by crop coefﬁcients. These crop coefﬁcients are used to estimate irrigation water requirement following the single crop coefﬁcient approach. Using the reliability ensemble averaging (REA) technique, we are able to reduce the overall predictive model uncertainty by more than 10 %. The exceedance probability curve of irrigation water requirements shows that a certain threshold, e.g. an irrigation water limit due to water right of 400 mm, would be less frequently exceeded in case of the REA ensemble average (45 %) in comparison to the equally weighted ensemble average (66 %). We conclude that multi-model ensemble predictions and sophisticated model averaging techniques are helpful in predicting irrigation demand and provide relevant information for decision making.


Predicting crop water needs
Globally, the proportion of freshwater consumption by agriculture from rainfall as well as surface and groundwater resources is large (9087 km 3 yr −1 ) . It is projected that water demand will increase in the future, in particular by irrigation agriculture, in order to support the increasing world population with food (Foley et al., 2011;De Fraiture and Wichelns, 2010;Hanjra and Qureshi, 2010;Wada and Bierkens, 2014). Therefore, strategies based on improved irrigation methods and local adaptions of management practices are likely to be implemented to anticipate this trend. Such strategies are often developed using decision support systems that are informed by mathematical models. For example, irrigation management has been optimized by modelling and measurements for crops grown in Central Asia (Pereira et al., 2009) and for irrigated cotton in the High Plains region of Texas (Howell et al., 2004). Others have investigated water use efficiency (Wang et al., 2001) or crop water productivity (Liu et al., 2007) by modelling experiments for irrigated crops grown in China.
All these models depend on the calculation of evapotranspiration (ET), which represents the evaporation from a surface and transpiration from plants. In the case of agricultural crops, ET is equal to the crop water needed for crop growth and yield production. Globally, evapotranspiration represents about two-thirds of the total rainfall on land, while evapotranspiration from crops amounts to about 8 % (Oki and Kanae, 2006) and is therefore the most important term of the water balance. The basic concept for deriving crop water needs of irrigated crops has been initially reported by Jensen (1968) and is proposed by Allen et al. (1998) as the single crop coefficient concept. The crop-specific evapotranspiration (ET c ) is derived from reference evapotranspiration (ET o ) and a cropspecific coefficient (K c ): with ET o given in millimetres and dimensionless K c . ET o can be calculated by standardize potential evapotranspiration (PET) to a short (grass) or tall (alfalfa) reference crop. In the case of the Penman-Monteith equation (Monteith, 1965;Penman, 1948) standardized fixed values for albedo (0.23), plant height (0.12 cm) and surface resistance (70 m s −1 ) are assumed (Allen et al., 1998;Jensen et al., 1990). K c is commonly calculated on the basis of field experiments (e.g. Ko et al., 2009;da Silva et al., 2013) and varies with the crop development.
Such an approach is part of many irrigation management models, including Cropwat (Smith, 1992), ISAREG (Pereira et al., 2009), ISM (George et al., 2000), and global crop water models (Siebert and Döll, 2010). Moreover, the single crop coefficient concept is the basis for the simulation of crop water needs in many studies. For example, Lathuillière et al. (2012) have derived water use by terrestrial ecosystems and have shown that actual ET declines over a 10-year period by about 25 % in response to deforestation and replacement by agriculture in Brazil. They showed that irrigation water requirement (IRR) is relevant for terrestrial water fluxes, and a reliable estimation is crucial for the closure of the water cycle. In another study future climate impacts on groundwater in agriculture areas have been investigated (Toews and Allen, 2009). They showed that larger return flows to the groundwater can be related to increased IRR under warmer temperatures and longer vegetation periods. Moreover, the crop coefficient concept is also the basis for the water footprint (volume of water consumed or polluted to produce one unit of biomass) assessment of crops (Mekonnen and Hoekstra, 2011) and has been used to determine water requirements and the water footprint of the agriculture sector in Saudi Arabia (Multsch et al., 2013). In almost all studies, researchers use only one ET o method with a single, often spatially independent, K c set. As a result, some scientists ask to use locally adapted K c sets at least (Ko et al., 2009;da Silva et al., 2013). For this reason, the investigation of predictive uncertainty of IRR is needed, in particular in the frame of large-scale assessments.

Sources of predictive uncertainty
Major sources of uncertainties should be considered in the study design, quantified throughout the modelling process (Refsgaard et al., 2007) and communicated as part of the results to the end users. Uncertainties related to large-scale estimations of the IRR have only rarely been analysed. For example, Siebert and Döll (2010) have studied the uncertainty in predicting green (rainfall consumed by crops) and blue (consumed surface and groundwater by crops in terms of irrigation) water consumption by using different ET o equations on a global scale. They observed a significant difference of blue water consumption, i.e. required irrigation, and only a small change in green water consumption between model runs while using two classical ET o equations. More recently, Sheffield et al. (2012) pointed out that using a more up-todate parameterization of PET to calculate drought indices led to different conclusions on drought occurrence globally.
Generally, model predictive uncertainty can be traced back to four sources: input uncertainty, output uncertainty, structural uncertainty and parametric uncertainty (Renard et al., 2010). The last two, structural and parametric uncertainty, are addressed in this study with a focus on the prediction of IRR. As part of the parametric uncertainty, the parameterization of equations to quantify natural or anthropogenic processes has received considerable interest, particularly in conceptual rainfall-runoff modelling (Beven, 2006;Vrugt et al., 2009). In case of modelling crop water needs according to Eq. (1), K c is an important model parameter. K c values for a large number of crops are provided by the FAO56 irrigation guidelines (Allen et al., 1998), which are commonly used for irrigation planning. However, it has been highlighted that an adjustment to the global K c is needed if the simulations are used for irrigation planning on a local to regional scale (Ko et al., 2009;da Silva et al., 2013). Nevertheless, it is still unclear whether a local adaption of K c leads to a better model performance. For this reason, we quantify the parametric uncertainty of model parameterization with different K c sets.
The model structure also introduces uncertainties, as any model remains a simplification of the real world. In the context of modelling water resources, all hydrological and crop growth models rely on the estimation of ET. According to Eq. (1), ET o is required to estimate crop-specific evapotranspiration. ET o equations are often divided into categories according to the input data (Bormann, 2011;Tabari et al., 2013): temperature-based equations such as the Hargreaves-Samani (HS) equation (Hargreaves and Samani, 1985), radiation-based equations such as Priestley-Taylor (PT) (Priestley and Taylor, 1972) or combined equations such as the FAO56 Penman-Monteith (PM56) equation (Allen et al., 1998), which further takes wind speed into account. Nevertheless, in many cases it was shown that the variability among ET o methods is large (Fisher et al., 2011;Kite and Droogers, 2000). Because most water resources models rely on some calculation of ET o , we see it as a crucial source of structural uncertainty that is rarely considered.

Reduction of predictive uncertainty by ensemble modelling
Ensembles of model predictions can be developed by different sets of model parameterization (single-model ensemble) and model structures (multi-model ensemble). The weighting of model ensembles according to their fit to observational data has become of interest to reduce the uncertainty and to derive more robust predictions and projections. Giorgi and Mearns (2002) have introduced the reliability ensemble averaging (REA) technique in climate research. Basically, different models are weighted according to their performance in representing measured data and according to the distance of individual models to the ensemble average prediction to quantify the convergence of different models. This approach has been applied more recently for predicting catchment nitrogen fluxes (Exbrayat et al., 2013) and calculating water balances and land use interaction . In a first step, we analyse the relative contributions of the structural and parametric model uncertainty in hindcasts of IRR of wheat across the Murray-Darling Basin (MDB), Australia. Simulations are calculated using the spatial decision support system SPARE:WATER (Multsch et al., 2013). In a second step, we apply the REA methodology to reduce the predictive uncertainty of IRR. The general procedure is as follows: -The applicability of six different ET o methods is evaluated by using available measured class-A-pan evaporation measurements of 34 stations in the MDB over a 21-year time period.
-30 different model realizations are set up in a multimodel ensemble by combining various ET o equations (n = 6) and crop coefficient data sets (n = 5).
-IRR is calculated by forcing the multi-model ensemble with climate time series of 21 years (monthly data) for 3969 sites (each 1 km 2 × 1 km 2 ) in the MDB where irrigated wheat has been grown according to the land use allocation in 2000.
-The 30 model realizations are weighted according to their performance in representing measured data and their distance to the ensemble average.
By doing so, we quantify structural (ET o method) and parametric (K c set) uncertainty and apply REA to provide a robust estimate of IRR and the confidence interval around it. The underlying research question is how can we derive better predictions by using an ensemble of well-known ET o methods as well as K c sets and which are the likely causes of predictive uncertainty in IRR estimations. Finally, we show a procedure to reduce predictive uncertainty of IRR.

Study site and data
The MDB covers about 1 million km 2 of south-east Australia ( Fig. 1). Irrigation agriculture in the MDB sums up to 17 600 km 2 , which is equal to 65 % of the total irrigation agriculture in Australia. Total water withdrawal for irrigation in 2006 amounted to 7.36 km 3 yr −1 (ABS, 2006). Wheat is the second most important crop grown in MDB after grazing pastures, covering 3969 km 2 in 2006, and was therefore selected for this case study for which IRR and its underlying uncertainty was calculated. The cropping areas have been taken from a land use map from 2006 (ABARES, 2010) with a spatial resolution of 0.01 • × 0.01 • (∼ 1 km × 1 km). We assume a fixed land use distribution over time in our model study to clearly target the uncertainty in ET o method and crop coefficients. Climate data for 1986-2006 were taken from the SILO Data Drill of the Queensland Department of Natural resources and Water (https://longpaddock.qld.gov.au/silo/; Jeffrey et al., 2001) with a spatial resolution of 0.05 • × 0.05 • (∼ 5 km × 5 km). We used the same weather data set over all 3969 1 × 1 km land grid cells overlapped by a 5 × 5 km grid cell in the weather data. The model was forced with monthly data. For validation, we compared simulated ET o to measured class-A pan data from 34 stations throughout the MDB. The class-A pan data were obtained from Patched Point Dataset of the Queensland Department of Science, Information Technology, Innovation and the Arts (http://www. longpaddock.qld.gov.au/silo/ppd/). Measured data have been adjusted with monthly pan coefficients according to McMahon et al. (2013) to represent evaporation from open surface water. For stations where no pan coefficient was available, we used the one from the nearest station.

Simulation of irrigation requirement with SPARE:WATER
SPARE:WATER (Multsch et al., 2013) is a spatial decision support system for the calculation of crop-specific water requirements and water footprints from local to regional scale. Input parameters for the simulation are climate data, irrigation management (irrigation water quality, irrigation efficiency, irrigation method), a digital elevation model and crop characteristics such as maximum crop height and length of growing season as well as sowing and planting date. In a first step, the water requirement of growing a crop is simulated for each grid cell according to the spatial resolution of the input data. In a second step, the water footprint for spatial entities such as administrative boundaries or catchments is calculated considering statistical data on crop yield and harvest area. Water footprints for geographic entities are given as volume of water consumed per year (e.g. km 3 yr −1 ) and water footprints for specific crops as volumes of water consumed per biomass (m 3 t −1 ). In this study the calculation of the IRR is calculated as the difference between ET c and effective rainfall (P eff ). The latter one is estimated from the difference of surface runoff (RO) and precipitation (P ). RO is derived as a fixed fraction of 20 % of total P . The fixed fraction of runoff is adapted from the default setting of the FAO Cropwat model (Smith, 1992). On this basis, IRR is calculated according to Eq. (2): with IRR, ET c and P eff given in millimetres. ET c is calculated based on the single crop coefficient approach initially proposed by Jensen (1968) and recommended by Allen et al. (1998) according to Eq. (1). The input parameters for this method are the length of four individual stages (initial season, growth season, mid-season and late season) during the growing season and three related crop coefficients (K c ). These define the ratio between ET o and ET c for each part of the growing season. We have considered five different K c data sets (Table 1). The most common data set has been proposed from the FAO56 Irrigation and Drainage Guidelines (Allen et al., 1998). This approach has been applied for calculating crop water footprints (Mekonnen and Hoekstra, 2011) and is part of the widely used Cropwat model (Smith, 1992). It has been discussed that locally adapted K c sets are superior in simulating site-specific crop water requirement to global ones (Ko et al., 2009;da Silva et al., 2013). Thus, further data sets have been collected from various sources which represent site-specific relationships between ET o and ET c for areas in the MDB. ET o has been calculated with six different methods (Table 2). Two of them are classified as combined meth-ods (PM56, PPET), three are radiation-based methods (PT, TURC, APET) and one is a temperature-based method (HS). All of them are commonly applied functions. For example, PM56 and HS are included in Cropwat (Smith, 1992) and Aquacrop (Steduto et al., 2009), two models to quantify crop water and IRR, widely used and promoted by the FAO. The cropping system model EPIC (Williams, 1989) additionally allows the use of the PT equation, while the global vegetation model LPJmL (Fader et al., 2010) and the global water model WaterGap (Döll et al., 2003) are restricted to PT. APET and PPET have been particularly tested for the utilization under Australian weather conditions in several studies (Chiew et al., 2002;Chiew and Leahy, 2003;Donohue et al., 2010).

Reliability ensemble averaging
We used two types of ensemble averaging techniques that differ in the weighing technique. We calculated an equally weighted average of all 30 model realizations (6 ET o methods × 5 K c data sets) for every grid cell, which sum up to 3969 cells (1 × 1 km) in the MDB, where irrigated wheat is grown according to the land use allocation in 2006. However, this method neither considers the capability of its ensemble members to predict a target value nor does it value the agreement of model predictions amongst each other. Therefore, we apply the REA technique that was initially proposed by Giorgi and Mearns (2002) to reduce uncertainties in climate change projections. Moreover, it was used in impact studies targeting land use change impacts on hydrology ) and water quality scenario projections (Exbrayat et al., 2013).
The strength of the REA method is that it considers both the quality of a model prediction (performance) and its position within an ensemble of prediction (convergence). The aim is to provide a best estimate of predictions and a robust assessment of the confidence interval around it. The REA weighting scheme estimates two factors: model performance (R B ) and model convergence (R D ). R B represents the capability of each ensemble member to represent real-world data by its bias B. R D is a measure of the distance D of a single model to the equally weighted ensemble average. Both are limited by the natural background variability (ε). The combined effect known as reliability factor (R) is derived as . ( In this study, ε is calculated from measured class-A pan evaporation for 34 climate stations in the study region for the time period from 1986 to 2006. The class-A pan data have been adjusted with monthly pan coefficients for climate stations in Australia (McMahon et al., 2013). We calculated the annual mean evaporation in millimetres for each year and each station and used the 50 % confidence interval (difference between the 25 and 75 % percentile) of 224 mm to  (Meyer, 1999) Griffith, MDB 0.4 1.05 0.5 Hughes (Hughes, 1999) Murray and Murrumbidgee valleys 0.3 1.0 0.6 Table 2. The six equations applied for the calculation of reference evapotranspiration.

Method Abbreviation Equation
FAO-56 Penman-Monteith (Allen et al., 1998) PM56 Priestley-Taylor (Priestley and Taylor, 1972) PT Hargreaves-Samani (Hargreaves and Samani, 1985) HS Turc (Allen, 2003;Turc, 1961) TURC define ε. The consideration of the difference between upper and lower percentiles has been recommended by Giorgi and Mearns (2002). Model performance is measured by the RMSE between measured (class-A pan) and predicted ET o for each model (i). The convergence criterion R D is calculated in an iterative procedure. The difference between the average IRR of each ensemble member i and the ensemble average is calculated. Under the consideration of the natural background variability ε, a first guess of R D (for each ensemble member) is predicted as well as a first guess of the REA average. This procedure is repeated by considering the newly derived REA average until the ensemble convergence so that the difference between ensemble members and the REA average cannot be reduced by additional iterations (see Giorgi and Mearns, 2002, for a complete methodological description). The error of the equally weighted ensemble average is described by the RMSE between IRR i predicted by model i (with n = 30 models) and the equally weighted ensemble average irrigation water requirement (IRR). The error of the reliability ensemble average (RMSE REA ) is derived from the reliability factor of each model (R i ), the irrigation water requirement predicted by model i (IRR i ) and the REAweighted ensemble average (IRR REA ). The RMSE represents an approximate 60-70 % confidence interval under the assumption that the amount of irrigation is distributed somewhere between normal and uniform.

Validation of ET o methods
We applied six ET o equations to 34 sites in the MDB for which measured class-A pan evaporation data were available from 1986 to 2006 (Fig. 2). Class-A pan data represent the evaporation from an open water surface and integrate all climate factors driving evaporation such as radiation, wind speed, humidity and temperature. Pan evaporation differs from evaporation from a cropped surface through a different albedo, heat storage and humidity above the surface. For this The median daily ET o for APET is 3.6 mm d −1 , PM56 3.9 mm d −1 , HS 3.8 mm d −1 , PPET 5.2 mm d −1 , PT 6.4 mm d −1 and TURC 3.4 mm d −1 . According to the root mean square error (RMSE) PM56 gave the most reliable results. The medians of ET o for APET, PM56, and HS are close to the median of the measured evaporation rate of 3.7 mm d −1 . Apart from PT and PPET, the other methods underestimate ET o , especially where class-A pan data are larger than 6 mm d −1 . The relationship between measured and simulated ET o is linear as shown by the coefficients of determination r 2 ranging from 77 % (PT) to 88 % (PPET).
The simulated ET o is normally distributed if a single station and 1 year is tested (Shapiro test for normality: α > 0.1 for each year and station). The difference between the 34 stations is up to 2 times larger than the inter-annual difference in the 21-year period. Thus, spatial variability is larger than temporal variability in the MDB. The intra-annual variability shows a different picture. The median ET o in the summer months is up to 4 times larger than the ET o during winter months for all ET o methods, except PPET and PT with a 6 times larger ET o in summer than in winter months.
Four of the six methods simulate the measured data with a high r 2 and a low RMSE. The difference between the methods itself is large, in particular through the high ET o estimates by PT and PPET. Thus, the structural uncertainty through the ET o method is substantial and needs to be considered for the prediction of IRR, which is addressed in the next sections.

Irrigation water requirement and its variability
The IRR of wheat has been simulated using an ensemble of 30 model realizations for each of the 3969 1 km × 1 km irrigated cells in the MDB for 21 years. Average values of IRR for all model realizations are shown in Table 3. In most cases, the largest estimates are given by the combinations of the K c set Hughes with the ET o method PT. These are almost 2.5 times higher than the lowest average IRR calculated by the combination of TURC with the K c set Harris. It is obvious that changing ET o method results in a larger variation of calculated IRR than using a different K c set. Hence, the average IRRs give a first idea about variability due to model structures and parameters.
Over a large watershed such as the MDB, local differences in IRR may be large while catchment-wide water management plans define thresholds for water withdrawal, for example due to water rights or water resources protection measures. A given threshold may require heterogeneous local adaptations of irrigation management and a change in cropping patterns. Figure 3 shows the probability that a certain amount of IRR is exceeded in the MDB on average over the 21-year period. It illustrates the range of IRR predicted by the ensemble of all 30 model realizations for each grid cell. Two groups can be identified that are separated by ET o methods. The first group is composed of PPET and PT calculations. In this case, IRR is up to twice as high as compared to predictions by other models. The second group is formed by APET, HS, PM56 and TURC with substantially lower calculations of less than 500 mm in most cases. We note that the parametric uncertainty is almost negligible compared to the uncertainty introduced by the various ET o methods.

Ensemble averaging, uncertainty and weighting
Ensemble predictions have become an important tool to account for different model structures and parameters (Exbrayat et al., 2013;Huisman et al., 2009;Wada et al., 2013). The consideration of ensembles is especially helpful to increase our confidence in simulations when no validation data are at hand, such as projections of Earth's future climate under specified emission scenarios. Here we apply the concept of ensemble prediction to simulations of IRR. Two different ensemble averages, expressed as the exceedance probability of the IRR of wheat, are shown in Fig. 4. The first one represents the equally weighted average of irrigation (IRR, black line). The second one represents a weighted average using the reliability ensemble averaging (IRR REA , red line, see methods description) that weights predictions based on their performance and agreement with other ensemble members. This prevents dismissing some model structure, a process that can be rather subjective. Also, even an overall poorly performing model can contribute to the optimal information extracted from the ensembles (Viney et al., 2009) or may outperform better performing models once boundary conditions are changed (Exbrayat et al., 2013).
We use the inverse of the cumulative daily RMSE (Fig. 2) of the ET o methods during the growing season to calculate the criterion R B (RMSE 154 mm for APET, 123 mm for PM56, 142 mm HS, 232 mm PPET, 373 mm PT, 166 mm TURC). The convergence criterion R D was calculated based on the difference of the predicted irrigation given by a single ensemble member and the equally weighted ensemble average (see Sect. 2.3). Overall, the PT model combinations have the lowest reliability factors of between 0.51 and 0.6 followed by PPET with 0.96, a result driven by the poorer performance of these methods to simulate pan evaporation (Fig. 2), and the outlying positions of simulations using PT and PPET (Fig. 3). All other models are weighted similarly, a result in accordance with the similar performance and simulated values exhibited by these methods (see Table 4 for details).
The application of the reliability factor leads to a decrease of the calculated total IRR in each grid cell as well as to a decrease of its overall uncertainty (Fig. 4). The uncer-    HS  381  381  372  349  336 364  PT  661  671  654  618  580 637  PPET  577  577  565  534  514 551  PM56  365  362  355  344  324 350  APET  357  354  347  329  315 340  TURC  315  316  308  289  279  301  IRR  443  443  433  410  391  424 tainty range is given by the ensemble average plus/minus the RMSE in each grid cell, assuming that modelling errors are normally distributed. Exceedance probability curves might support defining thresholds in irrigation planning with consequences for decision makers through, for example, the adaptation of improved irrigation practice (e.g. from full to deficit irrigation, installation of advanced irrigation techniques) or the purchase of additional water rights. For example, a limit of available irrigation water of 400 mm per growing season will be exceeded less frequently in the MDB if the REA average IRR Table 4. Performance (R B ) and convergence (R D ) and reliability (R) coefficient of the ensemble members. is considered (45 %) in comparison to the equally weighted average (66 %). The spatial distributions of the equally weighted and the REA-weighted ensemble averages are shown in Fig. 5a and b. The equally weighted average of IRR ranges between 124 and 691 mm with an average across the MDB of 424 mm (Fig. 5a). Thus, spatial variability is large and western and northern areas require 5-6 times more irrigation than in the south-east. The REA-derived average IRR ranges between 104 and 663 mm across the river basin ( Fig. 5b) with an average of 405 mm. Depending on the location this value is up to 18 % lower as compared to simulations based on the equally weighted average (Fig. 5c). Also, the uncertainty range decreases as a consequence of the REA method by about 10 % across the MDB with maximum values of around 26 % when comparing equally weighted and REA-weighted RMSE (Fig. 5d-f). The largest change in uncertainty can be found in the south-east of the MDB and also in areas towards the east (Fig. 5f). Thus, REA leads not only to a decrease of predicted IRR but also to a reduction of its uncertainty. The uncertainty is reduced because the REA is drawn toward the group of the better ET o methods that also agree well between themselves.

Discussion and conclusions
The simulation of IRR strongly varies amongst ET o methods. Bormann (2011) recommended that the selection of the ET o method should be based on the validation of ET o with real-world observations rather than only on the availability of climate input data. This is due to the general large variabil-ity among ET o methods, which was also revealed in a study where PT was set as a benchmark model and the RMSE between ET o methods was analysed (McMahon et al., 2013). Likewise, the influence of a single ET o method on the prediction of crop yields was also reported for an agriculture site in Europe (Balkovič et al., 2013) where ET o estimates by PT were 40 % higher and those by Penman-Monteith 10 % lower in comparison to HS. We also found a large variability among ET o methods in our study. However, similar ranges across Australia for ET o have been reported by others (Chiew et al., 2002) for APET, PPET and PM56 as well as lower values for PT. Lascano et al. (2010) as well as Lascano and Van Bavel (2007) have shown that methods to calculate ET o based on combination methods, i.e. Penman-Monteith, tend to underestimate ET by as much as 25 %, especially in dry climates. Bormann (2011) further recommended that the reliability of ET o equations should be tested in a spatial context, especially if applied on large scale. For various regions across Australia, a large range of mean annual ET o between 1700 mm (PT) and 3670 mm (PPET) was reported (Donohue et al., 2010). To investigate the spatial heterogeneity within the MDB we analysed results of the 34 class-A pan stations. Overall, the performance of four of the ET o methods was good with RMSEs around 1 mm day −1 , except for three stations in the north. PPET performed less well with RMSE increasing to 2 mm day −1 while the PT value ranged up to 4 mm day −1 . However, we found no consistent spatial pattern. We are aware that the utilization of class-A pan data comes along with uncertainties. We did not assume that the data are error-free, but for the application of REA, a comparison of model simulations and observations is needed to calculate the model performance criterion. We could have treated PM56 as being an "observation" in terms of a benchmark model. However, we think that a more independent test is more appropriate in the sense of REA and therefore decided to use those observations that are at hand: class-A pan observations. To account for the difference of class-A pan evaporation and reference crop ET, we used a commonly applied correction factor (pan coefficients according to McMahon et al., 2013) to derive crop ET from class-A pan measurements. Most often, ET estimates are not compared to any measurements at all, leaving modellers with no information on how good their model application is. We therefore think that a comparison to class-A pan is for sure not perfect but better than not testing at all.
ET o estimates using the PM56 method revealed the best performance criteria in our study. PM56 considers the most meteorological input parameters -thereby possibly best representing the altering dry and wet conditions across the MDB over the year. The better performance of physically based equations in comparison to more empirical approaches for the simulation of ET o has also been reported by others (Donohue et al., 2010). PT performed least well in our study and resulted in up to 2 times larger estimates than other ET o Figure 5. Average equally weighted (a) and REA-weighted (b) irrigation water requirement during the growing season of wheat . Dots indicate irrigated cropping areas (n = 3969; cell size = 1 × 1 km) (note: a buffer has been used to increase the visibility of the single grid cells). (c) illustrates the difference between both IRR calculations (b-a). (d and e) show the root mean square error between the 30 realizations and the equally weighted (d) and REA-weighted (e) averages as well as the difference (f) between both calculations, respectively. methods. This is somewhat contrasting with other studies (Chiew et al., 2002;Donohue et al., 2010) where PT gave lower ET o values in comparison to methods such as APET and PPET. One reason is that Donohue et al. (2010) have considered the actual albedo from remotely sensed vegetation cover (Donohue et al., 2008) for the estimation of the net incoming solar radiation. In our calculations, an albedo of a reference crop 0.23 (short crop, i.e. grass) has been considered according to the guidelines for ET o from Allen et al. (1998). Another likely reason for this observation is that the PT equation is based on the Penman-Monteith equation in which the aerodynamic term is replaced by a constant (α) which is commonly set to 1.26 under Australian climatic conditions (Chiew and Leahy, 2003) and which we also applied. The consideration of region-specific α for the MDB could have increased the performance of PT in our study. The HS equation is commonly applied in situations where meteorological data are scarce, because the equation depends on more readily available temperature and extraterrestrial radiation derived from latitude and day of the year. A reason for its good performance in our study could be that the semi-arid climate in most of the MDB is favourable for the HS equation, which is supported by Tabari (2010), who concludes that HS is a good candidate model for warm humid and semiarid sites but fails under cold humid climates. However, the poor response of HS to changing climatic boundary conditions has also been criticized in a study on global drought simulations (Sheffield et al., 2012).
We combined the six ET o methods with five K c sets to address stochastic parametric uncertainty for irrigated wheat in the MDB. We show that the ET o method uncertainty range exceeded the uncertainty range of K c sets. Thus, the K c sets have a minor influence on predicted IRR. At first sight, this seems to be contrasting to others who have stated that adapted, regional K c sets are required to estimate reliable IRR rates. For instance, da Silva et al. (2013) reported that K c sets from FAO56 lead to errors in plot-scale irrigation planning under tropical conditions. Similar observations were reported for semi-arid conditions in the Texas High Plains region (Ko et al., 2009), highlighting the importance of regionally based K c sets. While regional adaptation of K c might be important at smaller scales, e.g. on the farm level, we conclude that large-scale applications do not necessarily need to focus on this potential contribution of uncertainty. Rather, effort should be put into finding appropriate ET o methods oreven better -utilize ensemble predictions to cover a more realistic range of predictions. Our study confirms this latter recommendation, as we could not identify a single best ET o method for the MDB. Especially in cases where no data for a direct evaluation of model results are available, the application of model ensembles gives insight to the predictive uncertainty, e.g. being helpful in the development of best management practices (Exbrayat et al., 2013), study of land use , and climate change (Exbrayat et al., 2014).
Besides the uncertainty introduced by local to global K c values, the utilization of the single crop coefficient concept itself comes along with errors, which are not addressed in this study. For example, Lascano (2000) shows how K c varies as a function of time (50 days) and how it changes when using a daily, 3-day, and 8-day moving average. Moreover, the temporal resolution of ET o calculation (i.e. hourly vs. daily) is an important component and errors associated with the method of irrigation (surface, drip, sprinkler) cannot be neglected, but they are beyond the uncertainty calculation of this study. We acknowledge that we do not consider uncertainties in boundary conditions (e.g. land-use management options, climatic variability) although these may be non-negligible. For example, Bocchiola et al. (2013) reported that changes in future precipitation regimes will have the greatest impact on the calculated water footprint (reflecting high ET rates) of maize in Italy and that changes in CO 2 and warming were less important. Conversely, water use was more driven by agricultural management than by regional climatic variation in a water footprint analysed for an irrigation district in China (Sun et al., 2013). Statistical correction of model forcing data (such as bias correction of precipitation) has also been reported to alter ET estimates as shown by Ye et al. (2012) for the Upper Yellow River in China with changes of up to 29 % of ET. Beyond that, the forcing data themselves introduce additional uncertainties. However, this is not part of this study and it would clearly go beyond the scope of our work presented here. Nevertheless, on the long term more research needs to be done on the investigation of the global predictive uncertainty of models, where all sources of uncertainty are evaluated, i.e. spatial input data uncertainty (e.g. soil and land use information), model forcing data uncertainty (e.g. climate data), parameter uncertainty, and model structure uncertainty. Thus, an even more complete picture of global model uncertainty can only be shown by considering all sorts of predictive uncertainty, including model input data, validation data, and spatial input data in addition to the impact of model structural and parametric uncertainty as presented here.
However, we argue that future management practices or the impact of climate change cannot be reliably evaluated due to the large uncertainty that exists in the ET o method, the basis of water resources modelling. We partially cope with this problem by applying the REA technique to extract the most relevant information from our simulations. The advantage of REA in decision making has already been shown for other fields of research, such as the development of N reduction scenarios to improve surface water quality (Exbrayat et al., 2013) and estimation of the effect of land use change on water budgets and hydrological fluxes . Despite the growing importance of IRR for today's agriculture (Siebert and Döll, 2010) and the effect on surface  and groundwater (Wada et al., 2010) resources, few studies have dealt with the predictive uncertainty of this requirement (e.g. Wada et al., 2013) and how to reduce it.