Using multi-model averaging to improve the reliability of catchment scale nitrogen predictions

Introduction Conclusions References


Introduction
Nowadays, mathematical models are often used to assess the impact of changes in boundary conditions on a natural system.More precisely, in the hydro-biogeochemical context, they are used to study the effect of changes in management practices (e.g.fertilisation rate), climate and land-use cover (e.g.clear-cutting, reforestation) on the water and nutrient balances (e.g.Arheimer et al., 2005;Breuer and Huisman, 2009;Zammit et al., 2005).Most of the time, the adopted methodology is to use a single model calibrated to match well with current conditions.Then, some modifications mimicking real world changes are imposed on the relevant boundary conditions resulting in a set of scenarios.The actual scenario prediction is produced by re-running the model with these updated drivers.Impacts can be estimated as the difference between the original model outcomes and the altered ones in either a relative or an absolute way.Optimally, these predictions should be compared to the actual post-change observations to assess their reliability but, in the case of land-use or climate change, this is seldom done as such data are typically not available.Nevertheless, some major concerns arise from this straightforward methodology in catchment scale hydrobiogeochemical model predictions.
First, natural processes involved in the water and nutrient balances (e.g.infiltration, denitrification) are described with a set of equations: the model structure.This primarily represents a translation of our understanding of natural mechanisms and regulating factors into mathematics.Because of the differences in the hydro-climatic and nutrient contexts between catchments, processes represented in a model can be adjusted by some conceptual parameters that are usually difficult to measure like the inorganic nitrogen retention rate in HBV-N (Arheimer and Brandt, 1998).The corresponding calibration procedure aims at finding the parameter values for which the agreement between observations Published by Copernicus Publications on behalf of the European Geosciences Union.and simulations is acceptable, based on some goodness-of-fit criteria (e.g.Legates and McCabe Jr., 1999).Although a lot of effort has been put into developing ever more efficient optimisation algorithms for the last two decades (Duan et al., 1992;Vrugt et al., 2003), the ability of these models to adequately simulate the impact of changed boundary conditions is of concern (Huisman et al., 2009), especially since predictions are almost never validated against post-change observations (Whitehead et al., 2002).
Second, it is now widely acknowledged that several parameter sets may perform equally well (Beven, 2006) and that the outcome of a successful calibration procedure may indeed not be the actual best result.Therefore, an option to address the uncertainty in predictions, especially in the case of scenario predictions, is to use ensembles of multi-model predictions gathering the information content of several simulations.Single-model ensembles regroup predictions obtained with the same model structure whilst altering parameter values and boundary conditions in a Monte-Carlo procedure like the GLUE methodology for example (Beven and Freer, 2001).Nevertheless, part of the predictive uncertainty is also linked to sometimes huge differences between model structures developed to address the same issues.As stated by Breuer et al. (2008) this is especially true in the context of hydro-biogeochemical predictions.In order to cope with structural uncertainties, it has become state-of-the-art to consider more than one model of the same system.These ensembles of predictions have been used in the fields of climate, weather, flood forecasting, rainfall-runoff and sub-surface flow, and a first multi-model comparison approach targeting agricultural fluxes of nitrogen was published by Diekkrüger et al. (1995).But to our knowledge, the ensemble methodology has received little interest in the nutrient fluxes context to date, in spite of the demonstrated improvement in prediction reliability.Furthermore, the few available studies, including previous publications by our working group, have only been based on re-prediction (hindcasting) efforts rather than scenario analyses (Exbrayat et al., 2010(Exbrayat et al., , 2011;;Kronvang et al., 2009).Therefore, we present here an example of the potential advantage of using multi-model predictions to assess the impact of a simple management change on the nutrient balance of a well-monitored mesoscale catchment in south-west Western Australia.

The Ellen Brook catchment
The Ellen Brook catchment (570 km 2 ) is located in coastal SW Western Australia and contributes significantly to the water (6 %) and N loads (10 %) entering the Swan-Canning estuary that drains the city of Perth (Viney and Sivapalan, 2001).Most of the catchment has been cleared for agricultural purposes (Swan River Trust, 2007).Hydro-climatic conditions are typical of a Mediterranean influence with a mean annual rainfall ranging from 510 to 830 mm yr −1 (1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006), derived from inverse distance weighted interpolation from the 4 Australian Bureau of Meteorology rain gauges located in the catchment.Intra-annual precipitation is distributed in cool and wet winters and warm and dry summers corresponding to high flow (May-June to September-October) and low to no flow periods (October-November to April-May), respectively (Fig. 1).Pan evaporation is high (∼2000 mm yr −1 ) and because of the sandy nature of the soils, runoff is mostly generated as a quick and peaky response to rainfall events which explains a fivefold difference between minimum and maximum annual discharge over the study period.Soil texture does not allow the adsorption of large quantities of dissolved organic matter (Petrone et al., 2009).Furthermore, dissolved organic nitrogen that accumulates in the groundwater slowly discharges in high concentration into the surface water during the driest months (Donohue et al., 2001).
As shown in Fig. 1, about 10 % of the TN flowing out of the Ellen Brook catchment is in the form of dissolved inorganic N (NO 3 -N and NH 4 -N) derived from animal wastes and fertilisers used for agriculture and private gardens (Swan River Trust, 2007).Dominant organic N forms are either present in dissolved forms of degrading matter or particles composed of plant and animal debris.Concentrations of all N forms rise up during autumn and winter (May-September) because they are flushed with surface runoff.They fall in early spring (September-November) as rainfall, hence runoff, decreases in intensity (Fig. 1).Slight increases in concentrations in December (Fig. 1) may be attributed to either evapotranspiration induced concentration phenomena or animals entering the stream more frequently during these hot periods (Swan River Trust, 2007).
Eutrophication-driven algal blooms have become frequent in the Swan-Canning estuary as a result of nutrient losses from upstream catchments cleared for agricultural purposes such as the Ellen Brook (Swan River Trust, 2009).This has led local authorities to set a target of nutrient loss reduction from upstream catchments of 50 % via different management options: stream bank fencing to reduce animal wastes and erosion, re-vegetation to stabilise river banks, increased community awareness to encourage reductions in fertiliser use, nutrient traps, improved monitoring of hot spots.Meanwhile, a large monitoring effort has been undertaken and more than 900 daily samples of total nitrogen (TN) concentrations are available at the Ellen Brook outlet out of a total of 3870 days with runoff between 1989 and 2006.Over this period, mean TN concentration was 2.1 mg N L −1 with values ranging from 0.3 to 7.4 mg N L −1 with no significant long-term temporal trend.This rich dataset allows a reliable application of our model ensemble.

Model cohort
The more independent the predictions within an ensemble are, the more errors tend to cancel each other (Abramowitz and Gupta, 2008).Therefore, in a scenario analysis context multi-model ensembles (MMEs) are preferred to multiple realisations of the same model structure in order to avoid results biased by an eventually inadequate model structure.There are however not many freely available nutrient mobilisation and transport models developed for mesoscale catchments (100-10 000 km 2 ).A recent review by Breuer et al. (2008) listed a total of 8 model approaches that are used to simulate the N cycle in catchments.Among these 8 model structures, several are actually modifications of the same common ancestor (i.e.SWAT); hence they share parts of their parameterisations.
In this study, we set up four conceptual model structures to describe the water and nitrogen balances of the Ellen Brook catchment at a daily time step.The ensemble includes LAS-CAM (Sivapalan et al., 1996a, b;Viney et al., 2000), CHIMP (Exbrayat et al., 2010), SWAT (Arnold et al., 1998) and HBV-N-D (Lindgren et al., 2007).Table 1 summarises the main features of each model and a short description follows.Our ensemble seems in fact to cover a large part of the available modelling philosophies reported by Breuer et al. (2008) in terms of simulated N-species, turnover processes as well as spatial distribution.
The simplest model, LASCAM, only splits the basin into lumped subcatchments over which the land-use cover is considered homogeneous.At each time step, the water balance is solved for each subcatchment and surface runoff, subsurface flow and baseflow discharge into the corresponding stream.Since it has been developed for semi-arid and hot regions where temperature is not a limiting factor, LASCAM does not require temperature input.Therefore, only substrate availability governs the represented soil N turnover processes that affect the three considered N-species (NO 3 -N, NH 4 -N, and TN): residue decay, plant harvest, mineralisation, volatilisation, plant uptake, nitrification, denitrification and fixation (Viney et al., 2000).Nutrients discharging from land into the stream are routed to the catchment outlet.
CHIMP is a more complex semi-distributed model which further divides the sub-catchments into land-use classes (Exbrayat et al., 2010).Water and nutrient balances are calculated for each of them before their outcome is weighted by the respective relative area over the sub-catchment.Since the recent implementation of an organic N store (Exbrayat et al., 2011), the same N-species as in LASCAM are considered, but temperature has a positive effect on the soil N turnover processes of plant uptake, nitrification, denitrification, fixation, mineralisation and immobilisation.Unlike in LASCAM, in-stream denitrification and nitrification processes can also occur.
The well-known SWAT model adopts a more detailed spatial distribution scheme by considering each single combination of land-use and soil type as an independent hydrological response unit (HRU).Water balance and different moistureand temperature-controlled N turnover processes are simulated for each HRU: plant uptake, residue decay, mineralisation, nitrification, volatilisation, denitrification, fixation and leaching.Re-infiltration from the stream is also allowed along with algal respiration and uptake.Amongst our four models, SWAT requires the most data and the multiple input files were directly generated from GIS data (Olivera et al., 2006).
Whereas the three previously described models are semidistributed with nested subcatchments discharging into www.geosci-model-dev.net/6/117/2013/Geosci.Model Dev., 6, 117-125, 2013 another only via stream flow, the fully distributed HBV-N-D (Lindgren et al., 2007) simulates the water and nutrient balances for each 100 × 100 m grid cell across the Ellen Brook catchment.Each pixel has its own land-use class with corresponding parameters and each grid cell flows into the adjacent downstream one following a single-flow-direction algorithm.HBV-N-D only considers TN and a single retention process assumed to represent the net effect of denitrification, uptake and sedimentation as a function of temperature and substrate availability.
Because of this difference in the spatial representation of the catchment within HBV-N-D, there is a massive difference of up to 3 orders of magnitude in the number of spatial units required to cover the Ellen Brook catchment (Table 1).The required boundary conditions and spatial disaggregation schemes within each model are summarised in Table 1, along with our catchment-specific setup properties.Discrepancies in considered nutrient species, and relevant turnover processes, represent a sample of the large structural differences that exist in hydro-biogeochemical models (Breuer et al., 2008).
The setup process of the different models to simulate the behaviour of the Ellen Brook catchment is similar (but not identical) to the one previously used by Exbrayat et al. (2011) and is only briefly described hereafter.First, the hydrological component of each model was calibrated with the SCE-UA (Duan et al., 1992) by reducing the root mean square error (RMSE) of observed vs. predicted daily runoff between 1989 and 1997.Then, by keeping the calibrated hydrological parameters fixed, parameters governing the different N mobilisation and transport processes were also optimised with the SCE-UA algorithm set to minimise the RMSE of daily TN loads.Years 1998 to 2006 were used for validation and scenario purposes.Applying genetic calibration algorithms such as SCE-UA neglects parameter uncertainties by aiming at finding the global optimum parameter set.We are well aware of this stochastic component of model uncertainty which we have dealt with in previous work (Exbrayat et al., 2010).Considering model realisations with different parameter sets results in single model ensembles that still follow the same model structure.In the present work we are focusing on different model structures rather than on the uncertainty inherent to each model.We do this in order to test whether a consideration of completely different model philosophies results in a more reliable scenario forecasting.
One of the ways to fulfil the requirements of the Swan Canning Water Quality Improvement Plan is to reduce the diffuse source of total nitrogen (TN) that comes from fertiliser application (Swan River Trust, 2007).Here, in order to illustrate the reliability ensemble averaging (REA) philosophy with a simple example, we apply some very straightforward scenarios of changing agricultural management practices (i.e.fertiliser reduction) over the catchment for the period 1998 to 2006.For each new simulation, the current fertiliser application rate of 30 kg N ha −1 yr −1 in the form of ammonium (Zammit et al., 2005) is stepwise decreased by 10 % of its original value and the models are re-run for the validation period.Then, we apply the REA weighting scheme described hereafter to all single predictions.

Reliability ensemble averaging
Previous studies on multi-model averaging techniques set in a variety of environmental modelling contexts have demonstrated that the simple mean of a MME usually outperforms its members taken separately in terms of goodness-of-fit metrics (Georgakakos et al., 2004;Shamseldin et al., 1997;Viney et al., 2009).However, it has also been shown that giving more weight to the already better performing members tends to provide an overall more reliable prediction (Exbrayat et al., 2010;Krishnamurti et al., 1999;Viney et al., 2009).In this case, a "performance" coefficient R B weights each single prediction according to either a goodness-of-fit metric (e.g.RMSE), multiple-linear regression methods or more sophisticated techniques like Bayesian Model Averaging (Raftery et al., 2005).
Following this, Giorgi and Mearns (2002) proposed to also consider the level of agreement between the models in response to the same changes in boundary conditions in the weighting scheme.The underlying philosophy is that the influence of a very well-calibrated model on the final prediction should be dampened if it provides a completely different response than the other models to the same changes.In that sense, outlying predictions are penalised by the introduction of a "convergence" coefficient R D favouring more central predictions in the weighting scheme.Although primarily designed for climate studies, the so called Reliability ensemble averaging (REA) method has been recently adapted to scenario analyses of the impact of land cover change on runoff (Huisman et al., 2009).Put in a mathematical way, the final weight R i assigned to each member of the MME can be summarised as where B i and D i are measures of the performance and convergence for model i, respectively.The term ε corresponds to a measure of the variability in TN export, expressed as the difference between the highest and smallest observed values.Following Huisman et al. (2009), B i corresponds to the model bias in simulating present-day TN export, i.e. the relative difference between simulated and observed TN export on days with measurements.The term D i is a measure of the distance between the change predicted by a model i, and the REA average change such as where TN i is the relative change of TN export predicted by model i, and N the number of models in the ensemble.The REA average change is not known beforehand and it is obtained iteratively following Giorgi and Mearns (2002).One of the key points of the REA method is that R B,i or R D,i are set to 1 whenever B i or D i are smaller than ε, respectively.Assuming that the probability density function of the change is somewhere between uniform and Gaussian, a 60-70 % confidence interval is represented by the REA average change plus and minus the weighted root mean square difference (RMSD) such as
Results for TN shown in Fig. 2 for both the calibration and validation period are first of all dominated by SWAT.Despite being a model explicitly set up for water quality simulations SWAT overestimates the annual TN exports for almost all years and a visual inspection of Fig. 2 attributes it to peaks of TN losses up to an order of magnitude higher than observations.This is confirmed by high RMSE values in Table 3.However, the average TN export simulated by SWAT on sampled days compares well with observations, especially during the calibration period (Table 3).This might be explained by a well-constrained calibration of SWAT for days with observations.Regarding TN, the LASCAM model performs the best for both periods in terms of RMSE whilst CHIMP gives the closest daily average TN export predictions as compared to the observation data.For CHIMP and HBV-N-D, prediction quality increases between calibration and validation, whilst the opposite is observed for LAS-CAM and SWAT.The load calculations depend in part on the correctness of the simulated hydrological fluxes and accordingly LASCAM and CHIMP provide the two best simulations in terms of runoff and TN losses.However, this is not always true as SWAT always outperforms HBV-N-D for Table 3. Model calibration (1989Model calibration ( -1997Model calibration ( ) and validation (1998Model calibration ( -2006) )  runoff predictions but has the worst RMSE for TN predictions, especially during the validation period.Generally, the models simulated less TN export during validation than during calibration.The highest TN export is simulated by SWAT with ∼131 and ∼118 t N yr −1 during calibration and validation, respectively.This corresponds to almost 4 times more export than HBV-N-D predictions (∼34 and ∼31 t N yr −1 ).According to Fig. 3 which represents the exceedance probability of daily TN losses simulated by the models, it seems that this difference is due to some rare events of intensive TN export predicted by SWAT.Meanwhile, LASCAM and CHIMP are in a better agreement with each other over the whole period.This is especially true for the simulated export rates of ∼83 and ∼85 t N yr −1 for the calibration period by LASCAM and CHIMP, respectively.Corresponding values of ∼60 and ∼69 t N yr −1 for the validation period differ a bit more but are still the most similar amongst all the models.As illustrated in Fig. 3, LASCAM simulated more frequent daily TN exports greater than 1 t N d −1 than CHIMP, whereas CHIMP's higher probability of lower N losses and less frequent no flow occurrence explains its higher average yearly TN export.
The validation period also corresponds to the control scenario.We therefore present corresponding results for a simple average of the predictions and the REA average in Table 2. Here, the REA average is only calculated with the reliability criterion as no perturbations have yet been made to our system.The simple average performs with a RMSE equal to 8.6 g N ha −1 d −1 which is worse than LASCAM but better than the other three models.However, the corresponding average export on sampled days is, at 0.42 t N d −1 , closer to the observed 0.41 t N d −1 than any of the single models.Meanwhile, the REA average outperforms all the ensemble members with a value of 6.5 g N ha −1 d −1 .This represents an improvement of about 10 % compared to LASCAM, the best performing single model.The simulated mean export on sampled days equals the observed mean.diminution of the TN export after a reduction in fertiliser application.The responses of LASCAM, SWAT and HBV-N-D to the changes in management practices are comparable to each other, with total reductions of less than 10 % when no fertiliser is applied.Conversely, CHIMP presents a totally different behaviour with a reduction of up to 80 % of its initially simulated TN export.The simple mean provides intermediary predictions towards a ∼25 % TN export reduction with no fertiliser.Nevertheless, the REA average change is well in agreement with HBV-N-D, LASCAM and SWAT with a reduction in TN export below 10 %.The shaded area in Fig. 4 represents ∼60-70 % of the uncertainty (REA average ± RMSD) of the change and always includes these 3 models but not CHIMP.The simple averaging scheme is not in the uncertainty bounds of the REA for reductions of more than 30 % in fertilisation, and moves further away from it when the reduction increases.

Discussion
Consistently with previous work in hydro-biogeochemical modelling by Breuer et al. (2008), Exbrayat et al. (2010) or Kronvang et al. (2009), discrepancies between model structures (Table 1) driven by a homogeneous dataset of boundary conditions are a source of large predictive uncertainty.Interestingly, the more lumped models LASCAM and CHIMP seem to perform better in estimating the nutrient losses than the more distributed ones.This may be due to conceptualisations of both hydrological and N cycles more adapted to the Ellen Brook conditions.In addition, LASCAM has originally been developed to predict the water, salt and nutrient balances in SW Western Australian catchments including the Ellen Brook (Viney et al., 2000;Zammit et al., 2005).Simulation of the water balance greatly impacts nutrient losses and RMSE is more sensitive to the correct timing of peak events.As shown in Table 2, SWAT and HBV-N-D runoff predictions are of quality comparable to CHIMP during validation.However, Figs. 2 and 3 as well as Table 3 clearly suggest that SWAT globally overestimates and HBV-N-D underestimates the TN losses, i.e.SWAT good matching of peak events is accompanied by a constant high discharge while HBV-N-D simulates lower flows.Nonetheless, since our aim is to quantify a relative change in total TN export in response to reductions in fertilisation rates, we do not reject any of the models for our application.Interestingly, the REA average outperforms any of the other simulations in the control case for which we have data to compare with, therefore giving more credit to the approach.The most striking feature in Fig. 4 is the behaviour of the CHIMP model during the scenario analysis.In spite of its good calibration and validation results, CHIMP simulates a reduction of up to 80 % in TN export while all the other models seem to be more in agreement with a total reduction not higher than 10 % of the current TN export.Therefore, we could attribute the acceptable calibration results of CHIMP as the outcome of a successful curve-fitting exercise in which the apparently plausible parameter values are, in fact, incorrect (Wade et al., 2008).Further, because of the outlying position of CHIMP, the simple mean provides a final prediction equivalent to an almost 25 % reduction in nitrogen losses when no fertilisation occurs.However, the trust we can put in this projection is questionable since it is not really in agreement with any of the single projections and its intermediary position is merely a result of very different but equally weighted projections.
When the agreement between models is introduced into the REA weighting scheme, the converging responses of the LASCAM, SWAT and HBV-N-D models to changed conditions provide a significantly different final prediction than the former simple averaging scheme.Similarly to some of the well-calibrated models in Huisman et al. (2009), the outlying position of CHIMP decreases its reliability in the final weighing scheme.Conversely, and in spite of their relatively poor ability to match current conditions, SWAT and HBV-N-D "attract" the final averaged prediction by being consistent with each other, and LASCAM, in their relative response to the management scenarios.This results in a final REA average prediction that looks more consistent with most of the single models.
Of course, one could argue that the ensemble approach is not entirely justified in our case because LASCAM is a well-calibrated model that also presents the expected behaviour during scenario analyses.However, contrary to the other models, LASCAM was primarily developed and tested to simulate water and nutrient fluxes in this particular catchment (Viney et al., 2000).In another application case, it is not sure that the chosen model structure would have been developed over several years to predict the hydro-biogeochemical fluxes of the catchment of interest, nor that there would be enough monitoring data to support model quality assessment.Similarly, although we agree that CHIMP's source code needs a thorough inspection in the near future, detection of probable quirks in its structure would not have been possible without comparing its predictions with other models in these scenario analyses.
Nonetheless, the results obtained with the adopted averaging method are a good demonstration of its ability to extract the most reliable content of information from each ensemble member (Giorgi and Mearns, 2002).This is achieved in spite of the relatively small size of our ensemble when compared to studies published in other fields such as hydrology (e.g.Breuer et al., 2009;Georgakakos et al., 2004) or climatology (e.g.Krishnamurti et al., 1999).We do however argue that ensemble studies focusing on water quality and nutrient losses are still rare in the literature and this contribution is a www.geosci-model-dev.net/6/117/2013/Geosci.Model Dev., 6, 117-125, 2013 further step in the innovative direction adopted by our working group as documented in previous contributions (Exbrayat et al., 2010(Exbrayat et al., , 2011)).Therefore, we consider our results to be very valuable in the frame of hydro-biogeochemical predictions (Breuer et al., 2008) and this method could surely be helpful for application cases in which the absence of monitoring would make it hard to identify the most appropriate structure (Huisman et al., 2009), such as land management scenarios or prediction in ungauged basins (Sivapalan, 2003).This is especially true since we usually rely on models developed and calibrated for stationary and not changing conditions (Milly et al., 2008;Sivapalan et al., 2011).

Conclusions
Through our straightforward example of fertilisation rate reduction we demonstrated the potential advantage of using a multi-model ensemble to lower the risk of relying on a single, maybe subjectively chosen, model structure.This is a real advantage in our application case since the actual effects of different changes are not yet known, making the evaluation of model quality impossible.So far, REA and similar averaging schemes have been primarily applied in climate and hydrological sciences and more work is still required in this direction to address their effect on predictions.We therefore see some potential in the ensemble approach in other fields of environmental modelling where the structural uncertainty of models used for predictions is large and rarely addressed.

Fig. 1 .
Fig. 1.Seasonal cycle and relative contribution of different species to TN (pie chart) in the Ellen Brook (1989-2006).Missing values in March and April correspond to no flow periods.

Fig. 2 .
Fig. 2. Observed and predicted daily discharge and daily total N export during calibration and validation periods for the model cohort.

Fig. 3 .
Fig. 3. Exceedance probability of daily TN exports as predicted by the models during the validation period (1998-2006).

Table 2
, as reflected in the time series, LAS-CAM performs clearly better in predicting discharge during the calibration while HBV-N-D has the largest RMSE of 2.31 m 3 s −1 , more than twice LASCAM's, as a result of the mismatch from 1989 to 1991.CHIMP and SWAT present intermediary values during the calibration.The range of RMSE in the ensemble narrows during the validation period.This is due to both LASCAM's RMSE increasing to 1.11 m 3 s −1 results.