Nutrient data from catchments discharging to receiving waters are monitored for catchment management. However, nutrient data are often sparse in time and space and have non-linear responses to environmental factors, making it difficult to systematically analyse long- and short-term trends and undertake nutrient budgets. To address these challenges, we developed a hybrid machine learning (ML) framework that first separated baseflow and quickflow from total flow, generated data for missing nutrient species, and then utilised the pre-generated nutrient data as additional variables in a final simulation of tributary water quality. Hybrid random forest (RF) and gradient boosting machine (GBM) models were employed and their performance compared with a linear model, a multivariate weighted regression model, and stand-alone RF and GBM models that did not pre-generate nutrient data. The six models were used to predict six different nutrients discharged from two study sites in Western Australia: Ellen Brook (small and ephemeral) and the Murray River (large and perennial). Our results showed that the hybrid RF and GBM models had significantly higher accuracy and lower prediction uncertainty for almost all nutrient species across the two sites. The pre-generated nutrient and hydrological data were highlighted as the most important components of the hybrid model. The model results also indicated different hydrological transport pathways for total nitrogen (TN) export from two tributary catchments. We demonstrated that the hybrid model provides a flexible method to combine data of varied resolution and quality and is accurate for the prediction of responses of surface water nutrient concentrations to hydrologic variability.
Surface water nutrient concentrations have been significantly increased by human activities (Forio et al., 2015) due to urbanisation, waste discharges and agricultural intensification (Liu et al., 2012; Kaiser et al., 2013; Li et al., 2013). Increased nutrient concentrations and loads in streams alter the biogeochemical functioning and biological community structure in receiving estuaries (Jickells et al., 2014; Staehr et al., 2017), leading to an increased incidence of harmful algal blooms (Domingues et al., 2011), anoxia and hypoxia (Li et al., 2016; Testa et al., 2017) and reduced water availability (Heathwaite, 2010). Analysis of tributary water quality data over time is therefore essential to compute incoming nutrient loads, support policy and plan remediation measures.
Water quality data, however, often have constraints that make it challenging to analyse long- and short-term trends. Firstly, water quality data often have non-linear responses to environmental factors and show high-order interaction effects between different environmental variables. Moreover, nutrients can derive from different sources (point or non-point) in the landscape and are transported to receiving waters through different water pathways subject to varied catchment hydrological conditions and human intervention (Hirsch et al., 2010; Lloyd et al., 2014). Additionally, tributary nutrient datasets often are sparse in both space and time, due to the high cost of fieldwork and chemical analysis (Lamsal et al., 2006; Forio et al., 2015). Historical and current water quality monitoring programmes often use low-frequency sampling regimes on a weekly to monthly basis (Halliday et al., 2012). When monthly averaged concentrations are used, calculated nutrient loads to receiving environments such as lakes or estuaries may be poorly estimated (Cozzi and Giani, 2011), with high variability in the estimated loads (Jordan and Cassidy, 2011). It is also common to have patchy availability of nutrient species data across a study area, and combining datasets from different projects and analytical laboratories makes the analysis of long-term trends fraught with uncertainty. For instance, total nitrogen (TN) and total phosphorus (TP) concentrations within catchment outflows may have been monitored for decades, while dissolved organic nitrogen (DON) and dissolved organic carbon (DOC) concentrations may have only been monitored recently, with the increasing recognition of their ecological importance (Górniak et al., 2002; Petrone et al., 2009; Erlandsson et al., 2011). Given the hydrochemical correlation between different nutrient species and high analytical cost, there are benefits in extracting maximum information from all available nutrient data, particularly relating to changes in water quality over time (Hirsch et al., 2010). In summary, while high-quality nutrient data from tributaries are typically required as input to water quality modelling of receiving waters, the reliability and accuracy of the trend analysis of tributary data are frequently restricted by data non-linearity, limited sample size and variable nutrient availability.
Various models for constructing tributary water quality data have been developed. For example, linear models (LMs) and generalised linear models (GLMs) that use correlations between concentration (C) and flow (Q) have long played a central role in stream water quality analysis (Cohn et al., 1989; Chanat et al., 2002). Some multivariate regression models have been applied to analyse the long-term trend (Li et al., 2007; Tao et al., 2010; Greening et al., 2014) and seasonal patterns (Giblin et al., 2010; Chen et al., 2012) of surface water nutrients. For example, a weighted regression on time, discharge and season (WRTDS) was introduced by Hirsch et al. (2010) and has been applied to a number of different water quality studies (Green et al., 2014; Zhang et al., 2016a, b, c).
Meanwhile, data-driven machine learning (ML) methods are increasingly being applied to quantify relationships between soil, water and environmental landscape attributes (Lintern et al., 2018; Wang et al., 2018; Guo et al., 2019). For instance, random forest (RF), a widely used ML method, was used to model the spatial and seasonal variability of nitrate concentrations in streams (Álvarez-Cabria et al., 2016). Gradient boosting machines (GBMs) were used to quantify relationships between land-use gradients and the structure and function of stream ecology (Clapcott et al., 2012). In contrast to process-based conceptual models, ML methods simulate relationships purely from the data (Maier et al., 2014) and have the ability to incorporate different types of variables (e.g. numerical or categorised variables); this is particularly suitable for systems with complex variable interactions and non-linear response functions (Povak et al., 2014).
While both process-based and ML models can manage non-linear interactions and be used to explore
long-term trends, they both have difficulty in fully extracting important hydrochemical information
embedded in nutrient data. Hybrid methods have been proposed for flow forecasting, to enhance the
performance of ML models by first using intermediate models to generate additional variables, which
are then used for subsequent modelling. For instance, a neural network model is first applied to
reconstruct surface ocean partial pressure of carbon dioxide (
Stream flow integrates water from multiple pathways resulting in a distribution of residence times. Stream nutrients are the product of overlapping historical inputs and reaction rates, which are spatially distributed and temporally weighted within the catchment (Abbott et al., 2016). Therefore, it is beneficial to understand nutrient transport pathways from the source to receiving waters, to analyse the long- and short-term trends of stream nutrient data; this knowledge will improve management strategies to reduce nutrient transport (Tesoriero et al., 2009; Mellander et al., 2012). In the analysis of the streamflow hydrograph, separating baseflow (the long-term delayed flow from storage) and quickflow (the short-term response to a rainfall event) from total flow is a well-established strategy to better understand transport pathways (Tesoriero et al., 2009). To utilise all available nutrient data and assess the impact of different transport pathways on stream nutrient concentrations, we developed a hybrid machine learning framework for surface water nutrient concentrations (ML-SWAN) that first separated baseflow and quickflow from total flow and then built intermediate models to generate missing nutrient species within the total nutrient pool, using relationships with baseflow, quickflow, rainfall and seasonal components. The generated nutrient data were included as additional variables for a final ML prediction. RF and GBM were employed and their performance compared in stand-alone mode and as a hybrid method.
This study aimed to compare model performance for nutrient concentration
prediction, to generate accurate daily nutrient data, to assess the impacts
of different water transport pathways on surface water nutrient
concentrations and to present a feasible framework for the application of
the hybrid method for surface water nutrient prediction. It was hypothesised
that the hybrid RF and hybrid GBM, which used pre-generated daily nutrient
concentrations and the separated baseflow and quickflow as additional
auxiliary inputs, would take advantage of the complementary strengths of
hydrochemical and hydrological relationships to provide the most accurate
and reliable nutrient predictions. To test this hypothesis, the hybrid RF
and hybrid GBM were compared to a linear model, a multivariate weighted
regression model (WRTDS), and stand-alone RF and GBM models, for the
prediction of TN, TP,
Our modelling goal in this study was to minimise the sum of the overall loss
function between the predicted nutrient concentrations and measured nutrient
concentrations.
LMs are the most commonly used tool to describe concentration–discharge (
RF and GBMs are ensemble models that combine multiple base learners inside the model to improve the prediction performance (Ishwaran and Kogalur, 2010; Singh et al., 2014). The ensemble methods are the main difference between RF and GBM. In RF, bootstrap aggregating is used to resample the original dataset with replacement. Hence, datasets with partial data are generated and then used to build individual base learners. Unlike bootstrap aggregating, GBM iteratively generates a sequence of base learners, where each successive base learner is built for the residual prediction of the preceding base learner (Friedman, 2001, 2002). The probability with which data points are selected for the next training set is not constant and equal for all data points. The selection probability increases for data points that have been misestimated in the previous iteration; data points that are difficult to classify would receive higher selection probabilities than easily classified data points (Yang et al., 2010; Erdal and Karakurt, 2013).
For RF and GBM, the most commonly used base learner is a classification and regression tree
(CART). A CART model is built to split the dataset into different nodes (Breiman et al., 1984):
Compared to LM and WRTDS models, one drawback of RF and GBM, as well as many ML methods in general, is that there is no specific equation in GBM or RF to directly demonstrate model structures. However, GBM and RF do provide the relative importance of each variable, which is based on the empirical improvement in the loss function due to the split on the specific variable in a tree (Povak et al., 2014; Puissant et al., 2014). The improvement of a certain variable was averaged over all trees and used as the relative importance of that variable for the final model. This relative importance serves as the key index to understand the model structure of RF and GBM (Makler-Pick et al., 2011).
Total flow is commonly conceptualised as including baseflow and quickflow
components (Meshgi et al., 2015). Baseflow separation
techniques use the time-series record of streamflow to extract the baseflow
and quickflow signatures from the total flow. This can be done by using
graphical methods to identify the intersection between baseflow and the
rising and falling limbs of the quickflow response (Szilagyi
and Parlange, 1998) or by filtered methods which process the entire stream
hydrograph to derive a baseflow hydrograph (Furey and
Gupta, 2001). In this study, the three-pass filtered method was applied
for baseflow separation; the quickflow was first estimated as described
below (Lyne and Hollick, 1979; Nathan and McMahon,
1990), and then baseflow was calculated:
In this study, the root mean squared error (RMSE) and the Nash–Sutcliffe
model efficiency coefficient (MEF) were used to compare model performance.
The RMSE is a measure of overall error between the predicted and measured
data and returns an error value with the same units as the data, which is
given by the following equation:
The main aims of this research is to test the hybrid model, rebuild the historical nutrient data,
and explore the short- and long-term nutrient changes. The first step is verifying the model
performance. In this case, the data were randomly divided into
Variable list and descriptions.
The overall processes of ML-SWAN can be divided into three stages (Fig. 1). The first stage was
baseflow separation using the EcoHydRology package (Fuka et al., 2018). The generated baseflow,
quickflow, total flow and rainfall were further transformed into lagged data (the averaged values
over the previous 3, 7 and 15 days) to capture any short-term impacts of different water pathways
and rainfall on stream nutrients. JD,
Overall modelling processes of ML-SWAN.
The second stage of ML-SWAN was to build intermediate RF and GBM models that generated daily nutrient concentrations. For the intermediate RF and GBM models, only lagged hydrological data (including total flow, baseflow and quickflow), lagged rainfall and seasonal components on the training dataset were used. Nutrients were not used as a predictor in the intermediate model. Note that, in this study TP, TN, DOC and DON were selected to be generated in the second step. If one nutrient was considered as the final target, the other three nutrients were used to generate daily data. For instance, daily TP, DOC and DON were generated as additional variables to predict TN. In that case, the missing TP, DOC and DON were generated by the intermediate model for the training dataset and the testing dataset. Daily TN, TP, DOC and DON data were generated and used for the final predictions. These nutrients were selected since they may be generated from similar sources or are important components of the total nutrient load. For instance, DOC and DON may both be generated from dissolved organic matter (DOM) (Seitzinger et al., 2002; Bernal et al., 2005; Filep and Rékási, 2011). In the catchments studied here, DON can be a dominant component of TN (Nice et al., 2009; Petrone, 2010; Bourke et al., 2015). The selection of DOC and DON for pre-generation may not necessarily be appropriate for other catchments. The selection of nutrients for pre-generation depends on data availability in the dataset. The use of different species of the same nutrients (N or P) can generally improve model performance.
The third stage of ML-SWAN built an additional hybrid model using the training data, which has generated nutrient data by the intermediate models, lagged hydrological data, lagged rainfall data and seasonal components. Note that at this stage, the only difference between stand-alone ML and hybrid ML methods was that stand-alone ML did not use pre-generated daily nutrient data.
Hydrological characteristics of the two tributaries.
To test the generalisability of the hybrid framework, two sites in Western Australia (Ellen Brook
and Murray River) were selected as study areas. Ellen Brook and Murray River are key tributaries for
the Swan–Canning Estuary and Peel–Harvey Estuary (Fig. 2), respectively, and have different hydrological
conditions. The Swan–Canning Estuary is located adjacent to the Perth metropolitan area, with an
area of approximately 40
The location of Ellen Brook and Murray River.
The Peel–Harvey Estuary is located approximately 75
Nutrient sampling time and sample size in Ellen Brook and Murray River.
Both Swan–Canning Estuary and Peel–Harvey Estuary experience a Mediterranean climate with cool, wet
winters (June–August) and hot, dry summers (December–March). The long-term average annual rainfall
varies from 1300
Overall, the scaled RMSE reduced from LM, WRTDS, stand-alone ML and hybrid ML for all nutrients
except
Model performance across six nutrients and the two sites:
Stand-alone ML achieved results that placed it between WRTDS and hybrid ML. Stand-alone GBM
achieved the highest accuracy for
In summary, the hybrid ML had the best performance amongst the six methods,
followed by stand-alone RF and GBM. WRTDS was better than the linear model
but could only achieve results similar to stand-alone RF and GBM for
Model performance for six nutrients was compared in the last section. To make this section more concise, these six models were then compared in their ability to generate daily TN in Ellen Brook from 1 January 1989 to 16 July 2018 (Fig. 4). The daily TN in Murray River and daily TP in both sites were also generated (see results in the Supplement). TN was selected because TN is the most important and most frequently measured nutrient in many places. This hybrid method can also be used for other nutrients. Note that all data points (not just the 80 % training dataset) were used to generate daily TN.
Daily TN generated by the six models for Ellen Brook.
The LM performed very poorly for TN prediction; low-concentration samples
(
Apart from the better performance for high-concentration data, another difference between stand-alone ML and hybrid ML was that the long-term trend in TN was consistent in stand-alone ML, but this trend fluctuated in hybrid ML. For instance, hybrid GBM results fluctuated from 1989 to 1999 and then showed an increasing long-term trend from 2005 to 2018, in addition to the seasonal fluctuation. The pre-generated nutrient is the only difference between stand-alone model and hybrid model. If there are long-term trends in nutrient concentrations (e.g. TN), similar trends should also exist in the components of TN (either DON or dissolved inorganic nitrogen). The pre-generated nutrients emphasise this impact on the hybrid model. This suggests that the generated nutrient data could provide additional information that allowed the hybrid ML to capture long-term trends; this information was not included in the seasonal components but existed in the generated nutrient data.
The distribution of the daily TN generated by the six models and that of the measured TN data in Ellen Brook.
The distribution of the TN data generated by the six models was compared to the distribution of the measured TN data (Fig. 5). Similar to the results shown in Fig. 4, hybrid GBM had the most similar distribution to the measured TN data. Only a few low- and high-concentration data were incorrectly predicted by the hybrid GBM. Hybrid RF also achieved a distribution similar to the measured data, but more extreme-value data were underestimated compared to the hybrid GBM. Stand-alone GBM and RF showed a similar distribution to the hybrid GBM and RF with less accuracy in the extreme data. Overall, GBM (either stand-alone model or hybrid model) could have a better distribution than RF. WRTDS generated some extremely high data and underestimated many low-concentration data, which is also seen in Fig. 4b. The linear model incorrectly predicted most of the TN data. The results in both Figs. 4 and 5 showed that hybrid GBM achieved the best simulated daily TN data, followed by hybrid RF, stand-alone GBM and RF. WRTDS and LM generated large biases in TN prediction.
The hybrid ML models predicted most of the extreme concentrations (Figs. 4 and 5), and only a few points were under-predicted. The limited number of extreme data and the model structure that tried to balance the overall trend prediction with extreme data prediction can cause under-prediction. For example, higher weights can be set up for extreme data during the model training process to force model to over-predict the value for extreme concentrations, which may reduce the accuracy for overall trend prediction. In this study, our target is to understand the long-term nutrient trend. Therefore, we did not use this technique during the model training process.
The daily data generated by the hybrid GBM showed a lower RMSE and better distribution than stand-alone ML, WRTDS and LM (Figs. 4 and 5). Compared to LM, WRTDS and simple CART models, one drawback of RF and GBM, as well as many ML methods in general, is that there is no specific equation in GBM or RF to directly demonstrate model structures. However, GBM and RF do provide the relative importance of each variable, which is based on the empirical improvement in the loss function due to the split on the specific variable in a tree (Povak et al., 2014; Puissant et al., 2014). The improvement of a certain variable was averaged over all trees as the relative importance for the final model. This relative importance serves as the key index to understanding the model structure of RF and GBM (Makler-Pick et al., 2011).
Variable importance in the hybrid GBM for TN prediction in
The variable importance for TN prediction by hybrid GBM in Ellen Brook and Murray River is presented in Fig. 6. The variable importance in the intermediate models is also included, and the length of coloured sections represents the importance of those variables in the hybrid GBM or intermediate GBM. The importance was scaled according to the most important variable. The generated DON and TP ranked as the first two critical variables in Ellen Brook, while all three generated nutrients were listed as the most important variables in Murray River. This suggests that the generated nutrients do provide critical information to the model and improve model performance. The quickflow was most important for the generated DON and TP, as well as the TN itself in Ellen Brook. The impacts of quickflow decreased, and baseflow, seasonal components and rainfall data become more important for TN prediction in Murray River. This difference in variable importance reflects different catchment characteristics across the two sites and therefore different hydrological and hydrochemical processes controlling TN concentrations. The total flow was not of high importance at either site, which suggests that baseflow or quickflow had more impact on surface water TN. Moreover, TN concentrations were affected by more variables in Murray River than in Ellen Brook.
Hydrological conditions, specific sub-catchment characteristics and the chemical properties of nutrients can all impact surface water nutrient concentrations (Barron et al., 2009; Moatar et al., 2016), nutrient partitioning (Ruibal-Conti et al., 2013) and nutrient transport (Burt and Pinay, 2005; Tesoriero et al., 2009). TN prediction in Murray River was impacted by more variables than in Ellen Brook (Fig. 6), suggesting more complex relationships in Murray River.
Quick flow is composed of runoff, interflow and direct precipitation (Brodie and Hostetler, 2005) and was shown to be important for TN prediction in Ellen Brook. Direct precipitation, however, did not have a large impact on TN (the green bars in Fig. 6); this suggests that runoff and interflow were important for TN concentrations. Baseflow can account for (on average) 53 % of annual stream discharge in Ellen Brook, but baseflow was not of high importance for TN prediction in this study. This may occur due to low TN concentrations in the baseflow (Barron et al., 2009), large areas of low nutrient-retaining sandy soils in the Ellen Brook catchment, and high nutrient transport efficiency in quickflow and first flush. Mellander et al. (2012) quantified nutrient transport pathways in agricultural catchments and found that quickflow was only 2 %–8 % of total flow, but it can transport up to 50 % of TP. Gunaratne et al. (2017) found that the seasonal first flush was only 30 % of runoff volume but contained 40 %–70 % of the nutrient load.
Note that the median TN in Ellen Brook (2.1
Baseflow is derived from groundwater discharge to streams and the slow drainage of water stored in local wetlands (Kelsey et al., 2010). Baseflow is highlighted as an important variable for TN prediction in Murray River. The Murray River catchment has large areas with high nutrient-retaining soils (high PRI) (Kelsey et al., 2011) and relatively low TN concentrations, and it is likely that groundwater makes significant contributions to TN in Murray River. Ruibal-Conti et al. (2013) previously found that variability in TN is strongly associated with variability in flows in Murray River. Our results extend this finding, in that both baseflow and quickflow likely impact TN in the river.
It is noted that seasonal components including
Generated daily
Six models were compared for nutrient predictions and the hybrid GBM model achieved the highest
accuracy (Figs. 3 and 5). The long-term changes in TN have been discussed in previous sections. To
understand the long-term changes in other nitrogen species across the year, the hybrid GBM was then
applied to generate daily DON,
The generated nutrient data provided additional information to enhance the hybrid model performance (Figs. 3 and 5). To assess the individual impact of a generated nutrient, we did a simple test that sequentially added generated TP, DOC and DON data to the base GBM (only seasonal components and lagged hydrological data) and evaluated RMSE and MEF for TN prediction. This process was repeated 30 times and the results are presented in Fig. 8.
Model performance for TN prediction across different input variables for Ellen Brook.
The RMSE significantly decreased when generated TP was added as an additional variable. DOC and DON only have 297 and 129 data, respectively, and were only measured in recent years, while TP has more than 1000 data and has been measured since 1990 (Table 3). However, DOC and DON could still improve model performance (Fig. 8), and the generated DON was ranked as the most important variable across both sites (Fig. 6). The medium RMSE slightly decreased when both generated DOC and DON were added. Moreover, the generated DOC and DON also reduced the model uncertainty, such that the IQRs became narrower than model results without the generated nutrients.
Our results suggest that the recent DON and DOC data improved understanding of historical TN. It is not uncommon to have a similar data structure when several datasets are combined or new measurements are added to a project. While there were no DON data prior to 2006 in Ellen Brook, daily DON can be generated back to 1990 with the help of generated TN, DOC and TP data; DON had the highest MEF among the six nutrients (Fig. 3). This hybrid method provides a feasible process to fully utilise all available nutrient data to accurately fill gaps in either historical or recent nutrient datasets.
Monitoring, modelling and forecasting water quality inputs are essential to support the management
of the quality of receiving waters while responding to current anthropogenic stressors
(Holguin-Gonzalez et al., 2013; Schnoor, 2014). The performances of six models were comprehensively
compared, in an exploration of historical and contemporary nutrient data across two study sites. LM
had the highest error while stand-alone RF and GBM had similar error. This agrees with previous
findings by Erdal and Karakurt (2013) that RF and GBM models achieved similar correlation
coefficients (
The performance of WRTDS, as well as many conceptual models, is often reliant on a prescribed set of input information, which can account for variance in nutrient concentrations but may miss some important processes for certain rivers (e.g. baseflow in this study). This can compromise the performance of WRTDS for nutrient prediction. Moreover, hydrological and chemical processes within the systems are typically ignored by many conceptual models, which may exclude important hydrochemical information. By contrast, some complex conceptual models may include these hydrochemical processes but are often constrained by insufficient nutrient data to calibrate and validate the models. Some simplifications may be made to account for lack of data, but the simplifications may often weaken model performance. The hybrid framework presented in this study has overcome the challenge caused by data paucity by building intermediate models to generate missing nutrient data and then using this additional hydrochemical information to improve final model performance.
The hybrid models developed in this study were able to take advantage of the complementary strengths of both hydrochemical (additionally generated nutrient data) and hydrological (lagged data) information. This was particularly the case for the prediction of high nutrient concentrations, where the hybrid models were shown to outperform the stand-alone RF and GBM, in terms of accuracy, reliability and value distribution. Improved accuracy in the hybrid model was achieved by using intermediate models, although these intermediate models may also have a relatively high error (similar to stand-alone RF and GBM). However, if the improved model performance is higher than the introduced error, the results are manageable. Similar results were also found in Hunter et al. (2018), who compared a hybrid process-driven and ANN model with the stand-alone ANN model and the process-driven model. In their study, the hybrid also achieved the best performance followed by stand-alone ANN. The process-driven benchmark model had a significantly lower accuracy than the other two models.
A limitation of the hybrid modelling approach, however, is that it requires the time and expertise to develop intermediate models for generating additional nutrient data. Prior knowledge also plays an important role in identifying the variables for pre-generation. Some statistical methods (e.g. the correlation test, simple linear model) can be helpful to identify these variables if there is no clear theoretical or conceptual understanding on which to base the selection of the important variables.
In this study, we tested the generalised performance of the hybrid model across six nutrient species
and two tributaries. We also note that nutrients may not always be the critical variables targeted
for pre-generation; the pre-generated DOC was ranked as having low importance for Ellen Brook and
produced only a slight improvement in the performance of the hybrid model for
There were constraints in the nutrient datasets in this research, and similar constraints commonly exist in other study areas. Many nutrient datasets contain important information, but sometimes it can be challenging to directly combine or utilise them. ML methods provide a feasible approach to preprocess these datasets or combine them. In this study, the concentrations of missing nutrient species were first predicted by the intermediate ML method and then used as inputs for another ML method for final predictions. The pre-generation of missing data and pre-modelling hydrological analysis were critical components of the hybrid model and allowed the identification of the impact of different hydrological transport pathways for TN export from the two tributary catchments. The hybrid ML methods were further applied to generate nutrient data for eight tributaries, and the generated data have since been used as inputs to an estuary prediction model, which simulates and forecasts nutrient concentrations in the previous and next 5 d in the Swan–Canning Estuary (Huang et al., 2019). The modelling methods and strategies developed in the work presented here can be easily applied to other study areas. Overall, ML methods provide a flexible and feasible solution to explore the underlying relationships, reconstruct spatial and temporal datasets, and combine different models.
A hybrid machine learning model was developed, and its performance tested on six nutrients and two estuary tributaries and compared with alternative modelling approaches. The hybrid ML model exhibited higher prediction accuracy and lower prediction uncertainty than stand-alone ML, WRTDS and LM for almost all nutrients. The pre-generation of missing data and pre-modelling hydrological analysis were critical components of the hybrid model and allowed the identification of the impact of different hydrological transport pathways for TN export from the two tributary catchments. The results of this study demonstrate the advantages of using hybrid models for high temporal resolution nutrient prediction; the results also demonstrate the use of the hybrid model for re-analysis of historical data in the light of contemporary data. Modelling strategies for different modelling targets and dataset structures have also been discussed. The modelling framework presented here can aid others to fully use all available nutrient data to generate accurate nutrient predictions.
The data and the data sources used in this study are cited and explained in
the text. The current version of model is available from the project
website:
The supplement related to this article is available online at:
BW, MRH and CO contributed to the development of the methodology and designed the experiments, and BW carried them out. BW developed the model code and performed the simulations. BW prepared the paper with contributions from all coauthors.
The authors declare that they have no conflict of interest.
The authors acknowledge Peisheng Huang and Brendan Busch for providing the historical nutrient data.
Benya Wang was supported by a postgraduate scholarship provided by the CRC for Water Sensitive Cities. Matthew R. Hipsey received funding support from the Australian Research Council (project LP150100451).
This paper was edited by Thomas Poulet and reviewed by Thu Huong Thi Hoang and one anonymous referee.