Recalibration of a three-dimensional water quality model with a newly developed autocalibration toolkit (EFDC-ACT v1.0.0): how much improvement will be achieved with a wider hydrological variability?

. Autocalibration techniques have the potential to enhance the efficiency and accuracy of intricate process-based hydrodynamic and water quality models. In this study, we developed a new R-based autocalibration toolkit for the Environmental Fluid Dynamics Code (EFDC) and implemented it into the recalibration of the Yuqiao Reservoir Water 10 Quality Model (YRWQM) with long-term observations from 2006 to 2015, including dry, normal, and wet years. The autocalibration toolkit facilitated recalibration and contributed to exploring how the model recalibrated with long-term observations performs more accurately and robustly. Previously, the original YRWQM was calibrated and validated with observations of dry years in 2006 and 2007, respectively. Compared to the original YRWQM, the recalibrated YRWQM performed as well in water surface elevation with a Kling-Gupta Efficiency (KGE) of 0.99 and water temperature with a 15 KGE of 0.91, while better in modeling total phosphorus (TP), chlorophyll a (Chl a ), and dissolved oxygen (DO) with KGEs of 0.10, 0.30, and 0.74 respectively. Furthermore, the KGEs improved by 43~202% in modeling TP - Chl a - DO process when compared to the models calibrated with only dry, normal and wet years. The model calibrated in dry years overestimated DO concentrations, probably explained by the parameter of algal growth rate increased by 84%. The model calibrated in wet years performed poorly for Chl a due to a 50% reduction in the carbon-to-chlorophyll ratio probably 20 triggered by changes in the composition of the algal population. Our study suggests that calibrating


Introduction 25
Lakes and reservoirs fulfill the role of "sentinels" to climate change due to both their capacity to buffer synoptic-scale hydroclimatic extremes and their susceptibility to hydrological variability (Adrian et al., 2009;Williamson et al., 2009;Mooij et al., 2019). In recent decades, dramatic hydrological variability has been widely detected and remarkably influenced biogeochemical processes in lakes and reservoirs (Sinha et al., 2017;Grant et al., 2021;Kong et al., 2022;Salk et al., 2022).
In a bid to delve into these variations, process-based hydrodynamic and water quality models have been increasingly popular 30 tools since they can disentangle numerous intricate causal relations between exogenous drivers and environmental impacts within water bodies (Arhonditsis and Brett, 2004;Mooij et al., 2010;Fu et al., 2019). However, the accuracy and robustness of these models in the face of such intense hydrological variability have become a key issue.
Driven by the purpose of better understanding physical, chemical, and biological processes, the complexity of processbased hydrodynamic and water quality models continues unabated over recent years (Robson, 2014). However, the increased 35 complexity of the model is a mixed blessing. It indeed helps us to examine biogeochemical processes in lakes and reservoirs, but when the complexity of the model exceeds a certain level, both the accuracy and the identifiability are diminished (McDonald and Urban, 2010). Higher dimensions, more state variables, and more specific details introduce more and more parameters into models through drastic simplification of reality, which subsequently become a massive source of model uncertainty. Model calibration is one of the essential procedures in model setup to reduce the uncertainty from parameter 40 estimations and to obtain a satisfactory parameter set to match simulated results with observed data (Jørgensen and Fath, 2011). Although calibration has been used extensively in process-based hydrodynamic and water quality models, there are still two notable problems.
The first problem is that the manual calibration method (trial and error) commonly used in process-based hydrodynamic and water quality models is inefficient and does not guarantee optimal results. Firstly, some steps, such as adjustment of 45 inputs, tuning of parameters, evaluation of model performance and visualization of outputs, subject modelers to timeconsuming and tedious tasks. Secondly, the parameter set selected by this method may still suffer from uncertainty and interferences of subjective factors. With the development of computer technology and subsequent application in numerical simulation methods, the automatic calibration method is burgeoning (Shimoda and Arhonditsis, 2016). Numerous modeling studies in recent decades have employed automatic calibration procedures in 2-D or lower-dimensional process-based 50 hydrodynamic and water quality models in lakes or reservoirs (Rigosi et al., 2011;Huang, 2014;Luo et al., 2018). However, due to their high complexity and time-consuming calculation, there are few applications of automatic calibration procedures in 3-D hydrodynamic and water quality models. For example, the automated Parameter ESTimation software (PEST) was applied in the Environmental Fluid Dynamic Code (EFDC) (Arifin et al., 2016), and optimization algorithms were applied in tools to EFDC input and output files is still cumbersome. For such a sprawling model system as the EFDC, a specific automatic calibration tool can eliminate much of the repetitive and unnecessary work. On the other hand, specific automatic calibration tools are needed to support multi-objective evaluation methods for three-dimensional model calibration, including 65 evaluation at different locations on the horizontal plane, evaluation at different depths on the same grid, or a mixture of these objectives.
The second problem is that many models were calibrated with only short-term observations with a narrow hydrological variability, then were employed in water ecosystem management decisions or future predictions. There are hidden perils in the presumption that these calibrated models are adaptive to a wider hydrological variability. For example, in a previous 70 study of the Spokane River and Lake Spokane Model, the use of a low flow period for calibration may result in an overestimation of inlake total phosphorus (TP) and chlorophyll a (Chl a), and an underestimation of minimal dissolved oxygen (DO) (Zhang et al., 2018a). This issue has also arisen in studies of other models (Vaze et al., 2010;Nielsen et al., 2014;Basijokaite and Kelleher, 2021). A large number of parameters is almost impossible to be constrained by a narrow hydrological variability (Janssen and Heuberger, 1995;Franks, 2009), thus triggering the equifinality problem, where several 75 distinct parameter inputs produce the same model outputs called "good results for the wrong reasons" (Arhonditsis et al., 2007;Paudel, 2012). Even if the final parameter set chosen by the modeler satisfies the match between model results and observations under the current hydrological variability, there is no credibility that the model will be accurate as a robust prognostic tool under a wider hydrological variability (Arhonditsis et al., 2007). Therefore, the utilization of a longer period of observations containing a wider range of hydrological years for calibration may be an important way to improve the 80 identifiability of parameters in process-based hydrodynamic and water quality models. Benefiting from the continued accumulation of historical observations, numerous published models have been recalibrated using longer scales of data (James, 2016;Schnedler-Meyer et al., 2022). Cerco and Noel (2005) recalibrated the Chesapeake Bay model with a decade of observations resulting in a clear improvement in modeling primary production and light attenuation. Benefit from longterm data supports, exploring how the hydrological variability in the calibration period impact model calibration will help 85 establish more accurate and robust models.
Long-term modeling is required to test the model's ability to reproduce ecosystems under different hydrological years.
Based on the established Yuqiao Reservoir Water Quality Model (YRWQM) (Zhang et al., 2013(Zhang et al., , 2015(Zhang et al., , 2019, this study aims to explore how long-term observations under a wider hydrological variability impact model calibration with the application of automatic calibration techniques. We hypothesized that models using observations with a longer period and 90 wider hydrological variability for calibration perform more accurately and robustly. We first developed a new R-based autocalibration toolkit for the EFDC model. Second, we recalibrated the YRWQM by the toolkit with a decadal-scale observation under three hydrological situations: dry, normal, and wet periods. Finally, we compared the model performance, parameters, and kinetic processes represented by parameters across the model calibration scenarios that used different splitsample approaches. These discrepancies will highlight the importance of the hydrological variability corresponding to the 95 observed data for model calibration and deepen our understanding of biogeochemical processes in shallow lakes and reservoirs under the wide hydrological variability.

EFDC-Automatic Calibration Toolkit
EFDC-Automatic Calibration Toolkit (EFDC-ACT) was developed in this study for automating the calibration of a 3-D 100 hydrodynamic and water quality model, EFDC, with more than two hundred parameters. EFDC-ACT is a multi-parameter and multi-variable autocalibration toolkit based on R. Documentation and source code is shared and publicly available at https://doi.org/10.5281/zenodo.7438143. The conceptual overview of the EFDC-ACT is shown in Fig. 1. There are three main steps in EFDC-ACT: initialization, autocalibration, and post-analysis.
Before using EFDC-ACT, the user should prepare the necessary files including the EFDC-ACT master file, the 105 Comma-Separated Values (CSV) file containing the parameter and variable information, the input file for the EFDC, and the CSV file containing the observations. In the initialization step, EFDC-ACT checks and loads the master file, the parameter list, and the variable list. Then EFDC-ACT generates a matrix of parameter value ranges, sets the model evaluation statistics as the objective functions, and launches the autocalibration process. More detail is in EFDC-ACT User's Guide (S1).
To maintain the diversity of the parameter sets while accelerating convergence, EFDC-ACT introduced the caRamel 110 package (Monteil et al., 2020) in the autocalibration step. The caRamel package is a genetic algorithm-based multi-objective optimizer, incorporating the Multiobjective Evolutionary Annealing Simplex method (MEAS) (Efstratiadis and Koutsoyiannis, 2008) and Nondominated Sorting Genetic Algorithm II (ε-NSGA-II) (Reed and Devireddy, 2004). It is suitable for highly complex, time-consuming hydrodynamic and water quality models like EFDC. EFDC-ACT controls the caRamel optimization according to the master file, feeding the parameter value range matrix into the caRamel. The caRamel 115 generates the parameter set, then EFDC-ACT passes it into EFDC and starts the calculation. At the end of the model run, EFDC-ACT calculates the statistics based on the modeled and observed values and passes the result back to caRamel as the objective function value. As the parameter set is adjusted, the autocalibration process is repeated until the termination criterion is reached, such as the maximum number of runs or the expected statistic results.
Automatic model evaluation is a potent tool for making models more transparent and credible (Alexandrov et al., 2011;120 Soares and Calijuri, 2021). To this end, EFDC-ACT provides model result extraction, statistical model evaluation, and graphical model evaluation in the post-analysis step. During the autocalibration process, the user can open the CSV files to view each iteration's parameter set and model evaluation results. After each iteration, EFDC-ACT plots the time series using modeled and observed values, thus supplying the users with a visual comparison that statistics cannot afford. EFDC-ACT will also output the final optimization results in a CSV file when all iterations are complete. 125 According to the model evaluation guidelines proposed by Moriasi et al. (2007), the statistics used to evaluate the model performance include three categories: standard regression (R 2 ), dimensionless (NSE, KGE), and error index (MAE, RMSE, PBIAS, RSR). Kling-Gupta Efficiency (KGE) is included as an alternative to NSE in this study. KGE gives equal weight to bias, linear correlation, and variability, avoiding the systematic underestimating of the variability (Gupta et al., 2009). 130

Chronicle of YRWQM
The Yuqiao Reservoir (40°00'-40°04'N, 117°26'-117°37'E) is situated in Jixian County, Tianjin, China (Fig. 2). The shallow reservoir has a length of 66 km from east to west and a width of 50 km from north to south, an average water depth 140 of 4.74 m, a maximum water depth of 12.74 m, a total surface area of 86.6 km 2 , and a storage capacity of 1.559 billion m 3 . It is situated within a basin area that covers 2060 km 2 (Fig. S1). The Yuqiao Reservoir is the primary source of drinking, agricultural and industrial water for approximately 129 villages in the surrounding area (Zhang et al., 2019;Yu and Zhang, 2021). Previous studies have shown that Yuqiao Reservoir is a typical mesotrophic, phosphorus-limited environment (Chen et al., 2012;Zhang et al., 2020). The YRWQM (Zhang et al., 2013) is a regional hydrodynamic and water quality model 145 developed under the framework of EFDC (Hamrick, 1992;Ji et al., 2001) to improve the understanding and management of the Yuqiao Reservoir. Since its inception, the model has undergone five phases of development and refinement (Fig. 2).
The original YRWQM was constructed to investigate how agricultural pollution by flood flows affects the water quality in the Yuqiao Reservoir. The model was calibrated, validated, and employed to predict the variations of water quality resulting from agricultural pollution (Zhang et al., 2013). Subsequently, the YRWQM was coupled with a modified 150 submerged aquatic vegetation model (M-SAVM) to study the development effect of submerged macrophytes (Zhang et al., 2015(Zhang et al., , 2016 and epiphyton (Špoljar et al., 2017;Zhang et al., 2018b) on the water quality indicators of the reservoir. An integrated climate-hydrological-water quality (RCM-SWAT-YRWQM) framework was also proposed to elucidate the effects of a changing climate on the trophic state (Zhang et al., 2019).
The YRWQM has become a powerful tool for research and management of the Yuqiao Reservoir water ecosystem 155 through the above phases in the past decade. However, there may still be a risk of insufficient accuracy since the original YRWQM calibration does not take into account long-term hydrological variability. With the availability of the decadal-scale observations covering dry, normal, and wet periods and the design of new model calibration methods, it is time to examine whether a longer period of calibration can improve the accuracy and robustness of the model.

YRWQM recalibration with EFDC-ACT
The datasets required for YRWQM recalibration included meteorological data, discharge, precipitation, evaporation, water surface elevation, water temperature, and water quality data. Meteorological data are obtained from China Meteorological 180 Data Service Centre. Discharge, precipitation, evaporation, and water surface elevation data used in the model were obtained from the Yuqiao Reservoir Administrative Bureau. The data above were collected at a frequency of once a day from 2006 to 2015, except for 2012 when water surface elevation data were collected once a month (Zhang et al., 2019). While six monitoring stations were employed by the original YRWQM for calibration and validation (Zhang et al., 2013). to balance the cost with accuracy in calibration, water temperature and water quality data collected from monitoring station S2 were 185 used in recalibrated model evaluation, which represented the water column at the center of the Yuqiao Reservoir. The water quality state variables included TP, Chl a, and DO concentrations (Zhang et al., 2015). All the water quality data was sampled, preserved, and analyzed monthly or semi-monthly from 2006 to 2015 according to the Standard Method for the Examination of Water and Wastewater Editorial Board.
Due to the lower stability of the hydrodynamic model compared to the water quality model, the recalibration of the 190 YRWQM was divided into two parts: the hydrodynamic model recalibration and the water quality model recalibration. Both parts of the YRWQM were recalibrated with EFDC-ACT. The parameter ranges listed in Table S1 were referenced from the original YRWQM and other literature (Wu and Xu, 2011;Zhang et al., 2013;Yi et al., 2016;Jiang et al., 2018;Zhao et al., 2020;Kim et al., 2021). KGE and PBIAS were used to evaluate the recalibrated model. Model performance is considered satisfactory when the KGE is greater than -0.41 in this study, meaning that the model improves upon the mean value 195 benchmark (Knoben et al., 2019). The PBIAS describes the average tendency for simulated values to be greater or less than observed values, with positive values indicating a model bias toward underestimation and negative values indicating a model bias toward overestimation (Gupta et al., 1999).  During the recalibration period (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015), the bottom roughness height and the wind-drag multiplier were automatically calibrated. The hydrodynamics of the recalibrated YRWQM demonstrated good performance for WSE and TEM at station S2 (Fig. 3). The recalibrated YRWQM remarkably reproduced the decadal variation of WSE in Yuqiao Reservoir with a KGE of 0.99. The recalibrated YRWQM reproduced the seasonal cycle of water temperature with a KGE of 0.91. The highest and lowest water temperatures were grasped with a highest observed value of 31℃ and a corresponding simulated value of 28℃, and a lowest observed value of 0℃ and a same corresponding simulated value. The modeled WSE and TEM both indicated that the hydrodynamic model in the recalibrated YRWQM is reliable and can be used for water quality modeling in the Yuqiao Reservoir during the recalibration period. 230

EFDC-ACT efficiency and model recalibration
The water quality of the recalibrated YRWQM performed satisfactorily for the modeled TP concentration at station S2 with a KGE of 0.10. Most of the observations were evenly distributed with little variances on either side of the modeled values, with only a few observations away (Fig. 3). The modeled TP concentration peaked at the end of 2010 and 2011 and was beyond the range of observations. Nevertheless, the inter-and intra-annual variability in TP concentrations were still well captured and the model showed acceptable performance overall with a PBIAS of 40%. The model represented the 235 variation of Chl a concentration over the decade with a KGE of 0.30. Chl a concentration showed a clear double-peaked or multi-peaked pattern of intra-annual variation, with peaks occurring mostly in spring and autumn (Fig. 3). The modeled DO concentrations likewise showed good performance with a KGE of 0.74. DO concentration exhibited a pronounced seasonal cycle with lower concentrations in summer and higher concentrations in winter (Fig. 3). Among these parameters we also found six of them to be sensitive (Table 1). These parameters were carbon-to-chlorophyll ratio for algae (CChl), the maximum growth rate for algae (PM), basal metabolism rate for algae (BMR), predation rate on algae (PRR), the minimum mineralization rate of dissolved organic phosphorus (KDP), and reaeration multiplier (REAC). 250 These primary parameters were selected during our calibration process based on the biogeochemical characteristics of Yuqiao Reservoir and the model performance, and they significantly influenced the model results. More detailed model equations, parameter interpretations, and calibration results are listed in Sect. S3.   dry years 2006, 2007, 2010, and 2015). The state variables being compared included WSE, TEM, TP, Chl a, and DO. The parameters being compared included eight governing algal kinetics (CChl, PM, Keb, TM1, TM2,

Performance comparison between YRWQM recalibrated in decade and models calibrated in different hydrological years 265
There were obvious discrepancies in the performance of the recalibrated YRWQM in different hydrological years (Fig. 4).
The ability of the recalibrated YRWQM to reproduce TP concentrations in the decade was the best with the highest KGE values. The recalibrated model evaluation for TP concentrations reflected satisfactory performance in dry and wet years with  In comparison to the YRWQM recalibrated with the decade, the other three models calibrated in different hydrological years showed distinct inferiority (Fig. 4). The model calibrated in dry years performed relatively poor in modeling DO with a KGE of 0.24 over the decade and a maximum KGE of 0.36 in dry years. The model calibrated in normal years failed to obtain good evaluations in modeling TP with the lowest KGE values over the decade and in all three different hydrological 280 situations. The model calibrated in wet years showed relatively worse in modeling Chl a with all KGEs less than 0.2 and the lowest KGE of -0.02 occurring in dry years. The results indicated that the YRWQM recalibrated with the decade outperformed the other three models calibrated in a solely hydrological year, with the best robustness in modeling TP, Chl a, and DO concentrations for a wide hydrological variability during the decade.

Parameters and kinetic processes comparison between recalibrated YRWQM and models calibrated within 285 different hydrological years
Similar to the model performances, models employing different calibration strategies also had different parameter results (Fig. 5). Most parameter values of the recalibrated YRWQM were within the parameter ranges of the other three calibrated models, except for PRR, which had the lowest value of 0.12 among the four models. PRR represents the rate of predation on algae by zooplankton or other aquatic organisms and algal predation is one of the main causes of algal reduction. 290 Compared to models using other calibration strategies, the model calibrated in dry years had the highest PM of 5.1, the highest PRR of 0.28, and the lowest REAC of 1.1. PM and PRR govern the growth and predation of algae respectively. REAC represents the reaeration multiplier for the turbulence-induced and wind-induced surface reaeration coefficient, a lower REAC value means less reaeration at the air-water interface. There were significantly lower BMR and higher KDP in the model calibrated in normal years, two parameters that represent the algal basal metabolism and the mineralization of 295 dissolved organic phosphorus into inorganic phosphorus, respectively. Most parameters of the model calibrated in wet years were similar to the model recalibrated in the decade, except for one significantly lower value of CChl, which governs the conversion between modeled and measured algal biomass.

Recalibrated YRWQM vs. Original YRWQM
Before embarking on the discussion of the discrepancies between the recalibrated YRWQM and original YRWQM, it is important to note that the model evaluations chose a single station (station S2, the center of the Yuqiao Reservoir) and three state variables (TP, Chl a, DO), constrained by the complexity and computational cost of decadal-scale modeling. 305 Nevertheless, the above indicators were considered capable of representing the main biogeochemical processes in Yuqiao Reservoir, as previous statistical analyses and numerical models have indicated that Yuqiao Reservoir is a phosphoruslimited mesotrophic reservoir (Chen et al., 2012;Zhang et al., 2013;Xu et al., 2015). The model evaluations demonstrated that the recalibrated YRWQM performed equally well as the original YRWQM in terms of hydrodynamics, while the recalibrated YRWQM outperformed when it came to water quality. We supposed that the better performance probably 310 stemmed from the recalibrated parameter values, especially sensitive parameters (Table 1). As described by Cerco and Cole (1994) in the three-dimensional eutrophication model of Chesapeake Bay, the growth rate of algae was expressed as a multiplication of the maximum growth rate (PM) with a series of limiting factors in YRWQM, while the algal reduction was caused mainly by basal metabolism (BMR) and predation (PRR). These parameter values in the original YRWQM gave the algae too lenient growth conditions and motivated inaccurate algal outbreaks during the decade with a PBIAS of -357% 315 (Table 1). These algal outbreaks may also be a potential reason for the overestimations of TP and DO concentrations with PBIASs of -78% and -53%, respectively (Table 1). With the excessive algal outbreaks in the original YRWQM, the continuous enrichment of phosphorus in algae and the oxygen production process of excessive net photosynthesis prompted the final overestimations (Ji, 2017). As algal kinetics were accurately parameterized during the decade with a PBIAS of 36% in recalibrated YRWQM, satisfactory results were also obtained for TP and DO concentrations with PBIASs of 36% and -2% 320 respectively (Table 1).
It should be noted that the modeled TP concentration peaks were not recorded in the observations in late 2010 and 2011 (Fig. 3). This may have been caused by the year-end water transfer, with inflow TP concentrations reaching 460 µg/L and 960 µg/L in December 2010 and 2011 respectively. It may also demonstrate that recalibrated YRWQM can provide a higher temporal resolution than observations and be potential as a hindcast model for reservoir management. Overall, the accuracy 325 and robustness of the YRWQM have taken a solid step forward over a meticulous, long-term recalibration with EFDC-ACT.

Why does the recalibrated YRWQM have better-performing parameters? Impact of the hydrological variability on calibration results
Whereas it has been discussed above how updating parameter values improved the model accuracy and robustness of YRWQM, it is now more intriguing to see how this parameter updating was achieved by recalibration. We suppose that the 330 observations with a wide hydrological variability may have contributed to the better-performing parameters, as the original YRWQM was calibrated and validated in the only dry situation while the recalibrated YRWQM used decadal observations with a wide range of hydrological variability. Hydrological variability is one of the main causes of varying biogeochemical processes (Delpla et al., 2009;Li et al., 2020), and the changes in parameter values reflect the variability of these processes (Robson et al., 2018). James (2016) recalibrated the Lake Okeechobee Water Quality Model using 30 years of observations 335 including a series of extreme hydro-meteorological events, thereby improving the quality of the parameters and the ability to model nitrogen and phytoplankton. Many studies have also shown that the improvement in model parameters may be triggered by calibration using long-term observations with greater hydrological variability (Cerco et al., 2004;Lung and Nice, 2007).
Among the four models, the parameters of the recalibrated YRWQM showed a proper trade-off with values almost 340 falling within the range determined by the other models calibrated in specific hydrological years (Fig. 5). The model calibrated in dry years performed as well as the recalibrated YRWQM for Chl a but failed to reproduce DO with a PBIAS of -21% (Fig. 4). This may be due to the highest algal growth rate (PM) causing excessive net photosynthesis (Fig. 5). The drastic water level fluctuations of the Yuqiao Reservoir in dry years (Fig. 3) probably caused the decline of submerged macrophytes and the increase of phytoplankton, like other shallow water bodies (Furey et al., 2004;Krolová et al., 2013;Lu 345 et al., 2018). However, it is necessary to analyze more observations of submerged macrophytes and couple the recalibrated YRWQM with M-SAVM to gain a definite conclusion (Zhang et al., 2015). In the case of the model calibrated in wet years, the model performed poorly in modeling Chl a (Fig. 4) and the carbon-to-chlorophyll ratio (CChl) was the lowest (Fig. 5).
Unlike a fixed value in the model, the value of CChl is more variable and depends on the makeup of the algae population, typically ranging from 0.015 to 0.1 (Bowie et al., 1985). Ren et al. (2019) also noted the differences in microbial 350 composition between the dry and wet periods in Poyang Lake. A multi-species phytoplankton module that enables variable CChl may contribute to more robust algal modeling. The above discussion pointed out the risks inherent in employing a model calibrated with a single hydrological year for climate change studies or management decisions. From the point of model accuracy and robustness, the use of long-term observations with sufficient hydrological variability to calibrate hydrodynamic and water quality models is probably the best option. 355

Highly efficient calibration with EFDC-ACT
The newly developed autocalibration toolkit, EFDC-ACT, eliminated a lot of hindrances to the recalibration of YRWQM.
Compared to the conventional manual calibration method, it not only reduces a great deal of uncertainty from the subjective choice of parameters but also accelerates the convergence of the optimization process. As a generic autocalibration toolkit developed for models based on the EFDC framework, the EFDC-ACT supports the autocalibration of any combination of 360 more than 200 parameters in the EFDC model. Meanwhile, the EFDC-ACT also incorporates automatic model evaluation and advanced visualization of simulations and observations. Some process patterns can only be seen by time series plots and 2D plots. Statistics alone cannot reveal this kind of pattern (Bennett et al., 2013;Hipsey et al., 2020). The automated time series plots make the model results more visual and transparent at this point. The generated output files after each optimization iteration are overwritten and only the model parameters and evaluation results of each iteration are retained. 365 This design ensures reproducibility while avoiding a large usage of hard disk space (Luo et al., 2018). The entire automatic calibration framework proposed with EFDC-ACT can also be a reference to develop other automatic calibration tools for hydrodynamic and water quality models.
The caRamel algorithm adopted in EFDC-ACT has been demonstrated through case studies to obtain similar optimization results while speeding up convergence (Monteil et al., 2020). However, hundreds of parameters and the high spatial and temporal complexity of EFDC bring about a time-consuming computation, making it difficult to reach the recommended number of iterations of the caRamel algorithm. Furthermore, even with the support of optimization algorithms, how to obtain better calibration results faster is still a critical issue for the autocalibration of high-complexity models like EFDC. Although with the aid of auto-calibration, modelers should spend time learning and understanding the model system and the parameter implications to avoid getting good model error statistics values with the wrong parameters. The auto-375 calibration should be viewed as an efficient way to refine calibration after learning the model system with the manual calibration.

Challenging high-complexity model autocalibration problems: a possible hierarchical autocalibration strategy introducing expert knowledge
To enable faster convergence of the model parameter optimization process, we propose a hierarchical autocalibration 380 strategy based on EFDC-ACT. This strategy requires the modelers to orderly, automatically calibrate the model three times for different purposes. First, modelers formulate a large range of parameters based on literature or parameter implications, then run EFDC-ACT and perform a sensitivity analysis to find both the sensitive parameters and state variables. Although EFDC-ACT does not provide the functions for sensitivity analysis, there are a few R packages for sensitivity analysis, such as the sensitivity package. A Bayesian framework integrating sensitivity, uncertainty, and identifiability analysis was also 385 proposed for EFDC (Jia et al., 2018). The modelers will analyze the interactions between these sensitive variables according to expert knowledge, and variables with controlling effects will be the primary target for the second level of autocalibration.
Next, the modelers target the sensitive variables and parameters identified in the first time and perform the autocalibration again until the model performs satisfactorily. Finally, the modelers hold the identified variables and parameters constant, then auto-calibrate the model for the third time to determine the other insensitive state variables and parameters. With this 390 hierarchical autocalibration strategy, EFDC-ACT can handle the parameter estimation of EFDC more competently. This strategy is a possible framework in the future, which is suitable not only for EFDC-ACT but also for other automatic calibration tools that do not produce sufficient iterations.
Even in the context of rapid advances in computer technology, expert knowledge is still indispensable to calibrating highly complex models (Wood et al., 1990;Ostfeld and Salomons, 2005). With the emergence of automatic calibration tools, 395 how to combine expert knowledge with them has become a new issue (Krueger et al., 2012;Xia and Shoemaker, 2022). The selection of key state variables in the hierarchical autocalibration strategy above is an example of the application of expert knowledge in an autocalibration tool. With the evolution of computer technology, the development of autocalibration tools, and the accumulation of observations, the hierarchical autocalibration strategy proposed above offers a possible workaround to deal with enormous autocalibration problems in high-complexity models.

Conclusion
We developed a new automatic calibration toolkit, EFDC-ACT, and implemented it into the recalibration of the YRWQM with ten years (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) of observations in a wide range of hydrological variability. In comparison with the original YRWQM, the hydrodynamics of the recalibrated YRWQM performed as well over the decade, while the recalibrated model performed significantly better in modeling TP, Chl a, and DO concentrations. When compared to the models calibrated with 405 only dry, normal and wet years, the KGEs improved by a maximum of 196%, 134%, 202% in modeling TP, Chl a, and DO respectively. Our analysis indicates that the recalibrated YRWQM accuracy and robustness improvement is derived from the constraining effect of observations with a wider hydrological variability. Such information will help to unravel how hydrological variability in the calibration periods affects the process-based hydrodynamic and water quality models, including their parameters, kinetic processes, performance, and long-term robustness. Moreover, a general autocalibration 410 toolkit developed in this study, EFDC-ACT, is substantially less time-consuming and more efficient for modelers than the conventional manual calibration method. The framework of EFDC-ACT and a possible hierarchical autocalibration strategy can also be a reference for future complex hydrodynamic and water quality model calibration. Finally, with our convenient autocalibration toolkit, it will be possible to explore the impact of the hydrological variability on more complex processbased hydrodynamic and water quality models. 415 Code and data availability. The source code of the automatic calibration toolkit, EFDC-ACT, is freely available from https://doi.org/10.5281/zenodo.7438143 (Zhang and Fu, 2022) on Zenodo under the Creative Commons Attribution 4.0 International licence. The observed hydrodynamic and meteorological datasets are freely available from https://doi.org/10.5281/zenodo.8083303 (Zhang and Fu, 2023) on Zenodo. The Yuqiao Reservoir is an important source of 420 drinking water and the public may be sensitive to the water quality conditions. Therefore, we cannot make water quality datasets publicly available due to the public health purposes. The water quality datasets are available for reviewers and readers who would like to reproduce the results upon request to the corresponding author.

Supplement.
The supplement related to this article is available online at: … Author contributions. CZ designed the work, led the study, acquired the financial support, provided study resources, and 425 conducted the research process; CZ and TF designed the methodology, developed the software, and wrote the initial manuscript draft; TF validated the reproducibility of results and prepared visualization; CZ reviewed and edited the manuscript.
Competing interests. The contact author has declared that none of the authors has any competing interests.