A parallel workflow implementation for PEST version 13 . 6 in 1 high-performance computing for WRF-Hydro version 5 . 0 : a case 2 study over the midwestern United States 3 1

The Weather Research and Forecasting Hydrological (WRF-Hydro) system is a state12 of-the-art numerical model that models the entire hydrological cycle based on physical principles. 13 As with other hydrological models, WRF-Hydro parameterizes many physical processes. Hence, 14 WRF-Hydro needs to be calibrated to optimize its output with respect to observations for the 15 application region. When applied to a relatively large domain, both WRF-Hydro simulations and 16 calibrations require intensive computing resources and are best performed on multimode, 17 multicore high-performance computing (HPC) systems. Typically, each physics-based model 18 requires a calibration process that works specifically with that model and is not transferrable to a 19 different process or model. The parameter estimation tool (PEST) is a flexible and generic 20 calibration tool that can be used in principle to calibrate any of these models. In its existing 21 configuration, however, PEST is not designed to work on the current generation of massively 22 parallel HPC clusters. To address this issue, we ported the parallel PEST to HPCs and adapted it 23 to work with WRF-Hydro. The porting involved writing scripts to modify the workflow for 24 different workload managers and job schedulers, as well as developing code to connect parallel 25 PEST to WRF-Hydro. To test the operational feasibility and the computational benefits of this 26 first-of-its-kind HPC-enabled parallel PEST, we developed a case study using a flood in the 27 midwestern United States in 2013. Results on a problem involving calibration of 22 parameters 28 show that on the same computing resource used for parallel WRF-Hydro, the HPC-enabled parallel 29 PEST can speed the calibration process by a factor of up to 15 compared with commonly used 30


Introduction
3 Physically based hydrological models contain detailed physical mechanisms to model the 4 hydrological cycle, but many complex physical processes in these models are parameterized. For 5 example, the state-of-the-art Weather Research and Forecasting Hydrological (WRF-Hydro) 6 modeling system (Gochis et al., 2018) has dozens of parameters that can be land-and river-type 7 dependent and are typically specified in lookup tables. Therefore, these hydrological models need 8 to be calibrated before they can be applied to research over different regions. In this context, 9 calibration refers to adjusting the values of the model parameters so that the model can closely 10 match the behavior of the real system it represents. In some cases, the appropriate value for a 11 model parameter can be determined through direct measurements conducted on the real system. In 12 many situations, however, the model parameters are conceptual representations of abstract 13 watershed characteristics and must be determined through calibration. In fact, model calibration is 14 the most time-consuming step, not only for hydrological models, but also for Earth system model representation of surface flow, subsurface flow, channel routing, and baseflow, as well as a simple 24 lake/reservoir routing scheme. As a physics-based model, WRF-Hydro includes many complicated 25 physical processes that are nonlinear and must be parameterized. The default parameters given by 26 WRF-Hydro may be valid for one region but not for another region. Hence calibration of related 27 model parameters is often required in order to use the model in a new domain. In particular, for a 28 large spatial domain such as the entire contiguous United States, in order to develop the optimal 29 parameter sets in a reasonable amount of time, the calibration must be conducted on high-30 performance computing (HPC) systems in parallel instead of in the traditional sequential mode. 1 To date, no such calibration tool can efficiently calibrate WRF-Hydro on HPC resources. 2 Typically, each physics-based model needs a calibration code that is custom designed to work with 3 that particular numerical model and its set of physics parameterizations, software architecture, and 4 solvers. These custom-designed calibration codes are highly challenging and do not offer 5 flexibility. Therefore, a more flexible and generic calibration tool is needed that can calibrate any 6 code that uses Message Passing Interface/Open Multi Processing (MPI/OpenMP) for 7 parallelization on HPC systems. 8 9 One widely used generic and independent calibration tool is the parameter estimation tool (PEST). 10 PEST (Doherty, 2016) conducts calibration automatically based on mathematical methods and 11 thus is applicable for optimizing nonlinear parameters. Compared with manual calibration, 12 automatic calibration is more efficient and effective because it avoids interference from human 13 factors (Madsen, 2000;Getirana, 2010). The uniqueness of PEST is that it operates independent 14 of models: there is no need to develop additional programs for a particular model except preparing 15 the files required by PEST (as described in Sec. 3.2). PEST has four modes of operation. One of 16 the modes is regularization mode, which supports the use of Tikhonov regularization and is found 17 better for serving environmental models because, if implemented properly, it supports model 18 predictions of minimum error variance, is numerically stable, and embraces rather than eschews 19 the heterogeneity of natural systems. Singular value decomposition (SVD) can be used as a 20 regularization device to guarantee numerical stability of the calibration problem. Parallel PEST is 21 able to distribute many runs across many computing nodes using master-worker parallel 22 programing. To our best knowledge, however, no approach is available that allows users to submit 23 jobs using PEST parallelization to a typical supercomputing facility that uses job scheduling and 24 workload management such as Simple Linux Utility for Resource Management (SLURM), 25 Portable Batch System (PBS), and Cobalt. A previous study (Senatore et al., 2015)  Missouri at a spatial resolution of 3 km (Fig. 1) Noah LSM it shows obvious improvements in reproducing surface fluxes, skin temperature over 2 dry periods, snow water equivalent, snow depth, and runoff (Niu et al. 2011). The Noah-MP is 3 configured at a grid spacing of 3 km, and the aggregation factor is 15; that is, starting from a 3 km 4 LSM resolution in the domain shown in Fig. 1, hydrological routing is performed at a grid 5 resolution of 200 m, with 3285 south-north × 3735 west-east grid cells. We use a time step of 10 6 seconds for the routing grid in order to maintain model stability and prevent numerical dispersion 7 of overland flood waves. The WRF-Hydro is configured to be in offline or uncoupled mode-there 8 is no online interaction between the WRF-Hydro hydrological model and the WRF atmospheric 9 model. Overland flow, saturated subsurface flow, gridded channel routing, and a conceptual 10 baseflow are active in this study. The gridded channel network uses an explicit, one-dimensional, 11 variable time-stepping diffusive wave. The time step of 10 seconds also meets the Courant 12 condition criteria for diffusive wave routing on a 200 m resolution grid. A direct output-equals-13 input "pass-through" relationship is adopted to estimate the baseflow. Although the baseflow 14 module is not physically explicit, it is important because the water flow in the channel routing is 15 contributed by both the overland flow and baseflow. If the overland flow is active as it is in this 16 study, it passes water directly to the channel model. In this case the soil drainage is the only water 17 resource flowing into the baseflow buckets. However, if the overland flow is deactivated but 18 channel routing is still active, then WRF-Hydro collects excess surface infiltration water from the 19 land model and passes this water into the baseflow bucket. This bucket then contributes the water 20 from both overland and soil drainage to the channel flow. Therefore, the baseflow must be active 21 if the overland flow is switched off. This study does not consider lakes and reservoirs. 22 23 We use the geographic information system (GIS) tool (Sampson and Gochis, 2018) developed by 24 the WRF-Hydro team to delineate the stream channel network, open water (i.e., lake, reservoir, 25 and ocean) grid cells, and groundwater/baseflow basins. Meteorological input for the WRF-Hydro 26 model system includes hourly precipitation; near-surface air temperature, humidity, and wind 27 speed; incoming shortwave and longwave radiation; and surface pressure. In this study, the hourly 28 precipitation is from the National Centers for Environmental Prediction (NCEP) Stage IV analysis 29 at a spatial resolution of 4 km. The Stage IV data is based on combined radar and gauge data (Lin 30 and Mitchell, 2005;Prat and Nelson, 2015), and has been shown to be temporally well correlated 31 with high-quality measurements from individual gauges (see, e.g., Sapiano and Arkin, 2009; Prat 1 and Nelson, 2015). The other hourly meteorological inputs are from the second phase of the multi-2 institution North American Land Data Assimilation System project, phase 2 (NLDAS-2) (Xia et 3 al., 2012a,b), at a spatial resolution of 12 km. NLDAS-2 is an offline data assimilation system 4 featuring uncoupled LSMs driven by observation-based atmospheric forcing. 5 6 During the 15-day period of this studied case, light to moderate rain occurred on April 8 through 7 11, 2013, followed by a relatively dry period from April 12 to 15. Then a heavy rain event began 8 on April 16 and peaked on April 18. The heaviest rain band moved east of the study area on April 9 19. The rainy event ended over the study area on April 20 (see Fig. S1 in Supporting Information). 10 We start the WRF-Hydro simulation on October 1, 2012, and run the model for six months to reach 11 equilibrium. This 6-month period is considered as spin-up time and is excluded from model 12 calibration and evaluation. We calibrate the river discharge calculated by the WRF-Hydro model 13 from 00UTC April 9 to 00UTC April 12, 2013, considering it long enough to achieve our objective. The interface we have built between parallel PEST and the management software is, in general, 26 used for (1) setting the number of workers, and the nodes for each worker to conduct a model run 27 (WRF-Hydro here); (2) setting up the working directory for the workers; (3) finding the nodes that 28 are available; (4) identifying the nodes that work for each worker; (5) passing the global files (same 29 for all the working directory) to all the workers (these files include the lookup table files that are 1 not to be calibrated, the namelist files for both LSM and hydrological sector, and restart files that 2 generated by the previous simulations, or spin-up period); and (6) submitting the job for the entire 3 calibration process, including parallel PEST and parallel WRF-hydro. This job can be submitted 4 as a cold-start run or as a restart. The main difference for this interface on different management 5 software is that different management software has its own way to identify available nodes and to 6 submit jobs. These differences require minor changes in the scripts we developed, which involve 7 finding and identifying available nodes for workers, and submitting jobs for the specific 8 management software. See detailed comments in the published code and scripts. 9

PEST files and settings
10 PEST requires three file types in both sequential and parallel modes. They are template files to 11 define the parameters to be calibrated, an instruction file to define the format of model-generated 12 output files, and a control file to supply PEST with the size of the problem and the settings for the 13 calibration method. Parallel PEST uses a "master-worker" paradigm that starts model runs 14 simultaneously by different workers (or in different folders). The master of parallel PEST 15 communicates with each of its workers many times during a calibration. To run PEST in parallel 16 mode, one also needs a management file to inform PEST where the working folder is for each 17 worker and what the names and paths are for each model input file that PEST must write (i.e., 18 lookup tables that come from template files) and each model output file that PEST must read (such 19 as frsxt_pts_out.txt). The management file also set the maximum running time for each worker. 20 For those workers that take longer than the maximum running time, PEST will stop the model run 21 by that particular worker and assign that model run to another worker if there is one with nothing 22 else to do. 23

24
To the best of our knowledge, however, parallel PEST is not designed to run on HPCs directly. 25 We developed scripts and an interface to enable parallel PEST to run on HPCs using SLURM, 26 PBS, or Cobalt workload managers and job schedulers. The development involved writing scripts 27 to modify the workflow for different workload managers and job schedulers, as well as developing 28 code to connect parallel PEST to WRF-Hydro. These developments enable parallel PEST to run 29 many workers at the same time; each worker runs a parallel code (here WRF-Hydro) that uses 30 more than one node, which could significantly reduce the wall-clock time of model calibrations. 1 Although this master-worker parallelism may not be as efficient as a fully MPI approach, it is 2 sufficient for model calibration and requires the least effort for the current parallel PEST to run on 3 HPC systems. 4

5
This study presents calibration results from PEST using the SVD-based regularization in 6 regularization mode to ensure numerical stability (Tonkin and Doherty, 2005). We focus on 7 calibrating 22 parameters (see Table 1  The primary objective of this study is to build a bridge for linking the parallel PEST and WRF-23 hydro on the basis of HPC clusters and to explore the computational benefits of this bridge. We 24 do not attempt to extensively assess each individual tool or address questions in each individual 25 domain, such as optimizing the objective functions in PEST or calibrating WRF-Hydro for a long 26 time period considering all the relevant parameters to achieve an optimal parameter set. The 27 calibration period thus is limited to only three days, which we believe long enough to achieve our 28 objective and to understand WRF-Hydro's sensitivity to the calibrated parameters. We calibrated 29 WRF-Hydro using four USGS sites (referred to as Station 1, Station 2, Station 3, and Station 4 30 hereafter), as shown in Fig. 1. (More USGS sites could be included if one manually reallocated 1 the stations that were not properly assigned to the desired location on the channel network by the 2 GIS tool.) As shown by the lower left index map in Figure 1, the study area (the red box) only 3 covers the lower part of Upper Mississippi River Basin (UMRB) and a portion of Missouri River 4 Basin (MORB). In order to prepare observation datasets of streamflow contributed only from the 5 drainage area within the model domain, we identified inflows entering the model domain at three 6 different sites, namely, sites 05411500, 06807000, and 06887500, as indicated by the black solid 7 triangles in the index map of Figure 1. The outflows of combined UMRB and MORB can be found 8 at the three outlets, namely, sites 07010000, 07020500, and 07022000 (named Stations 2, 3, and 9 4, respectively, as shown by black solid circles in Figure 1). These outlets are located sequentially 10 at the main Mississippi River after confluence of Mississippi River and Missouri River. Thus, the consider the spatial variability of other parameters or their scaling factors. We acknowledge that 28 some studies calibrate a single scaling factor (without considering its spatial variability, however) 29 of overland roughness coefficients (OVROUGHRTFAC) rather than the actual value of each land 30 type in the lookup table (e.g., Kerandi et al., 2018). Although this approach reduces the number of 31 calibrated parameters, it has less flexibility because changing one factor will change all the 1 parameters that use the same proportion. 2 3 For the calibration exercises we conduct here, the retention depth factor (RETDEPRTFAC) is 4 fixed at 0.001. This value is reasonable because the modeled discharge of our particular 5 configuration (Sec. 2.2) using default parameters is lower than observed discharge. Reducing this 6 factor from 1 to 0.001 keeps less water in water ponds and more water on the surface so it can 7 contribute to river discharge. First, we calibrate 48 parameters based on a 3-day simulation from 8 April 9 to April 11, 2013 (Table S1 in Supporting Information). This calibration uses the 9 estimation mode in the PEST tool and considers equal weight for all four USGS stations. We 10 calibrate Manning's roughness coefficients for both channels and land-use types, the deep drainage 11 (SLOPE), infiltration-scaling parameter (REFKDT), and saturated soil lateral conductivity 12 (REFDK). Manning's roughness coefficients control the hydrograph shape and the timing of the 13 peaks; the SLOPE, REFKDT, and REFDK control the total water volume. Second, based on the 14 knowledge we learn from the 48-parameter calibration (see details in Sec. 4.1), for the same 3-day 15 period, we reduce the number of calibrated parameters from 48 to 22 according to the sensitiveness 16 of the WRF-Hydro model to the adjustable parameters. For example, during the calibration we 17 find that Manning's roughness coefficients for several land types barely change because these land 18 types (e.g., tundra, snow/ice) are not present in the study area. We also learn that even though the 19 calibrated WRF-Hydro parameters can generate discharge results that closely resemble 20 observations, the physical meaning of several parameters are not appropriate because of the wide 21 range of those parameters that we set in the PEST control file. For example, Manning's roughness 22 coefficient for stream order 1 (0.199) is calibrated smaller than that for stream order 2 (0.218); the 23 overland roughness coefficients for evergreen needleleaf forest (0.043) and mixed forest (0.023) 24 are calibrated smaller than for cropland/woodland (0.046). Neither of these is true in the real world. 25 We therefore adjust the range of many parameters according to the literature (Soong et al., 2012) 26 to maintain their physical meanings (Table 1). We find that by using the same absolute weight for 27 all four stations, the calibration helps three stations (Station 2, 3, and 4) with large water volumes 28 to generate more reasonable results than do the default parameters; however, the results for Station 29 1, which has a relatively small volume of water, is not always better than the discharge that is 30 modeled by using default parameters. Thus, we assign a weight of 9.0 for Station 1 versus a weight where is the tth observed value from USGS sites for river discharge , is the tth 12 simulated value from the WRF-Hydro output, is the temporal average of USGS observed 13 discharge, and n is the total number of observation time points. An efficiency of 1 (NSE = 1) 14 corresponds to a perfect match between modeled discharge and observed data. An efficiency of 0 15 (NSE = 0) indicates that the model predictions are as accurate as the mean of the observed data. 16 An efficiency below zero (NSE < 0) occurs when the model is worse than the observed mean. 17 Essentially, the closer the NSE is to 1, the more accurate the model is. 18

20
Based on the knowledge we gained from the 48-parameter 3-day calibration, we adjust the range 21 of critical parameters in the PEST control file to maintain their physical meanings. For example, 22 we set Manning's roughness coefficient larger for stream order 1 than for stream order 2. We also 23 adjust the parameter range of the overland roughness coefficient for multiple land covers, such as 24 forests. We exclude the parameters that is not sensitive to WRF-Hydro streamflow for this study, 25 in order to constrain the problem size considering the availability of computational resources. 26 However, if the studied area is much larger with more land types than the study area here, then 27 there would be more parameters to calibrate. Hundreds of constant parameters in the Noah-MP 1 model could affect the WRF-Hydro results (Cuntz et al. 2016) and can be calibrated as well. Both 2 these situations would increase the burden of WRF-Hydro calibration. We perform the same 3-3 day calibration from April 9 to April 11, 2013. Figure 2 shows the results of the 3-day modeled 4 discharge (in cubic meters) using default and calibrated parameters after five iterations, as well as 5 observed discharge. The four stations are calibrated by considering different weights. While the 6 model performance for Station 1 using default and calibrated parameters are similar, the calibration 7 improves the model performance over the drainage areas represented by Stations 2, 3, and 4 8 significantly. The modeled discharge using the default parameter underestimates the streamflow 9 by 24-33%. PEST detects this underestimation, immediately adjusts the parameters and increases 10 the modeled discharge during the first iteration. After the third iteration, the difference in calibrated 11 results between different iterations is relatively small. We allow the PEST to conduct five iterations 12 and use the parameters obtained from the fifth iteration as our optimum parameters. As shown in 13   Table 2, the NSEs for all four stations are increased to be closer to 1; RMSEs are 27 significantly decreased; and the correlation coefficients between the observed and modeled 28 discharge are increased from 0.8, 0.7, 0.19, and 0.65 to 0.9, 0.81, 0.78, and 0.75. Compared with 29 the results of calibration using the estimation mode (no regularization) in PEST (not illustrated), 30 the SVD-based regularization generates slightly better hydrograph shape with 1-day later 31 discharge peaks that are closer to the observations. However, a problem remains with the 1 hydrograph shapes of the modeled discharge, especially with the modeled peak of discharge. For 2 Station 1, the WRF-Hydro almost captures the timing of the peak of discharge, but it still 3 underestimates the discharge by ~25%. One of the reasons perhaps is that this study uses a direct 4 pass-through baseflow module, which does not account for slow discharge and long-term storage 5 of the baseflow. Therefore, the largest contribution to river discharge is from precipitation, and 6 groundwater does not contribute much discharge to the channels in a long-term view, as is also 7 true for the other three large river stations. As a result, the contribution from the baseflow to the 8 river discharge in model simulations does not stay as long as in real situations. In the observations, 9 the river discharge decreases from the peak at a speed of ~500 m 3 /sec per day, while the modeled 10 river discharge decreases from the peak at a speed of ~1667 m 3 /sec per day. Using exponential 11 storage-discharge function for the baseflow may improve this situation. Other reasons include that 12 the parameter range we set in the PEST control file is perhaps not wide enough, as we can see 13 from Table 1  In this study we test the computational performance of HPC-enabled parallel PEST using different 25 number of workers (6, 12, and 23) for the 22-parameter calibration. As shown in Table 3, we  26 conducted six experiments: Test 1 uses 23 workers, Test 2 uses 12 workers, and Test 3 uses 6 27 workers. All three tests use two nodes for each worker to run WRF-Hydro in parallel. The 28 maximum number of lambda-testing runs undertaken per iteration is set to 15, 10, and 5 for Tests 29 1, 2, and 3, respectively, to assure that only one cycle of WRF-hydro runs is devoted (using 15, 10 30 and 5 workers from Tests 1, 2, and 3, respectively) to testing Marquardt lambdas. Note that the 31 maximum number of lambda-testing runs should be set equal to or less than the workers available. 1 Otherwise, another cycle of WRF-hydro runs needs to be conducted. In fact, generating more 2 Marquardt lambdas does not always guarantee that the best Marquardt lambdas are generated. In 3 contrast, it may make the model convergence slower (here, PEST) or even model failure. 4 5 In order to test the trade-offs between the computing nodes used for running parallel WRF-Hydro 6 and the workers used for running parallel PEST, Tests 4, 5 and 6 use the same number of workers 7 (six) as Test 3 but use different number of nodes for each worker to run WRF-Hydro in parallel. 8 Explicitly, Test 4 uses four nodes per worker, Test 5 uses six nodes per worker, and Test 6 uses 9 eight nodes per worker. The maximum number of lambda-testing runs undertaken per iteration is 10 set to five for Tests 4, 5 and 6. Note that the time costs in Table 3 are limited to only one iteration. 11 Conducting more iterations will increase the cost of wall-clock time and computing resource, but 12 will not change the conclusion for the scale-up capability and computational benefits for HPC-13 enabled parallel PEST linked to WRF-hydro. lambdas. As a result, the total time cost for Test 2 is ~1.5 times more than that for Test 1, and the 25 total time cost for Test 3 is ~1.5 times more than that for Test 2 (Fig. 4b). By extrapolating the 26 speedup curve shown in Fig. 4a and Fig. 4b, we expect the total time cost to be ~1516 minutes 27 when using only one worker (or sequential mode), which is about 15 times slower compared with 28 running the PEST in parallel mode using 23 workers. For this particular study with 22 adjustable 29 parameters, we expect the time cost most likely to stay the same even if one increases the number 30 of workers to more than 23, because PEST runs WRF-Hydro only 23 or 22 times for each iteration. 31 1 is not an efficient use of computing resources. PEST may run WRF-Hydro more than 22 times 2 (e.g., 44 times) if higher-order finite differences are employed. In this case, assigning more 3 workers (e.g. 45 workers) may further speed up the calibration process. On the other hand, for the 4 same case study and using the same number of nodes for running parallel WRF-Hydro, we can 5 estimate the computing speedup by assuming an increase in the number of calibrated parameters 6 to 50. This would be the case, for example, to evaluate model sensitiveness to the physics in Noah-7 MP or the spatial variabilities of certain parameters. We then expect to use 51 workers to calculate 8 the Jacobian matrix in only one cycle. This would then be 28-30 times faster than running PEST 9 using one worker (or in sequential mode). Similarly, if 100 parameters were used for the calibration 10 for the same case study, a factor of up to 60 speedup in the calibration process would be achieved 11 by running HPC-enabled parallel PEST. 12

13
In addition, by increasing the number of nodes for each worker to conduct WRF-Hydro (Tests 3, 14 4, 5, and 6), the time cost for the entire calibration process is significantly reduced (Figs. 4c and  15 4d). Specifically, the WRF-hydro scales up well when using four, six, and eight nodes compared 16 with using two nodes per worker for running the WRF-Hydro. Both the time spent on calculating 17 the Jacobian matrix and the time spent on testing the parameter upgrades are decreased by 49% 18 67%, and 77%, respectively, when using four, six, and eight nodes. Therefore, the total time spent 19 is also decreased when using more nodes for each worker (see Table 3). Moreover, if one has a 20 larger study area such as the entire contiguous United States, we expect the WRF-Hydro to have 21 an even better scale-up capability (e.g., on dozens of nodes) than this study. 22 23 While these numbers in Table 3 and Figure 4 are helpful to demonstrate the scale-up capability of 24 each component (PEST and WRF-Hydro), they do not answer questions such as, if one has certain 25 number of nodes, how many workers and how many nodes per worker should be used to achieve 26 the highest efficiency of the WRF-Hydro calibration using HPC-enabled PEST? On the other hand, 27 one may have unlimited computational resource, but would like to complete the calibration in a 28 short time period. We present scalability analysis below to answer these questions. First, we 29 generate more scenarios using different number of workers and nodes per worker by extrapolating 30 the existing time and computing costs based on the experiments that are already conducted. These 31 scenarios use 23 or 12 workers, and 4, 6, or 8 nodes per worker, respectively. Since we have 1 conducted simulations using the same number of nodes per worker, the cost for these scenarios are 2 easily predicted. 3 4 As shown in Figure 5, compare with Test 3 (which requires the least computing resource -12 5 nodes in total), having more workers (with the same number of nodes for each worker, e.g., Tests 6 1 and 2), takes more time than the ideal curve. The ideal curve assumes a linear speedup based on 7 the time cost of Test 3. However, using the same number of workers and increasing the number of 8 nodes for each worker (e.g., Tests 4, 5, and 6) can achieve the ideal speedup. Even when using 12 9 workers, increasing the number of nodes for each worker can still achieve a speedup close to the 10 ideal curve. Using 23 workers will not achieve the ideal speedup. Therefore, if one only has a 11 certain number of nodes available, we recommend to use relatively small number of workers but 12 large number of nodes for each worker. For example, if one has 48 nodes, then there are three 13 options can be considered: using 23 workers and 2 nodes per worker; 12 workers and 4 nodes per 14 worker, and 6 workers and 8 nodes per worker. Other partition (16x3; or 8x6) between numbers 15 of workers and nodes per worker are not as efficient as above. These three options will cost 103, 16 72 and 60 min, respectively, to finish one iteration. Thus, using 6 workers and 8 nodes per worker 17 is the most efficient way to consume the limited computing resource. On the other hand, if one 18 would like to conduct the calibration in a short time period without any limits for the computing 19 resource, then using 23 workers and 8 nodes (perhaps even more nodes depending on the size of 20 the model domain and the scale up capability of WRF-Hydro), will finish one iteration in ~24 min. 21 22 23 To assess the transferability of the calibrated parameters, we apply the optimum parameters 24 obtained from the calibration for the four stations (black circles) in Fig. 1 to another set of four 25 stations (crosses in Fig. 1) in the study area. All four sites are located on relatively small rivers, so 26 the lag time between precipitation peak and the discharge peak are much shorter than that for the 27 stations on the lower part of MRB (e.g., Stations 2, 3, and 4). The assessment compares the 28 observed discharge with the closest grid cells from the discharge output of WRF-Hydro. Figure 6  29 shows the observed and modeled discharge using default and the optimum parameters. Overall, 30 8). By using the calibrated parameters from other sites over the area, the model results increase the 3 discharge and shift the hydrograph shape so they are much closer to the observations than model 4 results using default parameters. The absolute error of simulated discharge decreases by 13.1%, 5 38.3%, and 71.6%, respectively, over Stations 6 through 8 (Station 5 shows a 6% increase of 6 absolute error), compared with the default simulated discharge. We also find that using the SVD-7 based regularization for the PEST calibration captures the timing of discharge peak better than 8 using the estimation mode, which is one-day earlier than the observations reaching the discharge 9 peak. 10 5 Summary and discussion 11 WRF-Hydro is a new, and perhaps the first practical, computer code that can run on HPC systems 12 and can model the entire hydrological cycle using physics-based submodels and high-resolution 13 input datasets (e.g., radar). The hydrological community has desired this capability for decades, function for a groundwater model or reach-based channel routing. Our preliminary testing shows 28 that using exponential storage-discharge function with the default parameters provided by  Hydro, the modeled discharge was larger than that of observations. Thus, the calibration will need 30 to adjust the parameters to reduce the discharge. Our study finds that for calibrating 22 parameters, 1 using the same computing resource for running WRF-hydro, the HPC-enabled PEST calibration 2 tool can speed up WRF-Hydro calibration by a factor of 15, compared with running PEST in 3 sequential mode. The speedup factor can be larger when the number of parameters needing 4 calibration is higher (e.g., 50 or 100). 5 6 The following are several key points that we would like to mention to inform future studies: 7 1. In this study, we consider using the prior or regularization information only for the 8 parameters that we calibrate. As is the case with solving inverse problems, prior 9

Evaluation of spatial transferability of the calibrated parameters
information is added to improve the smoothness of the solutions. In order to build a more 10 comprehensive calibration, an important aspect that can be considered is to enrich the prior 11 with the available historical data. For example, in this particular case, one can use the 12 historical observation data (e.g., April and May from the past few years) to enrich the prior 13 information for the parameters. Hence, the regularization objective function in PEST will 14 constitute not only the discrepancies between parameters and their "current estimates" but 15 also the discrepancies between WRF-Hydro simulations and preferred values (which is the 16 observed time series of historical discharge). Additionally, one can use the pilot points 17 technique described by Doherty (2005) in conjunction with parameter estimation to add 18 more flexibility to the calibration process. This will be potentially beneficial in improving 19 the predictions. 20 2. To focus on our main goal, we calibrate only the parameters in lookup tables. However, 21 we acknowledge that using a single value to represent a physics for a large domain could 22 be problematic, especially we expect the HPC-enabled parallel PEST to execute with 23 WRF-Hydro for large domains. This situation often needs parameter regionalization. For 24 example, WRF-Hydro version 5.0 has many spatially distributed parameters available, 25 such as OVROUGHRTFAC-the overland flow roughness scaling factor, 26 RETDEPRTFAC-the factor of maximum retention depth, and the soil-related parameters 27 (when compiled with SPATIAL_SOIL=1). Calibrating these spatial parameters based on 28 grid scale (e.g., catchments) rather than a single value will give the model more flexibility 29 and thus better fit the observations (Hundecha and Bardossy, 2004;Wagener and Wheater, 30 2006). In practice, for example, one can include regional OVROUGHRTFACs (e.g., their 31 1 However, the selection of the locations and sizes of catchment may introduce significant 2 uncertainties to the calibration results, which require systematic and comprehensive 3 investigation and understanding of the study area. 4 3. This study is limited to calibrating the observed streamflow only based on the format of 5 one of WRF-Hydro model outputs for individual station or point (frxst_pts_out.txt). It is 6 feasible, however, to calibrate other variables as long as the observation data is available. 7 For example, one can either find the closest point from the gridded dataset to the 8 observation location and then compare that model grid to observations; or one can change 9 the WRF-Hydro input/output code to output other variables in the frxst_pts_out.txt file, so 10 they can still use the same interface we developed here to calibrate other variables in 11 addition to the discharge. 12 4. The optimal parameter set obtained from this study is from the 5th iteration of parallel 13      Table 1) identified by the 3-day calibration 3 over four stations that are in the study area (indicated by crosses in Fig. 1). 4