GP-SWAT (v1.0): a two-layer graph-based parallel simulation framework for the SWAT model

High-fidelity and large-scale hydrological models are increasingly used to investigate the impacts of human activities and climate change on water availability and quality. However, the detailed representations of real-world systems and processes contained in these models inevitably lead to prohibitively high execution times, ranging from minutes to days. 15 This becomes computationally prohibitive or even infeasible when large iterative model simulations are involved. In this study, we propose a generic two-layer model parallelization scheme to reduce the run time of computationally expensive model applications through a combination of model spatial decomposition and the graph-parallel Pregel algorithm. Taking the Soil and Water Assessment Tool (SWAT) as an example, we implemented a generic tool named GP-SWAT, enabling model-level and subbasin-level model parallelization on a Spark computer cluster. We then evaluated GP-SWAT in two sets 20 of experiments to demonstrate the potential of GP-SWAT to accelerate single and iterative model simulations and to run in different environments. In each test set, Spark-SWAT was applied for the parallel simulation of eight synthetic hydrological models with different input/output (I/O) burdens and river network characteristics. The experimental results indicate that GPSWAT can effectively solve high-computational-demand problems of the SWAT model. In addition, as a scalable and flexible tool, it can be run in diverse environments, from a commodity computer running the Microsoft Windows operating 25 system to a Spark cluster consisting of a large number of computational nodes. Moreover, it is possible to apply this generic scheme to other subbasin-based hydrological models or even acyclic models in other domains to alleviate input/output (I/O) demands and optimize model computational performance. https://doi.org/10.5194/gmd-2020-429 Preprint. Discussion started: 27 January 2021 c © Author(s) 2021. CC BY 4.0 License.


Introduction
With the enhanced availability of high-resolution remote sensing data and long periods of hydrometeorological data, 30 hydrologists are increasingly building high-fidelity hydrological models to investigate water availability (Liang et al., 2020), water quality (Fang et al., 2020), climate change (Cai et al., 2016), and watershed management options (Jayakody et al., 2014;Qi and Altinakar, 2011;Lee et al., 2010). However, these hydrological models, which contain detailed representations of real-world systems and processes, can demand large computational budgets and require prohibitively high execution times, ranging from minutes to days (Razavi et al., 2010). Because modeling practices such as model calibration and uncertainty 35 analysis usually involve thousands of model evaluations or more, they may sometimes become computationally prohibitive or even infeasible (Razavi and Tolson, 2013). Thus, the effective use of computationally expensive simulations remains a challenge for many applications involving a large number of model simulations.
In general, there are four broad research methods for alleviating the computational burden associated with computationally expensive model applications: (1) utilizing metamodeling approaches (Chandra et al., 2020;Sun et al., 2015), (2) developing 40 computationally efficient algorithms (Humphrey et al., 2012;Joseph and Guillaume, 2013), (3) opportunistically avoiding model evaluations (Razavi et al., 2010), and (4) utilizing parallel computing technologies and infrastructures (Yang et al., 2020;Huang et al., 2019;Wu et al., 2013;Wu et al., 2014;Zamani et al., 2020). The first three methods share the same goal of reducing the computational demand by using lightweight surrogate models, decreasing the number of model simulations, and terminating model execution early when the simulation result is poorer than expected. The fourth method adopts a different 45 strategy of boosting model application performance by optimizing the efficiency of computational resource utilization. These four strategies are largely complementary, and in practice, a given modeling approach may employ a combination of two or more of these strategies (e.g., coupling parallelization with a computationally efficient optimization algorithm). In this study, we propose a scheme in which the last strategy is adopted to optimize the efficiency of generic modeling activities. The remainder of this section will briefly review model parallelization methods and establish the context for the research 50 presented in this paper.
To some extent, the efficiency of model parallelization is determined by the parallelization scale and/or granularity. From the model structure perspective, a model can be parallelized at both the model and subunit levels. At the former scale, a complete model is used as a parallel unit, whereas at the latter scale, the subunits of the model are parallelized (referred to as submodel parallelization). From the spatial domain perspective, a watershed can be partitioned into subbasins, and it is 55 possible to simulate sibling subbasins simultaneously (referred to as subbasin parallelization). Both of these latter methods of decomposition and parallelization can be applied in modeling applications involving only a single model simulation (e.g., emergency decision-making or flood forecasting support based on a single model run) and can be combined with modellevel parallelization to maximize the performance in modeling applications involving a large number of model simulations.
However, submodel parallelization is relatively complex and usually requires model reconstruction because modelers must 60 consider the communication among components, failover, task management, etc. As a result, a steep learning curve is https://doi.org/10.5194/gmd-2020-429 Preprint. Discussion started: 27 January 2021 c Author(s) 2021. CC BY 4.0 License. expected for modelers who are unfamiliar with the model source codes and parallel computation frameworks. On the other hand, subbasin parallelization usually requires no model reconstruction, as a subbasin can be simply treated as a watershed consisting of only one subbasin, and the upstream inputs can be treated as the boundary conditions. Examples of subbasin parallelization include the works of Wang et al. (2013), Yalew et al. (2013) and Liu et al. (2014). 65 In addition, the choice of parallel computing techniques and resources has a great influence on model parallelization performance. Recent studies have demonstrated the feasibility of combining parallel computation techniques with sharedmemory or distributed-memory systems to enhance the efficiency of model parallelization. Examples of model parallelization with shared-memory systems include the works of Rouholahnejad et al. (2012), Wu and Liu (2012), and Joseph and Guillaume (2013). However, the performance gains of these tools or methods are hindered by the poor scalability 70 of shared-memory systems. On the other hand, distributed-memory model parallelization systems offer better scalability and, thus, better performance. A distributed-memory model parallelization system in the form of grid, cluster or cloud computing allows computational nodes to be dynamically added to boost model parallelization performance. For example, Whittaker (2004) and Confesor Jr and Whittaker (2007) presented a parallel method by combining cluster-based parallel computing with the nondominated sorting genetic algorithm version II on a Beowulf cluster consisting of a server and 12 computation 75 nodes to facilitate the analysis of uncertainty in the Soil and Water Assessment Tool (SWAT). Zhang et al. (2013) established a Python-based parallel computing package called PP-SWAT by combining Python, the Message Passing Interface (MPI) standard for Python and the open-source MPI standard Open MPI for the multiple-objective calibration of SWAT. Gorgan et al. (2012) implemented the gSWAT application by leveraging grid computing technologies and infrastructures for the calibration of extensive hydrological models. However, these methods have two major limitations: 80 they usually require complex computational facilities that may not be readily available, and modelers must use a parallel framework (e.g., Open MPI) to cope with model communications, failover, task management, etc.
Recently, the growth of cloud computing systems and big data techniques has enabled modelers to avoid the aforementioned limitations. For example, with cloud computing, users can easily build their own private clouds or simply use third-party facilities (e.g., Azure HDInsight, Amazon Web Services EMR, or Google Dataproc). In addition, parallel frameworks for 85 processing big data, such as Hadoop, Spark and Flink, are now available to alleviate the burden placed on modelers for addressing low-level programming tasks such as communications, failover, and task management. These advantages have sparked research on efficient solutions to complex high-dimensional computational problems. For example, Humphrey et al. In this study, we propose a two-layer model parallelization scheme based on a combination of the graph-parallel Pregel 95 algorithm and model spatial domain decomposition. In accordance with this scheme, a graph-parallel simulation tool for SWAT (named GP-SWAT) has been developed using an open-source general-purpose distributed cluster computing framework, Spark. GP-SWAT has been assessed in two sets of experiments to demonstrate its potential to accelerate single and iterative model simulations at different parallelization granularities when implemented on a computer running the Windows operating system (OS) and on a Spark cluster consisting of five computational nodes. Experiment set one was 100 conducted to illustrate that GP-SWAT can be used to perform subbasin-level model parallelization using a multicore computer running the Windows OS, while in experiment set two, GP-SWAT was assessed for iterative model runs. For each experiment in the latter set, subbasin-and watershed-level parallelization schemes were employed to execute 1000 model simulations with one to five parallel tasks implemented on each computational node. In each of the test cases, GP-SWAT was evaluated based on eight synthetic hydrological models representing different input/output (I/O) burdens and different 105 levels of river network complexity.

Parallelization methods
Model parallelization can be achieved at the model and/or submodel level. Model-level parallelization is applicable for modeling routines such as model calibration, sensitivity and uncertainty analysis, and identifying beneficial management 110 practices, which involve a large number of model simulations; however, it cannot be used to reduce the run time in circumstances involving only one model simulation (e.g., flood forecasting). By contrast, submodel-level parallelization is effective in use cases involving only a single model simulation, and it can also be applied in combination with model-level parallelization to maximize the performance of modeling routines involving a large number of model simulations. However, submodel-level parallelization usually requires model reconstruction to enable the parallel simulation of model components 115 and to achieve the communication among the components that is necessary for integrating the model results. Although there are some parallel computation frameworks available that can facilitate model parallelization at the submodel level, such as Open MPI and the Open Multi-Processing (OpenMP) application programming interface (API), it is still a very tedious and time-consuming process.
For spatially explicit acyclic models, it is possible to decompose a large-scale model into multiple smaller models and 120 properly orchestrate the simulation processes of these smaller models to generate integrated results for the original model. This spatial decomposition and merging strategy has all of the advantages of submodel-level parallelization but requires no model reconstruction. Taking the SWAT model as an example, a large-scale watershed model involving multiple subbasins can be split into multiple smaller models, each of which consists of only one subbasin (hereafter referred to as subbasin models). The stream flow and chemical loadings from upstream subbasins can be treated as boundary conditions, which can 125 be incorporated as point sources. Through proper organization of the simulation of these models, a result identical to that of https://doi.org/10.5194/gmd-2020-429 Preprint. Discussion started: 27 January 2021 c Author(s) 2021. CC BY 4.0 License. the original model can be achieved. In this strategy, upstream subbasin models must be simulated before downstream ones; however, sibling subbasin models can be simulated in parallel to optimize the model performance. For detailed information about the implementation of subbasin-level parallelization for SWAT model, readers are referred to Yalew et al. (2013).

Graph representation of hydrological models 130
A property graph is a directed graph with properties attached to each vertex and edge. Each vertex is keyed by a unique identifier. Similarly, each edge has corresponding source and destination vertex identifiers. In many domains of the natural and social sciences, relationships of interest can be described by property graphs, such as the relationships in traffic and social networks. Accordingly, many graph algorithms have been devised to simplify and improve the performance on related analytical tasks, including methods for the parallel processing of graph-based data. From this perspective, the relationships 135 between the components of a distributed hydrological model can be described by a dendriform property graph. Figure 1a demonstrates how a simple watershed model can be represented with a property graph. Each subbasin can be represented as a vertex with an identifier and two properties denoting how many subbasins exist directly upstream of the current subbasin (referred to as subNo hereafter) and the number of these directly upstream subbasins for which the simulation process has been completed in the current computation step (referred to as finSubNo hereafter). Each edge denotes an upstream-140 downstream flow drainage relationship. Modeling routines involving iterative simulations can also be represented in the form of property graphs. In this case, each simulation is represented by a subgraph of the integrated graph, and each outlet vertex of these subgraphs is connected to a virtual vertex to form the integrated graph (figure 1b). To uniquely identify a vertex in the integrated graph, the vertex identifier consists of both the subbasin number and the simulation number. In addition, the virtual vertex is identified by a special number (i.e., -1 in our case). Such an integrated property graph 145 representing a modeling routine involving iterative simulation can also be interpreted as a property graph representing a large-scale virtual hydrological model consisting of many identical landscapes (represented by the same model with different parameters) that are connected to a virtual outlet. Moreover, in this case, each vertex can be considered to represent a subbasin model or watershed model, and thus, it is possible to achieve model parallelization at the watershed and subbasin levels (figure 1c).

Design and implementation of GP-SWAT 155
Apache Spark is an open-source general-purpose distributed cluster computing framework that provides distributed task dispatching, scheduling, and graph functionalities through an API available in the Java, Python, Scala, and R languages.
Apache Spark can run in diverse environments, ranging from a single computer (running the Microsoft Windows or Linux OS) to a computer cluster consisting of a large number of computational nodes hosted in an in-house or cloud environment.
GP-SWAT is designed to work with the Spark graph API for model-level and subbasin-level parallelization. property graph from the route information generated by the preprocessing program and defines how the models are to be simulated in parallel; its output is then replicated to the executors to actually carry out model parallelization.   figure 3). In the first anonymous function, finSubNo (the second property of a vertex, denoting the number of directly upstream subbasin models for which simulation has been completed) is updated by adding the merged message (denoting the number of directly 190 upstream subbasin models for which simulation had been completed in the previous superstep), and finSubNo is then compared with subNo (denoting the number of subbasins directly upstream of the current subbasin). If finSubNo is equal to subNo (meaning that simulation is complete for all directly upstream subbasins), then the subbasin model represented by the current vertex is executed through an external function that will be discussed later. In the second anonymous function, used to compute the messages for the next superstep, a value of "1" is sent to the downstream neighboring vertex if the current 195 subbasin model has been simulated in the current superstep; otherwise, a value of "0" is sent, indicating that no subbasin model was executed. The third anonymous function simply adds two messages from upstream vertices and returns the result of the addition operation (this function is invoked n-1 times, where n is the number of inbound messages).

200
In figure 4, we demonstrate the model parallelization of a simple hydrological model consisting of five subbasins (figure 1a) using the Pregel algorithm. In the initial superstep, the receiveMsg program is invoked on all vertices and is passed the default message "0". As seen from figure 4, finSubNo is equal to subNo at vertices 1, 2 and 4; thus, these three subbasin models are executed. Next, the sendMsg program is invoked on all vertices that have inbound messages and downstream neighboring vertices. For vertices 1, 2 and 4, at which subbasin model simulation has been performed, "1" is sent to their 205 downstream neighboring vertices. Because vertex 3 did not execute simulation of its subbasin model, "0" is sent. Finally, at the end of each superstep, the mergeMsg program is invoked on vertices with at least two inbound messages in the next superstep. In the second superstep, the receiveMsg program is invoked on vertices 3 and 5, which have inbound messages.
Because finSubNo is equal to subNo only at vertex 3, only the subbasin model associated with this vertex is executed, and sendMsg sends "1" to its downstream neighboring vertex 5. In superstep 3, the receiveMsg program is invoked on vertex 5 210 https://doi.org/10.5194/gmd-2020-429 Preprint. Discussion started: 27 January 2021 c Author(s) 2021. CC BY 4.0 License. and triggers the simulation of its associated subbasin model, as its finSubNo value has reached the number of directly upstream subbasins.

the upper right number indicates the number of directly upstream subbasins for which simulation has been completed, and a red vertex denotes that the corresponding subbasin model is simulated in that superstep).
The function for invoking a model (Model.call in figure 3) is implemented in Java to reuse components that have been implemented in previous studies . Because the models are designed to be executed in parallel within a 220 computational node and among nodes, we must ensure that a model can be executed only once at a given time. Thus, the first step of model simulation is to find a model that is not occupied; this task is achieved through a combination of the synchronous mechanism and static indicators of Java. When a free model is identified, the watershed configuration file is replaced, and the upstream point-source files stored in the NFS shared storage are copied to the model directory. Next, the subbasin model inputs are edited in accordance with the parameter set of the current simulation. Finally, the model is 225 executed, and the model output is copied to the NFS shared storage for later use.

Case study
Eight synthetic hydrological models representing different I/O burdens and different levels of river network complexity were used to evaluate the efficiency of GP-SWAT. These synthetic models were built based on the Harp Lake and Jinjiang hydrological models (Fu et al., 2014;Zhang et al., 2020;Zhang et al., 2015). The Harp Lake and Jinjiang hydrological models 230 include 38 and 99 subbasins, respectively. For the various synthetic hydrological models, the number of subbasins, the simulation period and other configuration parameters are the same, but different numbers of hydrological response units where n is the total number of simulations for a given job, JET is the total execution time of the job when run in the test environment, ST is the average execution time of one model simulation, and Speedup act measures how much faster a program runs in parallel compared to being run sequentially on a single computer. For subbasin-level parallelization, we also calculated the theoretical speedup, which considers the theoretical speedup that a model can achieve under different test 245 configurations. The theoretical speedup is calculated as follows: where sub num is the number subbasins of the hydrological model under test, n is the total number of supersteps, i denotes the i-th superstep, Count i is the number of subbasin models simulated in the i-th superstep, PT num is the number of parallel tasks performed at an executor, and Ceil is a function that returns the smallest integer value that is greater than or equal to a 250 predetermined parameter value. Experiment set one was carried out on a workstation with 24 processors operating at a frequency of 2.67 GHz, 24 GB of RAM, and 1 TB of disk storage, running the Windows 2012 Server OS. Spark was configured with one executor, and a maximum of 24 cores were allowed in this executor. The eight synthetic hydrological models were used to evaluate subbasin-level parallelization with the number of parallel executor tasks ranging from one to 24. Experiment set two was 255 carried out on a Spark cluster consisting of 5 computational nodes. Each computational node had a quad-core CPU, 8 GB of RAM and 50 GB of disk storage. The CPU frequency was 2.2 GHz, and each node was running 64-bit Linux as the OS. In this Spark cluster, each node could have 1-5 executors, but each executor was allowed to have only one core. Model-level https://doi.org/10.5194/gmd-2020-429 Preprint. Discussion started: 27 January 2021 c Author(s) 2021. CC BY 4.0 License. and subbasin-level parallelization of the eight synthetic hydrological models was conducted on this Spark cluster with 5, 10, 15, 20 and 25 executors.

Performance analysis
Spark applications can run either in local mode (nondistributed deployment mode with a single Java Virtual Machine (JVM)) or in cluster mode. When run in local mode, Spark can be deployed on a computer with a Microsoft Windows, Mac, or Linux OS. Experiment set one was designed to verify that GP-SWAT can be used to perform subbasin-level model 265 parallelization using a multicore computer running the Windows OS. Each synthetic model was simulated 10 times with 1-24 cores, and the actual speedup values were calculated using the average execution time. Figure 5 shows the actual and theoretical speedups versus the number of parallel tasks performed in local mode. In general, the actual speedup values increase with increasing model complexity for both the HP-and JJ-based synthetic models, indicating that subbasin-level parallelization works especially well for complex models. For less complex models, such as HP1 and JJ1, the best speedup 270 values are 1.8 and 2.3, respectively. However, for more complex models, a much better speedup can be achieved. For example, the maximum speedup values achieved for HP4 and JJ4 are 4.8 and 5.8, respectively.

275
In experiment set two, we evaluated the performance of GP-SWAT for iterative model runs. For each experiment in this set, subbasin-and watershed-level parallelization (hereafter referred to as the decomposed mode and the integrated mode, https://doi.org/10.5194/gmd-2020-429 Preprint. Discussion started: 27 January 2021 c Author(s) 2021. CC BY 4.0 License. respectively) were employed to execute 1000 model simulations in the test environment with one to five parallel tasks implemented at each computational node. The speedup values achieved in the integrated mode (red) and in the decomposed mode (blue) are plotted against the number of parallel tasks in figure 6. For simple synthetic models (i.e., HP1 and JJ1), the 280 speedups in the integrated mode surpass those in the decomposed model, while for more complex models (models other than HP1 and JJ1), the speedups in the decomposed mode are better than those in the integrated mode. This is a result of the conflict between the system overheads (including job initiation, task orchestration, and model result transfer) and performance gains associated with subbasin-level parallelization. In general, the speedup gradually increases with up to 20 parallel tasks and then slightly decreases as the number of parallel tasks continues to increase. Similar to the case of single 285 model runs, we also observe that the speedup increases with increasing model complexity for both the decomposed and integrated modes. As discussed before, this phenomenon is caused by the additional performance gains resulting from model decomposition. For example, the maximum speedups for the least complex models HP1 and JJ1 are 6.62 and 8.34, respectively, while the speedups for HP4 and JJ4 are 22.75 and 27.03, respectively. Compared with the parallelization of iterative model simulations, the parallelization of a single model is of less interest because most model applications involve a large number of simulations. However, we have found that it is essential to accelerate single model simulation when using hydrological models to support applications such as emergency decision-305 making or flood forecasting in a product environment. Because of its ability to accelerate a single model, we believe that GP-SWAT can be integrated into decision-making and flood forecasting systems to offer quasi-real-time support for decisionmaking and flood forecasting.
GP-SWAT can be run in either the decomposed mode or the integrated mode. As indicated by the experimental results, the operation mode can exert a great influence on the performance of GP-SWAT. Compared with the integrated mode, the 310 decomposed mode requires more computation time to perform task management and model result transfer; thus, it may not be suitable for lightweight models. On the other hand, the decomposed mode can greatly alleviate the I/O bottleneck for complex models, thereby reducing the computation time. In such a case, the extra time required for task management and model result transfer is negligible. Therefore, we suggest running GP-SWAT in the integrated mode for lightweight models and running it in the decomposed mode otherwise. There is one exception to this suggestion. When conducting iterative 315 simulations of lightweight models, GP-SWAT should run in the decomposed mode if the changes are restricted to a small portion of the downstream subbasins. In addition, the performance of GP-SWAT can be affected by the characteristics of the watershed route network. In general, the decomposed mode is less suitable for watersheds that are thin and long than for watersheds with short paths and many branches.

Advantages and limitations 320
GP-SWAT represents the first attempt to accelerate the simulation of a highly computationally intense SWAT model through two-layer parallelization using the graph-parallel Pregel algorithm. As a two-layer model parallelization tool, it can be used to accelerate single or iterative model simulations, endowing it with a range of possible applications and great flexibility in maximizing performance. In addition, the applicability of a model parallelization tool is also influenced by its running environment. As an open-source tool implemented in Java and Scala, GP-SWAT can be run not only on a commodity 325 computer with a Linux, Unix or Windows OS but also on a Spark cluster with a large number of computational nodes.
Moreover, the Spark framework is so universal in the IT industry that many cloud providers currently offer convenient ondemand managed Spark clusters (e.g., Google Dataproc, Amazon Web Services EMR, and Azure HDInsight) with out-ofthe-box support for Spark-based applications. Thus, users can easily adapt GP-SWAT to run in these environments, thereby avoiding the technical and financial burdens encountered when building an in-house Spark cluster.
Moreover, the Spark-based implementation of GP-SWAT lends it many advantages that other distributed computation frameworks (e.g., MPI) may not have. First, Spark provides many high-level functionalities, such as distributed task dispatching, scheduling, failover management and load balancing, which can greatly reduce the burden on programmers. For example, by means of the failover management functionality, Spark-based programs can gracefully handle partial failures (e.g., partial breakdown of the computational nodes), as the Spark framework can automatically detect failed tasks and then 335 reschedule these tasks on healthy computational nodes. In MPI-based applications, it may be necessary to explicitly manage these issues. Second, the Spark framework is a higher-level parallel computing framework. The SWAT model and Spark need to be coupled only loosely to achieve model parallelization. In contrast, to achieve model parallelization with MPI, model reconstruction is usually required.
GP-SWAT uses the NFS protocol to exchange data among different computational nodes. While NFS is easy to implement, 340 some problems may be encountered in the case of a large-scale computational cluster. Because NFS employs a master-slave architecture, performance bottlenecks may arise when a large number of nodes are simultaneously attempting to read from or write to the master. Another drawback of GP-SWAT is that it requires additional I/O operations when working in the decomposed mode (i.e., subbasin-level parallelization). For example, the basin-level inputs are replicated to the subbasin models, and point-source files (which contain redundancies with the standard SWAT outputs) are generated and read by the 345 subbasin models. However, this issue can be partially addressed by optimizing the I/O module of SWAT to reduce the redundant output.

Summary and future work
In this study, we developed a two-layer (i.e., model-level and subbasin-level) parallel simulation framework for the SWAT model, called GP-SWAT, based on the graph-parallel Pregel algorithm. The efficiency of GP-SWAT was evaluated through 350 two sets of experiments. Experiment set one was conducted to illustrate that GP-SWAT can be used to perform subbasinlevel model parallelization using a multicore computer running the Windows OS. The results show that GP-SWAT can accelerate the single model simulation process, especially for complex models. In experiment set two, GP-SWAT was assessed for iterative model runs. For each experiment in this set, subbasin-and watershed-level parallelization schemes were employed to execute 1000 model simulations in the test environment with one to five parallel tasks implemented at 355 each computational node. Our experimental results show that GP-SWAT can achieve a remarkable performance improvement over traditional SWAT models by leveraging the computational power of a Spark cluster. In addition to performance improvement, GP-SWAT also has some other notable features.
(1) It can be employed to perform both individual and iterative model parallelization, endowing it with a range of possible applications and great flexibility in maximizing performance through the selection of a suitable parallelization mode (at the 360 model level or the subbasin level).
(2) It is a flexible and scalable tool that can run in diverse environments, ranging from a commodity computer with a Microsoft Windows, Mac, or Linux OS to a Spark cluster consisting of a large number of computational nodes, either deployed in-house or provided by a third-party cloud service provider.
Further work should also be conducted to improve the performance and extend the usability of our proposed method. In this 365 study, data exchange was achieved through the NFS protocol, which can present an I/O bottleneck when a large number of computational nodes are involved. To scale GP-SWAT to a larger computational cluster, a distributed storage system, such as the Hadoop Distributed File System (HDFS) or Redis, should be used to address this potential issue. In addition, the application of the proposed scheme to other environmental models, such as the Hydrologic Simulation Program-FORTRAN (HSPF) model, will be necessary to demonstrate the flexibility and universality of our proposed method. 370

Code availability
The model code (GP-SWAT v1.0) is released under the MIT under license. Source codes, manual and synthetic models used in this study (https://doi.org/10.5281/zenodo.4447969, Zhang et al., 2020b) are available on the Zenodo repository.