DiRong1.0: A distributed implementation for improving routing network generation in model coupling

1 Ministry of Education Key Laboratory for Earth System Modeling, Department of Earth System Science, Tsinghua University, Beijing, China 2 Hydro-Meteorological Center of Navy China, Beijing China 3 Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), China 10 4 State Key Laboratory of Numerical Modeling for Atmospheric Sciences and Geophysical Fluid Dynamics (LASG), Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, China


Introduction
Coupled Earth system models and numerical weather forecasting models highly depend on existing couplers (Hill et al., 2004;Craig et al., 2005;Larson et al., 2005;Balaji et al., 2006;Redler et al., 2010;Craig et al., 2012;Valcke, 2013;Liu et al., 2014; In some existing coupling software such as MCT and C-Coupler, the global information of a parallel decomposition is originally distributed among all processes of a component model. This is because a process only records its local parallel decomposition on the grid cells assigned to it. Therefore, these couplers generally use the following four steps for generating a routing network between the parallel decompositions of a source (src) and a destination (dst) component model. 65 1) Gathering global parallel decomposition: the src/dst root process gathers the global information of the src/dst parallel decomposition from all src/dst processes.
2) Exchanging global parallel decomposition: the src/dst root process first exchanges the src/dst global parallel decomposition with the dst/src root process, and then broadcasts the dst/src global parallel decomposition to all src/dst processes. 70 3) Detecting common grid cells: each src/dst process detects its common grid cells with each dst/src process based on its local parallel decomposition and the dst/src global parallel decomposition. 4) Generating the routing network: each src/dst process generates its local routing network according to the information about common grid cells.

75
Given that each of the src and dst component models uses K processes on a grid of size N (i.e., the grid has N cells), the first and second steps when using C-Coupler correspond to gather/broadcast communications with a time complexity of at least 3 Design and implementation 95

Overall design
The design and implementation of DiRong1.0 significantly benefits from the general idea of distributed directories (Pinar and Hendrickson, 2001), which have already been used in coupler development (Theurich et al., 2008;Hanke et al., 2016). Another different kind of specific distributed directories is defined and used in DiRong1.0. 100 Each cell of a grid can be numbered with a unique index from 1 to N called the "global" cell index, while each grid cell assigned to the same process can be numbered with a unique "local" cell index. Thus, the information of a given parallel decomposition can be recorded as a Cell Local-Global Mapping Table (CLGMT), each element of which is a triple of global cell index, process ID, and local cell index. For example, Tables 1 and 2 are the CLGMTs corresponding to the parallel decompositions 105 in Fig. 1a and 1b, respectively.
Generally, the CLGMT entries of a parallel decomposition are distributed among the processes of a component model, which means a process only stores part of the CLGMT. The key idea of existing global implementations is to reconstruct the global CLGMT of the peer parallel decomposition in each process for routing network generation. To be an efficient solution though, 110 DiRong1.0 should be fully based on a distributed CLGMT without reconstructing any global CLGMT. Existing global implementations depend on global CLGMTs because the distribution of the CLGMT entries is determined by a model, and thus a coupler generally has to view any distribution as random.
Motivated by the above analysis, the key challenge in DiRong1.0 is achieving efficient rearrangement of the original 115 distribution of the CLGMT entries of a given parallel decomposition into a regular intermediate distribution, and efficiently generating the routing network based on the intermediate distribution. Specifically, we employ a regular intermediate distribution that evenly distributes the CLGMT entries among processes based on the global cell indices placed in ascending order. Such an intermediate distribution is not only simple, but it also enables a straightforward rearrangement of the CLGMT entries into the intermediate distribution via a sorting procedure similar to distributed sort. With that, DiRong1.0 takes the 120 following major steps to generate a routing network between the src and dst component models.
1) The src/dst component model rearranges the original distribution of the CLGMT entries of the src/dst parallel decomposition into the regular intermediate distribution.
2) The src and dst component models exchange the CLGMT entries in the intermediate distributions.
3) Based on the src and dst CLGMT entries in the intermediate distributions, each src/dst process generates table entries of 125 the sharing relationship, which describes how each grid cell is shared between the processes of the src and dst component models.

4) The src/dst component model rearranges the intermediate distribution of the entries in the sharing relationship table (SRT)
into the original distribution of the CLGMT entries of the src/dst parallel decomposition. 5) Each src/dst process generates its local routing network based on the local SRT entries. 130 The remainder of this section details the implementation of each major step, except the last one because it is similar to the last major step in the global implementation.

Rearranging CLGMT entries within a component model 135
The rearrangement of CLGMT entries within a component model is achieved via a divide-and-conquer sorting procedure, similar to a merge sort using the global cell index as the keyword. This procedure first sorts the CLGMT entries locally in each process, and then iteratively conducts a distributed sort via a main loop of logK iterations, where K is the number of processes of the src/dst component model. In each iteration, processes are divided into distinct pairs and the two processes in each pair swap the CLGMT entries based on a point-to-point communication. Figure 3 shows an example of the distributed sort 140 corresponding to the CLGMT entries in Table 1, and Table 3 shows the distributed CLGMT after rearranging the CLGMT entries in Table 2. As shown in Fig. 3, the distributed sort employed in DiRong1.0 uses a similar butterfly communication pattern to the optimized MPI implementations of various collective communication operations (Brooks, 1986;Thakur et al., 2005).

Exchanging CLGMT entries between component models
After the rearrangement of the CLGMT in a component model, the CLGMT entries are sorted into ascending order based on their global cell index and are evenly distributed among processes. The CLGMT entries reserved in each process therefore have a determinate and non-overlapping range of global cell indices, and such a range can be easily calculated from the grid size, the total number of processes, and the process ID. Thus, it is straightforward to calculate the overlapping relationship of 150 the global cell index range between a src process and a dst process. As it is only necessary to exchange CLGMT entries between a pair of src and dst processes with overlapping ranges, point-to-point communications suffice to handle the exchange of the CLGMT entries.
Following the previous major step, each process reserves two sequences of CLGMT entries corresponding to the src and dst parallel decompositions. Given that the two sequences contain n1 and n2 entries, respectively, the time complexity of detecting the sharing relationship is O(n1+n2), because the entries in each sequence have already been ordered by ascending global cell index, and a procedure similar to the kernel of merge sort, which merges two ordered data sequences, can handle such a detection. 160 To record the sharing relationship, an SRT entry is designed as a quintuple of global cell index, src process ID, src local cell index, dst process ID, and dst local cell index. Given a quintuple <q1,q2,q3,q4,q5>, the data on global cell q1 in the src component model, corresponding to local cell q3 in process q2, is transferred to local cell q5 in process q4 in the dst component model. Table 4 shows the SRT in the src component model calculated from the rearranged, distributed CLGMT entries in Fig. 3 and 165 Table 3.
It is possible that multiple src CLGMT entries correspond to the same global cell index. In such a case, any src CLGMT entry can be used for generating the corresponding SRT entries, because the src component model guarantees that the data copies on the same grid cell are identical. Given a dst CLGMT entry, if there is no src CLGMT entry with the same global cell index, 170 no SRT entry will be generated. In the case that multiple dst CLGMT entries correspond to the same global cell index and there is at least one src CLGMT entry with the same global cell index, a SRT entry will be generated for every dst CLGMT entry.

Rearranging SRT entries within a component model 175
After the previous major step, the SRT entries are distributed among processes of a component model according to the intermediate distribution. Because a process can only use the SRT entries corresponding to its local cells for the last major step of local routing network generation, the SRT entries need to be rearranged among the processes of a component model. We find that such a rearrangement can be achieved via a sorting procedure similar to a distributed sort using the src/dst process ID as a keyword, or even via the sorting procedure implemented in the first major step. Tables 5 and 6 show the SRT entries 180 distributed in the src and dst component model, respectively, after the rearrangement. To facilitate the implementation of the sorting procedure, we force the number of processes in the first to fourth major steps to be the maximum power of 2 (2 n ) no larger than the total number of processes of the src/dst component model. For a process 190 whose ID I is not smaller than 2 n , its CLGMT entries are merged into the process with ID I-2 n before the first major step, and the SRT entries corresponding to it are obtained from the process with ID I-2 n after the fourth major step. This strategy does not change the aforementioned time complexity and memory complexity of DiRong1.0, as 2 n is larger than half of the total number of processes.

Evaluation
To evaluate DiRong1.0, we implement it in C-Coupler2, which enables us to compare it with the original global routing network generation. We develop a toy coupled model for the evaluation consisting of two toy component models and C-Coupler2, which allows us to flexibly change the model settings in terms of grid size and number of processor cores (processes). The toy coupled model is run on a supercomputer, where each computing node includes two Intel Xeon E5-2678 v3 CPUs 200 (Intel(R) Xeon(R) CPU, 24 processor cores in total), and all computing nodes are connected with an InfiniBand network. The codes are compiled by an Intel Fortran and C++ compiler at the optimization level O2 using an Intel MPI library (2018 Update 2). A maximum of 6400 cores are used for running the toy coupled model, and all test results are from the average of multiple runs.

205
In Table 7 to Table 10, we evaluate the effect of varying the number of processes; the two component models use the same number of processor cores. For a grid size of 500,000 (Table 7) Table 7 is small (about 160 KB with 60 cores and about 6 KB with 1600 cores for each toy component model), but the execution time of point-to-point communication does not vary linearly with message size and may be unstable when the message size is small. In contrast to DiRong1.0, the execution time of the global implementation increases rapidly with increasing number of cores. As a result, DiRong1.0 outperforms the global implementation more significantly when using more cores. When the grid size increases (e.g., from 4,000,000 in Table 8 to 32,000,000 in Table 10), DiRong1.0 Considering that a model can use more processor cores for acceleration when its resolution becomes finer, we further evaluate the weak scalability of DiRong1.0 by concurrently increasing the grid size and number of cores to achieve similar numbers of grid points per process. As shown in Table 11, the execution time of DiRong1.0 increases slowly, whereas the execution time 220 of the global implementation increases rapidly with larger grid sizes and increasing number of cores. This demonstrates that DiRong1.0 achieves much better weak scalability than the global implementation.

Conclusion and discussion
This paper proposes a new distributed implementation, DiRong1.0, for routing network generation. It is much more efficient 225 than the global implementation as it does not introduce any gather/broadcast communication and it achieves much lower complexity in terms of time, memory, and communication. This conclusion is demonstrated by our evaluation results.
DiRong1.0 has already been implemented in C-Coupler2. Its code is publicly available in a C-Coupler2 version and will be further used in future C-Coupler versions. We believe that some existing couplers such as MCT, OASIS3-MCT, and CPL6/CPL7 can also benefit from DiRong1.0, as it accelerates the routing network generation as well as the coupler 230 initialization.
We did not evaluate the impact of DiRong1.0 on the total time of a model simulation, because this impact can be relative. The overhead of routing network generation as well as coupler initialization is trivial for a long simulation (e.g., hundreds of model days or even hundreds of model years), but may be significant for a short simulation (e.g., several model days or even several 235 model hours in weather forecasting (Palmer et al., 2008;Hoskins, 2013)). Data assimilation for weather forecasting may require a model to run for only several model hours or even less time. In an operational model, there is generally a time limitation on producing forecasting results (for example, finishing a five-day forecast in two hours), and thus developers always have to carefully optimize various software modules, especially when the model resolution becomes finer. In fact, one of the primary motivations for the development of DiRong1.0 was to accelerate the initialization of C-Coupler2 for an operational 240 coupled model used in China.
Another main reason for developing DiRong1.0 is that routing network generation will become more important in later versions of C-Coupler. Recently, a new framework was developed for weakly coupled ensemble data assimilation (EDA) based on C-Coupler2, named DAFCC1 (Sun et al., 2020). DAFCC1 will be an important part of C-Coupler3, the next version of C-Coupler. 245 For users wanting the atmosphere component of a coupled system to perform EDA, DAFCC1 will automatically generate an ensemble component corresponding to all ensemble members of the atmosphere component for calling the DA algorithm, and will automatically conduct routing network generation for the data transfers between the ensemble component and each ensemble member. Thus, routing network generation will be more frequently used in EDA with DAFCC1. For example, given 50 ensemble members, the routing network generation with the ensemble component will be conducted at least 50 times. 250 We note that the current sequential read of a remapping weight file is another drawback of C-Coupler2. Similar to Hanke et al. (2016), we will design a specific distributed directory for reading in the remapping weights in parallel, which will allow the remapping weights to be efficiently redistributed among processes based on DiRong1.0. Currently, C-Coupler2 employs a simple global representation for horizontal grids, which means that each process retains all points of a horizontal grid in 255 memory. The global representation will become a bottleneck in at least two aspects. First, it will consume too much memory to run a model simulation. For example, given a horizontal grid with 16,000,000 points, the memory required to keep it in each process is large: about 1.3 GB, provided that each point has four vertices and the data type is double precision. Second, the initialization of the data interpolation functionality requires model grids to be exchanged between different component models, which introduces global communications (e.g., broadcast) for the global grid representations. To address this bottleneck, we 260 will design and develop a distributed grid representation that can be viewed as a specific distributed directory, and will enable an efficient redistribution of horizontal grid points among processes based on DiRong1.0. Author contributions. HY was responsible for code development, software testing, and experimental evaluation of DiRong1.0, and co-led paper writing. LL initiated this research, was responsible for the motivation and design of DiRong1.0, supervised HY, and co-led paper writing. CS, RL, XY, and CZ contributed to code development and software testing. ZZ and BW 270 contributed to the motivation and software testing. All authors contributed to the improvement of ideas and paper writing.