A new distributed algorithm for routing network generation in model coupling and its evaluation based on C-Coupler2

. It is a fundamental functionality of a coupler for Earth system modeling to efficiently handle data transfer between component models. Routing network generation is a major step for initializing the data transfer functionality. Most existing couplers employ an inefficient and unscalable global implementation for routing network generation that relies on collective communications. That’s a main reason why the initialization cost of a coupler increases rapidly when using more processor 15 cores. In this paper, we propose a new D istributed a lgorithm for Ro uting n etwork g eneration (DaRong), which does not introduce any collective communication and achieves much lower complexities than the global implementation. DaRong is of course much more efficient and scalable than the global implementation, which has been further demonstrated via empirical evaluations. DaRong has already been implemented in C-Coupler2. We believe that existing and future couplers can also benefit from it.

3) Detecting common grid cells: each src/dst process detects its common grid cells with each dst/src process based on its local parallel decomposition and the dst/src global parallel decomposition. 4) Generating the routing network: each src/dst process generates its local routing network according to the information 65 about common grid cells.
Given that each of the src and dst component models uses K processes and the corresponding grid size is N (the grid has N In the following context, existing implementations of routing network generation are called global routing network generation.

Overall design
Each cell of a grid can be numbered with a unique index from 1 to N (called global cell index), while each grid cell assigned to the same process can also be numbered with a unique local cell index. Thus, the information of a given parallel 90 decomposition can be recorded as a Cell Local-Global Mapping Table (CLGMT), each element of which is a triple of global https://doi.org/10.5194/gmd-2020-91 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License. cell index, process ID, and local cell index. For example, Tables 1 and 2 are the CLGMTs corresponding to the parallel decompositions in Fig. 1a and Fig. 1b respectively. Generally, the CLGMT entries of a parallel decomposition are distributed among the processes of a component model, which 95 means a process only stores a part of the CLGMT. The key idea of the existing global implementation can be summarized as reconstructing the global CLGMT of the peer parallel decomposition in each process for routing network generation. To be a scalable solution, DaRong should be fully based on distributed CLGMT without reconstructing any global CLGMT. The reason why existing implementations have to depend on global CLGMTs is because the distribution of the CLGMT entries is determined by a model and thus a coupler generally has to view any distribution as random. into the original distribution of the CLGMT entries of the src/dst parallel decomposition. 115 5) Each src/dst process generates its local routing network based on the local SRT entries.
In the following context of this section, we will detail the implementation of each major step except the last one because it is similar to the last major step in the global implementation.

Rearranging CLGMT entries intra a component model
Such rearrangement is achieved via a divide-and-conquer sorting procedure that is similar to a merge sort with the keyword of global cell index. This procedure first sorts the CLGMT entries locally in each process, and next iteratively conducts distributed sort by a main loop of logK iterations (K is the number of processes of the src/dst component model). In an iteration, https://doi.org/10.5194/gmd-2020-91 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.
processes are divided into distinct pairs and the two processes in each pair swap the CLGMT entries based on a point-to-point 125 communication. Figure 2 shows an example of the distributed sort corresponding to the CLGMT entries in Table 1, and Table   3 shows the distributed CLGMT after rearranging the CLGMT entries in Table 2.

Exchanging CLGMT entries between component models
After the rearrangement of the CLGMT in a component model, the CLGMT entries are sorted in an ascending order of the 130 global cell indexes and evenly distributed among processes. The CLGMT entries reserved in each process therefore have a determinate and non-overlapping range of global cell indexes, and such a range can be easily calculated from the grid size, the number of total processes, and process ID. Thus, it is easy to calculate the overlapping relationship of global cell index range between a src process and a dst process. As it is only necessary to exchange CLGMT entries between a pair of src and dst processes with overlapping ranges, point-to-point communications only are enough for handling the exchange of the CLGMT 135 entries.

Generation of SRT
After the previous major step, each process reserves two sequences of CLGMT entries corresponding to the src and dst parallel decompositions respectively. Given that the two sequences contain n1 and n2 entries respectively, the time complexity of 140 detecting the sharing relationship is O(n1+n2), because the entries in each sequence have already been ordered in ascending global cell indexes, and a procedure similar to the kernel of merge sort that merges two ordered data sequences can handle such detection.
To record the sharing relationship, a SRT entry is designed as a quintuple of global cell index, src process ID, src local cell 145 index, dst process ID, and dst local cell index. Given a quintuple <q1,q2,q3,q4,q5>, it means that number q3 local cell in number q2 process of the src component model is number q1 global cell, and the data on it will be transferred to number q5 local cell in number q4 process of the dst component model. Table 4 shows the SRT in the src component model, calculated from the rearranged distributed CLGMT entries in Fig. 2 and Table 3.

150
It is possible that multiple src CLGMT entries correspond to the same global cell index. Under such a case, any src CLGMT entry can be used for generating the corresponding SRT entries, because the src component model should guarantee that the data copies on the same grid cell are exactly the same. Given a dst CLGMT entry, if there is no src CLGMT entry with the same global cell index, no SRT entry will be generated. Given that multiple dst CLGMT entries correspond to the same global cell index and there is at least one src CLGMT entry with the same global cell index, a SRT entry will be generated for each 155 dst CLGMT entry.

Rearranging SRT entries intra a component model
After the previous major step, the SRT entries are distributed among processes of a component model according to the intermediate distribution. As a process can use only the SRT entries corresponding to its local cells for the last major step of 160 local routing network generation, the SRT entries should be rearranged among the processes of a component model. We find that such rearrangement can also be achieved via a sorting procedure similar to the distributed sort with the keyword of src/dst process ID, or even the sorting procedure implemented for the first major step can be reused. Tables 5 and 6 show the SRT entries distributed in the src and dst component model respectively, after the rearrangement. To facilitate the implementation of the sorting procedure, we force the number of processes regarding the 1 st ~ 4 th major steps to be the maximum power of 2 (2 n ) no larger than the total process number of the src/dst component model. For a process whose ID I is not smaller than 2 n , its CLGMT entries will be merged into the process with the ID of I-2 n before the first major step, and the SPT entries corresponding to it will be obtained from the process with the ID of I-2 n after the fourth major step. 175 This strategy will not change the above time complexity and memory complexity of DaRong, as 2 n is larger than a half of the total process number.

Evaluation
For evaluating DaRong, we implemented it in C-Coupler2, which enables us to compare it with the original global routing 180 network generation in C-Coupler2. We developed a toy coupled model consisting of two toy component models and C-Coupler2 for the evaluation, which enables us to flexibly change the model settings in terms of grid size and number of processor cores (processes). The toy coupled model is run on a supercomputer, where each computing node on the supercomputer includes two Intel Xeon E5-2678 v3 CPUs (Intel(R) Xeon(R) CPU (24 processor cores in total)), and all computing nodes were connected with an InfiniBand network. The codes were compiled by an Intel Fortran and C++ compiler 185 https://doi.org/10.5194/gmd-2020-91 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.
at the optimization level O2, using an Intel MPI library (3.2.2). A maximum number of 3200 cores are used for running the toy coupled model.
We made an evaluation under the variation of process numbers ( Fig. 3; two component models use the same number of processor cores). For the grid size of 500,000 (Fig. 3a) outperforms the global implementation more significantly when using more cores. When the grid size gets larger (e.g., 4,000,000 in Fig. 3b and 16,000,000 in Fig. 3c), DaRong still significantly outperforms the global implementation, while with better scalability.

200
Considering a model can use more processor cores for acceleration when its resolution gets finer, we further evaluated the weak scalability of DaRong, where we concurrently increased the grid size and core number to achieve similar numbers of grid points per process. As shown in Table 7, the execution time of DaRong increases slowly while the execution time of the global implementation increases rapidly with the increment of grid size and core number. This demonstrates that DaRong achieves much better weak scalability than the global implementation. 205

Conclusion
In this paper, we propose a new distributed algorithm, DaRong, for routing network generation. As it does not introduce any collective communication and achieves much lower complexity in terms of time, memory and communication than the global implementation that is widely used in existing couplers, it is of course much more efficient and scalable than the global 210 implementation. The evaluation results further demonstrate this conclusion.