Development and performance optimization of a parallel computing infrastructure for an unstructured-mesh modelling framework

. This paper describes the development and performance optimization of a 1 parallel computing infrastructure for an unstructured-mesh global model (GRIST; 2 Global-to-Regional Integrated forecast SysTem). The focus is on three major aspects 3 that facilitate rapid iterative development, including parallel computing, index 4 optimization and an efficient group I/O strategy. For parallel computing, the METIS 5 tool is used for the partition of the global mesh, which is flexible and convenient for 6 both the quasi-uniform and variable-resolution simulations. The scaling tests show 7 that the partition method is efficient. To improve the cache efficiency, several mesh 8 index reordering strategies are investigated to optimize the performance of the 9 indirect addressing scheme used in the stencil calculations. The numerical results 10 show that the indexing strategies are able to speed up the calculations, especially for 11 running with a small number of processes. To overcome the bottleneck of poor I/O 12 efficiency for the high-resolution or massively parallel simulations, a group parallel 13 I/O method is implemented and proven to be of high efficiency in the numerical 14 experiments. Altogether, these three aspects of the parallel computing toolkits are 15 encapsulated in a few interfaces, which can be used for general parallel modelling on 16 unstructured meshes.

considered in this study) is one of the major choices for these newly developed global 27 models (e.g., Ullrich et al. (2017)), mainly owing to their ability to allow general 28 computational patterns and their flexibility to switch between uniform-mesh and 29 variable-resolution (VR) modelling. 30 Despite certain advantages of the unstructured meshes, several obstacles have to 31 be overcome to achieve a practical computational efficiency. First, to support both the 32 quasi-uniform and VR simulations, the parallel-partition strategy should be general 33 enough and possesses a good load balance. The conventional method of dividing an 34 icosahedral grid into 10 identical rhombi and partitioning each rhombus into blocks 35 (e.g., MacDonald et al. (2011)) is typically not applicable. Second, the neighbours of 36 a grid point on the unstructured meshes cannot be obtained by simple index shifting; 37 thus, the indirect addressing scheme (MacDonald et al. (2011)) is typically used to 38 perform the stencil calculations. This results in discontinuous memory access during 39 model integration, which reduces the efficiency of compiler optimization and cache 40 reuse. Although the directly addressed vertical index can be put on the innermost 41 dimension, the computational performance in our numerical experiment is not that 42 good, which might slightly differ from the testing conclusion of MacDonald et al. 43 (2011) where no appreciable performance penalty for the indirect addressing scheme 44 is observed. Third, because the mesh points distributed to each process cannot form a 45 regular rectangular area as supported by a structured grid, the I/O operations between 46 memory and the parallel file system are also discontinuous, posing a bottleneck for 47 high-resolution and massively parallel computing. In short, to make scientific 48 computing on an unstructured mesh practical, a unified and efficient approach to 49 handle the parallel communication, computation and data I/O is an important task. 50 https://doi.org/10.5194/gmd-2020-158 Preprint. Discussion started: 29 September 2020 c Author(s) 2020. CC BY 4.0 License.
with which the dual cell is associated), and the edge point (the intersecting point of a 114 pair of edges that belong to a primal and dual cell, respectively). Several model 115 variables are located at each of the three types of mesh points. For example, the 116 potential temperature is located at the node point, the vorticity is located at the 117 triangle point, and the normal and tangent velocities are located at the edge point (see 118 Figure 1c in Zhang et al. (2019)). The node points can be optimized or directly 119 generated by the Centroidal Voronoi Tessellation (CVT) technique (e.g., Du et al. 120 (2003); Ringler et al. (2011)), which ensures that the generating points (node points) 121 are the centroids of the corresponding Voronoi cells (in the limit of the constraint). 122 During the model development process, two grid generators have been developed to 123 generate the required mesh information: one is a serial code that adopts the 124 STRIPACK library (Renka (1997)) to generate the Delaunay triangulations in the 125 iterations for optimizing the node points, and the other is a parallel code based on the 126 MPI-SCVT package (Jacobsen et al. (2013)). In this section, we will describe the 127 parallelization methods, including the mesh partition method and some techniques for 128 the inter-process communications. The partition of the entire global mesh can be obtained by partitioning the node 133 points. In practice, the METIS library is used to provide a general approach to 134 partition. METIS is a graph partitioning tool, which uses the input node points, 135 information of their neighbours and the number of partitioned groups to perform the 136 partition. A node point and one of its neighbours constitute two vertices of an edge in 137 the graph. By default, the principle of METIS is to minimize the number of edges 138 being cut under the constraint that the number of points assigned to each group is 139 roughly the same (cut-edges refer to the edges whose two vertices belong to different 140 groups). A smaller number of cut-edges implies less communication between groups, 141 and the constraint of a roughly equal number of points in each group is to ensure a 142 good load balance. Figure 1 illustrates a global mesh partitioned by METIS. In this 143 case, both the quasi-uniform and VR Voronoi cells are partitioned into ten groups. 144 Cells of the same colour fall in one group and will be assigned to the same process. 145 As a result of the partitioning principle, all processes are roughly distributed equally for the quasi-uniform mesh, while more processes are assigned to the refinement 147 regions for the VR mesh. 148 Because the update of data on a mesh point usually requires information on its 149 adjacent mesh points during the model integration, each process needs the data 150 belonging to other processes when updating the data on its boundary mesh points (the 151 mesh points adjacent to mesh points of other processes). To facilitate the calculations, 152 three types of data areas are defined, including: 153 (i) Inner area: an area composed of mesh points whose data update does not 154 require the data from other processes; 155 (ii) Boundary area: an area composed of mesh points whose data update requires 156 the data from other processes; 157 (iii) Halo area: an area composed of extended mesh points in other processes for 158 the update of boundary data of this process. 159 The number of layers of the halo area can be flexibly configured. Figure 2  160 presents an example that uses three halo layers, while in most cases, two layers are 161 required (as a default). The calculation procedure for the mesh partition operates as 162 follows. First, we use METIS to partition the global node points, and determine three 163 types of areas mentioned above based on the partition and neighbourhood information 164 of the node points. Second, we determine the corresponding partitions of edge and 165 triangle points based on the partition of node points. Third, we establish the mappings 166 between the global and the local indices of the node, edge, and triangle points. This 167 completes the mesh partition. 168 169

Communication 170 171
Communicating with neighbouring processes is required when one process 172 updates its data in the halo area. To facilitate the communications, we initialize three 173 pairs of arrays: 'send_sites_(v/e/t)' and 'recv_sites_(v/e/t)', for data defined on the 174 node (v), edge (e) and triangle (t) points, respectively. These arrays are initialized for 175 each neighbouring process and are used to record the global indices of the data to be 176 sent to this neighbour as well as the data to be received from this neighbour. Then, the 177 global indices are converted to the local indices for the ease of data preparations and 178 assignments.
The inter-process communications are performed by three consecutive steps: 180 (i) Data preparation. Each process puts the variable data to be sent to the 181 temporary sending arrays according to the local indices stored in 'send_sites'. 182 We report the scaling test results to show the efficiency of the partition method 206 and the communication techniques. All the tests in this paper are carried out on a 207 Sugon HPC platform. Each computation node contains 64 CPU cores with 256 GB 208 memory. The Sugon Parastor300 parallel file system is used as the storage system. We 209 run 60 MPI processes on each node to ensure enough available memory for the tests. assuming 100% parallel efficiency, which starts from 1 and 1/64 for the G10 grid and 221 G8 grid, respectively. We may observe that the actual run-time lines are very close to 222 the ideal run-time lines, suggesting that the model scales well. It should be noted that 223 all the actual run times of the G10 grid are shorter than the corresponding ideal run 224 times, that is, the super-linear speedup is achieved for the G10 grid. This abnormal 225 phenomenon indicates that there is still room for improving the computational 226 efficiency of running with smaller numbers of processes. For models on the 227 unstructured meshes, improving the rate of cache hits is an effective way to improve 228 the computational efficiency. We apply the mesh index reordering strategies for this 229 purpose. Before entering the next section, Figure  As is known, the cache is designed to improve the memory-access efficiency of a 238 CPU. Cache works by improving the data reuse, through which the memory accesses 239 are replaced by the accesses to the cache. Because the CPU accesses the cache much 240 faster than the main memory, the computational efficiency can be improved. Under 241 the general caching mechanisms, improving the data locality is an efficient way to 242 enhance the cache reuse. For computing on the unstructured mesh, the stencil 243 calculations are almost the most computationally intensive tasks. Performing stencil 244 calculations for a mesh point requires data on its neighbouring points, which is 245 supported by the indirect addressing scheme. Since the neighbours of a mesh point lie 246 nearby in the two-dimensional (2D) sphere, it is important to find an indexing strategy 247 to assign a nearby location in memory for these 2D spatially nearby mesh points.  We apply three index reordering strategies to optimize the locality of the mesh 264 points: the breadth-first-search (BFS) strategy, the Hilbert curve strategy, and the 265 Morton curve (a.k.a., Z-order curve) strategy. These indexing strategies help to 266 generate a distribution of points that has better locality in memory, leading to a higher 267 cache hit rate and computational efficiency. Before introducing each of them, Figure  268 4a first shows the mesh index order without reordering. The index order of the node 269 points is completely chaotic, as the node points are generated by the recursive 270 bisection of the icosahedral grid with small modifications.

274
The BFS strategy is a graph search algorithm commonly used to solve the 275 shortest path problem of unweighted graphs, which can be implemented by the 276 following three steps: 277 (i) Initialize an empty queue, and select a node point as the first node of the 278 queue; 279 (ii) Take out the first node of the queue and then add all its child nodes 280 (neighbouring points) into the queue (if a child node is already in the queue or has 281 been in the queue before, it will not be added); 282 (iii) If the queue is empty, then the procedure ends; otherwise, go to step (ii). 283 Since the neighbours of each node point are arranged counter-clockwise in the 284 grid data, the index order of the BFS strategy presents the form as shown in Figure  285 4b. 286 287

The Hilbert curve indexing strategy 288 289
The Hilbert curve is a kind of fractal curve, which maps 2D or 290 higher-dimensional data into one dimensional data and well preserves the spatial 291 locality. Because the original Hilbert curve indexing strategy is used for regular node 292 points, we need to convert the unstructured node points into a regular pattern. That is, 293 the 2D coordinates need to be determined for each node point. This can be 294 accomplished by establishing an oblique coordinate system, as shown in Figure 5. 295 First, we need to determine the origin of the system. We choose the first node point 296 with six neighbours (the hexagon points) in the inner area as the origin, whose 297 coordinates are (0, 0). After that, the six neighbours of the origin are sequentially 298 initialized with coordinates +1 or -1 in the x or y directions, that is (0, 1), (-1, 1), (-1, 299 0), (0, -1), (1, -1), (1, 0) are assigned as the coordinates of the six neighbours in a 300 counter-clockwise manner. Then, this procedure is repeated for the neighbours' 301 neighbours until covering all the node points in the inner area. It should be pointed 302 out that since the non-hexagon points cannot be arranged in the same manner as 303 hexagon points, special treatment is required when encountering non-hexagon points. 304 The coordinates of the neighbours of the non-hexagon points are not initialized and 305 set to the default (0, 0). Since there are only a few non-hexagon points, this has little impact on the performance. 307 After the 2D coordinates are initialized, the minimum x and y coordinate values 308 of all the node points are subtracted from the x and y coordinates, respectively, which 309 ensures that all coordinate values are non-negative. Since the number of points in the 310 x and y directions should be 2 n (n is a non-negative integer) for the standard Hilbert 311 curve indexing strategy, we choose the smallest 2 n that can cover the largest x and y 312 coordinate values as the total number of points. Finally, using the x and y coordinate 313 values of each node point, as well as 2 n as the inputs, the standard xy2d function (cf. 314 https://en.wikipedia.org/wiki/Hilbert_curve) is called to obtain its converted 1D value. 315 Then, the node points are sorted according to the 1D values, which finishes the 316 application of the Hilbert index reordering strategy. Figure 4c shows the Hilbert 317 indexing order in a practical simulation. Use the 32 characters (Base32) 0-9 and b-z (remove a, i, l, o) to encode the 337 merged numbers. Take five consecutive binary digits of a merged number as a group, 338 which ranges from decimal 0 to 31, and convert it to the corresponding character in Base32. For example, the merged number in step (ii) is converted to "wtw37q". After 340 the encoding, we sort the node points according to the character strings to complete 341 the implementation of the Morton curve indexing strategy. Figure 4d shows the index 342 order of the Morton curve strategy. 343 Finally, we provide a remark about the relationship between the mesh resolution 344 and the length of the converted strings. Assume that the length of the string to be 345 converted is L; then, the total binary digits of the longitude and latitude are 5L. If L is 346 even, the number of binary conversions for longitude and latitude using the bisection 347 method is 5L/2; if L is odd, the longitude bisection times is [5L/2]+1, and the latitude 348 bisection times is [5L/2]. More clearly, the relationship between L and the resolution 349 is shown in Table 1. Since the target resolution of the densest mesh we currently use is 350 ~3.5 km, setting L=5 is enough to meet our requirements. G10 grid, the quasi-uniform G8 grid, and the variable-resolution G8 grid (a G8X4 358 gird, which means the fine-mesh and coarse-mesh resolutions vary roughly by a ratio 359 of 4, and the timestep is set to 20 seconds). The speedups of the index reordering 360 strategies relative to the original-ordering case with different numbers of processes 361 and different grids are shown in Figure 6. 362 For the G10 grid, compared with the unoptimized case, the run times of all the 363 index reordering strategies are reduced, with a speedup ranging from 1.04x to 1.42x. 364 As the number of processes increases, the optimization effect of using the index 365 reordering strategies becomes less significant. The reason is that as the number of 366 processes increases, the number of mesh points on each process decreases, implying 367 that the percentage of data put into the cache is increased. Therefore, the effect of 368 cache optimization by using the index reordering strategies becomes less obvious. 369 For the G8 grids, when using the same number of processes with the G10 grid 370 (see the lower left part of Figure 6), the three index reordering strategies can speed up 371 the calculations on some test cases, but with a smaller speedup factor. While for the other test cases, acceleration is relatively hard to achieve. This is because the number 373 of mesh points distributed to each process is much less than that of the G10 grid. As 374 we decrease the number of processes, as shown in the lower right part of Figure 6, the 375 speedups of the three index reordering strategies become conspicuous again. When 376 running on 60 processes, a 1.12x speedup and a 1.22x speedup are obtained for the 377 quasi-uniform G8 grid and variable-resolution G8 grid, respectively. These results 378 suggest that the index reordering strategies can indeed speed up the calculations, 379 especially for running with a small number of processes. 380 Based on tests using the three indexing strategies, the BFS strategy typically 381 performs best and can be used as the default indexing strategy.  2019)). This issue becomes especially challenging for the 391 unstructured-mesh models because of discontinuous accesses. As shown in Figure 7a, 392 originally, we call the PnetCDF (Li et al. (2003)) interface to perform the I/O 393 operations, and each process directly interacts with the parallel file system. To give a 394 more specific example, we use the data input procedure for an illustration. When 395 reading data in parallel, the global indices of the data to be read by each process are 396 discontinuous (that is, the positions of the data to be read in the input file are 397 discontinuous, due to the use of the unstructured mesh), while the interface for 398 reading data in PnetCDF requires that the data read each time are located 399 continuously in the input file. Therefore, the reading interface in PnetCDF has to be 400 called multiple times. To reduce the number of interface calls, we initialize two arrays 401 'var_start' and 'var_count' to record the starting positions and lengths of the data to be 402 read by each process, respectively. That is, 'var_start (i)' is the starting position of the 403 input file for the i-th call to the PnetCDF reading interface, and 'var_count (i)' is the 404 length of the data for the i-th call to the PnetCDF reading interface. The sizes of these two arrays are the number of times that the PnetCDF reading interface is called. With 406 these two arrays, we call the PnetCDF nonblocking reading interface 'nfmpi_iget_var' 407 multiple times to read the data. It is worth noting that the data are not imported when 408 calling 'nfmpi_iget_vara', but only the reading requests are recorded. The reading is 409 actually carried out at the wait interface 'nfmpi_wait_all'. 410 The 'var_start' and 'var_count' arrays are initialized in the mesh partition 411 procedure, and the knowledge of implementation details is not required for scientific 412 model developers. After that, these two arrays can be used as the inputs to call the 413 'wrap_read_par' function to read the grid data or the variable data. The data output 414 follows the same approach as the data input, except one special treatment: the edge 415 and triangle points are partitioned following the partition of the node points, while 416 each edge or triangle has two or three node points; thus, each edge or triangle point 417 may belong to two or three processes. To avoid the conflicts during the data output, 418 we choose the process with the smallest rank to perform the output of the data defined 419 on the edge or triangle points that belong to more than one process. The users also do 420 not have to know the details of initializing the 'var_start' and 'var_count' arrays for 421 the data output. In addition, similar to the inter-process communications, we have also 422 designed a linked list to collect variables that need to be output. An interface called 423 'wrap_add_field' can be used to add the variables to the list. When the collection is 424 finished, an interface called 'wrap_output' is used to write all the collected model 425 variables in the list to the parallel file system. 426 Although the method mentioned above can combine multiple reading requests, 427 PnetCDF shows a significant performance degradation provided that the number of 428 processes scales to several hundreds or thousands. Therefore, we consider improving The first step to apply the group I/O method is to determine the I/O processes. 444 We use a user-specified parameter 'group_size' to determine the size of the 445 process-groups, i.e., how many processes are in one group. Then, the processes with and more than 90x speedup is observed when the total number of processes is 4200. 493 For the G8 and G8X4 grids, the best number of I/O processes is between 30 and 70, 494 and more than 122x speedup and 108x speedup can be achieved for the quasi-uniform 495 G8 grid and G8X4 grid, respectively, when the total number of processes reaches 496

497
For the data output, the group I/O can reduce the writing time for both the G8 498 and G8X4 grids with almost all the 'group_size's larger than 1, while it is only 499 effective for the G10 grid with part of the 'group_size's. For the G10 grid, the best 500 number of I/O processes is between 120 and 200, and more than 3x speedup can be 501 achieved for all the process numbers. The reason why more speedup is achieved for 502 the data input than for output may be that the second dimensions of the input data 503 (smaller than 7, mainly grid data currently) are much smaller than those of the output data (the number of vertical layers of the variables, 30 in this study). This means that 505 the input data are 'more discontinuous' than the output data, so the optimization effect 506 of the group I/O method for data input is more significant than for data output. For the 507 G8 and G8X4 grids, the best number of I/O processes is between 50 and 80, and more 508 than 80x speedup and 84x speedup can be obtained for the quasi-uniform G8 grid and 509 G8X4 grid, respectively, when the total number of processes reaches 4200. These The three aspects of the parallel computing toolkits mentioned above are 539 encapsulated in only a few interfaces that can be used by scientific model developers.