Advanced parallel implementation of the coupled ocean-ice model FEMAO with load balancing

In this paper, we present a parallel version of the finite element model of the Arctic Ocean (FEMAO) configured for the White sea and based on the MPI technology. This model consists of two main parts: an ocean dynamics model and a surface ice dynamics model. These parts are very different in terms of the amount of computations because the complexity of the ocean part depends on the bottom depth, while that of the sea-ice component does not. In the first step, we decided to locate both submodels on the same CPU cores with the common horizontal partition of the computational domain. The model 5 domain is divided into small blocks, which are distributed over the CPU cores using Hilbert-curve balancing. Partition of the model domain is static (i.e., computed during the initialization stage). There are three baseline options: single block per core, balancing of 2D computations and balancing of 3D computations. After showing parallel acceleration for particular ocean and ice procedures, we construct the common partition, which minimizes joint imbalance in both submodels. Our novelty is using arrays shared by all blocks that belong to a CPU core instead of allocating separate arrays for each block, as is usually done. 10 Computations on a CPU core are restricted by the masks of not-land grid nodes and block-core correspondence. This approach allows us to implement parallel computations into the model that are as simple as when the usual decomposition to squares is used, though with advances of load balancing. We provide parallel acceleration of up to 996 cores for the model with resolution 500× 500× 39 in the ocean component and 43 sea-ice scalars, and we carry out detailed analysis of different partitions on the model runtime. 15


Introduction
The increasing performance and availability of multiprocessor computing devices makes it possible to simulate complex natural systems with high resolution, while taking into account important phenomena and coupling comprehensive models of various subsystems. In particular, more precise, accurate, and full numerical description of processes in seas and oceans have become possible. There are now models of seas that can simulate currents, dynamics of thermohaline fields, sea ice, pelagic ecology, 20 benthic processes, and so on; see, for example, review Fox-Kemper et al. (2019) The finite-element model of the Arctic Ocean (FEMAO) Iakovlev (1996Iakovlev ( , 2012 has been developed since the 1990s and it has been adjusted to the White Sea Chernov (2013), Chernov et al. (2018). The model domain is a part of the cylinder over sphere (i.e., the Cartesian product of a region on the Earth surface to a vertical segment). The coordinates are orthogonal, with the axes directed to the East, to the South, and downwards. The horizontal grid is structured and rectangular because finite elements are defined on triangles composing rectangles, see Iakovlev (1996). Points that correspond to the land are excluded from the computations using a mask of "wet" points. The z-coordinate is used as the vertical axis (i.e., not sigma coordinate and so on). Therefore, for each 2D-grid node, there is the number of actually used vertical layers. In case of significantly variable depth, this "integer depth" may also vary, see figure 1. In contrast, sea ice and sea surface computations are depth-independent.
The presence of both 2D and 3D calculations complicates balancing of the full model. 30 The original code was written in Fortran-90/95 and it did not allow computation in parallel. Our goal is to develop a parallel version of the model based on the MPI technology without the need to make significant changes in the program code (i.e., preserve loops structure, mask of wet points, but benefit from load balancing). Consequently, we developed a library that performs a partition of the 2D computational domain and organizes communication between the CPU cores. In numerical ocean models, the baseline strategy is to decompose domain into squares Madec et al. (2015) or into small blocks, with consequent 35 distribution over the processor cores Dennis (2007Dennis ( , 2003; Chaplygin et al. (2019). Both approaches allow to preserve the original structure of the loops and utilize the direct referencing of neighbouring grid nodes on rectangular grids. Decomposition into small blocks is more attractive from the viewpoint of load balancing, especially for z-coordinate models. Blocks can be distributed using the METIS Karypis (1998) software or simpler algorithms, such as Hilbert curves Dennis (2007). We give preference to partition on blocks, which are distributed using Hilbert curves to make the code library-independent. 40 Note that some modern ocean models can also benefit from unstructured mesh usage, where there is no need for the mask of wet points; see for example Koldunov et al. (2019). In addition, some ocean models omit masking of wet points, see Madec et al. (2015). This implies increase in the number of computations, but benefits from less control-flow interruptions that give rise to better automatic vectorization of loops. In the following sections we will describe our parallel version implementation relying on the use of mask of wet points to make balanced computations and we will also outline its peculiar properties.

The White Sea
The White Sea is a relatively small (about 500 km×500 km) and shallow (67 m is the mean depth with a maximal depth of not more than 340 m) semi-closed sea in the Arctic Ocean basin, located in the North-Western part of Russia and included in its territorial waters. Its area is 90000 km 2 . The White Sea plays an important role for economy of the neighbouring regions Filatov et al. (2007).

50
The White sea consists of several parts, including four bays and a narrow shallow strait called Gorlo that separates one part of the sea from the other. The coastline of the sea is quite complex, which means that the rectangle (almost a square) of the Earth's surface that contains the sea has only about one third of the water area.
The White Sea is a convenient model region to test the numerical algorithms, software, and mathematical models that are intended to be used for the Arctic Ocean. First, low spatial step and relatively high maximum velocities demand, due to the 55 Courant stability condition, a rather small temporal step. This makes it difficult to develop efficient algorithms, stable numerical schemes, and ensure performance using the available computers. Second, because this model is less dependent on the initial distributions, it makes the test simulations easier because the only liquid boundary is needed to set the boundary data. Finally, the White Sea's relatively small inertia enables quite short simulations that are able to reveal any problems and demonstrate any important features. 60 3 The model and the software A time step in the FEMAO model consists of several procedures, see algorithm 1. The model uses the physical-process splitting approach, so that geophysical fields are changed by each procedure that simulates one of the geophysical processes.
Algorithm 1 Time step algorithm for FEMAO 1: Forcing (i.e., preparation of river runoff, atmospheric data, shortwave radiation, boundary values, etc.); 2: Dynamics of the sea ice, including melting and freezing, interaction of sea-ice floes, and also evaluating the velocity of two-dimensional ice-drift; 3: Sea-ice advection by this drift velocity; 4: Advection of 3D scalars, such as temperature and salinity; 5: Vertical diffusion of the scalars with sources due to heating, ice melting/freezing, and so on; 6: Dynamics of 3D horizontal current velocity; 7: Solving the SLAE for the sea level; 8: Evaluating the vertical velocity.
The matrix of the System of Linear Algebraic Equations (SLAE) is sparse and it contains 19 non-zero diagonals that correspond to adjacent mesh nodes within a finite element. The matrix does not vary in time and it is precomputed before the time step loop. The most time-consuming steps for the sequential code version were: 3D advection of scalars, 2D advection of sea-ice fields and solving the SLAE for the sea level.
Sea ice is considered to be an ensemble of multiple floes with some thickness distribution. This distribution is approximated by the discrete one with 15 fixed thickness values (gradations), including zero thickness (open water). Sea ice is described by distribution of its compactness (concentration) for each gradation and ice volume for each gradation (excluding water). In 70 addition, snow-on-ice volume for each gradation is evaluated. Therefore, there are 43 2D sea-ice scalars: ice and snow volume for 14 gradations and sea-ice compactness for 15 ones. Because there are 39 vertical layers in an ocean component, the set of all of the sea-ice data is comparable to a single 3D scalar.
The tested version of the model has a spatial resolution of 0.036 • E, 0.011 • N, which is between 1.0 and 1.3 km along parallels and 1.2 km along a meridian. The number of 2D grid nodes is 500 × 500, and only 33% of them are "wet" (84542).

75
The time step is 100 s. The vertical step is 5 m up to 150 m deep and then 10 m up to 300 m. In fact, in the bathymetry data (ETOPO Amante and Eakins (2009)) the deepest point of the sea more shallow than it really is, which reduces the actual maximum depth to 240 m.

Organization of the calculations
Computations in the ocean and sea-ice components are performed using three-dimensional arrays, such as a(i, j, k) or b(i, j, m), 80 where i, j represent the horizontal grid indices, k represents the depth-layer, and m represents the ice gradation. The differential operators are local: only neighbouring grid nodes-that is, a(i ± 1, j ± 1, k)-are used.
Typical differential operators in the ocean component are organized as shown in algorithm 2, where N x = 500, N y = 500 and K(i, j) is the number of depth layers. For land points, K(i, j) = 0, and K(i, j) ∈ [3, 39] with approximate mean value 12 for the remaining "wet" points, see figure 1.
Algorithm 2 Typical 3D calculation loop 1: for j = 1, Ny do 2: a(i, j, k) = ... Note that arrays in Fortran are arranged in the column-major order, so the first index i is linear in memory. The presented arrangement of indices is common for ocean models, see for example NEMO Madec et al. (2015). The loops arrangement  128 × 128 blocks. Colours can repeat. Black rectangle corresponds to a "shared" array allocated for blocks belonging to a given CPU core.
is utilized from the original code. Although another arrangement may be more efficient, it does not affect the parallelization 90 approach given later on. In spite of the fact that inner loop does not have stride-1 access, we can speculate that it allows for possible automatic vectorization over m index and corresponds to minimal control flow interruptions due to false mask(i, j) values.

Modifications of the non-parallel code
are horizontal indices of the blocks. The other points are distributed over the first blocks: The set of indices corresponding to a block is denoted by To formulate a balancing problem, we must assign weights of computational work to each block and then distribute them 100 among N p available CPU cores in such a way that all cores have the same amount of work to do, or as close to this as possible, but provided that the "quality" of the partition is kept. Connectivity of subdomains or minimum length of the boundary can be chosen as possible criteria for the quality of a partition. The weight for a block is the sum of weights corresponding to grid points in the range Ω(i b , j b ). The following weights are chosen for 2D and 3D computations, respectively: where "mean" operation is applied over wet points.

Trivial 1block partition
For a fixed n b , one can find the number of "wet" blocks (i.e., blocks with at least one not-land point). In this partition, the number of cores N p is equal to the number of wet blocks and each CPU core gets exactly one block, see figure 2. Varying n b , 110 possible values of N p can be found.

Hilbert curve partition
For n b being a power of 2, the Hilbert curve connecting all the blocks can be constructed Bader (2012). This gives a onedimensional set of weights that is balanced using the simplest algorithm. The sum of the blocks' weights on p core is denoted by W p . In spite of the fact that the Hilbert curve possesses the locality property (i.e., close indices on the curve correspond Function remove_not_connected_subdomains() finds the connected subdomain with the maximum work for each CPU core and sends other blocks to neighbouring cores. Function balance_all_ranks() tries to send bordering blocks for each core to neighbouring cores to minimize the maximum work on both cores: are for the work on the current CPU core and on a neighbouring core, respectively. The number of iterations is user-defined and we choose N iter = 15, which is usually enough to reach convergence. Note that optimization does not guarantee to find a global optimum. The need for partitioning into connected subdomains comes from the intention to increase percentage of the wet points on CPU cores due to the data structure used; see the following section for a definition of the "shared" array.
not more than 10% in most cases. This is an acceptable accuracy because partitioning itself is not the main objective of the article.
Let us introduce two baseline partitions: hilbert2d (with weights w 2d ) and hilbert3d (with weights w 3d ), see figure 2. As one can see, hilbert2d divides the computational domain on quasi-uniform subdomains, while hilbert3d locates many CPU cores in 130 high-depth regions and few cores in shallow water. Minimum and maximum number of blocks on a core can be found in tables 2, 3. When one of these partitions is applied to the whole coupled ocean-ice model, it balances one submodel and unbalances another. Table 2 shows that balancing of 2D computations ("LI 2D" → min) leads to imbalance in 3D computations ("LI 3D" ≈ 200%) and table 3 shows the opposite behaviour with "LI 2D" ≈ 300%. These values are close to the estimates given in appendix A and defined by the ratio between minimum, maximum and mean integer depth. The presented LI values imply a 135 slowdown of one of the submodels by three to four times because LI increases runtime (T ) compared to optimal one (T opt ) in the following way: A compromise for both submodels can be found by considering a combination of weights:

140
where γ 0 ≈ 3 is a ratio of run times for ocean and ice submodels on one CPU core. A partition of this type is denoted by hilbert2d3d. While this weight is optimal for "overlapping" computations of two code sections with different complexity, it is also the optimal weight for "non-overlapping" code sections (i.e., separated by blocking MPI exchanges). We show this in appendix B with corresponding estimates of LI for 2D (130%) and 3D (34%) computations.

Data structure and MPI exchanges 145
After partitioning has been performed, we get a set of blocks for each CPU core p, allocated for all blocks belonging to a CPU core with the following range of indices (excluding halo): j p s = min(j s (I p )), j p e = max(j e (I p )), a(i p s : i p e , j p s : j p e , :). 150 The shared array size is shown by rectangle in figure 2 for a particular CPU core. We introduce a mask of grid points belonging to a CPU core (mask p (i, j)). Correspondence between blocks, shared array and the mask is clarified in figure 3. Introducing this mask does not increase the complexity of the algorithms because the mask of the wet points already exists in the original code and it is simply modified. Finally, only minor modifications of the original loops are necessary: Usually, see Dennis (2007Dennis ( , 2003, arrays are allocated for each block separately. This has the following advantages: -More efficient cache usage;

160
-If the number of blocks is large enough to get proper balancing, then there is no need for mask p (i, j), thus giving advance in vectorization and so on.
It also introduces some drawbacks: -Overheads for copying block boundaries (small blocks like 4 × 4 are prohibited); -Many modifications of the original code are necessary, especially in service routines, I/O, and so on.

165
Consequently, the main strength of our approach is the ability to incorporate balancing while keeping the original program code as simple as for the trivial 1block partition. We expect that a "shared" array may be not optimal for near-land CPU cores with ∼ 20% of wet points because of non-efficient cache usage. An example of such a core is shown by the rectangle in figure   2. An experimental study of runtime dependence on % of wet points will be carried out. Table 1. The model with 1block partition; "n b × n b " is the grid of blocks; "LI 2D" and "LI 3D" are load imbalances (3) for weights w 2d and w 3d , respectively. "LI iceadvect" and "LI advect3D" are LIs computed based on the runtime of corresponding functions without exchanges; "days / 24 hours" is the number of computed days for one astronomical day. Borders of blocks neighbouring with other CPU cores are sent using MPI. The following optimizations are applied to reduce 170 the exchange time:

CPU
-All blocks' boundaries adjacent to a given CPU core are copied to a single buffer array, which is sent in one MPI_Send call.
-If possible, a diagonal halo exchange is included into cross exchanges with extra width.
-There is an option to send borders of two or more model fields in one MPI_Send call. These three bullets reduce latency 175 cost in many cores.
-Borders in the sea component are sent up to K(i, j) depth (i.e., only the actually used layers are transmitted). This reduces bandwidth limitations.

Parallel solver of the SLAE
As we have already mentioned, the time-implicit equation for the free surface is reduced to a SLAE with sparse 19-diagonal 180 matrix. This is solved by a parallel implementation of Bicgstab algorithm preconditioned by block-ILU(0) with overlapping blocks, see Saad (2003). ILU(0) preconditioner preserves the 19-diagonal matrix structure, where matrix blocks are defined for each CPU core and correspond to wet points plus a band of border points of width 2. Because blocks are defined by the partition, the convergence rate depends on the number of CPU cores. Nevertheless, we have found that in the range from 1 to 996 CPU cores, it is sufficient to perform 6 to 10 iterations in order to reach the relative residual Ax − b / b ≤ 10 −6 .     by the 1block partition method, which is highly restricted in the allowable number of cores. We first show how the most timeconsuming functions corresponding to ocean and ice submodels accelerate for three partitions: 1block, hilbert2d and hilbert3d (see figure 2). We then study overall performance of the model using four partitions, including hilbert2d3d with combined 195 weights (5).
The maximum grid size of blocks for our model is n b × n b = 128 × 128 because the MPI exchange width is limited by the block size, while the SLAE solver requires exchange of width 2. Note that due to the data structure that we used, the performance of Hilbert-type partitions is almost insensitive to n b at moderate number of CPU cores. Nevertheless, n b may be tuned by hand to decrease the complexity of the partition optimization procedure or to increase the percentage of the wet 200 points on a core. In runs with many CPU cores, we use the maximum available number of blocks to get better balancing. The parameters that we used in the experiments are given in tables 1, 2, 3, 4.

Speedup of scalar and ice advection
Advection of scalars (advect3D, depth-dependent) and ice (iceadvect, depth-independent) are the most time-consuming procedures in ocean and ice submodels, respectively. In the following, we will show that hilbert3d partition is appropriate for In total, as we expected, the hilbert3d partition is suitable for advect3D function, and its acceleration is two to three times more efficient than when 1block/hilbert2d partitions are used. Also, hilbert2d/1block partitions show two to four times faster iceadvect function compared to hilbert3d partition. The different accelerations are strongly connected to balancing of computations. To check partition-based ("LI 3D" and "LI 2D") and runtime-based ("LI advect3D" and "LI iceadvect", correspondingly) Load Imbalance for 3D and 2D computations, see tables 1, 2, 3. Note that theoretical and practical LI 220 are moderately close to each other, which confirms our choice of weights (1), (2) for these functions. Also note that the data structure and organization of the calculations are appropriate for load balancing.
Further analysis reveals that the runtime-based LI could be 4-25% more than the partition-based one, see tables 2, 3. This may be connected to overheads introduced by non-efficient organization of memory. We allocate a shared array for all blocks belonging to a CPU core and near-land cores may have only 20 % of the wet points (see tables 2, 3), which can lead to an 225 increase in cache misses. Figure 5 shows a scatter plot for % of the wet points vs advect3D runtime without exchanges (each point corresponds to some CPU core). One can clearly see that on a moderate number of cores (149) the computations are limited by the core with the smallest % of the wet points, which has the maximal runtime. However, there is no drastic dependence of runtime on % of wet points. This may mean that the number of operations per one array element in this model is large enough, thus the data structure plays a moderate role. In particular, the data structure does not limit the model 230 efficiency on 993 cores (see figure 5), where advect3D runtime suffers from imperfect balancing: cores with small number of blocks usually fall into 100% of wet points and has a wide range of run times, approximately from 1 to 5 seconds. Figure   6 additionally shows that on 993 cores the balancing is limited for processors, where the number of blocks per core is small.
Proper balancing of 3D computations by 2D partitioning implies that the number of blocks per core should be in a wide range, while the minimum number should not be close to 1 (see the left-hand panel in figure 6). Stagnation of balancing procedure is Figure 7. Speedup compared to one core for the full model. Different partitions shown in colour (1block, hilbert2d, hilbert3d, hilbert2d3d). Figure 8. Relative contribution of different code sections to runtime; "ocean" -all procedures corresponding to ocean submodel including advect3D, "ice" -ice submodel including iceadvect, "processing" -computation of statistics, "I/O" -input/output with scatter-gather functions; "SLAE" -matrix inverse and RHS preparation; "other" -simple service procedures. evident from the fact that we have only 84542 surface wet points, which corresponds to patches of size 9 × 9 (with 12 vertical levels) on 993 cores, on average.

Speedup of the full model
The coupled ocean-ice model is launched on the same CPU cores for both submodels with the common horizontal partition.
Input/output functions are sequential and utilize gather-scatter operations, which are given by our library of parallel exchanges.

240
Speedup for the full model compared to one CPU core is given in figure 7.  changes (see also tables 4, 3 for load imbalance values). In addition, section "processing" reduces its runtime because it contains many not fully optimized service functions that are sensitive to stretching of the horizontal area covered by a CPU core: such stretching is done by the hilbert3d partitioner.
Simulated years per wall-clock day (SYPD) for the best configuration (hilbert2d3d, 993 cores) is 3463/365 ≈ 9.5, see table   4. A direct comparison with other coupled ocean-ice models cannot be achieved because our configuration is rare. However,

260
we can rescale the performance (rSYPD) of time step efficiency of the global models in the following way: where we take into consideration different numbers of horizontal mesh wet points (N mesh ), CPU cores (N p ) and time step (∆t), but we neglect different numbers of vertical levels and differences in formulation of the ice dynamics. As follows from table 5, rSYPD is of the order of 10 for all of the ocean-ice models that we have presented, including FEMAO. While this characteristic 265 cannot rate models over their efficiency, we argue that our parallel configuration is comparable to existing parallel ocean-ice models.

Conclusions
In this paper, we present a relatively simple approach to accelerate the FEMAO ocean-ice model based on rectangular structured grid with advances of load balancing. The modifications that had to be introduced into the program code are identical to 270 those that were required by the simplest decomposition on squares. The only demand on the model to be accelerated by this technique is marking computational points by a logical mask. In the first step, we utilize the common partition for ocean and ice submodels. For a relatively "small" model configuration, 500 × 500 horizontal points, we reach parallel efficiency of 60 % for particular functions (3D scalar advection using 3D-balancing approach and 2D ice advection using 2D-balancing approach) and 43% for the full model (using combined weight approach) on 993 CPU cores. We show that balancing the 3D computations 275 leads to unbalanced 2D computations, and vice versa. Consequently, further acceleration may be achieved by performing computations of 2D and 3D components on distinct groups of CPU cores with different partitions. Nevertheless, high parallel efficiency of 3D scalar advection itself is a great advance for future applications of the model, especially for the version with a pelagic ecology submodel Chernov et al. (2018), where more than 50 3D scalars (biogeochemical concentrations) are added to the thermohaline fields.

280
Note that while the parallel approach that we have presented here can be implemented into the model in relatively simple way, the code of the library of parallel exchanges can be rather complex (see Supplements). of 2D computations affects 3D computations imbalance, and vice versa. Let S and V be sets of surface and ocean points, correspondingly; S p and V p be sets of these points belonging to a CPU core p; | · | be the number of points in a set. The number of 3D points can be expressed via 2D ones: |V | = {i,j}∈S K(i, j) = |S| · mean {i,j}∈S K(i, j).
When balancing of 2D computations is used (hilbert2d), surface points are distributed among processors in roughly equal size (|S p | = |S|/N p ). Then, for 3D computations, the ratio of maximum work to mean work among cores is defined as: and the corresponding load imbalance is When balancing of 3D computations is used (hilbert3d), ocean points are distributed among processors in roughly equal size (|V p | = |V |/N p ). Then, for 2D computations, the ratio of maximum work to mean work is defined as: 300 W max W mean = max p (|S p |) mean p (|S p |) = max p (|V p |/mean {i,j}∈Sp K(i, j)) mean p (|S p |) = |V |/|S| min p (mean {i,j}∈Sp K(i, j)) ≈ ρ min , and the corresponding load imbalance is Appendix B: Finding the optimal weight for non-overlapping 2D and 3D calculations Let W be a full computational work and let it be distributed between 3D (W 3d ) and 2D (W 2d ) computations with ratio γ 0 : 305 W = W 2d +W 3d ∼ (1+γ 0 )W 2d . Our goal is to find weight function w(i, j), which corresponds to minimal joint (2D and 3D) Load Imbalance. We use the notation presented in the previous appendix and we define the "number of computational points corresponding to weight": |V w | = {i,j}∈S w(i, j).
Assuming equipartition with respect to this weight (|V w p | = |V w |/N p ), we can derive LI for 2D calculations: Load imbalance for full model Figure A1. Load Imbalance for the full model L(w), as a function of w(γ) = w 2d + γw 3d ; γ0 = 3. and for 3D calculations: W 3d max W 3d mean = max p |V p | mean p |V p | = max p (|S p | · mean {i,j}∈Sp K(i, j)) mean p (|S p | · mean {i,j}∈Sp K(i, j)) = max p mean {i,j}∈Sp (K(i, j))) mean {i,j}∈Sp (w(i, j))) mean p mean {i,j}∈Sp (K(i, j))) mean {i,j}∈Sp (w(i, j))) ≈ ρ max (K/w). (B2) Finally, assuming that 2D and 3D computations are non-overlapping (i.e., the maximum work is under summation), "Load Imbalance" for the full model: For a given bathymetry K(i, j), ratio γ 0 = 3 and special type of weight function w(γ) = w 2d + γw 3d , LI(w(γ)) can be plotted numerically for different values of γ, see figure A1. The minimum of this function corresponds to the choice γ = γ 0 = 3, and LI for 2D and 3D computations in this case are 130% and 34%, respectively. We are grateful to G.S. Goyman for helpful remarks on the final draft of the manuscript and assistance in code design. Computing resources of the Joint Supercomputer Center of the Russian Academy of Sciences (JSCC RAS) were used.