In this paper, we present a parallel version of the finite-element model of the Arctic Ocean (FEMAO) configured for the White Sea and based on MPI technology. This model consists of two main parts: an ocean dynamics model and a surface ice dynamics model. These parts are very different in terms of the number of computations because the complexity of the ocean part depends on the bottom depth, while that of the sea-ice component does not. In the first step, we decided to locate both submodels on the same CPU cores with a common horizontal partition of the computational domain. The model domain is divided into small blocks, which are distributed over the CPU cores using Hilbert-curve balancing. Partitioning of the model domain is static (i.e., computed during the initialization stage). There are three baseline options: a single block per core, balancing of 2D computations, and balancing of 3D computations. After showing parallel acceleration for particular ocean and ice procedures, we construct the common partition, which minimizes joint imbalance in both submodels. Our novelty is using arrays shared by all blocks that belong to a CPU core instead of allocating separate arrays for each block, as is usually done. Computations on a CPU core are restricted by the masks of non-land grid nodes and block–core correspondence. This approach allows us to implement parallel computations into the model that are as simple as when the usual decomposition to squares is used, though with advances in load balancing. We provide parallel acceleration of up to 996 cores for the model with a resolution of

The increasing performance and availability of multiprocessor computing devices make it possible to simulate complex natural systems with high resolution, while taking into account important phenomena and coupling comprehensive models of various subsystems. In particular, more precise, accurate, and full numerical descriptions of processes in seas and oceans have become possible. There are now models of seas that can simulate currents, dynamics of thermohaline fields, sea ice, pelagic ecology, benthic processes, and so on; see, for example, the review by

The finite-element model of the Arctic Ocean

The number of depth layers in the White Sea model; the vertical grid step is 5 m up to 150

The original code was written in Fortran 90/95, and it did not allow for computation in parallel. Our goal is to develop a parallel version of the model based on MPI technology without the need to make significant changes in the program code (i.e., to preserve loop structure and the mask of wet points but benefit from load balancing). Consequently, we developed a library that performs a partition of the 2D computational domain and organizes communication between the CPU cores.

In numerical ocean models, the baseline strategy is to decompose the domain into squares

In Sects. 2–4 we provide the model configuration and organization of the calculations in the non-parallel code on a structured rectangular grid. In Sect. 5 we describe the parallelization approach, which preserves the original structure of the loops. Domain decomposition is carried out in two steps: first, the model domain is divided into small blocks and then these blocks are distributed between CPU cores. For all blocks belonging to a given core a “shared” array is introduced, and a mask of computational points restricts calculations. Partitions could be of an arbitrary shape, but blocks allow us to reach the following benefits: a simple balancing algorithm (Hilbert curves) can be applied as the number of blocks along a given direction is chosen to be a power of 2, and boundary exchanges can be easily constructed for an arbitrary halo width smaller than the block size. In Sect. 6 we report parallel acceleration on different partitions for particular 2D and 3D subroutines and the whole model.

The White Sea is a relatively small (about

The White Sea consists of several parts, including four bays and a narrow shallow strait called Gorlo that separates one part of the sea from the other. The coastline of the sea is quite complex, which means that the rectangle (almost a square) of the Earth's surface that contains the sea has only about one-third of the water area.

The White Sea is a convenient model region to test the numerical algorithms, software, and mathematical models that are intended to be used for the Arctic Ocean. First, a low spatial step and relatively high maximum velocities demand, due to the Courant stability condition, a rather small temporal step. This makes it difficult to develop efficient algorithms, implement stable numerical schemes, and ensure performance using the available computers. Second, because this model is less dependent on the initial data, it makes the test simulations easier because only the liquid boundary is needed to set the initial and boundary data. Finally, the White Sea's relatively small inertia enables us to check the correctness of the code with rather short simulations, which are able to demonstrate important features of the currents.

A time step in the FEMAO model consists of several procedures; see Algorithm 1. The model uses the physical process splitting approach so that geophysical fields are changed by each procedure that simulates one of the geophysical processes.

The matrix of the system of linear algebraic equations (SLAE) is sparse, and it contains 19 nonzero diagonals that correspond to adjacent mesh nodes within a finite element. The matrix does not vary in time, and it is precomputed before the time step loop. The most time-consuming steps for the sequential code version were 3D advection of scalars, 2D advection of sea-ice fields, and solving the SLAE for the sea level. The simple characteristic Galerkin scheme

The local 1D sea-ice thermodynamics are based on the 0-layer model

The tested version of the model has a spatial resolution of 0.036

The model with the 1block partition:

Computations in the ocean and sea-ice components are performed using three-dimensional arrays, such as

Typical differential operators in the ocean component are organized as shown in Algorithm 2, where

Differential operators in the ice component are shown in Algorithm 3, where

Note that arrays in Fortran are arranged in column major order, so the first index

In this section we describe the partitioning algorithm of the model domain into subdomains, each corresponding to a CPU core, and subsequent modifications of the single-core calculations, which require only minor changes to Algorithms 2 and 3. Grid partition is performed in two steps: the model domain is decomposed into small blocks, and then these blocks are distributed over CPU cores in such a way that computational load imbalance is minimized. We utilize a common grid partition for both sea-ice and ocean submodels and provide theoretical estimates of the load imbalances resulting from the application of different weight functions in the balancing problem. The partition is calculated during the model initialization stage, as our balancing algorithm (Hilbert curves) is computationally inexpensive. Also, we guarantee that the partition is the same each time the model is run if the parameters of the partitioner were not modified.

Computational domain

To formulate a balancing problem, we must assign weights of computational work to each block and then distribute them among

For a fixed

Three types of partition. Different processor cores are separated by a black line. Hilbert partitions are based on a grid of

For

Function

The described algorithm performs partitioning into connected subdomains with load imbalance, which is

We have implemented two baseline partitions: hilbert2d (with weights

Same as Table

Same as Table

A compromise for both submodels can be found by considering a combination of weights:

After partitioning has been performed, we get a set of blocks for each CPU core

Blocks belonging to a CPU core are in color, and borders of the allocated array are indicated by a thick line;

An example of the shared array size is shown by the rectangle in Fig.

Same as Table

Speedup compared to one core for two functions:

Scatter plot: percentage of wet points on a core –

Scatter plot: number of computational points for 3D calculations on a core –

Usually (see

There is more efficient cache usage.

If the number of blocks is large enough to get proper balancing, then there is no need for

There are overheads for copying block boundaries (small blocks like

Many modifications of the original code are necessary, especially in service routines, I/O, and so on.

Borders of blocks neighboring other CPU cores are sent using MPI. The following optimizations are applied to reduce the exchange time.

All block boundaries adjacent to a given CPU core are copied to a single buffer array, which is sent in one

If possible, a diagonal halo exchange is included in cross-exchanges with extra width.

There is an option to send borders of two or more model fields in one

Borders in the sea component are sent up to

As we have already mentioned, the time-implicit equation for the free surface is reduced to an SLAE with a sparse 19-diagonal matrix. This is solved by a parallel implementation of the BICGSTAB algorithm preconditioned by block ILU(0) with overlapping blocks; see

Our experiments were performed on the cluster at the Joint Supercomputer Center of the Russian Academy of Sciences (

The maximum grid size of blocks for our model is

Advections of scalars (

Speedup for the mentioned procedures is given in Fig.

Speedup including MPI exchanges is shown in Fig.

Speedup compared to one core for the full model. Different partitions are shown in color (1block, hilbert2d, hilbert3d, hilbert2d3d).

Relative contribution of different code sections to runtime; ocean – all procedures corresponding to ocean submodel including

Further analysis reveals that the runtime-based LI could be 4 %–25 % more than the partition-based one; see Tables

The coupled ocean–ice model is launched on the same CPU cores for both submodels with a common horizontal partition. Input/output functions are sequential and utilize gather–scatter operations, which are given by our library of parallel exchanges. Speedup for the full model compared to one CPU core is given in Fig.

The relative contribution of different code sections to runtime is given in Fig.

Absolute contribution of different code sections to runtime on 993 CPU cores.

Efficiency of the time step loop for the FEMAO model compared to global ocean–ice models. Rescaled SYPD (rSYPD, Eq.

Absolute values of code section runtimes for 1 model day are shown in Fig.

Simulated years per wall-clock day (SYPD) for the best configuration (hilbert2d3d, 993 cores) are calculated as

In this paper, we present a relatively simple approach to accelerate the FEMAO ocean–ice model based on a rectangular structured grid with advances in load balancing. The modifications that had to be introduced into the program code are identical to those that were required by the simplest decomposition on squares. The only demand on the model to be accelerated by this technique is marking computational points by a logical mask. In the first step, we utilize the common partition for ocean and ice submodels. For a relatively “small” model configuration of

Note that while the parallel approach that we have presented here can be implemented into the model in a relatively simple way, the code of the library of parallel exchanges can be rather complex (see the “Code availability” section).

Let us introduce two functions of bathymetry (defined by integer depth

When balancing of 2D computations is used (hilbert2d), surface points are distributed among processors in roughly equal amounts (

When balancing of 3D computations is used (hilbert3d), ocean points are distributed among processors in roughly equal amounts (

Let

Assuming equipartition with respect to this weight (

For a given bathymetry

Load imbalance for the full model LI

The version of the FEMAO model used to carry out simulations reported here can be accessed from

NI is the developer of the FEMAO model and conceived the research. PP developed the parallel exchange library and implemented it in the most computationally expensive parts of the model. IC completed the implementation and prepared the test configuration with high resolution. PP performed the experiments and wrote the initial draft of the paper. All of the authors contributed to the final draft of the paper.

The authors declare that they have no conflict of interest.

Implementation of the parallel exchange library into the FEMAO model was carried out with the financial support of the Russian Foundation for Basic Research (projects 18-05-60184, 19-35-90023). The development of the parallel exchange library was carried out with the financial support of the Moscow Center for Fundamental and Applied Mathematics (agreement no. 075-15-2019-1624 with the Ministry of Education and Science of the Russian Federation).

We are grateful to Gordey Goyman (INM RAS) for helpful remarks on the final draft of the paper and assistance in code design. Computing resources of the Joint Supercomputer Center of the Russian Academy of Sciences (JSCC RAS) were used.

This research has been supported by the Russian Foundation for Basic Research (grant nos. 18-05-60184 and 19-35-90023) and the Ministry of Education and Science of the Russian Federation (agreement no. 075-15-2019-1624).

This paper was edited by Richard Mills and reviewed by Nikolay V. Koldunov and one anonymous referee.