This paper presents the message passing interface (MPI)-based parallelization of the three-dimensional hydrodynamic model SHYFEM (System of HydrodYnamic Finite Element Modules). The original sequential version of the code was parallelized in order to reduce the execution time of high-resolution configurations using state-of-the-art high-performance computing (HPC) systems. A distributed memory approach was used, based on the MPI. Optimized numerical libraries were used to partition the unstructured grid (with a focus on load balancing) and to solve the sparse linear system of equations in parallel in the case of semi-to-fully implicit time stepping. The parallel implementation of the model was validated by comparing the outputs with those obtained from the sequential version. The performance assessment demonstrates a good level of scalability with a realistic configuration used as benchmark.

Ocean sciences are significantly supported by numerical modeling, which helps to understand physical phenomena or provide predictions both in the short term or from a climate perspective. The reliability of ocean prediction is strictly linked to the ability of numerical models to capture the relevant physical processes.

The physical processes taking place in the oceans occupy a wide range of spatial and timescales. The ocean circulation is highly complex, in which
physical processes at large scales are transferred to smaller scales, resulting in mesoscale and sub-mesoscale structures, or eddies

The coastal scale is also rich in features driven by the interaction between the regional-scale dynamics and the complex morphology typical of shelf areas, tidal flats, estuaries and straits.

In both large-scale and coastal modeling, the spatial resolution is a key factor.

Large-scale applications require a finer horizontal resolution than the Rossby radius (in the order of 100

Ocean circulation models use mostly structured meshes and have a long history of development

In the last few decades, however, the finite-volume

Representing several spatial scales in the same application renders the unstructured grid appealing in simulations aimed at bridging the gap between
the large-scale flow and the coastal dynamics

The computational cost of a numerical simulation depends on the order of accuracy of the numerical scheme and the grid scale

Both large-scale and coastal applications may involve significant computational resources because of the high density of mesh descriptor elements required to resolve dominant physical processes. The computational cost is also determined by upper limits on the time step, making meaningful simulations prohibitive for conventional machines. Access to HPC resources is essential for performing state-of-the-art simulations.

Several successful modeling studies have involved the SHYFEM (System of HydrodYnamic Finite Element
Modules) unstructured grid model

The range of applications of SHYFEM was recently extended in a multi-model study to assess the hazards related to climate scenarios

SHYFEM was also applied to produce seamless three-dimensional hydrodynamic short-term forecasts on a daily basis

We implemented a version of the SHYFEM code that can be executed on parallel architectures, addressing the problem of load balancing that is strictly related to the grid partitioning, the parallel scalability and inter-node computational overhead. Our aim was to make all these applications (study process simulation at different scales, long-term and climatic implementations, forecasting and relocatable systems) practical, also from a future perspective where the computational cost is constantly increasing with the complexity of the simulations. We adopted a distributed memory approach, with two key advantages: (i) reduction in runtime with the upper limit determined by the user's choice of resources and (ii) memory scalability, allowing for highly memory-demanding simulations.

The distributed memory approach, based on the message passing interface (MPI)

The MPI developments carried out in this work consist of additional routines that wrap the native MPI directives, without undermining the code readability. Some aspects of the parallel development, such as the domain decomposition and the solution of free surface equations, were achieved using external libraries.

Section

SHYFEM solves the ocean primitive equations, assuming incompressibility in the continuity equation, and advection–diffusion equation for active
tracers using finite-element discretization based on triangular elements

Notations adopted in this work.

n/a: not applicable.

SHYFEM equations are discretized in time with a forward time stepping scheme with terms that are evaluated at time level

The treatment of external pressure gradient and divergence in continuity is consistent with the method described in

SHYFEM applies a semi-implicit scheme to the Coriolis force and vertical viscosity with weights of time level

The implementation of

SHYFEM considers the semi-implicit treatment of the Coriolis term, since it can lead to instability when the friction is too low. The weights assigned
to time level

The vertical viscosity terms in Eq. (

The subsequent linear system is sparse with a penta-diagonal structure with dimension

The solution of Eq. (

The elliptic equation of the prediction of free surface

The advancement of momentum equations is finalized with the correction step, using

Example of variables on the SHYFEM horizontal and vertical grids.

Vertical velocities at time level

The advection–diffusion equation for a generic tracer

The horizontal diffusivity follows Smagorinsky's formulation.

The density is updated by means of the equation of state (EOS) under the hydrostatic assumption:

The sub-steps of the SHYFEM solution method are in Algorithm

SHYFEM time loop

AdvanceMomentum {Eq. (

SolveBarotropicEquation {Eq. (

FinaliseMomentum {Eq. (

CalcVerticalVelocity {Eq. (

SolveTracerAdvection {Eq. (

UpdateDensity {Eq. (

Figure

Variables are also staggered in the vertical grid, as shown in Fig.

The turbulent and molecular stresses and the vertical velocity are computed at the bottom interface of each layer (black dots in
Fig.

Scalar variables (red) are staggered with respect to vertical velocity (black), referenced in the middle and at layer interfaces, respectively. The
sea surface elevation is a 2D field defined only in the

The grid cells on the top layer can change their volume as a result of the oscillation of the free surface. The number of active cells along the vertical direction depends on the sea depth.

The spatial discretization of the governing equations in the finite-element method (FEM) framework is based on the assumption that the approximate solution is a linear combination of shape functions defined in the 2D space. In this section we provide suggestions for the practical use of the FEM method in the relevant terms of the SHYFEM equations.

Notation for connectivity and gradient of shape functions.

Table

Considering a scalar field, such as the surface elevation

The resulting gradient is referenced to the element

Considering a vector field, such as

The horizontal viscosity stress in

Equation (

The terms

The dependency between adjacent nodes of

Scientific and engineering numerical simulations involve an ever-growing demand for computing resources due to the increasing model resolution and
complexity. Computer architectures satisfy simulation requirements through a variety of computing hardware, often combined together into heterogeneous
architectures. There are key benefits from the design (or re-design) of a parallel application

Identifying data dependencies is key for the design of the parallel algorithm, since inter-process communications need to be introduced to satisfy these dependencies.

In the case of a structured grid, each grid point usually holds information related to the cell discretized in the space, and data dependencies are represented by a stencil containing the relations between each cell and its neighbors. For example, we could have five or nine point stencils that represent the dependencies of the current cell with regards to its four neighbors north, south, east and west, or alternatively cells along diagonals could also be considered for computation.

On the other hand, unstructured grid models can be characterized by dependencies among nodes (the vertexes), elements (triangles), or nodes and elements. These kinds of dependencies need to be taken into account when the partitioning strategy is defined.

Element based partitioning and node-based partitioning.

The choice between element-based or node-based partitioning (see Fig.

The best partitioning strategy cannot be defined absolutely. In fact, it usually depends on the code architecture and its implementation.

Possible data dependencies on a staggered grid.

Analysis of the SHYFEM code shows that an element-based domain decomposition minimizes the number of communications among the parallel processes. Four
types of data dependencies (see Fig.

The element-based partitioning needs data exchange when there are dependencies A, B and D, while node-based partitioning needs data exchange with dependencies C and D. The data dependency A happens only when momentum is exchanged to compute the viscosity operator; the data dependency D happens in two cases when matrix–vector products are computed for the implicit solution of the free surface equation (FSE) which is solved using the PETSc (Portable, Extensible Toolkit for Scientific Computation) external libraries; finally, the data dependency C is the most common in the code, more frequent than dependency B.

We can thus summarize that, after analyzing the SHYFEM code, element-based partitioning reduces the data dependencies that need to be solved through data exchanges between neighboring processes. Clearly, the computation on nodes shared among different processes is replicated.

The second step, after selecting the partitioning strategy, is to define the partitioning algorithm. This represents a way of distributing the
workload among the processes for an efficient parallel computation. The standard approach

The SHYFEM code has a modular structure. It enables users to customize the execution by changing the parameters defined within a configuration file (i.e., namelist), to set up the simulation, and to activate the modules that solve hydrodynamics, thermodynamics and turbulence.

This section details the changes made to the original code, introducing the additional data structures needed to handle the domain partitioning, the
MPI point-to-point and collective communications, the solution of the FSE using the external PETSc library

The domain decomposition over several MPI processes entails mapping the information of local entities to global entities. The entities to be mapped are the elements and nodes. As a consequence of the partitioning procedure, each process holds two mapping tables, one for the elements and one for the nodes. The mapping table of the elements stores the correspondence between the global identifier of the element (which is globally unique) with a local identifier of the same element (which is locally unique). The same also happens with the mapping table associated with the nodes. Mapping information is stored within two local data structures, containing the global identification number (GID) of elements and nodes. The order of GID elements in local structures is natural; namely they are set in ascending order of GIDs. The local–global mapping is represented by the position in the local structure of the GIDs, called local identification number (LID).

The GIDs of the nodes are stored in a different order. The GIDs of the nodes that belong to the boundary of the local domain are stored at the end of the mapping table. This provides some computational benefits: it is easy to identify all of the nodes on the border; in most cases it is better to first execute the computation over all of the nodes in the inner domain and after the computation over the nodes at the boundary; during the data exchange it is easy to identify which nodes need to be sent and which ones need to be updated with the data from the neighboring processes.

Data exchanges are executed when element-to-node and element-to-element dependencies happen and MPI point-to-point communications are used. In the first case, each process receives information based on the elements that share the target node from the processes the elements belong to. It keeps track of the shared nodes in terms of numbers and LIDs. Each process computes the local contribution and sends it to the interested neighboring processes. The information received is stored in a temporary 3D data structure defined for nodes, vertical levels and processes. A reduction operation is performed on the information received.

The element-to-element dependency happens only once in the time loop required to compute the viscosity operator. In this case, each process sends its
contribution to the neighbors in terms of momentum values. Each process keeps track of the elements to be sent/received in terms of the numbers and
LIDs, using two different data structures. In this case, the data structure extends the local domain in order to include an overlap used to store the
data received from the neighbors as shown in Fig.

Communication pattern for the calculation of horizontal viscosity.

Finally, collective communications were introduced to compute properties related to the whole domain, for instance to calculate the minimum or maximum temperature of the basin or to calculate the total water volume.
Algorithm

SHYFEM-MPI time loop

SendHalo(

RecvHalo(

SetExplicitTerms

AdvanceMomentum {Eq. (

GlobalExchange(RHS)

SolveBarotropicEquation {Eq. (

FinaliseMomentum {Eq. (

SendHalo(

CalcVerticalVelocity {Eq. (

SolveTracerAdvection {Eq. (

UpdateDensity {Eq. (

Semi-implicit schemes are common in computational fluid dynamics (CFD) mainly due to the numerical stability of the solution. In the case of ocean numerical modeling, external
gravity waves are the fastest process and propagate at a speed of up to 200

The semi-implicit treatment of barotropic pressure gradient, described in Sect.

Iterative methods are the most convenient methods to solve a large sparse system with

Algorithms based on KSP search for an approximate solution in the space generated by the matrix

In a parallel application, each of the

The last two criteria are met when the method diverges.

The calculation of the norm involves global communication to enable all the processes to have the same norm value. Hence both point-to-point and
global communication burden each iteration, leading to a loss of efficiency for the parallel application if the number of necessary iterations is
high. The number of iterations depends on the physical problem and on its size. An estimate of the problem complexity is given by the condition number

In the case of complex systems it is convenient to modify the original linear system defined in Eq. (

We used PETSc rather than implementing an internal parallel solver. In fact, PETSc was developed specifically to solve problems that arise from partial differential equations on parallel architectures and provides a wide variety of solver/preconditioners that can be switched through a namelist. In addition, the interface to PETSc is independent of the version, and its implementation is highly portable on heterogeneous architectures.

Example of domain decomposition between two MPI processes for SHYFEM-MPI and the PETSc library. Numbers represent the global (local) ID. Panels

Data management related to the forcing files and restart files.

The PETSc interface creates counterparts of

To solve the free surface in SHYFEM-MPI, we used the flexible biconjugate gradient stabilized method (FBCGSR) with incomplete lower–upper (iLU) factorization
as a preconditioner. We set absolute and relative tolerance to

The PETSc library uses a parallel algorithm to solve the linear equations. The decomposition used inside PETSc is different from the domain
decomposition used in SHYFEM-MPI (see Fig.

Domain of the SANIFS configuration. The model mesh is superimposed over bathymetry.

I/O management usually represents a bottleneck in a parallel application. To avoid this, input and output files should be concurrently accessed by the
parallel processes, and each process should load its own data. However, loading the whole file for each process would affect memory scalability. In
fact, the allocated memory should be independent of the number of parallel processes in order to ensure the memory scalability of the code. The two
issues can be addressed by distributing the I/O operations among the parallel processes. During the initialization phase, SHYFEM needs to read two
files: the basin geometry and the namelist. All the MPI processes perform the same operation and store common information. This phase is not scalable
because each process browses the files. However, this operation is only performed once and has a limited impact on the total execution time. As a
second step, initial conditions and forcing (both lateral and surface) are accessed by all the parallel processes, but each one reads its own portion
of data, as shown in Fig.

We ran our experiments to assess the correctness of MPI implementation on the Southern Adriatic Northern Ionian coastal Forecasting System (SANIFS)
configuration, which has a horizontal resolution of 500

Runs are initialized with the motionless velocity field and with temperature and salinity fields from CMEMS NRT products

The sea level is imposed with a Dirichlet condition, while relaxation is applied to the parent model total velocities with a relaxation time of
1

The boundary conditions for the upper surface follow the MFS bulk formulation

We select an upwind scheme for both horizontal and vertical tracer advections. The formulation of bottom stress is quadratic. The time stepping for
the hydrodynamics is semi-implicit with

The cold start implies strong baroclinic gradients. To prevent instabilities, we select a relatively small time step, set to

The round-off error, given by the representation of floating point numbers, affects all the scientific applications based on numerical solutions
including general circulation models (GCMs). The computational error, such as the round-off error, is present also in the sequential version of all
numerical models. The round-off error is the main reason why when the same sequential code is executed on different computational architectures or it
is executed with different compiler options or with different compilers, the outputs are not bit-to-bit identical. Moreover, only changing the order
of the evaluation of the elements in the domain using the sequential version of SHYFEM we obtain outputs which are not bit-to-bit identical. The
parallel implementation through MPI inherently leads to a different order in the evaluation of elements and hence a different order in the floating
point operations. Although we can force the MPI version to execute the floating point operations in the same order of the sequential version and then
the parallel model results are the same as those of the serial model, we cannot guarantee that the results of the serial model have no uncertainty
because the serial model also contains the round-off error.

The parallel implementation of the SHYFEM model was validated to assess the reproducibility of the results when varying the size of the domain
decomposition and the number of parallel cores used for a simulation. Our baseline was the results from a sequential run, and we compared the results
with those obtained with parallel simulations on 36, 72, 108 and 216 cores. The parallel architecture used for the tests is named Zeus and is available
at the CMCC Supercomputing Center. Zeus is a parallel machine equipped with 348 parallel nodes interconnected with an Infiniband EDR
(100

The results were compared with a 1 d simulation, saving the outputs every hour (the model executes 5760 time steps) and referring to the data from
the native grid. Only the most significative fields were taken into account for comparison: temperature, salinity, sea surface high and zonal
velocity. In order to evaluate the differences between the parallel execution with respect to the results obtained with a sequential run, we used the
root mean square error as a metric:

RMSE time series of SHYFEM-MPI outputs compared with the sequential run for all the prognostic fields and with different numbers of cores: 36, 72, 108 and 216.

As a result, we computed the RMSE for each domain decomposition, for each aforementioned field and for each time step saved in the output files
(hourly), as shown in Fig.

The time series of all the MPI decompositions overlap, which means that the program can reproduce the sequential result close to machine precision. The
time series of the sea surface height (SSH) are noisy, but the RMSE remains steady and of the order of

To further assess the impact of the round-off error induced by the use of PETSc solver, we ran the same configuration five times with the same number
of cores. Again we used the RMSE as a metric to quantify the differences between four simulations with respect to the first one, which was taken as
reference. Figure

RMSE evaluated for code reproducibility with 72 MPI processes.

Although the SHYFEM-MPI implementation is not bit-to-bit reproducible, the RMSE time series show that for each of the model variables, the deviations
of the runs from the reference run remain close to the machine precision, with no effect on the reproducibility of the solution of the physical
problem. We implemented a perfect reproducible version of the model by including a halo over the elements shared between the neighboring
processes. This version can be obtained using the DEBUGON compiler key while compiling the code. This ensures that the order of the floating point
operations is kept the same and forces the PETSc solver for sequential use. However, this version is only for debugging and is beyond the scope of
this work. Moreover, we tested the restartability of the code comparing the results obtained from a run, a.k.a. the long run, simulating 1 d, but
writing the restart files after 12

Execution time in a log–log plot of different solvers available in the PETSc library.

SANIFS execution time in a log–log plot of three settings of the code: efficient MPI version of the code including I/O; efficient MPI version of the code without I/O; debug version of the code which reproduces bit-to-bit identical output of the sequential version.

The parallel scalability of the SHYFEM-MPI model was evaluated on Zeus parallel architecture. The SANIFS configuration was simulated for 7 d and the number of cores varied up to 288, which corresponds to eight nodes of the Zeus supercomputer (each node is equipped with 36 cores).

The SHYFEM-MPI implementation relies on the PETSc numerical library, which provides a wide range of numerical solvers. As a preliminary evaluation, we
compared the computational performance of the following solvers: the generalized minimal residual (GMRES) method, the improved biconjugate gradient stabilized method (IBCGS), the flexible biconjugate gradient stabilized (FBCGSR) method and the biconjugate gradient stabilized method (BCGS). The pipelined solvers available in the PETSc library aim at hiding network latencies and synchronizations which can become computational bottlenecks in Krylov methods. Among the available
pipelined solvers we tested pipelined variant of the biconjugate gradient stabilized method (PIPEBGCS) and the pipelined variant of the generalized
minimal residual method (PIPEBCGS). The former method diverges after a few iterations; it was necessary to increase the tolerances by 4 orders of magnitude
in order to use it, and it did not lead to improvements. The latter instead had a worse scalability than the BCGS method used initially. Finally, we
experimented with the use of the free-matrix approach also available in the PETSc library, which does not require explicit storage of the matrix, for the
numerical solution of partial differential equations. The results reported in Fig.

Execution time when scaling the number of computational cores of the efficient MPI version of the model with and without I/O and of the debug version of the code. Time is expressed in seconds.

To assess the decoupled effect of the I/O and computational performance of the model, we ran the model with the configuration that does not write the output, so that the MPI implementation can be clearly assessed. Since I/O can be affected by its way of implementation in the model and the architecture of the underlying system (number of I/O nodes, their interconnection, number of MDS and OSS servers, RAID configuration, and the parallel file system used) it would be hard to assess the effect of the I/O in the benchmark results and could add extra uncertainty to the performance measurements and results.

We hence evaluated the model with the SANIFS configuration including the effect of I/O to provide an insight into the model performance when a realistic
configuration is used. Finally we measured the speedup of the debug version of the model, which provides the bit-to-bit identical outputs of the
sequential version. Figure

In Table

SANIFS detailed execution time for different processing phases.

Deeper insights into the performances are provided in Fig.

SANIFS processing ratio among the computing phases.

Figure

Finally, we measured the memory footprint of the model varying the number of processors. Figure

SANIFS memory usage with different number of computing nodes. The labels on the points refer to the memory per node expressed in MB.

To conclude, the free surface equation part of the code needs to be better investigated for a more efficient parallelization. Among the investigations we started to evaluate the use of high-level numerical libraries such as Trilinos and Hypre and the exploitation of the DMPlex module available in the PETSc package to efficiently handle the unstructured grids. The execution time reported for the free surface solver includes the assembly of the linear system, the communication time of the internal routines of PETSc for each of the solver iterations and the communication needed to redistribute the solution onto the model grid. The effects of a non-optimal model mesh partitioning on the solver efficiency have not yet been assessed. Moreover, a more efficient partition algorithm should be adopted to reduce the idle time and to improve the load balancing.

The hydrodynamical core of SHYFEM is parallelized with a distributed memory strategy, allowing for both calculation and memory scalability. The implementation of the parallel version includes external libraries for domain partitioning and the solution of the free surface equation. The parallel code was validated using a realistic configuration as a benchmark. The optimized version of the parallel model does not reproduce the output of the sequential code bit to bit but reproduces the physics of the problem without significant differences with respect to the sequential run. The source of these differences was considered for different orders of operations in each of the domain decompositions. Forcing the code to exactly reproduce the order of the operation in the sequential code was found to lead to a dramatic loss of efficiency and was therefore not considered in this work.

Our assessment reveals that the limit of scalability in the parallel code is reached at 288 MPI cores, when the parallel efficiency drops below 40 %. The analysis of the parallel performance indicates that with a high level of MPI processes used, the burden of communication and the cost of solving the free surface equation take up a huge proportion of the single model time step. The workload balance needs to be improved, with a more suitable solution for domain partitioning. The parallel code, however, enables one of the main tasks of this work to be accomplished, namely to obtain the results of the simulation in a time that is reasonable and significantly faster than the sequential case. The benchmark has demonstrated that the execution time is reduced from nearly 8 h for the sequential run to less than 4 min with 288 MPI cores.

A continuous function

Shape function of node

Considering an element

The shape functions satisfy the following relations:

The shape functions are calculated by inverting the system (Eq.

Shape functions overlapping in the element

The integration of vertical viscosity term

The stresses are discretized with centered differences:

Grouping the velocities by layer and using the identity

Distribution of variables in three generic layers:

The code of the parallel version of SHYFEM is available at

The input dataset used for the SANIFS configuration is available at

NP, GA, PS and GC formulated the research goals; SM, IE, GM and IB designed the parallel algorithm; GM and SM developed software for the parallel algorithm; IB, IF and NP validated the parallel model and conducted the formal analysis; GM and IB performed the computational experiments; GA, PS and NP provided the computational resources; GM, SM and IB wrote the first draft of the paper; IB and IF provided the data visualization; GM, SM and IE assessed the computational performance; IF, GV and IB contributed to the code tracking of the model equations; NP, GA, PS and GC supervised the research; all the authors contributed to writing and revising the paper.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was part of the Strategic project no. 4 “A multihazard prediction and analysis test bed for the global coastal ocean” of CMCC.

This paper was edited by Simone Marras and reviewed by Ufuk Utku Turuncoglu and one anonymous referee.