Scalability and some optimization of the Finite-volumE Sea ice–Ocean Model, Version 2.0 (FESOM2)

. A study of the scalability of the Finite-volumE Sea ice–Ocean circulation Model, Version 2.0 (FESOM2), the ﬁrst mature global model of its kind formulated on unstructured meshes, is presented. This study includes an analysis of the main computational kernels with a special focus on bottlenecks in parallel scalability. Several model enhancements improving this scalability for large numbers of processes are described and tested. Model grids at different resolutions are used on four high-performance computing (HPC) systems with differing computational and communication hardware to demonstrate the model’s scalability and throughput. Furthermore, strategies for improvements in parallel performance are presented and assessed. We show that, in terms of throughput, FESOM2 is on a par with state-of-the-art structured ocean models and, in a realistic eddy-resolving conﬁg-uration (1/10 ◦ resolution), can achieve about 16 years per day on 14 000 cores. This suggests that unstructured-mesh models are becoming very competitive tools in high-resolution climate modeling. We show that the main bottlenecks of FE-SOM2 parallel scalability are the two-dimensional components of the model, namely the computations of the external (barotropic) mode and the sea-ice model. It is argued that these bottlenecks are shared with other general ocean circulation models.


Introduction
Mesoscale eddies play a critical role in the general circulation of the ocean, strongly affect biogeochemical processes, and thus impact marine life.Since these eddies are not resolved on meshes coarser that the local internal Rossby radius, their effect must be parameterized in terms of eddydriven bolus transport (Gent and McWilliams, 1990) or mixing.While such coarse-mesh models have been a staple of climate research, there are numerous indications summarized in Hewitt et al. (2017) that eddy-resolving meshes can enhance the realism of climate models by improving the simulated positions of major frontal systems and reducing surface and deep biases in climate simulations (Rackow et al., 2019).With increasing computational resources becoming available to the community, running eddy-permitting (nominal resolution around 1/4 • ) and eddy-resolving (resolution of 1/10 • or better) simulations becomes more and more feasible in climate research.However, existing models are still rather slow and require many months of wall-clock time for centuryscale climate simulations making these configurations prohibitively costly for many important applications.
The grid resolution requirements become even more severe in the modeling of many physical processes in the ocean.Since the first internal Rossby radius is decreasing in high latitudes down to several kilometers or even less on the ocean shelves, very fine meshes are needed to simulate the emerging eddy dynamics.At midlatitudes, simulating the submesoscale eddy dynamics related to mixed layer or ageostrophic instabilities, which are particularly pronounced in the vicinity of fronts, creates comparable challenges with respect to model resolution that, ideally, should be on the kilometer scale.Furthermore, submesoscale eddies are believed to affect mixed layer restratification and vertical heat transport (e.g., see Sasaki et al., 2014;Su et al., 2018) and may contribute to shaping the circulation in major current systems (Chassignet and Xu, 2017).Further challenges may emerge in designing setups capable of resolving simultaneously large-scale and coastal dynamics (Holt et al., 2017).Studying their role in ocean puts very high demands on the efficiency of ocean circulation models.
Research questions such as how the ocean will impact climate in a warming world are the main drivers behind the ongoing work towards increasing numerical efficiency, and thus throughput, of ocean circulation models.However, even if ample computational resources were available, one important factor hampering the throughput of these models is their limited parallel scalability, that is, existing models struggle to make full use of the new generation of massively parallel high-performance computing (HPC) systems.Scalability bottlenecks often arise from the saturation of the parallel communication after mesh partitions become smaller than some number of surface mesh vertices per compute core depending on the model and on the hardware employed.The question of scalability is widely discussed in the ocean modeling community, yet the number of publications that document the present status is modest (e.g., see Huang et al., 2016;Dennis et al., 2012;Prims et al., 2018;Ward and Zhang, 2015;Kiss et al., 2019).Furthermore, the information propagates largely via personal exchanges between researchers rather than through peer-reviewed publications.In fact, it can be argued that the opportunities and challenges that come with extreme-scale computing still miss the majority of researchers involved in code development.In order to establish the status quo and to raise awareness of the most pressing problems, a first necessary step is to determine precisely the limits of scalability of existing systems, analyze their common causes, and to understand to what extent the existing ocean circulation models are different.
The main components of ocean circulation models limiting their scalability have been identified in the literature as the solver for the external (barotropic) mode (e.g., Huang et al., 2016;Prims et al., 2018) and the sea-ice model (e.g., Prims et al., 2018).They represent two-dimensional (2-D) stiff parts of the solution algorithm and require either linear solvers (usually iterative) or explicit pseudo-time-stepping with very small time steps (see the split-explicit method for the barotropic dynamic in Shchepetkin and McWilliams (2005) or elastic-viscous-plastic method for the sea ice in Hunke and Dukowicz (1997)).Both approaches are not particularly computationally expensive; however, they introduce numerous exchanges of 2-D halos per time step of the ocean model.Thus the extent to which the 2-D parts of the code control the scalability and the measures potentially capable of at least partly alleviating these limitations are worth a detailed analysis.Typical numbers from the structured-mesh modeling community indicate that a 1/4 • ocean mesh with about 1 million wet surface vertices starts to saturate beyond 500 cores (e.g., see Prims et al., 2018) providing a throughput of 5-10 simulated years per day (SYPD).The throughput of the 1/10 • Parallel Ocean Program (POP) model starts to saturate after 4000 cores at ca. 5 SYPD (∼ 5.8 million wet surface vertices) increasing to 10.5 SYPD at 16 875 cores with a new solver for the external mode (Huang et al., 2016).Higher throughputs may be still possible but at the cost of a very inefficient use of computational resources.
The present article considers the computational performance and parallel scalability of the unstructured-mesh Finite-volumE Sea ice-Ocean Model (FESOM2; Danilov et al., 2017).FESOM2 is based on FESOM1.4(Wang et al., 2014) -the first mature general circulation model with unstructured mesh developed for climate research applications -but using a faster dynamical core.Recently developed large-scale ocean circulation models formulated on unstructured meshes such as FESOM (Wang et al., 2014;Danilov et al., 2017;Scholz et al., 2019), MPAS (Ringler et al., 2013;Petersen et al., 2019), or ICON (Korn, 2017) make it possible to utilize more flexible meshes with variable resolution.However, compared to structured-mesh models, the unstructured-mesh models pose two specific issues that need to be addressed.
-First, mesh partitioning in unstructured-mesh models is based on the analysis of the connectivity pattern between surface mesh cells or vertices; this method is agnostic of mesh geometry and allows one to work with only wet cells and vertices.FESOM2 uses the METIS software package (Karypis and Kumar, 1998) for this purpose.The number of neighboring parallel partitions is not defined a priori by quadrilateral geometry of the mesh as in many structured-mesh models but may vary in some limits -due to the partitioning process -especially when mesh fragments become small.Hence, the performance of this technology on large global ocean meshes with complex boundaries and its implications for parallel communication are an interesting topic to explore.
-Second, one expects that an unstructured-mesh code is more expensive per degree of freedom than its structured-mesh counterparts due to indirect addressing of neighboring cells.However, for ocean meshes, this issue hardly causes any difficulties.In fact, due to the vertical alignment of 3-D prisms within a column, the neighborhood information is propagated along the column and can be efficiently accessed.On the other hand though, high-order advection stencils on an unstructured mesh tend to be more complex and expensive to evaluate.Additionally, a triangular (or dual hexago-nal) surface mesh contains more edges than a quadrilateral mesh with the same number of vertices, making the computation of fluxes more expensive.This relative slowness can be compensated by better scalability leading to an equal or even higher throughput.Indeed, the number of halo exchanges and information to be exchanged should be about the same for meshes of the same size run on the same number of cores independently of their structure.Their relative expense measured as the communication-to-computation ratio will be lower, permitting scaling down to smaller partitions.
It is therefore expected that unstructured-mesh codes can offer a very similar throughput to that of structuredmesh models, and we see the substantiation of this statement as one of the main goals of this paper.
In the framework of the present study, we complement the analysis of scalability and throughput of FESOM2 as a whole by illustrating the performance of its main two-and threedimensional computational kernels.The main scalability bottlenecks, such as the sea ice and the solver for the external mode, are subject to an in-depth parallel performance analysis; the prospective and realized strategies to improve their performance are presented.The hierarchic mesh partitioning is an additional technology, which promises an improvement in the modularity and the flexibility of the main FESOM2 computational kernels and in their mapping to the current and prospective HPC systems.We introduce and discuss its possible uses in various contexts.
The paper is organized as follows.We start with the general description of the model and its solution algorithm, briefly introduce the meshes for test problems, and give some relevant information on the HPC systems employed in our study.In Sect.3, the parallel performance results on our test systems are demonstrated separately for the whole model and for its main computational kernels -similarly to the approach used in Reuter et al. (2015).Section 4 includes some sample model analysis plots produced using Intel Trace Analyzer and Collector and deals with performance enhancements implemented in FESOM2.A number of additional tests illustrating various aspects of the model's performance constitute Sect. 5. A discussion of results and strategies for future improvements of the parallel scaling of FESOM2 is presented in Sect.6, followed by a brief conclusions section.
2 Description of the model and test setups 2.1 Governing equations and solution procedure FESOM2 (Danilov et al., 2017;Scholz et al., 2019) is a general ocean circulation model solving the primitive equations in the Boussinesq, hydrostatic, and traditional approximations.It is formulated on unstructured triangular meshes with scalar degrees of freedom placed at vertices and horizontal velocities at triangles.Because of the dominance of hydro-static balance, the vertices are aligned in the vertical with the surface vertices making the treatment of the vertical direction similar to that in structured-mesh models.The vertical coordinate is realized using arbitrary Lagrangian-Eulerian (ALE) methodology supporting different choices of model layers.The Finite Element Sea Ice Model (FESIM; Danilov et al., 2015) is included in FESOM2 as a set of subroutines.It solves the modified elastic-viscous-plastic (mEVP) dynamical equations allowing one to reduce the number of subcycling steps without compromising numerical stability (Kimmritz et al., 2017;Koldunov et al., 2019).The surface mesh partitioning is carried out using the METIS software package.The sea-ice model is run on the same partitioning as the ocean model and is called each step of the ocean model.Although this may lead to some unnecessary overhead in seaice-free regions -especially in global setups -it makes the exchange between the sea-ice and ocean components trivial.
Next, we describe the time step structure to the extent necessary for the discussion of scalability.
The real procedure is more complicated and includes several additional pieces intertwined with those mentioned.Computations of pressure anomaly due to change in density are performed simultaneously with computations of the Brunt-Väisälä frequency and are taken out of the velocity predictor step.Computations of the thermal expansion, saline contraction, isoneutral slope, the Gent-McWilliams bolus velocity (Gent and McWilliams, 1990), and vertical mixing (K-profile parametrization; Large et al., 1994) appear in different places before they are first needed and are not timed separately in our study.
The description above does not include model input and output (I/O), which can be time consuming if done too frequently and represents a major scalability bottleneck of its own.
From the standpoint of the model's parallel scalability, the procedures mentioned above can be split into two classes.The first class includes the essentially two-dimensional computational parts such as the sea surface height (SSH) solver (Appendices A1 and B) and the sea-ice dynamics (Appendix A2).The mEVP sea-ice solver in FESOM2 (Danilov et al., 2015) commonly carries out about 100 shorter time steps within a time step of the ocean model, each followed by the halo exchange for two-dimensional sea-ice velocities.The total amount of information being exchanged is comparable to just a single three-dimensional halo exchange yet there are many such exchanges.The SSH solution is obtained by using an iterative solver utilizing pARMS (https: //www-users.cs.umn.edu/~saad/software/pARMS/, last access: 4 September 2019), which also involves multiple twodimensional halo exchanges and -even worse -global communications within the iterative procedure.
The other class includes the remaining routines.Most of them are three-dimensional and involve either no or only one halo exchange per ocean time step.The tracer advection presents the exception: it may need additional communications depending on the order of the approximation and on whether or not the flux-corrected transport (FCT) procedure is applied.However, in such cases, the amount of numerical work also increases accordingly, thus the code scalability should not be influenced in a major way.One more subroutine containing additional halo exchanges is the horizontal viscosity, which also needs some smoothing if the Leith parametrization is selected.
In the configuration used here to study the scalability, the third-or fourth-order transport algorithm with FCT is used.It involves four 3-D halo exchanges per time step.The Leith parametrization is used as well, thus the number of 3-D halo exchanges is at its maximum.
FESOM2 uses the vertical as a tighter index than the horizontal to reuse the information on horizontal neighborhood along the vertical column.This is also favorable for vertical operations and generally does not lead to cache misses in the horizontal, especially if the 2-D mesh is sorted along a space-filling curve.The tracers are stored in a single array tracer(num levels , num 2Dvertices , num tracers ), and advection and diffusion computations are applied to each tracer separately in a loop over the tracers.In this case, each tracer is contiguous in memory.Auxiliary arrays needed for FCT and high-order advection are allocated for one tracer only.

Test cases and HPC systems used in the study
Simulations are performed on three meshes (Fig. 1, Table 1) using 47 non-equidistant vertical z layers.The first is a lowresolution CORE2 mesh specially constructed to better represent global circulation in a low-resolution setup.It consist of ∼ 0.13 million wet surface vertices.Its horizontal resolution is ∼ 25 km north of 50 • N as well as around the coast of the Antarctic, ∼ 65 km in the Southern Ocean, and up to All meshes have 47 unevenly spaced vertical layers.The K-profile parameterization (Large et al., 1994) is used for the vertical mixing, and isoneutral diffusion (Redi, 1982) and the Gent-McWilliams (GM) parameterization (Gent and McWilliams, 1990) are utilized.Note that the GM coefficient is set to 0 when the horizontal grid spacing goes below 25 km.The horizontal advection scheme for tracers uses a combination of third-and fourth-order fluxes with flux corrected transport (FCT); for horizontal momentum advection, a second-order flux form is used.The Leith viscosity (Leith, 1968(Leith, , 1996) ) is used together with the modified Leith viscosity in combination with weak biharmonic viscosity.The vertical advection scheme for tracers and momentum combines third-and fourth-order fluxes.Sea-ice dynamics uses the mEVP option (Kimmritz et al., 2017;Koldunov et al., 2019) with 100 subcycles.To ensure model stability for "cold start" experiments with high-resolution meshes (fArc and STORM), we use smaller time steps (4 and 2 min, respectively) compared to the time steps used in production runs (15 and 10 min, respectively).The time step of 45 min is used in all CORR2 experiments.
Our scaling tests utilize four HPC systems summarized in Table 2; the specific versions of FORTRAN or C compilers and Message Passing Interface (MPI) libraries employed in the runs are listed in Table 2.For each HPC system used in our runs, the maximum number of (physical) cores per node is used without any hyperthreading (except for the comparisons in Sect.5.3).

Scaling tests
Each simulation in this section starts from a state of rest by reading the mesh, partitioning, initial conditions, and forcing; after that, 1800 time steps are performed; no output is written out during the run.The linear free surface option and the new version of the BiCGStab method for the SSH solver (see Sect. 4.2 for details) are used, and no setup or I/O operations are included in any timings.Whereas the majority of the used configuration settings are either standard for the scalability studies or for FESOM2 production runs, a word of caution with respect to this testing protocol has to be given: since the model is working with the ocean fields that are not yet fully developed, the time step for high-resolution meshes has to be reduced.Our motivation and consequences of using this protocol for scaling experiments are discussed in Appendix B1.Here, we only mention that differences in absolute runtimes of "cold start" with smaller time step vs. "warm start" with larger time step do not affect our general conclusions.Therefore a simple testing protocol can be used, although absolute values for runtimes of individual model components should be considered to be just an approximation.
For the fArc mesh (0.6 million vertices) on Mistral (Fig. 2, top left), we start with a moderate number of cores (144), which is then increased in increments up to 6912.The model shows strong linear scaling in the total runtime up until 1152 cores corresponding to about 500 vertices per core and then starts to slightly deviate from the linear scaling.The two components that clearly account for the bulk of scaling deterioration are the sea ice and the SSH solver.Note that the rest of the model components continue to scale linearly up to 6912 cores corresponding to ca. 100 vertices per core.The behavior for JUWELS (Fig. 2, top right) is similar although JUWELS is in general somewhat faster.
The scaling of the SSH solver on both machines practically stagnates at some point (after 1152 cores on JUWELS and 1728 cores on Mistral).The sea-ice-model scaling is not linear, but the absolute values of the wall-clock time still improve almost until the end.The fArc mesh is focused on the Arctic Ocean, therefore the sea-ice computations initially (on 144 cores) occupy about 16 % of the total time, while the SSH computations require only 6 %.At the greatest number of cores (6912), they occupy already 33 % and 28 %, respectively.The scaling for the entire model starts to considerably deviate from the linear behavior after SSH and sea-ice calculations become more expensive than the 3-D part of the code.
The scaling results for the STORM mesh (5.6 million wet vertices) are shown in Fig. 3. On Mistral, our scaling experiments could utilize up to 50 688 cores (by means of a special reservation), while on JUWELS a maximum of 12 288 cores was available, and all tests were performed using the general queue.The scaling behavior of the STORM mesh is similar to that of fArc: the 3-D parts scale practically linearly until about 100 vertices per core, whereas the SSH calculations and the sea-ice model present scalability bottlenecks.After 11 520 cores on Mistral, the SSH runtime stays around 13-14 s but, surprisingly, does not get worse.On JUWELS, the SSH stops to scale much earlier, already at 5760 cores, and oscillates around the runtime value of 15 s.The sea ice on Mistral continues to scale until about 41 472 cores, but the scaling is clearly suboptimal; on JUWELS, the runtime for the sea-ice kernel continues to improve until the maximum number of cores is reached.In the runtimes for both 2-D code parts (sea ice and SSH), we also clearly notice much stronger oscillations that indicate sensitivity to the state of the communication hardware and software of the corresponding HPC system.The STORM mesh experiments on Mistral were conducted in two separate series: up to 18 432 cores in the general queue without any special job scheduler configuration; larger jobs were run using a special reservation for 1500 nodes.This difference in the experiment setting might be responsible for the sudden oscillations in the timings for the SSH solver and for some other, perfectly scalable otherwise, routines such as the ocean mixing and computation of the pressure gradient (not shown).After these oscillations, the total time continues to improve until the maximum number of cores -even though it does not go back to the linear scaling path anymore.
We use the smaller CORE2 mesh to demonstrate the model behavior when partitionings with less than 100 vertices per core are used (Fig. 4).The cumulative effect of increasing runtimes of the 2-D model parts and decreasing runtimes of the 3-D parts is an almost constant total runtime.This good scaling of the 3-D parts of the FESOM2 code suggests that the 3-D model parts may be efficiently computed on hardware architectures with low memory per computational core (e.g., GPUs).
Runtime values detailing performance of individual model components are presented in Fig. 5.Although obtained in simulations using the CORE2 mesh on 288 cores of Ollie/AWI and simulating 11 680 time steps (1 model year with 45 min time step), their relative values are representative for other meshes.
The standard configurations of FESOM2 have 47 unevenly spaced vertical z levels, which is nearly sufficient to resolve the first baroclinic mode.For setups with high horizontal resolution, it is beneficial to increase the number of vertical levels to better resolve vertical structure of horizontal motions in the ocean (e.g., Stewart et al., 2017).In order to better understand the effect of increasing the number of vertical layers on model's scaling, we set the number of vertical levels to 71 following recommendations described in Stewart et al. (2017).This compares well with typical numbers of layers used in high-resolution model setups.Experiments were performed on Mistral with fArc mesh, and all settings were the same as in the scaling runs in Sect.3. Just as expected, with an increased number of vertical levels the scaling improves while the model becomes slower (Fig. 6).Deviations from the linear scaling are still observed, but the decline is smaller for 71 vertical layers than for 47.Increases in the number of vertical levels only affect the 3-D part of the model by giving each compute core more local work, while the poorly scalable 2-D parts are not affected by the change in the vertical layering.The number of communication calls does not change, although there is an increase in the volume transmitted.
To sum it up, total runtimes for FESOM2 scale linearly until about 400-300 vertices per core and then start to deviate from the linear behavior.The computationally significant 3-D parts of the model scale almost linearly to much lower numbers of vertices per core (at least 100), but the main 2-D components (SSH and sea ice) represent the scaling bottlenecks and, by occupying an ever growing fraction of the total runtime, eventually lead to a stagnation and later on to a deterioration in the runtimes with increasing number of compute cores.In practice, for large meshes such as STORM, the limits of scalability are hardly ever reached using the HPC resources currently available for production runs (e.g., 256 compute nodes on JUWELS and 512 compute nodes on Mistral).
In the following sections, we describe how the above runtimes translate into operational throughputs, discuss several algorithmic and technical enhancements used to reach the current level of scalability, and propose some measures that have the potential to further improve model performance.
4 Parallel scaling analysis and techniques to improve parallel scalability

Parallel code analysis using Intel Trace Analyzer and Collector
Scaling tests carried out in Sect. 3 show that the major 3-D computational parts scale very well, while the scaling of the main 2-D kernels, namely of the sea ice and even more of the SSH solver, is substantially worse.To analyze and explain these scaling properties, we visualize the communication patterns of FESOM2 with the Intel Trace Analyzer and Collector (ITAC) for a full FESOM2 time step on the CORE2 mesh with 36 (Fig. 7, top) and 72 (Fig. 7, bottom) MPI tasks on Ollie (AWI Bremerhaven).The different types of computation can be easily distinguished in the ITAC graph:  -First, the sea-ice step with 100 iterations on the 2-D mesh performs one halo exchange in every iteration (lines 5, 6 in Algorithm 1).As the CORE2 mesh has no focus in polar regions, most compute cores idle in the MPI_Wait of the halo exchange (light blue in Fig. 7).
The zoom-in view in Fig. 7 reveals this heavy load imbalance and also shows the small 2-D halo exchanges that are implemented with nonblocking MPI_Irecv and MPI_Isend calls (green) overlapped with a small part of independent calculations (blue) and finalized with MPI_Waitall (light blue).
The zoom-in view of the sea-ice step (bottom right of Fig. 7, above for 36 cores, below on 72 cores) shows that the computational load scales linearly, but the halo exchange gets more and more expensive.At the end of the sea-ice step, the global fresh water balance (line 7 in Algorithm 1) is enforced using MPI_Allreduce (red strip in Fig. 7).
-The third part is the 2-D solver for the SSH (lines 12-14 in Algorithm 1) dominated by frequent small 2-D halo exchanges and global sums (dark vertical strip in Fig. 7).On a small number of cores, the time spent in the solver is low, but it dominates the massively parallel runs.We will go into more detail concerning the SSH solver in Sect.4.2.
The results shown in Fig. 7, in particular negligible amounts of waiting time, illustrate our claims about highly efficient (in terms of parallel scaling) 3-D parts of the FE-SOM2 code but also explain suboptimal scaling of both main 2-D computational kernels, the sea ice, and the SSH.Even for rather low numbers of cores, the explicit iterative method for the sea ice described in Appendix A2 hardly contains enough arithmetic operations to amortize even nonblocking MPI communications involved in 2-D halo exchanges.The situation only gets worse when the number of cores is increased and the communication-to-computation ratio falls even lower.The analysis of the problems with the SSH scaling and our methods to deal with them are the subject of the next section.

Overlapping computation with communication in
the BiCGStab solver for the SSH For the SSH computation, FESOM2 employs the solver package pARMS augmented by additional Krylov-type solvers and parallel preconditioners (see Fuchs, 2013).For the SSH Eq. (A1) in both FESOM1.4 and FESOM2, the following settings proved -after extensive testing -to deliver robust results with fast convergence and good parallel efficiency for up to ca. 1000 cores (a typical setup for large FE-SOM runs around 2013): -BiCGStab Krylov solver (Algorithm 2), parallel preconditioner restrictive additive Schwarz (RAS) with one vertex overlap, local preconditioner incomplete lower-upper (LU) factorization ILU(k) with fixed fill level k = 3, solution tolerance 10 −8 .
Our recent scaling tests revealed that the SSH solver with the above settings represents a major bottleneck for the next level of scaling: up to 10 000 compute cores and beyond.Since the iterative SSH solver is very similar in its parallel communication patterns to the sea-ice step with frequent 2-D halo exchanges and little computational work in between, a substantially worse scaling of the SSH step turns out to be caused by the additional global communications.Figure 9 (top) shows one BiCGStab iteration with three MPI_Allreduce calls (see also Appendix B): the first two calls performed after the matrix-vector product and the preconditioner step cause increased waiting times.In addition to the blocking behavior, the MPI_Allreduce itself becomes expensive on higher core counts.Another limiting factor is the convergence of the RAS preconditioner, whose efficiency depends on the sizes of parallel partitions (see Smith et al., 2004), and, as a consequence, the iteration count increases slowly with the number of compute cores.
After searching for alternatives, we implemented the pipelined BiCGStab algorithm (Algorithm 3) proposed in Cools and Vanroose (2017) that resulted in an improved scaling behavior (see Fig. 8) especially for high core counts.As opposed to the classical BiCGStab algorithm, the pipelined version replaces the three blocking MPI_Allreduce calls by two nonblocking communications overlapped with the computations of the matrix-vector product and the preconditioner (see Fig. 9, bottom).Since the pipelined version involves more arithmetic operations for vector updates, it may be inferior to the original BiCGStab method on lower core counts.Also differences in the HPC hardware play a substantial role in the comparative performance of both algorithms as clearly illustrated in Fig. 8.

Speeding up the sea-ice model
The equations of sea-ice dynamics with traditional viscousplastic rheology and elliptic yield curve Eqs.(A2)-(A4) are very stiff and would require time steps on the level of a fraction of a second if computed explicitly (e.g., see Kimmritz et al., 2017, for a brief summary).For this reason, they are solved either with an iterative solver or explicitly using pseudo-elasticity to reduce the limitations on short time step as proposed by Hunke and Dukowicz (1997).The latter approach is called elastic-viscous-plastic (EVP) and requires N EVP ≥ 100 substeps within the external step of the sea-ice model.Other components of the sea-ice model such as the advection or the computation of thermodynamic sources and sinks are advanced at the time step of the ocean model and are much less demanding.Since the EVP approach is explicit, it has to satisfy stability limitations -as discussed by many authors beginning with the original paper by Hunke and Dukowicz (1997).It turns out that as the mesh resolution is refined, N EVP must be increased to maintain stability and becomes prohibitively large.The violation of stability leads to noise in the derivatives for the sea-ice velocity.Although this noise may stay unnoticed by users, it affects dynamics.
An essential measure taken in FESOM2 to improve scalability is to use the modified EVP approach (mEVP) as described in Danilov et al. (2015).As opposed to the traditional EVP approach, the mEVP approach, based on a suggestion by Bouillon et al. (2013), splits the issues of numerical stability and convergence.This solver is always stable, and the number of the substeps, N EVP , determines its convergence to the viscous-plastic rheology.In practice, it turns out that mEVP produces practically acceptable solutions in the regime with relatively small N EVP that only corresponds to an initial error reduction.N EVP is found experimentally by starting simulations from the values greater than the stability parameters α and β in Eqs.(A7)-(A10) and reducing them to minimum values that lead to results practically indistinwww.geosci-model-dev.net/12/3991/2019/Geosci.Model Dev., 12, 3991-4012, 2019  guishable from simulations performed with a large N EVPas explained in Koldunov et al. (2019).We use N EVP = 100 for both fArc and STORM meshes.These values are already much lower than the values needed for the standard EVP, and, in this sense, the sea-ice model is already optimal.The seaice model uses only local communications, but their number is proportional to N EVP per external time step.Estimates obtained for 1-year runs are calculated without accounting for the initial model setup (reading mesh, restart files, etc.) needed only once per run.
The SYPD estimates based on 1-year experiments are substantially lower than those obtained from the scaling results; we attribute this slowdown of the model in an operational regime to the I/O overhead and the SSH solver.For fArc at 1152 cores, I/O operations (mainly writing the recurring output files and the restart at the end of the year) use ca. 5 % of the time, while at 6912 cores it becomes 17 %.Even worse is the situation for STORM with I/O at already 25 % of computing time at 4608 cores and growing to 65 % at 13 824 cores.Note that the amount of the I/O in our 1-year experiment is rather moderate (limited number of monthly fields).In highresolution (eddying) regimes, diagnostics based on the mean effects of eddies require much higher output frequencies.The I/O in FESOM2 is still organized in a serial way with fields from all computational cores collected on one core and written to disk.While sufficient on small meshes, this methodology is obviously a bottleneck for simulations using large meshes on large parallel partitions.It will be changed in the nearest future in favor of parallel I/O solutions.
Also the fraction of the total runtime (excluding I/O) spent in the SSH solver is greater for 1-year simulations than for scaling experiments due to larger time step.The increase is due to poorer conditioning of the SSH matrix and consequent increase in the number of SSH solver iterations.More details on the consequences of this issue for our scaling experiments are presented in Appendix B1.
The real model performance depends on many factors, including I/O, state of the machine, number of vertical levels, time step, and other details of the model configuration.Therefore any timings produced in idealized settings must be taken with a grain of salt when model throughputs lie in the focus of a discussion.It is even harder to compare throughput of different models to each other, especially when not all details are available.Nonetheless, an attempt to make such a comparison is presented in Table 3, which should be treated as a qualitative estimate.
Nevertheless an attempt to account for differences in model configurations is made by introducing c SYPD constants.The relationship between resulting SYPD, basic model characteristics, and computer resources can be expressed as so the SYPD is directly proportional to model time step ( t) and number of computational cores (N core ), while inversely proportional to the number of degrees of freedom (N 3-D ).
The larger the proportionality constant c SYPD the more computationally efficient the model is.In Table 3, we list c SYPD for 3-D (vertices times number of layers) and 2-D degrees of freedom (vertices) since 2-D model parts take up a significant fraction of computational time.This comparison shows that once the differences in the number of model levels, computational resources, time steps, and mesh size are accounted for, the FESOM2 throughput is easily on a par with that of well-established structured models confirming our claim that unstructured-mesh models can be about as fast as conventional structured-mesh models and thus represent an efficient tool for ocean and climate studies.Considering overall model scalability, there is still room for improvement, especially for greater numbers of computational cores, by transitioning to a parallel I/O.

Hierarchic mesh partitioning
The standard way of performing mesh partitioning in FE-SOM2 relies on the METIS (Karypis and Kumar, 1998) graph partitioning package (currently version 5.1.0)and utilizes the dual-weighted load-balancing criterion based on the number of 2-D and 3-D mesh vertices.This approach works well for moderate numbers of parallel partitions but tends to produce some undesirable artifacts for large numbers of partitions: isolated vertices, partitions containing vertex groups widely separated in the computational domain, noncontiguous partitions, etc.
As a simple but elegant way to remedy some of these deficiencies, a backward-compatible wrapper for METIS has been implemented that allows one to perform the mesh partitioning in a hierarchic fashion (see Fig. 11,left).The procedure starts from the coarsest level, e.g., producing a partition per networking switch.Then METIS is called recursively for each partition on the current level until the lowest level (usually that of a single core) has been reached.Since the performance of METIS for small numbers of partitions is usually excellent, this approach guarantees that each coarse partition only contains contiguous vertices thus potentially improving the mapping of the computational mesh onto the topology of the compute cluster.This methodology is certainly not entirely new (e.g., see the method called "nested partitioning" in Sundar and Ghattas, 2015); however, neither implementations of this idea as our simple METIS wrapper nor applications of this technique to ocean modeling could be found in the literature.
The simulation runs with the CORE2 mesh on Emmy (Fig. 11, right) appear to show a small edge for the hierarchic partitioning in cases if the amount of work per partition becomes small; however, runs on HPC systems with higherend network hardware such as Ollie or JUWELS show no discernible differences to nonhierarchic partitioning.We attribute this discrepancy to the ability of new communication hardware with its superior efficiency and lower latency to hide deficiencies in the mesh partitioning, whereas older net-Table 3. Throughput of different models at selected configurations comparable to FESOM2 STORM, fArc and CORE2 meshes.Vertices are 2-D degrees of freedom only in the ocean.It is not possible to express the resolution of FESOM2 configurations in a single number, therefore we list resolutions that regular Mercator grid would have with a similar number of ocean degrees of freedom.Data for FESOM2 configurations are taken from 1-year simulations with I/O.work hardware on Emmy is more sensitive to the placement of grid vertices on parallel partitions.A similar situation -although on a much higher level -may also arise in the future if the communication efficiency in the future HPC systems does not keep pace with the computational performance increases (e.g., increases in the number of compute cores) of compute nodes.
In addition to improving the mapping of the partitioning topology to the topology of the HPC system, the hierarchic partitioning capability is intended to provide a simple interface for placing poorly scalable computational kernels (e.g., I/O or sea ice) at a certain mesh hierarchy level with a natural way to supply clearly defined relationships between partitions on different levels.By exploiting this feature, we hope to be able to improve computational efficiency of kernels currently bound by the I/O or communication bandwidth and to also provide interfaces to hardware accelerators such as GPUs or field programmable gate arrays (FPGAs).

Improving the network performance by underpopulating compute nodes
A simple but often effective way to improve the network performance on the current-generation network hardware such as Intel Omni-Path is to leave one core per node off the regular compute job assignment.Indeed, our experiments on Ollie and Mistral (not shown) demonstrated little difference between fully assigned and underpopulated nodes (or sometimes even slightly better SSH timings for underpopulated nodes).In this way, in addition to networking duties, other loads (e.g., I/O) currently distributed over all compute cores can be offloaded to these idle nodes. in model scalability

SSH strategy
Iterative solvers based on preconditioned conjugate gradient method or other Krylov subspace methods need global communications, which is the reason behind many present-day models opting for the split-explicit subcycling to explicitly resolve surface gravity waves.The advantage of using a linear solver for the SSH is its algorithmic simplicity and, in the case of FESOM, enhanced stability on arbitrary meshes (less control on mesh quality and smoothness is needed).Many possible subcycling algorithms have been discussed in the literature (e.g., see an analysis in Shchepetkin and McWilliams, 2005).We plan to include the subcycling option in FESOM2 in the future but, judged from the scalability of purely explicit sea-ice solver, only partial improvements can be expected.Indeed, the scalability of an explicit sea ice vs. an implicit SSH solver is machine dependent (e.g., see STORM results for the sea ice and the SSH on Mistral and JUWELS in Fig. 3) and is also never optimal.However, even small improvements in scaling, at least on some machines, might be sufficient to make practicable partitions as small as 200 surface vertices per core.We will report on our results in due time.
Alternatively, one may follow the suggestion of Huang et al. (2016) and switch to a solver based on a Chebyshev method.The method of Huang et al. (2016) is based on an algorithm without global sums and allows one to better overlap computation with communication.A caveat here is that unstructured meshes used by FESOM2 may cause the SSH system matrix to be more poorly conditioned than in Huang et al. (2016), since mesh elements can differ by a factor of 100 or greater.It remains to be seen whether this method is able to compete with the split-explicit approach.Another simple optimization avenue presents itself via renumbering of mesh vertices: in the solver package pARMS, the arrays are permuted to have the inner vertices first, followed by the vertices that have to be send to the neighboring partitions, and, last, the vertices to be received; this allows one to exchange the halo and meanwhile perform computations on the inner vertices.The same strategy can be easily extended to the FE-SOM2 sea-ice model and to all other halo exchanges.The optimal data structure for unstructured meshes is still under discussion (MacDonald et al., 2011).

Sea-ice model strategy
The sea-ice model is linked less directly to the ocean model than the internal mode, and the simplest choice to reduce its cost would be to call it every second or every third time step of the ocean model.Eddy-resolving codes are usually run with a time step smaller than 10 min, and the sea-ice state does not change much in this time interval.However, in-creasing an external sea-ice step has implications for stability.Theoretically, running the mEVP with a doubled external time step would require a √ 2 increase in stability parameter with corresponding changes in N EVP .Thus, in the end, one would expect only √ 2 reduction in the time spent on sea-ice computations, yet this has to be carefully tested.One more caveat is the floating (as opposed to levitating) sea-ice option, which might require further adjustments.
A more dramatic change would be to run the sea ice synchronously with the ocean on a separate set of cores using the same ocean mesh partitioning or partitions obtained by hierarchically combining several ocean partitions to simplify the ocean-sea-ice communication.On global meshes, where the domain covered with sea ice occupies only a fraction of the total area, the sea-ice model would require only a small number of additional cores.In this approach, one would choose the number of cores for the sea ice tuned to achieve the runtimes not exceeding those of the ocean model.Clearly, the scalability of the sea-ice model still remains a bottleneck; however, the overall scalability would be substantially improved by running the components with poor scalability in parallel instead of one after another.In addition, this technique can optimize the utilization of computational resources since the cores currently idle due to the communicationbound sea-ice model can be used for the well-scalable ocean model.The drawback of this approach is a more complicated code design and the need for a smarter parallel loadbalancing algorithm.It can be simplified by using couplers, but this may come with its own issues regarding scalability.

Need for more efficient 2-D computations
We saw above that, independent of the solution method (iterative solver or subcycling), the two-dimensional parts of the ocean code present a challenge on the road to a better parallel scalability.While the measures briefly discussed above can be helpful, we do not expect that they can deliver optimal scalability down to less than 100 surface vertices per core seen for the 3-D part of the code.The open question is whether new hardware architectures can offer better options.The problem with the external mode and sea-ice solvers lies in too frequent but short communications.On present architectures, a possible strategy to reduce the number of communications in the SSH and sea-ice parts of the code is to make halos wider.
In this regard the concept of dwarfs -also known as Berkeley dwarfs (see Asanović et al., 2006), further developed and customized for weather and climate applications in Lawrence et al. (2018) -might be useful.The main idea behind this concept is to capture some essential characteristics of a model part or a parametrization in a stand-alone piece of software that can be run, analyzed, ported, and optimized on its own.In the HPC-speak, a dwarf is a computational kernel enhanced by the ability to run without the remainder of the model and whose performance is critical to that of the whole model.By utilizing dwarfs extracted from the problematic 2-D kernels, a number of numerical, algorithmic, and computational methodologies can be tested and optimized in the search for better solutions.In the framework of the currently funded EU project ESCAPE, dwarfs for several weather and climate models are being actively developed (http://www.hpc-escape2.eu/,last access: 19 June 2019).

Conclusions
We systematically explored the scalability of FESOM2 for a set of meshes (0.13 million, 0.64 million, and 5.6 million surface vertices, 47 layers) on several HPC systems (up to 50 000 compute cores) and found a nearly linear scalability of FESOM2 code to the limit of 300-400 surface mesh vertices per core.Our numerical experiments indicate that the 3-D parts of the code scale almost optimally down to 100 surface mesh vertices per core.Similar to other large-scale ocean circulation models, the factor limiting the scalability of FESOM2 is its two-dimensional part, represented by the solver for the sea surface height and the sea-ice model.Since the sea-ice model uses explicit time stepping and local communications, it generally demonstrates better scalability than the solver part; however, this behavior is machine dependent and never reaches the optimal levels of scaling.
Our results also allow us to conclude that the technology of mesh partitioning relying on the connectivity pattern and based on METIS software package in the case of FESOM2 works well on large meshes and down to very small partitions.While the hierarchic mesh partitioning proposed here does not generally improve the scalability, it lays the base for future model improvements, especially concerning the seaice model, I/O, and the SSH computation.This methodology also can become useful for unstructured model data parallel postprocessing using approaches presented in Rocklin (2015) and Hoyer and Hamman (2017).
The analysis of scalability was complemented by the analysis of model throughput.We show that the FESOM2 not only scales well but also reaches throughputs that are very comparable to those of structured-mesh models given the same computational resources and similar meshes.They can be even higher because of a better scalability for high-resolution configurations.Since new unstructured-mesh large-scale ocean model developments (FESOM2, MPAS, ICON) are all based on the finite-volume methodology, similar behavior can be expected in all these cases.In summary, therefore, we see the following statement as the main message to the oceanographic community: the present-day unstructured-mesh global ocean circulation models have reached a level of maturity, where their computational speed is no longer a limiting factor.
While the result that the external mode and the sea-ice model are currently limiting the parallel scalability (with further practical complications arising from the implementation of the I/O) did not come as a surprise, the lesson learned from the analysis presented in this study is the extent of the problem.Given that computational resources available for most current long-term simulations are in many cases still limited to 5000-10 000 or less cores, the parallel scalability is only beginning to emerge as a major issue on large (1/10 • or finer) meshes.The current CPU architectures appear to be well suited for nearly all 3-D computational parts of FESOM2, thus the potential for improvement seems to lie in the direction of improved memory bandwidth, lower communication latency, and more efficient file systems -this can be then used as an indication when choosing "optimal" hardware.Suboptimal scaling of the sea ice combined with a sequential arrangement of sea-ice and ocean steps results in an inefficient utilization of computational resources and indicates clear directions for improvement.This, together with a better, scalable, parallel I/O, is the direction for future model code development to enable high-resolution climate simulations with reasonable throughputs.
In order to better prepare the ocean modeling community to challenges that come with extreme-scale computing, more efforts should be dedicated to in-depth scalability analysis of computational cores in different ocean models and careful intercomparison of ocean model throughputs.This only can be done in a coordinated manner similar, for example, to Balaji et al. (2017).
B1 Dependency of scaling results on time step, forcing, and ocean state Most of the scaling experiments in this paper were performed in accordance with the simple protocol in which the model is initialized from a state of rest and run for a small number of time steps (1800 in our case).This design (hereafter called "cold start") allows us to easily conduct scaling experiments and permits quick testing of changes in the model code; no reading of the restart files is required that can become time consuming for large setups (making experiments very expensive in terms of CPU hours).Moreover, a similar strategy is used by many ocean and atmosphere modeling groups therefore greatly simplifying the comparison of the results.The downsides of this protocol are a smaller time step needed to keep high-resolution setups stable and the ocean dynamics that is not fully developed yet.In this section, we explore consequences of these deficiencies for the results described in our article.
High-resolution models have a several times larger time step in operational setting compared to the spin-up phase (that usually lasts for about a year).A larger time step makes the conditioning of the SSH matrix worse, hence there is an increase in the number of iterations in the SSH solver.To illustrate how this effect will influence the scaling behavior, we conducted two series of experiments for fArc mesh on Mistral/DKRZ: the cold start ones with shorter time step (4 min) and "warm start" ones that start from the restart files (after 1year spin-up) and use a larger time step (15 min), same as in fArc production experiments (Fig. B1).The shape of the SSH solver curve stays practically the same but is shifted upwards.The increase in the total mean time is almost entirely due to the computation time of the SSH solver (Fig. B2).Although the higher cost of the SSH solver matters, it only slightly changes the scaling for the total model mean.Note that deviations from linear scaling for total mean become visible only after 250 surface vertices per core -a number beyond our claimed break in linear scaling after around 300 vertices per core.
To see how the number of barotropic solver iterations changes with time, we perform experiments with the fArc mesh on Mistral/DKRZ using 5760 cores.The model is run from a state of rest (year 1948) with the Polar Science Center Hydrographic Climatology (PHC) (Steele et al., 2001) initial conditions for four consecutive years, and the number of solver iterations is recorded at every time step.Figure B3 shows running mean values (window 500) of iteration counts.It is clear that changing ocean state and forcing has an effect on the number of SSH solver iterations and a full picture of the solver behavior does not become apparent in short experiments as statistics over a longer period of time have to be collected.On the other hand, approximate values can be estimated from the short experiments, and the cold start experiments can probably give a lower bound estimate of the solver performance.
As a final note, one has to keep in mind that simple scaling experiments described in the article can explore mostly the scalability of model components, while clearly absolute numbers can vary in a wide range.For example, the sea-ice model will show different runtimes depending on the number of EVP subcycles (e.g., see Koldunov et al., 2019).The runtimes for tracers can be reduced nearly 2-fold if the flux corrected transport (FCT) option is off (it is on now) or can be increased if higher-order schemes are used.Scaling will be improved if the number of layers is increased, and so on.The scaling of the total mean will be affected by these details, thus any discussion can only be conducted in a certain approximate sense.

Figure 1 .
Figure 1.Resolution maps for meshes used for our experiments: CORE2 (a), fArc (b), and STORM (c).Total number of surface vertices for each mesh is displayed at the top.

Figure 2 .
Figure 2. Scaling results for the fArc mesh on Mistral (DKRZ, Hamburg; a), JUWELS (JSC, Jülich; b), Emmy (RRZE, Erlangen; c), and Ollie (AWI, Bremerhaven; d) compute clusters.The black line indicates linear scaling and the orange line (and the number labels) gives the mean total computing time over the parallel partitions.

Figure 3 .
Figure 3. Scaling results for the STORM mesh on Mistral (DKRZ Hamburg, a) and JUWELS (JSC Jülich, b) compute clusters.The black line indicates linear scaling and the orange line (and the number labels) gives the mean total computing time over the parallel partitions.

Figure 4 .
Figure 4. Scaling results for the CORE2 mesh on Emmy (RRZE Erlangen) and Ollie (AWI Bremerhaven) compute clusters.The black line indicates linear scaling and the orange line (and the number labels) gives the mean total computing time over the parallel partitions.

Figure 5 .
Figure 5. Mean wall-clock runtimes for different model components.The experiment is run for 1 year (11 680 time steps of 45 min) of CORE2 mesh on 288 cores on Ollie/AWI."Pressure" includes computations of density and the Brunt-Väisälä frequency, and "Slope" includes computations of isoneutral slope.

Figure 7 .
Figure7.One FESOM2 time step on the CORE2 mesh with 36 MPI tasks (above) and 72 tasks (below) visualized in Intel Trace Analyzer and Collector.Zoom-in views of iterations of the sea-ice model (bottom right) show the load imbalance between processes with differing amounts of sea-ice per process and illustrate reasons for a suboptimal parallel scaling: with twice the number of cores, the computation time (blue) for parallel partitions fully covered by sea ice halves; however, the halo exchange time (green) increases.

Figure 9 .
Figure 9. Above: one iteration of the classical BiCGStab scheme (Algorithm 2) in the Intel Trace Analyzer and Collector.Observe the blocking due to computing global sums in scalar products via MPI_Allreduce.In addition, communication is needed for halo exchanges that often overlap with independent computations.Below: one iteration of the pipelined BiCGStab scheme (Algorithm 3).The global sums are now initiated by a nonblocking call to MPI_Iallreduce and finished by MPI_Wait.In between, major computations such as a matrix-vector product and a preconditioning step are performed also including halo exchanges.

Figure 10 .
Figure 10.Model throughput in simulated years per day (SYPD) estimated from scaling experiments and measured in 1-year experiment with I/O for fArc (a) and STORM (b).

Figure 11 .
Figure 11.Schematics of an HPC cluster for purposes of hierarchic mesh partitioning (a) and a comparison of the total, SSH, and sea-ice scaling for hierarchic (continuous lines) and nonhierarchic (dashed lines) partitionings for the CORE2 mesh on Emmy (RRZE, Erlangen, b).

Figure B1 .
Figure B1.Scaling results for the fArc mesh on Mistral (DKRZ Hamburg) for cold start (a) and warm start with increased time step (b) experiments.The black line indicates linear scaling, the orange line (and the number labels) gives the mean total computing time over the parallel partitions.

Figure B2 .
Figure B2.Difference between runtimes of the experiment with warm start and larger time step and the experiment with cold start and smaller time step.

Figure B3 .
Figure B3.Number of barotropic solver iterations at each time step for four consecutive years.The year 1948 began with the cold start.Running-mean smoothing with window size of 1000 was applied.Ocean model time step was −4 min.

Table 1 .
Characteristics of the meshes.Rectangular analogs refer to Mercator-type grids with a similar number of wet points.35 km in the equatorial belt.There is also a slight increase in resolution near coastal regions, and most of the ocean interior is resolved by elements sized around 1 • .The second mesh, referred to as fArc (FESOM Arctic), aims to better represent circulation in the Arctic Ocean while maintaining a relatively low resolution in the rest of the ocean.It is similar to CORE2 over most of the ocean but is refined down to 4.5 km in the Arctic.It contains about 0.6 million surface ver- Storch et al., 2012)ss than the number of wet vertices in a typical Mercator 1/4 • mesh.The third mesh is the 1/10 • mesh based on the one used in the STORM project (e.g., see vonStorch et al., 2012)with surface quads split into two triangles and land quads removed.It contains ca.5.6 million wet surface vertices.

Table 2 .
Overview of the systems used in the scaling experiments.