Interactive comment on “ Scalability and some optimization of the Finite-volumE Sea ice-Ocean Model , Version 2 . 0 ( FESOM 2 ) ”

A study of the scalability of the Finite-volumE Sea ice-Ocean circulation Model, Version 2.0 (FESOM2), the first mature global model of its kind formulated on unstructured meshes, is presented. This study includes an analysis of main computational kernels with a special focus on bottlenecks in parallel scalability. Several model enhancements, improving this scalability for large numbers of processes, are described and tested. Model grids at different resolutions are used on four HPC systems with differing computation and communication hardware to demonstrate model’s scalability and throughput. Further5 more, strategies for improvements in parallel performance are presented and assessed. We show that in terms of throughput FESOM2.0 is on par with the state-of-the-art structured ocean models and in realistic eddy resolving configuration (1/10◦ resolution) can produce about 16 years per day on 14 000 cores. This suggests that unstructured-mesh models are becoming extremely competitive tools in high-resolution climate modelling. It is shown that main bottlenecks of FESOM parallel scalability are the two-dimensional components of the model, namely the computations of external (barotropic) mode and the 10 sea-ice model. It is argued that these bottlenecks are shared with other general ocean circulation models.


Introduction
Mesoscale eddies play a critical role in the general circulation of the ocean, and they impact marine life, including biogeochemical processes.Since these eddies are not resolved on meshes coarser that the local internal Rossby radius, their effect needs to be parameterized in terms of eddy-driven bolus transport (Gent and McWilliams (1990)) or mixing.While such coarse-mesh models have been a staple of climate research, there are numerous indications summarized in Hewitt et al. (2017) that eddy resolving meshes can enhance the realism of climate models by improving the simulated positions of major frontal systems and reducing surface and deep biases in climate simulations.With increasing computational resources becoming available to the community, running eddy-permitting (nominal resolution around 1/4 • ) and eddy-resolving ocean models becomes more and more feasible in climate research.However, existing models are still rather slow and require many months of wallclock time for century-scale climate simulations, making prohibitive to use these configurations in many important applications.This cores with a new solver for the external mode (Huang et al. (2016)).Throughputs higher than these can still be reached, but would imply a very inefficient use of computational power.
The present manuscript considers the computational performance and parallel scalability of the unstructured-mesh Finite-volumE Sea ice-Ocean circulation Model (FESOM2.0,Danilov et al. (2017)).FESOM2 is based on FESOM1.4(Wang et al. (2014))-the first mature general-circulation model on unstructured mesh developed for climate research applications-but using a faster dynamical core.Recently developed large-scale ocean circulation models, formulated on unstructured meshes, such as FESOM (Wang et al. (2014), Danilov et al. (2017)), MPAS (Ringler et al. (2013)), or ICON (Korn (2012)) make it possible to utilize more flexible meshes with variable resolution.However, compared to structured-mesh models, the unstructured mesh models pose two specific issues that need to be addressed.
-First, mesh partitioning in unstructured mesh models is based on the analysis of the connectivity pattern between surface mesh cells or vertices; this method is agnostic of mesh geometry and allows one to work with only wet cells and vertices.
FESOM uses the METIS software package (Karypis and Kumar (1998)) for this purpose.The number of neighboring parallel partitions is not defined a priori as in many structured mesh models, by quadrilateral geometry of the mesh, but the partitioning process may vary in some limits especially when mesh fragments become small.Hence, the performance of this technology on large global ocean meshes with complex boundaries and its implications for parallel communication are an interesting topic to explore.
-Second, one expects that an unstructured-mesh code is more expensive per degree of freedom than its structured-mesh counterparts due to an indirect addressing of degrees of freedom.However, for ocean meshes, this issue hardly causes any difficulties.In fact, due to the vertical alignment of 3D prisms within a column, the neighborhood information is propagated along the column and can be efficiently accessed.On the other hand, though, high-order advection stencils on an unstructured mesh tend to be more complex and expensive to evaluate.Additionally, a triangular (or dual hexagonal) surface mesh contains more edges than a quadrilateral mesh with the same number of vertices making the computation of fluxes more expensive.This relative slowness can be compensated by better scalability leading to an equal or even higher throughput.Indeed, the number of halo exchanges and information to be exchanged should be about the same for meshes of the same size run on the same number of cores independently of their structure.Their relative expense measured as the communication-to-computation ratio will be lower permitting scaling down to smaller partitions.It is therefore expected that unstructured-mesh codes can offer very similar throughput to that of structured-mesh models, and we see the substantiation of this statement as one of the main goals of this paper.
In the framework of the present study, we complement the analysis of scalability and throughput of FESOM2 as a whole by data illustrating the performance of the main two-and three-dimensional computational kernels of the model.The main scalability bottlenecks such as the sea-ice and the solver for the external mode are subject to an in-depth parallel performance analysis; the realized and prospective strategies to improve their performance are presented.As an additional technology, promising an improvement in the modularity and the flexibility of the main FESOM2 computational kernels and in their mapping to the current and prospective HPC systems, we introduce the hierarchic mesh partitioning and discuss its possible uses in various contexts.
The paper is organized as follows.We start with the general description of the model and of its solution algorithm, briefly introduce the meshes for test problems, and give some relevant info on the HPC systems employed in our study.In Sec. 3, the parallel performance results on our test systems are demonstrated separately for the whole model and for its main computational kernels-similar to the approach used in Reuter et al. (2015).Sec. 4 includes some sample model analysis plots produced using Intel Trace Analyzer and deals with performance enhancements implemented in FESOM2.0.A number of additional tests illustrating various aspects of model's performance constitute Sec. 5. A discussion of results and strategies for future improvements of the parallel scaling of FESOM2 is presented in Sec.6 followed by a brief conclusions section.
2 Description of the model and test setups 2.1 Governing equations and solution procedure FESOM (Danilov et al. (2017), Scholz et al. (2018)) is a general ocean circulation model solving the primitive equations in the Boussinesq, hydrostatic, and traditional approximations.It is formulated on unstructured triangular meshes with scalar degrees of freedom placed at vertices and horizontal velocities at triangles.Because of the dominance of hydrostatic balance, the vertices are aligned in vertical with the surface vertices making the treatment of the vertical direction similar to that in structured-mesh models.The vertical coordinate is realized using arbitrary Lagrangian-Eulerian (ALE) methodology supporting different choices of model layers.The Finite Element Sea-Ice Model (FESIM, Danilov et al. (2015)) is included in FESOM as a set of subroutines.It solves the modified Elastic-Viscous-Plastic (mEVP) dynamical equations allowing to reduce the number of subcycling steps without compromising numerical stability (Kimmritz et al. (2017); Koldunov et al. (2019)).The surface mesh partitioning is carried out using METIS software package.The sea-ice model is run on the same partitioning as the ocean model and is called each step of ocean model.Although this may lead to some unnecessary overhead in sea-ice free regions -especially in global setups -it makes trivial the exchange between the sea-ice and ocean components.
Next, we describe the time step structure to the extent necessary for the discussion of scalability.tion above does not include model input and output (I/O), which can be time consuming if done too frequently and represents a major scalability bottleneck of its own.
From the standpoint of model's parallel scalability, the procedures mentioned above can be split in two classes.The first class includes the essentially two-dimensional computational parts such as the SSH solver (Appendix A1 and Appendix B) and the sea-ice dynamics (Appendix A2).The mEVP sea-ice solver in FESOM (Danilov et al. (2015)) commonly carries out about 10 one hundred shorter time steps within a time step of the ocean model each followed by the halo exchange for two-dimensional sea-ice velocities.The total amount of information being exchanged is comparable to just a single three-dimensional halo exchange, yet there are many such exchanges.The SSH solution is obtained by using an iterative solver utilizing pARMS1 , The other class includes the remaining routines.Most of them are three-dimensional and involve either no or only one halo exchange per ocean time step.The tracer advection presents the exception: It may need additional communications depending on the order of the approximation and on whether or not the flux-corrected transport (FCT) procedure is applied.However, in such cases, also the amount of numerical work increases accordingly, thus the code scalability should not be influenced in a major way.One more subroutine containing additional halo exchanges is the horizontal viscosity which also needs some smoothing if the Leith parametrization is selected.
In the configuration used here to study the scalability, the third/fourth order transport algorithm with FCT is used.It involves four 3D halo exchanges per time step.The Leith parametrization is used as well, thus the number of 3D halo exchanges is at its maximum.The scaling of the SSH solver on both machines practically stagnates at some point (after 1152 cores on JUWELS and 1728 cores on Mistral).The sea-ice model scaling is not linear, but the absolute values of the wallclock time still improve almost until the end.The fArc mesh is focused on the Arctic Ocean, therefore the sea-ice computations initially (on 144 cores) occupy about 16% of the total time, while the SSH computations require only 6%.At the greatest number of cores (6912), they occupy already 33% and 28%, respectively.The scaling for the entire model starts to considerably deviate from the linear behavior after SSH and sea-ice calculations become more expensive than the 3D part of the code.

Test cases and HPC systems used in the study
The scaling results for the STORM mesh (5.6M vertices) are shown in Fig. 3. On Mistral, our scaling experiments could utilize up to 50,688 cores (by means of a special reservation), while on JUWELS maximum 12,288 cores were available, and all tests were performed using the general queue.The scaling behavior of the STORM mesh is similar to that of fArc: The 3D parts scale practically linearly until about 100 vertices per core, whereas the SSH calculations and the sea-ice model present scalability bottlenecks.After 11,520 cores on Mistral, the SSH runtime stays around 13-14 seconds but, surprisingly, does not get worse.On JUWELS, the SSH stops to scale much earlier, already at 5,760 cores, and oscillates around the runtime value of 15 seconds.The sea-ice on Mistral continues to scale until about 41,472 cores, but the scaling is clearly suboptimal; on JUWELS, the runtime for the sea-ice kernel continues to improve until the maximum number of cores is reached.In the runtimes for both 2D code parts (sea-ice and SSH), we also clearly notice much stronger oscillations that indicate sensitivity to the state of the communication hard-and software of the corresponding HPC system.
The STORM mesh experiments on Mistral were conducted in two separate series: Up to 18,432 cores in the general queue without any special job scheduler configuration; larger jobs were run using a special reservation for 1,500 nodes.This difference in the experiment setting might be responsible for the sudden oscillations in the timings for the SSH solver and for some other, perfectly scalable otherwise, routines such as the ocean mixing and computation of the pressure gradient (not shown).After these oscillations, the total time continues to improve until the maximum number of cores -even though it does not go back to the linear scaling path anymore.
We use the smaller CORE2 mesh to demonstrate the model behavior when less than 100 vertices per core is taken (Fig. 4).
The combination of increase in the runtime of the 2D part of the model and continues decrease of the 3D part result in almost constant numbers for the total runtime.The good scaling of the FESOM2.03D part leads us to expect relatively easy transfer of the FESOM2.0dynamical core to systems with low memory per computational core (e.g.GPUs).
The standard configurations of FESOM2 have 47 unevenly spaced vertical z-levels, which is nearly sufficient to resolve the first baroclinic mode.For setups with high horizontal resolution, it is beneficial to increase the number of vertical levels to better resolve vertical structure of horizontal motions in the ocean (e.g.Stewart et al. (2017)).In order to better understand the effect of increasing the number of vertical layers on model's scaling, we set the number of vertical levels to 71 following recommendations described in Stewart et al. (2017).This compares well with typical numbers of layers used in high resolution model setups.Experiments were performed on Mistral with fArc mesh, and all settings were the same as in the scaling runs in Sec. 3. Just as expected, with increased number of vertical levels, the scaling improves, while the model become slower (Fig. 5).Deviations from the linear scaling are still observed, but the decline is smaller for 71 vertical layers than for 47.
Increases in the number of vertical levels only affect the 3D part of the model by giving each compute core more local work, while the poorly scalable 2D parts are not affected by the change in the vertical layering.The number of communication calls does not change, although there is an increase in the volume transmitted.To sum it up: Total runtimes for FESOM2.0scale linearly until about 400-300 vertices per core and then start to deviate from the linear behavior.The computationally significant 3D parts of the model scale almost linearly to much lower numbers of vertices per core (at least 100), but the main 2D components (SSH and sea-ice) represent the scaling bottlenecks and, by occupying an ever growing fraction of the total runtime, eventually lead to a stagnation and later on to a deterioration in the runtimes with increasing number of compute cores.In practice, for large meshes such as STORM, the limits of scalability are hardly ever reached using the HPC resources currently available for production runs (e.g., 256 compute nodes on JUWELS and 512 compute nodes on Mistral).
In the following sections, we describe how the above runtimes translate into operational throughputs, discuss several algorithmic and technical enhancements used to reach the current level of scalability, and propose some measures that have the potential to further improve model performance.-First, the sea-ice step with 100 iterations on the 2D mesh performs one halo exchange in every iteration (lines 5, 6 in Algorithm 1).As the CORE2 mesh has no focus in polar regions, most compute cores idle in the MPI_Wait of the halo exchange (light blue in Fig. 6).The zoom in Fig. 6 reveals this heavy load imbalance and also shows the small 2D halo exchanges that are implemented with non-blocking MPI_Irecv and MPI_Isend calls (green) overlapped with a small part of independent calculations (blue) and finalized with MPI_Waitall (light blue).
The zoom into the sea-ice step (bottom right of Fig. 6, above for 36 cores, below on 72 cores) shows that the computational load scales linearly, but the halo exchange gets more and more expensive.
At the end of the sea-ice step, the global fresh water balance (line 7 in Algorithm 1) is enforced using MPI_Allreduce (red strip in Fig. 6).
-The third part is the 2D solver for the SSH (lines 12-14 in Algorithm 1) dominated by frequent small 2D halo exchanges and global sums (dark vertical strip in Fig. 6).On a small number of cores, the time spent in the solver is low, but it dominates the massively parallel runs.We will go into more detail concerning the SSH solver in Sec.4.2.
-Finally, the pattern of the compute intensive 3D part is visible again for the horizontal velocity corrector, the ALE and vertical velocity step, as well as for the tracer advection and diffusion (lines 15-17 in Algorithm 1).
The results shown in Fig. 6, in particular negligible amounts of waiting time, illustrate our claims about highly efficient (in terms of parallel scaling) 3D parts of the FESOM2 code but also explain suboptimal scaling of both main 2D computational kernels, the sea-ice and the SSH.Even for rather low numbers of cores, the explicit iterative method for the sea-ice described in Appendix A2 hardly contains enough arithmetic operations to amortize even non-blocking MPI communications involved into 2D halo exchanges.The situation only gets worse when the number of cores is increased and the communication-tocomputation ration falls even lower.The analysis of the problems with the SSH scaling and our methods to deal with them are the subject of the next section.The first two calls performed after the matrix-vector product and the preconditioner step cause increased waiting times.In addition to the blocking behavior, the MPI_Allreduce itself becomes expensive on higher core counts.Another limiting factor is the convergence of the RAS preconditioner, whose efficiency depends on the sizes of parallel partitions, and, as a consequence, the iteration count increases slowly with the number of compute cores (Fig. 10).
After searching for alternatives, we implemented the pipelined BiCGStab algorithm (Algorithm 3) proposed in Cools and Vanroose (2017) that resulted in an improved scaling behavior (see Fig. Since the pipelined version involves more arithmetic operations for vector updates, it may be inferior to the original BiCGStab method on lower core counts.Also differences in the HPC hardware play a substantial role in the comparative performance of both algorithms as clearly illustrated in Fig. 7.

Speeding up the sea-ice model
The equations of sea-ice dynamics with traditional viscous-plastic rheology and elliptic yield curve (A2)-(A4) are very stiff and would require time steps on the level of a fraction of a second if computed explicitly (see, e.g., Kimmritz et al. (2017) for a brief summary).For this reason, they are solved either with an iterative solver or explicitly using pseudo-elasticity to reduce the limitations on short time step as proposed by Hunke and Dukowicz (1997).The latter approach is called elastic-viscous-plastic (EVP) and requires N EVP ≥ 100 substeps within the external step of the sea-ice model.Other components of the sea-ice model such as the advection or the computation of thermodynamic sources and sinks are advanced at the time step of the ocean model and are much less demanding.Since the EVP approach is explicit, it has to satisfy stability limitations -as discussed by many authors beginning from the original paper by Hunke and Dukowicz (1997).It turns out that, as the mesh resolution is refined, N EVP must be increased to maintain stability and becomes prohibitively large.The violation of stability leads to noise in the derivatives for the sea-ice velocity.Although this noise may stay unnoticed by users, it affects dynamics.An essential measure taken in FESOM to improve scalability is to use the modified EVP approach (mEVP) as described in Danilov et al. (2015).As opposed to the traditional EVP approach, the mEVP approach, based on a suggestion by Bouillon et al. (2013), splits the issues of numerical stability and convergence.This solver is always stable, and the number of the substeps N EVP determines its convergence to the viscous-plastic rheology.In practice, it turns out that mEVP produces practically acceptable solutions in the regime with relatively small N EVP that only corresponds to an initial error reduction.N EVP is found experimentally by starting simulations from the values greater than the stability parameters α and β in (A7)-(A10) and reducing them to minimum values that lead to results practically indistinguishable from simulations performed with a large N EVP -as explained in Koldunov et al. (2019).We use N EVP = 100 for both fArc and STORM meshes.These values are already much lower than the values needed for the standard EVP, and, in this sense, the sea-ice model is already optimal.The sea-ice model uses only local communications, but their number is proportional to N EVP per external time step.Estimates obtained for one-year runs are calculated without accounting for the initial model setup (reading mesh, restart files, etc.) needed only once per run.Also the fraction of the total runtime (excluding I/O) spent in the SSH solver is greater for one-year simulations than for scaling experiments: For STORM on 13824 cores, it is 15% in scaling runs vs. 59% of the (corrected for I/O) time in oneyear experiments.For fArc on 6912 cores, it is 31% and 52%, respectively.The reason is an increase in the iteration number when model is running in a more realistic setting compared to scaling experiments started from the cold start with a relatively smooth SSH field (Fig. 10).A more developed SSH field -especially in high resolution parts of the model -requires more solver iterations for convergence of the SSH solver.Increase in number of iterations between cold start and restart simulations is more pronounced for the fArc mesh, since this mesh have extended areas with high (4.5km)resolutions.The real model performance depends on many factors, including I/O, state of the machine, number of vertical levels, time step and other details of the model configuration.Therefore any timings produced in idealized settings must be taken with a grain of salt when model throughputs lie in the focus of a discussion.Even harder is to compare throughput of different models to each other, especially when not all details are available.Nonetheless, an attempt to make such a comparison is  3; it should be treated more as an qualitative estimate.Numbers for FESOM2 configurations are taken from one year simulations.This comparison shows that, once the differences in the number of model levels, computational resources, time steps, and mesh size are accounted for, the FESOM2 throughput is easily on par with that of well-established structured models confirming our claim that unstructured-mesh models can be about as fast as conventional structured-mesh models and thus represent an efficient tool for ocean and climate studies.Considering overall model scalability, there is still room for improvement, especially for greater numbers of computational cores, by transitioning to a parallel I/O.Table 3. Throughput of different models at selected configurations comparable to FESOM2 fArc and STORM meshes.

Hierarchic mesh partitioning
The standard way of performing mesh partitioning in FESOM2 relies on METIS (Karypis and Kumar (1998)) graph partitioning package (currently Version 5.1.0)and utilizes the dual-weighted load balancing criterion based on the number of 2D and 3D mesh vertices.This approach works well for moderate numbers of parallel partitions but tends to produce some undesirable artifacts for large numbers of partitions: Isolated vertices, partitions containing vertex groups widely separated in the computational domain, non-contiguous partitions, etc.
As a simple but elegant way to remedy some of these deficiencies, a backward compatible wrapper for METIS has been implemented that allows to perform the mesh partitioning in a hierarchic fashion (see Fig. 11 (left)).The procedure starts from the coarsest level, e.g., producing a partition per networking switch.Then METIS is called recursively for each partition on the current level until the lowest level (usually that of a single core) has been reached.Since the performance of METIS for small numbers of partitions is usually excellent, this approach guarantees that each coarse partition only contains contiguous vertices thus potentially improving the mapping of the computational mesh onto the topology of the compute cluster.This methodology is certainly not entirely new (see, e.g., the method called nested partitioning in Sundar and Ghattas (2015)); however, neither implementations of this idea as our simple METIS wrapper nor applications of this technique to ocean modeling could be found in the literature.
The simulation runs with the Core2 mesh on Emmy (Fig. 11 (right)) appear to show a small edge for the hierarchic partitioning in cases if the amount of work per partition becomes small; however, runs on HPC systems with higher-end network hardware such as Ollie or JUWELS show no discernible differences to non-hierarchic partitioning.We attribute this discrep- ancy to the ability of new communication hardware with its superior efficiency and lower latency to hide deficiencies in the mesh partitioning, whereas older network hardware on Emmy is more sensitive to the placement of grid vertices on parallel partitions.A similar situation -however on a much higher level -may also arise in the future if the communication efficiency in the future HPC systems does not keep pace with the computational performance increases (e.g., rises in the number of compute cores) of compute nodes.
In addition to improving the mapping of the partitioning topology to the topology of the HPC system, the hierarchic partitioning capability is intended to provide a simple interface for placing poorly scalable computational kernels (e.g., I/O or sea-ice) at a certain mesh hierarchy level with a natural way to supply clearly defined relationships between partitions on different levels.By exploiting this feature, we hope to be able to improve computational efficiency of kernels currently bound by the I/O or communication bandwidth and to also provide interfaces to hardware accelerators such as GPUs or FPGAs.

Improving the network performance by underpopulating compute nodes
A simple but often effective way to improve the network performance on the current-generation network hardware such as Intel Omnipath is to leave one core per node off the regular compute job assignment.Indeed, our experiments on Ollie and Mistral (not shown) demonstrated little difference between fully assigned and underpopulated nodes (or sometimes even a slightly better SSH timings for underpopulated nodes).In this way, in addition to networking duties, other loads (e.g.I/O) currently distributed over all compute cores can be offloaded to these idle nodes.Iterative solvers based on preconditioned conjugate gradient method or other Krylov subspace methods need global communications, which is the reason behind many present-day models opting for the split-explicit subcycling to explicitly resolve surface gravity waves.The advantage of using a linear solver for the SSH is its algorithmic simplicity and, in the case of FESOM, enhanced stability on arbitrary meshes (less control on mesh quality and smoothness is needed).Many possible subcycling algorithms have been discussed in the literature (see, e.g., an analysis in Shchepetkin and McWilliams (2005)).We plan to include the subcycling option in FESOM in the future, but, judged from the scalability of purely explicit sea-ice solver, only partial improvements can be expected.Indeed, the scalability of an explicit sea-ice vs. an implicit SSH solver is machine dependent (see, e.g., STORM results for the sea-ice and the SSH on Mistral and JUWELS in Fig. 3) and is also never optimal.
However, even small improvements in scaling, at least on some machines, might be sufficient to make practicable partitions as small as 200 surface vertices per core.We will report on our results in due time.
Alternatively, one may follow the suggestion of Huang et al. (2016) and switch to a solver based on a Chebyshev method.
The method of Huang et al. (2016) is based on an algorithm without global sums and allows one to better overlap computation with communication.A caveat here is that unstructured meshes used by FESOM may cause the SSH system matrix to be more poorly conditioned than in Huang et al. (2016), since mesh elements can differ by a factor of 100 or greater.It remains to be seen whether this method is able to compete with the split-explicit approach.Another simple optimization avenue presents itself via renumbering of mesh vertices: In the solver package pARMS, the arrays are permuted to have the inner vertices first, followed by the vertices that have to be send to the neighbouring partitions, and, last, the vertices to be received; this allows to exchange the halo and meanwhile perform computations on the inner vertices.The same strategy can be easily extended to the FESOM2 sea-ice model and to all other halo exchanges.

Sea-ice model strategy
The sea-ice model is linked less directly to the ocean model than the internal mode, and the simplest choice to reduce its cost would be to call it every second or every third time step of the ocean model.Eddy resolving codes are usually run with a time step smaller than 10 min, and the sea-ice state does not change much in this time interval.However, increasing an external seaice step has implications for stability.Theoretically, running the mEVP with a doubled external time step would require a √ 2 increase in stability parameter with corresponding changes in N EVP .Thus, in the end, one would expect only √ 2 reduction in the time spent on sea-ice computations, yet this has to be carefully tested.One more caveat is the floating (as opposed to levitating) sea-ice option, which might require further adjustments.
A more dramatic change would be to run the sea-ice synchronously with the ocean on a separate set of cores using the same ocean mesh partitioning or partitions obtained by hierarchically combining several ocean partitions to simplify the oceansea-ice communication.On global meshes, where the domain covered with sea-ice occupies only a fraction of the total area, the sea-ice model would require only a small number of additional cores.In this approach, one would choose the number of cores for the sea-ice tuned to achieve the run times not exceeding those of the ocean model.Clearly, the scalability of the sea-ice model still remains a bottleneck; however, the overall scalability would be substantially improved by running the components with poor scalability in parallel instead of one after another.In addition, this technique can optimize the utilization of computational resources, since the cores currently being idle due to the communication-bound sea-ice model can be used for the well-scalable ocean model.The drawback of this approach is a more complicated code design and the need of a smarter parallel load balancing algorithm.It can be simplified by using couplers, but this may come with its own issues regarding scalability.

Need for more efficient 2D computations
We saw above that, independent of the solution method (iterative solver or subcycling), the two-dimensional parts of the ocean code present a challenge on the road to a better parallel scalability.While the measures briefly discussed above can be helpful, we do not expect that they can deliver optimal scalability down to less than 100 surface vertices per core seen for the 3D part of the code.The open question is whether new hardware architectures can offer better options.The problem with the external mode and sea-ice solvers lies in too frequent but short communications.
In this regard the concept of dwarfs, also known as Berkeley dwarfs (see Asanović et al. (2006)), further developed and customized for weather and climate applications in Lawrence et al. (2018), might be useful.The main idea behind this concept is to capture some essential characteristics of a model part or a parametrization in a stand-alone piece of software that can be run, analyzed, ported, optimized on its own.In the HPC-speak, a dwarf is a computational kernel enhanced by the ability to run without the remainder of the model and whose performance is critical to that of the whole model.By utilizing dwarfs extracted from the problematic 2D kernels, a number of numerical, algorithmic, and computational methodologies can be tested and optimized in the search for better solutions.

Conclusions
We systematically explored the scalability of FESOM2 for a set a meshes (0.13M, 0.64M, and 5.6M surface vertices, 47 layers) on several HPC systems (up to 50 000 compute cores) and found a nearly linear scalability of FESOM2 code to the limit of 300-400 surface mesh vertices per core.Our numerical experiments indicate that the 3D parts of the code scale almost optimally down to 100 surface mesh vertices per core.Like for other large-scale ocean circulation models, the factor limiting the scalability of FESOM2 is its two-dimensional part, represented by the solver for the sea surface height and the sea-ice model.Since the sea-ice model uses explicit time stepping and local communications, it generally demonstrates better scalability than the solver part; however, this behavior is machine dependent and never reaches the optimal levels of scaling.
Our results also allow us to conclude that the technology of mesh partitioning relying on the connectivity pattern and based on METIS software package in the case of FESOM2 works well on large meshes and down to very small partitions.While the hierarchic mesh partitioning proposed here does not generally improve the scalability, it lays the base for future model improvements, especially concerning the sea-ice model, I/O, and the SSH computation.This methodology also can become useful for unstructured model data parallel post-processing using approaches presented in Rocklin (2015) and Hoyer and Hamman (2017).
The analysis of scalability was complemented by the analysis of model throughput.We show that the FESOM not only scales well but also reaches throughputs that are very comparable to those of structured mesh models given the same computational resources and similar meshes.They can be even higher because of a better scalability for high-resolution configurations.Since new unstructured-mesh large-scale ocean model developments (FESOM, MPAS, ICON) are all based on the finite-volume methodology, similar behavior can be expected in all these cases.In summary, therefore, we see the following statement as the main message to the oceanographic community: The present-day unstructured-mesh global ocean circulation models have reached a level of maturity, where their computational speed is no longer a limiting factor.
While the result that the external mode and the sea-ice model are currently limiting the parallel scalability (with further practical complications arising from the implementation of the I/O) did not come as a surprise, the lesson learned from the analysis presented in this study is the extent of the problem.Given that computational resources available for most current long-term simulations are in many cases still limited to 5000-10000 or less cores, the parallel scalability is only beginning to emerge as a major issue on large (1/10 • or finer) meshes.However, the sensitivity of iterative SSH solver to the current state of the simulated ocean dynamics is a serious warning, especially given the focus on locally refined meshes intended to simulate eddy-rich dynamics.Suboptimal scaling of the sea-ice combined with a sequential arrangement of sea-ice and ocean steps result in an inefficient utilization of computational resources and indicate clear directions for improvement.This, together with a better scalable parallel I/O are the directions for future model code development to enable high-resolution climate simulations with reasonable throughputs.
In order to better prepare ocean modelling community to challenges that come with extreme-scale computing, more efforts should be dedicated to in depth scalability analysis of computational cores in different ocean models and careful intercomparison of ocean model throughputs.This only can be done in coordinated manner similar, for example, to Balaji et al. (2017).

Figure 1 .
Figure 1.Resolution maps for meshes used for our experiments: CORE2 (left), fArc (middle), and STORM (right).Total number of surface vertices for each mesh is displayed at the top.

Figure 2 .
Figure 2. Scaling results for the fArc mesh on Mistral (DKRZ Hamburg, top left), JUWELS (JSC Jülich, top right), Emmy (RRZE Erlangen, bottom left) and Ollie (AWI Bremerhaven, bottom right ) compute clusters.The black line indicates linear scaling, the orange line (and the number labels) give the mean total computing time over the parallel partitions.

Figure 4 .
Figure 4. Scaling results for the CORE2 mesh on Emmy (RRZE Erlangen) and Ollie (AWI Bremerhaven) compute clusters.The black line indicates linear scaling, the orange line (and the number labels) give the mean total computing time over the parallel partitions.

4
Parallel scaling analysis and techniques to improve parallel scalability 4.1 Parallel code analysis using Intel Trace Analyzer Scaling tests carried out in Sec. 3 show that the major 3D computational parts scale very well, while the scaling of the main 2D kernels, namely of the sea-ice and even more of the SSH solver is substantially worse.To analyze and explain these scaling properties, we visualize the communication patterns of FESOM2 with the Intel Trace Analyzer (ITAC) for a full FESOM2 time step on the CORE2 mesh with 36 (Fig. 6 (top)) and 72 (Fig. 6 (bottom)) MPI tasks on Ollie (AWI Bremerhaven).The different types of computation can be easily distinguished in the ITAC graph: Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-334Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 March 2019 c Author(s) 2019.CC BY 4.0 License.

4. 2 Figure 6 .
Figure 6.One FESOM2 time step on the CORE2 mesh with 36 MPI tasks (above) and 72 tasks (below) visualized in Intel Trace Analyzer.Zoom-ins into iterations of the sea-ice model (bottom right) show the load imbalance between processes with differing amounts of sea-ice per process and illustrate reasons for a suboptimal parallel scaling: With twice the number of cores, the computation time (blue) for parallel partitions fully covered by sea-ice halves; however, the halo exchange time (green) increases.
7) especially for high core counts.As opposed to the classical BiCGStab algorithm, the pipelined version replaces the three blocking MPI_Allreduce calls by two non-blocking 13 Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-334Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 March 2019 c Author(s) 2019.CC BY 4.0 License.communications overlapped with the computations of the matrix-vector-product and the preconditioner (see Fig. 8 (bottom)).

Figure 8 .
Figure 8. Above: One iteration of the classical BiCGStab scheme (Algorithm 2) in the Intel Trace Analyzer.Observe the blocking due to computing global sums in scalar products via MPI_Allreduce.In addition, communication is needed for halo exchanges that often overlap with independent computations.Below: One iteration of the pipelined BiCGStab scheme (Algorithm 3).The global sums are now initiated by a non-blocking call to MPI_Iallreduce and finished by MPI_Wait.In between, major computations such as a matrix-vector product and a preconditioning step are performed also including halo exchanges.

10 15
Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-334Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 March 2019 c Author(s) 2019.CC BY 4.0 License.throughput in an operational configuration The scaling tests are by design performed in somewhat idealized setting that allows to minimize the effect of factors not related to pure computation.However, in the operational setting, aspects such as the I/O frequency, the dependence of the SSH solver on the model state and model diagnostics might require additional computations.In Fig 9, the throughput estimates in simulated years per day (SYPD) are plotted for selected scaling results from Sec. 3 against estimates based on one-year computations with I/O on Mistral conducted in an operational configuration.The one-year experiments were started using the restart filesproduced by one year of model spin-up.After one year simulation time, the model dynamics is usually well developed, and the time step sizes as well as the velocities have values typical for production runs.The restart files were written out once per year, and the standard model output was also switched on, namely monthly 3D fields of temperature, salinity, three velocity components, and monthly 2D fields of sea surface height, sea surface temperature, sea surface salinity, as well as the sea-ice fields: concentration, thickness, and two components of velocity.Estimates obtained from scaling experiments are calculated with the assumption of using the same time step as in one-year runs (15 minutes for fArc and 10 minutes for STORM).

Figure 9 .
Figure 9. Model throughput in simulated years per day (SYPD) estimated from scaling experiments and measured in one-year experiment with I/O for fArc (left) and STORM (right).

Figure 10 .
Figure 10.Number of the SSH solver iterations for cold start and restart simulations performed from the restart obtained after one year of simulations for fArc and STORM meshes on Mistral (DKRZ Hamburg).

Figure 11 .
Figure 11.Schematics of an HPC cluster for purposes of hierarchic mesh partitioning (left) and a comparison of the total, SSH, and sea-ice scaling for hierarchic (continuous lines) and non-hierarchic (dashed lines) partitionings for the Core2 mesh on Emmy (RRZE Erlangen, right).
Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-334Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 March 2019 c Author(s) 2019.CC BY 4.0 License.onmesh resolution and are determined experimentally.Different from the traditional EVP, mEVP is always stable provided α and β are sufficiently large (e.g.α = β = 500 were used in simulations reported in this paper).The number of substeps determines the closeness to a VP solution.Indeed, in the limit p → ∞, when the difference between p and p+1 estimates tends to zero, the equations above become the time discretized (A3) with (A4).The number N EVP needed for simulations is selected experimentally starting from N EVP α and reducing it to as low values as possible without a noticeable effect to solution.For meshes used in the evaluation here with the finest resolution about 4.5 km in the Arctic (fArc), N EVP = 100 works well.We note that so small values of N EVP are only possible because of using mEVP.Since halo exchange for the sea-ice velocity is needed on every iteration, low N EVP improve parallel scalability.Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-334Manuscript under review for journal Geosci.Model Dev. Discussion started: 27 March 2019 c Author(s) 2019.CC BY 4.0 License.

Table 1 .
Characteristics of the meshes.Rectangular analogues refer to Mercator-type grids with a similar number of wet points.

Table 2 .
Overview of the systems used in the scaling experiments.