Accurately modelling the contribution of Greenland and Antarctica to sea level rise requires solving partial differential equations at a high spatial resolution. In this paper, we discuss the scaling of the Ice-sheet and Sea-level System Model (ISSM) applied to the Greenland Ice Sheet with horizontal grid resolutions varying between 10 and 0.25 km. The model setup used as benchmark problem comprises a variety of modules with different levels of complexity and computational demands. The core builds the so-called stress balance module, which uses the higher-order approximation (or Blatter–Pattyn) of the Stokes equations, including free surface and ice-front evolution as well as thermodynamics in form of an enthalpy balance, and a mesh of linear prismatic finite elements, to compute the ice flow.

We develop a detailed user-oriented, yet low-overhead, performance instrumentation tailored to the requirements of Earth system models and run scaling tests up to 6144 Message Passing Interface (MPI) processes. The results show that the computation of the Greenland model scales overall well up to 3072 MPI processes but is eventually slowed down by matrix assembly, the output handling and lower-dimensional problems that employ lower numbers of unknowns per MPI process. We also discuss improvements of the scaling and identify further improvements needed for climate research. The instrumented version of ISSM thus not only identifies potential performance bottlenecks that were not present at lower core counts but also provides the capability to continually monitor the performance of ISSM code basis. This is of long-term significance as the overall performance of ISSM model depends on the subtle interplay between algorithms, their implementation, underlying libraries, compilers, runtime systems and hardware characteristics, all of which are in a constant state of flux.

We believe that future large-scale high-performance computing (HPC) systems will continue to employ the MPI-based programming paradigm on the road to exascale. Our scaling study pertains to a particular modelling setup available within ISSM and does not address accelerator techniques such as the use of vector units or GPUs. However, with 6144 MPI processes, we identified issues that need to be addressed in order to improve the ability of the ISSM code base to take advantage of upcoming systems that will require scaling to even higher numbers of MPI processes.

Projections of future sea level rise are a major societal demand. The future mass loss of ice sheets and glaciers is one of the primary sources of sea level rise

Since complex bed topographies, rugged coastlines of ice sheets and small scale features form an irregular geometry (e.g. narrow confined fjords in Greenland or small pinning points in Antarctic ice shelves), unstructured meshes are best suited. This motivated the development of codes based on finite element and finite volume discretizations with triangular or Voronoi meshes

To study the performance of ISSM, we select a real-life system as a test case: the Greenland Ice Sheet (GrIS) simulated at different horizontal resolutions – covering the range from what is today's standard for long-term simulations

Several metrics have been proposed to quantify “scalability”, that is, the performance response of a code when additional hardware resources are made available (see, e.g.

Numerical models such as the ISSM are generally based on a discretization of the underlying system of partial differential equations (PDEs) leading to a fully discrete system of nonlinear or linear algebraic equations to be solved. In the context of finite elements, the computational domain is partitioned using a computational mesh consisting of elements (e.g. triangles or quadrilaterals in 2-D, tetrahedra or prisms in 3-D), and the discrete unknowns are specified per node, per element or per face of the mesh, depending on the specific finite element scheme and the used approximation order. The discrete unknowns are called degrees of freedom (DOF), and their total number then specifies the size of the discrete linear or nonlinear system to be solved. With increasing mesh resolution as well as higher order of shape functions, the number of DOF is increasing for a particular PDE boundary value problem

The employed model setup in our study comprises a variety of modules with different levels of complexity. Sophisticated performance analyses that identify the impact of such a multi-physics problem for overall code performance do exist for ocean models

formerly Albany/FELIX

A first scaling analysis of ISSM has been conducted in

In addition, to pinpoint the effect of different models on the overall performance, we develop an instrumentation scheme that provides detailed performance information closely related to an Earth system scientist's view of the code. The challenge here is to develop a setup that does not introduce much overhead, as otherwise the results of instrumented runs would not be representative for the original code base. As a result, then, such an analysis can become part of the standard production environment providing insight into the code's performance as algorithms or the code's environment (e.g. the underlying hardware, middleware or libraries) change. To that end, a setup is developed that limits the instrumentation overhead through careful filtering but leaves the code base untouched.

The paper is structured as follows: we start by introducing the underlying physical model (the details of the mathematical models are given in Appendix

For this study, we focus on a selected subset of the capabilities of ISSM; e.g. we employ only the HO approximation of the stress balance. This approximation is currently used in ice sheet projections and, in terms of the ice sheet code run as a part of a fully coupled ESM, it is the most comprehensive level of physics that we expect to be practical in ESMs. We also do not incorporate other sophisticated modules, such as subglacial hydrology, or advanced approaches for computation of surface mass balance (SMB).
The mathematical model is given in detail in the Appendix

The mathematical models for the different modules (Sect.

The results presented in this study are based on a setup of the GrIS that has previously been used for future projections

The main difference with

Each horizontal mesh is generated with a higher resolution denoted by

Overview of problem setups used in our study. Note that (a) the vertices and elements of the 3-D mesh describe the full mesh (i.e. ice-covered and non-ice-covered elements), (b) the minimum DOF is the DOF from the 2-D mass transport module and (c) the maximum DOF is the DOF from the 3-D stress balance horizontal module.

In order to measure the performance, we conduct 30 time steps in each run, but we only measure time steps from 11 to 30. Since we allow in each time step the individual modules to reach their convergence criteria, we intentionally exclude the timings from a cold start based on a poor initial guess.

The convergence criteria (see the Appendix) for the linear iteration of the stress balance is the Euclidean norm

All experiments are conducted on dedicated compute nodes of the Lichtenberg high-performance computing (HPC) system with two 48-core Intel Xeon Platinum 9242 per compute node and 384 GB of main memory each, connected with an InfiniBand HDR100 network providing point-to-point connections between nodes.
For all runs, we employ 48 MPI processes on each node pinned to NUMA nodes.
The fact that we only use half of the available hardware cores is due to the fact that at G250 resolution, each MPI process requires 7.4 GB of memory, when the mesh is distributed on only 48 processes.
Even the shrinking memory consumption with an increasing number of processes (e.g. 5.9 GB per process on 96 processes) does not overcome this limitation, while we only use a few nodes.
The demand of memory per process shrinks with a rising number of processes, but it is a limitation of our setup.
Each experiment runs three repetitions, and results fall within a standard deviation of 10 %.
The basis for our instrumentation is the latest ISSM public release 4.18, which is compiled with GCC 10.2 (optimization level “-O2”), Open MPI 4.0.5

We compiled both ISSM and PETSc with “-O2” as well as with “-O3 -march=cascadelake -mtune=cascadelake”. On a 480-core configuration, the entire calculation (without loading the model) took 1955 s compiled with “-O2” and 1930 s when compiled with higher-level optimization. With 1536 cores, we observed execution times of 763 and 748 s, respectively. So the compiler optimization level has an impact of less than 2 % in both cases. As the impact of the more aggressive compiler optimizations is low, we stick to “-O2”, as it avoids some potential numerical issues that can arise with more aggressive compiler options. Optimization level “-O2” is also employed for Open MPI and PETSc in the module tree provided for the Lichtenberg HPC system by the computing centre of TU Darmstadt.

Nevertheless, in general, vector units which are generated during compilation with native compiler flags are important for overall performance of codes. The fact that ISSM does not benefit from vector instructions is a sign that the performance of ISSM is memory bandwidth limited. Therefore node-level performance optimizations are a future need that pure MPI scaling cannot overcome.

The sequence of modules in a transient time step in ISSM. Small grey circles indicate the dimension of the equations of the module (3-D, 2-D). Larger grey circles with PDEs are denoting if and how many partial differential equations are solved. Diamonds with “par” indicate that only an algebraic equation is evaluated. On the right side, we list the DOF for particular mesh resolutions and modules.

Sketch of a solution sequence of ISSM.

ISSM is implemented in multiple modules, which run in a predefined sequence illustrated in Fig.

Although these modules solve different mathematical problems, the modular software designs enables a similar structure
for the implementation of the linear and nonlinear equations.
This generalized form of the solution sequence of ISSM is shown in Fig.

The first step of each sequence step that involves the solution of a PDE is identical and consists of constructing the equation system. Within this step, memory is allocated, the entries of the system matrix are filled in on each mesh element, and the global matrix is assembled. The main difference between the modules lies in how the equation matrix is filled. Here, the code iterates over all elements of the mesh and computes an element matrix, whose entries depend on the PDE being discretized. The element entries are then assembled into a global matrix. Next, the equation system is solved. If a module contains a nonlinear iteration – this is the case for the horizontal stress balance and the thermal module – in each step of the nonlinear solver material properties or basal constraints are updated, and the global linear equation system is solved in the same fashion as for linear PDEs. The nonlinear iteration is repeated until the convergence criterion is reached. Subsequently, the results are post-processed, and the geometry of the mesh is updated as needed. Finally, the requested file output is selected.

While running multiple modules in parallel is not possible due to data dependencies, ISSM parallelizes the solution sequence of individual modules. For this it uses an even distribution of the elements, which is constant over time and independent of the modules, and each module handles the parallelization in the same way. During the construction of the equation system, memory is allocated locally, the element matrices are computed, and the equation system is filled, without any MPI communication. Thereafter, entries which are assigned to other MPI processes are communicated in the assembly of the equation system, leading to many MPI calls. The parallel linear solver then solves the equation system and the solution is distributed to all MPI processes in the post-processing. In the case of a nonlinear system, a convergence criterion has to be computed, and the nonlinear iteration has to be updated, potentially involving additional MPI calls. In the final selection of the requested output, data are stored in vectors or reduced to scalars. Both operations again lead to MPI communication.

Large code bases such as ISSM are developed and used over decades. The environment in which they are executed, on the other hand, i.e. the hardware, the operating systems, underlying libraries and compilers, is in a constant state of flux. As a consequence, code development needs not only to address the representation of physics but also to account for these changing operating environments, in particular, the increase in the number of compute cores.
Modernizing a code or porting it from one operating environment to the other is likely to affect overall performance. In particular, modules that do not play a significant role with respect to compute time with a low core count may end up taking a significantly larger chunk of compute time on a larger parallel system and thereby significantly affect the overall performance and scalability.
In addition, a code like ISSM is never in a final state.
The development of new modules, the implementation of new algorithms in existing modules, code optimizations to exploit GPU accelerators or vector units, or an update of used libraries can have a substantial (positive or negative) performance impact.
As a consequence, a continuous performance monitoring of ISSM is an essential feature allowing to assess the performance on the shifting computational ground the code lives on. For this reason, the code version of ISSM that we started out with had a basic timing setup to monitor the performance of the eight modules accounted for in Fig.

As will be shown in the next section, we need to dig deeper to develop a sufficient understanding of the performance behaviour of ISSM. To gather this information, we developed a sustainable performance measurement environment which provides performance information that correlates with the algorithmic view of domain scientists. Sustainability here includes three main factors: (1) the instrumented code needs to be easy to build and use, (2) the instrumentation results need to refer to identifiable modules in the code, and (3) measurements must not lead to a significant computational overhead, as this would distort results.

Profiling information for a code can be created in two different ways: sampling and/or instrumentation. With sampling-based tools, the execution is interrupted at regular intervals and that state recorded (e.g. with HPC toolkit, cf.

In our work, we use the Score-P instrumentation tool

In this work, we go beyond the instrumentation of the top-level modules and develop an instrumentation that enables an in-depth analysis of ISSM behaviour which is closely tied to the algorithmic view of domain scientists through making judicious use of the features provided by Score-P. Score-P generates profiles and traces based on compiler instrumentation supporting filtering and manually defined user regions. Additionally, Score-P hooks into the PMPI interface (the MPI profiling interface) of the MPI library and is therefore able to track each MPI call with little overhead. This is important as calls to MPI functions are likely causes of synchronization overhead that we might encounter. In addition, Score-P is able to generate similarly instrumented interfaces for user-defined libraries, which we employ to wrap calls to the PETSc library. Our timing profile thus includes every PETSc call. This is beneficial since the PETSc calls provide much more context information than MPI calls by themselves. So the profile contains the information whether an MPI call belongs to an assembly, the solver or some other PETSc algorithm, respectively.

In order to develop a low-overhead instrumentation, we start out with a full instrumentation (which is generated automatically without effort on our part) and then analyse which modules account for a significant chunk of the runtime. Repeating this several times and taking the modular structure of ISSM (see Sect.

These 58 functions cover the so-called “hot paths” of the main parts of the code, i.e. the modules and call paths where most of the computing time is spent. We mention that this process of finding the hot paths of a code has since been (partially) automated with the Performance Instrumentation Refinement Automation framework
(PIRA) tool

Since we paid attention to include all functions and methods which are on the call path to a function we instrument, in order to get the context of each measured region, our function whitelist includes the entry point of each physics module (which have also been measured by ISSM internal timings), the calls of the individual solution sequences and the top-level calls of the logical steps of the algorithms, e.g. allocation of memory, computation of element matrices, assemble of matrices and vectors and the linear solver.

Table

Even with this high number of calls to MPI functions, the instrumentation overhead remains low: The profiling of the 15 trillion MPI calls results in a runtime overhead of about 2.5 %. When, in addition, the 84 billion PETSc functions are instrumented, we do not detect any noteworthy additional overhead, the same holds for the addition of the 57 million calls related to ISSM. On the other hand, a fully automatic instrumentation of ISSM results in an overhead of over 13 000 % according to our measurements on a coarser grid. Fully instrumented binaries are not executable in reasonable time, and any performance evaluations made on their basis may have little relevance with respect to the original code.

Details of our filter, the final profile and the overhead of the measurement environment 30 time steps of G250 with 3072 MPI processes.

The bottom line is that our instrumentation scheme keeps overhead low, even for large-scale runs, thus ensuring that the measured code is representative of the original source. As a result, it is quite feasible to periodically run an instrumented version of the code as part of the regular work of domain scientists, as a safeguard against surprises arising, for example, from a changed MPI library.

Through the performance analysis of ISSM we recognized, for example, that the matrix of the equation system in the stress balance horizontal module is being reallocated in each iteration of the nonlinear iteration scheme. Since the structure of the matrix does not change during these iterations, we modified the code to pre-allocate the matrices. Reusing them saved substantial time in the allocation and assembly of the equation system of the stress balance horizontal module. For example, running with 3072 cores, the time for matrix allocation was reduced by 91 %, the time for matrix assembly was reduced by 85 %, and the overall runtime of stress balance horizontal module decreased by 31 %. The performance numbers shown in the following section are based on this optimized version.

Pre-allocation is not an option for the thermal module due to dynamic changes in boundary condition type at the base (see Eqs.

In this section, we present the results of the measurements for the entire transient time step (Fig.

Runtime of a transient time step of ISSM without output handling in resolution G250.

Figure

However, the stress balance module scales linearly up to 3072 MPI processes and reasonably above that. Although the stress balance module does not scale linearly, its runtime is still monotonously declining over the number of employed MPI processes. In contrast, the earlier increase (i.e. at a lower number of MPI processes) in the runtime of the thermal, mass transport and moving front modules is more prominent. Since they scale worse than the transient solution, they become more relevant with increasing numbers of processors. The worst scalability is found for the moving front module, which contributes even more than the stress balance module from 4608 MPI processes on. The main reason for these discrepancies in the scaling behaviour is the fact that the individual modules solve equations with differing numbers of total DOF and different computing costs per element. In the following, we discuss the scaling behaviour and the algorithmic parts which cause it. Modules that do not solve any PDEs (SMB and grounding line modules) scale linearly and are not further investigated.

Runtime of the stress balance computation of ISSM Greenland model G250.

Since the stress balance module is the most time consuming module of ISSM, it is the most important module with respect to performance optimization and thus discussed first.
Exploiting the natural anisotropy of the problem, the horizontal and vertical components are solved in an uncoupled fashion, and the structure of the solution procedure varies between these two main
components.
Therefore, we present them here separately.
The runtime of both modules is displayed in Fig.

We observe that the stress balance horizontal module is by far more expensive to solve than the vertical, which is expected due to more DOF in the former. The scaling behaviour of both modules differs significantly with the stress balance vertical module scaling worse. For the horizontal stress balance module, runtime is still monotonously declining at 6144 MPI processes but starts to deviate from the linear scaling. The stress balance vertical module exhibits a minimum runtime at 2304 MPI processes and slows down by about a factor of 4 with 6144 MPI processes. In the horizontal case, linear scaling breaks down when the DOF per MPI process fall below 10 000, while the vertical case never reaches 10 000 DOF per MPI process.

The execution time of the stress balance horizontal module and the stress balance vertical module on low core counts is dominated by the costs for the computation of the entries of the matrix, which scales linearly with the number of cores for setups considered in this work. Most notably, the costs for the matrix assembly are in both cases rising from 1152 cores on despite the large difference in the size of the problem. The linear solver does not represent a problem either in the stress balance horizontal module or in stress balance vertical module: it does not need a significant amount of time and its scaling is sufficient. While the stress balance vertical module is solved in a linear equation system, the nonlinear equation system of the stress balance horizontal module has to be solved iteratively and needs approximately 12 iterations per time step for this particular application.

Runtime of the thermal module for G250.

The runtime for the thermal module is presented in Fig.

The thermal module contains a nonlinear iteration scheme in which the basal boundary conditions are updated. This update is expensive, and the execution time does not change much as the number of cores increases.
Furthermore, Fig.

Runtime of the moving front module for G250.

The moving front module consists of three individual modules: (1) a level-set module that is executed first, followed by (2) a module evaluating the slope of the level-set function and lastly (3) the extrapolation module.
As shown in Fig.

The runtime of each step is displayed in Fig.

The amount of time required for allocation of memory for the level-set and level-set slope modules is similar to that of stress balance and thermal modules. In contrast, the allocation of memory is more time consuming for the extrapolation module. This is likely due to repeatedly solving a diffusion equation that accumulates larger costs.

The linear solver does only take a significant amount of time in the extrapolation module and scales linearly. The execution time of the linear solver of the level-set and level-set slope modules is negligible. Other routines are summarized in the dashed grey line, but because of an almost linear scaling, they do not play an important role.

Runtime of the mass transport module for G250.

This module is characterized by a particularly low number of DOF which is 30 times smaller than in the horizontal stress balance and 13 times less than in the thermal module.
The overall module scales up to 1536 MPI processes (Fig.

Of high interest in terms of planning simulations is the quantity SYPD, as it reflects the total time needed to conduct a certain simulation. As the typical applications of ice sheet models vary strongly in simulated time periods and resolution, we conducted here simulations with coarse resolutions (G4000), a resolution that is still higher than that used in the current paleo-simulations, as well as with the highest resolution we could afford (G250). We estimated SYPD from our simulations of 20 time steps and scaled them up to 1 year by also taking into account differing time step sizes for each resolution. As displayed in Fig.

SYPD for various grid resolutions estimated from a runtime of 20 time steps.

As our ability to assess the scalability of the code might be limited by the size of the problem we solve, we are conducting in this last part a comparison for simulations of Greenland in different resolutions. Figure

Runtime for various grid resolutions.

Execution time for matrix assembly for various modules. Symbols are representing the modules, while the colour is denoting the number of MPI processes.

Percentage of computation time spent in matrix assembly versus the remaining computations in mass transport module.

The performance analysis for the components of individual modules reveals that the major scaling issue is the assembly of the equation system matrix as shown in Figs.

Matrix assembly consists almost entirely of MPI communication, and the cost of communication increases from a certain point on as the number of MPI processes grows. The allocation of memory and the computation of entries of the equation system, on the other hand, are both core-local computations and hence scale linearly with the number of cores. In particular, this part of the computation will scale further, because elements are distributed evenly on MPI processes, no matter how many or few elements are computed on each core. In all these modules, the runtime contribution of vector assembly is insignificant and the linear solver is either insignificant or scales well.

Our measurements reveal that the scalability is breaking down in all modules by poor performance of the matrix assembly.
The tipping point mainly depends on the costs of the allocation of memory and the time for computing the element entries of an individual PDE, the amount of elements per MPI process and the data locality of the assembly, which impacts the amount of inter-core communication required for the assembly. So, for a fixed core count, increasing DOF increases performance but, with an increasing number of MPI processes, the communication overhead of the assembly starts dominating at some point. Similar behaviour was also found in other studies

While not in the focus of the current study, we also noticed that the output routine does not scale well either. The main reason for this is an indexing scheme tailored towards post-processing in the MATLAB/Python user interface which does not exhibit good data locality and leads to a large number of MPI calls. Here, we suggest the more data-local indexing scheme already used in the creation of the equation system. The compatibility to the user interface could then be achieved in a reordering post-processing of the results, which can be done trivially in parallel by distributing different output vectors and different time steps among available cores.

The performance analysis of an ice sheet code needs to keep the challenging numerical underpinnings of the problem in mind: the code solves in a sequential fashion a number of different modules, each of which solves an equation system of a very different type and with a different number of DOF. Therefore, this type of code is inherently prone to the situation that a particular domain decomposition may be optimal for the module with the largest number of DOF and scale well with increasing core counts, while the performance of the module with a smaller number of DOF may not experience any improvement – or even worsen – with increasing core counts.

Also other components of ESMs are facing the issue that modules with fewer DOF are limiting scalability. For the finite volume sea ice–ocean model FESOM2(Finite-volumE Sea ice–Ocean Model, version 2.0), similar scalability issues were found for 2-D computations

Our analysis reveals that the migration of lateral margins is becoming costly with increasing number of cores. This module includes the extrapolation of some solution fields performed via solving a diffusion equation over the ice-free regions constrained by the values calculated in the ice-covered region. This approach generates the smallest stiffness matrices in the overall sequence because ISSM treats Dirichlet boundary conditions with a lifting method so that only the unconstrained degrees of freedom are included in the stiffness matrix. Since all vertices located on ice are constrained, the unconstrained degrees of freedom are only those that correspond to the vertices that are outside of ice, and one ends up with a very small number of DOF per MPI process. One approach to address this problem would be to use an alternative approach for treating Dirichlet boundary conditions such as including entries for all nodes in the stiffness matrix, setting rows of constrained nodes to 0 except along the diagonal and changing the right-hand side to the value of the constraint.

In addition, wherever a low number of DOF per MPI process is limiting scalability with increasing core counts, the number of DOF could be increased by switching to P2 (quadratic) elements. In this context, one must find a reasonable balance between increasing the size of the problem for the sake of making the node computation more expensive while keeping communication constant, and increasing total computational costs disproportionally with respect to the additional knowledge gain. At this point, it should also be mentioned that the bedrock topography is only insufficiently known, and resolutions finer than 150 m are to date limited by the lack of input data for such simulations.

In order to increase the throughput, new modelling strategies are worth investigating. The nonlinear iterations are contributing to the overall costs of a time step significantly.
Depending on the particular application, the number of nonlinear iterations may be reduced. Thus far, nothing is known about the effect of such a reduction, and care must be taken not to miss abrupt changes in the system. A simulation study comparing the resulting evolution of the temperature field with and without iterative update of the basal constraints can assess this effect. It is here also worth considering employing error indicators to steer the number of nonlinear iterations. Similar to error indicators used for adaptive mesh refinement

Future simulation strategies may comprise more on-the-fly analyses than is currently standard in ice sheet codes. So far, only few scalars are computed, while an in-depth analysis is conducted in the post-processing. Some analyses may be conducted on the level of processors, while others need to be run globally. In particular, sensitivity studies for tuning model parameters would benefit from such an on-the-fly analysis with simulations producing unrealistic results quickly terminated.

From the perspective of ice sheet codes running coupled in ESMs, we recommend to consider higher resolutions even if long timescales are anticipated, as this leads to better scaling than coarse resolution. If the anticipated SYPDs cannot be met,
the ice sheet code is to be run at its peak performance rather than the maximum number of cores available. To this end, investigations of optimal sharing of resources between codes of ESMs should be conducted.
Similar to ESMs

By means of the performance analysis we also identified some avenues for future improvement of the code. The reuse of the equation system matrix is also applicable to the multiple executions of the extrapolation step, and a change of indexing in the requested outputs module can be used to improve the memory locality. Furthermore, our instrumentation is suitable for efficient tracing and load imbalance detection. The major load imbalances occur in the computation of matrix entries and lead to load-imbalanced matrix assembly. The matrix assembly is clearly a key computation that warrants further investigation. We also want to emphasize here, that we investigated the performance of an HO application only and that other issues may arise for other momentum balance choices.

Although our performance instrumentation leads to a very modest overhead, this can be further diminished. About 99 % of the MPI calls belong to three functions, MPI_Iprobe, MPI_Test and MPI_Testall, and about 90 % of the PETSc calls recorded by Score-P refer to setvalue() routines used in filling buffers for matrix assembly. If the MPI calls are not of interest, they can be disabled in groups via the Score-P interface, or individual functions can be excluded by implementing and preloading a functionless PMPI interface. Currently, the library wrapping interface of Score-P does not allow for the easy exclusion of certain functions in a library (i.e. a whitelist/blacklist functionality for library functions). However, the overhead of our instrumentation would clearly be reduced further if instrumentation of these low-level PETSc calls could be avoided.

To analyse the practical throughput of a typical application of ISSM, we conducted transient HO simulations for the Greenland Ice Sheet in five different horizontal resolutions. We present runtime measurements for individual code modules based on an instrumentation of the ice sheet code with Score-P. We conclude that ISSM scales up to 3072 MPI processes in the highest resolution that we tested (G250). While it was expected that the stress balance module would dominate the runtime, we found that simulating the motion of the lateral margins becomes the main cost factor from 4608 MPI processes on. We find major scaling challenges in HO due to the assembly of the system matrix, in particular, when the number of DOF per MPI process is falling below 10 000. The maximum throughput for all horizontal resolutions in HO was reached at 1152–2304 MPI processes and is particularly small for the highest resolution due to severe time step restrictions.

This study also showed that meaningful in-depth performance analysis of ISSM can be performed at little cost and with minimal code changes, which could be eliminated completely in the future by a very limited refactoring of the code. An instrumented, user-oriented, low-overhead profiling version of ISSM can then be built from the unmodified main source of ISSM. Thus, scientists using ISSM can monitor performance of their code as their computational environment evolves without having to worry about carrying instrumentation code into their new code branch. Future advances in instrumentation with respect to the filtering of library routines could further decrease instrumentation overhead to the level where it is negligible, thereby allowing continuous performance monitoring, which would provide valuable information for the creation of execution models of ISSM and its modules.

Let

The momentum balance used in this study is the Blatter–Pattyn higher-order approximation

The boundary condition of the momentum balance is at all boundaries

Finally, at the (vertical) calving front

The viscous rheology of ice is treated with a regularized Glen flow law (Eq.

The ice is treated as an incompressible material and hence the mass balance reduces to

We solve the enthalpy balance equation to resolve cold-, temperate- or polythermal-ice states

The temperature field

At the ice surface

The jump condition on

Ice thickness evolution equation reads as

For HO the grounding line position is obtained from hydrostatic equilibrium: let the thickness of flotation be given by

The terminus evolution (for both, marine terminating glaciers, as well as ice shelves) is given by the kinematic calving front condition using a level-set method. The level-set function,

For filling the required physical variables at elements that are activated due to expansion of the ice sheet area, an extrapolation is required. This is done by solving a 2-D diffusion equation for each variable

While in general the derivation of SMB may require an energy balance model to compute, from precipitation, air temperature and radiation, the surface skin temperature and SMB, we restrict use in this study a simple approach and compute surface melting is parameterized by a positive degree day (PDD) method

ISSM employs three different convergence criteria

Physical parameters used for ISSM.

Horizontal mesh and simulated surface velocities,

ISSM version 4.18

CB, YF, MR, AH and VA designed the study. YF conducted the performance measurements and instrumented the code. MR contributed the ISSM Greenland setup. All authors discussed the results and text. YF and CB wrote Sects. 3 and 4. AH and YF wrote Sects. 5 and 6. AH wrote major parts of Sect. 1 and the Appendix. MM contributed ISSM design philosophy and insights into implementations. VA contributed the comparison to other ESM codes and strategies in terms of numerics.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Simulations for this study were conducted on the Lichtenberg high-performance computer of Technical University of Darmstadt. The authors would like to thank the Hessian Competence Center for High Performance Computing – funded by the Hessian State Ministry of Higher Education, Research and the Arts – for helpful advice. Angelika Humbert thanks Simone Bnà for discussions. Angelika Humbert and Martin Rückamp acknowledge support from the German Federal Ministry for Education and Research (BMBF) within the GROCE-2 project (grant no. 03F0855A), to which this work contributed optimal HPC setting options for coupled ice–ocean simulations. We thank Thomas Zwinger and the anonymous reviewer for very useful comments on the manuscript.

The work of Christian Bischof was partially funded by the German Science Foundation (DFG) (project no. 265191195) within SFB 1194. The work of Yannic Fischler was partially funded by the Hessian LOEWE initiative within the Software-Factory 4.0 project.

This paper was edited by Steven Phipps and reviewed by Thomas Zwinger and one anonymous referee.