Near-global climate simulation at 1 km resolution : establishing a performance baseline on 4 , 888 GPUs with COSMO 5 . 0

The best hope for reducing long-standing global climate model biases is by increasing resolution to the kilometer scale. Here we present results from an ultra-high resolution non-hydrostatic climate model for a near-global setup running on the full Piz Daint supercomputer on 4,888 GPUs. The dynamical core of the model has been completely rewritten using a domain-specific language (DSL) for performance portability across different hardware architectures. Physical parameterizations and diagnostics have been ported using compiler directives. To our knowledge this represents the first complete atmo5 spheric model being run entirely on accelerators at this scale. At a grid spacing of 930 m (1.9 km), we achieve a simulation throughput of 0.043 (0.23) simulated years per day and an energy consumption of 596 MWh per simulated year. Furthermore, we propose a new memory usage efficiency (MUE) metric that considers how efficiently the memory bandwidth – the dominant bottleneck of climate codes – is being used.


Introduction
Should global warming occur at the upper end of the range of current projections, the local impacts of unmitigated climate change would be dramatic.Particular concerns relate to the projected sea-level rise, increases in the incidence of extreme events such as heat waves and floods, and changes in the availability of water resources and the occurrence of droughts (Pachauri and Meyer, 2014).
Current climate projections are mostly based on global climate models (GCMs).These models represent the coupled atmosphere-ocean-land system and integrate the governing equations, for instance, for a set of prescribed emissions scenarios.Despite significant progress during the last decades, uncertainties are still large.For example, current estimates of the equilibrium global mean surface warming for doubled greenhouse gas concentrations range between 1.5 and 4.5 • C (Pachauri and Meyer, 2014).On regional scales and in terms of the hydrological cycle, the uncertainties are even larger.Reducing the uncertainties of climate change projections, in order to make optimal mitigation and adaptation decisions, is thus urgent and has a tremendous economic value (Hope, 2015).
How can the uncertainties of climate projections be reduced?There is overwhelming evidence from the literature that the leading cause of uncertainty is the representation of clouds, largely due to their influence upon the reflection of incoming solar radiation (Boucher et al., 2013;Bony et al., 2015;Schneider et al., 2017).Horizontal resolutions of current global climate models are typically in the range of 50-200 km.At this resolution, clouds must be parametrized, based on theoretical and semiempirical considerations.Refining the resolution to the kilometer scale would allow the explicit representation of deep convective clouds (thunder-storms and rain showers; e.g., Fig. 1).Studies using regional climate models demonstrate that at this resolution, the representation of precipitation is dramatically improved (Kendon et al., 2014;Ban et al., 2015).The representation of shallow cumulus cloud layers, which are common over significant fractions of the tropical oceans, requires even higher resolution.The United States National academy of sciences has thus recommended (Natinal Research Council, 2012) developing "high-end global models that execute efficiently ... , enabling cloud-resolving atmospheric resolutions (2-4 km) and eddy-resolving ocean resolutions (5 km)" in the near future.
While the scientific prospects of such an undertaking are highly promising, the computational implications are significant.Increasing the horizontal resolution from 50 to 2 km increases the computational effort by at least a factor of 25 3  = 15 000.Such simulations will only be possible on future extreme-scale high-performance computers.Furthermore, power constraints have been driving the widespread adoption of many-core accelerators in leading edge supercomputers and the weather and climate community is struggling to migrate the large existing codes to these architectures and use them efficiently.
But what does efficient mean?While concerns of the total cost of ownership of a high-performance computing (HPC) system have shifted the focus from peak floating point performance towards improving power efficiency, it is not clear what the right efficiency metric is for a fully fledged climate model.Today, floating point operations are around 100× cheaper than data movement in terms of time and 1000× cheaper in terms of energy, depending on where the data come from (Borkar and Chien, 2011;Shalf et al., 2011).Thus, while focusing on floating point operations was very relevant 25 years ago, it has lost most of this relevance today.Instead, domain-specific metrics may be much more applicable to evaluate and compare application performance.A metric often used for climate models is the throughput achieved by the simulation measured in simulated years per wall clock day (SYPD; see Balaji et al. (2017) for a detailed discussion on metrics).For global atmospheric models, a suitable nearterm target is to conduct decade-long simulations and to participate in the Atmospheric Model Intercomparison Project (AMIP) effort. 1 Such simulations require a 36-year long simulation for the period 1979-2014, driven by observed atmospheric greenhouse gas concentrations and sea-surface temperatures.Within the context of current climate modeling centers, such a simulation would be feasible for an SYPD greater than or equal to 0.2-0.3.At such a rate the simulation would take up to several months.However, domain-specific metrics such as SYPD are very dependent on the specific problem and approximations in the code under consideration and are often hard to compare.Ideally comparisons would be performed for production-quality global atmospheric models that have been extensively validated for climate simulations and cover the full (non-hydrostatic and compressible) dynamics and the entire suite of model parameterizations.
With the SYPD metric alone, it is hard to assess how efficiently a particular computing platform is used.Efficiency of use is particularly important because, on the typical scale of climate simulations, computing resources are very costly and energy intensive.Thus, running high-resolution climate simulations also faces a significant computer science problem when it comes to computational efficiency.As mentioned before, floating point efficiency is often not relevant for state-of-the-art climate codes.Not only does counting floating point operations per second (flop s −1 ) not reflect the actual (energy) costs well, but the typical climate code has very low arithmetic intensity (the ratio of floating point operations to consumed memory bandwidth).Attempts to increase the arithmetic intensity may increase the floating point rate, but it is not clear if it improves any of the significant metrics (e.g., SYPD).However, solely focusing on memory bandwidth can also be misleading.Thus, we propose memory usage efficiency (MUE), a new metric that considers the efficiency of the code's implementation with respect to input/output (I/O) complexity bounds as well as the achieved system memory bandwidth.
In summary, the next grand challenge of climate modeling is refining the grid spacing of the production model codes to the kilometer scale, as it will allow addressing longstanding open questions and uncertainties on the impact of anthropogenic effects on the future of our planet.Here, we address this great challenge and demonstrate the first simulation of a production-level atmospheric model, delivering 0.23 (0.043) SYPD at a grid spacing of 1.9 km (930 m), sufficient for AMIP-type simulations.Further, we evaluate the efficiency of these simulations using a new memory usage efficiency metric.

Current state of the art
Performing global kilometer-scale climate simulations is an ambitious goal (Palmer, 2014), but a few kilometer-scale landmark simulations have already been performed.While arguably not the most relevant metric, many of the studies have reported sustained floating point performance.In 2007, Miura et al. (2007) performed a week-long simulation with a horizontal grid spacing of 3.5 km with the Nonhydrostatic Icosahedral Atmospheric Model (NICAM) on the Earth Simulator, andin 2013 Miyamoto et al. (2013) performed a 12 h long simulation at a grid spacing of 870 m on the K computer, achieving 230 Tflop s −1 double-precision performance. 2In 2014, Skamarock et al. (2014) performed a 20-day long simulation with a horizontal grid spacing of 3 km with the Model 2 1 Tflop s −1 = 10 12 flop s −1 .for Prediction Across Scales (MPAS) and later, in 2015, participated in the Next Generation Global Prediction System (NGGPS) model intercomparison project (Michalakes et al., 2015) at the same resolution and achieved 0.16 SYPD on the full National Energy Research Scientific Computing Center (NERSC) Edison system.In 2015, Bretherton and Khairoutdinov (2015) simulated several months of an extended aquaplanet channel at a grid spacing of 4 km using the System for Atmospheric Modeling (SAM).Yashiro et al. (2016) were the first to deploy a weather code on the TSUBAME system accelerated using graphics processing units (GPUs).The fully rewritten NICAM model sustained a double-precision performance of 60 Tflop s −1 on 2560 GPUs of the TSUB-AME 2.5 supercomputer.In 2016, Yang et al. (2016a) implemented a fully implicit dynamical core at 488 m grid spacing in a β-plane channel achieving 7.95 Pflop s −1 on the Taihu-Light supercomputer. 3  The optimal numerical approach for high-resolution climate models may depend on the details of the target hardware architecture.For a more thorough analysis, the physical propagation of information in the atmosphere has to be considered.While many limited-area atmospheric models use a filtered set of the governing equations that suppresses sound propagation, these approaches are not precise enough for global applications (Davies et al., 2003).Thus, the largest physical group velocity to face in global atmospheric models is the speed of sound.The speed of sound in the atmosphere amounts to between 280 and 360 m s −1 .Thus in a time span of an hour, the minimum distance across which information needs to be exchanged amounts to about 1500 km, corresponding to a tiny fraction of 1.4 % of the earth's sur-3 1 Pflop s −1 = 10 15 flop s −1 .face.However, many numerical schemes exchange information at much larger rates.For instance, the popular pseudospectral methodology (e.g., ECMWF, 2016) requires Legendre and Fourier transforms between the physical grid and the spherical harmonics and thus couples them globally at each time step.Similarly, semi-Lagrangian semi-implicit timeintegration methods require the solution of a Helmholtztype elliptical equation (Davies et al., 2005), which implies global communication at each time step.Both methods use long time steps, which may partially mitigate the additional communication overhead.While these methods have enabled fast and accurate solutions at intermediate resolution in the past, they are likely not suited for ultrahigh-resolution models, as the rate of communication typically increases proportionally to the horizontal mesh size.Other approaches use time-integration methods with only locally implicit solvers (e.g., Giraldo et al., 2013), where they try to retain the advantages of fully implicit methods but only require nearestneighbor communication.
The main advantage of implicit and semi-implicit approaches is that they allow large acoustic Courant numbers α c = c t/ x, where c denotes the speed of sound and t and x the time step and the grid spacing, respectively.For instance, Yang et al. (2016a) use an acoustic Courant number up to 177; i.e., their time step is 177 times larger than in a standard explicit integration (this estimate is based on the x = 488 m simulation with t = 240 s).In their case, such a large time step may be chosen, as the sound propagation is not relevant for weather phenomena.
However, although implicit methods are unconditionally stable (stable irrespectively of the time step used), there are other limits to choosing the time step.In order to appropriately represent advective processes with typical veloc- ities up to 100 m s −1 and associated phase changes (e.g., condensation and fallout of precipitation), numerical principles dictate an upper limit to the advective Courant number α u = |u| t/ x, where |u| denotes the largest advective wind speed, e.g., Ewing and Wang (2001).The specific limit for α u depends on the numerical implementation and timestepping procedures.For instance, semi-Lagrangian schemes may produce accurate results for values of α u up to 4 or even larger.For most standard implementations, however, there are much more stringent limits, often requiring that α u ≤ 1.For the recent study of Yang et al. (2016a), who used a fully implicit scheme with a time step of 240 s, the advective Courant number reaches values of up to α u = 4.2 and 17.2 for the x = 2 km and 488 m simulation, respectively.Depending upon the numerical approximation, such a large Courant number will imply significant phase errors (Durran, 2010) or even a reduction in effective model resolution (Ricard et al., 2013).In order to produce accurate results, the scheme would require a significantly smaller time step and would require reducing the time step with decreasing grid spacing.For the NGGPS intercomparison the hydrostatic Integrated Forecasting System (IFS) model used a time step of 120 s at 3.125 km (Michalakes et al., 2015), the regional semi-implicit, semi-Lagrangian, fully non-hydrostatic model MC2 used a time step of 30 s at 3.0 km (Benoit et al., 2002), and Météo France in their semi-implicit Application of Research to Operations at Mesoscale (AROME) model use a time step of 45 and 60 s for their 1.3 and 2.5 km implementations, respectively.Since the IFS model is not a non-hydrostatic model, we conclude that even for fully implicit, global, convection-resolving climate simulations at ∼ 1-2 km grid spacing, a time step larger than 40-60 s cannot be considered a viable option.
In the current study we use the split-explicit time-stepping scheme with an underlying Runge-Kutta time step (Wicker and Skamarock, 2002) of the Consortium for Small-Scale Modeling (COSMO) model (see Sect. 3.1).This scheme uses sub-time stepping for the fast (acoustic) modes with a small time step τ and explicit time stepping for all other modes with a large time step t = n τ .Most of the computations are required on the large time step, with α u ≤ 2, depending on the combination of time-integration and advection scheme.In contrast to semi-implicit, semi-Lagrangian, and implicit schemes, the approach does not require solving a global equation and all computations are local (i.e., vertical columns exchange information merely with their neighbors).The main advantage of this approach is that it exhibits -at least in theory -perfect weak scaling. 4This also applies to the communication load per sub-domain, when applying horizontal domain decomposition.

Model description
For the simulations presented in this paper, we use a refactored version 5.0 of the regional weather and climate code developed by COSMO (COSMO, 2017;Doms and Schättler, 1999;Steppeler et al., 2002) and -for the climate mode -the Climate Limited-area Modelling (CLM) Community (CLM-Community, 2017).At kilometer-scale resolution, COSMO is used for numerical weather prediction (Richard et al., 2007;Baldauf et al., 2011) and has been thoroughly evaluated for climate simulations in Europe (Ban et al., 2015;Leutwyler et al., 2017).The COSMO model is based on the thermo-hydrodynamical equations describing non-hydrostatic, fully compressible flow in a moist atmosphere.It solves the fully compressible Euler equations using finite difference discretization in space (Doms and Schättler, 1999;Steppeler et al., 2002).For time stepping, it uses a split-explicit three-stage second-order Runge-Kutta discretization to integrate the prognostic variables forward in time (Wicker and Skamarock, 2002).For horizontal advection, a fifth-order upwind scheme is used for the dynamic variables and a Bott scheme (Bott, 1989) is used for the moisture variables.The model includes a full set of physical parametrizations required for real-case simulations.For this study, we use a single-moment bulk cloud microphysics scheme that uses five species (cloud water, cloud ice, rain, snow, and graupel) described in Reinhardt and Seifert (2006).For the full physics simulations, additionally a radiation scheme (Ritter and Geleyn, 1992), a soil model (Heise et al., 2006), and a sub-grid-scale turbulence scheme (Raschendorfer, 2001) are switched on.
The COSMO model is a regional model and physical space is discretized in a rotated latitude-longitude-height coordinate system and projected onto a regular, structured, three-dimensional grid (IJK).In the vertical, a terrainfollowing coordinate supports an arbitrary topography.The spatial discretization applied to solve the governing equations generates so called stencil computations (operations that require data from neighboring grid points).Due to the strong anisotropy of the atmosphere, implicit methods are employed in the vertical direction, as opposed to the explicit methods applied to the horizontal operators.The numerical discretization yields a large number of mixed compact horizontal stencils and vertical implicit solvers, strongly connected via the data dependencies on the prognostic variables.Figure 2 shows the data dependency graph of the computational kernels of the dynamical core of COSMO used in this setup, where each computational kernel corresponds to a complex set of fused stencil operations in order to maximize the data locality of the algorithm.Each computational kernel typically has multiple input and output fields and thus data dependencies as indicated with the edges of the Computational Directed Acyclic Graph (CDAG) shown in Fig. 2.
Figure 2. Illustration of the computational complexity of the COSMO dynamical core, using a CDAG.The nodes of the graph represent computational kernels (blue ellipses) that can have multiple input and output variables, halo updates (green rectangles), and boundary condition operations (orange rectangles).The edges of the graph represent data dependencies.Since the dynamical core of COSMO has been written using a DSL, the CDAG can be produced automatically using an analysis back end of the DSL compiler.The lengthy serial section in the middle of the figure corresponds to the sound waves sub-stepping in the fast-wave solver.The parallel section in the upper left corresponds to the advection of the seven tracer variables.CDAGs can automatically be produced from C++ code.
Maximizing the data locality of these stencil computations is crucial to optimize the time to solution of the application.
To enable the running of COSMO on hybrid highperformance computing systems with GPU-accelerated compute nodes, we rewrote the dynamical core of the model, which implements the solution to the non-hydrostatic Euler equations, from Fortran to C++ (Fuhrer et al., 2014).This enabled us to introduce a new C++ template librarybased domain-specific language (DSL) we call Stencil Loop Language (STELLA) (Gysi et al., 2015a) to provide a performance-portable implementation for the stencil algorithmic motifs by abstracting hardware-dependent optimization.Specialized back ends of the library produce efficient code for the target computing architecture.Additionally, the DSL supports an analysis back end that records the access patterns and data dependencies of the kernels shown in Fig. 2.This information is then used to determine the amount of memory accesses and to assess the memory utilization efficiency.For GPUs, the STELLA back end is written in CUDA, and other parts of the refactored COSMO implementation use OpenACC directives (Lapillonne and Fuhrer, 2014).
Thanks to this refactored implementation of the model and the STELLA DSL, COSMO is the first fully capable weather and climate model to go operational on GPU-accelerated supercomputers (Lapillonne et al., 2016).In the simulations we analyze here, the model was scaled to nearly 5000 GPUaccelerated nodes of the Piz Daint supercomputer at the Swiss National Supercomputing Centre. 5To our knowledge, COSMO is still the only production-level weather and climate model capable of running on GPU-accelerated hardware architectures.

Hardware description
The experiments were performed on the hybrid partition of the Piz Daint supercomputer, located at the Swiss National Supercomputing Centre (CSCS) in Lugano.At the time when our simulation was performed, this supercomputer consisted of a multi-core partition, which was not used in this study, as well as a hybrid partition of 4'936 Cray XC50 nodes.These hybrid nodes are equipped with an Intel E5-2690 v3 CPU (code name Haswell) and a PCIe version of the NVIDIA Tesla P100 GPU (code name Pascal) with 16 GBytes secondgeneration high-bandwidth memory (HBM2). 6The nodes of both partitions are interconnected in one fabric (based on Aries technology) in a Dragonfly topology (Alverson et al., 2012).

Energy measurements
We measure the energy to solution of our production-level runs on Piz Daint using the methodology established and de-O.Fuhrer et al.: Near-global climate simulation at 1 km resolution scribed in detail by Fourestey et al. (2014).The resource utilization report provided on Cray systems for a job provides the total energy (E n ) consumed by each application run on N compute nodes.The total energy (which includes the interconnect) is then computed using where τ is the wall time for the application, the N/4 × 100 W × τ term accounts for the 100 W per blade contribution from the Aries interconnect, and the 0.95 on the denominator adjusts for AC/DC conversion.

Simulation setup and verification
When pushing ahead the development of global highresolution climate models, there are two complementary pathways.First, one can refine the resolution of existing global climate models (Miura et al., 2007).Second, one may alternatively try to expand the computational domain of high-resolution limited-area models towards the global scale (Bretherton and Khairoutdinov, 2015).Here we choose the latter and develop a near-global model from the limitedarea high-resolution model COSMO.
We perform near-global simulations for a computational domain that extends to a latitude band from 80 • S to 80 • N, which covers 98.4 % of the surface area of planet Earth.The simulation is inspired by the test case used by the winner of the 2016 Gordon Bell Prize (Yang et al., 2016a).
The simulations are based on an idealized baroclinic wave test (Jablonowski and Williamson, 2006), which can be considered a standard benchmark for dynamical cores of atmospheric models.The test describes the growth of initial disturbances in a dynamically unstable westerly jet stream into finite-amplitude low-and high-pressure systems.The development includes a rapid transition into a nonlinear regime, accompanied by the formation of sharp meteorological fronts, which in turn trigger the formation of complex cloud and precipitation systems.
The setup uses a two-dimensional (latitude-height) analytical description of a hydrostatically balanced atmospheric base state with westerly jet streams below the tropopause, in both hemispheres.A large-scale local Gaussian perturbation is then applied to this balanced initial state which triggers the formation of a growing baroclinic wave in the Northern Hemisphere, evolving over the course of several days (Fig. 3).To allow moist processes, the dry initial state is extended with a moisture profile (Park et al., 2013) and the parametrization of cloud-microphysical processes is activated.
The numerical problem is discretized on a latitudelongitude grid with up to 36 000 × 16 001 horizontal grid points for the 930 m simulation.In the zonal direction the domain is periodic and at 80 • north/south confined by boundary conditions, relaxing the evolving solution against the initial conditions in a 500 km wide zone.The vertical direction is discretized using 60 stretched model levels, spanning from the surface to the model top at 40 km.The respective layer thickness widens from 20 m at the surface to 1.5 km near the domain top.
For the verification against previous dry simulations, a simulation at 47 km grid spacing is used.The evolution of the baroclinic wave (Fig. 3) very closely follows the solution originally found by Jablonowski and Williamson (2006, see their Fig. 5).At day 8 of the simulation, three low-pressure systems with frontal systems have formed, and at day 10 wave breaking is evident.At this time, the surface temperature field shows cutoff warm-core cyclonic vortices.
The evolution in the moist and the dry simulation at very high resolution is shown in Fig. 4. The high-resolution simulations reveal the onset of a secondary (likely barotropic) instability along the front and the formation of small-scale warm-core vortices with a spacing of up to 200-300 km.The basic structure of these vortices is already present in the dry simulation, but they exhibit considerable intensification and precipitation in the moist case.The formation of a large series of secondary vortices is sometimes observed in maritime cases of cyclogenesis (Fu et al., 2004;Ralph, 1996) but appears to be a rather rare phenomena.However, it appears that cases with one or a few secondary vortices are not uncommon and they may even be associated with severe weather (Ludwig et al., 2015).
The resulting cloud pattern is dominated by the commashaped precipitating cloud that forms along the cold and occluded fronts of the parent system (Fig. 1).The precipitation pattern is associated with stratiform precipitation in the head of the cloud and small patches with precipitation rates exceeding 5 mm h −1 in their tail, stemming from small embedded convective cells.Looking closely at the precipitation field (Fig. 4e), it can be seen that the secondary vortices are colocated with small patches of enhanced precipitation.

Efficiency metric
In the past, the prevalent metric to measure the performance of an application was the number of floating point operations executed per second (flop s −1 ).It used to pay off to minimize the number of required floating point operations in an algorithm and to implement the computing system in such a way that a code would execute many such operations per second.Thus, it made sense to assess application performance with the flop s −1 metric.However, the world of supercomputing has changed.Floating point operations are now relatively cheap.They cost about a factor of 1000 less if measured in terms of energy needed to move operands between memory and registers, and they execute several hundred times faster compared to the latency of memory operations.Thus, algorithmic optimization today has to focus on minimizing data movement, and it may even pay off to re-  compute certain quantities if this avoids data movements.In fact, a significant part of the improvements in time to solution in the refactored COSMO code are due to the recomputation of variables that were previously stored in memory -the original code was written for vector supercomputers in the 1990s.On today's architectures, it may even pay off to replace a floating point minimal sparse algorithm with a block-sparse algorithm, into which many trivial zero operations have been introduced (Hutter et al., 2014).Using the flop s −1 metric to characterize application performance could be very misleading in these cases.
A popular method to find performance bottlenecks of both compute-and memory-bound applications is the roofline model (Williams et al., 2009).However, assessing performance of an application simply in terms of sustained memory bandwidth, which is measured in bytes per second, would be equally deceptive.For example, by storing many variables in memory, the original implementation of COSMO introduces an abundance of memory movements that boost the sustained memory bandwidth but are inefficient on modern processor architectures, since these movements cost much more than the recomputation of these variables.
Furthermore, full-bandwidth utilization on modern accelerator architectures, such as the GPUs used here, requires very specific conditions.Optimizing for bandwidth utilization can lead the application developer down the wrong path because unaligned, strided, or random accesses can be intrinsic to an underlying algorithm and severely impact the bandwidth (Lee et al., 2010).Optimizations to improve alignment or reduce the randomness of memory accesses may introduce unnecessary memory operations that could be detrimental to time to solution or energy efficiency.
Thus, in order to properly assess the quality of our optimizations, one needs to directly consider data movement.
Here we propose a method how this can be done in practice for the COSMO dynamical core and present the resulting MUE metric.
Our proposal is motivated by the full-scale COSMO runs we perform here, where 74 % of the total time is spent in local stencil computations.The dependencies between stencils in complex stencil programs can be optimized with various inlining and unrolling techniques (Gysi et al., 2015a).Thus, to be efficient, one needs to achieve maximum spatial and temporal data reuse to minimize the number of data movement operations and perform them at the highest bandwidth.
In order to assess the efficiency in memory usage of an implementation on a particular machine, one needs to com-pare the actual number of data transfers executed, 7 which we denote with D, with the necessary data transfers Q of the algorithm.Q is the theoretical lower bound (Hong and Kung, 1981) of the number of memory accesses required to implement the numerical algorithm.
The MUE can be intuitively interpreted as how well the code is optimized both for data locality and bandwidth utilization; i.e., if MUE = 1, the implementation reaches the memory movement lower bound of the algorithm and performs all memory transfers with maximum bandwidth.Formally, where B and B represent the bandwidth achieved by an implementation and maximum achievable bandwidth, respectively.
In order to compute the MUE for COSMO, we developed a performance model that combines the theoretical model from Hong and Kung (1981), hypergraph properties, and graph partitioning techniques to estimate the necessary data transfer, Q, from the CDAG information of COSMO.By partitioning the CDAG into subcomputations that satisfy certain conditions imposed by the architecture, this model determines the theoretical minimum amount of memory transfers by maximizing the data locality inside the partitions.To approximate the number of actual transfers D, we use the same technique to evaluate the quality of current COSMO partitioning.The values B and B were measured empiricallythe former by profiling our application and the latter by a set of micro-benchmarks.
The details of how to determine the MUE for COSMO are given in Appendix A. The MUE metric cannot be used to compare different algorithms but is a measure of the efficiency of an implementation of a particular algorithm on a particular machine, i.e., how much data locality is preserved and what the achieved bandwidth is.It thus complements other metrics such as SYPD and may complement popular metrics such as flop s −1 or memory bandwidth.As compared to the frequently applied roofline model, the MUE metric also includes the schedule of operations, not simply the efficient use of the memory subsystem.It is thus a stronger but also more complex metric.

Performance results
To establish a performance baseline for global kilometerscale simulations, we here present a summary of the key performance metrics for the simulations at 930 m, 1.9 km, and 47 km grid spacing, as well as a study of weak and strong scalability.
7 Here, we define data transfers as load/stores from system memory.

Weak scalability
Until this study the GPU version of COSMO had only been scaled up to 1000 nodes of the Piz Daint supercomputer (Fuhrer et al., 2014), while the full machine -at the time of the experiment -provides 4932 nodes.The scaling experiments (weak and strong) were performed with the dry simulation setup.Including a cloud microphysics parametrization, as used in the moist simulation (Fig. 4), increases time to solution by about 10 %.Since microphysics does not contain any inter-node communication and is purely local to a single column of grid points, we do not expect an adverse impact on either weak or strong scalability.
Figure 5 shows weak scaling for three per-node domain sizes ranging from 128 × 128 to 256 × 256 grid points in the horizontal, while keeping the size in the vertical direction fixed at 60 grid points.In comparison, on 4888 nodes, the high-resolution simulations at 930 m and 1.9 km horizontal grid spacing correspond to a domain size per node of about 346 × 340 and 173 × 170. 8The model shows excellent weak scalability properties up to the full machine, which can at least partially be explained by the nearest-neighbor haloexchange pattern.This property reduces the complexity of COSMO's scalability on Piz Daint to the strong scaling behavior of the code.Essentially, the number of grid points per node determines the achievable time to solution of a given problem.

Strong scalability
From earlier scaling experiments of the STELLA library (Gysi et al., 2015b) and the GPU version of COSMO (Fuhrer et al., 2014;Leutwyler et al., 2016), it is known that, for experiments in double precision on Tesla K20x, linear scaling is achieved as long as the number of grid points per node exceeds about 64 × 64 to 128 × 128 grid points per horizontal plane.In comparison, single-precision measurements on Tesla P100 already start to saturate at a horizontal domain size of about 200 × 200 grid points per node (Fig. 6), 9 corresponding to about 32 nodes for a 19 km setup and about 1000 nodes for a 3.7 km setup.Since with the 930 m and 1.9 km setup we are already in 930 m or close to 1.9 km, the linear scaling regime on the full machine, we here chose a coarser horizontal grid spacing of 3.7 and 19 km.
The lower limit on the number of nodes is given by the GPU memory of 16 GB.In addition to the GPU benchmarks (filled symbols), we measured the performance with the CPU version of COSMO (empty symbols), using 12 MPI (Message Passing Interface) ranks per CPU, i.e., one MPI rank per Haswell core. 10 Exceeding 1000 nodes (38 × 38 grid points per node) execution on CPUs yields a shorter time to solution than on GPUs.

Time to solution
On 4888 nodes, a 10-day long moist simulation at 930 m grid spacing required a wall time of 15.3 h at a rate of 0.043 SYPD (Table 1), including a disk I/O load of five 3-D fields and seven 2-D fields, 11 written periodically every 12 h of the sim-9 COSMO supports two floating point formats to store numbersdouble and single precision -which can be chosen at compile time. 10The CPU measurements were performed on the hybrid partition of Piz Daint as well, since the multi-core partition is much smaller.
11 The standard I/O routines of COSMO require global fields on a single node, which typically do not provide enough storage to hold a global field.To circumvent this limitation, a binary I/O mode, allowing each node to write its output to the file system, was implemented.
Table 1.Time compression (SYPD) and energy cost (MWh SY −1 ) for three moist simulations: at 930 m grid spacing obtained with a full 10-day simulation, at 1.9 km from 1000 steps, and at 47 km from 100 steps.We have also conducted a 1.9 km simulation with a full set of physical parametrizations switched on, and this increases the time to solution by 27 % relative to the moist simulations presented here.
In conclusion, the results show that AMIP-type simulations at horizontal grid spacings of 1.9 km using a fully fledged atmospheric model are already feasible on Europe's highest-ranking supercomputer.In order to reach resolutions of 1 km, a further reduction of time to solution of at least a factor of 5 is required.
In the remainder of this section, we attempt a comparison of our results at 1.9 km against the performance achieved by Yang et al. (2016a) in their 2 km simulation (as their information for some of the other simulations is incomplete).They argue that for an implicit solver the time step can be kept at 240 s independent of the grid spacing and report (see their Figs.7 and 9) values of 0.57 SYPD for a grid spacing of 2 km, on the full TaihuLight system.As explained in Sect.2, such large time steps are not feasible for global climate simulations resolving convective clouds (even when using implicit solvers), and a maximum time step of 40-80 s would very likely be needed; this would decrease their SYPD by a factor of 3 to 6. Furthermore, their simulation covers only 32 % of the Earth's surface (18 • N to 72 • ) but uses twice as many levels; this would further reduce their SYPD by a factor of 1.5.Thus we estimate that the simulation of Yang et al. (2016a) at 2 km would yield 0.124 to 0.064 SYPD when accounting for these differences.In comparison, our simulation at 1.9 km yields 0.23 SYPD; i.e., it is faster by at least a factor of 2.5.Note that this estimate does not account for additional simplifications in their study (neglect of microphysical processes, spherical shape of the planet, and topography).In summary, while a direct comparison with their results is difficult, we ar-  gue that our results can be used to set a realistic baseline for production-level GCM performance results and represent an improvement by at least a factor of 2 with respect to previous results.

Energy to solution
Based on the power measurement (see Sect. 3.3) we now provide the energy cost of full-scale simulations (Table 1) using the energy cost unit MWh per simulation year (MWh/SY).
The 10-day long simulation at 930 m grid spacing running on 4888 nodes requires 596 MWh SY −1 , while the cost of the simulation at 1.9 km on 4888 nodes is 97.8 MWh SY −1 .For comparison, the coarse resolution at 47 km simulation on a reduced number of nodes ( 18) requires only 0.01 MWh SY −1 .A 30-year AMIP-type simulation (with full physics) at a horizontal grid spacing of 1 km on the Piz Daint system would take 900 days to complete, resulting in an energy cost of approximately 22 GWh -which approximately corresponds to the consumption of 6500 households during 1 year.
Again, we attempt a comparison with the simulations performed by Yang et al. (2016a).The Piz Daint system reports a peak power draw of 2052 kW when running the highperformance LINPACK (HPL) benchmark.The sustained power draw when running the 930 m simulation amounted to 1059.7 kW, thus 52 % of the HPL value.The TaihuLight system reports a sustained power draw for the HPL benchmark of 15 371 kW (TOP500, 2017).While Yang et al. (2016a) do not report power consumption of their simulations, we expect the simulations on Piz Daint to be at least 5 times more power efficient, even when assuming similar achieved SYPD (see above).12

Simulation efficiency
In view of the energy consumption, it is paramount to consider the efficiency of the simulations executed.We do this here by considering the memory usage efficiency metric introduced in Sect. 4.
To estimate a solution for the optimization problem described in Appendix A, Eq. (A2), we use the METIS li- brary (Karypis and Kumar, 2009).The results are presented in Table 2.The METIS Q column is the approximation of the lower bound Q obtained from the performance model using the METIS library.The COSMO D column is the evaluation of current COSMO partitioning in our model.The no-opt D column shows the amount of data transfers of COSMO if no data locality optimization techniques are applied, like in the original Fortran version of the code.Since the original CDAG is too large for the minimization problem of the performance model, three different simplified versions of the CDAG, which focus on the accesses to three different layers of the memory hierarchy of the GPU, are studied: registers, shared memory, and L2 cache.
The model shows how efficient COSMO is in terms of data transfers -it generates only 14, 66, and 13 % more data transfers than the lower bound in the register, shared memory, and L2 cache layers, respectively.The sophisticated data locality optimization techniques of the STELLA implementation of the dynamical core of COSMO result in very good data reuse.On the P100, all memory accesses to/from DRAM go through the L2 unit.Therefore, we focus on the efficiency of this unit such that where D L2 and Q L2 stand for estimated number of main memory operations and its lower bound, respectively.The model also can estimate the efficiency of our optimizations.Assuming that we can reach the peak achievable bandwidth B if we perform no data locality optimizations ( D), then It can be seen that MUE MUE no_opt = 0.67 0.44 = 1.52. (5) This result shows the importance of data locality optimizations -the optimized implementation is more than 50 % faster than a potential version that achieves peak bandwidth while using no data locality techniques.
To validate the model results, we have conducted singlenode runs with and without data locality optimizations.
Bandwidth measurements of the non-optimized version are very close to maximum achievable bandwidth (not shown).The fact that the two ratios in the third column of Table 3 agree is testimony to the high precision of the performance model.

Conclusions
The work presented here sets a new baseline for fully fledged kilometer-scale climate simulations on a global scale.Our implementation of the COSMO model that is used for production-level numerical weather predictions at Me-teoSwiss has been scaled to the full system on 4888 nodes of Piz Daint, a GPU-accelerated Cray XC50 supercomputer at the Swiss National Supercomputing Centre (CSCS).The dynamical core has been fully rewritten in C++ using a DSL that abstracts the hardware architecture for the stencil algorithmic motifs and enables sophisticated tuning of data movements.Optimized back ends are available for multi-core processors with OpenMP, GPU accelerators with CUDA, and for performance analysis.
The code shows excellent strong scaling up to the full machine size when running at a grid spacing of 4 km and below, on both the P100 GPU accelerators and the Haswell CPU.For smaller problems, e.g., at a coarser grid spacing of 47 km, the GPUs run out of parallelism and strong scalability is limited to about 100 nodes, while the same problem continues to scale on multi-core processors to 1000 nodes.Weak scalability is optimal for the full size of the machine.Overall, performance is significantly better on GPUs as compared to CPUs for the high-resolution simulations.
The simulations performed here are based on the idealized baroclinic wave test (Jablonowski and Williamson, 2006), which is part of the standard procedure to test global atmospheric models.Our results are validated against the original solution published in this paper.
We measured time to solution in terms of simulated years per wall clock day (SYPD) for a near-global simulation on a latitude band from 80 • S to 80 • N that covers 98.4 % of planet Earth's surface.Running on 4888 P100 GPUs of Piz Daint, currently Europe's largest supercomputer, we measured 0.23 SYPD at a 1.9 km grid spacing.This performance is adequate for numerical weather predictions and 10-year scale climate studies.Simulations at this resolution had an energy cost of 97.8 MWh SY −1 .
In a moist simulation with 930 m horizontal grid spacing, we observed the formation of frontal precipitating systems, containing embedded explicitly resolved convective motions, and additional cutoff warm-core cyclonic vortices.The explicit representation of embedded moist convection and the representation of the resolved instabilities exhibits physically different behavior from coarser-resolution simulations.This is testimony to the usefulness of the high-resolution approach, as a much expanded range of scales and processes is simulated.Indeed it appears that for the current test case, the small vortices have not previously been noted, as the test case appears to converge for resolutions down to 25 km (Ullrich et al., 2015), but they clearly emerge at kilometer-scale resolution.
These results serve as a baseline benchmark for global climate model applications.For the 930 m experiment we achieved 0.043 SYPD on 4888 GPU-accelerated nodes, which is approximately one-seventh of the 0.2-0.3SYPD required to conduct AMIP-type simulations.The energy cost for simulations at this horizontal grid spacing was 596 MWh SY −1 .
Scalable global models do not use a regularly structured grid such as COSMO, due to the "pole problem".Therefore, scalable global models tend to use quasi-uniform grids such as cubed-sphere or icosahedral grids.We believe that our results apply directly to global weather and climate models employing structured grids and explicit, split-explicit, or horizontally explicit, vertically implicit (HEVI) time discretizations (e.g., FV3, NICAM).Global models employing implicit or spectral solvers may have a different scaling behavior.
Our work was inspired by the dynamical core solver that won the 2016 Gordon Bell award at the Supercomputing Conference (Yang et al., 2016a).The goal of this award is to reward high performance achieved in the context of a realistic computation.A direct comparison with the results reported there is difficult, since Yang et al. (2016a) were running a research version of a dynamical core solver and COSMO is a fully fledged regional weather and climate model.Nevertheless, our analysis indicates that our benchmarks represent an improvement both in terms of time to solution and energy to solution.As far as we know, our results represent the fastest (in terms of application throughput measured in SYPD at 1 km grid spacing) simulation of a production-ready, nonhydrostatic climate model on a near-global computational domain.
In order to reach the 3-5 SYPD performance necessary for long climate runs, simulations would be needed that run 100 times faster than the baseline we set here.Given that Piz Daint with NVIDIA P100 GPUs is a multi-petaflop system, the 1 km scale climate runs at 3-5 SYPD performance represent a formidable challenge for exascale systems.As such simulations are of interest to scientists around the globe, we propose that this challenge be defined as a goal post to be reached by the exascale systems that will be deployed in the next decade.
Finally, we propose a new approach to measure the efficiency of memory bound application codes, like many weather and climate models, running on modern supercomputing systems.Since data movement is the expensive commodity on modern processors, we advocate that the code's performance on a given machine be characterized in terms of data movement efficiency.We note that both detailed mathematical analysis and the general applicability of our MUE metric is beyond the scope of this paper and requires additional publication.In this work we show a use case of the MUE metric and demonstrate its precision and usefulness in assessing memory subsystem utilization of a machine, using the I/O complexity lower bound as the necessary data transfers, Q, of the algorithm.With the time to solution and the maximum system bandwidth B, it is possible to determine the memory usage efficiency that captures how well the code is optimized both for data locality and bandwidth utilization.It will be interesting to investigate the MUE metric in future performance evaluations for weather and climate codes on high-performance computing systems.
Code and data availability.The particular version of the COSMO model used in this study is based on the official version 5.0 with many additions to enable GPU capability and available under license (http://www.cosmo-model.org/content/consortium/licencing.htm for more information).These developments are currently in the process of being reintegrated into the mainline COSMO version.COSMO may be used for operational and for research applications by the members of the COSMO consortium.Moreover, within a license agreement, the COSMO model may be used for operational and research applications by other national (hydro-)meteorological services, universities, and research institutes.The model output data will be archived for a limited amount of time and are available on request.able data dependency information is lost, and one cannot argue about the optimality and possible rearrangement of the operations fused within a kernel.
We now describe how we determine coarsening strategies of the COSMO CDAG for three levels of the memory hierarchy of our target system, registers, shared memory, and L2 cache.
1. Registers (65'536): the COSMO GPU implementation assigns all computations accessing variables with the same IJ coordinate to the same GPU thread and use registers to reuse values in the K direction.To model this memory hierarchy, it is only necessary to keep the stencil accesses in the K direction.Thus, all accesses in the IJ plane are represented as a single vertex in the CDAG, which is then simplified to 781 variables and their dependencies among all 60 levels in K.
2. Shared memory (64 kB): the shared memory of the GPU is used to communicate values between the different compute threads.In order to model this layer, all different accesses in the K direction of a variable are represented as a single vertex in the CDAG, while all accesses in the IJ plane are kept.
3. L2 cache (4 MB): this last cache level before DRAM is used to store whole arrays (fields).In this layer all accesses to a variable in any direction are represented as a vertex in the CDAG.It keeps only the data dependencies among variables, irrespective of the offset and direction of the access.
Table A1 lists details about the CDAGs at each of our three levels.Memory capacity of the GPU for each of the three layers is then used as a constraint to derive the parameter S (see values in Table A1).Values of the estimation of Q, obtained from the performance model for the three memory levels are shown in Table 2.
To generate the whole CDAG, we used the STELLA analysis back end (Gysi et al., 2015b) to trace all local memory accesses to all fields.Based on the information from the access offsets and order of operations, we reconstruct the readwrite, write-read, and write-write dependencies.

A3 Maximum achievable bandwidth B
We now describe how we measure the memory usage efficiency in practice.We start by describing how to determine the maximum achievable bandwidth for COSMO stencils.Even though the Tesla P100 has a theoretical peak memory bandwidth of 720 GB s −1 (NVIDIA, 2016), we argue that this may not be achievable for real applications.A well-established method to measure the maximum achievable bandwidth of GPUs is the GPU STREAM benchmark (Deakin et al., 2016).Our tests show that the maximum achievable bandwidth for COPY is 557 GB s −1 if at least 30 MB double-precision numbers are copied (Fig. A1).However, stencil codes on multidimensional domains like COSMO require more complex memory access patterns that, even when highly tuned, cannot achieve the same bandwidth as STREAM due to architectural limitations.We identified the most common patterns and designed and tuned a set of micro-benchmarks that only mimic the memory access patterns without the computations to investigate the machine capability of handling memory accesses for stencils.They include aligned, unaligned, and strided patterns in all dimensions.All benchmarks operate on a 3-D domain of parametric size, on either single-or doubleprecision numbers.The results of a representative set of four chosen micro-benchmarks are shown in Fig. A1, together with the GPU STREAM, which operates on 1-D domain.The fastest stencil kernel (double-precision aligned COPY) reaches 510 GB s −1 .The slowdown compared to GPU STREAM is due to the more complex access pattern in the 3-D domain.Furthermore, using single-precision numbers further deteriorates the bandwidth on P100 (COPY (float) reaches 475 GB s −1 ).Our COSMO 930 m run uses predominantly single-precision numbers on a 346×340×60 domain, which results in 28.2 MB of data per field.Our measurements show that the maximum achievable bandwidth for this setup is 362 GB s −1 (in the best case of the simple COPY benchmark).We will use this upper-bound number as the maximum system bandwidth.The average measured memory bandwidth across all COSMO real-world stencils is 276 GB s −1 , which gives B B = 0.76.

Figure 1 .
Figure 1.Visualization of a baroclinic wave at day 10 of a simulation with 930 m grid spacing.White shading: volume rendering of cloud ice, cloud water, and graupel ≥ 10 −3 g kg −1 .Blue shading: isosurface of rain and snow hydrometeors ≥ 4 × 10 −2 g kg −1 .The white contours denote surface pressure.

Figure 3 .
Figure 3. Evolution of a baroclinic wave in a dry simulation with 47 km grid spacing on day 8 and 10: (a) surface pressure and (b) temperature on the 850 hPa pressure level (roughly 1.5 km above sea level).

Figure 4 .
Figure 4. Output of the baroclinic wave (day 10): (a, b) dry simulation with 1.9 km grid spacing, (c, d, e) moist simulation with 0.93 km grid spacing, (a, c) surface pressure, (b, d) temperature on the 850 hPa pressure level, and (e) precipitation.

Figure 5 .
Figure 5. Weak scalability on the hybrid P100 Piz Daint nodes, per COSMO time step of the dry simulation.

Figure A1 .
Figure A1.Bandwidth of the representative stencil benchmarks and GPU STREAM on Tesla P100.All kernels (except for GPU STREAM) operate on a 3-D domain.
While for the 47 km setup scalability already starts to saturate, even with the 18 nodes reported (Table1), the high-resolution simulations, with an approximate per-node domain size of 346 × 340 × 60 and 173 × 170 × 60, are still in a regime of good scalability.

Table 2 .
Data transfer cost estimations of the dynamical core for a theoretical lower bound (METIS), dynamical core implementation using the STELLA library (COSMO), and non-optimized dynamical core implementation (no-merging).

Table 3 .
Performance model verification results.Measured dynamical core time step execution time for the STELLA implementation optimized for data locality and the non-optimized implementation compared against the corresponding MUE metric.