Introduction
The weather- and climate-modelling community is currently seeing a shift in
paradigm from limited area models towards novel approaches involving global,
complex and irregular meshes. Yet, regional models are commonly used in
numerical weather prediction and to study past, current and future climate at
high spatial and temporal resolution over areas of specific interest.
A wealth of such models exists currently, which differ in their discretisation
of the computational grid, their implementation of the numerical solvers,
their parameterisation of physical processes and most notably in their
simulation results e.g.. Despite these
differences, these models share the common principle of nested modelling: regional
climate information is generated by supplying a set of initial conditions as
well as time-varying lateral boundary conditions (LBCs; large-scale
atmospheric fields such as wind, temperature, geopotential height and
hydrometeors) and lower boundary conditions (sea surface temperature, sea
ice) to the regional model. The idea behind this approach is that the LBCs
keep the regional climate model (RCM) solution consistent with the forcing
atmospheric circulation, while small-scale patterns are generated with higher
accuracy due to the increase in temporal and spatial resolution. Sub-grid
scale processes in the RCM are included through parameterisations, which can
be entirely different from those of the forcing global circulation model
(GCM).
Supplying lateral boundary conditions to nested models can cause
severe problems, up to the point where the RCM solution becomes
inconsistent with the forcing data
. Starting off as an
initial-value problem, the RCM solution gradually becomes a boundary
value problem, which from a mathematical point of view represents
a fundamentally ill-posed boundary value problem
. This is less of an issue in the
context of numerical weather prediction (NWP), where typical model
run times are 3 to 15 days, than in seasonal forecasting (weeks to
months) and in regional climate modelling (years to centuries).
A common problem for both short- and long-term forecasting is that the
solution of the RCM seems to vary with the size of the computational
domain, as well as location and season
. Several authors have shown that
nudging techniques (grid or spectral nudging) towards the large-scale
features of the forcing model can reduce these adverse effects, but
that they can also hide model biases
.
Further, the technique of grid nesting introduces discontinuities in
spatial resolution between the regional model and the coarser-grid
driving model, as well as between the nests within the regional model
itself. For a typical refinement ratio of 3, two-thirds of the spatial
wave-number spectrum present in the fine mesh are absent in the coarse
mesh . This implies that (a) these features must be
spun-up for inflows into the higher-resolution domain, and that (b) these wave numbers are reflected at the domain boundary for outflows
from the high-resolution domain to the coarser domain. To address the
latter issue, filters that are efficient over a large range of wave
numbers are required. The temporal interpolation required by nesting
can introduce further numerical artefacts, in particular when
interpolating forcing LBCs, usually available at 3–6 h time steps, to
the model integration time step of the high-resolution domain
(typically 6 s per 1 km grid size).
One obvious solution is to avoid using LBCs and nesting by running
a global model at the resolution required for the area of
interest. This, however, is prohibitively expensive or simply not
feasible, even on the latest generations of supercomputers. An
intermediate approach therefore is to run a global model at a moderate
resolution and use a smooth mesh transition on a variable-resolution
grid, where filters are efficient at the local scale of the
corresponding grid cell . Beside the here-discussed
Model for Prediction Across Scales
(MPAS) model, few other recent developments such as ICON
icosahedral non-hydrostatic model; adopt this
strategy. Applying such models for mid- and long-term regional climate
simulations has only recently become possible and requires substantial
computational power.
The MPAS is a recent numerical
modelling framework that includes an atmospheric model MPAS-A
, an ocean model MPAS-O and
a land-ice model MPAS-LI. The MPAS model is a collaborative project, led by
the National Center for Atmospheric Research (NCAR) and the Los Alamos National Laboratory (LANL). The three components of MPAS in principle form
a so-called Earth system model (ESM), but a coupler between them is not yet
available. A common feature of the three MPAS constituents are unstructured,
centroidal Voronoi meshes spherical centroidal Voronoi tessellations,
SCVTs;, which allow for the generation of global, irregular,
variable-resolution meshes with smooth transitions between areas of different
refinement.
Upper left: sketch of a variable-resolution Voronoi mesh used for the horizontal grid;
upper right: C-grid staggering of state variables adapted from;
lower left: a block of owned cells (blue) assigned to an Message Passing Interface (MPI) task, along with two layers of halo cells (red, orange);
lower right: partitioning of the regular 120 km mesh with 40 962 grid cells for 64 tasks.
The atmospheric model MPAS-A is a global, fully compressible
non-hydrostatic model using finite-volume numerics. Based on the
Voronoi mesh, the model uses a C-grid staggering for the state
variables (i.e. wind components are modelled at the faces of every
cell, and the prognosed component of the wind is orthogonal to the
cell face) as described in and
(see Fig. ,
upper panels, for illustration). The governing equations can then be cast in a way
such that energy, momentum and water vapour content are conserved more precisely:
potential temperature, mass and scalar mass; seeSect. 2. The
MPAS-A model builds on existing, well-established techniques of the
Advanced Research Weather Research and Forecasting model
WRF-ARW;, for example the split-explicit time
integration scheme for the treatment of gravity waves and horizontally
propagating acoustic waves. It
also contains a subset of WRF's physics parameterisations that are
suitable for climate modelling purposes. While WRF uses
a terrain-following hydrostatic pressure coordinate for the vertical
discretisation, MPAS employs a height-based terrain-following vertical
coordinate. The latter discretisation reduces artificial circulations
caused by inaccuracies in the horizontal pressure gradient term
. It should be noted here that because MPAS was developed
out of a regional climate modelling context, the typical model tops in MPAS-A
are around 30 to 42 km, while current Earth system models work
with model tops of up to 100 km. At the time of writing, work is underway to
increase the model tops in MPAS-A for future applications as an Earth system
modelling component.
Until recently, advances in computational power following Moore's law were
mainly driven by transistor speed and energy scaling, as well as by
micro-architecture advances. Physical limitations and practical energy
concerns will create new challenges for continued performance scaling in the
coming decades. In consequence, large-scale parallelism and the use of
accelerators will be required to achieve performance and energy efficiency
. Hence, a key aspect of modern numerical codes is their
ability to scale on massively parallel systems. The quasi-uniform centroidal
Voronoi meshes used by MPAS are similar to icosahedral (hexagonal) meshes and
can provide nearly uniform resolution over the globe, as opposed to
latitude–longitude grids that require polar filtering to overcome the issue
of converging grid lines at the poles. Grids requiring polar filtering or
spherical transform methods do not scale very well on massively parallel
systems . With MPAS, an efficient parallelisation can be
achieved by aligning all grid cells in a 1-D array, with the vertical
coordinate stacked on top as the second dimension . Good
scaling has been achieved in early weak and strong scaling tests. However,
a thorough investigation of the scalability of MPAS on parallel and massively
parallel systems has not yet been conducted.
In this study, we investigate the performance of MPAS for different
problem sizes on four high-performance computing (HPC) facilities in Europe. For each problem,
strong scaling tests are conducted on all four platforms, which cover
a range of different architectures to reflect the large variety of
computational systems available for research. Additionally, we conduct
extreme scaling tests using a 3 km global mesh to study the
scalability of MPAS up to nearly half a million tasks and to
demonstrate that global, convection-resolving simulations are becoming
possible. We explore the limits of the MPAS model when its parallel
efficiency breaks down and identify opportunities for improvement. We
further derive estimates on the feasibility to conduct longer runs at
convection-resolving resolution on current HPC facilities.
We also assess the quality of the MPAS model output in terms of its accuracy
for climate modelling. We have chosen to study the particular problem of
reproducing the characteristics of the West African monsoon (WAM). The WAM is
the most prominent feature of the West African climate and accounts for the
majority of the annual precipitation. Differential heating of the ocean and
the land surface cause a seasonal change of the large-scale wind systems
during boreal summer, which results in the migration of the inter-tropical
convergence zone (ITCZ) and the associated rain band northwards over the West
African continent. The WAM is driven by a complex and not yet fully
understood interplay of various dynamical features
e.g.. GCMs often fail to reproduce this annual movement of the ITCZ due to
their limited temporal and spatial resolution
e.g.. Despite their deficiencies discussed
above, regional climate models can improve the representation of
precipitation in comparison to their forcing data set
. Variable-resolution meshes allow for resolving
the region of interest (greater West Africa in this case) at high resolution,
while keeping the model aligned with large-scale features outside of this
area. It is hoped that this will lead to an improvement of the representation
of the WAM. It also opens up the possibility to study processes such as the
teleconnection between the Indian monsoon and the WAM
, or the impact of land use changes on weather and climate
in a consistent approach.
Specifications of the four HPC facilities used for the
medium-size scaling tests in Sect. .
TGCC Curiea
FZJ Juqueen
SCC ForHLR1a
FZJ Juropatest
System
B510 Bullx
IBM Blue Gene/Q
Megware MiriQuid
T-Platform V210s
Processor
Intel Sandy Bridge
IBM PowerPC A2
Intel Ivy Bridge
Intel Haswell
E5-2680, 2.7 GHz
1.6 GHz
E5-2670v2, 2.4 GHz
E5-2695v3, 2.3 GHz
Cores/node
2×8
1×16
2×20
2×14
Mem./node
64 GB
16 GB
64 GB
128 GB
Nodes total
5040
28 672
512
60
MPI network
Infiniband QDR
5-D Torus
Infiniband FDR
Infiniband FDR
5 GB s-1
40 GB s-1
7 GB s-1
7 GB s-1
Filesystem
LUSTRE
GPFS
LUSTRE
GPFS
100 GB s-1
200 GB s-1
16 GB s-1
30 GB s-1
I/O network
same as MPI
10 Gb Ethernetb
same as MPI
10 Gb Ethernet
MPI library
Bull MPI
IBM MPI
Intel MPI
Intel MPI
Compiler
Intel icc/ifort
IBM bgxlc/bgxlf95
Intel icc/ifort
Intel icc/ifort
Peak perf.
2 Petaflops
5.9 Petaflops
216 Teraflops
72 Teraflops
a Thin nodes only, ignoring fat nodes (larger memory) and hybrid nodes (GPU accelerators)b Using dedicated I/O nodes, typically 128 I/O nodes per 1024 nodes (1 rack)
The paper is organised as follows: in Sect. , we
introduce the HPC facilities used for this study, provide details
about the MPAS-A code and present the scaling experiments with
moderate problem sizes. We continue in Sect. with
an analysis of the physical accuracy of the MPAS model in comparison
to observational data and data from our own regional climate modelling
experiments. Section is devoted to the
extreme scaling tests, and Sect. summarises our
findings and gives an outlook on future modelling activities. Lastly,
Sect. provides information on how to access the
model code and the test cases presented in Sects. and .
Scaling experiments for moderate problem sizes
We perform strong scaling experiments for three different meshes and
on four HPC facilities in Europe. The problem sizes range from
a regular 120 km mesh with 40 962 cells as the smallest problem to
a large, variable-resolution 60–12 km mesh with 535 554 grid
cells. An intermediate test case with a variable-resolution
100–25 km mesh with 163 842 grid cells is studied as well.
HPC facilities
Two of the four HPC systems, the Très Grand Centre de Calcul (TGCC)
Curie and the Forschungszentrum Jülich (FZJ) Juqueen,
belong to the largest machines in Europe and are part of the PRACE Tier-0
pool.
The technical specifications for each of the systems are summarised in
Table , a brief summary is given in the following.
TGCC Curie:
the TGCC Curie went into service in 2012 and consists of 360 fat nodes and 16 hybrid nodes,
not used in this study, and 5040 thin nodes with 2 eight-core Intel Sandy Bridge CPUs. An InfiniBand QDR network is used for
both the compute network and the I/O (input and output) to the global LUSTRE file system. With three different node types, Curie addresses a wide range of scientific challenges.
FZJ Juqueen:
the FZJ Juqueen is an IBM Blue Gene/Q system and was installed in
2012/13. It hosts 28 racks with 1024 nodes per rack and 16 cores per
node, which totals to 458 752 physical cores. Simultaneous
multi-threading is supported by the hardware, but not
used in this study due to the lack of threading in
MPAS-A (see below). A 5-D Torus interconnect is used as compute network,
while I/O is redirected to dedicated I/O nodes using a 10 Gb
Ethernet to connect to the General Parallel File System (GPFS) file system. With a large number of relatively slow CPUs
and a small memory per core, Juqueen most resembles the future
massively parallel systems described above. As such, porting and
scaling experiments of numerical codes onto this architecture are as
challenging as instructive for future applications.
SCC ForHLR1:
the ForHLR1 is the most recent addition to the SCC's HPC systems, installed in September 2014. It hosts different types of nodes to cater for
the various needs of the modelling community: the workhorse for
parallel applications, and used in this study are 512 thin nodes with 2 10-core
Intel Ivy Bridge CPUs, connected via an Infiniband FDR (fourteen data rate) interconnect.
A central LUSTRE filesystem is attached to the nodes, using the same Infiniband interconnect for I/O
as for the compute network.
FZJ Juropatest:
the FZJ Juropatest cluster is a small but cutting-edge prototype system and
consists of 70 T-Platform V210s blades with 2 14-core Intel
Haswell processors. The Message Passing Interface (MPI) communication is realised over Infiniband FDR,
and the I/O to the central GPFS file system is routed via
10 Gb Ethernet. While optimising the MPAS model for the Haswell features
and instruction sets is beyond the scope of this study, it will become
an inevitable step in future model development and tuning.
On Juropatest, we conduct two sets of runs for each of the test cases:
for the first set (Jtest-half in the following), we use only one of the two
available 14-core CPUs in each node, which implies a similar
number of cores per node for Curie and Juropatest or, in other words,
a similar number of nodes for the same total number of tasks. In this
configuration, each task is bound to one core on the node. For the
second set (Jtest-full in the following), we use both CPUs,
i.e. 28 cores on each node to exploit the capabilities of the
Juropatest system and the possible memory bandwidth limitations of MPAS-A.
MPAS-A code
For the strong-scaling studies in this paper, we use MPAS-A v3.1,
released on 24 November 2014. This release of the model employs
a horizontal domain decomposition for parallel execution, and
parallelisation is implemented using MPI only; in this version of the
code, no threading is used. The MPAS code is written almost entirely
in Fortran 2003, with a few minor parts written in C. The MPAS build
system uses only the make utility, with settings for
different compilers and architectures described as different targets
in the top-level Makefile; see Appendix
for the compiler flags used in this study.
The optimal parallelisation and distribution of the cells of the Voronoi mesh
for a given number of tasks is treated as a graph partitioning problem. The
dual mesh of a Voronoi tessellation is a Delaunay triangulation, which
immediately provides the connectivity graph for the primal (i.e. Voronoi)
cells in the mesh. In MPAS, the graph partitioning is computed as a separate
pre-processing step for which the METIS software is
used. An optimal
partitioning distributes equal work (by proxy, the number of cells) to each
task while minimising the edge cut (assumed to model the communication
between tasks). METIS uses a multilevel k-way partitioning scheme, which
produces partitions of comparable quality to traditional multilevel bisection
algorithms and is about 2 orders of magnitude faster .
The resulting graph partitioning can be critical for the model performance
due to, for example, a large overhead of communication and computational
imbalances between the individual partitions.
Parallel efficiency (%) for the 120 km regular mesh
with 40 962 grid cells, the variable 100–25 km mesh with 163 842 grid cells and the variable 60–12 km mesh with 535 554 grid cells.
At start-up, the MPAS-A model reads a file that assigns Voronoi cells
to each of the MPI tasks according to a partitioning produced by
METIS. The set of cells assigned to an MPI task is referred to as a
block, and the cells in this assignment are referred to as the
owned cells. The dynamical solver in MPAS-A requires stencils of
cells in order to apply various operators, and as part of the model
start-up, referred to internally as the bootstrapping process,
a pre-determined number of layers of halo cells (sometimes referred to
as ghost cells in other modelling systems) are added around each
block. Although the number of halo cells can vary between different
MPAS models, as illustrated in Fig. (lower left panel), MPAS-A adds
two layers of halo cells around each block of cells. A lower bound
for the number of halo cells Nh for a given number of owned cells No
can be estimated as
Nh=πNoπ+22-No.
At points in the
MPAS-A dynamical solver where current values of fields in halo cells
are required, values are communicated between tasks, from owned cells
to ghost cells, with point-to-point MPI communications.
An important aspect and common bottleneck in numerical weather prediction and
climate modelling is disk I/O, since large 4-D fields such as temperature,
geopotential height or wind components need to be written to the disk
frequently. In MPAS v3.1, I/O is facilitated by the parallel I/O library
(PIO)
v1.7.1, a wrapper with an easy-to-use application programme interface (API) that encapsulates the complexity of
parallel I/O for a number of supported formats: binary, serial
NetCDF,
Parallel-NetCDF
and recently (since v.1.9.14) parallel NetCDF-4 through
PHDF5 . PIO is
compiled without further optimisation (standard settings) on all four
machines. The HDF5, NetCDF and Parallel-NetCDF libraries are provided as
modules on all four systems.
Unless stated otherwise, all experiments are conducted with double
precision floating point precision, 41 atmospheric levels, 4 soil
levels, a full suite of physics and dynamics (see
Appendix for details), and standard disk I/O. Each
experiment is run for 24 h model time, during which an initial
conditions file is read (init.nc), diagnostic output files
are written every 3 h (diag.nc), and a final restart file
and a comprehensive output file are written at the end
(restart.nc, output.nc). The model integration time
step depends on the grid resolution and is mentioned in the individual
sections below. Note that for variable-resolution meshes, the global
time step is determined by the smallest grid size. By default, all
tasks participate in the parallel I/O. Each experiment is repeated
once or twice, depending on how close the measured run times are, to account
for fluctuations of single experiments.
Regular 120 km grid
The first and smallest test case consists of a global, regular 120 km
mesh with 40 962 grid cells, which is roughly comparable in
resolution to a 284×142 (latitude–longitude) grid. It is thus
in the range of current Earth system models. A model integration time
step of 150 s is adopted. For a resolution of 120 km, this is an
extremely conservative setting (1.25 s per km grid size) for MPAS-A,
which implies 576 integration time steps for a 24 h test run,
compared to 120–144 integration time steps for typical values of
5–6 s per km grid size. This increases the time spent for the actual
parallel integration relative to that for model initialisation and
disk I/O. The decomposition of the 120 km grid for 64 tasks
is illustrated as an example in Fig. (lower right panel),
while Fig. (left panel) shows
the scaling plots on the four HPC facilities described above. For an
easier comparison of the scalability of the different test cases, the
scaling is displayed as a parallel efficiency (i.e. the ratio of real
scaling and ideal scaling) vs. the number of tasks (bottom
horizontal axis) and number of cells owned per task (top horizontal
axis). Table provides further details about the
scaling, whereas Table summarises the size of the
files to be read and written during one model run.
Previous scaling tests on the NCAR-Wyoming Supercomputing Center's (NWSC)
Yellowstone machine
suggest that for regular meshes, the
parallel efficiency of MPAS-A is correlated with the number of cells
owned per task. Considering the time required for the solver only,
i.e. neglecting the set-up costs and the disk I/O, a parallel
efficiency of close to 70 % is obtained for more than 160 cells
per task. Here, we include the set-up costs (bootstrapping and reading
of initial conditions file) and the output to the disk in the scaling to
emphasise the importance of all aspects of the system – from
filesystem performance to compute performance to the speed of the
interconnect – and to estimate the necessary resource
requirements. It should be noted that this can have a negative and
noticeable influence on the parallel efficiency, depending on the
performance of the parallel I/O operations and the ratio of the time
spent for the set-up of the model and the actual time
integration. Hence, the threshold of 160 cells owned per task for the
breakdown of the parallel efficiency should be considered as a lower
limit. In Sect. , we provide a detailed analysis of
the costs of the individual steps for the Jtest-full system.
Communication volume as function of number of tasks for the three
test cases (left: 120 km; middle: 100–25 km; right: 60–12 km).
The dashed lines correspond to an exponential relation with a power-law index of 0.52. For large
numbers of tasks, a runaway growth can be seen for the 120 km and to some extent also for the 100–25 km mesh.
On the three Linux-cluster type systems, test runs are conducted on
single nodes up to 32 (Curie), 20 (ForHLR1), 15 (Jtest-full) and 30
(Jtest-half) nodes. In the following, we refer to good scaling as
a parallel efficiency of ≈70 % or more. Good scaling is
achieved up to 6 nodes on Curie (96 tasks, or 427 grid cells per
task), 8 nodes on ForHLR1 (160 tasks, or 256 grid cells per task), 8
nodes on Jtest-full (224 tasks, or 183 grid cells per task) and 15
nodes on Jtest-half (210 tasks, or 195 grid cells per task). Comparing
Curie and Jtest-half, it is evident that (a) a single Haswell CPU with 14
cores outperforms two Sandy Bridge CPUs with 2×8 cores on the
same board, and that (b) the parallel efficiency decreases faster with
the number of nodes on Curie. This is probably related to the
interconnect: while Juropatest (as well as Yellowstone) uses
Infiniband FDR Full Fat Tree technology (FDR,
theoretical effective, aggregated throughput 56Gbs-1=7GBs-1)
for the inter-process communication (MPI) and a separate 10 Gb
Ethernet connection for I/O operations, Curie uses QDR Full Fat Tree
technology (quad data rate, theoretical effective, aggregated throughput
32Gbs-1=4GBs-1) for the inter-process
communication and for I/O operations. The transition zone for
the breakdown of the parallel efficiency between 600 and 150 owned cells
per task is indicated as shaded blue area in Fig. . A comparison of the absolute
run times on Jtest-half and Jtest-full shows that runs with 28 cores
per node are 5–15 % slower than runs with 14 cores per node,
which is presumably due to memory bandwidth bottlenecks. An exception
here is the 420-task run (30 nodes on Jtest-half, 15 nodes on Jtest-full),
for which the increase in inter-node communication is the limiting factor
(see also the discussion in Sect. ).
The minimum job (and block) size on Juqueen is 32 nodes or 512 tasks,
which corresponds to only around 80 cells owned by each task. The
parallel efficiency drops rapidly with an increasing number of nodes,
since this problem size is simply too small for application on
Juqueen. Figure (left panel)
displays the communication volume as a function of the number of tasks, which
follows a power law with an index of 0.52 up to about 1000 tasks (approx. 40 owned cells per task). A runaway growth can be seen for larger
numbers of tasks. At the same time, the
graph partitions become increasingly non-contiguous (not displayed).
It should be noted that the communication volume and number of
non-contiguous partitions are computed by METIS and as
such are a function of the partitioning of the cells only, which
essentially assumes a single layer of ghost cells. Since
MPAS-A exchanges two layers of ghost cells at maximum, the actual
number of ghost cells, edges and vertices can be slightly
different. A detailed study of the impact of these graph properties
will be given in the following section.
(Left) approximate mesh cell sizes and (right) distribution of mesh cell sizes for
(top) the variable 100–25 km mesh with 163 842 grid cells and
(bottom) the variable 60–12 km mesh with 535 554 grid cells.
Table lists the cheapest model runs
in terms of CPUh spent per 24 h model integration and the fastest
runs in terms of real time per 24 h model integration for the four HPC
sites. It is important to remember that while the Jtest-half runs use
only 50 % of the available cores on each node, the computational
costs for the full node (28 CPUh per node per hour real time) are
charged for the model run, since the node is not available for other
users or jobs. Also, a one-to-one relation of CPUh between Linux
cluster-type machines and an IBM Blue Gene is not meaningful. By
comparing typical calls for proposals for the different HPC systems,
a conversion factor of 1:16 seems to be reasonable, i.e. to
consider one entire node with 16 cores on Juqueen as equivalent to one
core on the other systems. However, since applications for computing
resources usually demand estimates for the required amount of CPUh, we
list the actual CPUh here.
Variable 100–25 km grid
The second test case employs an irregular mesh with a variable
resolution ranging from 100 km for most parts of the globe to
25 km
for a circular area spanning about 60∘, and centred on West
Africa (lat = 12.5∘ N, long = 0∘ E). An
integration time step of 120 s is used. The mesh as well as the
frequency distribution of cell sizes are displayed in
Fig. (top panels), the scaling is illustrated in
Fig. (middle panel) and summarised in
Table .
Test runs are conducted on single nodes up to 192 nodes on Curie
(3072 tasks, 53 owned cells per task), 60 nodes on ForHLR1 (1200
tasks, 137 owned cells per task), 50 nodes on Jtest-full (1400 tasks,
117 owned cells per task) and 55 nodes on Jtest-half (770 tasks, 213
owned cells per task). Good scaling is achieved up to 32 nodes on
Curie (512 tasks, 320 owned cells per task), 20 nodes on ForHLR1 (400
tasks, 410 owned cells per task), 25 nodes on Jtest-full (700 tasks,
234 owned cells per task) and 35 nodes on Jtest-half (490 tasks, 334
owned cells per task). The parallel efficiency on average drops from
about 90 to 40 % within the transition zone (shaded area in
Fig. , 600 to 150 cells owned per task),
similar to the first test case on the regular 120 km mesh.
Notably different to the previous test case are the measured run times
for Jtest-full and Jtest-half: for small numbers of tasks, the
Jtest-full runs show a worse performance due to the aforementioned
memory bandwidth limitations. For large numbers of tasks (nodes), the
increase in inter-node MPI communication, which impacts the Jtest-half
runs more than the Jtest-full runs, becomes the limiting factor and
decreases the performance of the Jtest-half runs below that of the
Jtest-full runs. With respect to the remaining HPC systems, a clear
separation of the parallel efficiency by interconnect technology as
for the 120 km test case cannot be seen here, due to the following
reasons.
First, the disk I/O demand scales with the number of grid cells and
is larger by a factor of 4 for this mesh (see
Table ). As we will see in the
following sections, in particular for the extreme scaling experiments
in Sect. , the disk I/O becomes increasingly
important for larger problem sizes and can consume a significant part
of the total run time. The I/O is routed differently at the four HPC
sites, the central storage systems have different bandwidths and block
sizes, and the parallel I/O libraries might perform differently,
depending on the compilers and compilation flags. Additionally, in
this test case we adopt an integration time step of 120 s (4.8 s per
km grid size), which implies a smaller fraction of the total time
spent for the actual time integration relative to the disk I/O.
Ratio of (left) total number of cells, (middle) number of owned
cells and (right) number of halo cells per task between the non-contiguous
and the contiguous graph partition of the variable 100–25 km mesh for 300 tasks.
Second, the graph partitioning adds another layer of complexity and
variability to the performance diagnostics. Figure (middle panel) shows that the above-mentioned power-law relationship
between the communication volume and the number of tasks also holds for
the variable 100–25 km mesh up to about 40 owned cells per task
(4000 tasks). Contrary to the 120 km regular mesh, METIS tends
to create more non-contiguous partitions for complex mesh structures
(not displayed here). This variability introduced by the graph partitioning is unpredictable
and may have a significant impact on the model performance. If not
instructed otherwise, the graph partitioning tool METIS aims at
minimising the edge cut (communication volume), which potentially
comes at the cost of having non-contiguous partitions. As discussed
earlier in Sect. , halo cells are added
around each patch of the individual tasks, for which communication
with the neighbouring tasks is required. The additional amount of
communication caused by halo cells around non-contiguous partitions
can be substantial, in particular if the number of cells owned per
task is small (i.e. the ratio of halo cells to owned cells is
large).
To investigate the effect of non-contiguous partitions on the parallel
efficiency, we analyse one partition with 300 tasks for the
100–25 km mesh on ForHLR1 (546 owned cells per task), for which
METIS by default produces a non-contiguous partition. We create an
additional, contiguous partition using the command line arguments
-contig -minconn for METIS, which results in an increase of
the edge cut from 58 031 to 58 870
(1.4 %). Figure displays the
total number of cells (nCellsByTask), the number of owned cells
(nCellsSolveByTask) and the number of halo cells per task
(nHaloCellsByTask) as a ratio between the non-contiguous and the
contiguous partition. Since the communication volume increases for the
contiguous partition, the total number of cells per task on average is
larger, too. The average number of owned cells is identical, since the
number of cells of the graph does not change. Notably, task 200 has
a 1.3 times larger number of halo cells for the non-contiguous
partition, since its partition consists of two separate patches, which
implies a larger number of neighbouring tasks and surrounding halo
cells.
To eliminate the influence of the disk I/O on the run times for
the two partitions in this test, we switch off the output to the disk. We find that the
measured run times for the model integration is practically identical
for the two runs (251 s non-contiguous vs. 255 s contiguous). For
300 tasks, the average ratio of halo cells to owned cells is
1 : 2.8, which might be too small to see the effect of the
additional halo cells in the non-contiguous partition. We therefore
repeat the test for non-contiguous and contiguous partitions for 2520
tasks (65 owned cells per task), with a corresponding ratio of
1.2 : 1 halo cells to owned cells. Even in this case, the measured
run times for the model integration are nearly identical (45.2 s
contiguous vs. 45.6 s non-contiguous). We conclude therefore that
the impact of non-contiguous partitions on the run time is negligible
for any reasonable number of tasks for a given mesh.
Although the number of grid cells is 4 times larger for this test case
than for the regular 120 km mesh, the problem size is still too small
for application on Juqueen. The two smallest possible parallel runs
with 512 and 1024 tasks correspond to 320 and 160 cells owned per
task, for which the decrease in parallel efficiency is 20 %. Runs
with a larger number of tasks all have parallel efficiencies of less
than 60 %. Table lists the cheapest and
fastest model runs for the four HPC sites.
Variable 60–12 km grid
The third moderately sized scaling test consists of a variable-resolution mesh with maximum grid spacing of 60 km and minimum grid
spacing of 12 km. The refinement area is an approximate ellipse,
illustrated in Fig. (lower panels), and encompasses the whole
of north and central Africa, extending as far as India in the east and
covering a large part of the Atlantic Ocean in the west. This particular
mesh is useful for studying the teleconnection between the Indian and
Atlantic oceans and the monsoon systems in East and West Africa. The
total number of grid cells is 535 554, which corresponds to 1034×517 grid points on a regular latitude–longitude grid and thus
is in the range of current re-analyses. A time step of 72 s (6 s per
km grid size) is adopted.
Figure (right panel) and Table summarise the scaling of this
mesh on the four systems. As for the variable 100–25 km mesh,
a separation of the parallel performance by interconnect cannot be
detected, due to the increasing variability introduced by the disk I/O
(see Table ). While the number of tasks is sufficiently large to obtain
a constant power-law relationship between the communication volume and the number
of tasks up to 16 384 tasks on Juqueen (Fig. , right panel),
the complexity of the mesh leads to more non-contiguous partitions (not displayed).
Tests runs are conducted on single nodes up to 384 nodes on Curie
(6144 tasks, 87 owned cells per task), 120 nodes on ForHLR1 (2400
tasks, 223 owned cells per task), 55 nodes on Jtest-full (1540 tasks,
348 owned cells per task), and 55 nodes on Jtest-half (770 tasks, 696
owned cells per task). Good scaling is achieved up to 48 nodes on
Curie (768 tasks, 697 owned cells per task), 45 nodes on ForHLR1 (900
tasks, 595 owned cells per task) and up to the maximum number of 55
nodes on Jtest-full and Jtest-half. In the transition zone between 600
and 150 owned cells per task (shaded in blue), the parallel efficiency
on Curie and ForHLR1 drops off quickly from about 80 % to as low
as 30 %.
Juropatest shows the best (although most irregular) scaling of the three
Linux-cluster systems. With a maximum of 55 available nodes
on the system, the parallel performance is better than 70 % for
both Jtest-half and Jtest-full. The parallel efficiencies for ForHLR1
and Curie follow a more regular trend, although the number of
non-contiguous partitions is highly variable for ForHLR1, but not for
Curie. This adds further support to our conclusion that the impact of
non-contiguous partitions on the run time is negligible for any
reasonable number of tasks for a given mesh. With 535 554 grid cells,
this test case also shows good scaling on Juqueen up to 128 nodes
(2048 tasks, 262 owned cells per task). For larger number of tasks,
the parallel efficiency drops rapidly and constantly and the number of
non-contiguous partitions becomes highly variable.
Table lists the cheapest model runs
in terms of CPUh spent per 24 h model integration and the fastest
runs in terms of real time per 24 h model integration for the four HPC
sites.
Total run times (s) for the 120 km regular mesh with 40 962 grid cells,
the variable 100–25 km mesh with 163 842 grid cells and the variable 60–12 km
mesh with 535 554 grid cells. Indicated are time ratios between the Linux-cluster systems
Curie, ForHLR1 and Juropatest, and the Blue Gene system Juqueen. Fits to the
total run times are obtained separately for the Linux-cluster systems and the
Blue Gene system, using a modified version of Amdahl's law (see Eq. and Table ), and are displayed as dashed black lines.
Comparison of HPC systems
We further address the question how the total run times compare on the
various HPC systems. Figure displays the total
run times for the 24 h model runs on the four systems.
For all test cases, the three Linux-type systems line up quite well,
which means that the absolute run times are very close for similar parallel
decompositions and that differences in the processor architectures do not
translate into model speed-up. Due to the large differences in tasks
between the Linux-cluster systems and the Blue Gene system, the problem
size must be large enough for a fair comparison. Among the three test
cases, only the 60–12 km mesh test case satisfies this requirement.
Fit coefficients of modified Amdahl's law (Eq. ) for the three test cases and the two classes of architectures, with a common value for the exponent X=2.85.
For the Blue Gene system, only the fit to for the 60–12 km mesh should be used to estimate the required computing times.
Architecture
T(1) (s)
A
B
120 km
Linux-cluster
18 652
1.5×10-3
3.3×10-4
Blue Gene
413 440
1.8×10-4
3.3×10-5
100–25 km
Linux-cluster
86 523
8.5×10-4
5.2×10-5
Blue Gene
1 274 880
2.4×10-4
5.5×10-6
60–12 km
Linux-cluster
422 424
4.9×10-4
1.4×10-5
Blue Gene
6 579 200
8.4×10-5
6.1×10-6
A common feature across the three test cases is a minimum of the
total run times on the three Linux-type systems around 150 owned cells per task.
The reason for the increase in total run times for smaller numbers of owned
cells per task will be discussed in the following section. Here, we
attempt to fit a modified version of Amdahl's law , which
also accounts for the observed increase in run times when the parallel performance
breaks down:
T(n)=T(1)A+1-A-(nB)Xn+nBX.
Here, n denotes the number of parallel tasks and T(n) the total run time in seconds.
The first term on the right-hand side of Eq. () represents the serial
part of the code, the second term the part that scales perfectly and the third term
the part that takes longer when more tasks are used. This allows one to quickly estimate
the required resources for the test cases presented here for Linux-type systems (and
Blue Gene/Q systems) with similar specifications.
The coefficients A and B are allowed to vary across the three test cases and two
different classes of architectures. The power-law index X, on the other hand, is derived
as the best fit for the three test cases on the Linux-type systems and the 60–12 km on
the Blue Gene system. This results in a value of X=2.85 with a good fit to all four curves,
which indicates that the increase in total run times below a certain number of owned cells per task
occurs for the same reason (see also the discussion in Sect. ).
Required time for a 24 h integration (s) split into time
integration, model set-up and disk I/O for the three test cases on Jtest-full. Also
displayed are the total time and a combination of time integration and disk
I/O to reflect the parallel performance of MPAS-A for longer model runs, for which the initial set-up costs are amortised.
Table summarises the coefficients obtained from fitting
Eq. ()
to each of the six curves. Note that the fits to the 120 km mesh and the 100–25 km mesh on the
Blue Gene system are listed here as well, but should be used with precaution. The minimum of T(n) is obtained by setting
∂T(n)/∂n=0n=nmin and solving
for the number of owned cells per task No,min=Ncells/nmin.
This leads to minimum values of No,min={143,146,182} owned cells per task
and Tmin={116,177,402}s for the three Linux-type systems on the
120 km
mesh, the 100–25 km mesh and the 60–12 km mesh, respectively. For the Blue Gene
system, only the fit for the 60–12 km mesh test case can be considered as realistic and leads to
No,min=97 owned cells per task and a minimum run time of Tmin=2155s.
From the discussion in this section, we infer that a lower limit of about 150 owned cells per task
is a good and quick estimate for the Linux-type systems below which the parallelisation becomes highly
inefficient. The shift of the minimum value to larger numbers of tasks (lower numbers of owned cells
per task) on the Blue Gene system can be attributed to the performance of the faster 5-D Torus
interconnect relative to that of the slower CPU cores. The resulting fitting curves are also displayed
in Fig. .
Breakdown of parallel performance
In the following, we investigate the reasons for the breakdown of the
parallel performance. In the previous scaling tests, we include the
model set-up – in principle, bootstrapping and reading of the initial
conditions file – as well as the parallel output to the disk in the
measurements. Here, we split up the total computational costs into
three different steps: time integration,
model set-up and disk I/O. Over longer runs, the model
initialisation costs are amortised and the parallel performance is
determined by the numerical solver (time integration) and the parallel
I/O (disk output, reading of boundary updates). Accordingly, Fig.
displays the total computational costs, the costs for the three steps
and a combination of time integration and disk I/O for the
Jtest-full system and all test cases. Each time, the parallel
efficiency of the combined time integration and disk I/O, indicated with
orange stars, starts to decrease for less than 600 owned cells per task.
The time integration alone shows nearly ideal scaling down
to 150 owned cells per task, while the model set-up and disk I/O
are only weakly dependent on the number of tasks.
According to Figs. and ,
the absolute run times for all three test cases start to increase for
less than 150 owned cells per task on the three Linux-cluster systems. For the
Blue Gene system, this increase in absolute run times takes place at smaller numbers
of owned cells per task, but the picture is less clear due to the small problem sizes.
However, given the faster interconnect on the Blue Gene system and the fact that the
model initialisation and disk I/O depend only weakly on the number of tasks, this
suggests a breakdown of the parallel efficiency of the solver (i.e. the time integration) itself.
To exploit the limiting factors of the solver for a large number of tasks,
we use the parallel debugging and
profiling tools Scalasca and
Score-P. These tools
are available as modules on Juqueen; hence, we focus on three particular runs
on the 60–12 km mesh on this system. We analyse in detail the 2048-task run
(261 owned cells per task), the 4096-task run (130 owned cells per task) and the
8192-task run (65 owned cells per task), for which the latter takes longer
than the former two runs (Fig. ).
Halo cells are added around each patch of the
individual tasks, for which communication with the neighbouring tasks
is required. The larger the number of tasks, the smaller is the number of
owned cells per tasks, and the larger the ratio between halo cells and
owned cells. A lower bound for the number of halo cells Nh for a given
number of owned cells No is given by Eq. ()
and corresponds to a ratio of No:Nh=261:127=2.1 for the 2048-task run,
131:94=1.4 for the 4096-task run and 65:70=0.94 for the 8192-task run.
Table summarises the percentages of time spent
during the time integration step for both runs. Leaving aside the model initialisation
(initial bootstrapping, reading of initial conditions), the total time spent
for communication increases from 35 % for 2048 tasks to 70 % for 8192 tasks.
A detailed analysis of the Scalasca report reveals that most of
this time is wasted in MPI_WAIT during the exchange of 2-D and 3-D
halo fields. About 16 % of the time is spent for parallel I/O, namely for building
and writing output streams and for reading boundary updates (sea surface
temperature, sea ice fraction). It should be noted here that the selection
of fields written to the disk is customisable at run time using simple ASCII
files. Table reports on the number of 3-D and 4-D fields written
to the disk in our tests.
The Scalasca reports also reveal that
computational imbalances in the parallelisation can have serious
impacts. For the example of the 2048-task run on the 60–12 km mesh,
the average time required to update the boundary information
(sea surface temperature, sea ice fraction) is about 11.8 s per
task. However, one single task takes 19.3 s for the same action and
blocks all remaining tasks in their execution. Since the model synchronisation takes
place when exchanging halo cell data, the different computing times
of the model physics appear in the time spent for communication.
These results highlight the importance of an efficient parallelisation
and fast interconnects between the compute nodes and to the central
storage, in particular for future applications on massively parallel
systems and for the extreme scaling tests in Sect. .
Reproducing the dynamics of the West African monsoon
The second important aspect of a numerical weather prediction and
climate model is its accuracy in reproducing observed meteorological
conditions on global, regional and local levels. As described in the
introduction, the WAM has turned out to be
a notoriously difficult problem in climate modelling, since it is
a complex interplay of various dynamical, microphysical and
surface-related processes across scales. The common understanding is
that the intensity of the monsoon and the associated precipitation on
local scales are regulated by the driving, large-scale atmospheric
circulation. This picture is challenged and complicated by a recent
study of , who found that processes on a local scale,
such as mesoscale convection and precipitation events, can have
a noticeable influence through feedback effects on the entire monsoon
system. Examples are therefore enhanced moisture transport and
circulation, and strengthening of westward travelling disturbances
(African easterly waves).
In this study, we attempt a first and brief evaluation of the ability
of MPAS-A to reproduce the dynamics of the WAM. Due
to computational limitations for this short comparison, we are
restricted to 11-month long model runs. We focus on the onset of the
monsoon season in June/July 1982 for two of the meshes presented
above, namely the regular 120 km mesh and the variable
60–12 km
mesh. Both models are initialised in September 1981 using Climate Forecast System Reanalysis (CFSR) data
NCEP Climate Forecast System Reanalysis; at
0.5∘×0.5∘ resolution as initial
conditions. Daily updates of the sea surface temperature are taken
from the NOAA Optimum Interpolation Sea Surface Temperature Analysis
NOAA OI SST; at 0.25∘×0.25∘ resolution. The period from September 1981 to the beginning
of 1982 is considered as spin-up time for the model, in particular for
the soil conditions. An analysis of the soil properties over
the 11-month runs is undertaken to investigate whether such a short
spin-up time is sufficient. The model output is compared to different sets of
observational data to account for the large uncertainty in the gridded
observational products in the data-sparse region of West Africa
see, for example,.
Model topography (terrain height in metres) for the WRF reference and the
two MPAS model runs for the West African WRF domain. The left panel also indicates
five distinct agro-climatical zones, following a gradient of decreasing annual precipitation from south to north.
For near-surface air temperature and precipitation, we refer to (1) the
Climate Research Unit (CRU) high-resolution gridded time-series
data set v. 3.22 , and (2) the University of
Delaware (UDEL) air temperature and precipitation long-term monthly
means V3.01 at 0.5∘×0.5∘.
Additional observational data for near-surface air temperature are
obtained from the Global Historical Climate Network (GHCN) gridded
2 m temperature data set V2 at
0.5∘×0.5∘. For precipitation, we also use the
Global Precipitation Climatology Centre (GPCC) full data re-analysis
version 6 monthly means , also at
0.5∘×0.5∘.
We also compare the MPAS-A model output to reference data obtained from a set
of novel regional climate simulations over West Africa within the WASCAL
program. these data are produced using
the regional climate model WRF at 12 km resolution with initial and lateral
boundary conditions provided by the ERA-Interim re-analysis
80 km
resolution. The WRF model uses a set-up that is optimised for the
region of West Africa, following a detailed analysis of the monsoon dynamics
for different WRF model configurations by . An extensive
documentation and analysis of this reference data set will be given in
a forthcoming paper. The WRF model run covers a region from 25∘ W to
25∘ E and 5∘ S to 25∘ N and is initialised in
January 1979. Spectral nudging is applied to the WRF limited area model to
keep it on track with the large-scale features of the driving ERA-Interim
re-analysis. The sea surface temperature is updated every 6 h from the
ERA-Interim SST data.
The WRF model domain lies entirely within the refinement zone of the variable
60–12 km mesh. Hence, the resolution of the MPAS-A runs is 120 km for the
regular mesh (MPAS-120r hereafter) and 12 km for the variable mesh (MPAS-12v
hereafter). Figure shows the model topography of the WRF
reference model at 12 km resolution (WRF-12r hereafter) and the
MPAS-A models. Also indicated is a classification of the land area into five
agro-climatical zones, which will be used in the further analysis. Naturally,
the topography is nearly identical for WRF-12r and MPAS-12v. Minor
differences can be seen along the coastline and inland water bodies, which
are caused by the different grids and by the need to re-grid the MPAS model
output onto a regular lat–long grid. This re-gridding is performed with an
NCL script
mpas_to_latlon.py (L. Fowler, personal communication, 2014), which
adds another step to the post-processing tasks and is computationally quite
demanding; for example, a minimum of 1.5 GB of memory and 10 s run time is
required to re-grid one 3-D variable of MPAS-12v for one time step. For
MPAS-120v, the terrain is smoothed out and the coastline (often referred to
as landmask in the models) is ill represented. This has negative effects when
verifying the model output over regions such as Guinea, as we will see below.
(Top) mean near-surface temperature (over land) in ∘C for July 1982
for the three observational data sets CRU, UDEL, GHCN, the WRF reference and the two
MPAS model runs; (bottom) annual cycle of mean near-surface temperature over
the entire land area and the five agro-climatical zones depicted in Fig. .
(Top) monthly precipitation (over land) in millimetres for July 1982 for the three
observational data sets CRU, UDEL, GPCC, the WRF reference and the two MPAS model
runs; (bottom) annual cycle of monthly precipitation over the entire land
area and the five agro-climatical zones depicted in Fig. .
In Fig. , we analyse the spatial distribution of the
near-surface temperature for July 1982 (top panel) and its annual
cycle between September 1981 and July 1982 (bottom panel). The
observations provide data over land only and show noticeable spatial
differences in the position and intensity of the Saharan heat low
(SHL), in particular between CRU/UDEL and GHCN, and minor differences
over Ghana and along the coastline. Regarding the model runs, WRF-12r
matches the position and temperature of the SHL best and reproduces
the observed temperatures. Both MPAS models show a slightly colder
surface temperature distribution for July 1982, and the cold bias in
MPAS-120r is larger on average. The sea surface temperature
distribution is similar for the two MPAS runs, since they are using
the same SST data for their daily updates. However, MPAS-120r shows
strong artefacts along the coastline of the Gulf of Guinea, which is
due to its inaccurate landmask. The WRF-12r SST, which is updated
every 6 h from ERA-Interim data, is colder over the Golf of Guinea,
but otherwise shows the same patterns. With respect to the temporal
evolution, the annual cycle is reproduced well for both MPAS models;
however, a significant cold bias is detected over most of the land area
from the time of model initialisation in September 1981 up to
May 1982. From June to July 1982, this bias seems to nearly vanish
with the exception of the coarser MPAS-120v over Guinea due to the
limited resolution of the coastline. The observational data sets
displayed in the lower panel agree in general, but show small
differences for the Saharan region (only CRU and GHCN are displayed
for clarity; UDEL is very close to CRU).
Figure likewise displays the spatial distribution of
the monthly precipitation for July 1982 (top panel) and its annual
cycle (bottom panel) for the three observational data sets and the
three models. At the onset of the monsoon season, the rain band is
centred over 10∘ N for all three observational data
sets. Areas of high precipitation are found over Guinea, Sierra Leone
and the Senegal in the west, and over Nigeria, Cameroon and the
Central African Republic in the east. While the spatial distribution
hardly shows any difference between the observational data sets, the
temporal evolution deviates for the Saharan and Guinea regions. For
the Saharan region, small absolute values and a very small number of
actual observations lead to large relative uncertainties. For Guinea,
the spatial interpolation along the coastline leads to differences in
the timing of the maximum precipitation (April 1982 for CRU, May 1982
for GPCC; as for temperature, UDEL follows CRU closely).
The three models successfully reproduce the location of the rain band,
a fact that should not be taken for granted. In the case of WRF,
demonstrated that the representation of the monsoon
dynamics is largely determined by the microphysics and the planetary
boundary layer schemes, while the cumulus scheme seems to play more of
a role on daily timescales and for the actual amount of precipitation
triggered by mesoscale convection. The WRF model configuration chosen
here uses the WSM5 microphysics scheme, the ACM2 planetary boundary
layer parameterisation and the Grell–Freitas cumulus scheme
seefor a summary of the WRF physics options and further
references. It is highly optimised for the region and thus
not only matches the location of the rain band but also reproduces
the observed precipitation patterns over the Soudano, the Sahel and
the Sahelo regions. The two MPAS model runs, on the other hand, use an
out-of-the-box set-up, which consists of the WSM6 microphysics
scheme (similar to WSM5 with graupel as additional hydrometeor), the
YSU (Yonsei University) planetary boundary layer parameterisation and the Kain–Fritsch
cumulus scheme. This particular combination produces excessive
precipitation during the peak of the monsoon in WRF due to
a non-linear response of convective precipitation events to the
dynamics, and it seems to exhibit the same behaviour in the two MPAS
models.
(Top) mean sea level pressure, (middle) mean soil temperature and
(bottom) mean relative soil moisture over land for July 1982 for the WRF reference and the MPAS model runs.
Annual cycle of (left) mean sea level pressure in hPa, (middle)
mean soil temperature in ∘C and (right) mean relative soil moisture in % over land for the WRF reference and the MPAS model runs.
With respect to the annual cycle of precipitation, the WRF model shows
excessive precipitation for all months and an early onset of the
monsoon season for the Soudano, Sahel and Sahelo regions. The excess
in rainfall over all land area is mainly caused by an overestimation
over Guinea, which receives the most rainfall over the year (more than
3000 mm year-1, compared to less than 500 mm year-1 in the Sahelo
region). The out-of-the-box MPAS models match the observations
better, in particular the timing of the onset of the rainy season and
the precipitation over Guinea. The coarser MPAS-120r is closer to the
observed precipitation over the inland areas than the
higher-resolution MPAS-12v during the onset of the monsoon in
June/July 1982. This, however, is not a signal of a higher accuracy of
the coarser model: at 120 km resolution, the Kain–Fritsch cumulus
scheme is less active than at 12 km resolution, and consequently its
wet bias is reduced.
To support this further, Fig. a displays the
mean sea level pressure map for July 1982. WRF-12r and MPAS-12r both
show a distinct area of low pressure, the SHL,
alongside the south–north pressure gradient that is causing the
movement of the ITCZ and the associated monsoon rain band. This SHL is
much less pronounced in MPAS-120r. In both MPAS models, the region of
low pressure is displaced by about 10∘ to the east compared to
WRF. derived a climatological mean position of
3∘ W, 23∘ N from re-analysis data between 1979 and
2001, which coincides well with WRF.
However, the SHL is not only a region of low surface
pressure but also of high surface temperatures. Comparing
Figs. and a,
it is apparent that the areas of low pressure and high temperature are
co-located for WRF-12r. The high-temperature regions
of MPAS-120r and MPAS-12v match the WRF SHL position albeit with a cold
bias of about 2 ∘C, while the low-pressure region is shifted
to the east. This shift is caused by the overall cold bias in the two
MPAS models: the SHL, like any other non-frontal thermal low,
is formed by rapid solar heating of the
land surface and the near-surface layers, which leads to a rising of
hot and dense air and to the formation of a stationary low-pressure
area. Thus, the cold bias in the MPAS models implies a less intense
deepening of the pressure system at the expected location of the
SHL. As discussed above, the general cold bias over all land areas
persists from the onset of the MPAS model runs in September 1981 and
only starts to vanish from June to July 1982 on. This is reflected in
the temporal evolution of the mean sea level pressure
(Fig. , left panel), integrated
over all land areas, which is the highest for MPAS-120r, the lowest for
WRF-12r, in-between for MPAS-12v and merging towards each other from
June 1982 on.
A final investigation of the soil properties in
Figs. b and c
and (middle and right panel)
reveal the root cause of the cold temperature bias and the associated
displacement of the low-pressure system. Starting from the MPAS model
initialisation in September 1981, the soil temperature is about
2 ∘C lower then for the WRF model run, which is initialised
in January 1979 and thus has more than 2 years to spin up the
land surface model (community Noah land surface model for both WRF and MPAS). Over the course
of the simulated 11 months, the MPAS soil temperatures start to
converge towards the WRF soil temperature: in July 1982, the soil
temperatures distributions in Fig. b are
already quite similar. However, the relative soil moisture maintains
a constant bias for most of the time and still shows spatial
differences in July 1982. This is due to the fact that between the
MPAS model initialisation in September 1981 and the onset of the
monsoon season in 1982, virtually no precipitation occurs over large
fractions of the land area and that the inherently low soil moisture
simply has no possibility to adapt.
We therefore conclude from this section that the global-to-variable-resolution mesh concept and its implementation in MPAS-A have the potential
to reproduce the dynamics of the WAM and its
associated precipitation. In this single, 11-month experiment, the
out-of-the-box set-up of MPAS can compete with the optimised WRF
set-up with respect to the timing of the onset of the rainy season.
The high-resolution MPAS-12v model shows a better performance than
the coarse-resolution MPAS-120r model in general, but in particular
for the coastal regions. Deficiencies in the location of the SHL and a cold bias in the surface temperature are apparent in
both MPAS model runs, which are caused to a large extend by an
insufficient spin-up time of the models. Detailed investigations
of the WRF reference run, not discussed here, suggest that a spin-up
time of at least 1 year is required in order to adjust the soil moisture.
The MPAS runs also tend to overestimate the monsoon precipitation,
which we attribute to the combination of physical parameterisations
of the out-of-the-box set-up.
Extreme scaling experiment at very high resolution
In this section, we evaluate the scalability and limitations of the MPAS-A model for a very large mesh on a massively parallel system. This experiment is motivated by the following two aspects:
Future HPC environments will most likely be massively parallel
systems, with the number of cores per node and the number of nodes
increasing much faster than the speed of the individual
cores. Models such as MPAS-A have to be able to scale on these
systems in order to be used successfully in the future.
The typical model resolution of global and regional NWP and
climate models has increased continuously over the past few
decades. Currently, the European Centre for Medium-Range Weather
Forecasts (ECMWF) is leading the field with an operational global
NWP model at 14 km resolution, and limited area
models have been taken down to sub-kilometre resolution. With this
experiment, we want to demonstrate that convection-resolving
below about 4 km grid size; seefor
example, global atmospheric simulations are
possible on current HPC environments.
To create a large-enough problem for these tests at a convection-resolving
resolution, we use a regular, global 3 km mesh with more than 65 million
horizontal grid cells and 41 vertical levels. Up to now, only a few real-data
simulations on this mesh have been conducted on NWSC's Yellowstone
supercomputer using a maximum of 16 384 MPI tasks (approx. 4000 owned cells
per task); a set of scaling benchmarks based on an idealised case have also
been run on up to 131 072 MPI tasks on Edison, a Cray XC30 at the National
Energy Research Scientific Computing
Center.
Among the four HPC sites presented earlier, only the FZJ Juqueen
offers a large enough number of cores to conduct this extreme scaling
experiment. With a maximum of 458 752 cores, these tests require
scaling out to the full system, which is not possible during normal
operations. However, following the 3rd Juqueen Tuning and Porting
Workshop in February 2015, a few selected applicants were invited to
conduct extreme scaling tests on the full machine during a period of
24 h. The results presented in the following were obtained during
this event, which is summarised in a technical report by
.
Model configuration and experiment preparation
Several modifications of the MPAS-A code are required to conduct this
experiment. First, it is no longer possible to use the standard
NetCDF large-file format (CDF-2), since the maximum number of elements
for some of the 4-D variables exceeds the internal limit of
2 billion values for any particular record. One possible solution is
to use the CDF-5 extension of CDF-2, which is supported by the
Parallel-NetCDF library. However, only very few applications
understand this format, and initial testing on Juqueen revealed
problems reading correct data in massively parallel read
operations. Another solution, which is adopted here, is to use the
newer NetCDF-4 format, which supports parallel I/O through PHDF5. This
requires upgrading the parallel I/O library PIO from v1.7.1 to 1.9.15,
and modifying the I/O framework of the MPAS-A model code. Motivated by
this study, MPAS-A v4.0, released at the time of writing, supports
NetCDF-4 I/O without any need to modify the software framework.
Further, the generation of the model input data becomes a large
computational problem, which cannot be fit on a single machine due to
time constraints and memory limitations. In MPAS release v3.1, the
pre-processing of the data is partly a serial process running on one
CPU core only. Hence, a parallelisation of the pre-processing is
required in addition to the above changes to the I/O routines. These
changes are applied to the basic MPAS-A v3.1 code, and this modified
version is used in the following scaling experiments.
Another difference from the default model configuration presented in
Sect. is that each test is run for 1 h
model time only. This implies that the update of the surface data
(sea surface temperature, sea ice fraction), which occurs every 24 h
in the above moderate scaling tests, is dropped. Also, with a global
resolution of 3 km, it is generally agreed that no convection scheme
is required, since the microphysics scheme is able to generate the
convective precipitation systems on the grid scale. These
modifications are reflected in the model namelist in
Appendix .
The pre-processing consists of two steps. First, a static data set is
produced (static.nc), which maps invariants such as terrain
height, landmask and land use classification onto the 3 km
mesh. These invariant fields are required as input for the subsequent
generation of the initial conditions (init.nc), in part
because the terrain field is necessary for the generation of the
height-based vertical grid. The parallel pre-processing steps require
special mesh partitions, different from the partitions generated by
METIS and used for the model runs. Here, both steps are conducted with
576 cores, spread across 80 nodes on the ForHLR1 to provide sufficient
memory for each task. Each step takes about 1 h 15 min real time and
requires around 4.6 TB of the available 5.1 TB of memory. The
resulting initial conditions file has a size of 1.2 TB and is
transferred to Juqueen over the 10 Gbs-1 internet connection between the
two HPC sites. Since both pre-processing steps are only required once,
no further investigation or optimisation of the run times is attempted.
MPAS-A 3 km global simulation experiment (strong scaling tests).
Nodes
Tasks (MPI only)
Bootstrapping (s)
Initial read (s)
4096
65 536
1260
1260
8192
131 072
1370
590
16 384
262 144
1560
1020
24 576
393 216
1680
1080
28 672
458 752
1740
1140
Nodes
Integration time f.
Parallel efficiency
Integration in
1 h model time (s)
integration only
24 h wall timea
4096
1760
100.0 %
29 h
8192
960
91.2 %
53 h
16 384
490
90.1 %
104 h
24 576
335
87.7 %
152 h
28 672
360
69.5 %
141 h
a Estimated time extrapolated from 1 h integration, including
initial bootstrapping/reading, output to the disk, 12 s time step
First attempts and optimisations
Initial test runs at 3 km resolution revealed previously unknown
problems on the system. As described in
Sect. , a bootstrapping step is required
during the model initialisation to set up the grid and instruct
individual tasks with whom to share information about neighbouring
grid cells. In the MPAS code, this is implemented using hash
tables. In order to complete the bootstrapping in a reasonable time,
the hash table size (parameter TABLESIZE) is increased from
the default value 27 183 to 6 000 000. After this adjustment, the
bootstrapping step takes between 18 and 29 min on Juqueen, depending
on the number of MPI tasks (see
Table ). A second bottleneck is the
reading of the initial conditions file. Performance improvements for
this step are achieved by setting two run time environment variables
that were presented during the tuning and porting workshop
(BGLOCKLESSMPIO_F_TYPE, ROMIO_HINTS), and by
optimising the number of I/O tasks. While in the previous scaling
tests all tasks participate in the I/O, we find improvements when
using only 128 I/O tasks per 1024 nodes (1 rack, 16 384 tasks),
with an average read performance of 1.2 GBs-1
and an average write performance of 0.6 GBs-1,
respectively. Due to the large number
of parallel tasks in this extreme scaling experiment, we use the
unit nodes instead of tasks throughout this section,
where 1024 nodes correspond to 1 rack or 16 384 tasks.
Required times for individual steps of the 3 km test runs on Juqueen (18 s time step).
With these optimisations, the parallel reading of the initial
conditions file improves slightly with the number of tasks, since they
are located in different racks. Conversely, the bootstrapping step
takes longer for larger numbers of tasks. Hence, the overall model
initialisation takes approximately 45 min for the 3 km mesh and
varies only slightly with the number of nodes. One notable
exception here is the run on 8192 nodes, for which the
initial I/O is only 50 % of that of the other runs. The exact reasons
for this behaviour need to be investigated, but we think that this
combination of file size and I/O tasks is a sweet spot on Juqueen.
Execution of extreme scaling tests
The substantial memory requirements for the 3 km mesh do not allow one to run
the model on 1024 or 2048 nodes. The baseline for our scaling experiment is
therefore the run on 4096 nodes (4 racks, 65 536 MPI tasks, 512 I/O tasks, 65 TB
memory). Contrary to the model initialisation, the time integration step
scales very well up to the entire machine, with a parallel efficiency of
87 % for 24 576 nodes (393 216 tasks, 167 cells per task) compared to the baseline (see
Table ). The test run on the full system with 28 672 nodes (458 752 tasks, 143 cells per task)
shows a lower performance than the run on 24 racks and a parallel efficiency of nearly 70 %,
which confirms the above-mentioned limit of approximately 150 owned cells per task, below which the
parallelisation becomes inefficient.
All scaling tests are conducted with an 18 s model integration time step for a 1 h model time.
However, we find that in order to keep the model stable when starting off the
3 km mesh from initial conditions derived from a 48 km re-analysis data set
(CFSR), a more conservative time step is required. MPAS currently lacks
a dynamical initialisation system (e.g. digital filters, adaptive
time-stepping), which could avoid this issue. Model instabilities leading to
NaNs can affect the performance in different ways: (1) the performance might
increase in cases where if-NaN-tests causes the code to return early from
computationally intensive physics routines, or (2) the performance might
decrease due to the continuous generation of floating-point exceptions.
We note that on Blue Gene systems, MPAS does not by default include the -qflttrap compiler
flag to trap floating-point exceptions, and, consequently, the model
will continue to run even when the simulation generates NaNs.
We therefore repeat the runs for 4096, 8192 and 16 384 nodes with a 12 s time step to
obtain a stable model run. The measured real times for the three 12 s runs
are very close to 1.5 times the real times for the corresponding 18 s runs,
which gives us confidence that we can scale the results for
24 576 and 28 672 nodes with an 18 s time step to
a 12 s time step, despite the fact that the runs with an 18 s time step produced NaNs.
Table and
Fig. summarise the required times of the
individual steps of the 3 km runs with an 18 s integration time
step. Due to wall-time constraints, we only conduct runs without
writing output to the disk. The last column in
Table estimates how many hours the
3 km model can be advanced within 24 h wall time, and is calculated
as follows: a 12 s model integration time step is assumed, and the
real time required is scaled from the 18 s runs by a factor of 1.5 for
24 576 and 28 672 nodes, for which no 12 s runs are conducted. For a typical
production run, diagnostic output files of 13 GB size are written
every 3 h model time, while comprehensive output files of
approximately 250 GB size are written every 24 h model
time. A restart file of 2.1 TB size is written at the end of the
model run. Based on a parallel write performance of
0.6 GBs-1, we make a conservative estimate that roughly
2 hours of the 24 h wall time will be used up by writing these files
to the disk. Tables and list the file sizes and the
cheapest and fastest model runs for the 3 km mesh on Juqueen.
We conclude from this extreme scaling test that the dynamical solver
of MPAS scales on massively parallel systems out to hundreds of
thousands of cores. Our results confirm that the model behaves similar
for the 3 km mesh than for the significantly smaller problem sizes
and that the parallel efficiency is limited by the same factors,
namely the increasing number of halo cells and the amount of communication
for a large number of tasks. This occurs
around 150 owned cells per task, which corresponds to roughly
27 300 nodes for the 3 km mesh. However, we find that the model
initialisation and the disk I/O become increasingly important and at
the same time difficult to improve for extremely large test
cases. Compared to the model integration, the time required for the
model initialisation and for reading and writing data is largely
independent of the number of tasks. For a maximum wall time of 24 h on
Juqueen, these steps consume up to 3 h or 10–15 % of the total
job time – in other words, hundreds of thousands of CPUh. The
cheapest run on Juqueen utilises 4096 nodes (4 racks, 65 536 tasks),
consumes 1.3 Mio CPUh for a 24 h model integration and gives a speed-up of
1.2 × real time. For 24 576 nodes (24 racks, 393 216 tasks),
a speed-up of 6.3 × real time is achieved at the slightly higher
expense of 1.5 Mio CPUh.
Conclusions
In this study, we analyse the atmospheric model MPAS-A in detail for
its numerical performance and for its physical accuracy. We conduct
scaling tests for three medium-size problems using regular and
variable meshes of different complexity on four different HPC
facilities. We confirm an overall good scaling (⪆ 70 %
parallel efficiency) of MPAS across all systems and find that
a value 150 cells owned by each task provides a good and quick
estimate for the breakdown of the parallel performance
when set-up and I/O costs are included in the scaling. This limit
depends weakly on the interconnect of the system (absolute but also
relative to the core speed), with faster interconnects corresponding
to lower values, but also on the I/O performance. Accordingly, it is
slightly lower for the Blue Gene system (between 100 owned cells
per task for the 60–12 km mesh and 150 owned cells per task
for the 3 km mesh). Taking into account that the set-up costs are
amortised over longer runs, MPAS-A maintains a parallel efficiency of
80 % or better for more than 150 owned cells (100 owned cells on Juqueen)
per task. Based on these findings, we provide numbers on the typical file sizes and
optimal model configurations for conducting research and operational
runs on the different HPC systems.
An in-depth analysis of the properties of different graph partitions
for one of the meshes shows that the impact of non-contiguous graph
partitions in form of changes in the number of halo cells is
negligible for any reasonable number of tasks for a given problem
size. We further employ the parallel profiling tools Scalasca and
Score-P to identify the bottlenecks in the MPAS-A code when the
parallel performance breaks down. Our findings confirm that most of
the time in such cases is spent waiting during the communication with
neighbouring tasks, but also demonstrate the negative impact that
computational imbalances can have on the model performance.
We also study the accuracy of MPAS-A for one common and challenging
problem in climate research, namely the capability to reproduce the
dynamics of the WAM and its associated
precipitation. We conduct 11-month simulations for two meshes,
a regular 120 km mesh and a variable 60–12 km mesh, and compare the
model output to a number of observation data sets, selected
re-analyses and a reference model run. The reference model run is
chosen from a novel set of regional climate simulations over West
Africa within the framework of the WASCAL programme and employs the
regional climate model WRF, from which MPAS inherits several aspects
of the dynamical solver and all of its physical parameterisation
schemes.
We find that MPAS-A is able to model the monsoon dynamics and the
northwards movement of the monsoon rain band in this pilot study.
Despite using an out-of-the-box configuration of the model, both runs reproduce the
timing of the onset of the monsoon season better than the optimised
WRF reference run. We find that the precipitation in the
early monsoon season is overestimated, which we attribute to the
choice of physics parameterisations. The MPAS model runs also show
a cold bias in the near-surface temperature and consequently fail to
place the SHL at the correct location, which we believe
stems from a too short spin-up time of the model.
We would like to stress that our pilot study aims primarily at
investigating the feasibility to conduct climate modelling research
with MPAS for the West African region. In order to be able to draw
general conclusions on the ability of MPAS to reproduce observed
climatological properties, an ensemble of model runs over longer
time periods with sufficiently long spin-up times is required.
In the last part of this study, we conduct extreme scaling tests on
a global 3 km mesh with more than 65 million grid cells on up to
458 752 cores on Juqueen, the IBM Blue Gene/Q at the Forschungszentrum
Jülich. We describe the issues that arise when attempting such an
experiment for the first time – up to now, MPAS-A has been run for
real-data cases, which include a full physics suite, on a maximum
number of 16 384 tasks on Yellowstone – and provide solutions that
allow one to conduct the scaling test in the first place and improve the
model performance. We find that the model scales very well up to the
entire machine with a parallel performance of nearly 70 % for
458 752 tasks. We confirm that the limitations and rules to estimate
the scaling, derived for moderate problem sizes, are also valid for
the extremely large test case. This gives confidence for planning
model experiments and estimating required run times, storage and
computational resources.
Furthermore, we identify additional aspects in the model that become
increasingly relevant for larger problem sizes: the model
initialisation and the disk I/O. We describe strategies to improve the
performance of the model, which are partly machine dependent. We
further give estimates on required run times and resources for
conducting scientific experiments with the 3 km mesh on Juqueen.
Our next steps will be to conduct a number of longer simulation
experiments on regular and variable-resolution meshes with a moderate
number of grid cells. Specifically, we plan to pursue the study on the
dynamics of the WAM using a variable-resolution grid
such as the 60–12 km mesh and a regular grid with a similarly fine
resolution. This will allow us to compare the accuracy of the model
after a full spin-up of the soil conditions and to assess the impact
of the variable mesh on the model results. It will also allow us to
study physical processes such as the teleconnection between the oceans
and the African monsoon systems, and investigate the impact of climate
and land use changes in a consistent approach.
In conclusion, the MPAS-A model is a novel atmospheric model that
scales well on a range of architectures for small up to extremely
large numbers of tasks. Based on an unstructured Voronoi mesh, it
allows to conduct global simulations with local refinement regions and
smooth transition in-between them. This makes it possible to study
local-scale processes in regions of interest with a full coupling to
the large-scale motions and a physical consistency within the
model. This is shown here in a pilot study of the WAM. We also demonstrate that it is possible to conduct global,
convection-resolving atmospheric simulations with MPAS on current and
future massively parallel systems. However, it is evident that
the application of models such as MPAS for extremely large problem
sizes and numbers of tasks requires substantial efforts to optimise the
model to the problem and to the machine it is run on. In
order to do so, interdisciplinary approaches and more intensive
training of scientist on hardware, software and programming techniques
are paramount.