We discuss two parallelization schemes for MagIC, an open-source, high-performance, pseudo-spectral code for the numerical solution of the magnetohydrodynamics equations in a rotating spherical shell. MagIC calculates the non-linear terms on a numerical grid in spherical coordinates, while the time step updates are performed on radial grid points with a spherical harmonic representation of the lateral directions. Several transforms are required to switch between the different representations. The established hybrid parallelization of MagIC uses message-passing interface (MPI) distribution in radius and relies on existing fast spherical transforms using OpenMP. Our new two-dimensional MPI decomposition implementation also distributes the latitudes or the azimuthal wavenumbers across the available MPI tasks and compute cores. We discuss several non-trivial algorithmic optimizations and the different data distribution layouts employed by our scheme. In particular, the two-dimensional distribution data layout yields a code that strongly scales well beyond the limit of the current one-dimensional distribution. We also show that the two-dimensional distribution implementation, although not yet fully optimized, can already be faster than the existing finely optimized hybrid parallelization when using many thousands of CPU cores. Our analysis indicates that the two-dimensional distribution variant can be further optimized to also surpass the performance of the one-dimensional distribution for a few thousand cores.

The dynamics in many astrophysical objects like stars, planets, or moons
are aptly modelled by the fluid flow and magnetic field generation
in a rotating sphere or spherical shell. Since the pioneering work by

MagIC still mostly follows the original algorithm laid down by

Over the last 20 years several aspects of the original algorithm by

For the last 20 years, MagIC simulations have resulted in more than 120 peer-reviewed publications (See

Achieving an efficient use of the available computer resources by a given numerical implementation often proves challenging, in particular when moving to petascale architectures. There are two main reasons. First, the large number of physical compute cores requires a large enough workload that can be fairly partitioned and distributed across these processing units. This typically puts an upper bound on the number of cores that can usefully be employed and often results in poor strong scaling. A second complication arises from the different layers of memory access and of communication between the physical cores (i.e. non-uniform memory access – NUMA – domains, sockets, nodes). Ideally, one would attempt to keep all data “local” for quick access. However, optimizing a code for properly distributing the workload while keeping a reasonable data locality can be difficult and often requires compromising one aspect in favour of another.

Until recently, MagIC only offered a one-dimensional distribution of the data implemented using MPI+OpenMP
(hereafter referred to as “1d-hybrid” implementation). For calculating the non-linear terms in
grid space, this code uses MPI to distribute the spherical shells between the available NUMA domains and the inherent OpenMP
scheme of the open-source spherical harmonics transform library

The first two-dimensional MPI domain decomposition in pseudo-spectral codes in spherical geometry was considered in the ASH code by

Motivated by the aforementioned points, we propose in this work a two-dimensional data distribution layout for MagIC with communication-avoiding features. This required a major rethinking of data structures and communication algorithms, and it demands a re-implementation and thorough optimization of a large portion of the existing 1d-hybrid code. Due to the high complexity of the required refactoring tasks, OpenMP parallelism was dropped from the current implementation for the time being. Since the two-dimensional distribution implementation presented here relies on pure MPI communication, we refer to it as the “2d-MPI” implementation or version. Implications and incentives to eventually re-introduce OpenMP into the new version will be discussed along the way.

This paper is organized as follows. In Sect.

In this section we introduce aspects of the numerical formulation which are relevant for the understanding of this work. Since the implementation of the magnetohydrodynamic (or MHD for short) equations implemented in MagIC still closely follows the original work by

We consider a spherical shell of inner radius

To ensure the solenoidal nature of

Reordering the terms in Eq. (

The formulation discussed in Eqs. (

The quadrature shown in Eq. (

During the initialization stage, MagIC allows the choice between finite differences or a spectral expansion using Chebyshev polynomials to handle the radial
discretization strategy. Finite-difference methods allow the use of faster point-to-point communications, but they also require a larger number of nodal
points to ensure a proper convergence of the solution. In this work, we explicitly chose to focus solely on the spectral approach to handle the radial discretization,
but we encourage the interested reader to consult

Each spectral coefficient

With the spatial discretization fully specified, we can proceed with the time discretization. For an easier understanding, we will
derive the main steps using the equation for the time evolution of the magnetic poloidal potential

The equation for

Equation (

In order to mitigate the time step constraints associated with an explicit treatment of the diffusion terms, MagIC adopts an implicit–explicit
(IMEX) time-stepping approach. Non-linear and Coriolis terms are handled using the explicit part of the time integrator, while the remaining
linear terms are treated implicitly. Currently, IMEX multisteps

Let

MagIC is a highly optimized hybrid MPI+OpenMP code under active development. The 1d-hybrid version uses MPI to
distribute the

The purpose of this section is to first familiarize the reader with the established 1d-hybrid implementation and then introduce the new 2d-MPI implementation. By adding MPI parallelism in a second direction, the extension also allows distributing the computations within a shell over the NUMA domains, sockets, or even nodes of a computer cluster.

Due to the high complexity of this re-implementation, our 2d-MPI version lacks any use of OpenMP. The main purpose of this work is to provide a thorough assessment of the prospects and merits of a two-dimensional data distribution, to pinpoint shortcomings, and to discuss the overall viability of a possible fully optimized and fine-tuned two-dimensional distribution implemented using OpenMP+MPI.

In this section we first present the pseudocode for the sequential algorithm for MagIC in
Sect.

For the sake of simplicity, we discuss only the second order time-stepping scheme described
in Sect.

In this section we discuss how the simulation is distributed across MPI ranks for the 1d-hybrid and the 2d-MPI
implementations. For all purposes, MPI Cartesian grid topology is used.
The

Let

This changes for the execution of the

Illustration of the “snake ordering” of the

Illustration of the 2d-MPI distribution of the points with

The distribution of the

Just like in the 1d-hybrid implementation, an

We opted for a compromise that exploits a special regularity of the current data structure and leaves the

The figure also illustrates a drawback of this distribution: the

The spherical harmonics transform (SHT) takes a substantial portion of the runtime. MagIC relies heavily on SHTns

Next we describe how both the 1d-hybrid and 2d-MPI version of the code use SHTns for computing the SHT.

The so-called

The

Two parameters need to be determined, namely which MPI algorithm to use and the length of the queue.
The choice of the MPI algorithm depends on the MPI implementation, the
hardware, and the number of resources. We compare the performance of each variant in Sect.

As for the length of the

As discussed in Sect.

Most of the wall time of the

Next we describe the

In both the 1d-hybrid and 2d-MPI implementations the

In this section we compare the performance of the 1d-hybrid and 2d-MPI implementations in a practical, realistic setting. We also profile some crucial sections of the code in order to highlight the advantages and shortcomings of the different implementations.

Following

All tests were performed on the Cobra cluster at the Max Planck Computing and Data Facility (MPCDF). Each node of this
machine possesses two Intel Xeon Gold 6148 processors and 192 GB of main memory. Each processor has a single NUMA domain with
20 cores. The nodes are interconnected with a 100

The runtime of each crucial section was measured for each MPI rank independently using perflib, a lightweight profiling library developed in-house. Our figures show the average over all ranks as a solid line. A coloured background shows the spread between the fastest and slowest ranks and visualizes the imbalance of the times measured.

The performance of this code section is of special interest. This is because the

For the tests we fixed the size of a scalar field in grid space to 240 KiB per rank per radial level. This
can be achieved by using the formula

We expect the performance of the

We measure the performance of the

Effective usage of

Runtime for the

Figure

Since the A2AW implementation for the

In this subsection we continue the comparison between intranuma, internuma, and internode regimes but now bring the whole spherical harmonics transform into perspective. Since this constitutes one of the largest portions of the computation, it is particularly interesting to determine the behaviour in the different communication regimes.

Runtime for the different components of the SHT in the intranuma

Figure

Even for the intranuma case shown in Fig.

In this subsection we compare the performance of the four different

The problem we chose for this and subsequent subsections is the dynamo benchmark with

For the 2d-MPI scheme, we fix the

Strong scalability of

In Fig.

The 1d-hybrid version shows a good strong scalability, which, however, starts to degrade after 1000 cores (25 nodes). At this core count the 1d-hybrid version transitions from 10 to 20 threads. OpenMP efficiency is typically difficult to maintain for large thread counts, and this is likely the reason for the performance degradation.

The 2d-MPI implementation, especially for the A2AW variant, scales remarkably well up to 6000 cores (150 nodes), at which point the performance starts to degrade. However, the losses remain acceptable up to the largest number of cores we could test: 24 000. The A2AV variant performs similarly to A2AW, and beyond 12 000 cores all variants show nearly identical timings. This test suggests preferring A2AW, which will be kept in the following. However, as the behaviour may be different on other architectures and for other MPI libraries, an auto-tuning routine that selects the fastest option in the initialization phase of a simulation seems a good idea for the future.

In this subsection we take a deeper look into the different parts of the

Strong scalability for the

In Fig.

In both the 1d-hybrid and 2d-MPI implementations, the performance is dominated by the transposition time.
Here the 2d-MPI clearly outperforms the 1d-hybrid implementation because of the reduced
communication, as discussed in Sect.

The timing results in Fig.

For a fixed

However, the performance gain due to the communication-avoiding distribution of the 2d-MPI implementation far outweighs these
shortcomings. We would like to highlight the fact that it is only possible to reduce the communication volume due to the particular
distribution pattern shown in Fig.

Finally we discuss the strong scaling of the MagIC time step, i.e. of the main application without
initialization, finalization, and diagnostics. As usual, we define the parallel efficiency as

Following

Strong scalability for the 1d-hybrid and 2d-MPI implementations of MagIC for 100 time steps of the dynamo
benchmark using

Parallel efficiency and cross-efficiency of the 1d-hybrid and 2d-MPI implementations, detailing

In Fig.

The radial loop for the 1d-hybrid code scales well up to 6000 cores, but the inferior scalability of the

The 2d-MPI implementation starts suffering from a small loss of scalability of the

The parallel cross-efficiency (see Table

The benefits offered by the 2d-MPI implementation can be illustrated with a simple example. Instead of performing a simulation with the 1d-hybrid implementation on 6000 cores a scientist could opt for using the 2d-MPI implementation on 12 000 cores and obtain the solution in only 56 % of the runtime. In other words, by investing only 12.5 % more CPU hours, the 2d-MPI implementation arrives at the same solution in half the time. Furthermore, using the 2d-MPI implementation allows the user to allocate more computing nodes and thus gives the user access to more distributed memory.

The parallel cross-efficiency also gives an idea of how “far” the parallel efficiency of the 2d-MPI
is from the 1d-hybrid implementation. There is a significant gap for 120 to 1200 cores,
but the gap closes at about 2000 cores, e.g.

Finally, the main characteristics measured on our benchmark platform are expected to be transferable to other flavours of HPC platforms based on contemporary x86_64 CPUs (e.g. AMD EPYC) and high-performance interconnects (e.g. Mellanox Infiniband), provided the platforms are sufficiently optimized for SHTns and MPI libraries. In general, for good performance of the 1d-hybrid version it is mandatory to establish the optimal number of threads per MPI rank and to respect the placement of the threads within the NUMA domains. The new 2d-MPI version brings the advantage of being less affected by the actual node topology, but the performance of the application is sensitive to the MPI library optimization of routines such as

The parallel cross-efficiency in Table

For the 1d-hybrid code, let

Table

The next natural question is how the computation times of the radial loop of both implementations compare with each other.
For that we use the ratio

The column

Radial loop performance comparison for the dynamo benchmark.

We now attempt to determine how much improvement in the 2d-MPI implementation is required for
its main application runtime (including all transposition times) to match the runtime of the 1d-hybrid code.
In practice, several portions of the 2d-MPI code could benefit from cache-hitting strategies, fine-tuned
vectorization, and computation–communication overlap, amongst others. We will now discuss a much simpler
scenario in which

Table

The last three entries in Table

We described a new parallelization scheme based on a two-dimensional, MPI-based data decomposition for MagIC, an open-source code for
three-dimensional fluid dynamics simulations in spherical geometry, with high scientific impact in a broad range of scientific fields
ranging from fundamental fluid dynamics and modelling of planetary dynamos to stellar dynamics.
MagIC uses spherical surface harmonics of degree

Thanks to a number of new concepts, the 2d-MPI version presented here can already compete with the hybrid parallelization in terms of runtime, and in addition it offers the possibility to use a significantly larger number of cores. It opens the possibility to employ tens of thousands of CPU cores on modern HPC clusters and paves the way to using the next-generation CPU architectures.

Decisive factors for its success include a communication-avoiding data distribution of the

Our results showed that a mere 10 % performance gain in the computation time of the radial loop
of the 2d-MPI implementation would bridge the gap between the two code versions for 2000 cores or more,
in addition to providing an extended strong scalability regime.
Other sets of equations solved by MagIC, such as the anelastic equations

Our implementation paves the way towards a future unified hybrid variant of MagIC which combines a two dimensional data distribution with the proven benefits of an additional OpenMP layer in order to further improve the computational performance across an even wider range of simulation scenarios. This future 2d-hybrid version could retain the benefits of our communication-avoiding data distribution and its improved strong scaling behaviour, while still benefiting from the performance of the finely optimized multithreaded libraries used within the radial loop. We expect that the performance of the radial loop of such a 2d-hybrid implementation will be closer to the performance of the 1d-hybrid version, effectively eliminating the main bottleneck of the 2d-MPI version.

Looking further into the future, the hybrid MPI-OpenMP parallelization of MagIC would be a natural starting point for developing a GPU-accelerated version of the code. Such a porting effort would probably start out from the 1d-hybrid version of MagIC, mapping an entire GPU to a radial zone (or MPI rank), and relying on the performance of the GPU variant of SHTns on top of the very high memory bandwidth and floating-point performance provided by an individual GPU of the current or next generation. A 2d-hybrid version of MagIC could serve as the basis for an even more flexible, accelerated version of MagIC e.g. for applications in which an even higher radial resolution is desired or when the individual accelerators (possibly of a kind different than GPU) are comparatively less powerful.

MagIC's source code is under active development and available for download from

No data sets were used in this article.

TG and JW developed and maintained the main branch of MagIC. TG also managed the use of external libraries (FFTW, LAPACK, SHTns, etc.) and fine optimization of OpenMP. TD conceptualized, implemented, and tested the algorithms for the 1d-hybrid implementation. RL conceptualized, implemented, and tested the algorithms for the 2d-MPI implementation. The numerical experiments were designed and conducted by RL with the supervision of TD and TG. TD and MR supervised HPC aspects of the project. JW and TG were responsible for supervising and interpreting the geophysical aspects of the project. The original draft was written by RL. The reviewing and editing of the current draft was conducted equally by all authors.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The article processing charges for this open-access publication were covered by the Max Planck Society.

This paper was edited by Josef Koller and reviewed by two anonymous referees.

^{®}oneAPI Math Kernel Library – C, available at: