MagIC v5.10: a two-dimensional MPI distribution for pseudo-spectral magneto hydrodynamics simulations in spherical geometry

We discuss two parallelization schemes for MagIC, an open-source, high-performance, pseudo-spectral code for the numerical solution of the magneto hydrodynamics equations in a rotating spherical shell. MagIC calculates the non-linear terms on a numerical grid in spherical coordinates while the time step updates are performed on radial grid points with a spherical harmonic representation of the lateral directions. Several transforms are required to switch between the different representations. The established hybrid implementation of MagIC uses MPI-parallelization in radius and relies on existing fast spherical 5 transforms using OpenMP. Our new two-dimensional MPI decomposition implementation also distributes the latitudes or the azimuthal wavenumbers across the available MPI tasks/compute cores. We discuss several non-trivial algorithmic optimizations and the different data distribution layouts employed by our scheme. In particular, the two-dimensional distribution data layout yields a code that strongly scales well beyond the limit of the current one-dimensional distribution. We also show that the two-dimensional distribution implementation, although not yet fully optimized, can already be faster than the existing finely 10 optimized hybrid implementation when using many thousands of CPU cores. Our analysis indicates that the two-dimensional distribution variant can be further optimized to also surpass the performance of the one-dimensional distribution for a few thousand cores.


Introduction
The dynamics in many astrophysical objects like stars, planets, or moons is aptly modelled by the fluid flow and magnetic 15 field generation in a rotating sphere or spherical shell. Since the pioneering work by Glatzmaier (1984), several numerical codes have been developed over the years to model the problem. Typically, they all solve for convection and magnetic field induction in a co-rotating reference frame. The solutions are formulated as disturbances about a hydrostatic, well mixed, and adiabatic reference state. The so-called Boussinesq approximation assumes a homogeneous background state and yields a particularly simple formulation. Meanwhile, the anelastic approximation (e.g. Lantz and Fan, 1999;Jones et al., 2011) allows 20 to incorporate radial variation of the background reference state and transport properties. While the Boussinesq approximation for multithreading between NUMA domains, this configures a limitation on the maximum number of the radial grid points.
This limitation could be quite severe. HPC clusters typically have been offering two NUMA domains per compute node, but the current trend in hardware development might lead into a further subdivision into multiple physical units (e.g. AMD EPYC and Intel CascadeLake-AP) or logical NUMA domains (e.g. sub-NUMA clustering in Intel Xeon). Effectively, the computing power per NUMA domain has stagnated and it may even decrease in the future. This would simply prevent MagIC from taking 70 advantage of performance gains in the foreseeable evolution of processor technology. In addition, numerical experiments Matsui et al. (2016) suggest that two-or three-dimensional parallelization might be necessary for dynamo codes to achieve petascale capabilities with current hardware architecture.
The first two-dimensional MPI domain decomposition in pseudo-spectral codes in spherical geometry was considered in the ASH code by Clune et al. (1999). The layout was nevertheless still not completely two-dimensional in the sense that the 75 maximum number of MPI ranks was still bounded by the maximum spherical harmonic degree. This limitation was released in more recent implementations by Marti and Jackson (2016) in the context of modelling full spheres, by Matsui et al. (2014) in the geodynamo code Calypso and in the Rayleigh code. The performance benchmark by Matsui et al. (2016) however revealed an important overhead for these codes compared to pseudo-spectral codes which avoid global communications such as the open source code XSHELLS (e.g. Schaeffer et al., 2017). 80 Motivated by the aforementioned points, we propose in this work a two-dimensional data distribution layout for MagIC with communication-avoiding features. This required a major rethinking of data structures and communication algorithms and demands a re-implementation and thorough optimization of a large portion of the existing 1d-hybrid code. Due to the high complexity of the required refactoring tasks, OpenMP parallelism was dropped from the current implementation for the time being. Since the two-dimensional distribution implementation presented here relies on pure MPI communication, we refer to 85 it as the "2d-MPI" implementation or version. Implications and incentives to eventually re-introduce OpenMP into the new version will be discussed along the way. This paper is organized as follows. In section 2 we provide an overview of the mathematical formulation and the numerical techniques used in MagIC. In section 3 we detail the pseudo-codes for the sequential algorithm, for the 1d-hybrid and the 2d-MPI implementations. We also describe the underlying data structures and the key differences between each approach. In 90 2 Magneto hydrodynamics equations and numerical formulation In this section we introduce aspects of the numerical formulation which are relevant for the understanding of this work. Since the implementation of the magneto hydrodynamic (or MHD for short) equations implemented in MagIC still closely follows the original work by Glatzmaier (1984), we refer to this publication and to more recent reviews (e.g. Tilgner, 1999;Hollerbach, 2000;Christensen and Wicht, 2015) for a more comprehensive description of the algorithm.

Dynamo model
We consider a spherical shell of inner radius r i and outer radius r o rotating about the z-axis with a constant frequency Ω.

115
where p is the non-hydrostatic pressure, e r and e z are the unit vectors along the radial and axial directions respectively, and g = r/r o is the dimensionless gravity. This set of equations is controlled by four dimensionless parameters, the Rayleigh number Ra, the Ekman number E, the magnetic Prandtl number Pm, and the Prandtl number Pr where α is the thermal expansivity, g o is the gravity at the outer boundary, and κ is the thermal diffusivity. Equations (1)-(4) need to be complemented by appropriate boundary conditions on temperature, velocity, and magnetic field as well as an initial state.

Spatial discretization
To ensure the solenoidal nature of u and B, Eq.
(2)-(4) are solved in the spherical coordinates (r, θ, φ) by expanding the velocity and the magnetic fields into poloidal and toroidal potentials following This formulation yields the six time-dependent scalar quantities W , Z, G, H, T , and p. These six variables are expanded in spherical harmonic functions up to a degree and order max in the angular directions (θ, φ). Since the spherical harmonics are the set of eigenfunctions of the Laplace operator on the unit sphere, they are an especially attractive basis for representing functions in spherical coordinates. They are defined by where P m are the normalized Legendre polynomials (see Abramowitz and Stegun, 1964). Any function f (r, θ, φ, t) at the radius r and instant t can be expanded by a truncated spherical harmonic decomposition Letting denotes the complex conjugate, it holds that f ,−m (r, t) = f m (r, t) for all real functions f . This effectively halves 135 the cost of computing and storing the spherical harmonic coefficients.
Reordering the terms in Eq. (7) and using the definition of the spherical harmonics from Eq. (6) we obtain which is known as the inverse spherical harmonics transform. By integrating f m Y m over the spherical coordinates using the 140 Gauss-Legendre quadrature we obtain the forward transform: where θ j and w j are respectively the Gauss nodes and weights for the N θ collocation points in the latitudinal direction and φ k = 2kπ/N φ are the N φ regularly-spaced grid points in the azimuthal direction.

145
The formulation discussed in Eq. (8)-(11) provides a two-step algorithm for computing the spherical harmonics transforms (hereafter SHT), corresponding to a Legendre transform and a Fourier transform. The latter can be handled by finely optimized implementations of the fast Fourier transform (FFT for short) such as the FFTW library (Frigo and Johnson, 2005),  (Schaeffer, 2013), which relies on the recurrence relation by Ishioka (2018). Additionally, in the context of practical numerical implementation, a transposition is required between the two transforms; this will be covered further in section 3.
The quadrature shown in Eq. (11) is exact for N θ > max (Schaeffer, 2013), but in practice, we set max = 2N θ /3 in order to prevent aliasing errors (e.g. Orszag, 1971;Boyd, 2001). In fact, we also set N φ = 2N θ for the number of longitudinal points, which guarantees isotropic resolution in the equatorial regions (Glatzmaier, 2013) .

155
During the initialization stage, MagIC allows the choice between finite differences or a spectral expansion using Chebyshev polynomials to handle the radial discretization strategy. Finite differences method allows the use of faster point to point communications but they also require a larger number of nodal points to ensure a proper convergence of the solution. In this work, we explicitly chose to focus solely on the spectral approach to handle the radial discretization, but we encourage the interested reader to consult (Matsui et al., 2016) for an extensive comparison between several different numerical implementations.

160
Each spectral coefficients f m (r, t) is expanded in truncated Chebyshev series up to the degree N c − 1 where c = 2/N r − 1 is a normalization factor and the double quotes mean that the first and last terms need to be multiplied by one half. In the above expression, C n (x) is the nth order first-kind Chebyshev polynomial defined by the kth nodal point of a Gauss-Lobatto grid defined with N r collocation points. The discrete radius r k defined between r i and r o is mapped onto x k using the following affine mapping Conversely, the Chebyshev spectral coefficients of the functions f m (r, t) read With the particular choice of Gauss-Lobatto collocation points, Eq. (12) and Eq. (13) can also be efficiently computed using fast discrete Cosine transforms (DCTs) of the first kind (Press et al., 2007, § 12.4.2).

Time discretization 175
With the spatial discretization fully specified, we can proceed with the time discretization. For an easier understanding, we will derive the main steps using the equation for the time evolution of the magnetic poloidal potential G as an illustrative example.
We refer the interested reader to (e.g. Christensen and Wicht, 2015) or the online documentation of MagIC for the derivation of the other equations.
The equation for G is obtained by considering the radial component of the induction equation (see Eq. (3)). Using the 180 spherical harmonics decomposition and the Chebyshev collocation method described above we obtain where C n is the second radial derivative of the nth Chebyshev polynomial and dS = sin θ dθdφ is the spherical surface element.
Equation (14) can be rewritten in the following matrix form

185
where the matrices A and L ∈ R Nr×Nr contain primarily the coefficients of C n (x k ) and C n (x k ), are dense, and depend on but not on m. Equation (15) forms a set of ordinary differential equations that depend on time and contain a non-linear term, namely N m (t) ∈ C Nr , and a stiff linear diffusion operator on the right hand side.
In order to mitigate the time step constraints associated with an explicit treatment of the diffusion terms, MagIC adopts an implicit-explicit (IMEX) time stepping approach. Non-linear and Coriolis terms are handled using the explicit part of the time 190 integrator, while the remaining linear terms are treated implicitly. Currently, IMEX multisteps (e.g. Ascher et al., 1995) or IMEX diagonally-implicit Runge-Kutta (DIRK, e.g. Ascher et al., 1997) are available.
Let δt denote the time step size. A general k step IMEX multistep method applied to Eq. (15) reads where the exponent notations correspond to the time discretization with t i = t 0 + iδt and a j , b E j , and b I j correspond to the 195 weights of the IMEX multistep scheme. In practice, multistep schemes present stability domains that decrease with their order of convergence and require the knowledge of past states to continue the time integration. As such, they are not selfstarting and need to be initiated with a lower order scheme. MagIC implements DIRK schemes to overcome these limitations, but here we discuss only a simple two step IMEX scheme. Setting the IMEX weights to a = (1, 0), b E = (3/2, −1/2) and b I = (1/2, 1/2, 0), Eq. (16) reduces to the popular IMEX scheme assembled from a Crank-Nicolson and a second order Adams-Bashforth scheme (Glatzmaier, 1984) and can simply be written as with M δt ∈ R Nr×Nr and G i+1 m , B i m ∈ C Nr . Assuming suitable initial and boundaries conditions, all terms needed to compute B i m are known from the previous iterations. The matrices M δt contain the Chebyshev coefficients and are dense, but realvalued and of moderate size (realistic values of N r range from 33 to 1025). As such, the linear system in Eq. (17) requires 205 typically O(N 2 r ) operations to be inverted (Boyd, 2001) and can be easily solved using LU decomposition and backward substitution from standard libraries such as LAPACK (Anderson et al., 1999). From here on, we abuse the notation and use Eq. (17) when discussing the time stepping for any fields, i.e. not only the magnetic poloidal potential G.

Implementation
MagIC is a highly optimized hybrid MPI+OpenMP code under active development. The 1d-hybrid version uses MPI to dis-210 tribute the N r spherical shells amongst different computing nodes, while employing OpenMP for fully using the local cores within an individual NUMA domain. Although MagIC is finely optimized, multithreading across NUMA domains is not implemented. Therefore, simulations require at least one spherical shell per NUMA domain. This restricts the maximum number of NUMA domains to N r − 1, which severely limits the computational resources that can be employed for a given radial resolution. This disadvantage could increase in the future since the current trend in HPC architecture is to further sub-divide the 215 computing units into physical or even logical NUMA domains.
The purpose of this section is to first familiarize the reader with the established 1d-hybrid implementation and then introduce the new 2d-MPI implementation. By adding MPI-parallelism in a second direction, the extension also allows distributing the computations within a shell over the NUMA domains, sockets or even nodes of a computer cluster.
Due to the high complexity of this re-implementation, our 2d-MPI version lacks any use of OpenMP. The main purpose of 220 this work is to provide a thorough assessment of the prospects and merits of a two-dimensional data distribution, to pinpoint shortcomings, and to discuss the overall viability of a possible fully optimized and fine-tuned two-dimensional distribution implemented using OpenMP+MPI.
In this section we first present the pseudocode for the sequential algorithm for MagIC in subsection 3.1 and then discuss the distribution of the data both for the 1d-hybrid and 2d-MPI versions in subsection 3.2. A more detailed description of the two 225 main parts of the code are discussed in subsection 3.3 and in subsection 3.4.

Sequential Algorithm
For the sake of simplicity, we discuss only the second order time stepping scheme described in subsection 2.3, Eq. (17). The resulting pseudocode is shown in Algorithm 1. For each time step, the code can be divided into two stages, the "radial loop" and the " -loop": The radial loop (lines 2-10): It computes the non-linear terms in grid space (θ, φ, r) and thus requires forward and inverse SHTs. The radial levels are completely decoupled in this stage, and thus, can be efficiently distributed without any need for communication between them. The directions ( , m) and (θ, φ) are coupled. Notice that the FFTs on lines 6 and 9 take place on the second, slowest dimension. This requires the Fourier transform to "stride" the data in memory. This is not ideal, but modern FFT libraries are able to efficiently handle strided transforms.

235
The -loop (lines 13-18): It performs the actual time step. Most of the computation effort goes into solving the linear systems from Eq. (17) and into updating the right hand sides B i m . After updating each B i m , the factors of M δt are computed using LAPACK's real-valued LU decomposition dgetrf routines for dense matrices. Next, the solution of each linear system from Eq. (17) needs to be computed. In practice, all right hand sides are "packed" in a real matrix B i ∈ R Nr×(2 +2) :

240
where and respectively represent the real and imaginary parts of the vector. The system is then solved for all right hand sides as with a single call to LAPACK's dgetrs. The matrix B i is computed and stored as (r, m, ) for faster memory access during the LAPACK call. Therefore, the intermediate solution X is also obtained as (r, m, ). The final solution G i+1 m is reconstructed 245 from X by reordering the data back into the ( , m, r) format, while adding the real and the imaginary parts back together.
Notice also that the decomposed factors of M δt are kept for all subsequent time steps, and are only recomputed when δt changes.

Data Distribution
In this section we discuss how the simulation is distributed across MPI ranks for the 1d-hybrid and the 2d-MPI implementations.

250
For all purposes, MPI Cartesian grid topology is used. The φ direction is numerically periodic, but this property is never explicitly used by the underlying algorithms. The remaining directions are non-periodic. We denote by Π r the number of MPI ranks used in the radial direction. Similarly, Π θ represents the number of ranks in θ or direction.
Let N = max + 1 denote the total number of -modes, N m the number of m-modes and N m the number of ( , m) tuples.
We simply use the − superscript to denote the number of local points in some MPI rank. For instance, N − r denotes the number 255 of local radial points stored in a given rank.
1d-hybrid distribution: in the radial loop, the data distribution for the 1d-hybrid approach follows intuitively from Algorithm 1. The radial shells are distributed evenly among the MPI ranks, ideally N − r = N r /Π r for each rank. During the computation of the non-linear terms, each rank stores N φ × N θ × N − r points. During the computation of the miscellaneous terms, each rank stores N m × N − r points.

260
This changes for the execution of the -loop. For efficiently computing line 17 of Algorithm 1, all N r radial points must be local. This is achieved by the so-called -transposition, which changes the data distribution to N r × N − m local points. Special end for 20: end for care is needed when choosing the distribution of the N − m points. The computational effort per -modes is given by the LUdecomposition of the matrix and the solve step for + 1 right-hand-side vectors for this matrix. To balance the computation among the Π r processes, we distribute the -modes in a "snake-like" fashion. Starting with the largest , which has the highest 265 computational effort, we distribute the modes over the processors until Π r is reached. The next is also placed on this last processor, and then we reverse the order until reaching the first processor again. The procedure is repeated until all modes have been distributed. See Fig. 1 for an illustration with N = 21 and Π r = 6. This snake-like distribution guarantees that the largest possible imbalance between two MPI ranks is Π r − 1 tuples. Furthermore, each matrix B i can be fully stored locally, which is not the case for the 2d-MPI code, as we discuss below. At the end of the -loop, another -transposition takes place, converting 270 the data layout back to N m × N − r local points. 2d-MPI distribution: Like in the 1d-hybrid implementation, the radial points are split evenly between the Π r ranks in the radial loop. Additionally, the N θ points are distributed evenly and contiguously between the Π θ ranks (see Fig. 2a). At line 7, each rank is responsible for the computation of N − θ ×N φ ×N − r points. After the FFT at line 9 of Algorithm 1, N − θ ×N m ×N − r points are stored locally, but the computation of the Legendre transform at line 10 requires that all θ angles for a given spherical 275 harmonic order m are available locally. This is guaranteed by the so-called θ-transposition, which gathers the values for all N θ θ angles while distributing the m-modes, resulting in N θ × N − m × N − r local points. The distribution of the m-modes is more involved than the distribution of the θ-angles because there are (N − m) different spherical harmonic degrees for each order m. Once again, we employ the "snake-like" distribution depicted in Fig. 1, which guarantees that the largest imbalance between any two MPI ranks is Π θ − 1 tuples. See Fig. 2c-d for a graphical representation 280 of this step.

Radial Loop Implementation
The spherical harmonics transform (SHT) takes a substantial portion of the runtime. MagIC relies heavily on SHTns (Schaeffer, 2013), a highly optimized library dedicated to computing the spherical harmonics transform using OpenMP, advanced SIMD vectorization and cache-hitting strategies (Ishioka, 2018). SHTns has been developed to compute the steps synthesized in lines 8-10 and lines 4-6, that is, independent of the radial level r. It can not only handle scalar fields but is also optimized for 300 general vector fields and vector fields with a poloidal-toroidal decomposition. SHTns is written in C but has a Fortran interface with a dedicated data-layout for MagIC. Although SHTns is flexible and modular, it does not have an MPI-distributed interface.
However, it offers an interface for computing the Legendre transform (lines 5 and 10 of Algorithm 1) only for an individual (m, r) pair, which is the use case in our 2d-MPI implementation. The transposition and FFT computations are then handled separately by MagIC.

305
Next we describe how both the 1d-hybrid and 2d-MPI version of the code use SHTns for computing the SHT.
1d-hybrid radial loop: The spherical harmonics transforms are delegated directly to SHTns (see the pseudocode in Algorithm 2). For instance, for the temperature scalar field on the numerical grid, T (θ, φ, r k ), a single call to spat_to_SH returns the fully transformed T m (r k ) for a given radial level r k . This is only possible because all (θ, φ) points are local in the 1d-hybrid version of the code. No explicit optimization or multithreading is required, as SHTns handles this internally. The number of h i and output fields f i are ordered in a queue. Upon calling finish_queue, each h i is effectively transposed. After receiving the data, a complex-to-real FFT (e.g. using Intel MKL) is performed and saved into the output fields f i . The queue has a limited size. Once the limit is reached, finish_queue is called, immediately triggering the data transfer and the FFT computation.
Additionally, upon exiting the loop, finish_queue is called one last time to treat any remaining field in the queue.
The θ-transposition is implemented using the following algorithms: -P2P: each θ-rank performs one call to mpi_isend and mpi_irecv for each other θ-rank (i.e. Π θ −1) per scalar field, followed by a mpi_waitall call. MPI types are used to stride the data when needed. The advantage of this algorithm is that it does not require any particular packing and/or reordering of the data.
-A2AW: a single call to mpi_alltoallw per scalar field, using the same MPI types used for the P2P algorithm.
-A2AV: a single call to mpi_alltoallv per scalar field. No MPI types are needed, but the data needs to be reordered 330 in a buffer prior to the call.
Two parameters need to be determined, namely, which MPI algorithm to use and the length of the queue. The choice of the MPI algorithm depends on the MPI implementation, the hardware and the number of resources. We compare the performance of each variant in subsection 4.1.
As for the length of the θ-transposition queue, several factors must be taken into account. Longer queues allow more scalar

-loop Implementation
As discussed in subsection 3.2, the data needs to be redistributed with the so-called -transposition prior to the execution of this part of the code. The distribution of the -modes and their respective m-modes to the MPI processes has already been discussed in subsection 3.2.
Most of the walltime of the -loop is spent in the -transposition and in the backward substitution for solving the linear 345 systems from Eq. (17). Depending on how often δt changes, the cost for computing the LU decomposition for each dense matrix M δt might become a bottleneck, but in the scenario investigated during our benchmarks, this was not the case. Nevertheless, a frequent change in δt indicates that the algorithm should be run with a generally smaller time step.
Next we describe the -loop implementations of both MagIC versions. In both the 1d-hybrid and 2d-MPI implementations the -transposition might be performed using the P2P, A2AV and A2AW algorithms discussed previously. Additionally, the following algorithm has been implemented: -A2AP: a single call to mpi_alltoall per scalar field with padding to accommodate the largest message. No MPI 365 types are needed, but the data needs to be reordered in a buffer. Due to the intrinsic imbalance in the distribution of the data, considerably larger buffers are required. Some MPI libraries implement faster algorithms for mpi_alltoall, which can in some cases be advantageous.
In the 1d-hybrid implementation of MagIC an auto-tuning routine always selects the optimal transposition strategy during the initialization stage of the code. For the 2d-MPI version, the algorithm still must be selected manually via the parameter file. 370 We discuss the performance of the different strategies in subsection 4.4.

Benchmark
In this section we compare the performance of the 1d-hybrid and 2d-MPI implementations in a practical, realistic setting. We also profile some crucial sections of the code in order to highlight the advantages and shortcomings of the different implementations. The runtime of each crucial section was measured for each MPI rank independently using perflib, a lightweight profiling library developed in-house. Our figures show the average over all ranks as a solid line. A coloured background shows the 390 spread between the fastest and slowest ranks and visualizes the imbalance of the times measured.

θ-transpose benchmark
The performance of this code section is of special interest. This is because the θ-transposition is required only for the 2d-MPI implementation, thus adding unavoidable extra cost for handling the two-dimensional distribution of the data.
For the tests we fixed the size of a scalar field in grid space to 240 KiB per rank per radial level. This can be achieved by 395 using the formula N θ = 48 √ 10 Π θ to determine the number of latitudinal points.
We expect the performance of the θ-transposition to strongly depend on the distribution of the θ-ranks across the computing nodes. To explore this behaviour, we defined three communication regimes with the following parameters: -intranuma: θ-communication is constrained within a single NUMA domain: Π θ = 10, Π r = 4, N θ = 480 and N r = 128.
We measure the performance of the θ-transposition in these three different regimes and determine for which algorithm (A2AV, P2P or A2AW, discussed in section 3) and for which queue length (for Algorithm 3) they perform best. Notice that 405 larger queue lengths do not guarantee full usage of the larger buffers. As an example, suppose that q max is the queue length and that q fields are queued for transposition. If three new fields are added to the queue (i.e. a vector field), but the q + 3 > q max , the algorithm transposes the first q fields, clears the queue and then adds the three fields into a new queue. This could be improved to enforce full buffer usage in future versions, but our benchmarks demonstrate that the benefits would be marginal. Table 1 shows the effective usage of the transposition buffers for queue lengths from 3 to 13, as well as the "queue-free" version 410 denoted by queue length one. Figure 3 shows the times for 50 time steps for the three different communication regimes. The inferior performance of the A2AV implementation is attributed to the time spent reordering the send and receive buffers. Both A2AW and P2P use the same MPI types to avoid this reordering and therefore perform very similar. Concerning the queue length, the intranuma and intranode regimes perform best with smaller queues, while the opposite is true for the internodes regime.

415
Since the A2AW implementation for the θ-transposition seems to perform best, we stick to this scheme for the following test where we address the impact of the queue length.

2d-MPI SHT Benchmark
In this subsection we continue the comparison between intranuma, internuma and internode regimes, but now bring the whole spherical harmonics transform into perspective. Since this constitutes one of the largest portions of the computation, it is 420 particularly interesting to determine the behaviour in the different communication regimes. Figure 4 shows the θ-transposition time as well as the computation time spent inside the FFT and Legendre transform calls.
Although the computation time itself should not depend on the queue length, Figure 4 shows a small but consistent influence.
We assume that this is linked to the change in data access pattern caused by queuing and buffering data for the θ-transposition.
Even for the intranuma case shown in Fig. 4a, where a queue length of three provides a small performance boost for the 425 θ-transposition, the performance of the SHT is optimal for queue length of one. In the following test we therefore fix the queue length to one for intranuma and intranode regimes and to 13 for the internode regime.

-transpose strong scaling benchmark
In this subsection we compare the performance of the four different -transposition implementations for the 2d-MPI version.
We would like to recall that the 1d-hybrid implementation uses an auto-tuning routine which automatically selects the best 430 algorithm during the initialization. Moreover, the main difference between 1d-hybrid and 2d-MPI implementations in the - loop is the distribution of the ( , m) tuples, as already discussed in subsection 3.2. The communication pattern therefore differs significantly for the two implementations.
The problem we chose for this and subsequent subsections is the dynamo benchmark with N r = 301 and N θ = 960. We compute 100 time steps for several values of Π r and Π θ , and for core counts ranging from 120 to 24,000 (3 to 600 nodes, 435 respectively). For the 1d-hybrid implementation, we use 10 OpenMP threads per MPI task for tests with up to 1,000 cores (25 nodes) and 20 threads for larger runs (this configuration resulted in the best total time). The 1d-hybrid version of the code can use at most 6,000 cores (150 nodes) with one radial point per socket (Π r = 300 and 20 threads per MPI task).
For the 2d-MPI scheme, we fix the θ-transposition implementation to A2AW (as discussed in subsection 4.1) and the queue lengths to one for Π θ = 10, 20, or 40 and 13 otherwise. This new version of MagIC admits more combinations of Π θ and Π r 440 and allows the use 24,000 cores (600 nodes). For some core counts, multiple values of Π θ and Π r were allowed. For example, for 6,000 cores one may choose Π θ = 20 or Π θ = 40 with Π r = 300 or Π r = 150. We tested both and kept the configuration  with the best total time. For the reader's convenience, all plots show the value of Π θ for the 2d-MPI version at the top horizontal axis. We also set the experiments such that N r − 1 is divisible by Π r , and assign the last radial point to the last radial MPI rank. This is currently a limitation of MagIC. In principle, this should cause imbalance in some cases e.g. with Π r = 300 the 445 last radial MPI rank will store and solve twice the number of radial levels as the other ranks. However, the last radial MPI rank receives less ( , m) tuples, which diminishes the impact of the imbalance. Since both codes suffer from the same imbalance, we opt for simplifying the analysing to the average runtime only.
In Fig. 5 we can see that, except for the A2AP variant, the 2d-MPI implementation is always significantly faster than the 1d-hybrid implementation. This is a direct consequence of the communication-avoiding pattern of the 2d-MPI distribution.

450
The 1d-hybrid version shows a good strong scalability, which, however, starts to degrade after 1,000 cores (25 nodes). At this core count the 1d-hybrid version transitions from 10 to 20 threads. OpenMP efficiency is typically difficult to maintain for large thread counts and this is likely the reason for the performance degradation. The 2d-MPI implementation, especially for the A2AW variant, scales remarkably well up to 6,000 cores (150 nodes), where the performance starts to degrade. However, the losses remain acceptable up to the largest number of 24,000 cores we could 455 test. The A2AV variant performs similarly to A2AW, and beyond 12,000 cores all variants show nearly identical timings.
This test suggests to prefer A2AW which will be kept in the following. However, as the behaviour may be different on other architectures and for other MPI libraries, an auto-tuning routine that selects the fastest option in the initialization phase of a simulation seems a good idea for the future.

-loop strong scaling 460
In this subsection we take a deeper look into the different parts of the -loop. We are especially interested in the impact of the 2d-MPI data distribution on the LU decomposition and on the backward substitution. For these tests we use the same setting as in the previous subsection (100 time steps, N r = 301 and N θ = 960) and stick to the A2AW implementation of the transpositions. In Fig. 6 we compare transposition time, LU decomposition time, and the backward substitution time (referring to Eq. (19)) 465 for the 1d-hybrid and the 2d-MPI implementations. In addition, the total times for the -loop are shown, which also include the updating of time derivatives and related operations.
In both the 1d-hybrid and 2d-MPI implementations, the performance is dominated by the transposition time. Here the 2d-MPI clearly outperforms the 1d-hybrid implementation because of the reduced communication, as discussed in subsection 3.2.
The timing results in Fig. 6 demonstrate that decomposition and backward substitution for the linear systems do not scale 470 well with Π θ (upper horizontal axis). As explained in subsection 3.2, the data are distributed in the direction of the m-modes, but the time-stepping matrices, which need to be LU-decomposed, are independent of m.
For a fixed Π r , the decomposition time is thus expected to remain independent of Π θ . Likewise, the backward substitution time should just mildly be impacted by larger Π θ . This can be clearly seen in Fig. 6: 120 and 240 cores have Π r = 12, with Π θ = 10 and 20 respectively. Their decomposition time is roughly the same, with a slightly faster backward substitution time 475 for 240 cores. The same can be observed for 200 and 400 cores as well as 3,000 and 6,000 cores. There is a spike in the decomposition time for the 1d-hybrid implementation at 1,200 cores.
However, the performance gain due to the communication-avoiding distribution of the 2d-MPI implementation far outweighs these shortcomings. We would like to highlight that it is only possible to reduce the communication volume due to the particular distribution pattern shown in Figure 2, i.e. each θ-rank communicates with Π r ranks in the radial direction. The communication-

480
avoiding scheme discussed here is not possible for the 1d-hybrid implementation, since one of these directions is missing. This specific feature of the 2d-MPI version is essential to enable a better overall performance for large cores count, as we shall discuss in the next subsection.

Overall performance and strong scaling comparison
Finally we discuss the strong scaling of the MagIC time step, i.e. of the main application without initialization, finalization and 485 diagnostics. As usual, we define the parallel efficiency as where t is the main application time in seconds and n is the number of cores used. We use the test with n ref. = 120 as a reference.
Following Matsui et al. (2016), we consider that > 0.6 is a sufficiently good efficiency. Additionally, we define the "par- In Fig. 7 it is interesting to notice that the performance of the 1d-hybrid implementation is dominated by the -loop performance, whereas the performance of the 2d-MPI implementation is dominated by the performance of the radial loop. To some degree this is expected, because the radial loop of the 2d-MPI implementation requires the θ-transposition in addition to the SHT.
The radial loop for the 1d-hybrid code scales well up to 6,000 cores, but the inferior scalability of the -loop already leads 500 to a noticeable degradation beyond 1,200 cores in Fig. 7. This is the cause of the decrease in parallel efficiency apparent in Table 2 for 2,000 and more cores.
The 2d-MPI implementation starts suffering from a small loss of scalability of the -loop at 9,000 cores but remains at an acceptable level up to 24,000 cores. The radial loop scales remarkably well up to 12,000 cores but shows degraded performance at 18,000 and 24,000 cores. This can be explained by the problem size being too small for such numbers of cores, which is 505 reflected by N − r = 1 and N − ≈ 22 for 9,000 cores. Both the and the radial loop contribute to the loss of overall performance at the highest core counts apparent in Fig. 7. The parallel cross-efficiency (see Table 2 and Fig. 7) shows that the new 2d-MPI version is about 20% slower than the 1d-hybrid implementation for core counts of n < 2000. However, the 2d-MPI variant scales better for n ≥ 2, 000 with = 0.7 for up to n = 12, 000. The 1d-hybrid implementation, on the other hand, reaches = 0.63 for 6,000 cores. The 2d-MPI is as 510 fast as the 1d-hybrid implementation at n = 3, 000 but already 10% faster at n = 6, 000.
The benefits offered by the 2d-MPI implementation can be illustrated with a simple example. Instead of performing a simulation with the 1d-hybrid implementation on 6,000 cores a scientist could opt for using the 2d-MPI implementation on 12,000 cores and obtain the solution in only 56% of the runtime. In other words, by investing only 12.5% more CPU-hours, the 2d-MPI implementation arrives at the same solution in half the time.

515
The parallel cross-efficiency also gives an idea on how "far" the parallel efficiency of the 2d-MPI is from the 1d-hybrid implementation. There is a significant gap for 120 to 1,200 cores, but the gap closes at about 2,000 cores, e.g. = 0.86 for the 1d-hybrid and * = 0.82. This trend continues until 6,000 cores, when the parallel cross-efficiency overtakes the parallel efficiency of the 1d-hybrid implementation. The parallel cross-efficiency in Table 2 shows that the performance of both implementations is rather close for core counts larger than 2,000. This rightfully raises the question if this small gap could be bridged. Furthermore, it has already been established in previous sections that the main bottleneck of the 2d-MPI implementation is the radial loop, but the θ-transposition and the computation time (i.e. mostly the SHT) of the radial loop have not yet been thoroughly compared. The goal of this subsection is to provide this comparison, discuss which portions of the code could be optimized, and what impact this could have on the 525 2d-MPI implementation.
For the 1d-hybrid code, let t t and t r respectively denote the full time of the main application and the radial loop only, both given in seconds. Analogously, for the 2d-MPI code, let T t denote the time of the full main application and T r + T θ denote the time of the radial loop only, where T r is the computation time only and T θ is the θ-transposition time only (including communication and necessary buffer copies, if any). Both t t and T t are shown in Table 2. Table 3 shows t r , T r and T θ for the 530 same experiment, from 120 to 6,000 cores.  The next natural question is how the computation time of the radial loop of both implementations compare with each other.
For that we use the ratio Ideally, ρ would be always close to one, in which case the computational efficiency of the radial loop of both implementations would be identical. In practice, distributing the data between more MPI ranks means that each rank has less local data, which 540 makes cache-hitting, vectorization, and other fine-grained optimization techniques less efficient. In other words, even a finely optimized implementation of the 2d-MPI data distribution algorithm is likely to have ρ > 1.
The column ρ from Table 3 shows that the performance gap is far from one in most cases and is inversely proportional to the number of cores. This means that bridging the large gap in performance for the lower core counts might be unfeasible, but for larger core counts it is attainable. 545 We now attempt to determine how much improvement in the 2d-MPI implementation is required for its main application runtime (including all transposition times) to match the runtime of the 1d-hybrid code. In practice, several portions of the 2d-MPI code could benefit from cache-hitting strategies, fine tuned vectorization, computation-communication overlapping, amongst others. We will now discuss a much simpler scenario in which only the computation time of the radial loop can be improved, from T r to κT r . Subtracting T r from the total time T t and adding the "new" runtime κT r , we obtain the following 550 equation T t − T r + κT r = t t , κ = t t − T t T r + 1.
which gives us the needed improvement factor κ. The values are shown in Table 3. Table 3 shows that, for low core counts, improvements in T r alone could not bridge the gap in the performance between 2d-MPI and the 1d-hybrid implementations. From 120 to 1,000 cores we have t r > κT r , meaning that the 2d-MPI implementation would have to compute the radial loop faster than the already highly optimized 1d-hybrid code for both codes to perform the same.
The last three entries in Table 3 show a much more realistic improvement factor. An improvement of 10% (resp. 7%) in T r would suffice to bridge the gap between the main application time for 2,000 cores (resp. 3,000 cores). For 6,000 cores, Table 3 shows that T r could be 31% slower and both codes would still be equivalent. This is due to the poor scalability of the 560 -transposition of the hybrid code, which gives the 2d-MPI some "room" for performance loss in other portions of the code.

Conclusions and Future Work
We described a new parallelization scheme based on a two-dimensional, MPI-based data decomposition for MagIC, an opensource code for three-dimensional fluid dynamics simulations in spherical geometry, with high scientific impact in a broad range of scientific fields ranging from fundamental fluid dynamics and modeling of planetary dynamos to stellar dynamics.

565
MagIC uses spherical surface harmonics of degree and order m for the spectral representation in longitude and latitude and Chebyshev polynomials in radius. Our newly implemented 2d-MPI scheme is compared to the previously established 1d-hybrid code version that uses MPI and OpenMP and has been highly optimized over years.
Thanks to a number of new concepts, the 2d-MPI version presented here can already compete with the hybrid implementation in terms of runtime, and in addition offers the possibility to use a significantly larger number of cores. It opens the possibility 570 to employ tens of thousands of CPU cores on modern HPC clusters and paves the way to using the next-generation CPU architectures.
Decisive for its success is a communication-avoiding data distribution of the -and m-modes for the computation of the actual time step in MagIC and the optimization of the associated communications strategies. Compared to the existing onedimensional data decomposition, the new two-dimensional data distribution scheme requires an additional, costly transposition 575 in the azimuthal direction. This leads to a performance penalty at low core counts which might be hard to overcome. However, for the dynamo benchmark and the setups considered here, and starting at moderate core counts of two thousand, the performance of the new version is mostly limited by the not yet fully optimized main radial loop, which can be as much as 60% slower than its counterpart in the one-dimensional data distribution.
Our results showed that a mere 10% performance gain in the computation time of the radial loop of the 2d-MPI implemen-580 tation would bridge the gap between both code versions for two thousand cores or more, in addition to providing an extended strong scalability regime. Other sets of equations solved by MagIC, such as the anelastic equations (Jones et al., 2011), may present a different cost associated with the radial loop and would require a separate investigation.
Our implementation paves the way towards a future, unified hybrid variant of MagIC which combines a two dimensional data distribution with the proven benefits of an additional OpenMP layer in order to further improve the computational per-585 formance across an even wider range of simulation scenarios. This future "2d-hybrid" version could retain the benefits of our communication-avoiding data distribution and its improved strong scaling behaviour, while still benefiting from the performance of the finely optimized multithreaded libraries used within the radial loop. We expect that the performance of the radial