MagIC v5.10: a two-dimensional message-passing interface (MPI) distribution for pseudo-spectral magnetohydrodynamics simulations in spherical geometry

Lago, Rafael; Gastine, Thomas; Dannert, Tilman; Rampp, Markus; Wicht, Johannes

doi:https://doi.org/10.5194/gmd-14-7477-2021

Articles | Volume 14, issue 12

https://doi.org/10.5194/gmd-14-7477-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-14-7477-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 14, issue 12

Development and technical paper

|

07 Dec 2021

Development and technical paper |

| 07 Dec 2021

MagIC v5.10: a two-dimensional message-passing interface (MPI) distribution for pseudo-spectral magnetohydrodynamics simulations in spherical geometry

Rafael Lago, Thomas Gastine, Tilman Dannert, Markus Rampp, and Johannes Wicht

Download

Final revised paper (published on 07 Dec 2021)
Preprint (discussion started on 23 Aug 2021)

Interactive discussion

Status: closed

RC1:
'Comment on gmd-2021-216', Anonymous Referee #1, 21 Sep 2021
The present paper is an investigation of the parallel performance of the numerical dynamo simulations in a rotating spherical shells for the massively parallel computers using the open source program MagIC. MagIC has been parallelized by the MPI in one direction, and MPI communication in MagIC has been used for switching between the radial domain decomposition and decomposition over spherical harmonics mode, and also parallelized by OpenMP for SMP parallelization. In the present paper, the authors implement two dimensional MPI parallelization to improve parallelization scale and compare the parallel performance with the original 1D-MPI with OpenMP version. As described in the introduction, 2-D parallelization is not new technique, and the present parallelization is the almost same as the parallelization for Calypso and Rayleigh in Matsui et al. (2016). However, there is no paper about the technical detail for 2-D MPI parallelization for the spherical dynamo code using Calypso or Rayleigh. Consequently, I recommend to accept the present paper with minor revision. And, I have a few question to the authors.

First, I would like to point out that the present scaling tests are not in the practical range for the productive runs. Dynamo simulations in a rotating spherical shell are performed a few million time steps or even tens thousand steps. The minimum elapsed time is still more than one second in Figure 7, so it suggests that the present model needs approximately 12 days for one million steps. I guess that the practical problem size would be the half of the horizontal resolution for the productive runs. I recommend that the author describe the reason how they choose the spatial resolution and target elapsed time for the productive runs.

Another question is that the authors perform data communications for each radial layer and each scalar component in the 2D parallelization. Calypso and Rayleigh perform these communications with single MPI_ISEND/IRECV or MPI_ALLREDUCEV, respectively. Can authors discuss the advantage of the present communications from Calypso or Rayleigh's approach?

And, SHTns is used in the present study. I wonder if the authors calculate the Legendre polynomials at Gauss-Legendre points in the initialization, or calculate during each Legendre transforms. I remember that SHTns has both feature, so it would be helpful which approach is chosen and why the authors choose one.

Lastly, the authors defined T_t. However, I can't find any information for T_t in Table 3. Can I find the data from the other table? So, I lost a direction to figure out the following discussion using T_t in page 26. Please provide how to figure out T_t.

And, these are some minor suggestions:

In line 2, Can 'magnetohydrodynamics' be one word?

In line 5, I think "parallelization" would be more explicit than "implementation".

in line 27, "mag" would be capitalized as "MAG" or "Mag"

In line 65, Full name should be represented for "Non-Uniform Memory Access (NUMA)" first.

In line 117, It would be better to add dimensionless "self" gravity.

In equation (14) and (15), the diffusion term appears in the left hand side and right hand side, respectively. I would like to show this term in the same side in the both equations.

In line 202, I would like to show the equation using M_l^δt and B_lm^t below of the equation (17). I looked for the definition of M_l^δt and B_lm^t for a while.

in line 329 and 330, I prefer to say "component" instead of "field", if the scalar "field" includes toroidal and toroidal components for vectors.
Citation: https://doi.org/10.5194/gmd-2021-216-RC1
- AC1:
  'Reply on RC1', Rafael Lago, 11 Oct 2021
  
  We’d like to thank the referee for her/his very thorough and constructive review and for the suggestions and corrections which were proposed. In the following we individually address all points raised by the Referee #1:
  > First, I would like to point out that the present scaling tests are not
  
  > in the practical range for the productive runs. Dynamo simulations in
  
  > a rotating spherical shell are performed a few million time steps or
  
  > even tens thousand steps. The minimum elapsed time is still more than
  
  > one second in Figure 7, so it suggests that the present model needs
  
  > approximately 12 days for one million steps. I guess that the practical
  
  > problem size would be the half of the horizontal resolution for the
  
  > productive runs. I recommend that the author describe the reason how
  
  > they choose the spatial resolution and target elapsed time for the
  
  > productive runs.
  
  The value of δt (and thus, the total number of time steps) depends not only on the grid resolution, but also on the nature of the physical problem and its control parameters (Ekman, Prandtl and Rayleigh numbers). As an example, convection in thin spherical shells will require much larger spatial resolution but will converge in a significantly smaller number of iterations, than, say, geodynamo models geared to model reversals of the Earth magnetic field. As such, the total number of iterations is not a really meaningful measure.
  However, as explained in lines 436 and 467 of the revised text, the times are measured for 100 timesteps. In the largest run (24,000 cores), the main application time was 6.4 seconds, thus one million timesteps would require 18 hours in average. For the the largest "recommended" run (12,000 cores, where the parallel efficiency remains within acceptable levels), one million time steps are estimated to require 20 hours in average.
  Finally, the development of the 2d-MPI version is justified mostly for large grids. The chosen resolution is comparable to current state-of-the-art geodynamo models (e.g. Schaeffer et al. 2017). For smaller grids, the 1d-hybrid version of the code should deliver the optimal performance.
  > Another question is that the authors perform data communications for
  
  > each radial layer and each scalar component in the 2D parallelization.
  
  > Calypso and Rayleigh perform these communications with single
  
  > MPI_ISEND/IRECV or MPI_ALLREDUCEV, respectively. Can authors discuss
  
  > the advantage of the present communications from Calypso or Rayleigh's
  
  > approach?
  
  As noted, a single communication involving all fields and all radii could be performed. In the early stages of the project we have performed tests using such a strategy. This has the following consequences:
  
  (1) allows much large message sizes
  
  (2) fewer synchronization points are needed
  
  (3) memory requirement grows with the number of radial points (since all fields and buffers for each radial point must be stored imultaneously for a single communication ).
  Our Figure 3 shows that "packing" more messages past 1,750KiB (queue of length 9) does not provide any visible performance benefit for the hardware we used in our tests, voiding (1). Concerning (2), some of the early experiments showed an equivalent or slightly superior performance of the queue algorithm when many radial points were used. Moreover, for simulations involving many radial points, (3) is exacerbated.
  We agree that the combining all messages in a single communication could be useful in several scenarios and we may revisit this part of MagIC in future releases, enabling the user to choose between strategies.
  
  > And, SHTns is used in the present study. I wonder if the authors calculate
  
  > the Legendre polynomials at Gauss-Legendre points in the initialization, or
  
  > calculate during each Legendre transforms. I remember that SHTns has both
  
  > feature, so it would be helpful which approach is chosen and why the authors
  
  > choose one.
  
  The "on-the-fly" strategy is used. This is the recommendation of SHTns’ author for $\ell_{max}>32$ (see (Schaeffer, 2013) for more details).
  > Lastly, the authors defined Tt. However, I can't find any information for Tt
  
  > in Table 3. Can I find the data from the other table? So, I lost a direction
  
  > to figure out the following discussion using Tt in page 26. Please provide
  
  > how to figure out Tt.
  
  T_t and t_t are given in Table 2 under the column "time". In the revised text, we added these columns to Table 3 as well.
  > And, these are some minor suggestions:
  
  > In line 2, can 'magnetohydrodynamics' be one word?
  
  We modified all occurrences throughout the text to "magnetohydrodynamics”.
  > In line 5, I think "parallelization" would be more explicit than "implementation".
  
  We changed all occurrences of "hybrid implementation" to "hybrid parallelization" throughout the text, but we kept the term "1d-hybrid implementation", since it refers to a "version" of the code. We hope that this change meets the expectations.
  > In line 27, "mag" would be capitalized as "MAG" or "Mag"
  
  This has been corrected as requested.
  > In line 65, Full name should be represented for "Non-Uniform Memory
  
  > Access (NUMA)" first.
  
  This has been modified, but on line 59 where it appears first.
  > In line 117, It would be better to add dimensionless "self" gravity.
  
  This has been modified as requested.
  > In equation (14) and (15), the diffusion term appears in the left hand side
  
  > and right hand side, respectively. I would like to show this term in the same
  
  > side in the both equations.
  
  This has been modified as requested.
  > In line 202, I would like to show the equation using Mlδt and Blmt below of
  
  > the equation (17). I looked for the definition of Mlδt and Blmt for a while.
  
  We added a sentence in the text to clarify the definition of Mlδt and Blmt.
  > in line 329 and 330, I prefer to say "component" instead of "field", if the
  
  > scalar "field" includes toroidal and toroidal components for vectors.
  
  We fear that using the word “components” may erroneously lead the reader to believe that we are solving for vector components. To avoid this confusion we prefer to refer to them “scalar fields”.
  
  Citation: https://doi.org/10.5194/gmd-2021-216-AC1
  - AC4: 'Reply on AC1', Rafael Lago, 20 Oct 2021
    
    It seems that there was a typo in our reply concerning the line numbers of the revised manuscript.
    - Instead of "lines 436 and 467" the correct reference in the revised manuscript is "lines 435--436".
    We apologize for the confusion. This has been corrected in the "Author's response" PDF as well.
    
    Citation: https://doi.org/10.5194/gmd-2021-216-AC4
RC2:
'Comment on gmd-2021-216', Anonymous Referee #2, 24 Sep 2021
This paper presents the implementation of a 2D data parallelization strategy for the MagIC pseudo-spectral code. With this strategy, the three dimensional field data is distributed, using MPI, along two dimensions. As a consequence, two MPI communication stages are required during the spectral transform from physical to spectral space and vice-versa. This strategy has been presented and discussed in other publications. In this regard, the manuscript doesn't provide a novel parallelization approach but a discussion on the technical aspects of the implementation in MagIC. A detailed comparison to the hybrid MPI and OpenMP parallelization, which involves a single MPI communication stage, available in MagIC is made. Performance benchmarks are used to illustrate which strategy is best depending on the computational resources. In this light and as a development and technical paper, I recommend to accept this paper with minor changes. The questions I would like to be addressed are given below.

Considering that a large portion of todays supercomputers include accelerators, possibly making up most of their raw computational power, does the 2D strategy bring any advantages with regards to using accelerators?

The discussion of the transposition in section 3.3 is confusing. The paragraph starting at line 335 discusses the importance of the size of the queue, but the MPI algorithms (l. 325) specify a MPI communication "per scalar field". The discussion in Section 4.1 also seems to imply a single communication call per "queue". Can you clarify this?

The performance benchmarks provide insight into the behaviour of the 2D parallelization implementation but are executed on a single cluster. A more general discussion on what kind of performance to expect on another HPC cluster depending on its technical characteristics would be helpful for the wider community.

The strong scaling experiment with the 1D hybrid strategy increases the number of threads for runs with higher number of cores. This is counter intuitive as it leads to even smaller computational load per thread which I would expect to affect the scaling negatively. Can you explain this behaviour?

At high resolution, the memory footprint of a pseudo-spectral code can become important. Is there a benefit, at the memory level, to use a 2D distribution in MagIC?

Corrections:

l. 132: P_lm should be the "associated Legendre polynomials" and not the "Legendre polynomials".

Table 3: Adding a κ T_r column would make it easier to follow the discussion at the end of Section 4.6
Citation: https://doi.org/10.5194/gmd-2021-216-RC2
- AC2:
  'Reply on RC2', Rafael Lago, 11 Oct 2021
  
  We’d like to thank the referee for her/his very thorough and constructive review, especially concerning other architectures. In the following we individually address all points raised by the Referee #2:
  > Considering that a large portion of todays supercomputers include
  
  > accelerators, possibly making up most of their raw computational power, does
  
  > the 2D strategy bring any advantages with regards to using accelerators?
  
  This is a very valid point. Unfortunately, porting a large production code like MagIC to GPUs (or more generally speaking to a discrete “accelerator”) is a significant challenge which is way beyond the scope of this work. Currently, and at least for the next couple of years, the community using MagIC has access to large-scale CPU (x86_64 CPUs from Intel or AMD) resources, e.g. in Germany or France, justifying the development efforts for “modernizing” the CPU-only version of MagIC.
  That said, we do fully agree with the referee that porting to GPUs eventually might even become unavoidable in the light of the current hardware trends and technological developments in HPC. Realistically, a GPU port of MagIC would start out from the OpenMP parallelization which is already there for the 1D-MPI parallelization (which is envisaged also for the 2D-MPI parallelization). In that sense, the newly developed 2D MPI-strategy does not bring immediate advantages with regards to using accelerators but it would add flexibility to a future GPU version of MagIC in the same way as it now does for the CPU-only version. We have added a short paragraph at the end of Section 5, Conclusions and Future Work (lines 601-607 of the revised manuscript) which addresses this point and sketches a conceivable GPU-porting strategy for MagIC.
  > The discussion of the transposition in section 3.3 is confusing. The paragraph
  
  > starting at line 335 discusses the importance of the size of the queue, but
  
  > the MPI algorithms (l. 325) specify a MPI communication "per scalar field".
  
  > The discussion in Section 4.1 also seems to imply a single communication call
  
  > per "queue". Can you clarify this?
  
  Indeed, the description in lines 325 was wrong. The algorithm performs one communication for all queued fields. The text has been updated accordingly.
  > The performance benchmarks provide insight into the behaviour of the 2D
  
  > parallelization implementation but are executed on a single cluster. A more
  
  > general discussion on what kind of performance to expect on another HPC
  
  > cluster depending on its technical characteristics would be helpful for the
  
  > wider community.
  
  Indeed such a discussion was missing from the main manuscript. We added a paragraph in in Section 4.5 of the manuscript (lines 523-529), elaborating on how these performances should translate into other clusters.
  > The strong scaling experiment with the 1D hybrid strategy increases the
  
  > number of threads for runs with higher number of cores. This is counter
  
  > intuitive as it leads to even smaller computational load per thread which
  
  > I would expect to affect the scaling negatively. Can you explain this
  
  > behaviour?
  
  We performed tests with the 1d-hybrid code with 10 and 20 threads, but always with the same total number of threads. For instance, the test with 1,000 cores was executed using 100 MPI ranks and 10 threads per rank, or 50 MPI ranks and 20 threads per rank. In both scenarios, the workload per thread is the same. What changes is the amount of data being handled by each MPI rank. The transition from 10 threads to 20 threads shows that, at some point it is better for MPI to handle larger chunks of data, which is expected.
  > At high resolution, the memory footprint of a pseudo-spectral code can become
  
  > important. Is there a benefit, at the memory level, to use a 2D distribution
  
  > in MagIC?
  
  This is indeed another benefit of the 2d-MPI implementation. Since it allows the use of more resources, the user may have access to more (distributed) memory as well. We added a comment in the revised version of the text, in Section 4.5, lines 577 and 578.
  It may be noted that the proposed 2D MPI distribution requires additional memory for storing the queued fields during the θ-transposition. However, in section 3.3 we clarify that the queue size (and thus, extra memory requirement) can be controlled by the user as well. The benefit of having more distributed memory available can easily overcome the disadvantage of having to store a controllable extra number of scalar fields.
  > l. 132: Plm should be the "associated Legendre polynomials" and not the
  
  > "Legendre polynomials".
  
  This has been corrected as requested.
  > Table 3: Adding a κ Tr column would make it easier to follow the discussion
  
  > at the end of Section 4.6
  
  This has been updated as suggested.
  
  Citation: https://doi.org/10.5194/gmd-2021-216-AC2
  - AC3: 'Reply on AC2', Rafael Lago, 20 Oct 2021
    
    It seems that there was a typo in our reply concerning the line numbers of the revised manuscript.
    - Instead of "lines 601-607" the correct reference in the revised manuscript is "lines 600-606".
    - Instead of "lines 523-529" the correct reference in the revised manuscript is "lines 522-528".
    - Instead of "lines 577 and 578" the correct reference in the revised manuscript is "lines 516-517".
    We apologize for the confusion. This has been corrected in the "Author's response" PDF as well.
    
    Citation: https://doi.org/10.5194/gmd-2021-216-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Rafael Lago on behalf of the Authors (20 Oct 2021) Author's response Author's tracked changes Manuscript

ED: Publish as is (20 Oct 2021) by Josef Koller

AR by Rafael Lago on behalf of the Authors (27 Oct 2021)

Short summary

In this work we discuss a two-dimensional distributed parallelization of MagIC, an open-source code for the numerical solution of the magnetohydrodynamics equations. Such a parallelization involves several challenges concerning the distribution of work and data. We detail our algorithm and compare it with the established, optimized, one-dimensional distribution in the context of the dynamo benchmark and discuss the merits of both implementations.