Massive-Parallel Trajectory Calculations version 2.2 (MPTRAC-2.2): Lagrangian transport simulations on Graphics Processing Units (GPUs)
- 1Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany
- 2Institut für Energie- und Klimaforschung (IEK-7), Forschungszentrum Jülich, Jülich, Germany
- 3School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
- 4Key Laboratory of Middle Atmosphere and Global Environment Observation, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, China
- 5University of Chinese Academy of Sciences, Beijing, China
- 1Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany
- 2Institut für Energie- und Klimaforschung (IEK-7), Forschungszentrum Jülich, Jülich, Germany
- 3School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
- 4Key Laboratory of Middle Atmosphere and Global Environment Observation, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, China
- 5University of Chinese Academy of Sciences, Beijing, China
Abstract. Lagrangian models are fundamental tools to study atmospheric transport processes and for practical applications such as dispersion modeling for anthropogenic and natural emission sources. However, conducting large-scale Lagrangian transport simulations with millions of air parcels or more can become numerically rather costly. In this study, we assessed the potential of exploiting graphics processing units (GPUs) to accelerate Lagrangian transport simulations. We ported the Massive-Parallel Trajectory Calculations (MPTRAC) model to GPUs using the open accelerator (OpenACC) programming model. The trajectory calculations conducted within the MPTRAC model were fully ported to GPUs, i.e., except for feeding in the meteorological input data and for extracting the particle output data, the code operates entirely on the GPU devices without frequent data transfers between CPU and GPU memory. Model verification, performance analyses, and scaling tests of the MPI/OpenMP/OpenACC hybrid parallelization of MPTRAC were conducted on the JUWELS Booster supercomputer operated by the Jülich Supercomputing Centre, Germany. The JUWELS Booster comprises 3744 NVIDIA A100 Tensor Core GPUs, providing a peak performance of 71.0 PFlop/s. As of June 2021, it is the most powerful supercomputer in Europe and listed among the most energy-efficient systems internationally. For large-scale simulations comprising 108 particles driven by the European Centre for Medium-Range Weather Forecasts' ERA5 reanalysis, the performance evaluation showed a maximum speedup of a factor of 16 due to the utilization of GPUs compared to CPU-only runs on the JUWELS Booster. In the large-scale GPU run, about 67 % of the runtime is spent on the physics calculations, conducted on the GPUs. Another 15 % of the runtime is required for file-I/O, mostly to read the large ERA5 data set from disk. Meteorological data preprocessing on the CPUs also requires about 15 % of the runtime. Although this study identified potential for further improvements of the GPU code, we consider the MPTRAC model ready for production runs on the JUWELS Booster in its present form. The GPU code provides a much faster time to solution than the CPU code, which is particularly relevant for near-real-time applications of a Lagrangian transport model.
-
Notice on discussion status
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
-
Preprint
(3181 KB)
-
Supplement
(2948 KB)
-
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(3181 KB) -
Supplement
(2948 KB) - BibTeX
- EndNote
Journal article(s) based on this preprint
Lars Hoffmann et al.
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2021-382', Anonymous Referee #1, 05 Feb 2022
The manuscript contains the matter of two articles. The first Is a thorough description of the new version of the MPTRAC trajectory code and the second one is the description of the parallelization of this code using GPU which is an excellent example of the application of modern programming methods that is more general that MPTRAC. I accept the choice of the author to group these two works but it would have made sense to make two separate articles to reach a wider audience, at least for the second one.
The manuscript contains very useful material, in particular in the second part where it demonstrates how a complex simulation code can be moved to a GPU system using the high level library OpenACC with relatively small effort (compared to the full rewriting required by direct use of CUDA low level libraries). This is an important and inspiring contribution.
I only have a few comments to be accounted in the revised version
In regard of the sophistication of the rest of the code, the treatment of the vicinity of the pole appears very crude and inaccurate. There has been possibly few concern in the applications of MPTRAC so far but this is a point that should be corrected in the next version.
The convective parameterization is based on the assumption of CAPE relaxation and an important parameter is the CAPE threshold that should deserve some discussion. The manuscript says that a global value is used but CAPE accumulates much more over the continents than over the ocean, leading to much more intense storms in continental regions. Therefore a single threshold value will probably produce excessive mixing over the continents and too small mixing over the oceans. More generally, it is recognized by all experts in convective parameterization that CAPE alone is a bad predictor of convective onset. As the ERA5 archive the upward and downward convective fluxes resulting from its state of the art parameterization of convection, why not using these data instead of a very crude representation of convection. Again, this might be considered in the next version.
The manuscript fails to quote this work “optimization of atmospheric transport models on HPC platforms, de la Cruz et al., Computers & Geosciences, 2016, doi: 10.1016/j.cageo.2016.08.019” which addresses very similar issues.
Figure 10 made from screen copies is not readable, either on print or on the screen.
Other minor comments
- 185: pressure is not the best choice of vertical transport for Lagrangian transport in the stratosphere as well where many models use instead the potential temperature and heating rates instead of pressure tendencies.
- 205: I guess the authors meant linear in log pressure.
- 770: The results from OpenMP parallelization may vary a lot according the scheduling strategy. This should be mentioned.
- 896: I do not see any fluctuations but a regular increase in fig. 11.
-
RC2: 'Comment on gmd-2021-382', Anonymous Referee #2, 08 Feb 2022
The manuscript by Hoffmann et al. presents and impressive piece of work. I can only congratulate the authors on the development of MPTRAC and its parallelization on GPUs, which is the main topic of the study. The manuscript is well written and structured and the methods and results sections are easy to follow.
I thus only have a few minor comments, suggestions, and corrections that the authors should consider before publication:
The introduction is quite MPTRAC-centric. Since the focus is on code parallelization, it would be good to include references on parallelization approaches in other Lagrangian dispersion models, e.g. Brioude et al. (2013) for FLEXPART-WRF, Jones et al. (2007) for NAME, Pisso et al. (2019) for other versions of FLEXPART (with MPI or OpenMP parallelization and asynchronous I/O in case of FLEXPART-COSMO). There is actually also a GPU version of (parts of) FLEXPART developed many years ago (https://db.cger.nies.go.jp/metex/flexcpp.html), but unfortunately it was never published in peer-reviewed literature to my knowledge.
The introduction should also explain more clearly, what the main areas of application of MPTRAC are. It seems to be designed primarily to study large scale atmospheric transport in the free troposphere and stratosphere but not for transport and mixing in the atmospheric boundary layer (ABL). This is important to mention, because Lagrangian models are increasingly being used for inverse emission estimation, for which e.g. a proper representation of turbulent mixing in the ABL is critical.
The manuscript convinced me that MPTRAC is a technically carefully designed, flexible and computationally efficient model. However, I was less convinced that it is also doing a good job in terms of accurately representing atmospheric transport. A key criterion for Lagrangian particle dispersion models, for example, is the well-mixed condition of Thomson (1987): A tracer well-mixed in the atmosphere should not un-mix due to the simulated transport. This is challenging to achieve but is critical for simulating mixing in the ABL or inversely estimating emissions, for example. Simple mixing schemes (e.g. without density correction term) as implemented in the model lead to un-mixing. It would be good to know the magnitude of un-mixing generated by the model in long simulations (un-mixing likely saturates at some point). This could be studied in a simulation similar to those presented in Section 3.3, but where particles with uniform mass are initialized proportional to air density. Particle densities should ideally remain proportional to air density throughout the simulation.
The synthetic tracer simulations presented in Section 3.3. are suitable to study differences between the CPU and GPU versions, but they are not sufficient to demonstrate that transport is generally well represented in the model. A much more challenging diagnostic for stratospheric transport, for example, would be age of air, which is known to be underestimated by many transport models.
I thus strongly encourage the authors to focus on such critical aspects in future studies to provide a thorough scientific benchmark for future applications of the model. This is more a comment than a suggestion for modifying the current publication.Small points:
Page 7, line 184: What exactly do you mean by "pushed back"? The standard approach in Lagrangian models is that particles are reflected. "Pushing back" likely leads to accumulation of air parcels at the surface or upper boundary of the model.
Page 9, line 250: Shouldn't it be | phi | > phi_max? Same issue on the next line on page 10.
Page 15: Convection is parameterized in an overly simplified way, since e.g. deep convection does not at all lead to uniform vertical mixing. It would be good to mention (and to consider) more advanced approaches such as Forster et al. (2007, https://doi.org/10.1175/JAM2470.1).
Page 18: Also dry deposition is described in a highly simplified way. Dry deposition does not only depend on particle or gas properties but also on the state of the atmosphere (in addition to surface properties). Also here it should be mentioned that more advanced approaches for Lagrangian models exist, e.g. Webster and Thomson (2012, https://doi.org/10.1504/IJEP.2011.047322).
Page 30: Which number of compute cores of the GPU is the most relevant number for MPTRAC? Is it the number of FP32 or FP64 cores? Later it becomes clear that it is the latter. Is double precision really needed? Did you test MPTRAC with single precision?
Figure 7: The differences between GPU and CPU simulations presented in panels b), d) and f) are likely due to statistical noise. This could be shown by performing multiple CPU simulations with different random seeds and evaluate the differences in the same way as the differences between CPU and GPU.
Section 3.7: I didn't quite understand this scaling test. Why does the runtime shown in Fig. 11 not decrease with the number of MPI tasks? What is the difference between a weak and a strong scaling test?
Small corrections and typos:
Page 11, Line 272: Change to "The following choices are made .."
Page 23, Line 500: shouldn't it be "interpreting" rather than "interpolating"?
Page 30, line 678: "MPTRAC was build" -> "MPTRAC was built"
Page 40, line 830: "33% if the overall runtime" -> "33% of the overall runtime"
Page 40, line 857: It should be Figs. 10a and b rather than 9a and b.
-
AC1: 'Comment on gmd-2021-382', Lars Hoffmann, 08 Mar 2022
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2021-382/gmd-2021-382-AC1-supplement.pdf
Peer review completion
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2021-382', Anonymous Referee #1, 05 Feb 2022
The manuscript contains the matter of two articles. The first Is a thorough description of the new version of the MPTRAC trajectory code and the second one is the description of the parallelization of this code using GPU which is an excellent example of the application of modern programming methods that is more general that MPTRAC. I accept the choice of the author to group these two works but it would have made sense to make two separate articles to reach a wider audience, at least for the second one.
The manuscript contains very useful material, in particular in the second part where it demonstrates how a complex simulation code can be moved to a GPU system using the high level library OpenACC with relatively small effort (compared to the full rewriting required by direct use of CUDA low level libraries). This is an important and inspiring contribution.
I only have a few comments to be accounted in the revised version
In regard of the sophistication of the rest of the code, the treatment of the vicinity of the pole appears very crude and inaccurate. There has been possibly few concern in the applications of MPTRAC so far but this is a point that should be corrected in the next version.
The convective parameterization is based on the assumption of CAPE relaxation and an important parameter is the CAPE threshold that should deserve some discussion. The manuscript says that a global value is used but CAPE accumulates much more over the continents than over the ocean, leading to much more intense storms in continental regions. Therefore a single threshold value will probably produce excessive mixing over the continents and too small mixing over the oceans. More generally, it is recognized by all experts in convective parameterization that CAPE alone is a bad predictor of convective onset. As the ERA5 archive the upward and downward convective fluxes resulting from its state of the art parameterization of convection, why not using these data instead of a very crude representation of convection. Again, this might be considered in the next version.
The manuscript fails to quote this work “optimization of atmospheric transport models on HPC platforms, de la Cruz et al., Computers & Geosciences, 2016, doi: 10.1016/j.cageo.2016.08.019” which addresses very similar issues.
Figure 10 made from screen copies is not readable, either on print or on the screen.
Other minor comments
- 185: pressure is not the best choice of vertical transport for Lagrangian transport in the stratosphere as well where many models use instead the potential temperature and heating rates instead of pressure tendencies.
- 205: I guess the authors meant linear in log pressure.
- 770: The results from OpenMP parallelization may vary a lot according the scheduling strategy. This should be mentioned.
- 896: I do not see any fluctuations but a regular increase in fig. 11.
-
RC2: 'Comment on gmd-2021-382', Anonymous Referee #2, 08 Feb 2022
The manuscript by Hoffmann et al. presents and impressive piece of work. I can only congratulate the authors on the development of MPTRAC and its parallelization on GPUs, which is the main topic of the study. The manuscript is well written and structured and the methods and results sections are easy to follow.
I thus only have a few minor comments, suggestions, and corrections that the authors should consider before publication:
The introduction is quite MPTRAC-centric. Since the focus is on code parallelization, it would be good to include references on parallelization approaches in other Lagrangian dispersion models, e.g. Brioude et al. (2013) for FLEXPART-WRF, Jones et al. (2007) for NAME, Pisso et al. (2019) for other versions of FLEXPART (with MPI or OpenMP parallelization and asynchronous I/O in case of FLEXPART-COSMO). There is actually also a GPU version of (parts of) FLEXPART developed many years ago (https://db.cger.nies.go.jp/metex/flexcpp.html), but unfortunately it was never published in peer-reviewed literature to my knowledge.
The introduction should also explain more clearly, what the main areas of application of MPTRAC are. It seems to be designed primarily to study large scale atmospheric transport in the free troposphere and stratosphere but not for transport and mixing in the atmospheric boundary layer (ABL). This is important to mention, because Lagrangian models are increasingly being used for inverse emission estimation, for which e.g. a proper representation of turbulent mixing in the ABL is critical.
The manuscript convinced me that MPTRAC is a technically carefully designed, flexible and computationally efficient model. However, I was less convinced that it is also doing a good job in terms of accurately representing atmospheric transport. A key criterion for Lagrangian particle dispersion models, for example, is the well-mixed condition of Thomson (1987): A tracer well-mixed in the atmosphere should not un-mix due to the simulated transport. This is challenging to achieve but is critical for simulating mixing in the ABL or inversely estimating emissions, for example. Simple mixing schemes (e.g. without density correction term) as implemented in the model lead to un-mixing. It would be good to know the magnitude of un-mixing generated by the model in long simulations (un-mixing likely saturates at some point). This could be studied in a simulation similar to those presented in Section 3.3, but where particles with uniform mass are initialized proportional to air density. Particle densities should ideally remain proportional to air density throughout the simulation.
The synthetic tracer simulations presented in Section 3.3. are suitable to study differences between the CPU and GPU versions, but they are not sufficient to demonstrate that transport is generally well represented in the model. A much more challenging diagnostic for stratospheric transport, for example, would be age of air, which is known to be underestimated by many transport models.
I thus strongly encourage the authors to focus on such critical aspects in future studies to provide a thorough scientific benchmark for future applications of the model. This is more a comment than a suggestion for modifying the current publication.Small points:
Page 7, line 184: What exactly do you mean by "pushed back"? The standard approach in Lagrangian models is that particles are reflected. "Pushing back" likely leads to accumulation of air parcels at the surface or upper boundary of the model.
Page 9, line 250: Shouldn't it be | phi | > phi_max? Same issue on the next line on page 10.
Page 15: Convection is parameterized in an overly simplified way, since e.g. deep convection does not at all lead to uniform vertical mixing. It would be good to mention (and to consider) more advanced approaches such as Forster et al. (2007, https://doi.org/10.1175/JAM2470.1).
Page 18: Also dry deposition is described in a highly simplified way. Dry deposition does not only depend on particle or gas properties but also on the state of the atmosphere (in addition to surface properties). Also here it should be mentioned that more advanced approaches for Lagrangian models exist, e.g. Webster and Thomson (2012, https://doi.org/10.1504/IJEP.2011.047322).
Page 30: Which number of compute cores of the GPU is the most relevant number for MPTRAC? Is it the number of FP32 or FP64 cores? Later it becomes clear that it is the latter. Is double precision really needed? Did you test MPTRAC with single precision?
Figure 7: The differences between GPU and CPU simulations presented in panels b), d) and f) are likely due to statistical noise. This could be shown by performing multiple CPU simulations with different random seeds and evaluate the differences in the same way as the differences between CPU and GPU.
Section 3.7: I didn't quite understand this scaling test. Why does the runtime shown in Fig. 11 not decrease with the number of MPI tasks? What is the difference between a weak and a strong scaling test?
Small corrections and typos:
Page 11, Line 272: Change to "The following choices are made .."
Page 23, Line 500: shouldn't it be "interpreting" rather than "interpolating"?
Page 30, line 678: "MPTRAC was build" -> "MPTRAC was built"
Page 40, line 830: "33% if the overall runtime" -> "33% of the overall runtime"
Page 40, line 857: It should be Figs. 10a and b rather than 9a and b.
-
AC1: 'Comment on gmd-2021-382', Lars Hoffmann, 08 Mar 2022
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2021-382/gmd-2021-382-AC1-supplement.pdf
Peer review completion
Journal article(s) based on this preprint
Lars Hoffmann et al.
Lars Hoffmann et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
862 | 188 | 18 | 1,068 | 39 | 14 | 8 |
- HTML: 862
- PDF: 188
- XML: 18
- Total: 1,068
- Supplement: 39
- BibTeX: 14
- EndNote: 8
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
The requested preprint has a corresponding peer-reviewed final revised paper. You are encouraged to refer to the final revised version.
- Preprint
(3181 KB) - Metadata XML
-
Supplement
(2948 KB) - BibTeX
- EndNote