the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Exploring a high-level programming model for the NWP domain using ECMWF microphysics schemes
Abstract. We explore the domain-specific Python library GT4Py (GridTools for Python) for implementing a representative physical parametrization scheme and the related tangent-linear and adjoint algorithms from the Integrated Forecasting System (IFS) of ECMWF. GT4Py encodes stencil operators in an abstract and hardware-agnostic fashion, thus enabling more concise, readable and maintainable scientific applications. The library achieves high performance by translating the application into targeted low-level coding implementations. Here, the main goal is to study the correctness and performance-portability of the Python rewrites with GT4Py against the reference Fortran code and a number of automatically and manually ported variants created by ECMWF. The present work is part of a larger cross-institutional effort to port weather and climate models to Python with GT4Py. The focus of the current work is the IFS prognostic cloud microphysics scheme, a core physical parametrization represented by a comprehensive code that takes a significant share of the total forecast model execution time. In order to verify GT4Py for Numerical Weather Prediction (NWP) systems, we put additional emphasis on the implementation and validation of the tangent-linear and adjoint model versions which are employed in data assimilation. We benchmark all prototype codes on three European supercomputers characterized by diverse GPU and CPU hardware, node designs, software stacks and compiler suites. Once the application is ported to Python with GT4Py, we find excellent portability, competitive performance, and robust execution in all tested scenarios including with reduced precision.
- Preprint
(1148 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on gmd-2024-92', Anonymous Referee #1, 25 Jun 2024
As a developer who is using GT4Py to port parameterized physics, I am encouraged by these performance results as well as the portability across multiple GPU architectures. Overall, I think this is an excellent paper that highlights the potential of DSLs as a forward-looking development platform. I have several questions and comments.
- Line 191 : This line mentions that " can be differentiated for the vertical boundaries using the interval context manager". As a GT4Py user, it's clear what is being written, but given that "differentiated" has mathematical meanings, it may be better to reword this to avoid confusion.
- List 1 and 2 : I realized later that the "Code and data availability" section lists the repositories that contain the codes in List 1 and 2. Originally, I had mistakenly searched the ECMWF-iFS Github site for the CLOUDSC and CLOUDSC2 dwarf codes and was wondering why I couldn't find the codes from the list. One suggestion is to mention that the repos for the codes are mentioned later in the "Code and data availablility" section.
- Line 296 : Can NPROMA be explained further?
- Line 307 : To clarify, is the symmetry test timing the sum of the CLOUDSC2TL and CLOUDSCAD timings?
- Line 336 : I'm a bit confused on the virtual GPU explanation. Does this mean that when 1 MPI process is mapped to an MI250X, only half the GPU is executed?
- Question: The Gridtools backend was mentioned as a GT4Py backend (and I think it enables GPU compute), but its results were not presented. Was it because it was slower than the Dace backend?
Citation: https://doi.org/10.5194/gmd-2024-92-RC1 - AC1: 'Reply on Referee Comments', Stefano Ubbiali, 26 Aug 2024
-
RC2: 'Comment on gmd-2024-92', Anonymous Referee #2, 26 Jun 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2024-92/gmd-2024-92-RC2-supplement.pdf
- AC1: 'Reply on Referee Comments', Stefano Ubbiali, 26 Aug 2024
-
RC3: 'Comment on gmd-2024-92', Anonymous Referee #3, 24 Jul 2024
This is an excellent paper expanding the use of domain specific languages (DSLs), and GT4Py specifically, for performance and productivity in numerical weather prediction. To my knowledge this is the first published work on a tangent-linear or adjoint model in GT4Py, and the results are very encouraging. The authors describe their methodology and development process well, which will aid others looking to reproduce this work and apply it to their own models. That said I do have some small questions and comments I would like to raise before publication:
Primary points/questions:
- I don’t think it is necessary to define the tangent linear or adjoint operators explicitly, and I’m also not certain that you need to explicitly define the Taylor test either.
- Line 247: I would like to see more description of the infrastructure code around the stencil. What does compile_stencil look like? Presumably the parent DiagnosticComponent class specifies the __call__ method, which wraps array_call, but that would be nice to see explicitly instead of assuming from what is in the paper.
- Line 250: Similarly, the stencil collection decorator is ifs-specific, and I would appreciate more detail about what it does and how.
- Line 265: Why use a GT4Py backend for CPU but a DaCe backend for gpu?
- Line 345: Is the goal of the GT4Py or ECMWF teams to achieve the same performance as native Fortran and CUDA models, or is it to attain most of their performance alongside the benefits of portability and productivity?
- Figures 3-5: I’m not convinced by the layout of these figures. Because there are fewer implementations of CLOUDSC2 (and none in 32-bit aside from GT4Py) it may be more natural to report these performance results in a table, or to remove the space for the missing data, especially panels e and f which look disconcertingly sparse. On the other hand this is a very striking way to draw attention to the fact that GT4Py gives you 64- and 32-bit versions of the model in one go, but if you want to emphasize that I would like to see it more explicitly highlighted in the text.
Minor:
- In your introduction is it worthwhile to discuss efforts to use tools like Numba or Cython to accelerate numerical models written in Python across various fields of science, such as Augier et al. (doi:10.1038/s41550-021-01342-y) or others?
- Line 18: the authors describe Fortran’s “functional programming style” which is slightly imprecise; while Fortran uses functions and subroutines, functional programming refers to a style of programming using only pure functions, so no values are updated in-place, which is not how Fortran operates.
- Line 177: It would be useful to acknowledge contributions from groups beyond the Allen Institute, since they have ceased their work on GT4Py.
- Line 190: “GTScript abstracts spatial for-loops away” would be more accurate than stating it abstracts for-loops entirely
- Line 237: “Not only it builds upon Sympl, but is also extends it” should be “Not only does it build upon Sympl, but also extends it”
- Figure 6: Because the relevant information is contained within the top ~10% of the plot it may be useful to change the y-axis to instead range from 0.8 to 1.0
- Listing 1: Should foealfcu be “foealpha”?
Other comments:
- Line 323: The fact that the GT4Py implementations of the tangent-linear and adjoint formulations of CLOUDSC2 are the first to enable GPU execution at any precision is very cool and could be emphasized more heavily throughout the paper, in my opinion.
- Line 361: It might be worth mentioning that the Python overhead would still account for around 1% of CPU runtime even if the GT4Py CPU performance was on par with Fortran
Citation: https://doi.org/10.5194/gmd-2024-92-RC3 - AC1: 'Reply on Referee Comments', Stefano Ubbiali, 26 Aug 2024
Status: closed
-
RC1: 'Comment on gmd-2024-92', Anonymous Referee #1, 25 Jun 2024
As a developer who is using GT4Py to port parameterized physics, I am encouraged by these performance results as well as the portability across multiple GPU architectures. Overall, I think this is an excellent paper that highlights the potential of DSLs as a forward-looking development platform. I have several questions and comments.
- Line 191 : This line mentions that " can be differentiated for the vertical boundaries using the interval context manager". As a GT4Py user, it's clear what is being written, but given that "differentiated" has mathematical meanings, it may be better to reword this to avoid confusion.
- List 1 and 2 : I realized later that the "Code and data availability" section lists the repositories that contain the codes in List 1 and 2. Originally, I had mistakenly searched the ECMWF-iFS Github site for the CLOUDSC and CLOUDSC2 dwarf codes and was wondering why I couldn't find the codes from the list. One suggestion is to mention that the repos for the codes are mentioned later in the "Code and data availablility" section.
- Line 296 : Can NPROMA be explained further?
- Line 307 : To clarify, is the symmetry test timing the sum of the CLOUDSC2TL and CLOUDSCAD timings?
- Line 336 : I'm a bit confused on the virtual GPU explanation. Does this mean that when 1 MPI process is mapped to an MI250X, only half the GPU is executed?
- Question: The Gridtools backend was mentioned as a GT4Py backend (and I think it enables GPU compute), but its results were not presented. Was it because it was slower than the Dace backend?
Citation: https://doi.org/10.5194/gmd-2024-92-RC1 - AC1: 'Reply on Referee Comments', Stefano Ubbiali, 26 Aug 2024
-
RC2: 'Comment on gmd-2024-92', Anonymous Referee #2, 26 Jun 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2024-92/gmd-2024-92-RC2-supplement.pdf
- AC1: 'Reply on Referee Comments', Stefano Ubbiali, 26 Aug 2024
-
RC3: 'Comment on gmd-2024-92', Anonymous Referee #3, 24 Jul 2024
This is an excellent paper expanding the use of domain specific languages (DSLs), and GT4Py specifically, for performance and productivity in numerical weather prediction. To my knowledge this is the first published work on a tangent-linear or adjoint model in GT4Py, and the results are very encouraging. The authors describe their methodology and development process well, which will aid others looking to reproduce this work and apply it to their own models. That said I do have some small questions and comments I would like to raise before publication:
Primary points/questions:
- I don’t think it is necessary to define the tangent linear or adjoint operators explicitly, and I’m also not certain that you need to explicitly define the Taylor test either.
- Line 247: I would like to see more description of the infrastructure code around the stencil. What does compile_stencil look like? Presumably the parent DiagnosticComponent class specifies the __call__ method, which wraps array_call, but that would be nice to see explicitly instead of assuming from what is in the paper.
- Line 250: Similarly, the stencil collection decorator is ifs-specific, and I would appreciate more detail about what it does and how.
- Line 265: Why use a GT4Py backend for CPU but a DaCe backend for gpu?
- Line 345: Is the goal of the GT4Py or ECMWF teams to achieve the same performance as native Fortran and CUDA models, or is it to attain most of their performance alongside the benefits of portability and productivity?
- Figures 3-5: I’m not convinced by the layout of these figures. Because there are fewer implementations of CLOUDSC2 (and none in 32-bit aside from GT4Py) it may be more natural to report these performance results in a table, or to remove the space for the missing data, especially panels e and f which look disconcertingly sparse. On the other hand this is a very striking way to draw attention to the fact that GT4Py gives you 64- and 32-bit versions of the model in one go, but if you want to emphasize that I would like to see it more explicitly highlighted in the text.
Minor:
- In your introduction is it worthwhile to discuss efforts to use tools like Numba or Cython to accelerate numerical models written in Python across various fields of science, such as Augier et al. (doi:10.1038/s41550-021-01342-y) or others?
- Line 18: the authors describe Fortran’s “functional programming style” which is slightly imprecise; while Fortran uses functions and subroutines, functional programming refers to a style of programming using only pure functions, so no values are updated in-place, which is not how Fortran operates.
- Line 177: It would be useful to acknowledge contributions from groups beyond the Allen Institute, since they have ceased their work on GT4Py.
- Line 190: “GTScript abstracts spatial for-loops away” would be more accurate than stating it abstracts for-loops entirely
- Line 237: “Not only it builds upon Sympl, but is also extends it” should be “Not only does it build upon Sympl, but also extends it”
- Figure 6: Because the relevant information is contained within the top ~10% of the plot it may be useful to change the y-axis to instead range from 0.8 to 1.0
- Listing 1: Should foealfcu be “foealpha”?
Other comments:
- Line 323: The fact that the GT4Py implementations of the tangent-linear and adjoint formulations of CLOUDSC2 are the first to enable GPU execution at any precision is very cool and could be emphasized more heavily throughout the paper, in my opinion.
- Line 361: It might be worth mentioning that the Python overhead would still account for around 1% of CPU runtime even if the GT4Py CPU performance was on par with Fortran
Citation: https://doi.org/10.5194/gmd-2024-92-RC3 - AC1: 'Reply on Referee Comments', Stefano Ubbiali, 26 Aug 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
483 | 143 | 30 | 656 | 21 | 19 |
- HTML: 483
- PDF: 143
- XML: 30
- Total: 656
- BibTeX: 21
- EndNote: 19
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1