the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The Real Challenges for Climate and Weather Modelling on its Way to Sustained Exascale Performance: A Case Study using ICON (v2.6.6)
Abstract. The weather and climate model ICON (ICOsahedral Nonhydrostatic) is being used in high resolution climate simulations, in order to resolve small-scale physical processes. The envisaged performance for this task is 1 simulated year per day for a coupled atmosphere-ocean setup at global 1.2 km resolution. The necessary computing power for such simulations can only be found on exascale supercomputing systems. The main question we try to answer in this article is where to find sustained exascale performance, i. e. which hardware (processor type) is best suited for the weather and climate model ICON and consequently how this performance can be exploited by the model, i. e. what changes are required in ICON’s software design so as to utilize exascale platforms efficiently. To this end, we present an overview of the available hardware technologies and a quantitative analysis of the key performance indicators of the ICON model on several architectures. It becomes clear that domain decomposition-based parallelization has reached the scaling limits, leading us to conclude that the performance of a single node is crucial to achieve both better performance and better energy efficiency. Furthermore, based on the computational intensity of the examined kernels of the model it is shown that architectures with higher memory throughput are better suited than those with high computational peak performance. From a software engineering perspective, a redesign of ICON from a monolithic to a modular approach is required to address the complexity caused by hardware heterogeneity and new programming models to make ICON suitable for running on such machines.
- Preprint
(698 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on gmd-2024-54', Anonymous Referee #1, 28 May 2024
Adamidis et al. presents benchmarking results of the ICON model on different hardware; CPU, GPU, and Vector. This is an interesting comparison, especially given the direction of travel for HPC hardware and the use of accelerators.
It is mostly well written, and I recommend its publication in GMD given some minor revisions, mainly around the figures, which is where the majority of my issues with the manuscript are.
Figure 1 is a nice-looking figure, but I don't believe it adds anything to the discussion and is in fact misleading as it isn't really saying anything of substance at all. It doesn't reflect the point made in the text: "the appropriate programming model becomes a difficult task as not all of them support all types of accelerators" (lines 67-68), but the diagram just shows a wall with each brick representing a different programming model and arrows going indiscriminately to different hardware. It would instead be useful to show which programming models work on which hardware, either through a different diagram or a table.
Figure 2 likewise looks very nice but doesn't really contain any substance about how the ICON code is currently and how ICON-C will change the code structure and apply the different programming models. I would rather for the authors show code snippets or pseudocode to illustrate how the modules could be ported to the new architectures using the different programming models, and how would this be done, e.g. manually or using a code writer? Other models, such as the LFRic weather and climate model (https://www.metoffice.gov.uk/research/approach/modelling-systems/lfric), are using tools such as PSyclone (https://psyclone.readthedocs.io) to try to achieve performance portability of the Fortran source code on different architectures.
Figure 3 is quite confusing. The ideal strong scaling curves are very difficult to see as they are a light grey colour (the same as the gridlines behind). The point type used (left- and right-facing triangles) are difficult to differentiate and the fact that the colours are slightly different for the global and nested domains, but not different enough, makes it hard to unpick this. The different line styles are also not discussed. The authors could consider making the plot much larger, using very different point styles, mentioning the different line styles, and perhaps separating out the global and nested results into different sub-figures.
Figure 4 is very interesting given the results in Figure 3 and the core numbers in Table 2. I would suggest highlighting the number of cores in the discussion around GPUs for Figure 3. It might also be helpful to highlight on Figure 4 the number of cores for each hardware type configuration, so it can be seen when the hardware becomes underutilised.
For Figure 5 I'm a bit unsure if the timing for B1 as given in the example will include to the end of the K41 kernel, i.e. outside the end of B1.
The plots in Figure 6 are very hard to see as the points are large and the lines are close together and overlapping. The points are also all clustered around a small section of the graph, but the scale is much larger in X and especially Y, mainly to include the legend. I would suggest plotting each point on its own graph, making a large multi-figure plot, and zooming in as much as possible onto (0.1:100, 1:50,000) ranges to allow the relevant areas to be seen as clearly as possible. I would likewise do the same for Figure 7 by combining Figures 6 and 7 in a single multi-panel plot.
Figure 8 is also really interesting. Do you know the energy usage from the CPU runs?
Minor comments:
Line 17: I don't believe “System” should be capitalised here.
Line 26: The term GPU is used without being defined (this occurs on line 40). Also, the discussion here is around x86 hardware being combined with GPUs, Vector, or ARM systems. Note that superchip hardware, such as NVIDIA's Grace-Hopper (https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip) is an ARM-GPU system, i.e. there is no x86 CPU, where the CPU itself is instead ARM-based.
Line 345-6: I think this sentence needs to be rephrased: "Not only the ICON community is currently on the way to using current and upcoming Exascale systems for high-resolution simulations." - do you mean something like "ICON is not the only model that is currently on the way to using current and upcoming Exascale systems for high-resolution simulations."
Line 356-7: Perhaps I'm missing something, but wouldn't halving the horizontal resolution result in a 4-fold increase in necessary resources, rather than 8-fold? I'm assuming that the level structure remains unchanged.
Citation: https://doi.org/10.5194/gmd-2024-54-RC1 - AC1: 'Reply on RC1', Panagiotis Adamidis, 21 Jun 2024
-
RC2: 'Comment on gmd-2024-54', Anonymous Referee #2, 13 Aug 2024
This paper presents some strong scaling results and emperical Roofline models for the ICON weather code on NEC Vector Engines, NVIDIA A100 GPUs, and AMD CPUs.
The only novelty of the work lies I think in the Roofline analysis, although I am surprised this has not been done as part of the optimisation and porting studies, which are explicitly not claimed as contributions.
There are some points of clarification around the methodology here, as detailed below.
The paper also attempts to justify the need to modularise the code, but this effort is already underway in the WarmWorld project, and it is unclear how the evidence in terms of performance results presented in this paper directly motivates this, given that the model development and optimisation on the difference architectures preceeded this study, in particular by the reference Giorgetta et al., 2022.
Further, framing the performance comparisons of ICON to HPCG and HPL in Table 6 and related text, is also not new, and replicated from the study previously mentioned.
Most of the other comments below refer to the characterisation, comparisons, and discussion of processors, programming models and the high performance computing landscape at large, which was often rather clumsy.
Overall, the novel contribution here is small in my view, but defer to the editor to measure if this is sufficient for the aims on the journal.## Abstract and Section 1
The start of the paper presents the work as being rather time limited. For example, including the version of the software in the title implies, to me, that the work is only relevant to this specific version, and that is it may not be relevant for future versions, which I do not believe is the intended message.
Line 9 states that domain-decomposition parallelism reaches scaling limits, however, given the rest of the paper, this should be revised to more clearly state the distributed memory decomposition of the spatial domain.
On line 9 and the following, the statements that domain decomposition reaches scaling limits implies the need for single node performance is unclear.
## Section 2
In general, many unsubstantiated and poorly argued comments around programming models and processor technologies throughput Section 2.1.On line 41, the authors state that GPUs have dominated the Top500 list since 2015. The authors should clarify this to refer to share of compute performance, or, in the highest ranked systems, as even in the most recent Top500 list, there are more individual systems without GPUs than with and they therefore do not dominate in terms of number of systems. In short, be specific in how GPUs are dominating the list.
On line 42, it would be good to include examples of current vector processors in use other than the NEC SX-Aurora. It has probably been 20+ years since the last true vector machines (e.g., Cray) . If the authors are implying use of SIMD vectorization in CPUs, then this should be made more specific.
On line 48--49, the authors justify that the NEC Aurora cards are relevant because they offer high memory bandwidth. But GPUs, and now CPUs (including the A64FX previously mentioned), also offer this memory technology. Indeed, in Table 1, high bandwidth is listed for both GPUs and the NEC cards (twice in fact for NEC!). There is some debate to be had around the "Specialities" in this table, for example OpenMP (with or without target) can run well on CPUs and GPUs as well as the NEC, along with several other parallel programming models, so the benefit the authors imply over GPUs for programming languages is spurious today.
On line 46 the authors do not mention that GPUs have been used effectively for scientific computing. The way the argument is presented in this section dismisses this area of work, which again is not an authentic argument. Please rephrase this to be specific and accurate about the pros and cons of the processor categories outlined.
On line 65, the authors claim that vendor-specific models always give the best performance, citing an NVIDIA technical marketing document. Importantly, this document does not indeed claim that the vendor-specific model is better than other programming models, and in fact does not mention any programming model at all. As a reviewer, I also do not agree with the claim in general, as there have been several studies exploring programming models which show that the same performance is attainable in many programming models on a given hardware platform.
Figure 1 is used to show that not all models support all target architectures, but in my view, the figure implies that all models support all architectures. Please consider re-drawing this figure to illustrate the point. I'd encourage due diligence though, as several of the models do support all the architectures listed.
Line 70 is also unsubstantiated, to the paper's own detriment this time. I do no believe the claim that higher levels of abstraction imply higher performance portability and the authors do not justify this claim. In reality, there are a number of studies that show OpenMP gives the highest level of performance portability. The authors should cite these here to better argue the directive-based approach is a valid choice when striving for performance portability.
On line 87, the authors give a 2022 reference to justify OpenACC was the only choice for Fortran GPU acceleration. However, OpenMP target was available in 2013, 10 years prior to the 2022 study. Please provide a contemporariness reference for this statement, or revise the statement.
On line 108, the sentence ordering humorously implies that ICON requires the code be bloated and difficult to adapt! I suggest the authors switch the order of the final two sentences in this paragraph.
## Section 3Line 161, and previously in the abstract, the authors mention the MPI parallelism in based on domain decomposition. It is likely obvious, but worth stating given that ICON has parallelism in other dimensions, that the domain here is longitude and latitude, and not some other dimension.
The colouring of the AMD and NEC VE10AE lines in Figure 3 make them almost indistinguishable - I'd encourage the authors to explore using easily differentiable colours that are also appropriate for readers with colour vision deficiency.Table 2 shows that one A100 has comparable memory bandwidth to one NEC card. However, the results in Figure 3 show that the GPU performance is below that of the NEC. The text around lines 190 explain that both are starved for work in the strong scaling regime, but this question is about the results where they are not, e.g., 8 GPUs. Later, in Section 4, the authors show that the performance per GPU is very comparable in terms of the memory bandwidth bound code HPCG, and the later Roofline plots of ICON show this code also falls in this regime. Please explain the performance discrepancy. As I understand this presented work, there are no contributions to the model in terms of development, and so instead presenting a sufficient summary of the existing work on GPU and NEC optimisation becomes necessary to explain the performance the authors have measured and the Roofline analysis.
On line 204, the authors state that the strong/weak scaling limits are overcome by improving single node performance, but I don't think this is true, as reducing the time of the computation will not resolve the scaling issues, thanks to Amdahl's Law. I think the sentiment the authors want is to improve the overall performance of the code at all scales, one needs to focus on single node performance.
## Section 4On line 255, please justify why a ifort 2021 was used over a current release of this compiler. Additionally, the authors use AMD CPUs, so they should also consider use of the AMD, NVHPC, or GNU compilers, and using performance measurements to highlight the authors' choice of the Intel compilers.
On line 290, the authors discuss how the NEC vector units have to execute both paths of the IF/ELSE branch, and so inflate the measure of floating point operations. The authors should present a discussion here on how the measurements of the operations for the same problem using the three different methods (LIKWID, NSight, and ftrace) are reconciled. Figure 6 shows they are similar in some cases, but not others, e.g., `nwp_radiation` on AMD EPYC does a lot more operations than the other platforms. The later analysis shows that MPI communication is sometimes factored into the kernels within each timed portion - how do the authors deal with this on the NVIDIA GPU system whereby communication is initiated by the host CPU (as explained previously in the paper around GPU-aware MPI).
## Section 5This section is a high level discussion on computing architectures again, repeating much of the sentiments from Section 2. The way the argument is presented does not justify the need for the modularisation code, although obviously I'm supportive of modularising the code as described in Figure 2.
The energy comparison in Figure 8 is interesting here. It would be useful to add the system power draws on Table 2.
The comments around lines 370 are not clear. It is also not clear how remarkable the curve is given the scaling results presented previously, where the compute resource (i.e., energy use) is doubled but yields less that double the performance.## Section 6
The authors again state that the monolithic design is a factor in the performance portabililty. However, the Roofline models show that the codes do roughly equally well on the architectures they are executed on, so I'm not convinved the evidence presented leads to this conclusion specifically.
Citation: https://doi.org/10.5194/gmd-2024-54-RC2 - AC2: 'Reply on RC2', Panagiotis Adamidis, 21 Aug 2024
Status: closed
-
RC1: 'Comment on gmd-2024-54', Anonymous Referee #1, 28 May 2024
Adamidis et al. presents benchmarking results of the ICON model on different hardware; CPU, GPU, and Vector. This is an interesting comparison, especially given the direction of travel for HPC hardware and the use of accelerators.
It is mostly well written, and I recommend its publication in GMD given some minor revisions, mainly around the figures, which is where the majority of my issues with the manuscript are.
Figure 1 is a nice-looking figure, but I don't believe it adds anything to the discussion and is in fact misleading as it isn't really saying anything of substance at all. It doesn't reflect the point made in the text: "the appropriate programming model becomes a difficult task as not all of them support all types of accelerators" (lines 67-68), but the diagram just shows a wall with each brick representing a different programming model and arrows going indiscriminately to different hardware. It would instead be useful to show which programming models work on which hardware, either through a different diagram or a table.
Figure 2 likewise looks very nice but doesn't really contain any substance about how the ICON code is currently and how ICON-C will change the code structure and apply the different programming models. I would rather for the authors show code snippets or pseudocode to illustrate how the modules could be ported to the new architectures using the different programming models, and how would this be done, e.g. manually or using a code writer? Other models, such as the LFRic weather and climate model (https://www.metoffice.gov.uk/research/approach/modelling-systems/lfric), are using tools such as PSyclone (https://psyclone.readthedocs.io) to try to achieve performance portability of the Fortran source code on different architectures.
Figure 3 is quite confusing. The ideal strong scaling curves are very difficult to see as they are a light grey colour (the same as the gridlines behind). The point type used (left- and right-facing triangles) are difficult to differentiate and the fact that the colours are slightly different for the global and nested domains, but not different enough, makes it hard to unpick this. The different line styles are also not discussed. The authors could consider making the plot much larger, using very different point styles, mentioning the different line styles, and perhaps separating out the global and nested results into different sub-figures.
Figure 4 is very interesting given the results in Figure 3 and the core numbers in Table 2. I would suggest highlighting the number of cores in the discussion around GPUs for Figure 3. It might also be helpful to highlight on Figure 4 the number of cores for each hardware type configuration, so it can be seen when the hardware becomes underutilised.
For Figure 5 I'm a bit unsure if the timing for B1 as given in the example will include to the end of the K41 kernel, i.e. outside the end of B1.
The plots in Figure 6 are very hard to see as the points are large and the lines are close together and overlapping. The points are also all clustered around a small section of the graph, but the scale is much larger in X and especially Y, mainly to include the legend. I would suggest plotting each point on its own graph, making a large multi-figure plot, and zooming in as much as possible onto (0.1:100, 1:50,000) ranges to allow the relevant areas to be seen as clearly as possible. I would likewise do the same for Figure 7 by combining Figures 6 and 7 in a single multi-panel plot.
Figure 8 is also really interesting. Do you know the energy usage from the CPU runs?
Minor comments:
Line 17: I don't believe “System” should be capitalised here.
Line 26: The term GPU is used without being defined (this occurs on line 40). Also, the discussion here is around x86 hardware being combined with GPUs, Vector, or ARM systems. Note that superchip hardware, such as NVIDIA's Grace-Hopper (https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip) is an ARM-GPU system, i.e. there is no x86 CPU, where the CPU itself is instead ARM-based.
Line 345-6: I think this sentence needs to be rephrased: "Not only the ICON community is currently on the way to using current and upcoming Exascale systems for high-resolution simulations." - do you mean something like "ICON is not the only model that is currently on the way to using current and upcoming Exascale systems for high-resolution simulations."
Line 356-7: Perhaps I'm missing something, but wouldn't halving the horizontal resolution result in a 4-fold increase in necessary resources, rather than 8-fold? I'm assuming that the level structure remains unchanged.
Citation: https://doi.org/10.5194/gmd-2024-54-RC1 - AC1: 'Reply on RC1', Panagiotis Adamidis, 21 Jun 2024
-
RC2: 'Comment on gmd-2024-54', Anonymous Referee #2, 13 Aug 2024
This paper presents some strong scaling results and emperical Roofline models for the ICON weather code on NEC Vector Engines, NVIDIA A100 GPUs, and AMD CPUs.
The only novelty of the work lies I think in the Roofline analysis, although I am surprised this has not been done as part of the optimisation and porting studies, which are explicitly not claimed as contributions.
There are some points of clarification around the methodology here, as detailed below.
The paper also attempts to justify the need to modularise the code, but this effort is already underway in the WarmWorld project, and it is unclear how the evidence in terms of performance results presented in this paper directly motivates this, given that the model development and optimisation on the difference architectures preceeded this study, in particular by the reference Giorgetta et al., 2022.
Further, framing the performance comparisons of ICON to HPCG and HPL in Table 6 and related text, is also not new, and replicated from the study previously mentioned.
Most of the other comments below refer to the characterisation, comparisons, and discussion of processors, programming models and the high performance computing landscape at large, which was often rather clumsy.
Overall, the novel contribution here is small in my view, but defer to the editor to measure if this is sufficient for the aims on the journal.## Abstract and Section 1
The start of the paper presents the work as being rather time limited. For example, including the version of the software in the title implies, to me, that the work is only relevant to this specific version, and that is it may not be relevant for future versions, which I do not believe is the intended message.
Line 9 states that domain-decomposition parallelism reaches scaling limits, however, given the rest of the paper, this should be revised to more clearly state the distributed memory decomposition of the spatial domain.
On line 9 and the following, the statements that domain decomposition reaches scaling limits implies the need for single node performance is unclear.
## Section 2
In general, many unsubstantiated and poorly argued comments around programming models and processor technologies throughput Section 2.1.On line 41, the authors state that GPUs have dominated the Top500 list since 2015. The authors should clarify this to refer to share of compute performance, or, in the highest ranked systems, as even in the most recent Top500 list, there are more individual systems without GPUs than with and they therefore do not dominate in terms of number of systems. In short, be specific in how GPUs are dominating the list.
On line 42, it would be good to include examples of current vector processors in use other than the NEC SX-Aurora. It has probably been 20+ years since the last true vector machines (e.g., Cray) . If the authors are implying use of SIMD vectorization in CPUs, then this should be made more specific.
On line 48--49, the authors justify that the NEC Aurora cards are relevant because they offer high memory bandwidth. But GPUs, and now CPUs (including the A64FX previously mentioned), also offer this memory technology. Indeed, in Table 1, high bandwidth is listed for both GPUs and the NEC cards (twice in fact for NEC!). There is some debate to be had around the "Specialities" in this table, for example OpenMP (with or without target) can run well on CPUs and GPUs as well as the NEC, along with several other parallel programming models, so the benefit the authors imply over GPUs for programming languages is spurious today.
On line 46 the authors do not mention that GPUs have been used effectively for scientific computing. The way the argument is presented in this section dismisses this area of work, which again is not an authentic argument. Please rephrase this to be specific and accurate about the pros and cons of the processor categories outlined.
On line 65, the authors claim that vendor-specific models always give the best performance, citing an NVIDIA technical marketing document. Importantly, this document does not indeed claim that the vendor-specific model is better than other programming models, and in fact does not mention any programming model at all. As a reviewer, I also do not agree with the claim in general, as there have been several studies exploring programming models which show that the same performance is attainable in many programming models on a given hardware platform.
Figure 1 is used to show that not all models support all target architectures, but in my view, the figure implies that all models support all architectures. Please consider re-drawing this figure to illustrate the point. I'd encourage due diligence though, as several of the models do support all the architectures listed.
Line 70 is also unsubstantiated, to the paper's own detriment this time. I do no believe the claim that higher levels of abstraction imply higher performance portability and the authors do not justify this claim. In reality, there are a number of studies that show OpenMP gives the highest level of performance portability. The authors should cite these here to better argue the directive-based approach is a valid choice when striving for performance portability.
On line 87, the authors give a 2022 reference to justify OpenACC was the only choice for Fortran GPU acceleration. However, OpenMP target was available in 2013, 10 years prior to the 2022 study. Please provide a contemporariness reference for this statement, or revise the statement.
On line 108, the sentence ordering humorously implies that ICON requires the code be bloated and difficult to adapt! I suggest the authors switch the order of the final two sentences in this paragraph.
## Section 3Line 161, and previously in the abstract, the authors mention the MPI parallelism in based on domain decomposition. It is likely obvious, but worth stating given that ICON has parallelism in other dimensions, that the domain here is longitude and latitude, and not some other dimension.
The colouring of the AMD and NEC VE10AE lines in Figure 3 make them almost indistinguishable - I'd encourage the authors to explore using easily differentiable colours that are also appropriate for readers with colour vision deficiency.Table 2 shows that one A100 has comparable memory bandwidth to one NEC card. However, the results in Figure 3 show that the GPU performance is below that of the NEC. The text around lines 190 explain that both are starved for work in the strong scaling regime, but this question is about the results where they are not, e.g., 8 GPUs. Later, in Section 4, the authors show that the performance per GPU is very comparable in terms of the memory bandwidth bound code HPCG, and the later Roofline plots of ICON show this code also falls in this regime. Please explain the performance discrepancy. As I understand this presented work, there are no contributions to the model in terms of development, and so instead presenting a sufficient summary of the existing work on GPU and NEC optimisation becomes necessary to explain the performance the authors have measured and the Roofline analysis.
On line 204, the authors state that the strong/weak scaling limits are overcome by improving single node performance, but I don't think this is true, as reducing the time of the computation will not resolve the scaling issues, thanks to Amdahl's Law. I think the sentiment the authors want is to improve the overall performance of the code at all scales, one needs to focus on single node performance.
## Section 4On line 255, please justify why a ifort 2021 was used over a current release of this compiler. Additionally, the authors use AMD CPUs, so they should also consider use of the AMD, NVHPC, or GNU compilers, and using performance measurements to highlight the authors' choice of the Intel compilers.
On line 290, the authors discuss how the NEC vector units have to execute both paths of the IF/ELSE branch, and so inflate the measure of floating point operations. The authors should present a discussion here on how the measurements of the operations for the same problem using the three different methods (LIKWID, NSight, and ftrace) are reconciled. Figure 6 shows they are similar in some cases, but not others, e.g., `nwp_radiation` on AMD EPYC does a lot more operations than the other platforms. The later analysis shows that MPI communication is sometimes factored into the kernels within each timed portion - how do the authors deal with this on the NVIDIA GPU system whereby communication is initiated by the host CPU (as explained previously in the paper around GPU-aware MPI).
## Section 5This section is a high level discussion on computing architectures again, repeating much of the sentiments from Section 2. The way the argument is presented does not justify the need for the modularisation code, although obviously I'm supportive of modularising the code as described in Figure 2.
The energy comparison in Figure 8 is interesting here. It would be useful to add the system power draws on Table 2.
The comments around lines 370 are not clear. It is also not clear how remarkable the curve is given the scaling results presented previously, where the compute resource (i.e., energy use) is doubled but yields less that double the performance.## Section 6
The authors again state that the monolithic design is a factor in the performance portabililty. However, the Roofline models show that the codes do roughly equally well on the architectures they are executed on, so I'm not convinved the evidence presented leads to this conclusion specifically.
Citation: https://doi.org/10.5194/gmd-2024-54-RC2 - AC2: 'Reply on RC2', Panagiotis Adamidis, 21 Aug 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
908 | 253 | 27 | 1,188 | 15 | 19 |
- HTML: 908
- PDF: 253
- XML: 27
- Total: 1,188
- BibTeX: 15
- EndNote: 19
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1