the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
GPU-HADVPPM4HIP V1.0: higher model accuracy on China's domestically GPU-like accelerator using heterogeneous compute interface for portability (HIP) technology to accelerate the piecewise parabolic method (PPM) in an air quality model (CAMx V6.10)
Abstract. The graphics processing units (GPUs) are becoming a compelling acceleration strategy for geoscience numerical model due to their powerful computing performance. In this study, AMD’s heterogeneous compute interface for portability (HIP) was implemented to port the GPU acceleration version of the Piecewise Parabolic Method (PPM) solver (GPU-HADVPPM) from the NVIDIA GPUs to China’ s domestically GPU-like accelerators as GPU-HADVPPM4HIP, and further introduced the multi-level hybrid parallelism scheme to improve the total computational performance of the HIP version of CAMx (CAMx-HIP) model on the China’ s domestically heterogeneous cluster. The experimental results show that the acceleration effect of GPU-HADVPPM on the different GPU accelerator is more obvious when the computing scale is larger, and the maximum speedup of GPU-HADVPPM on the domestic GPU-like accelerator is 28.9 times. The hybrid parallelism with a message passing interface (MPI) and HIP enables achieve up to 17.2 times speedup when configure 32 CPU cores and GPU-like accelerators on the domestic heterogeneous cluster. And the OpenMP technology is introduced to further reduce the computation time of CAMx-HIP model by 1.9 times. More importantly, by comparing the simulation results of GPU-HADVPPM on NVIDIA GPUs and domestic GPU-like accelerators, it is found that the simulation results of GPU-HADVPPM on domestic GPU-like accelerators have less difference than the NVIDIA GPUs, and the reason for this difference may be related to the fact that the NVIDIA GPU sacrifices part of the accuracy for improved computing performance. All in all, the domestic GPU-like accelerators are more accuracy for scientific computing in the field of geoscience numerical models. Furthermore, we also exhibit that the data transfer efficiency between CPU and GPU has an important impact on heterogeneous computing, and point out that optimizing the data transfer efficiency between CPU and GPU is one of the important directions to improve the computing efficiency of geoscience numerical models in heterogeneous clusters in the future.
- Preprint
(2880 KB) - Metadata XML
-
Supplement
(1531 KB) - BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on gmd-2023-222', Anonymous Referee #1, 30 Jan 2024
General Comments
The manuscript gmd-2023-222 by Kai Cao et al. describes the porting of the Piecewise Parabolic Method numerical solver as part of the CAMx air quality model on heterogeneous computing architectures, and in particular on the new Chinese domestic accelerators, in comparison to NVIDIA GPUs. The resulting speedup and scaling behaviour are investigated. The study is relevant for the audience of the GMD journal and is timely in the context of the advent of GPU technologies in geoscientific model development.
The use of English is poor has to be improved throughout the manuscript. The text has to be revised and edited for syntax and grammar before it is reconsidered for publication.
Specific Comments:
The title is too long and may be confusing to the reader. In particular, it is not clear what "higher" model accuracy is with respect to. Please consider revising to "GPU-HADVPPM4HIP V1.0: using the heterogeneous interface for portability (HIP) to speedup the piecewise parabolic method in the CAMx (v6.10) air quality model on China's domestic GPU-like accelerator"
Lines 53-60: the details of specific supercomputers are not needed and may be supefluous for the reader. Propose to remove these lines.
L143-154: Is there a reference to the architecture of the Chinese heterogeneous cluster? In what ways to generations A and B differ? How do they compare to the NVIDIA clusters? These details are necessary for the reader to understand the potential and achieved performance comparison.
Table 1 is unnecessary and need not be reproduced here. It's enough to refer to the HIP API.
Figure 1: It is not clear if the trancode happened from CUDA C to HIP C directly ("hipify" in the diagram) or the standard C code was changes (HIP arrow). The section does not document if any changes in the memory size/number of threads/blocks/kernels offloaded were necessary to support the GPU-like accelerator.
In Fig 2, the colour scale for the last two columns is generally too coarse and unsuitable to show any differences, other than a few points. Please consider revising. Are the concentrations for a specific time (for intance at the end of the run when any errors presumably accumulate)? Please give more detail.
L. 297-290: What does it mean that the NVIDIA GPU sacrifices part of the accuracy for improved computing performance? Is this for floating point representation and/or arithmetic? Is there no user option (e.g. optimisation levels at compile time) to control the accuracy? Please elaborate.
Sec. 4.3.1: Given the large time spend to transfer the memory to and from the accelerator, wouldn't a more relevant comparison for any real-world application be of the total required time for the completion of the air quality simulation rather than the compute time of the numerical kernel? Was there any consideration on how to limit the required memory size/bandwidth?
Sec. 4.3.2: Are these results for the total model, or just the accelerated portion? How many timesteps is 1-hour of simulation? It would be generally interesting to see what the average speedup per timestep. It is hard to judge what the overall impact of the accelerator is, given that also the number of CPU cores/processes is also increasing. It would be good to conduct and add a scaling test purely on the CPU cores (ie without acceleration) to isolate the speedup due to acceleration. Finally, where is the speedup performance saturation (e.g. for the BJ case) attributed for large core/cards counts?
Minor Comments:
L70: Kinesthetic PreProcessor: Accelerated (KPPA) -> It should read "Kinetic PreProcessor"
L. 115: CUDA is not only supported on Tesla, but all NVIDIA GPU architectures. Please rephrase
L155-160: Most of the technical information is repeated in Table 2 and can be omitted from the text.
Please move all links to websites to the references section
Citation: https://doi.org/10.5194/gmd-2023-222-RC1 - AC1: 'Reply on RC1', Qizhong Wu, 23 Feb 2024
-
RC2: 'Comment on gmd-2023-222', Anonymous Referee #2, 05 Apr 2024
The paper deals with a comparison of non-accelerated and accelerated air quality simulations, using Intel CPUs, Nvidia GPUs, and unspecified Chinese hardware. The topic is well suited and is overall interesting and relevant. Nonetheless, there are in my opinion major issues with the manuscript which limit its broader interest. I summarise my overall impression and critique here, and also attach an annotated pdf.
A first major issue in the paper is that the Chinese hardware remains unknown for the readers throughout the text. Very little information about this hardware is disclosed, which makes it very difficult to assess and understand potentially relevant differences. The reader might infer/guess some parts of this (e.g., that CUDA cannot be used to program the Chinese accelerators), but these properties of the hardware are not mentioned. The reason for the lack of information is also not disclosed. In comparison of course, one can find full technical specifications for the Intel and Nvidia hardware used in the study. We are not explained why these Chinese accelerator are GPU-like and not quite GPUs. We are also provided very little information on the Chinese CPUs. This creates, in my opinion quite some intransparency.
Moreover, the authors make bold claims in terms of the comparative accuracy of Nvidia GPUs, relative to that obtained with the Chinese accelerators (larger errors for Nvidia hardware, which the authors attribute to the fact that Nvidia favours performance over accuracy!). To sustain such claims not only is more evidence required, but also significant information on the hardware and the software stack over which the application is built.My second point is that the manuscript is a bit confusing in terms of which processes are mapped to which hardware. This may be simply a matter of clarification, in particular when it comes to the in-node heterogeneous processes ("other modules" the authors state, are solved using OpenMP on the CPUs, whereas the advection module, when solved on CPUs is on a single core?). I would suggest a sketch to explain this.
In terms of the results, the text needs to be much more specific on how the runs are computed and how the speed up is computed. One can infer that speed up is computed against runs on the Chinese CPU, but it is not fully clear if that is using a parallel run on all cores or on a serial job.
The authors identify host-device transfers as the key bottleneck, mainly by computing the share of time spent on kernels and on data transfers. The authors seem to conclude that this can only be improved with better bandwidth between host and device. However, nothing is said about potential implementation issues (e.g., poor handling of memory allocations) or even bottlenecks which could be alleviated with better suited algorithms. Indeed, the fraction of time the solver actually spends on running GPU kernels is very low (below 24%) which is very inefficient.Going back to the accuracy aspect, the authors don't really provide an in-depth discussion of why there are differences between the different computations. The magnitude of the errors is way above arithmetic accuracy of the hardware involved, and the claims that Nvidia hardware favours performance over accuracy are not well supported by the exercise carried out in this manuscript. This requires much better defined benchmarks. Overall, the differences between the simulations could be due to a wide variety of reasons, anywhere from arithmetic accuracy of the hardware and the hardware-specific arithmetic kernels, sure, but all the way up to errors in the different implementations. The investigation is simply not deep enough to support the claims nor to provide robust evidence of the root causes of the differences. There is no discussion on why, for example, the different species show different variability in the errors, although this suggests that whatever the root cause of the error, this propagates differently across different processes/state variables.
- AC2: 'Reply on RC2', Qizhong Wu, 22 Apr 2024
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
386 | 58 | 25 | 469 | 21 | 28 | 21 |
- HTML: 386
- PDF: 58
- XML: 25
- Total: 469
- Supplement: 21
- BibTeX: 28
- EndNote: 21
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1