Porting the WAVEWATCH III (v6.07) Wave Action Source Terms to GPU
- 1Earth and Atmospheric Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- 2Program in Ocean Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
- 3Fluid Dynamics and Solid Mechanics (T-3), Los Alamos National Laboratory, Los Alamos, NM, USA
- 4Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA
- 1Earth and Atmospheric Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- 2Program in Ocean Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
- 3Fluid Dynamics and Solid Mechanics (T-3), Los Alamos National Laboratory, Los Alamos, NM, USA
- 4Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA
Abstract. Surface gravity waves play a critical role in several processes, including mixing, coastal inundation and surface fluxes. Despite the growing literature on the importance of ocean surface waves, wind-wave processes have traditionally been excluded from Earth system models due to the high computational costs of running spectral wave models. The Next Generation Ocean Model Development in the DOE’s (Department of Energy) E3SM (Energy Exascale Earth System Model) project partly focuses on the inclusion of a wave model, WAVEWATCH III (WW3), into the E3SM. WW3, which was originally developed for operational wave forecasting, needs to be computationally less expensive before it can be integrated into ESMs. To accomplish this, we take advantage of heterogeneous architectures at DOE leadership computing facilities and the increasing computing power of general-purpose graphics processing units (GPU). This paper identifies the wave action source terms as the most computationally intensive module in WW3 and then accelerates them via GPU. Using one GPU, our experiments on two computing platforms, Kodiak (P100 GPU & Intel(R) Xeon(R) CPU E5-2695 v4) and Summit (V100 GPU & IBM POWER9), show speedups of up to 2.4x and 6.6x respectively over one MPI task on CPU. Using different combinations of multiple CPUs and GPUs, we obtained an average speedup of 2x and 4x on Kodiak and Summit. We also discuss how the trade off between occupancy, register and latency affects the GPU performance of WW3.
Olawale James Ikuyajolu et al.
Status: final response (author comments only)
-
RC1: 'Comment on gmd-2022-141', Anonymous Referee #1, 10 Jul 2022
This manuscript presents results from a modified WAVEWATCH III code, which off-loads the spectral source terms to GPUs. The authors investigate the scaling and two different platforms, and validates the results against a CPU-only run.
This is a very timely, important and well written manuscript. I can recommend that it is accepted for publication after some minor changes. I also want to mention, that the way the problem is broken down and presented step by step made the manuscript easy to follow. Please find my comments and questions below:
lines 24-25: "despite the growing literature on their in the simulation of weather and climate."
Seems to be missing a word
line 71: I'm not sure what this is a reference to, but usually Komen et al. 1994 is used as a WAM reference?
line 86-88: These sentences are slightly confusing, since first we are talking about modules that calculate source terms (right hand of Eq. 1), and then we talk about discretizing (left hand side).
line 94: Can the relative computational intensivness change if we are using defferent propagation schemes. Can extremely small time steps alter/tip this balance? (Probably not, I guess.)
line 188-189: I'm a bit surprised that Sin takes half of the source term computational, time, since I would have expected the non-linear interactions to be the heaviest. The cumulative breaking term in ST4 is supposed to be quite resource consuming, while possible not having a large effect on the end results. Is that turned on or off here? If it's turned on, then that would perhaps be an obvious candidate to try to speed up the model (not directly related to GPU porting).
line 196-198 Is this a hard requirement, or does the lack of communication just mean that the source terms are trivially paralellisizable? The word "suitable" suggests that any communication here would make GPU porting a non-option, but I'm not sure that is the case (altough it probably becomes a lot more complex).
line 241-266: very long paragraph. Even though the paper is generally very well written, this aspect could be checked.
line 275: It seems like the order the figures are presented might be wrong, since Fig. 8 has already been referenced?
line 295: Would it be possible to increase occupancy by reorganizing the loops? Now we loop over all grid points, and then loop over one spectrum, but perhaps it would be more efficient to define an array that has both spatial and spectral dimensions (and perhaps slice that up into some blocks, if needed)? Can you comment on this?
The paper is missing a discussion section. Although it might not strictly be needed in this kind of more technical paper, it would perhaps be interesting for the reader to know what kind of impact these speed-ups might have in practical terms. Several days of wall time was mentioned, but is this a "game changer" to allow for including wave models in ESMs, or do we still need to optimize? I'm also wondering how well the exact non-linear solution might scale (if the authors can comment), since this might have consequences to very basic reasearch into e.g. wave growth that might be affected by the crude approximations of DIA. Finally, would it every be viable to port any other parts of the wave model, such as the propagation, to GPUs, or is the communication needed beween the grid points a complete deal breaker?
- AC1: 'Reply on RC1', Olawale Ikuyajolu, 01 Dec 2022
-
RC2: 'Comment on gmd-2022-141', Anonymous Referee #2, 12 Aug 2022
General comments
• The paper is well formulated in that the problem of interest is well described, the approach to GPU porting is well described, and some key aspects of performance are explained well in detail. I am requesting major revisions to the manuscript due to the importance of using appropriate baselines for GPU performance reporting. If the GPU performance is much slower than most CPUs, as seems to be the case, it is a very important data point for the reader to understand clearly.
• Comparing GPU runtime against a single CPU core is inappropriate because in production simulations, the entire CPU would have been used. I request that the authors change this comparison to use all available CPU cores in the CPU baselines when reporting speed-up numbers.
• On the same note, I request that the authors include a “roofline” plot of their ported kernel. This can be obtained directly using Nvidia’s ncu-ui tool (or several other tools if desired). Absent this, then the authors need to at least provide the floating point operations per second (flop/s) achieved by the kernel as well as the maximum expected flop/s for the observed DRAM-oriented arithmetic intensity in their kernel. This is a more objective performance metric because the baseline is fixed for a given kernel and GPU hardware choice. The documents below should help in doing this:https://www.nersc.gov/assets/Uploads/Talk-NERSC-NVIDIA-FaceToFace-2019.pdf
https://arxiv.org/pdf/2009.02449.pdf
Specific comments
• Line 52: The phrase “totally different” is not an accurate statement. The “breadth-first” SIMT dispatch of code over threads on GPUs is different than the more “depth-first” execution of code of CPUs. GPUs need a larger degree of parallelism exposed at one time than CPUs. Beyond this, much of the programming and optimization approach does remain the same. The order of execution as dispatched on the hardware should still match the order of memory accesses in arrays. Floating point and integer divisions and floating point transcendental operators should be minimized. Data movement to and from DRAM should be minimized.
• Line 127: It might be more accurate to say that OpenACC contains the most mature implementation using the Nvidia compiler suite on Nvidia GPUs.
• Line 129: These lists do not correspond to one another in a respective sense; and I believe, as written, this will lead to confusion for the reader. An unspoken yet commonly used mapping from OpenACC levels of parallelism to CUDA levels of parallelism is as follows: gang == blockIdx.x (i.e., “grid”-level parallelism); worker == threadIdx.y (i.e., “block” level parallelism); and vector == threadId.x (i.e., finer “block”-level parallelism). Perhaps a better more general statement is something similar to “Gang, worker, and vector parallelism expose increasingly fine granularities of parallelism to distribute work over grid, block, warp, and thread-level parallelism on Nvidia GPUs.” That should be true in all cases for the OpenACC spec itself and Nvidia hardware.
• Line 130: Gangs must operate independently without synchronization.
• Line 132: SIMD is not an accurate description of the parallel dispatch strategy on Nvidia GPUs. Please describe it as “SIMT” (single instruction, multiple thread).
• Line 135: Please specify that declare create is needed specifically due to using module-level variables directly in device code instead of passing them by parameter.
• Line 138: Please add that the expectation is that “W3SRCEMD” will then further dispatch parallel threads in the worker and / or vector levels.
• Line 167: Can you give the reasoning for reducing the optimization on Summit? If bugs were encountered, this can be useful information for the reader to understand that these issues can crop up sometimes.
• Line 207: Occupancy is largely affected by register and shared memory usage. Using large local, thread-private arrays (i.e., on the stack) can affect register usage, but I believe it is incorrect to say that occupancy is affected by the size of the module-level arrays created outside the kernel. Can the authors explain in more detail what is meant here?
• Line 222: I may be misunderstanding this, but it’s not clear why the impact of data transfers could not be assessed. Nvidia’s nvvp and nsight tools should be able to show the cost of all transfers.
• Line 242: Using only 32 threads per gang seems like it would lead to problems hiding memory fetching latency from DRAM via thread switching within an SM. Have you tried increasing this to 64 or 128, or is 32 something required by the algorithm itself?
• Line 252: I think it’s important to note at this point that there is another option that has, so far, not been discussed. The NSEAL loop could be pushed down the callstack into the innermost routines. While this would be a significant refactoring effort, it would allow developers to fission the one large kernel into multiple smaller kernels. This will increase the number of kernel launches (potentially increasing an important kernel launch latency cost), but each individual kernel would no longer suffer register spillage, which is likely the number one performance problem in the approach used in this paper. I believe this approach should at least be mentioned in the manuscript so that the reader understands there are multiple potential approaches.
• Line 257: There is not only a maximum number of registers per thread, but I believe there is a minimum number supported by the hardware as well, which may explain this behavior.
• Figure 8: The plot seems confusing because a blue line is represented for “time” in the legend, which make it seem like the red and green portions are potentially no longer representing “time”. Perhaps add two entries to the legend so that the labels are “Occupancy”, “Time (latency regime)”, “Time (occupancy regime)”, and “Time (neutral regime)”.
• Line 274: I do not consider this to be an appropriate performance comparison. A GPU should not be compared to a single CPU core. It should be compared to an entire CPU with reasonable optimization efforts performed on both the CPU and GPU. For instance, if the GPU code is 6-7x faster than a single P9 core, then when using the P9 as it would typically be used (21-42 cores), the V100 performance is actually 3-6x slower than a single P9.
• Line 300: I greatly appreciate considerations of correctness in this paper. This is often overlooked in GPU refactoring manuscripts.
• Line 307: “exponentially” is likely an inaccurate term. Perhaps use “significantly” or a similar word instead.
• Line 318: Please mention the baseline here. Line 320 seems to indicate that the comparison is against an entire CPU whereas the manuscript seems to indicate that it is against only one core of the CPU.
• Line 323: The authors should mention briefly here the other potential approach described in the comment for line 252 above.
• Line 324: Are the authors certain the register usage is due largely to constants? What is the evidence for this claim?- AC2: 'Reply on RC2', Olawale Ikuyajolu, 01 Dec 2022
-
RC3: 'Comment on gmd-2022-141', Anonymous Referee #3, 13 Sep 2022
General Comments:
The authors present a GPU acceleration of the source term part within the parallel context of the WW3 Framework. The work is innovative and important. However, there are some flaws within the theoretical approach and the tests that have been done when evaluating the new parallelization option.
In eq. (1) the wave action equation we have certain terms that are local parts (e.g. source terms and spectral advection), and global parts, which is the geographical advection. The geo. advection needs some parallel exchange either for the CD or the DD approach.
Now, when expanding this asymptotically using Amdahl's law for the given problem it can easily be seen that ultimatievly for a infinite number of computational cores the only cost that remains would be the parallell exchange since all other workload tends to zero.
Introducing now the communicators to the GPU, this would remain as well an overhead and add up to the global exchange of the advection part itself.
Now the most important question is how does the scheme scale for various constellation. Since the GPU layer was introduced the scalability analysis becomes twodimensional in terms of number of GPU and CPU. This question remain open in this paper even if the authors have sufficient acces to the needed computational resources. In the sense of the above also the quantification of the computation cost of the source terms is rather linked to give testcase constellation investigated in this paper.
I conclusion I think that the work is interesting and the implemenation is a important topic for GMD but the work lacks in depth scalability analyis and therefore no final statment can be made in terms of efficiency. Especially, in the context that only 8 cores have been used from possible NSPEC cores within the CD approach.
In conclusion I think that much more work must be done to evaluate the performance of this approach before the given conclusion can be made and the general contribution of the work can be evaluated, which is now not the case.
Specific Comments:
- 94: "is the most computationally intensive part of WW3" Can this be quantified? Moreover, I can not agree on this, since this would be rather linked to the given configuration and the used schemes. For implicit schemes and various other constellations with high resolution geographical space this must not be true. I think that too much general statements have been derived by the given configuration and testcase. I could imagine that with high spatial resolution and a lot of computational cores, which have unfortunately not been used here, the communication itself will take more time than the computation of the source terms itself, as explained in the general comments above.
The authors are using CD for their parallelization strategy, but I do not see why one could not use domain decomposition in combination with GPU. How would the scheme perform with DD approach? Why so few computation cores. Why so much development and such little evaluation of the performance?
The authors have limited their simulation to just 8 processors, however, I do not see why it could not be used on say 1000 processors if each of those processors has access to a GPU. This cannot be repeated often enough. It remains the major concern of this work. The Kodiak and Summit supercomputer have several thousand processors. Why is the limit to 8?
The interaction between explicit and implicit computations is not considered. I would think that the code can be used for explicit and implicit and the scalability should be evaluated for both.
Further code moving to the GPU could be the frequency shifting and refraction in explicit mode, has this been considered as well?
Technical issues:
- There is a typo in Equation (1), the “+” is missing
- The literature reference in the paper is not done properly. Here are two exapmles:
- 19: Either referer to the paper with „e.g.“ or put the original reference, which is not Hasselmann 1991. It was Gelci et al. 1957 if i remember right.
- 146: Again, Chawla et al. Was referenced but those authors did not derive the scaling of 1.1. The original publication should be cited Hasselmann & Hasselmann, 1981 or it can be add „e.g.“. However, the latter I do not find very usefull since we want to honor and cite the orignal work. This should be cleaned throughout the paper.
Missing reference:
- 104: Brus et al., 2021 reference is missing.
- AC3: 'Reply on RC3', Olawale Ikuyajolu, 01 Dec 2022
Olawale James Ikuyajolu et al.
Olawale James Ikuyajolu et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
759 | 218 | 25 | 1,002 | 8 | 6 |
- HTML: 759
- PDF: 218
- XML: 25
- Total: 1,002
- BibTeX: 8
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1