the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Graphics processing unit accelerated ice flow solver for unstructured meshes using the Shallow Shelf Approximation (FastIceFlo v1.0)
Ludovic Räss
Mathieu Morlighem
Abstract. Ice-sheet flow models capable of accurately projecting their future mass balance constitute tools to improve flood risk assessment and assist sea-level rise mitigation associated with enhanced ice discharge. Some processes that need to be captured, such as grounding line migration, require high spatial resolution (1 km or better). Conventional ice flow models may need significant computational resources because these models mainly execute Central Processing Units (CPUs), which lack massive parallelism capabilities and feature limited peak memory bandwidth. On the other side of the spectrum, Graphics Processing Units (GPUs) are ideally suited for high spatial resolution as the calculations at every grid point can be performed concurrently by thousands of threads or parallel workers. In this study, we combine GPUs with the pseudo-transient (PT) method, an accelerated iterative and matrix-free solving approach, and investigate its performance for finite elements and unstructured meshes applied to two-dimensional (2-D) models of real glaciers at a regional scale. For both Jakobshavn and Pine Island glacier models, the number of nonlinear PT iterations to converge for a given number of vertices N scales in the order of O(N1.2) or better. We compared the performance of PT CUDA C implementation with a standard finite-element CPU-based implementation using the metric: price and power consumption to performance. The single Tesla V100 GPU is 1.5 times the price of the two Intel Xeon Gold 6140 CPU processors. The power consumption of the PT CUDA C implementation was approximately one-seventh of the standard CPU implementation for the test cases chosen in this study. We expect a minimum speed-up of >1.5 to justify the Tesla V100 GPU price to performance. We report the performance (or the speed-up) across glacier configurations for degrees of freedom (DoFs) tested to be >1.5 on a Tesla V100. This study is a first step toward leveraging GPU processing power for accurate polar ice discharge predictions. The insights gained from this study will benefit efforts to diminish spatial resolution constraints at increased computing speed. The increased computing speed will allow running ensembles of ice-sheet flow simulations at the continental scale and at high resolution, previously not possible, enabling quantification of model sensitivity to changes in future climate forcings. These findings will be significantly beneficial for process-oriented and sea-level-projection studies over the coming decades.
- Preprint
(789 KB) - Metadata XML
- BibTeX
- EndNote
Anjali Sandip et al.
Status: final response (author comments only)
-
RC1: 'Comment on gmd-2023-32', Anonymous Referee #1, 24 Jun 2023
General comments:
In this paper, the authors are interested in studying and improving the performance of high resolution, continental scale ice-sheet modeling by leveraging pseudo-transient continuation and GPU hardware. An unstructured, finite-element code is developed in CUDA C to solve the momentum balance with the Shallow Shelf approximation (named FastIceFlo v1.0). The code is tested on two regional-scale glaciers. The scalability and wall time of the solver is reported and analyzed using a fixed resource set of 1 GPU and increasing problem size. This is compared to ISSM's CG iterative solver on 36 CPU cores.The paper is well-written, and the methodology is unique enough with respect to ice-sheet modeling to warrant publication after a minor revision. The paper shows that the FastIceFlo solver is faster than ISSM's CG solver in most cases for the specified case study showing that it is possible to use pseudo-transient continuation and GPUs to improve performance. I'm not satisfied with the price and power consumption comparisons, and I think these sections should be strengthened or omitted. I also think that a profiler tool should be used to verify the low memory throughputs reported as this would help clarify future directions and better inform readers about the potential of the methods.
Specific comments:
- Line 37: The authors state, "...the traditional way of solving the governing equations of ice-sheet flow, such as the finite-element analysis, is not adapted to GPUs as they cannot handle large sparse matrices and linear solvers." GPUs can handle large sparse linear solvers. There are vender specific libraries such as cuSolverSP. For multi-GPU iterative solvers see PETSc, Trilinos and Hypre.
- Lines 213-215: The authors state, "...the power consumption of the PT GPU implementation was approximately one-seventh of the traditional CPU implementation..." This comparison is misleading. A V100 GPU also requires a CPU which consumes power. One would need to include the power of the CPU for a fair comparison. A variation of this statement is also in the abstract (lines 13-14) and conclusion (line 271).
- Line 230: Where are L1TEX and L2 cache reported from?
- Lines 235-236: Where are the peak memory throughputs reported from? Was this an additional study performed?
- Lines 238-240: 3-4% of measured peak memory throughput is very low and would indicate a latency bound kernel. These numbers should be verified with a profiler such as nvprof or nsight.Technical comments:
- Equation 2: epsilons should be defined
- Equation 14: not on the same line
- Equation 14: epsilon_w should be defined. Should this be Delta w?
- Figure 2: It would be good to also show the meshes used in the study to better understand the mesh quality.
- Line 187: "performance of PT" should be "performance of the PT"
- Line 200: The values for the damping parameter, nonlinear viscosity relaxation scalar and transient pseudo time step are not provided for the study.
- Line 207: References are required for the price values of the Tesla V100 GPU and Xeon Gold 6140 CPU used in this study. A comparison between the prices is also written in the abstract (lines 12-13) and conclusion (line 270). References should be added there as well.
- Line 209: It would be good to have a reference for the NVIDIA System Management Interface used in this study and a version number.
- Line 211: A reference is also required for the hardware specification sheet used to acquire the thermal design power of the Intel Xeon Gold 6140 processor.
- Lines 219-222: A table or graph of speedups would help highlight the values obtained in the study.Citation: https://doi.org/10.5194/gmd-2023-32-RC1 -
AC1: 'Reply on RC1', Anjali Sandip, 30 Jun 2023
We thank the referee for the comments. We will make the proposed modifications to the revised version once the interactive discussion is over. With reference to comments on:
- Line 230: Where are L1TEX and L2 cache reported from? We reported the L1TEX and L2 cache from NVIDIA's Nsight Compute generated reports for each ice flow simulation.
- Lines 235-236: Where are the peak memory throughputs reported from? Was this an additional study performed? We posted the code to determine the peak memory throughputs in the associated GitHub repository – https://github.com/AnjaliSandip/FastIceFlo/blob/master/scripts/memcopy.cu
- Line 200: The values for the damping parameter, nonlinear viscosity relaxation scalar and transient pseudo time step are not provided for the study. We listed the range within which the damping parameter and nonlinear viscosity relaxation scalar were varied to maintain linear scaling and solution stability on the lines following equations 12 and 8, respectively. We listed the formula to determine the pseudo transient time step in equation 7.
Citation: https://doi.org/10.5194/gmd-2023-32-AC1
-
AC1: 'Reply on RC1', Anjali Sandip, 30 Jun 2023
-
RC2: 'Comment on gmd-2023-32', Daniel Martin, 04 Jul 2023
-
AC2: 'Reply on RC2', Anjali Sandip, 16 Aug 2023
We thank Dr. Dan Martin for the comments. We will make the proposed modifications to the revised version. With reference to comments on:
- Figure 3: Is it possible to include (likely in an additional figure) some sort of plot of norm(residual) (i.e. du/dt) vs. iteration to illustrate how this method performs? Is it a linear convergence, or something better? (in the end, you’re comparing against a more-standard iterative method where one would plot residual vs. iteration). Yes, we can add a plot of error residual vs iteration for each of the glacier model configurations. For both Jakobshavn and Pine Island glacier models, the number of nonlinear pseudo-transient iterations to converge for a given number of vertices N scales in the order of ≈ O(N1.2) or better.
- line 192: What do you mean by ”arithmetic precision”? 32 vs 64? (what number are you using for np?) np represents the size of the data type (double - 8 bytes)
- line 200: Can you include a table with the optimal parameters here?
The table with the optimal parameters has been included in the associated GitHub repository https://github.com/AnjaliSandip/FastIceFlo/blob/master/README.md
- line 202: Can you describe what you mean by ”optimal solver parameters are unidentifiable”? Are you completely unable to solve the problem? or is it simply that you can’t identify optimal parameters (in which case you could still have a result)?
We were unable to identify solver parameters that result in convergence (or meet the chosen stopping criterion) at ∼ 3e7 degrees of freedom (DoFs) for the Pine Island glacier model. We will investigate this in the next steps.
- Eqn 3: Do you add any regularization to address the singularity when the strain rate is 0?
Yes, we regularize our viscosity formulation in the GPU implementation by capping it at 1e5 to address the singularity when the strain rate tends towards zero.
etan[ix] = min(__expf(rele*__logf(eta_it) + (1.0-rele)*__logf(etan[ix])),eta_0*1e5);
We will include a statement indicating the same in the revised manuscript.
- line 89: Since your pseudo time-step is spatially variable, do you have a sense of how uneven the convergence is?
The spatially variable pseudo-time step speeds up the convergence by allowing for slow moving regions to reach steady state in a limited number of time steps compared to what would be required if we were using the smallest time step over the entire domain.
Citation: https://doi.org/10.5194/gmd-2023-32-AC2
-
AC2: 'Reply on RC2', Anjali Sandip, 16 Aug 2023
Anjali Sandip et al.
Anjali Sandip et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
345 | 92 | 14 | 451 | 3 | 2 |
- HTML: 345
- PDF: 92
- XML: 14
- Total: 451
- BibTeX: 3
- EndNote: 2
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1