Articles | Volume 19, issue 12
https://doi.org/10.5194/gmd-19-5623-2026
© Author(s) 2026. This work is distributed under the Creative Commons Attribution 4.0 License.
CaMa-Flood-GPU: a GPU-based hydrodynamic model implementation for scalable global simulations
Download
- Final revised paper (published on 29 Jun 2026)
- Preprint (discussion started on 05 Feb 2026)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2025-6500', Anonymous Referee #1, 18 Mar 2026
- AC1: 'Reply on RC1', Jiabo Yin, 01 May 2026
-
RC2: 'Comment on egusphere-2025-6500', Anonymous Referee #2, 22 Mar 2026
- AC2: 'Reply on RC2', Jiabo Yin, 01 May 2026
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
AR by Jiabo Yin on behalf of the Authors (06 May 2026)
Author's response
Author's tracked changes
Manuscript
ED: Publish subject to minor revisions (review by editor) (07 Jun 2026) by Thomas B. Wild
AR by Jiabo Yin on behalf of the Authors (10 Jun 2026)
Author's response
Author's tracked changes
Manuscript
ED: Publish as is (21 Jun 2026) by Thomas B. Wild
AR by Jiabo Yin on behalf of the Authors (21 Jun 2026)
Manuscript
This manuscript presents and evaluate a new implementation of the CaMa-Flood global river model. With this new implementation, the original model, written in Fortran and ran on a CPU multi-core architecture, is rewritten for a GPU-based architecture with adapted libraries and kernels with the objective to speedup global scale high resolution simulations without degrading performances. As a first step, the authors carefully analyze the main challenges behind this transposition, including the irregularity of the network topology, the interpolation of runoff inputs, the non linear relationship between water depth and river storage, and the handling of memory and communications between GPUs. Methods adapted to massive parallelism are proposed at each step. The new model, called CaMa-Flood-GPU, is then compared to the original CPU-based CaMa-Flood in terms of computation time and reproducibility. Results show a significant gain in computation time (more than 3 times quicker for a simulation at 1 arcmin resolution) with negligible differences in the outputs (river discharge and depth, flood outlow). The manuscript is well written and organized, figures are of good quality although some could be improved (see comments bellow). I have a few remarks that could further improve the manuscript, remarks that should be easily handled.
Main remarks:
1. Ordering catchments and assigning them to dedicated GPUs is particularly important for efficient parallelism in terms of memory and communications, but it is not clear how this first step is elaborated. More detailed could be provided, for example in section 2.1.1. This could also include how catchments are assigned to one GPU or another in a multi-GPU configuration (L172).
2. How are communications between neighbor catchments handled to account for backwater effects (impact of downstream water level on the surface profile and flow dynamics)? In other terms, are there some tricks with the arrangement of catchments into the memory to limit communication time (see also previous comment)?
3. Can floods represented in 2D introduce water exchanges between neighbor catchments that are not directly connected through the river network? What would be the implications for memory exchanges?
Minor remarks:
L136. Could you briefly describe what the scatter_add and atomic_add operations do?
L146. It is not clear how the global scale state array is constructed (see major comment 1).
L161. By “land grid cells”, do you mean “grid cells from the Land Surface Model that produces runoff”?
L185. I guess the shard_forcing interface could easily integrate a specific method to couple CaMa-Flood with a Land Surface Model, right? It might be worth mentioning it.
Fig. 5. The figure is not clear and could be improved. For instance, what does the columns (3 in batched runoff, fluxes and errors) represent? Catchments? And the lines? Computation (sub-)time steps? Where is the synchronization between GPUs, does it allow to advance to the next time step? What are dataloader 0 and 1?
L233. Isn’t the input broadcast also a collective communication? This would give three collective communications at each time step.
L234. How is the flexible time step implemented/parallelized? I understand that the same sub-step is chosen for all the catchments of the globe, is that right? Since each GPU works asynchronously, could it be possible to choose a different sub-step for each GPU?
Fig. 6. The figure could be improved and enlarged: gauge dots are not clearly visible except with a very high zoom, star symbols are not visible at all. Also, in the figure caption, it is written that the catchment outlines are shown; I understand that they are represented by the shaded colors, but in the text, the term catchment corresponds to base unit while in the figure it is more likely the entire basin. Is that right? Finally, why some catchments/basins are so large, encompassing several basins (like orange in South America, pink in Asia, green in Africa or brown in North America)?
Table 1. It seems from Table 3 that the CPU configuration was not used for the first three machines (4070 Ti, V100 and A100). Why then fill in the CPU and CPU Cores columns for these machines? Also, would it be possible to add the available memory for each machine and node?
L282. I understand the idea of using a coarse resolution forcing (1° runoff) to focus more on computation performances. But running a global scale simulation at very high resolution (e.g. 1 arcmin, that is typically not achievable with the current CPU version) would also require high resolution forcing. Maybe an additional experiment would help quantify the added simulation time due to the reading and broadcasting of high resolution forcing.
L288. Could you explain what the block size is? Is this related to the catchment assignment (see major remark 1)?
Table 3. The amount of memory is also a very important aspect in global scale and high resolution simulations. Could you explain why some configurations encountered lack of memory problems and not others?
Fig. 7. In the third column, it could be preferable to show the relative difference. In that case, values bellow 1e-6 could be attributed to numerical errors only (floating-point precision).
L325. What is the period of the simulation?
Fig. 8. What is the added value of showing both simulations, with and without the activation of the bifurcation module?