CaMa-Flood-GPU: a GPU-based hydrodynamic model implementation for scalable global simulations

Kang, Shengyu; Yin, Jiabo; Yamazaki, Dai

doi:10.5194/gmd-19-5623-2026

Articles | Volume 19, issue 12

https://doi.org/10.5194/gmd-19-5623-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-19-5623-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 19, issue 12

Model description paper

|

29 Jun 2026

Model description paper |

| 29 Jun 2026

CaMa-Flood-GPU: a GPU-based hydrodynamic model implementation for scalable global simulations

Shengyu Kang, Jiabo Yin, and Dai Yamazaki

Download

Final revised paper (published on 29 Jun 2026)
Preprint (discussion started on 05 Feb 2026)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-6500', Anonymous Referee #1, 18 Mar 2026

This manuscript presents and evaluate a new implementation of the CaMa-Flood global river model. With this new implementation, the original model, written in Fortran and ran on a CPU multi-core architecture, is rewritten for a GPU-based architecture with adapted libraries and kernels with the objective to speedup global scale high resolution simulations without degrading performances. As a first step, the authors carefully analyze the main challenges behind this transposition, including the irregularity of the network topology, the interpolation of runoff inputs, the non linear relationship between water depth and river storage, and the handling of memory and communications between GPUs. Methods adapted to massive parallelism are proposed at each step. The new model, called CaMa-Flood-GPU, is then compared to the original CPU-based CaMa-Flood in terms of computation time and reproducibility. Results show a significant gain in computation time (more than 3 times quicker for a simulation at 1 arcmin resolution) with negligible differences in the outputs (river discharge and depth, flood outlow). The manuscript is well written and organized, figures are of good quality although some could be improved (see comments bellow). I have a few remarks that could further improve the manuscript, remarks that should be easily handled.

Main remarks:
1. Ordering catchments and assigning them to dedicated GPUs is particularly important for efficient parallelism in terms of memory and communications, but it is not clear how this first step is elaborated. More detailed could be provided, for example in section 2.1.1. This could also include how catchments are assigned to one GPU or another in a multi-GPU configuration (L172).
2. How are communications between neighbor catchments handled to account for backwater effects (impact of downstream water level on the surface profile and flow dynamics)? In other terms, are there some tricks with the arrangement of catchments into the memory to limit communication time (see also previous comment)?
3. Can floods represented in 2D introduce water exchanges between neighbor catchments that are not directly connected through the river network? What would be the implications for memory exchanges?

Minor remarks:
L136. Could you briefly describe what the scatter_add and atomic_add operations do?
L146. It is not clear how the global scale state array is constructed (see major comment 1).
L161. By “land grid cells”, do you mean “grid cells from the Land Surface Model that produces runoff”?
L185. I guess the shard_forcing interface could easily integrate a specific method to couple CaMa-Flood with a Land Surface Model, right? It might be worth mentioning it.
Fig. 5. The figure is not clear and could be improved. For instance, what does the columns (3 in batched runoff, fluxes and errors) represent? Catchments? And the lines? Computation (sub-)time steps? Where is the synchronization between GPUs, does it allow to advance to the next time step? What are dataloader 0 and 1?
L233. Isn’t the input broadcast also a collective communication? This would give three collective communications at each time step.
L234. How is the flexible time step implemented/parallelized? I understand that the same sub-step is chosen for all the catchments of the globe, is that right? Since each GPU works asynchronously, could it be possible to choose a different sub-step for each GPU?
Fig. 6. The figure could be improved and enlarged: gauge dots are not clearly visible except with a very high zoom, star symbols are not visible at all. Also, in the figure caption, it is written that the catchment outlines are shown; I understand that they are represented by the shaded colors, but in the text, the term catchment corresponds to base unit while in the figure it is more likely the entire basin. Is that right? Finally, why some catchments/basins are so large, encompassing several basins (like orange in South America, pink in Asia, green in Africa or brown in North America)?
Table 1. It seems from Table 3 that the CPU configuration was not used for the first three machines (4070 Ti, V100 and A100). Why then fill in the CPU and CPU Cores columns for these machines? Also, would it be possible to add the available memory for each machine and node?
L282. I understand the idea of using a coarse resolution forcing (1° runoff) to focus more on computation performances. But running a global scale simulation at very high resolution (e.g. 1 arcmin, that is typically not achievable with the current CPU version) would also require high resolution forcing. Maybe an additional experiment would help quantify the added simulation time due to the reading and broadcasting of high resolution forcing.
L288. Could you explain what the block size is? Is this related to the catchment assignment (see major remark 1)?
Table 3. The amount of memory is also a very important aspect in global scale and high resolution simulations. Could you explain why some configurations encountered lack of memory problems and not others?
Fig. 7. In the third column, it could be preferable to show the relative difference. In that case, values bellow 1e-6 could be attributed to numerical errors only (floating-point precision).
L325. What is the period of the simulation?
Fig. 8. What is the added value of showing both simulations, with and without the activation of the bifurcation module?

Citation: https://doi.org/10.5194/egusphere-2025-6500-RC1
- AC1: 'Reply on RC1', Jiabo Yin, 01 May 2026
  
  Please see the attached PDF containing our point-by-point responses.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6500-AC1
RC2:
'Comment on egusphere-2025-6500', Anonymous Referee #2, 22 Mar 2026

I enjoyed reading manuscript egusphere-2025-6500, which describes the GPU implementation of CaMa-Flood, a popular river routing model typically used in large scale studies—often in concomitance with global hydrologic models. Overall the manuscript is well organized and written, although I believe there are opportunities for improving both quality of the presentation and experimental setup.
Beginning with the Introduction, I think it would be important to provide more context on the implementation of hydrodynamic models in GPU—something that is now limited to just a few lines. In other words, what is the state-of-the-art in the field? A second point I suggest strengthening is the background information on CaMa-Flood; I found it hard to follow the first part of the Introduction, as it assumes the reader is familiar with the model.
The “Performance comparison” (Section 3.1) seems strong. In my opinion, it should be complemented by a section / sub-section on the experimental setup, where the authors explain how the runoff data were generated and was CaMa-Flood setup.
A similar comment applies to “Numerical stability” (Section 3.3), which is rather short. Here, there are multiple opportunities for deepening the analysis and demonstrating that the model is indeed stable. For example, you could consider the option of working with runoff data at multiple spatial resolutions (why was a resolution 0.25 degrees adopted?) and using a variety of gauging stations. The current analysis focuses only on three major rivers; how does the two model implementations perform on smaller rivers? How about bifurcation points?
Finally, I suggest expanding the Conclusions, which read more like an extended abstract. Specifically, the discussion is now limited to Line 353-356 and could be extended. For example how can the "model modularity" support the integration of reservoir operations and sediment transport? This should ideally relate to existing / past efforts by the CaMa-Flood community, since model extensions integrating reservoirs already exist.
Detailed comments
- Line 22-23: I would provide more details on “certain terms” as not all GMD readers may be familiar with hydrodynamic modeling.
- Line 23-24: Same comment as above.
- Line 25: Can you provide evidence of the “widespread adoption and balanced fidelity and efficiency” of CaMa-Flood?
- Line 66-76: Can you add a few references to support these statements?
- Table 3: Is there a specific reason for choosing the Year 2000?

Citation: https://doi.org/10.5194/egusphere-2025-6500-RC2
- AC2: 'Reply on RC2', Jiabo Yin, 01 May 2026
  
  Please see the attached PDF containing our point-by-point responses.
  
  Citation: https://doi.org/10.5194/egusphere-2025-6500-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Jiabo Yin on behalf of the Authors (06 May 2026) Author's response Author's tracked changes Manuscript

ED: Publish subject to minor revisions (review by editor) (07 Jun 2026) by Thomas B. Wild

Dear Authors,

Thank you for submitting revised manuscript and nicely detailed response to reviewers comment. I Overall find that the revision has addressed the reviewers’ main concerns. The manuscript now contain s a clearer description of the multi-GPU domain decomposition and communication strategy, a more transparent experimental setup/designn, expanded numerical-stability comparisons, improved figures &captions, and broader context for both CaMa-Flood and GPU-based modeling. I therefore do not think an additional round of external review by reviewers again is necessary.

Before final acceptance I request you consider a small number of minor revision s to improve clarity for readers:

1)) Clarify the terminology around Fig. 6. I think improving this could help reduce confusion. Please make explicit the distinction among unit-catchments, primitive basins, retained basins/basin groups, and bifurcation-formed basin groups. Please also briefly explain why some retained regions in the figure appear spatially large.

2) Clarify the distinction between the performance benchmark and the high-resolution forcing/numerical-stability experiment. The revised manuscript explains that wall-clock benchmark use the 1degree binary sample runoff, while the numerical-stability comparison includes 0.1deg ERA5-Land NetCDF forcing. Please state clearrly whether the 0.1° ERA5-Land experiment was used to quantify wall-clock performance overhead from high-resolution forcing input. If it was not, please clarify that this experiment supports numerical/input consistency rather than a full benchmark of high-resolution forcing I/O overhead.

3) Reconcile the GPU memory-footprint description. Please check the manuscript, tables, captions, and response documents to ensure that the reported GPU memory requirements are internally consistent, including the stated memory footprint for the 1-arcmin global state. I have confusion about whether 20 or 30 gb memorry footprint in total, based on your response document.

4) Clarify how bifurcation behavior is evaluated in the numerical-stability comparison. I have confusion on this point. The revised MS explain that the bifurcation-enabled configuration exercises both the base routing operations and the bifurcation-specific accumulation pathway. Please make this point clear where the numerical-stability results are discussed, or otherwise clarify the extent to which behavior at bifurcation-routing pathways or points is directly evaluated.

Again These revisions should be limited in scope and I can check them editorially without need for sending out for re-review.

Thanks,

Tom

Hide

AR by Jiabo Yin on behalf of the Authors (10 Jun 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (21 Jun 2026) by Thomas B. Wild

AR by Jiabo Yin on behalf of the Authors (21 Jun 2026) Manuscript

Short summary

Global floods pose serious risks, but existing models are too slow for large-scale prediction. We redesigned the Catchment-based Macro-scale Floodplain (CaMa-Flood) model for graphics processing units (GPUs), reformulating irregular river networks, flux updates, and floodplain dynamics into highly parallel algorithms. CaMa-Flood-GPU runs global simulations in hours instead of days with the same accuracy, enabling larger ensembles, better flood-risk analysis, and improved preparedness worldwide.