Reduced floating-point precision in regional climate simulations: an ensemble-based statistical verification

Banderier, Hugo; Zeman, Christian; Leutwyler, David; Rüdisühli, Stefan; Schär, Christoph

doi:https://doi.org/10.5194/gmd-17-5573-2024

Articles | Volume 17, issue 14

https://doi.org/10.5194/gmd-17-5573-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-17-5573-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 17, issue 14

Methods for assessment of models

|

24 Jul 2024

Methods for assessment of models |

| 24 Jul 2024

Reduced floating-point precision in regional climate simulations: an ensemble-based statistical verification

Hugo Banderier, Christian Zeman, David Leutwyler, Stefan Rüdisühli, and Christoph Schär

Download

Final revised paper (published on 24 Jul 2024)
Preprint (discussion started on 08 Nov 2023)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-2263', Anonymous Referee #1, 11 Dec 2023
General comments
The manuscript is well written, concise, and represents an important contribution to the efforts of using reduced float precision in climate modelling.
The approach of ensemble based verification is interesting and thorough. The success of single precision is encouraging.
Additionally, I found the introduction to be a useful review of relevant literature.
The manuscript is in a good state for publication subject to some minor questions below.
Specific comments
This is beyond the main scope of the paper, but it would be interesting to include some discussion on implications for 16-bit, stochastic rounding etc. especially since this is mentioned in the introduction

Similarly, are there any plans for extensions to the work in 16-bit? I appreciate the hardware is not readily available; the papers you cite in the introduction use a half-precision emulator…

Some discussion/justification on the choice of KS over other metrics such as Wasserstein would be useful, especially since KS causes problems w.r.t the steep distribution functions of the bounded variables.

The choice of the 95th percentile for the rejection of the global null hypothesis is reasonable, but I wonder how robust the results are to different percentile choices? The shape of the empirical distribution is surely relevant here which is somewhat lost by considering only the mean

Last sentence of appendix C, some more details on the “”more comprehensive experiment” would be useful for reproducibility, even if the results are not shared

Figure 2: why are some variables labelled in grey, some in black? If it is the same reason as given in the caption of Figure 3 I would move the explanation earlier.
Citation: https://doi.org/10.5194/egusphere-2023-2263-RC1
RC2:
'Comment on egusphere-2023-2263', Milan Klöwer, 19 Apr 2024
Summary
The authors present a statistical verification technique to compare two datasets whether are statistically from the same distribution or not. They apply this technique to an evaluation of single precision arithmetic used in COSMO (most model components) for 10-year climate simulations. The methodology is nicely illustrated (appreciation for Fig 1) and generally the paper is written for others to reproduce it, making it a useful manuscript for future projects. The method described is a statistical verifications between a (1) baseline model, (2) some change to it and (3) an “anti-control” ensemble used to quantify the expected deviation from the baseline due to uncertainties in parameterizations and tuning parameters. The authors therefore conclude that there is little reason to not use single precision in COSMO also for climate simulations, except for one purely diagnostic variable which the method identifies to suffer significantly from the use of single precision.
I generally recommend this paper to be published with minor corrections, see below for a list of points I created while reading the manuscript. Most of them are (strong) recommendations that you can object to though if you have a good reason for it (please explain). However, I also want to raise three “major” points that don’t require major work, but that should be discussed in a paragraph or two as I feel this is currently missing from the text.
I enjoyed reading the manuscript, many thanks. It is generally well written and concise, illustrating a method and discussing some results in three figures without being overly complicated.
Major points
1. Conditional distributions
Your method generally uses *unconditional* probability distributions. That means two adjacent grid points could co-vary in double precision but be independent with single precision even though both grid points follow an unchanged unconditional probability distribution. In an information theoretic sense, the mutual information between grid points could change even though the (unconditional) entropy is unchanged. As I see it, this could be a significant impact of precision (or any other change of the model) but your method would not detect it. A real world example might be precipitation which could occur in large patches (high mutual information) in the control but in smaller patches (lower mutual information) in the test ensemble. Clearly, analysing conditional probability distributions would explode the dimensionality of the problem, which is with 10TB already relatively large. Could you elaborate on this aspect in 2.1?
2. Rejection is binary
After reading the manuscript it is unclear to me what the effect is of the binary rejection, instead of having some error metric that would penalise larger deviations more. For example, if cloud cover has a rejection rate of n% because cloud cover is too high, then, as far as I understand it, it doesn’t really matter whether in these situations cloud cover is 100% or 200%. But the latter is obviously not what you would tolerate a single precision simulation to output. I see the argument that you shouldn’t penalise large deviations because maybe they don’t matter more if they don’t have kick-off effects causing other variables or grid points or timesteps to be rejected more frequently.
3. Comparison to other methodologies
Reading Appendix A I’m left with the feeling that both methodologies have their outliers (for different reasons) and that an even better method would be to take the minimum rejection rate between both. Because then (if I see this right from Fig. A1) only HPBL would stick out as an anomaly which you also identify in the results. I see you have your arguments for your method over the Benjamini-Hochberg method but the discussion in Appendix C also shows that neither are actually methods robust to geophysical data distributions. I generally think the manuscript would gain a lot of strength if you incorporated the ideas from Appendix C directly into the main text to not leave the reader with a figure where you identify most variables that are outliers as being an artefact of the methodology. I mention in the minor comments that maybe converting the data to ranks, or maybe you can think of another way to deal with not-so-normal distributions. What you suggest to just round all data to 4 decimal points I think is one possibility (although I’d round in binary not in decimal) because your probability distributions should be well resolved by the numerical precision of your data. So whether you have 23 mantissa bits precision (in the data, not the compute) or 20 shouldn’t have an impact on the rejection rate. I can see a method where you accept a rejection rate as robust when rounding it to n-1, n-2 or n-3 mantissa bits does not have an impact. But note that different variables have a different bitwise real information content. E.g. temperature or CO2 have much more information in the significand/mantissa than a variable that varies over orders of magnitude in the atmosphere, say, specific humidity (see Klöwer et al. 2021, https://doi.org/10.1038/s43588-021-00156-2). I’d love to see a version of the manuscript that does not require Appendix C to explain some artefacts.
Minor points
Abstract
L11: “”rejection rate” would benefit from a bit more explanation, rejected based on what? You of course elaborate on it in the text, but maybe just name the verification you use given this is the abstract? Maybe just “rejection rate, highlighting little statistical difference between the …”
L12: “negligible as masked by model uncertainty” maybe? To explain the meaning of your anti-control?
Intro
L24: Maybe add memory and or data requirements?
L25: Or length of integration? Number of variables? Physical accuracy (e.g. more accurate parameterizations are often more costly), in the narrative of more accuracy with less precision?
L26: For most applications only float64 -> float32 is straightforward. 16-bit arithmetic often requires adjusted algorithms, and getting performance out is also not necessarily straightforward. You say this around L64 but maybe adjust the usage of “straightforward” here, or only refer to float64->float32.
L27: Remove “typically” and just state the important points of the IEEE-754 standard here.
L28: Note that this is the normal range, subnormals are smaller, please add to be precise.
L31: Note that around the number you mention the following float64 are representable
296.45678912345664

296.4567891234567

296.45678912345676

Rounded to float32 the representable floats (round to nearest in the middle) are
296.45676f0

296.4568f0

296.45682f0

While your point holds please choose a float64,float32 pair that’s actually representable not “something like”.
L33: you mention discretization, model and boundary condition error, there’s also initial condition error, maybe add?
L35: Maybe use “arithmetic intensity” to distinguish this concept from the use of “operational(ly)” in terms of operational, i.e. regularly scheduled numerical weather predictions?
L44: you switch between single precision and SP, I don’t see the need to abbreviate but in any case be consistent?
L45: I would expect computing architecture or compiler settings to play a role too? Have you tested that too or is that less relevant given where and how COSMO is run? I’m not saying you should test that performance, but maybe just outline to the reader what could impact performance improvements.
L97 and L98: the significanD or significanT bits.
L99: when first mentioning stochastic rounding, I’d provide a reference like https://doi.org/10.1098/rsos.211631
L128: -to
L150: I like these sentences summarising the implications of your rejection procedure. But it might be helpful to the reader to discuss, say, two cases, one where single precision causes a tiny bias globally, would this be rejected? As I see, not if that bias is masked by the variance of fC-R. And two, a case where single precision changes the climate in a small country but has no impacts in other regions of the world. Could these be added for clarification?
Figure 1: This is great!
L162: Call it multiplicative noise? There’s an analogy here to the stochastically perturbed parameterization tendencies (SPPT) where the perturbations also take this form but you apply it to the actual variables, I assume R has some autocorrelation in space and time? Maybe irrelevant for your study though.
L168: Maybe add a small table for the 7 ensembles?
L173: Can you not reduce diffusion because of numerical instability? If yes, maybe state why you’re changing the coefficients only in one direction. Also while I see a change in diffusion as a reasonable control to test against, you could have also changed a physical parameterization (e.g. make convection stronger/weaker). Maybe elaborate more on your decision why you created the control as you did?
Figure 2: Maybe state the precision on each panel? It’s double everywhere where it’s not single I guess, just for clarity.
L187: Could you elaborate where rejections in the ID test come from? As I see it you only perturb the initial conditions so rejections are solely due to internal variability which however is small because of the identical boundary conditions leaving little room for the weather over Europe to evolve onto an independent trajectory? So this could be a storm away from the boundaries that is strong in some ensemble members but not in the other ensemble?
L193: Differences … in the initial condition perturbations?
L197: Given that some variables suffer from single precision as you outline, do you think this can have an impact on others being one average those 2-5% off? E.g. if cloud cover has a systematic bias with single precision this could introduce a bias on surface temperature (and consequently other variables) that’s not large but enough to systematically cause those 2-5% higher rejection rate? Maybe a discussion of cross-variable impacts could be added?
L205: Use a rank-based test instead?
L207: This sounds like also output precision (assuming you always output single?) is of relevance here. I think it makes sense to round all the data to something slightly less than single precision anyway. That way any clustering of data to identical values is at least the same across all simulations regardless of the precision used for computations.
L211: I find it difficult to think of every sensitivity to precision as a bug. There are algorithms that are stable only at high precision without them being coded up incorrectly. E.g. stagnation in large sums of small numbers due to insufficient precision can be overcome with a compensated summation but that comes at additional computational cost. In other situations you might be able to solve precision issues by computing the sum in reverse (possibly an easy fix that could be considered a “bug”). Maybe write “due to rounding errors in algorithms whether easy to fix (e.g. a bug) or not.”
L212: Could you mark those variables in Fig 3 somehow? For anyone repeating your analysis I find this an important concept to highlight that rounding errors from some variables cannot propagate to others which certainly helps in finding in which calculation precision is lost.
L215: noting -> reiterating given you already said this?
Fig 3: You write “Height of the boundary layer” but abbreviate it as HPBL, add the “planetary” or call it HBL for consistency? Also decision rate vs rejection rate?
L222: after -> during ?
L223: Could you not present a version of Fig 2 and 3 where these technical artefacts are somehow circumvented / the methodology adjusted? Most not-so-careful readers would probably look at those figures and conclude “single precision is bad for cloud or soil modelling so we shouldn’t do this”.
L240ff: Just want to appreciate this list of recommendations that you give to readers, very helpful I believe!
L295: Temporal resolution usually comes with more constraints on compute and data storage. But would you recommend using time averages or time snapshots if both were available?
Fig C1: I find the grey background to highlight rejection a bit of an overkill, and it doesn’t make the purple lines particularly readable, given all are rejected, just write this in the caption and make the background white again?
Citation: https://doi.org/10.5194/egusphere-2023-2263-RC2
AC1: 'Comment on egusphere-2023-2263', Hugo Banderier, 24 May 2024

We thank both of the reviewers for their insightful comment. In the file "Refereeresponse.pdf", we aim to answer in details all the points raised by them.

Citation: https://doi.org/10.5194/egusphere-2023-2263-AC1

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Hugo Banderier on behalf of the Authors (24 May 2024) Author's response Author's tracked changes Manuscript

ED: Publish subject to technical corrections (13 Jun 2024) by James Kelly

AR by Hugo Banderier on behalf of the Authors (14 Jun 2024) Manuscript

Short summary

We investigate the effects of reduced-precision arithmetic in a state-of-the-art regional climate model by studying the results of 10-year-long simulations. After this time, the results of the reduced precision and the standard implementation are hardly different. This should encourage the use of reduced precision in climate models to exploit the speedup and memory savings it brings. The methodology used in this work can help researchers verify reduced-precision implementations of their model.