Replicability of the EC-Earth3 Earth System Model under a change in computing environment

. Most Earth System Models (ESMs) are running under different high-performance computing (HPC) environments. This has several advantages, from allowing different groups to work with the same tool in parallel to leveraging the burden of ensemble climate simulations but also offering alternative solutions in case of shutdown (expected or not) of any of the environments. However, for obvious scientiﬁc reasons, it is critical to ensure that ESMs provide identical results under changes in computing environment. While strict bit-for-bit reproducibility is not always guaranteed with ESMs, it is desirable that 5 results obtained under one computing environment are at least statistically indistinguishable from those obtained under another environment, which we term a “replicability” condition following the metrology nomenclature. Here, we develop a protocol to assess the replicability of the EC-Earth ESM. Using two versions of EC-Earth, we present one case of non-replicability and one case of replicability. The non-replicable case occurs with the older version of the model and likely ﬁnds its origin in the treatment of river runoffs along Antarctic coasts. By contrast, the more recent version of the model provides replicable 10 results. The methodology presented here has been adopted as a standard test by the EC-Earth consortium (27 institutions in Europe) to evaluate the replicability of any new model version across platforms, including for CMIP6 experiments. To a larger extent, it can be used to assess whether other ESMs can safely be ported from one HPC environment to another for studying climate-related questions. Our results and experience with this work suggest that the default assumption should be that ESMs are not replicable under changes in the HPC environment, until proven otherwise.

Action.We have added a new row in table 2 with the output size of each one of the experiments and a reference to it when saying "large amount of output".In addition, we now briefly mention why there is a difference in the output size of experiments e011 and m06e.R2.7.P13, l1: Missing "t".
Action.Thanks, this is fixed now.includes up to six land-surface tiles (bare ground, low and high vegetation, intercepted water, and shaded and exposed snow) which can co-exist under the same atmospheric grid-box.The vertical discrimination consists of a four-layer soil that can be covered by a single layer of snow.Vegetation growth and decay are varying climatologically, and there is no interactive biology".
R3.6.Section 3.2 should include an explanation of why a 20-year period is necessary to detect code errors that may arise later [e.g., more than 1 year as in Baker et al. (2015)] in the coupled climate model simulations, and whether 20 years is also sufficient for testing different configurations (e.g., active biogeochemistry vs none) or grid resolutions.In particular, how is the 20 years reconciled with the Servonnat et al finding cited at the bottom of page 4, that about 70 years are needed to account for low frequency ocean variability?
Response.The choice of 20 years is obviously partly subjective, and as pointed by the referee, it is necessary to bring more information about this choice in the manuscript.In climate sciences, it is common to consider a minimum of 20 or 30 years to estimate the mean state of the climate system at the global scale.This is due to the chaotic nature of the climate system that can bring the climate system to different states from similar initial conditions (including small perturbations in model experiments), only because of the internal variability that acts at different timescales, from daily to multi-decadal timescales.The atmosphere acts at higher frequencies than the ocean, possibly explaining why Baker et al. (2015) considered only 1-yr experiments to detect differences in atmosphere-only simulations run on different platforms.We decided to choose a lower boundary of 20 years following Hawkins et al. (2016) who investigated the chance to get a negative trend of the global temperature under a 1%of CO2 increase per year.This study suggests that such a negative trend is likely to occur with a probability of 7.8%, 1.2%, and 0.1% when considering 10, 15 and 20 years of data, respectively.Similarly, we consider that ensemble experiments starting from slightly perturbed initial conditions could be different after 10 years and are very likely to be similar after 20 years.In our protocol, the use of ensemble experiments allows an estimation of the climate internal variability that can be compared to any suspected difference.The choice of 20 years is a tradeoff between the boundary suggested by Hawkins et al. (2016) and the (expensive) computational cost of running 5 member experiments over 20 years (which makes an equivalent of 100 years, more than the 70 years of Servonnat et al. 2016).Obviously, a longer period would be safer to avoid any difference that would not be detected.But we consider that the use of a large set of variables, similarly to what has been done by Servonnat et al. (2016) is helpful to detect the main (ocean, sea ice and atmosphere) issues of replicability.
Action.We have added in the text at the end of Section 3.2.1:"As mentioned above, the integrations are 20-year long.Such a length is considered as a minimum, following Hawkins et al. (2016) who investigated the chance to get a negative trend in annual mean global mean temperature under a 1%/yr increase in CO2 concentration.This study suggests that a negative trend is likely to occur with a probability of 7.8%, 1.2%, and 0.1% when considering respectively 10, 15 and 20 years.Similarly, we consider that ensemble experiments starting from slightly perturbed initial conditions could be different after 10 years and are very likely to be similar after 20 years."

Response. Agreed.
Action.We now inter-compare the cases of non-replicability and replicability explicitly in the Section "Results and Discussion" R3.9.page 12, may wonder why this bug wasn't fixed.Add a sentence similar to the 3rd sentence in section 5 that emphasizes that the testing framework is a diagnostic tool that alerts the user to potential issues in model code, but does not identify or fix specific problems.
Action.Sentence added.

R3.10. The null hypothesis that is stated on page 13 line 1 should be introduced in section 3 (methods)
Response.Indeed, it was missing in the "Methods" section.

Action.
A sentence has been added to the "Methods" section.

R3.11. Page 16:
The authors should highlight the importance of adopting software development best practices generally, such as compiling and running climate models without optimizations and debugging flags (e.g., -fpe0) in the second bullet point.In the reviewer's experience, the typical goal of climate model end users is to simply get the model to build and run due to time constraints and/or lack of knowledge about model software and build systems.Users will run simulations with full optimizations only, and may not be aware of issues with code until they examine the output.
Response.We fully endorse the reviewer's point of view, and agree it should be publicized to the readership of GMD.The point that there is a world of "users" quite disconnected from the world of "developers" is certainly harmful to both parties, because the tools of the latters are often misused by the formers.
Action.We now recognize the above point in the conclusion and encourage model users to be more proactive in the computational aspects of models in order to avoid uninformed used of these models.
R3.12.In the conclusions, reiterate that this paper only demonstrates that EC-Earth 3.1 is nonreproducible, not that EC-Earth 3.2 is.
Action.Done, a bullet point has been added.

R3.13.
The data and code availability section should state clearly that the model codes themselves are not publicly available, and therefore a reviewer or reader cannot independently verify these results.At best they could independently test in a different model to which they may have access.
Response.This is indeed a critical point that was also raised by the editor.

Action.
A sentence has been added to the data and code availability section.

R3.15. page 2, line 8: "accuracy" is probably the wrong word, consider changing to "precision or stability"
Reply.Thank you for the recommendation.We are trying to follow the consensus in the state of the art where precision is used for the number of bits that the variables has (for example, single (32 bits) or double (64 bits)) and accuracy is used to evaluate how well the climate is simulated, closer to the meaning that we are using here.We have included a brief definition to ensure that the readers understand our meaning.

Action.
We have added a definition at the beginning of the text.

R3.16.
Move parenthetical explanation of floating-point math from page 4, lines 8-9 to page 3, line 15, where floating-point math is first mentioned.

R3
.18. page 10, line 5: "nail down" is used incorrectly; consider changing to "narrow down." Action.The wording has been changed.

Action. Done
Anonymous Reviewer #4 R4.1.This paper focuses on the EC-Earth model and considers how to demonstrate it is replicable across different HPC platforms.It adopts a statistical approach and demonstrates the use of metrics with a 'convenient' example of a potential bug in their user code.

R4.2.
The paper does not go into great detail on the causes of differences in the model results on different platforms.In practise, this investigation can be a long and tedious process to isolate the cause either in the user code or sometimes in vendor supplied code.

R4.3.
The technique may be applicable to other models, though it is difficult to tell, given the complexity of the coupled EC-Earth ESM, what changes would be needed to the authors' methods to achieve reliable results.This paper presents an interesting example on an important issue and publication is recommended following minor revisions.
Reply.We thank the reviewer for a constructive input on our manuscript.

R4.4.
Page 3, Line 15 (and section 2.1) Sentence beginning 'On the other hand, ... representation of numbers / operations can ..'.I find this sentence vague.The IEEE standard followed by vendors specifies how numbers are represented and rounded during arithmetic operations, unless a compiler is allowed to go beyond the standard, operations should not be expected to differ.Likewise, why would a different implementation of the MPI library (MPICH .v. OpenMPI) be expected to produce different results if MPI is simply used to move data between tasks?This paragraph appears to loosely suggest differences can arise simply from making these changes.There is no mention of the standards in place that compilers are expected to follow.Compiler options that violate standards are a user choice and often used, as the authors state, to achieve better performance.
Reply.We agree that the sentence is vague and more details should be included to avoid confusion to include details about the standards, how the compilation process for these models works, and which replicability issues could happen with some MPI functions.It is completely true that the different libraries for distributed memory paradigms such as MPICH or OpenMPI follow a standard and the possible differences in optimizations and reproducibility issues are coming from user decisions.However, Earth System Models like EC-Earth are compiled using an external layer and the compilation is transparent to the scientist, who sets up a configuration file per platform, including different keys which decide how aggressive will be the optimizations, respecting the standard or not.This means that the compilation options, libraries used or the optimization aggressiveness for different libraries could change between platforms, simply because the configuration file chosen by the user is containing different set-ups or because the static libraries have been compiled using different options, where the user is only linking them.Additionally, notice that MPI is not only used to move data among subdomains in Earth System Models.Typical MPI arithmetics operations like reductions involve to calculate global numbers from all processes.
If a specific order is not used for these operations, round off differences among runs will happen.As correctly pointed out by the reviewer, libraries follow a standard to maintain an order (binary tree sum order is typical) but more aggressive optimizations set up by the user (using the configuration file or by hand) could remove a restrictive order and the more aggressive algorithm could differ between libraries.
Action.All this information has been included in the paper.In particular, we have added a bullet point in the conclusions that the non-replicability is often a consequence of users violating vendor standards (consciously or not…) to achieve better performance.Page 3,Lines 16 & 29.Issues in replicability/reproducibility can also arise from the operating system libraries in use, separate from the compiler.For example, optimized vendor supplied versions of the BLAS/LAPACK libraries, often used in ESMs, can give rise to differences compared to other implementations of these libraries.I suggest the authors reword to say 'compiler environment' rather than simply 'compiler' or 'compiler setup' wherever used. Page 4, this is rather vague.The user/developer has a great deal of control over what the compiler is allowed to do in terms of optimizing arithmetic operations.To say '(or simply, the translation to assembly code...' is not correct.It is the code reorganisation performed by the compiler optimizations, then translated to assembler, which can be incorrect, either because of user code errors and/or inappropriate compiler options.

R4.5.
Action.The wording has been changed.R4.6.Page 4, lines 2-3.Again, this is rather vague.The user/developer has a great deal of control over what the compiler is allowed to do in terms of optimizing arithmetic operations.To say '(or simply, the translation to assembly code...' is not correct.It is the code reorganisation performed by the compiler optimizations, then translated to assembler, which can be incorrect, either because of user code errors and/or inappropriate compiler options.
Reply.We agree.

Action.
We have extended the sentence to include the reviewer suggestions and improve the understandability of this part.

R4.7. Section 3.1.2. The IFS model also supports OpenMP parallelization, can the authors clarify if OpenMP was used?
Reply.Although IFS is able to use OpenMP, NEMO is not prepared yet.The good practice in the EC-Earth community is to use a homogeneous execution for the complete MPMD application, using only MPI.Our configurations follow this decision.A brief explanation has been added to the relevant section.

R4.8. Section 3.1.4. Is the version of H-TESSEL used part of the IFS CY36, or a different version? Has it been modified from the version supplied with IFS?
Reply.We clarify in the text.
Action.We have added the sentence: "The land-surface model H-TESSEL is an integral component of the IFS cycle cy36r4.We have not modified it in EC-Earth 3.1 and EC-Earth3.2."Note that a more in-depth presentation of H-TESSEL has been added, following Anonymous Referee #3's request (see comment R3.5).
R4.9.Section 3.2.Why 20 years?Does this not depend on the choice of parameters studied?
Response.The choice of 20 years is obviously partly subjective, and as also pointed by Anonymous Referee #3, it is necessary to bring more information about this choice in the manuscript.In climate sciences, it is common to consider a minimum of 20 or 30 years to estimate the mean state of the climate system at the global scale.This is due to the chaotic nature of the climate system that can bring the climate system to different states from similar initial conditions (including small perturbations in model experiments), only because of the internal variability that acts at different timescales, from daily to multi-decadal timescales.The atmosphere acts at higher frequencies than the ocean, possibly explaining why Baker et al. (2015) considered only 1-yr experiments to detect differences in atmosphere-only simulations run on different platforms.We decided to choose a lower boundary of 20 years following Hawkins et al. (2016) who investigated the chance to get a negative trend of the global temperature under a 1%of CO2 increase per year.This study suggests that such a negative trend is likely to occur with a probability of 7.8%, 1.2%, and 0.1% when considering 10, 15 and 20 years of data, respectively.Similarly, we consider that ensemble experiments starting from slightly perturbed initial conditions could be different after 10 years and are very likely to be similar after 20 years.In our protocol, the use of ensemble experiments allows an estimation of the climate internal variability that can be compared to any suspected difference.The choice of 20 years is a tradeoff between the boundary suggested by Hawkins et al. (2016) and the (expensive) computational cost of running 5 member experiments over 20 years (which makes an equivalent of 100 years, more than the 70 years of Servonnat et al. 2016).Obviously, a longer period would be safer to avoid any difference that would not be detected.But we consider that the use of a large set of variables, similarly to what has been done by Servonnat et al. (2016) is helpful to detect the main (ocean, sea ice and atmosphere) issues of replicability.
Action.We have added in the text at the end of Section 3.2.1:"As mentioned above, the integrations are 20-year long.Such a length is considered as a minimum, following Hawkins et al. (2016) who investigated the chance to get a negative trend in annual mean global mean temperature under a 1%/yr increase in CO2 concentration.This study suggests that a negative trend is likely to occur with a probability of 7.8%, 1.2%, and 0.1% when considering respectively 10, 15 and 20 years.Similarly, we consider that ensemble experiments starting from slightly perturbed initial conditions could be different after 10 years and are very likely to be similar after 20 years."R4.10.Section 3.2.2.The IFS model normally outputs GRIB format files, which are a lossy compressed format.Can the authors clarify if they are using output at the precision of the model's arithmetic or some reduced precision format?This is important if looking for small differences and their results?
Reply.In IFS, we output using the reduced precision of GRIB1, which uses 8 and 16 bits per value, depending on the field.In addition, in NEMO we output netCDF data using 32-bit precision.Our tests prove that we can find differences in the simulated climate using our methodology.However, we think that a different study would be necessary to determine which is the minimum output precision to capture small differences in the simulated data, but this is out of the scope of this work.Page 10,table 2. Several comments: (i) I note that version 3.2 was compiled with the -fpmodel strict option which was not used on version 3.1.I would need to check myself but it seems likely to me that this would potentially limit the optimizations the compiler is allowed to perform at -O2.I am curious if the authors think this might be significant for their apparent bug in the river runoff code?(ii) The Mare-Nostrum experiment of 3.2 uses -fp-model precise rather than -fp-model strict.Is this a typo or was it different to the CCA experiment?If so, why? (iii) Can the authors confirm these compiler options were applied to all the code?It is not uncommon to see different compiler flags on selected routines, or compiler directives in the code itself.

R4.11.
Reply.The reviewer is right.Compilation using fp:strict limits some of the O2 optimizations and it is more limiting (from the computational performance point of view) than fp:precise.We suspected that fp:precise or fp:strict could help to avoid the bug problem with EC-Earth 3.1.However, the results (along with other tests) proved that floating point controls did not solve the problem, these tests are not included in the paper to avoid the extension of the document.On the other hand, the experiments for EC-Earth 3.2 are using fp:precise and fp:strict respectively to prove that the new methodology can be used (as an example) to evaluate the same application for two different flags compilation and platforms.Please check answer to R1.4 for more details, where we prove that we can obtain a similar climate across two platforms and flag compilations options, where fp:strict option is 4% slower in the final execution time, proving that (for this configuration) fp:strict us unneeded compared to fp:precise.All these details have been included in the paper along with a brief explanation about fp:precise and fp:strict.We will extend these results in the future (a future work section has been included for this purpose).About the last comment, the reviewer points out something very important and usually forgotten during compilation for complex models.In our case, we ensured during our compilation process that all external libraries used the same compilation flags, we have added this information during the model description part.
Action.We have added the above information to the paper, also in line with the answer to R1.4 above.
Page 12, line 5.It is disappointing that the authors have submitted this paper without completing their investigation into the cause of the disprecancy.If, in the time taken for the paper reviews, the authors are certain the fortran array referred to is the problem, this text should be amended.However, if there has not been any further investigation I would prefer not to see (educated) guesswork on the cause of the problem in the published paper and suggest removing the sentence, as it may turn out to be incorrect.The authors note that version 3.2 did not show the same behaviour.

Does this mean that the code they suspect was different between the two model versions? Can the authors clarify in the text whether the offending code was different between the versions?
Reply.We concur with the reviewer that the statement should be supported by more diagnostics; otherwise it might sound loose and unjustified.To this end, we plotted (see figure below) the sea surface salinity (SSS) differences between the experiments m06e and e011, i.e. the two experiments that produce statistically distinguishable results with EC-Earth3.1.The figure allows following the evolution of SSS differences for 1, 5, 10 and 15 days after the initialization from identical ocean and sea ice files.Interestingly, the Southern Ocean is already the scene of differences not seen elsewhere at day 1.The m06e experiment simulates surface waters that are fresher in elongated bands (and suspiciously aligned with the domain's grid) in the eastern Weddell coast and along the Ross ice shelf, but saltier in the rest of the Southern Ocean.This pattern persists through day 15.The lower sea ice areas in m06e than in e011 (Fig. 4 of the original manuscript, also reproduced below for ease of interpretation) are physically consistent with the SSS differences: the additional surface salt in m04e contributes to a weaker vertical oceanic stratification, which could eventually trigger deep convection and lead to systematic differences in winter sea ice coverage.At this stage, it is not clear why m06e has larger SSS already at day 1 away from the coast.Indeed, a salinity anomaly generated at the coast could not propagate equatorward at this speed.Daily mean differences in Sea Surface Salinity (SSS) between ensemble means of experiments m06e and e011 (each has 5 members) on day 1 (January 1 st , 1850), 5, 10 and 15 after initialization.Note the non-uniform color scale.
Reproduction of Fig. 4 of the submitted manuscript.Time series of ensemble mean Antarctic September sea ice extent in the two experiments e011 (orange) and m06e (green).The dashed lines indicate the ensemble minimum and maximum (i.e., the range).Sea ice extent is calculated as the sum of areas of ocean pixels containing more than 15% of sea ice.

Introduction
Numerical models of the climate system are essential tools in climate research.These models are the primary source of information for understanding the functioning of the Earth's climate, for attributing observed changes to specific drivers, and for the development of mitigation and adaptation policies, among others (IPCC, 2013).Over the years, these models have become more and more complex.Today's Earth System Models (ESMs) consist of several components of the climate system (ocean, atmosphere, cryosphere, biosphere) coupled together, and often feature several millions of lines of code.Due to their high computational requirements, ESMs usually run on high performance computing (HPC) facilities, or supercomputers.As such, climate science now fully entails an important computational component.Climate scientists -developers, users, and now computer scientists, are typically facing three questions regarding computational aspects of the ESMs: 1. How can better performance be achieved for a given model configuration?
3. Are the results reproducible under changes in hardware or software?
These three aspects (performance, accuracy and reproducibility) usually conflict and cannot all be achieved at the same time (Corden and Kreitzer, 2015).For example, better performance (1) can be achieved by means of compiler optimization by allowing the compiler to reorder floating-point operations, or even eliminate exceptions (overflow, division by zero).However, this may be the source of non-reproducibility (3) and cause less accuracy in the solution (2).Likewise, achieving bit-for-bit reproducibility ( 3) is possible but it implies to keep a full control on the flow of operations, which slows down dramatically the execution time of the code (1).
From the three questions listed above, we are primarily interested in the third one because it has received relatively little attention so far from the climate community.While the reproducibility of results should be a natural requirement in science (Berg, 2018), it proves particularly challenging in the context of Earth System Modeling.Users of ESMs have generally a good (or even a full) control on which model source code they use, but they do not always have such a level of control on several constraints external to the model code itself such as: the type of compiler, the version of softwares used, ::::::: compiler ::::::: options :::: such :: as :::::::: operating :::::: system :::::::: libraries, the maximum number of processors available and the architecture of the cluster itself (among others).It is known :::::::: suspected, however, that such factors ::::: might : also influence the outcome of the model ::::: results :::::::: produced ::: by :: the :::::: model :::: (see ::::::: Section :: 2).Therefore, the meaning of reproducibility for climate sciences, how to test it and whether ESMs provide replicable results are all legitimate questions to ask.
Before we progress with further considerations, it is important to make the distinction between three concepts that are often used interchangeably but represent in fact different notions: repeatability, replicability and reproducibility.Following the Association for Computing Machinery (ACM) we qualify a result as repeatable if it can be obtained twice within stated precision by the same experimenter and in the same experimental conditions; we qualify a result as replicable if it can be obtained twice within stated precision by different experimenters but in the same experimental conditions; finally, we qualify a result as reproducible :::::::::: reproducible if it can be obtained twice, within stated precision, by different experimenters in different experimental conditions (https://www.acm.org/publications/policies/artifact-review-badging;Plesser, 2018;McArthur, 2019).
The present study is concerned with the question of replicability of an Earth System Model.That is, it seeks to answer the following question: for the same model, forcings and initial conditions (experimental setup), how do results depend on the hardware/software constraints, that vary from one experimenter to the next?In particular, we wish to reach three goals with this study: -We aim at establishing a protocol for detecting the possible non-replicability of an ESM under changes in the HPC environment (compiler :::::::::: environment, distribution of processors, compilation options, flags, . . .); -We aim at reporting examples of non-replicability that highlight the need to run ESMs in full awareness of possible existence of bugs in its code; -We aim at alerting the climate community about the underestimated role of hardware or software errors in the final model solution.
This article first reviews the existing literature on the topic.Then, we introduce EC-Earth, the ESM used in this work, as well as the protocol for checking its replicability across multiple environments.Finally we report instances of replicability and non-replicability in EC-Earth before formulating recommendations for potential users of climate models.
We now review in more detail the reasons behind the non-replicability of ESMs and the literature published on the topic so far.
Another source of non-replicability is the non-deterministic nature of parallel applications.When global collective communications are used, all the resources working in parallel have to send and receive some data.These data, which are collected by a master process, may arrive in random order (due to delays in message passing between processes) : if ::: the :::: user :::: sets ::: up : a ::::: more ::::::::: aggressive :::::::: approach : -:::::::::: consciously ::: or ::: not : -::::::: through :: a ::::::::::: configuration ::: file.When data is processed following the order of arrival, the results can end up being non-deterministic because of round-off errors produced by the different order to collect the final result.(We recall that the commutative and associative properties of mathematics do not hold in finite-precision arithmetics).There are several techniques :::::: offered :: as :::::::: standards : to avoid round-off errors during parallel computation but this implies, in some way, to degrade the computational performance of the execution.This happens, for example, when requiring the collective communications to be sequenced in a prescribed order.Other techniques can be used to reduce the impact of maintaining a particular order to do the operations, such as a binary tree process to calculate the collective communications, avoiding a single sequential order but yet depending on the load balance of the parallel execution to achieve peak performance.
All these options can be implemented into the code by the developer, others are inserted directly by the compiler or the library used for the parallel execution.Again, the configuration depends on the compilation setup chosen and can differ from one HPC environment to another.

State of the art
Rosinski and Williamson (1997) were the first ones to raise the concern of replicability in a global atmospheric model, and formulated several criteria to validate the porting of such models from one computing environment to another.Recognizing already that bit-for-bit replicability would be impossible to meet, they proposed that the long-term statistics of the model solution in the new computing environment should match those in a trusted environment.Subsequent studies tested the sensitivity of results to domain decomposition, change in compiler :::::::::: environment, or usage of different libraries.They all came to the same conclusion that changes in behavior induced by hardware or software differences were not negligible compared to other sources of error such as uncertainty in model parameters or initial conditions.These conclusions were found to hold from weather (Thomas et al., 2002) to seasonal (Hong et al., 2013) and even climate change (Knight et al., 2007) time scales.
Arguably, the most comprehensive and complete study on the topic is that from Baker et al. (2015).Recognizing that the atmosphere exhibits coherency across variables and across space, they proposed a protocol to identify possible non-replicability in standard atmospheric fields, accounting for the strong covariance that may exist between these fields.While useful, the Baker et al. ( 2015) study addresses only short (1-yr) time scales and is only concerned by atmospheric variables.It is important to acknowledge that variations in hardware or software can potentially impact slower components of the climate system, that the time of emergence of the differences may exceed one year, and that long runs might be needed to disentangle internal climate variability from a true signal.As an example, Servonnat et al. (2013) investigated the replicability of the IPSL-CM5A-LR climate model across several HPC environments.They found that for dynamical variables like surface pressure or precipitation, at least ∼70 years would be needed to ensure that one given signal lies within the bounds of the reference signal.

Earth System Model
EC-Earth is a state-of-the-art ESM developed by the EC-Earth consortium, counting close to 20 European institutions (Hazeleger et al., 2011).EC-Earth is a community model developed in a collaborative and decentralized way.EC-Earth consists of coupled component models for atmosphere, ocean, land and sea ice, as described hereunder.
In this study, two versions of the EC-Earth ESM are used.The first one, denoted EC-Earth 3.1 hereinafter ::::::: hereafter, is the "interim" version that was developed between the fifth and sixth stages of the Coupled Model Intercomparison Project (CMIP5 and CMIP6).The second one, denoted EC-Earth 3.2, is the "near-CMIP6" version that was used during the two years preceding the official release of EC-Earth for CMIP6.

Code information and revisions used
The EC-Earth source codes used for this study were managed through the Subversion (SVN) version control system.For EC-Earth 3.1, the revision r1722 (EC-Earth3.1 official release) of the code was used.For EC-Earth 3.2, the revision r3906 of the code was used.

Atmosphere component
The With respect to the model version used for CMIP5 (Hazeleger et al., 2011), the main updates and improvements in EC-Earth 3.1 include an improved radiation scheme (Morcrette et al., 2008), and a new cloud microphysics scheme (Forbes et al., 2011) in the atmosphere.

Ocean and sea ice components
The ocean component of EC-Earth 3.1 is the version 3.3.1 of NEMO (Gurvan et al., 2017).NEMO uses the so-called ORCA1 configuration, which consists of a tripolar grid with poles over northern North America, Siberia and Antarctica at a resolution of about 1 • .Higher resolution, by roughly a factor 3, is applied close to the equator in order to better resolve tropical instability waves.46 vertical levels are used, and the vertical grid thickness ranges between 6m and 250m : 6 ::: m ::: and ::: 250 ::: m.The effects of the subgrid scale processes (mainly the mesoscale eddies) are represented by an isopycnal mixing/advection parameterization as proposed by Gent and McWilliams (1990) while the vertical mixing is parameterized according to a local turbulent kinetic energy (TKE) closure scheme (Blanke and Delecluse, 1993).A bottom boundary layer scheme, similar to that of Beckmann and Döscher (1997), is used to improve the representation of dense water spreading.The ocean component is coupled to the Louvain-la-Neuve sea Ice Model, version 3 (LIM3; Vancoppenolle et al., 2009) which is a dynamic-thermodynamic model explicitly accounting for subgrid scale variations in ice thickness.However, in EC-Earth 3.1, only one ice thickness category was used due to numerical instabilities when the default configuration was used with five thickness categories.
EC-Earth 3.2 uses the version 3.6 of the NEMO model and an updated version of the LIM3 model, which this time runs with five ice thickness categories.The ocean grid is identical except that it has 75 vertical levels.
NEMO is adapted to HPC using the shared memory paradigm with the standard MPI.Similar to IFS, it uses domain decomposition to distribute the workload among MPI processes.

Coupling
The atmosphere and ocean-sea ice components of EC-Earth are coupled with the Ocean Atmosphere Sea Ice Soil coupler version 3 (OASIS3; Valcke, 2012).OASIS allows exchanging different fields among components (such as IFS or NEMO) during the execution of EC-Earth.The coupling process involves the transformation of the fields from the source grid to the target grid (including interpolation and conservative operations when it is needed) and the explicit communication among components using MPI communications.OASIS is able to work using MPI to exchange fields between the source and target grids.For EC-Earth 3.1 OASIS3 is used as an independent application, while with EC-Earth 3.2 OASIS3-MCT is called using library functions(, : thus not requiring dedicated processors).

Generation of simulations
The five members of the integrations conducted on environments A and B always start from unique and identical atmospheric and sea ice restarts.These restarts are obtained from a long equilibrium simulation conducted at the Italian National Research Council (CNR) (Paolo Davini, personal communication, http://sansone.to.isac.cnr.it/ecearth/init/year1850_tome/15010101/).
An ocean restart was also obtained from this equilibrium simulation, and five random but deterministic perturbations were added to the sea surface temperature of this restart at all grid points (gaussian perturbation, standard deviation: 10 −4 K).The introduction of these tiny perturbations allows ensemble spread to develop in integrations A and B and to eventually sample the model's own internal climate variability.Note that by the deterministic nature of the perturbations, pairs of members always start from the same triplet of atmospheric, oceanic and sea ice restarts: the first member of integration A and the first member of integration B start from identical initial conditions, and so for the second member, the third one, etc.

Calculation of standard indices
Due to the large amount of output produced by each simulation, it is impossible in practice to compare exhaustively all aspects of integrations A and B to one another.Therefore, the outputs from integrations A and B are first post-processed in an identical way.The code used to post-process the outputs is available (see "Code and data availability") and based on the list of standard metrics proposed by Reichler and Kim (2008).We record for each integration standard ocean, atmosphere and sea ice output variables: 3-D air temperature, humidity and components of the wind; 2-D total precipitation, mean sea-level pressure, air surface temperature, wind stress and surface thermal radiation; 2-D sea surface temperature and salinity, and sea ice concentration.These fields are averaged monthly (240 time steps over 20 years) and the grand time mean is also saved (1 time step over 20 years).
A sensible option would then be to compare together spatial averages of the aforementioned variables from integrations conducted on A and B. However, by definition, spatial averages hide regional differences and one simulation could be deemed replicable with respect to another despite non-replicable differences at the regional scale.To address this point, we rather consider to first compare the fields from each integration at the grid point level to common reference datasets (those used in Reichler and Kim (2008)), and then to compare the resulting metrics from A and B together in order to possibly detect an incompatibility between the two integrations.For each field, a grid-cell area weighted average of the model departure from the corresponding reference is evaluated and then normalized by the variance of that field in the reference data set.Thus, for each field, one metric (positive scalar number) is retained that describes the mismatch between that field in the integration, and the reference field.Five such metrics are available for each integration for each field, since each integration uses five ensemble members.
We stress that the goal of this approach is not to evaluate the quality of the model but rather to come up with a set of scalars characterizing the distance between model output and a reference.Therefore, the intrinsic quality of the reference data sets does not matter for our question.As a matter of fact, the datasets used in Reichler and Kim (2008) and in our protocol correspond to observations affected by historical climate forcings whereas the model output is generated assuming constant pre-industrial forcing.That is, the metrics resulting from the comparison cannot be used in a meaningful way to characterize model quality whatsoever.

Results and Discussion
We first ran two integrations of EC-Earth 3.1 under the ECMWF-CCA and MareNostrum3 computing environments, respectively.Results revealed that for four of the parameters ::::::: variables considered (out of 13, i.e. about 30%), an incompatibility was detected (Fig. 3).Since the probability of making a Type-I error is set to 5%, the incompatibility might not be explainable by chance only -: -though we should recognize that all the parameters ::::::: variables : considered display covariances that make the thirteen tests not fully independent.Differences in metrics for sea ice concentration and sea surface temperature appear very large, hinting that more investigation should be devoted to the models behavior at high latitudes.
The analysis was then repeated with the newer version of the model, EC-Earth 3.2 (experiments a0gi and a0go on ECMWF-10 CCA and MareNostrum3, respectively).In that case, we found no instance of incompatibility between any of the 13 parameters ::::::: variables : considered (Fig. 7).A spatial analysis (Fig. 8) suggests that only 1% of the grid points display an incompatibility for 2-m air emperature :::::::::: temperature.We recall that under the null hypothesis of no difference, significant differences are expected to occur 5% of the time.The magnitude of the regional differences in Fig. 8 illustrates the amplitude of climate internal  experiment.::: On ::: the :::::::: contrary, ::::::: fp:strict ::::::: increases ::: the ::::::::: execution :::: time :: of ::: the :::::::::: experiments :: by ::::: 3.6% ::: on :::::::::::: ECMWF-CCA :::: and :: by ::::: 3.9% ::::: These :::::: figures ::: also ::::::::: inherently :::::: convey ::: the :::::::::: information :::: that :::: such :: an :::::::::: assessment :: is ::::: made :::::: difficult ::: by ::: the :::::::::: background ::::: noise :: of ::: the  What can explain these different outcomes?Our protocol, like others, cannot inform on the source of non-replicability, but can inform whether there may exist : be : one (Baker et al., 2015), so in-depth analyses that go beyond the scope of this study would be necessary to trace the origin of non-replicability with version 3.1.However, we suspect that the presence of a bug, present in EC-Earth 3.1 but no longer in EC-Earth 3.2, could explain this result.In fact, we were never able to run the EC-Earth 17 3.1 model with the "-fpe0" flag activated during compilation (but could well run the model if this flag was disabled).This flag allows stopping the execution when floating point exceptions, such as division by zero, are encountered.Our guess is that by disabling this flag, the model still encounters bugs (probably linked to the array initialization mentioned in Sec.4), but these bugs give different outcomes depending on the computing environment.This worrying error that we obtained by porting a climate model from one HPC environment to another one highlights the necessity to choose adequate options of compilation when developing a model, without giving way to the temptation of excluding safe compilation options to bypass a compilation error in a new HPC.Such a result highlights also the fact that porting a code from one HPC to another might be an opportunity to detect errors in model codes.
We finally formulate a set of practical recommendations, gained during the realization of this work: -The default assumption should be that ESMs are not replicable under changes in computing environments.Climate scientists often assume that a model code would give identical climates regardless of where this code is executed.Our experience indicates that the picture is more complicated, and that codes (especially when they are bugged, as they often inevitably are) interfere with computing environments in sometimes unpredictable ways.Thus, it is safer to always assume that a model code will give different results from one computing environment to another, and to try proving the opposite -: -i.e., that the model executed in the two computing environments gives results that cannot be deemed incompatible.Our protocol fulfills this goal.
Smith RS, Gregory JM, Stainforth DA.Irreducible uncertainty in near-term climate projections.Clim Dyn. 1 juin 2016;46(11):3807-19.R3.7.The details about the Monte Carlo simulation in the 2nd sentence of the Figure 1 caption should be moved to paragraph 2 in section 3.2.3.Action.Done.R3.8The authors do not explicitly compare Figs. 2 and 5 in the text.The authors should likewise explicitly compare Figs. 3 and 6 in the text.

Figure 2 .
Figure 2. Power of the statistical test used.Probability that a two-sample Kolmogorov-Smirnov test (KS test) returns a p-value below a prescribed significance level of 5%, for two 5-member normal samples with equal standard deviations, and means separated by 0, 0.1, 0.2, . . .4.0 standard deviations (x-axis) .The probabilities were estimated using 1000 Monte-Carlo runs :::(see ::: text for each effect size.Gaussian distributions with equal variance were assumed for the samples.::::: details)

Figure 6 .
Figure 6.Time series of ensemble mean Antarctic September sea ice extent in the two experiments e011 (orange) and m06e (green).The dashed lines indicate the ensemble minimum and maximum (i.e., the range).Sea ice extent is calculated as the sum of areas of ocean pixels containing more than 15% of sea ice.

Figure 7 .
Figure 7. Same as Fig. 3 but for the pair of experiments carried out with EC-Earth 3.2, namely a0gi and a0go.

Figure 8 .
Figure 8. Same as Fig. 5 but for the pair of experiments carried out with EC-Earth 3.2, namely a0gi and a0go.

Table 2 .
The four experiments considered in this study