Machine dependence and reproducibility for coupled climate simulations: The HadGEM3-GC3.1 CMIP Preindustrial simulation

. When the same weather or climate simulation is run on different High Performance Computing (HPC) platforms, model outputs may not be identical for a given initial condition. While the role of HPC platforms in delivering better climate projections is to some extent discussed in literature, attention is mainly focused on scalability and performance rather than on the impact of machine-dependent processes on the numerical solution. 5 Here we investigate the behaviour of the Preindustrial (PI) simulation prepared by the UK Met Ofﬁce for the forthcoming CMIP6 under different computing environments. Discrepancies between the means of key climate variables were analysed at different timescales, from decadal to centennial. We found that for the two simulations to be statistically indistinguishable, a 200-year averaging period must be used for the analysis of the results. Thus, constant-forcing climate simulations using the HadGEM3-GC3.1 model are reproducible on 10 different HPC platforms provided that a long-enough duration of simulation is used. In regions where ENSO teleconnection patterns were detected, we found large sea surface temperature and sea ice concentration differences on centennial time-scales. This indicates that a 100-year constant-forcing simulation may not be long enough to adequately capture the internal variability of the HadGEM3-GC3.1 model, despite this being the minimum simulation length recommended by CMIP6 protocols.

The context in which the term "machine dependence" is used in the paper is explained in section 2, where we explain possible reasons that can cause the solution in the two cases to evolve differently.We conclude section 2, p4 line 1, clearly stating that differences are triggered by machine-dependent processes (compiler, optimization and etc.) but eventually exist because of the chaotic nature of the system (internal variability).Thus, the differences among the two runs analysed throughout the paper are not a measure of the machine dependence over the internal variability (the type of machine dependence Ref.1 refers to) but a measure of how the internal variability responds to the different computing environment in the two cases (see also response to Ref.n2 point 1).
The purpose of our work is to find out for how long we should run a simulation (or analyse its results) on the ARCHER HPC platform to capture/sample the same climate variability exhibited by the reference simulation on the Met Office supercomputer (as we know the two runs do not bit-compare).
The Introduction has been substantially changed and the paragraph the reviewer refers to has been removed.Please refer to revised manuscript.5) "Page 6, lines 13-20: I got really confused by this paragraph.You say "using a chronological order in the strictest sense is meaningless because every 10 years segment is equally representative of the pre-industrial climate variability".And the last two sentences of the paragraph (lines 19-20) contradict this statement.
We agree with the reviewer that this was written in a confusing way, to address this, Lines 19-20 are removed.
6) "Section 3.2: the SNR measure you use to estimate if the mean of PIMO is different from the mean of PIAR should come with a test to determine more objectively when the difference becomes significant.For instance, if you compute your SNR on n years, you could sample 100 couples of n random years (bootstrap) in the same simulation to check the distribution of the SNR when computed between two (random) periods of the same simulation.This would give you an estimate of the influence of internal variability on your SNR (but would also need a longer simulation for this)… An alternative would be the use of the Student t-test on the difference of mean, and the Fisher F-test on the difference of variance." Alongside computing SNR, PI_MO -PI_AR differences were tested using a 2-tailed Welch's T-test (where our H0 is mean_MO = mean_AR).
In the paper, only results based on SNR are shown.That is because, while most of the time the T-test and the SNR-based analysis gave us the same answer, in a few instances we had discordant results (see Figure 1).Overall, the behaviour detected in comparing the two methods was: t-test indicating significant differences despite SNR < 1. Bearing in mind that what a t-test really signifies is: IF H0 were true the probability of obtaining, in repeated experiments, the observed mean_MO -mean_AR value is less than 0.05, we made the decision of keeping on using SNR as, in our view, this quantity has a greater physical weight and can more readily be interpreted.Additionally, by doing so we chose the most conservative method of the two (i.e. the one that less frequently points to significant differences).
Figure 1 a) SNR computed as in manuscript eq (2) for PI_MO -PI_AR SST differences (on a 50-years period): maximum value for SNR is 0.9, indicating that there are differences but SNR still below the threshold value of 1. b) 2-tailed Welch's T-test on same data: where SNR ~ 0.7 -0.9, the p-value is less than alpha.According to this, differences should be considered significant.
7) "… as long as those differences are lower than one we should not care about them because they do not show any significant difference (according to your definition of the SNR).And then you talk about SNR values of 0.9, 0.7, that are supposed to be close to 1. How close from one are those results?" We address this comment by modifying the manuscript so that we analyse results in regions only where SNR ≥ 1. See for example p6 line 24, p9 line6, p9 line 29.

Responses to Referee n2
 General Comments 1) "Part of the problem I think is conceptual… Subsequent time-steps will simply repeat the exercise (i.e.perturbing the initial conditions for the next time step by machine precision).I do not see how this will produce anything fundamentally different from a standard initial condition ensemble." A rather detailed explanation of how the machine may affect the numerical solution is given in section 2. However, it is clear that the previous version did not fully successfully communicate which questions our analysis and manuscript address (see below).
Internal variability is usually assessed in climate simulations via two methods: ensemble members for varying-forcing simulations and long centennial runs for constant-forcing simulations.In the first case, the question is "how many" ensemble members are needed to sample correctly the climate variability, in the second case the question is "how long".While the obvious answers are the more the better and the longest the better, one may ask what is the minimum number of ensembles, or the minimum simulation length required, that would guarantee an acceptable result.
Our work finds its reason in this context, i.e. we want to know for how long we should run a simulation (or analyse its results) on the ARCHER HPC platform to capture the same magnitude of climate variability exhibited by the reference simulation on the Met Office supercomputer (see also responses to Ref.n1 point 1), as we know that the two runs do not bit-compare.
We have now made this clear in the manuscript.Page 2, line 8-17: "In this paper, we investigate the behaviour of the UK CMIP6 Preindustrial (PI) control simulation with the HadGEM3-GC3.1 model on two different High Performance Computing (HPC) platforms.We first study whether the two versions of the PI simulation show significant differences in their long-term statistics.This answers our first question of whether the HadGEM3-GC3.1 model gives different results on different HPC platforms.
Machine-dependent processes can influence the model internal variability by causing it to be sampled differently on the two platforms (i.e.similarly to what happens to ensemble members initiated from different initial conditions).Therefore, our second objective is to quantify discrepancies between the two simulations at different time-scales (from decadal to centennial) in order to identify an averaging period/simulation length for which the two simulations return the same internal variability.
Note that the PI control simulation is a constant-forcing simulation.Therefore, no ensemble members are required for such experiment because, provided that the simulation is long enough, it will return a picture of the natural variability." and page 8 line 19-24: "The large differences observed on time-scales shorter than 200 years are a direct consequence of machine-dependent processes (compiler, machine architecture etc., see section 2 and 3.1 for details), but eventually exist because of the chaotic nature of the system.The two simulations behave similarly to ensemble members initiated from different initial conditions.Therefore, they exhibit different phases of the same internal variability but over longer time-scales differences converge to zero ." Please see also responses to Ref.n1 point 2 for details about the model porting and tests performed on IC ensembles, and responses to Ref.n1 point 6 for details on differences on single machine.
2) "Therefore the question to be asked of the two simulations discussed here is whether the simulations are distinguishable from an IC ensemble on a single machine, not whether they diverge at all".We added a section to the manuscript; see page 9, starting at line 15 (and Figure 11e and 11f).
In this section, we now show the signal-to-noise ratio computed by taking the differences between two 100-year periods for each simulation (note that each 100-year period can be considered as an ensemble member run on the same machine).This shows that differences between the two machines are comparable to differences among ensemble members run on a single machine.This confirms that the differences we observe between PI_MO and PI_AR, although triggered by the different computing environment, are largely dominated by the internal variability.This also highlights that 100 years is a too-short length for constant-forcing simulations on the same, or on a different, machine.
3) "However, the results presented here demonstrate that the climatology of the two simulations is the same -given a long enough averaging period the simulations are indistinguishable.This is a good result, however, it is not the conclusion that the authors come to." The first result presented in the paper (in the revised manuscript page 7, line 5-7) shows that on multicentennial time-scales the differences are not significant and the long-term statistics of the two runs are similar.We have now given more strength to this result.Page 8, line 15-19: "In summary, although large differences can be observed at smaller time-scales (see next section for further discussion), the climate of PIMO and PIAR is indistinguishable on the 200-year time-scale.We thus conclude that simulations using the HadGEM3-GC3.1 model are reproducible on different HPC platforms, provided that a long-enough simulation length is used.Our results also provide a strong indication that HadGEM3-GC3.1 does not suffer from code/compiler bugs that would make the model behave differently on different machines " 4) "Given this, the analysis in sections 3 and 4 are of little interest." Section 4 answers our question of how long we should run a copy-simulation on the ARCHER platform.
In this section, we show that only when using a 200-year averaging period we capture the same internal variability in both simulations.This is the main result of our paper, which implies not only that running a constant-forcing simulation for less than 100 years may potentially lead to different outcomes but also that running it for longer may not be necessary (this applies however only to those variables considered in the paper).
The additional analysis done on the ENSO signal is meant to be a demonstration of what physical process can cause such big differences.However, we agree with the reviewers that it is not surprising that a low-frequency process like ENSO is the one still showing differences on a 100-year timescale.This remark was added at page 9, line 13-14.
As both reviewers agree that this section is of lesser interest, we have shortened section 4.2 in the revised manuscript.
Finally, we believe that the reference to the CMIP project is appropriate.CMIP is much more than Historical and Scenario simulations (for which ensembles are requested).Many individual MIPs run constant-forcing simulations with varying lengths depending on time availability, computational resources and length requirements.The CMIP6 minimum run length requirement for many of the Model Intercomparison Projects (MIPs) is 100 years.Our results suggest that 100 years may not be long to capture correctly the internal variability of the HadGEM3-GC3.1 model.

 Specific Comments
Please refer to the revised manuscript, as Introduction and Conclusions have substantially changed since the previous version.
impact that a different computing environment can have on otherwise identical numerical simulations appears to be little known by climate models users and model data analysts.In fact, the subject is rarely ever addressed in a way that helps the community understand the magnitude of the problem, or to develop practical guidelines that take account of the issue.
To the extent of our knowledge, only a few authors discussed the existence of machine dependence uncertainty and highlighted the importance of bit-for-bit numerical reproducibility in the context of climate model simulations.Song et al. (2012) and Hong et al. (2013)  Machine-dependent processes can influence the model internal variability by causing it to be sampled differently on the two platforms (i.e.similarly to what happens to ensemble members initiated from different initial conditions).Therefore, our second objective is to quantify discrepancies between the two simulations at different time-scales (from decadal to centennial) in order to identify an averaging period/simulation length for which the two simulations return the same internal variability.
Note that the PI control simulation is a constant-forcing simulation.Therefore, no ensemble members are required for such experiment because, provided that the simulation is long enough, it will return a picture of the natural variability.
The remainder of the paper is organized as follows.In section 2, mechanisms by which the computing environment can influence the numerical solution of chaotic dynamical systems are reviewed and discussed.In section 3, the numerical simulations are presented and the methodology used for the data analysis is described.In section 4, the simulation results are presented and discussed.In section 5, the main conclusions of the present study are summarized.
2 The impact of machine dependence on the numerical solution In this section, possible known ways in which machine-dependent processes can influence the numerical solution of chaotic dynamical systems are reviewed and discussed.
Different compiling options, degrees of code optimization and basic library functions all have the potential to affect the reproducibility of model results across different HPC platforms, and on the same platform under different computing environments.Here we provide a few examples of machine-dependent numerical solutions using the 3D Lorenz model (Lorenz, 1963), which is a simplified model for convection in deterministic flows.The Lorenz model consists of the following three differential equations: where the parameters α = 10, γ = 28 and β = 8/3 were chosen to allow the generation of flow instabilities and obtain chaotic solutions (Lorenz, 1963).The model was initialized with (x 0 , y 0 , z 0 ) ≡ (1, 1, 1) and numerically integrated with a 4th-order Runge-Kutta scheme using a time step of 0.01.The Lorenz model was run on two HPC platforms, namely: the UK Met Office Supercomputer (hereinafter simply "MO") and ARCHER.
To demonstrate first the implications of switching between different computing environments, the Lorenz model was run on the ARCHER platform using: two different FORTRAN compilers (cce8.5.8 and intel17.0),see Figure 1a and 1b; same FORTRAN compiler (cce8.5.8) but different degrees of floating-point optimization (-hfp0 and -hfp3), see Figure 1c and 1d; same FORTRAN compiler and compiling options but the x-component in (1) was perturbed by adding a noise term obtained using the random_number and random_seed intrinsic FORTRAN functions.In particular, the seed of the random number generator was set to 1 and 3 in two separate experiments, see Figure 1e and 1f.
Finally, to illustrate the role of using different HPC platforms, the Lorenz model was run on the ARCHER and MO platforms using the same compiler (intel17.0)and identical compiling options (i.e. level of code optimization, floating-point precision, vectorization) (Figure 1g and 1h).
The divergence of the solutions in Figure 1a and 1b can likely be explained by the different 'computation order' of the two compilers (i.e. the order in which a same arithmetic expression is computed).In Figure 1c and 1d, solutions differ because of the round-off error introduced by the different precision of floating-point computation.In Figure 1e and 1f, the different seed used to generate random numbers caused the system to be perturbed differently in the two cases.While this conclusion is straightforward, it is worth mentioning that the use of random numbers is widespread in weather and climate modelling.Random number generators are largely used in physics parametrizations for initialization and perturbation purposes (e.g.clouds, radiation and turbulence parametrizations) and, as obvious, in stochastic parametrizations.The processes by which initial seeds are selected within the model code are thus crucial in order to assure numerical reproducibility.Furthermore, different compilers may have different default seeds.
As for Figure 1g and 1h, this is probably the most relevant result for the present paper.It highlights the influence of the HPC platform (and of its hardware specifications) on the final numerical solution.In Figure 1g and 1h the two solutions diverge in time similarly to Figure 1a -1d, however identifying reasons for the observed differences is not straightforward.While we speculate that reasons may be down to machine architecture and/or chip-set, further investigations on the subject were not pursued as this would be beyond the scope of this study.
The three mechanisms discussed above were selected because illustrative of the problem and easily testable via a simple model such as the Lorenz model.However, there are a number of additional software and hardware specifications that can influence numerical reproducibility, and that only emerge when more complex codes, like weather and climate models, are run.
These are: number of processors and processor decomposition, communications software (i.e.MPI libraries), threading (i.e.

OpenMP libraries)
. Of the possible mechanisms discussed in section 2, the ARCHER and MO simulations were likely affected by differences in compiler, processor type, number of processors and processor decomposition (alongside the different machine).
Note that the porting of the HadGEM3-GC3.1 model from the Met Office computing platform to the ARCHER platform was tested by running 50 ensemble members (each 24 hours long) on both platforms (this was done by the UK Met Office and NCAS-CMS teams).Each ensemble member was created by adding a random bit-level perturbation to a set of selected variables (x-and y-components of the wind, air potential temperature, specific humidity, long-wave radiation and etc.).Variables from each set of ensembles were then tested for significance using a Kolmogorov-Smirnov test to determine whether they can be assumed to be drawn from the same distribution.These tests did not reveal any significant problem with the porting of the HadGEM3-GC3.1 model (Personal Communications).However this method cannot detect code bugs, which may cause a same model to behave differently on different machines.The centennial simulations presented in this paper will help understanding whether or not such code bugs exist for the HadGEM3-GC3.1 model.

Data post-processing and analysis
During the analysis of the results, the following climate variables were considered: sea surface temperature (SST), sea ice area/concentration (SIA/SIC), 1.5m air temperature (SAT), the outgoing long-wave and short-wave radiation fluxes at top of the atmosphere (LW TOA and SW TOA), and the precipitation flux (P).These variables were selected as representative of the ocean and atmosphere domains and because they are commonly used to evaluate the status of the climate system.
Discrepancies between the means of the selected variables were analysed at different timescales, from decadal to centennial.
To compute 10-, 30-, 50-and 100-year means, (PI M O -PI AR ) 200-year time-series were divided into 20, 6, 4 and 2 segments respectively.Spatial maps were simply created by averaging each segment over time.Additionally, to create the scatter plots presented in section 4.1, the time average was combined with an area-weighted spatial average.Except for SIC, all the variables were averaged globally.Additionally, SIC, SST and SAT were regionally-averaged over the Northern and Southern Hemisphere, while SW TOA, LW TOA and P were regionally-averaged over the tropics, Northern extra-tropics and Southern extra-tropics according to the underlying physical processes.
Note that, when calculating (PI M O -PI AR ) differences, PI M O and PI AR segments are subtracted in chronological order.
Thus, for example, the first 10 years of PI AR are subtracted from the first 10 years of PI M O and so on.In fact, because the PI control simulation is run with a constant climate forcing, using a 'chronological order' in the strictest sense is meaningless, as every 10 years segment is equally representative of the pre-industrial decadal variability.We acknowledge that an alternative approach, equally valid, would be to subtract PI AR and PI M O segments without a prescribed order.
Discrepancies in the results between the two runs was quantified by computing the Signal-to-Noise Ratio (SNR) for each considered variable at each timescale.The signal is represented by the mean of the differences between PI M O and PI AR ( µ M O−AR ) and the noise is represented by the standard deviation of (PI M O -PI AR ) ( σ M O−AR ) divided by √ 2 (see below for details).Thus, SNR is defined as: For the final step of the analysis, the El Niño Southern Hemisphere Oscillation (ENSO) signal was computed for the ARCHER and MO simulations.We used the NINO3.4index, with a 3-month running mean, defined as follows: where SST mnth is the monthly sea surface temperature and SST 30yr is the climatological mean of the first 30 years of simulation used to compute the anomalies.

Multiple Timescales
The long-term means of the selected variables, and the associated SNR, are shown in Figures 2 and 3.All the variables exhibit a SNR < 1, indicating that on multi-centennial timescales the differences observed between the two simulations fall into the expected range of variability of the PI control run.
When maps like the ones in Figure 2 and 3 are computed using 10-, 30-, 50-and 100-year averaging periods (not shown), the magnitude of the anomalies increase and (PI M O -PI AR ) differences become significant (SNR » 1).This behaviour is discussed below.For all the considered variables, PI M O and PI AR start diverging quickly after the first few time-steps, once the system has lost memory of the initial conditions.See section 2 (Figure 1) for further discussion on how machine-dependent processes can influence the temporal evolution of the system.
SST, SAT, SW TOA and LW TOA differ the most in the Northern Hemisphere (and particularly on decadal timescales) (yellow diamonds in Figures 4d,6d,7d,8d), while SIA anomalies are particularly high in the Southern Hemisphere (red crosses in Figure 5d) and P anomalies in the tropics (green circles in Figure 9d).Overall, discrepancies are the largest at decadal timescales where the spread between the two simulations can reach |0.2| °C in global mean air temperature (Figure 6d), |1.2| million km 2 in Southern Hemisphere sea ice area (Figure 5d), or |1| W /m 2 in global TOA outgoing LW flux (Figure 8d).
As the timescale increases, (PI M O -PI AR ) differences get smaller and approach zero when a 200-year timescale is considered.This happens because 200 years is a long enough averaging-period for the positive and negative extremes in the time-series of Figure 4 -9 to average out.On shorter time-intervals, strong increasing/decreasing trends in one simulation may not be compensated by trends of opposite sign in the the other simulation and may result in SNR » 1. See for example the first 10 years of the NH SIA time-series in Figure 5a.Additionally, the 200-year mean of the SIA seasonal cycle shown in Figure 5c is almost identical for ARCHER and MO, confirming that on a 200-year timescale the two runs are comparable.This suggests that the overall physical behaviour of the model has not been affected by the porting.
Our results suggest that 100 years may not be enough to allow HadGEM3-GC3.1 to sample the same climate variability on different HPC platforms.This is particularity evident when we look at the spatial patterns of (PI M O -PI AR ) differences and the associated SNR.
In Figure 11, (PI M O -PI AR ) differences materialize into spatial patterns that are signatures of physical processes.SST (Figure 11a,b) and SIC (Figure 11c,d) anomalies are the largest in West Antarctica where ENSO teleconnection patterns are expected, they correspond to regions where SNR becomes equal to/larger than one.This suggests that (PI M O -PI AR ) differences are driven by two different ENSO regimes (the connection between SIC (and SST) anomalies in the Southern Hemisphere and ENSO has been widely documented in literature, e.g.Kwok and Comiso (2002), Liu et al. (2002), Turner (2004), Welhouse et al. (2016), Pope et al. (2017)).
This hypothesis is confirmed by the ENSO signal in Figure 12.A few times, to a strong El Niño (/La Niña) event in PI M O corresponds a strong La Niña (/El Niño) event in PI AR .This opposite behaviour enlarges SIC (and SST) differences between the two runs and strengthens the µ M O−AR signal, resulting in a strong SNR.
As ENSO provides a medium-frequency modulation of the climate system, it is not surprising that it takes longer than 100 years for its variability to be fully represented.11b.Thus, we conclude that differences between ARCHER and MO are comparable to differences between ensemble members run on a single machine.
As for PI M O , in Figure 11f, large differences (and SNR > 1) between the two ensemble members are found in East Antarctica.While this suggests that in this case a climate process other than ENSO is in action, the large SNR confirms that 100 years is a too short length for constant-forcing HadGEM3-GC3.1 simulations even on the same machine.
In summary, the analysis above confirms that (PI M O -PI AR ) differences, while triggered by the computing environment, are largely dominated by the internal variability as they persist among ensemble members on the same machine (in Figure 11 SNR > 1 always).and the other run on the ARCHER (PI AR ) HPC platform, were used to investigate the impact of machine-dependent processes of the N96ORCA1 HadGEM3-GC3.1 model.
Discrepancies between the means of key climate variables (SST, SIA/SIC, SAT, SW TOA, LW TOA and P) were analysed at different timescales, from decadal to centennial (see section 3.2 for details on methodology).
Although the two versions of the same PI control simulation do not bit-compare, we found that the long-term statistics of the two runs are similar and that, on multi-centennial timescales, the considered variables show a signal-to-noise ratio (SNR) less than one.We conclude that in order for PI M O and PI AR to be statistically indistinguishable a 200-year averaging period must be used for the analysis of the results.This indicates that simulations using the HadGEM3-GC3.1 model are reproducible on different HPC platforms, provided that a long-enough simulation length is used.
Additionally, the relationship between global mean differences and timescale exhibits a 2/3 power law behaviour, regardless the physical quantity considered, that approaches a plateau near the 200-year time-scale.This suggests a consistent timedependent scaling of (PI M O -PI AR ) differences across the whole climate simulation.
Larger inconsistencies between the two runs were found for shorter timescales (where SNR ≥ 1), being the largest at decadal timescales.For example, when a 10-year averaging period is used, discrepancies between the runs can be equal to up to |0.2| °C global mean air temperature anomalies, or |1.2| million km 2 Southern Hemisphere sea ice area anomalies.The observed differences are a direct consequence of the different sampling of the internal variability when the same climate simulation is run on different machines.They become approximately zero when a 200-year averaging period is used, confirming that the overall physical behaviour of the model was not affected by the different computing environment.
On a 100-year timescale, large SST and SIC differences (with SNR ≥ 1) where found where ENSO teleconnection patterns are expected.Medium-frequency climate processes like ENSO need longer than 100 years to be fully represented.Thus, a 100year constant-forcing simulation may not be long enough to correctly capture the internal variability of the HadGEM3-GC3.1 model (on the same, or on a different, machine).While this result is not per se unexpected, it is relevant to CMIP6 experiments as CMIP6 protocols recommend a minimum simulation length of 100 years (or less) for many of the MIP experiments.
This result has immediate implications for those members of the the UK CMIP6 community who will run individual MIP experiments on the ARCHER HPC platform, and will compare results against the reference PI simulation run on the MO platform by the UK Met Office.The magnitude of (PI M O -PI AR ) differences presented in this paper should be regarded as threshold values below which differences between ARCHER and MO simulations must be interpreted with caution.
In the light of our results, our recommendation to the UK MIPs studying the climate response to different forcings is to run HadGEM3-GC3.1 for at least 200 years, even when CMIP6 minimum requirements are of 100 years (see for example PMIP protocols).
Finally, although the quantitative analysis presented in this paper applies strictly to HadGEM3-GC3.1 constant-forcing climate simulations only, this study has the broader purpose of increasing the awareness of the climate modelling community on the subject of machine dependence of climate simulations.
investigated the uncertainty due to the round-off error in climate simulations.Liu et al. (2015b) andLiu et al. (2015a) discussed the importance of bitwise identical reproducibility in climate models.In this paper, we investigate the behaviour of the UK CMIP6 Preindustrial (PI) control simulation with the HadGEM3-GC3.1 model on two different High Performance Computing (HPC) platforms.We first study whether the two versions of the PI simulation show significant differences in their long-term statistics.This answers our first question of whether the HadGEM3-GC3.1 model gives different results on different HPC platforms.
2)when SNR < 1, (PI M O -PI AR ) differences can be interpreted as fluctuations of the system not necessarily linked to the different computing environment (i.e.PI M O and PI AR do not differ more than PI M O (or PI AR ) evaluated at two different points in time).When SNR > 1, the observed differences are outside of the expected range of variability and PI M O and PI AR are considered to be different.Note that (2) makes use of two mathematical assumptions: PI M O and PI AR are uncorrelated (i.e.their covariance is zero), and have the same variance.Under these assumptions, the noise can be represented as the standard deviation of the differences between PI M O and PI AR divided by √ 2.

Finally
, we want to know whether the two ENSO regimes in PI M O and PI AR are a reflection of the different computing environment or solely the result of natural variability (i.e. if a similar behaviour can be detected for simulations run on a same machine).This can be done by splitting the 200-year simulations in two segments and assuming that each 100-year period of PI M O and PI AR is a member of an ensemble of size two.Therefore, the ARCHER ensemble is made of PI AR 1st and PI AR 2nd, and the MO ensemble comprises PI M O 1st and PI M O 2nd.

Figure
Figure11eand 11f show the signal-to-noise ratio corresponding to SST differences between PI AR 1st and PI AR 2nd and PI M O 1st and PI M O 2nd.In Figure11e, the SNR pattern exhibited by the ARCHER ensemble members resemble the one shown by (PI M O -PI AR ) differences in Figure11b.Thus, we conclude that differences between ARCHER and MO are comparable

5
Discussion and Conclusions In this paper, the effects of different computing environments on the reproducibility of coupled climate model simulations are discussed.Two versions of the UK CMIP6 PI control simulation, one run on the UK Met Office supercomputer (MO) (PI M O )

Figure 1 .
Figure 1.Attractor (left-hand side) and time-series of the x-component (right-hand side) of the 3D Lorenz model for simulations run on ARCHER using: the cce8.3.4 and intel17.0compilers (a, b), same compiler but different level of floating-point optimization (c, d), same compiler and compiling options but different seed for random number generator (e, f).g and h are the Lorenz attractor and the x-component time-series for the Lorenz model run on MO and ARCHER using same compiler and compiling options.

Figure 4 .
Figure 4. Annual-mean time-series of Global SST (a), Northern Hemisphere SST (b) and Southern Hemisphere SST (c) for PIMO (grey line) and PIAR (dashed line).d shows how SST differences vary as a function of the timescale.

Figure 5 .
Figure 5. Annual-mean time-series of Northern Hemisphere SIA (a) and Southern Hemisphere SIA (b) for PIMO (grey line) and PIAR (dashed line).The 200-year mean of the NH and SH SIA seasonal cycle is shown in c. d shows how SIA differences vary as a function of the timescale.

Figure 6 .
Figure 6.As in 4 but for SAT.

Figure 7 .
Figure 7. Annual-mean time-series of SW TOA in the tropics (a), SW TOA in the Northern Extratropics (b) and SW TOA in the Southern Extratropics (c) for PIMO (grey line) and PIAR (dashed line).d shows how SW TOA differences vary as a function of the timescale.

Figure 8 .
Figure 8.As in 4 but for LW TOA.

Figure 9 .
Figure 9.As in 4 but for P.

Table 1 .
Hardware and software specifications of the ARCHER and MO HPC platforms as used to run the HadGEM3-GC3.1 model.

Table 2 .
200-year global mean and standard deviation for SST, SIA, SAT, SW TOA, LW TOA and P. Figures 4 to 9 show annual-mean time-series of spatially averaged SST, SIA, SAT, SW TOA, LW TOA and P, respectively.Figures 4d to 9d show (PI M O -PI AR ) differences as a function of the averaging timescale for each variable (see section 3.2 for details on the computation of the means).The 200-year global-mean and standard deviation of each variable are shown in