A test procedure is proposed for identifying numerically significant solution changes in evolution equations used in atmospheric models. The test issues a “fail” signal when any code modifications or computing environment changes lead to solution differences that exceed the known time step sensitivity of the reference model. Initial evidence is provided using the Community Atmosphere Model (CAM) version 5.3 that the proposed procedure can be used to distinguish rounding-level solution changes from impacts of compiler optimization or parameter perturbation, which are known to cause substantial differences in the simulated climate. The test is not exhaustive since it does not detect issues associated with diagnostic calculations that do not feedback to the model state variables. Nevertheless, it provides a practical and objective way to assess the significance of solution changes. The short simulation length implies low computational cost. The independence between ensemble members allows for parallel execution of all simulations, thus facilitating fast turnaround. The new method is simple to implement since it does not require any code modifications. We expect that the same methodology can be used for any geophysical model to which the concept of time step convergence is applicable.

The Community Atmosphere Model

Since the CAM is a climate model, one possibility could be to require that the long-term statistics
of the atmospheric motions be representative of the climate simulated
by the old code in the old environment

Given that the purpose of the regression testing is to assure
the model results stay the same,
rather than to provide a descriptive characterization of the simulated physical phenomena,
it would be useful to have additional test methods that can give early
warnings of unexpected model behavior using computationally inexpensive simulations.
The perturbation growth test (hereafter PERGRO)
based on the work of

The PERGRO test involved comparing one test simulation and two trusted
simulations over the course of 2 model days.
Solution differences were quantified by the spatial root mean square differences (RMSDs)
in the temperature field at each time step.
The differences between the two trusted simulations were triggered by
random temperature perturbations of the order of

Condition 1: during the first few time steps, differences between the original and ported code solutions should be within 1 to 2 orders of magnitude of machine rounding.

Condition 2: during the first few days, growth of the difference between the original and ported code solutions should not exceed the growth of the initial perturbation.

Condition 0: during the first few time steps, rounding-level initial perturbations introduced to the original code in the original environment should not trigger solution differences larger than 1 to 2 orders of magnitude of machine rounding.

When the PERGRO test was originally developed, the physical parameterizations
were quite simple, the code was able to satisfy condition 0, and the test
method was robust. As the model became more comprehensive and complex, more
rapid growth of rounding-level initial perturbation was observed. Compromises
were made to preserve some utility for the test. For example, in CAM4, the
test needed to be performed in an aqua-planet configuration, i.e., without
the land surface parameterizations, and with a few (small) pieces of code in
the atmospheric physics parameterizations switched off or revised, because
those codes were known to be very sensitive to small perturbations. If those
pieces of code were not switched off or revised, perturbations on the
trusted machine would grow so rapidly that the RMSD would reach

Examples of the evolution of root mean square (rms) temperature difference
(unit:

First, the default time step of 1800 s in CAM5 is sizable compared to the
characteristic timescales of many physical processes represented by the
model; therefore, the increments in the model state during one time step (i.e., the
process tendencies times the model time step) are significant, and the
differences between a pair of simulations with slightly different initial
conditions can also be perceptible. The red and purple curves in
Fig.

The second reason for rapid perturbation growth is related to the fact that
the radiation parameterization in CAM5 uses a pseudo random number generator,
and the seeds for the generator are chosen from the less significant digits
of the pressure field. This effectively introduces state-dependent noise into
the numerical solution. The green curve in Fig.

The third reason for rapid perturbation growth has to do with
particular pieces of code. Two types of examples
were discussed in

The examples shown in Fig.

In this section, we start with a clarification of the purpose and scope of
the new test method (Sect.

As stated earlier, the topic of this paper is regression testing
under circumstances when results from an atmospheric GCM are
no longer BFB reproducible. In other words, the testing discussed here
aims at substantiating whether results from an atmospheric GCM stay the same
after supposedly minor code modifications or computing-environment changes.
By “minor code modifications” we mean code refactoring,
optimization of the computational efficiency, or any other
code changes that might alter the sequence of
computation but still solve the same set of equations using
the same mathematical algorithms.
Computing environment changes refer to any changes in the
hardware or software configuration in which the model code is compiled and executed.
Two factors need to be considered when designing a method for regression testing:
(i) the physical quantities that represent the outcome of a simulation, and
(ii) a criterion for declaring two simulations as “the same”.
In the present paper,
we consider the outcome of a simulation unchanged if the numerical solution
is found to have the same time stepping error relative to
a reference solution obtained with a previously verified code and computing environment.
The details are explained later in Sect.

From the perspective that a GCM is a suite of algorithms solving a large set of differential, integral, and algebraic equations, the physical quantities (model variables) calculated by the code can be sorted into three categories:

Prognostic and diagnostic variables, whose equations are coupled to one another
such that any change in variable

Prognostic variables that are influenced by type I variables but do not feedback to
them. An example could be passive tracers carried by the model to investigate
atmospheric transport characteristics

Diagnostic quantities calculated to facilitate the evaluation of a simulation,
but do not feedback to type I or type II. Examples include the daily maximum 2 m temperature,
the total ice–liquid conversion rate in the cloud microphysics parameterization (which is
calculated merely for output in CAM5),
and any variable specific to the COSP simulator package

We take the standpoint that the essential characteristics of the simulated atmospheric phenomena are determined and represented by type I variables. If instantaneous and grid-point values are monitored, any significant solution change should be detectable through the monitoring of a single variable in type I, per definition of that variable type, as long as the simulations are long enough for the impact to propagate and evolve to a discernable signal in that monitored variable. On the other hand, since we are taking a deterministic perspective here, the simulations need to be sufficiently short to avoid chaos.

Based on the reasoning above, the test diagnostics of our new method
are calculated from a small set of prognostic variables of type I.
The use of multiple variables
is meant to help increase the sensitivity of the test
(decrease the chance of failing to detect a significant solution change),
since bugs or issues associated with a specific piece of code
might take a longer time to cause discernable solution differences in
one variable than in another.
In Sects.

Given the continuously growing complexity of modern atmospheric GCMs
and the need by large groups of model developers and users to perform
regression testing routinely (e.g., on a daily basis),
it is desirable to have test procedures that have the following
features:

objective,

easy to perform and automate,

requiring no or minimum code modifications,

exercising the entire model in its “operational” configuration,

also applicable to a subset of the code and thus useful for debugging,

capable of detecting changes in both global and/or regional features of the simulations,

insensitive to round-off differences associated with changes in the order of accumulations or associative operations,

computationally efficient.

The CAM-ECT of

The PERGRO test of

The new test proposed in this paper aims at satisfying all the eight
features listed above. It keeps the deterministic spirit
of PERGRO to achieve an early detection of
solution differences thus saving computational time.
Ensemble simulations are conducted to take into account
the internal variability of the atmospheric motions.
The test design was inspired by the results of

In Fig.

Convergence diagram showing the root mean square (rms) solution differences
calculated using the instantaneous 3-D temperature field after 1

To demonstrate this point, Fig.

Figure

In the study of

We also note that in the earlier study of

In this section we first give a brief overview of the CAM5 model
(Sect.

The global climate model used in this paper is CAM5.3

In the present paper, we use the FC5 component set of the model, meaning that the model is configured to run with interactive atmosphere and land, prescribed climatological sea surface temperature and sea ice cover, and with the anthropogenic aerosol and precursor emissions specified using values representative of the year 2000.

The basic idea of the TSC test is to perform control and test simulations with a 2 s time step, calculate their RMSDs with respect to reference simulations conducted with the control model with a 1 s time step, then determine whether the RMSDs of the control and test simulations are substantially different.

For a generic prognostic variable

Time step size affects the numerical solution at every time step and every grid point,
while certain atmospheric processes might occur in isolated regions, thus impacting
only a limited number of grid points during very short simulations.
Consequently, subtle but systematic solution changes can be masked by the model's
time stepping error and can be difficult to detect.
To help address this challenge, we calculate RMSDs
for

As for the physical quantities, the results shown in the present paper
include RMSD of

The test procedure includes three steps. Steps 1 and 2 are needed every time a new baseline model with different solution characteristics is established. Between such baseline releases, only step 3 is needed for the testing of a new code version or computing environment.

Create an

Obtain an

Repeat Step 2 with a modified code or in a different computing environment.
Compute the RMSDs with respect to the reference solutions created in Step 1, and
denote the results at model time

In the present paper we use a one-sided

CAM5 simulations conducted to evaluate the effectiveness of the TSC
method. Simulations in group ENV used the same code but different computers,
compiler versions, or optimization levels. Group MOD includes code
modifications following

In case the test and control
simulations only contain insignificant differences,

For all the simulations presented in this paper, the initial conditions were sampled from the first year (after 6 months of spin-up) of a previously conducted 5-year simulation. The decision of using the first year was arbitrary. In our experience, climate simulations of 1–5 years are frequently carried out during model development or evaluation, making such initial conditions easy to obtain. The two features we had in mind when choosing the initial conditions were that (i) they contain reasonably spun-up values for the model state variables (e.g., not all zeros or spatially constant values for the hydrometeors or aerosol concentrations), and (ii) they represent synoptic weather patterns in different seasons. The initial conditions do not need to represent well-balanced states in the quasi-equilibrium phase of a multi-year climate simulation. In fact, the default model time step of 1800 s was used when creating the initial conditions for this study, while the control and test simulations in TSC used a 1 or 2 s time step; therefore, the model state was certainly not well-balanced during those TSC simulations. Also notice that while model states from different seasons were used for initialization, all ensemble members started on 1 January 00:00 UTC for simplicity of the simulation and post-processing workflow, which also led to initial imbalances. Such imbalances are considered harmless since the purpose of the numerical integration is regression testing rather than faithfully simulating the atmospheric motions in the real world. We expect that the same set of initial conditions can be used after answer-changing code baselines are established – until a point when the list of prognostic variables in the model becomes substantially different. Then it would be useful to regenerate the initial conditions, and rethink which variables should be included in the test diagnostics.

Numerical simulations were carried out under a number of scenarios (test
cases) to help characterize

The group ENV used the same code
as in the reference ensemble but with different computers,
compilers,
or optimization levels:

PGI compiler version 15.3.0 with -O2 on Titan (“Titan-PGI”);

Intel compiler version 15.0.0 with -O2 on Yellowstone (ark:/85065/d7wd3xhc) at the Computational and Information Systems Laboratory of the National Center for Atmospheric Research (“YS-Intel15-O2”);

Intel compiler version 15.0.0 with -O3 on Yellowstone (“YS-Intel15-O3”).

The group MOD consists of two code modification cases from

In the division-to-multiplication (“DM”) case,
division by a time-invariant array was replace by multiplication
of the inverse at one place in the dynamical core

In the precision (“

In group PAR, we repeated all the parameter perturbation experiments
presented by

To understand the initial evolution of

Ensemble-mean

The dashed gray lines in Fig.

We now take a closer look at the test diagnostics at a single time instance.
In Fig.

The test case with a modified dust emission factor (DUST) was
expected to be challenging for the TSC method. In any model day,
the emission only occurs at a very small fraction
of the dust source areas.
Dust particles emitted from the surface can only be transported
over a short distance during the few-minute simulation time,
and the impact on meteorological conditions through
the absorption and/or scattering of radiation is also limited.
Hence, it is unlikely that the solution differences can be seen in
the global temperature RMSD. This was the reason that motivated us
to use multiple prognostic variables and to separate land and ocean
when defining the test diagnostics.
The results shown in Fig.

As in Fig.

The CONV-LND case is challenging for similar reasons.
Here the coefficient that controls the conversion of cloud condensate to precipitation
was modified for deep convection over land. With a smaller value
for zmconv_c0_lnd, we expect to have more cloud condensate detrained by deep convection,
which can lead to changes in the mass and number concentrations
of ice crystals in stratiform clouds. Failing results are indeed seen
in these two variables (Fig.

As mentioned earlier, CAM-ECT assigned a “pass” to the NU case
but we expect the TSC result to be a “fail”.
The respective time series in Fig.

Based on the results shown above, we propose a version 1.0 implementation of
the TSC test that uses 12-member 10

The TSC method also allows for very fast test turnaround since the ensemble
simulations can be conducted in parallel. On Titan we used 512 MPI processes
for each simulation and often submitted 12 simulations to the Portable Batch
System (PBS)
in three 128-node batch jobs. The wall clock time for finishing
a single 10

In this paper we have presented evidence to demonstrate that the concept of time step convergence can be used to assess the magnitude of solution difference in the CAM model. Future work will be useful to explore the following topics.

The TSC test procedure described in this paper has multiple parameters that can be modified: (1) ensemble size, (2) initialization strategy (e.g., simulation start time), (3) time step sizes, (4) integration length, (5) prognostic variables and model sub-domains included in the calculation of test diagnostics, and (6) the pass/fail criterion. Results presented in the previous section indicate that given (1)–(3), the choices for (4)–(6) can have strong impacts on the outcome of the TSC test.

In the DUST case, for example, systematically positive

Results in Fig.

As mentioned in the introduction, the radiation parameterization in CAM5 uses
a random number generator that leads to state-dependent noise in the model results.
All the simulations presented in Sect.

The development of the TSC test was motivated by the loss of utility of the PERGRO method and the relatively high computational cost of CAM-ECT. Since all three are regression testing methods, it is worth clarifying some linkages and distinctions among them.

CAM-ECT compares the model climate, and considers two sets of results “the same” when ensembles of 1-year simulations show consistent statistical distributions of global annual averages. PERGRO and TSC view CAM as a deterministic model, and considers two sets of model results “the same” when the observed solution differences with respected to trusted solutions appear to be consistent with the expected evolution of initial perturbation or time stepping error. In PERGRO and TSC, one-to-one solution comparisons are conducted using instantaneous grid-point values, and the solution differences are evaluated well within the deterministic limit of the flow evolution.

From the perspective that climate is essentially the statistical
characterization of deterministic-scale atmospheric conditions, and the fact
that the same set of differential-integral equations control the short-term
and long-term behaviors of the atmospheric motion in a numerical model, one
can expect the different regression testing methods to provide the same
“pass” or “fail” results when the solution differences are either very
small (e.g., at round-off level) or very different (e.g., due to a major bug
in the code). The general consistency between the TSC results shown in this
paper and the corresponding test results from

For practical model testing, it is highly desirable to find methods capable
of detecting early signs of climate-changing results at low computational
cost and with fast test turnaround. However, it is worth noting that the word
“climate-changing” is ambiguous until a quantitative criterion is
specified. For example, two simulations representing indistinguishable
climate characteristics according to SIEVE (see Sect.

In this study, we designed and evaluated a test procedure for determining whether the solutions of a numerical model remain the same within the limit of the time integration accuracy when the bit-for-bit reproducibility is lost due to code modifications or computing environment changes. A “fail” signal is issued when the numerical solutions no longer converge to the reference solutions of the original model. The test method is deterministic by nature, but involves an ensemble of simulations to account for possible flow dependencies of the numerical error.

Using the CAM5 model, we provided initial evidence that the test procedure
based on 10

The new test is built on the generic concept of time step convergence, and the implementation does not require any code modifications. We plan to explore the utility of the method in other components of our Earth system model (e.g., ocean, sea ice, and land ice), and expect that the same concept is applicable to a wide range of geophysical models such as global and regional weather and climate models, cloud-resolving models, large-eddy simulations, and even direct numerical simulations.

The source code of CAM5 can be obtained as part of the Community
Earth System Model (CESM) from the public release website

The authors declare that they have no conflict of interest.

The authors thank W. Sacks (NCAR) and the two anonymous reviewers for their valuable comments and suggestions. This research was supported as part of the Accelerated Climate Modeling for Energy (ACME) program, funded by the US Department of Energy, Office of Science, Office of Biological and Environmental Research (BER). The basis of the work, the time step convergence study, was previously supported by BER as part of the Scientific Discovery through Advanced Computing (SciDAC) Program, and by the Linus Pauling Distinguished Postdoctoral Fellowship of the Pacific Northwest National Laboratory (PNNL). This research used high-performance computing resources from the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, supported by the Office of Science of DOE under contract no. DE-AC05-00OR22725, and the National Center for Atmospheric Research (NCAR) Computational and Information Systems Laboratory, sponsored by the National Science Foundation. PNNL is operated by Battelle Memorial Institute for DOE under contract DE-AC05-76RL01830. NCAR is sponsored by the National Science Foundation. Edited by: P. Ullrich Reviewed by: two anonymous referees