<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="research-article">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">GMD</journal-id><journal-title-group>
    <journal-title>Geoscientific Model Development</journal-title>
    <abbrev-journal-title abbrev-type="publisher">GMD</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Geosci. Model Dev.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1991-9603</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/gmd-15-3183-2022</article-id><title-group><article-title>An ensemble-based statistical methodology to detect differences<?xmltex \hack{\newline}?> in weather and climate model executables</article-title><alt-title>An ensemble-based statistical methodology</alt-title>
      </title-group><?xmltex \runningtitle{An ensemble-based statistical methodology}?><?xmltex \runningauthor{C. Zeman and C. Schär}?>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes">
          <name><surname>Zeman</surname><given-names>Christian</given-names></name>
          <email>christian.zeman@env.ethz.ch</email>
        <ext-link>https://orcid.org/0000-0003-4248-4018</ext-link></contrib>
        <contrib contrib-type="author" corresp="no">
          <name><surname>Schär</surname><given-names>Christoph</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-4171-1613</ext-link></contrib>
        <aff id="aff1"><institution>Institute for Atmospheric and Climate Science, ETH Zurich, Switzerland</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Christian Zeman (christian.zeman@env.ethz.ch)</corresp></author-notes><pub-date><day>19</day><month>April</month><year>2022</year></pub-date>
      
      <volume>15</volume>
      <issue>8</issue>
      <fpage>3183</fpage><lpage>3203</lpage>
      <history>
        <date date-type="received"><day>18</day><month>July</month><year>2021</year></date>
           <date date-type="rev-request"><day>6</day><month>September</month><year>2021</year></date>
           <date date-type="rev-recd"><day>8</day><month>February</month><year>2022</year></date>
           <date date-type="accepted"><day>7</day><month>March</month><year>2022</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2022 Christian Zeman</copyright-statement>
        <copyright-year>2022</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022.html">This article is available from https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022.html</self-uri><self-uri xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022.pdf">The full text article is available as a PDF file from https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d1e89">Since their first operational application in the 1950s, atmospheric numerical models have become essential tools in weather prediction and climate research. As such, they are subject to continuous changes, thanks to advances in computer systems, numerical methods, more and better observations, and the ever-increasing knowledge about the atmosphere of earth. Many of the changes in today's models relate to seemingly innocuous modifications associated with minor code rearrangements, changes in hardware infrastructure, or software updates. Such changes are meant to preserve the model formulation, yet the verification of such changes is challenged by the chaotic nature of our atmosphere – any small change, even rounding errors, can have a significant impact on individual simulations. Overall, this represents a serious challenge to a consistent model development and maintenance framework.</p>

      <p id="d1e92">Here we propose a new methodology for quantifying and verifying the impacts of minor changes in the atmospheric model or its underlying hardware/software system by using ensemble simulations in combination with a statistical hypothesis test for instantaneous or hourly values of output variables at the grid-cell level. The methodology can assess the effects of model changes on almost any output variable over time and can be used with different underlying statistical hypothesis tests.</p>

      <p id="d1e95">We present the first applications of the methodology with the regional weather and climate model COSMO. While providing very robust results, the methodology shows a great sensitivity even to very small changes. Specific changes considered include applying a tiny amount of explicit diffusion, the switch from double to single precision, and a major system update of the underlying supercomputer. Results show that changes are often only detectable during the first hours, suggesting that short-term ensemble simulations (days to months) are best suited for the methodology, even when addressing long-term climate simulations. Furthermore, we show that spatial averaging – as opposed to testing at all grid points – reduces the test's sensitivity for small-scale features such as diffusion. We also show that the choice of the underlying statistical hypothesis test is not essential and that the methodology already works well for coarse resolutions, making it computationally inexpensive and therefore an ideal candidate for automated testing.</p>
  </abstract>
    </article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d1e107">Today's weather and climate predictions heavily rely on data produced by atmospheric numerical models. Ever since their first operational application in the 1950s, these models have been improved thanks to advances in computer systems, numerical methods, observational data, and the understanding of the earth's atmosphere. While such changes often may be only small and incremental, accumulated they have a big effect, which has manifested itself in a significant increase in skill of weather and climate predictions over the past 40 years <xref ref-type="bibr" rid="bib1.bibx5" id="paren.1"/>.</p>
      <p id="d1e113">While some of the model changes are intended to extend and improve the model, others are not meant to affect the model results but merely its computational performance and versatility. In software engineering, one often distinguishes between “upgrades” and “updates” in such cases. For weather and climate models, an upgrade would, for example, be the introduction of a new and improved soil model, whereas a new version of underlying software or a binary that has been built with a newer compiler version would represent only an update. Updates are often employed due to the necessity of keeping the software up to date without making any perceivable improvements in functionality. For a weather and climate model, the model results are not supposed to be significantly affected by such an update. This also applies to other changes, such as moving to a different hardware architecture or changing the domain decomposition for distributed computing. Robust behavior of the model with regard to such changes is crucial for a consistent interpretation of the results and the credibility of the derived predictions and findings.</p>
      <p id="d1e116">Weather and climate model results are generally not bit identical when they are, for example, run on different hardware architectures or have been compiled with different compilers. This is because the associativity property does not hold for floating-point operations (i.e., <inline-formula><mml:math id="M1" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo><mml:mo>+</mml:mo><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mo>(</mml:mo><mml:mi>y</mml:mi><mml:mo>+</mml:mo><mml:mi>z</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>  is not given), and the fact that the order of arithmetic operations is dependent on the compiler and the targeted hardware architecture. <xref ref-type="bibr" rid="bib1.bibx42" id="text.2"/> have achieved bit reproducibility between a CPU and a GPU version of the regional weather and climate model COSMO by limiting instruction rearrangements from the compiler and using a preprocessor that automatically adds parentheses to every mathematical expression of the model. However, this also came with a performance penalty, where the CPU and GPU bit-reproducible versions were slower by 37 % and 13 %, respectively, than their non-bit-reproducible counterparts. Due to this performance penalty and the effort involved in making a model bit reproducible, bit reproducibility is generally not enforced. It has to be noted that this behavior of not producing bit-identical results across different architectures or when using different compilers is common for most computer applications and not a problem per se. However, for weather and climate models, it represents a serious challenge due to the chaotic nature of the underlying nonlinear dynamics, where small changes can have a big effect <xref ref-type="bibr" rid="bib1.bibx21" id="paren.3"/>. For example, a tiny difference in the initial conditions of a weather forecast can potentially lead to a very different prediction. Consequently, rounding errors can also affect the model results in a major way. In order to mitigate this effect and to provide probabilistic predictions, forecasts often use ensemble prediction systems (EPSs), where a model is run several times for the same time frame with slightly perturbed initial conditions or stochastic perturbations of the model simulations <xref ref-type="bibr" rid="bib1.bibx18" id="paren.4"><named-content content-type="pre">see</named-content><named-content content-type="post">for an overview</named-content></xref>. The use of an EPS accounts for the uncertainty in initial conditions and the internal variability of the model results.</p>
      <p id="d1e169">So, in order to verify whether the properties of a weather and climate model executable are not significantly affected after an update or a change to a different platform, we have to resort to ensemble simulations. Without ensemble simulations, we would only be able to answer something we already know a priori: any change in the model or its underlying software and hardware will make the model slightly different and, therefore, might significantly affect the output due to the chaotic nature of the underlying dynamics. However, with ensemble simulations, we can answer a much more important question: how do the changes in model results compare to the internal variability of the underlying nonlinear dynamical system? If the effect of the new model is significantly smaller than that of internal variability, a statistical test will not be able to detect whether the results of the new and the old model come from the same distribution or not.</p>
      <p id="d1e173">In this paper, the detection of such changes will be referred to as “verification”. In the atmospheric and climate science community, the terms “validation” and “verification” are not always used in a clearly defined way, and are sometimes even used interchangeably. An extensive discussion about different definitions of verification and validation can be found in <xref ref-type="bibr" rid="bib1.bibx29" id="text.5"/>. <xref ref-type="bibr" rid="bib1.bibx41" id="text.6"/> defines verification as “ensuring that the computer program of the computerized model and its implementation are correct”. In contrast, validation is defined as “substantiation that a model within its domain of applicability possesses a satisfactory range of accuracy consistent with the intended application of the model”. According to <xref ref-type="bibr" rid="bib1.bibx8" id="text.7"/>, validation refers to “the processes and techniques that the model developer, model customer and decision makers jointly use to assure that the model represents the real system (or proposed real system) to a sufficient level of accuracy”, while verification refers to “the processes and techniques that the model developer uses to assure that his or her model is correct and matches any agreed-upon specifications and assumptions”. <xref ref-type="bibr" rid="bib1.bibx9" id="text.8"/> define validation as “comparison with observations” and verification as “comparison with analytic test cases and computational products”. <xref ref-type="bibr" rid="bib1.bibx52" id="text.9"/> state that “whenever a model or model component is compared with reality, validation is performed”, whereas they define verification as “substantiating that a simulation model is translated from one form into another, during its development life cycle, with sufficient accuracy”. <xref ref-type="bibr" rid="bib1.bibx31" id="text.10"/> and <xref ref-type="bibr" rid="bib1.bibx30" id="text.11"/> recommend not to use the terms verification and validation at all for models of complex natural systems. They argue that both terms imply an either–or situation for something that is not possible (i.e., a model will never be able to accurately represent the actual processes occurring in a real system) or is only possible to evaluate for simplified and limited test cases (i.e., comparing with analytical solutions for simple problems). Nevertheless, both terms are commonly used in atmospheric sciences. Note that in this paper, we follow the terminology of <xref ref-type="bibr" rid="bib1.bibx52" id="text.12"/>. As our methodology's goal is to ensure that there are no significant differences between two model executables, we use the term verification for the methodology.</p>
      <p id="d1e201">Using the definition from <xref ref-type="bibr" rid="bib1.bibx52" id="text.13"/>, verification is a form of system testing in the area of software engineering. This means that a complete integrated system is tested; in this case, a weather and climate model consisting of many different components that interact with each other. System tests are an integral part of testing in software engineering. An objective system test that can be performed automatically is also an excellent asset for the practice of continuous integration and continuous deployment (CI/CD). CI/CD  enforces automation in the building, testing, and deployment of applications and should also be considered good practice in developing and operating weather and climate models.</p>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Background</title>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>Current state of the art</title>
      <p id="d1e222">Despite its importance for the consistency and trustworthiness of model results, verification has received relatively little attention in the weather and climate community. However, the awareness seems to have increased, as some recent studies tackle this issue more systematically.</p>
      <p id="d1e225"><xref ref-type="bibr" rid="bib1.bibx39" id="text.14"/> were among the first to propose a strategy for verifying atmospheric models after they had been ported to a new architecture. They set the conditions that the differences should be within one or two orders of magnitude of machine rounding during the first few time steps and that the growth of differences should not exceed the growth of initial perturbations at machine precision during the first few days. The methodology of <xref ref-type="bibr" rid="bib1.bibx39" id="text.15"/> was developed and used for the NCAR Community Climate Model (CCM2). However, the approach is no longer applicable for its current successor, the Community Atmosphere Model (CAM), because the parameterizations are ill conditioned, which makes small perturbations grow very quickly and exceed the tolerances of rounding error growth within the first few time steps <xref ref-type="bibr" rid="bib1.bibx1" id="paren.16"/>. <xref ref-type="bibr" rid="bib1.bibx48" id="text.17"/> performed 42 h simulations with the Mesoscale Compressible Community (MC2) model to determine the importance of processor configuration (domain decomposition), floating-point precision, and mathematics libraries for the model results. By analyzing the spread of runs with different settings, they concluded that processor configuration is the main contributor among these categories to differences in the results of its dynamical core. <xref ref-type="bibr" rid="bib1.bibx17" id="text.18"/> analyzed an ensemble of over 57 000 climate runs from the climateprediction.net project (<uri>https://www.climateprediction.net</uri>, last access: 31 January 2022). The climate runs were performed with varying parameter settings and initial conditions on different hardware and software architectures. Using regression tree analysis, they demonstrated that the effect of hardware and software changes is small relative to the effect of parameter variations and, over the wide range of systems tested, may be treated as equivalent to that caused by changes in initial conditions. <xref ref-type="bibr" rid="bib1.bibx16" id="text.19"/> performed seasonal simulations with the global model program (GMP) of the Global/Regional Integrated Model system (GRIMs) on 10 different software system platforms with different compilers, parallel libraries, and optimization levels. The results showed that the ensemble spread caused by differences in the software system is comparable to that caused by differences in initial conditions.</p>
      <p id="d1e249">One of the most comprehensive recent studies on verification is from <xref ref-type="bibr" rid="bib1.bibx1" id="text.20"/>, where they proposed the use of principal component analysis (PCA) for consistency testing of climate models. Instead of testing all model output variables, many of which were highly correlated, they only looked at the first few principal components of the model output and used <inline-formula><mml:math id="M2" display="inline"><mml:mi>z</mml:mi></mml:math></inline-formula>-scores to test if the value from a test configuration is within a certain number of standard deviations from the control ensemble. If the test failed for too many PCs, they rejected the new configuration. They confirmed their methodology using 1-year-long simulations of the Community Earth System Model (CESM) with different parameter settings, hardware architectures, and compiler options. While the methodology showed high sensitivity and promising results, it had some difficulties detecting changes caused by additional diffusion due to its focus on annual global mean values. <xref ref-type="bibr" rid="bib1.bibx2" id="text.21"/> also used <inline-formula><mml:math id="M3" display="inline"><mml:mi>z</mml:mi></mml:math></inline-formula>-scores for consistency testing of the Parallel Ocean Program (POP), the ocean model component of the CESM. However, instead of evaluating principal components on spatial averages, as in <xref ref-type="bibr" rid="bib1.bibx1" id="text.22"/>, they applied the methodology at each grid point for individual variables and stipulated that this local test had to be passed for at least 90 % of the grid points to achieve a global test pass. <xref ref-type="bibr" rid="bib1.bibx28" id="text.23"/> extended the consistency test of <xref ref-type="bibr" rid="bib1.bibx1" id="text.24"/> by performing the test on spatial means for the first nine time steps of the Community Atmospheric Model (CAM) on a global <inline-formula><mml:math id="M4" display="inline"><mml:mrow><mml:msup><mml:mn mathvariant="normal">1</mml:mn><mml:mo>∘</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> grid with a time step of 1800 s. With this method, they were able to produce the same results for the same test cases as <xref ref-type="bibr" rid="bib1.bibx1" id="text.25"/>. Additionally, they were also able to detect small changes in diffusion that were not detected in <xref ref-type="bibr" rid="bib1.bibx1" id="text.26"/>.</p>
      <p id="d1e299"><xref ref-type="bibr" rid="bib1.bibx51" id="text.27"/> used time-step convergence as a criterion for model verification, based on the idea that a significantly different model executable will no longer converge towards a reference solution produced with the old executable. Their test methodology produced similar results to the one from <xref ref-type="bibr" rid="bib1.bibx1" id="text.28"/> and is relatively inexpensive due to the short integration times. However, due to the nature of the test, it cannot detect issues associated with diagnostic calculations that do not feed back to the model state variables.</p>
      <p id="d1e308"><xref ref-type="bibr" rid="bib1.bibx24" id="text.29"/> used an ensemble-based approach where they applied the Kolmogorov–Smirnov (K-S) test on annual and spatial means of 1-year simulations for testing the equality of distributions of different model simulations. Furthermore, they used generalized extreme value (GEV) theory to represent the annual maxima of daily average surface temperature and precipitation rate. They then applied Student's <inline-formula><mml:math id="M5" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test on the estimated GEV parameters at each grid point to test the occurrence of climate extremes. They showed that the climate extremes test based on GEV theory was considerably less sensitive to changes in optimization strategies than the K-S test on mean values.  <xref ref-type="bibr" rid="bib1.bibx25" id="text.30"/> applied two multivariate two-sample equality of distribution tests, the energy test and the kernel test, on year-long ensemble simulations following <xref ref-type="bibr" rid="bib1.bibx1" id="text.31"/> and <xref ref-type="bibr" rid="bib1.bibx24" id="text.32"/>. However, both these tests generally showed a lower power than the K-S test from <xref ref-type="bibr" rid="bib1.bibx24" id="text.33"/>, which means that more ensemble members were needed to reject the null hypothesis confidently. <xref ref-type="bibr" rid="bib1.bibx23" id="text.34"/> used the K-S test as well as the Cucconi test for annual mean values at each grid point for the verification of the ocean model component of the US Department of Energy's Energy Exascale Earth System Model (E3SM). Furthermore, they used the false discovery rate (FDR) method from <xref ref-type="bibr" rid="bib1.bibx7" id="text.35"/> for controlling the false positive rate. Both tests were able to detect very small changes of a tuning parameter, with the K-S test showing a slightly higher power than the Cucconi test for the smallest changes.</p>
      <p id="d1e339"><xref ref-type="bibr" rid="bib1.bibx27" id="text.36"/> recently proposed an ensemble-based methodology based on monthly averages (and an average over the whole simulation time) and then compared these averages on a grid-cell level against standard indices used in <xref ref-type="bibr" rid="bib1.bibx35" id="text.37"/>. Finally, spatially averaging results in one scalar number per field, month, and ensemble member. These scalars were then used for the K-S test to detect statistically significant differences. Performing this test for climate runs with the earth system model version EC-Earth 3.1 in different computing environments revealed significant differences for 4 out of 13 variables. However, the same test for the newer EC-Earth 3.2 version showed no significant differences. <xref ref-type="bibr" rid="bib1.bibx27" id="text.38"/> suspect the presence of a bug in EC-Earth 3.1 that was subsequently fixed in version 3.2 as the reason for this disparity.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Determining field significance</title>
      <p id="d1e358">A challenging question in the area of model verification is the role of statistical significance at the grid-point versus the field level. A statistical hypothesis test's significance level <inline-formula><mml:math id="M6" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula> is defined as the probability of rejecting the null hypothesis even though the null hypothesis is true (commonly known as false-positive or type I error). So, if we compare two ensembles and perform the test at every grid point, the test may locally reject the null hypothesis even if the two ensembles stem from the same model. When assuming spatial independence, the probability of having <inline-formula><mml:math id="M7" display="inline"><mml:mi>x</mml:mi></mml:math></inline-formula> rejected local null hypotheses out of <inline-formula><mml:math id="M8" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula> tests follows from the binomial distribution:
            <disp-formula id="Ch1.E1" content-type="numbered"><label>1</label><mml:math id="M9" display="block"><mml:mrow><mml:mi>P</mml:mi><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mi>N</mml:mi><mml:mi mathvariant="normal">!</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mi mathvariant="normal">!</mml:mi><mml:mo>(</mml:mo><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mi mathvariant="normal">!</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:msup><mml:mi mathvariant="italic">α</mml:mi><mml:mi>x</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>-</mml:mo><mml:mi mathvariant="italic">α</mml:mi><mml:msup><mml:mo>)</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>
          On average, we can expect <inline-formula><mml:math id="M10" display="inline"><mml:mrow><mml:mi mathvariant="italic">α</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:math></inline-formula> local rejections over the whole grid when two ensembles come from the same model. However, for <inline-formula><mml:math id="M11" display="inline"><mml:mrow><mml:mi>N</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M12" display="inline"><mml:mrow><mml:mi mathvariant="italic">α</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.05</mml:mn></mml:mrow></mml:math></inline-formula>, the probability of having nine or more erroneous rejections is still 6.3 %, which means that 10 or more local rejections are required (probability 2.8 %) to reject the global null hypothesis at field level with a 95 % confidence interval. So, in this case, 10 % of the local hypothesis tests would have to reject the local null hypothesis to get a significant global rejection. For a larger grid with <inline-formula><mml:math id="M13" display="inline"><mml:mrow><mml:mi>N</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">10</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mn mathvariant="normal">000</mml:mn></mml:mrow></mml:math></inline-formula>, we would require 537 (5.37 %) or more local rejections (probability 4.8 %) to reject the global null hypothesis with a 95 % confidence interval <xref ref-type="bibr" rid="bib1.bibx20" id="paren.39"><named-content content-type="pre">see Fig. 3 in</named-content><named-content content-type="post">for a visualization of this function</named-content></xref>.</p>
      <p id="d1e505">However, local tests cannot be assumed to be statistically independent due to spatial correlation. Therefore, Eq. (<xref ref-type="disp-formula" rid="Ch1.E1"/>) is not valid in our case. While two identical models will still have <inline-formula><mml:math id="M14" display="inline"><mml:mrow><mml:mi mathvariant="italic">α</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:math></inline-formula> false rejections on average, a higher or lower rejection rate is more likely. Unfortunately, the exact distribution of rejection rates is unknown in such a case <xref ref-type="bibr" rid="bib1.bibx45" id="paren.40"/>. <xref ref-type="bibr" rid="bib1.bibx20" id="text.41"/> argued that spatial correlation reduces <inline-formula><mml:math id="M15" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula>, the number of independent tests, due to a clustering effect of grid points and therefore also increases the percentage of local rejections needed to reject the global null hypothesis. To account for that, they estimated the effective number of independent tests <inline-formula><mml:math id="M16" display="inline"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi mathvariant="normal">eff</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> with the use of Monte Carlo methods, which allowed them to use Eq. (<xref ref-type="disp-formula" rid="Ch1.E1"/>) for calculating the number of rejected local tests that are required to reject the global null hypothesis.</p>
      <p id="d1e547"><xref ref-type="bibr" rid="bib1.bibx55" id="text.42"/> recommended the use of the FDR method of <xref ref-type="bibr" rid="bib1.bibx7" id="text.43"/>. This method defines a threshold level <inline-formula><mml:math id="M17" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, based on the sorted <inline-formula><mml:math id="M18" display="inline"><mml:mi>p</mml:mi></mml:math></inline-formula>-values. The threshold is defined as
            <disp-formula id="Ch1.E2" content-type="numbered"><label>2</label><mml:math id="M19" display="block"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="false">max⁡</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mi>N</mml:mi></mml:mrow></mml:munder><mml:mfenced open="[" close="]"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub><mml:mo>:</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub><mml:mo>≤</mml:mo><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>/</mml:mo><mml:mi>N</mml:mi><mml:mo>)</mml:mo><mml:msub><mml:mi mathvariant="italic">α</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
          where <inline-formula><mml:math id="M20" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the sorted <inline-formula><mml:math id="M21" display="inline"><mml:mi>p</mml:mi></mml:math></inline-formula>-values with <inline-formula><mml:math id="M22" display="inline"><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mi>N</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M23" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">α</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the chosen control level for the FDR (note that <inline-formula><mml:math id="M24" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">α</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> must not be the same as <inline-formula><mml:math id="M25" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula> for the local test). The FDR method only rejects local null hypotheses if the respective <inline-formula><mml:math id="M26" display="inline"><mml:mi>p</mml:mi></mml:math></inline-formula>-value is no larger than <inline-formula><mml:math id="M27" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. This condition essentially ensures that the fraction of false rejections out of all rejections is at most <inline-formula><mml:math id="M28" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">α</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> on average. While the FDR method of <xref ref-type="bibr" rid="bib1.bibx7" id="text.44"/> is theoretically also based on the assumption that the different tests are statistically independent, it has been shown to also effectively control the proportion of falsely rejected null hypotheses for spatially correlated data <xref ref-type="bibr" rid="bib1.bibx50 bib1.bibx23" id="paren.45"/>. An assessment of the FDR method in the context of our verification methodology will be presented in Sect. <xref ref-type="sec" rid="Ch1.S4.SS11"/>.</p>
</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Methods and data</title>
<sec id="Ch1.S3.SS1">
  <label>3.1</label><title>Verification methodology</title>
      <p id="d1e774">We consider ensemble simulations of two model versions, which for brevity will be referred to as “old” and “new”, respectively. We start by stating our global null hypothesis:
<list list-type="bullet"><list-item>
      <p id="d1e779"><inline-formula><mml:math id="M29" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo>(</mml:mo><mml:mi mathvariant="normal">global</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>: the ensemble results from the old and the new model are drawn from the same distribution.</p></list-item></list>
We then consider the changes in the model to be insignificant if we are not able to reject the global null hypothesis. This global test is based on a statistical hypothesis test applied at the grid-cell level with the local null hypothesis <inline-formula><mml:math id="M30" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>. The specific definition of <inline-formula><mml:math id="M31" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> will be given later, as it somewhat depends upon the chosen statistical hypothesis test; see Sect. <xref ref-type="sec" rid="Ch1.S3.SS3"/>. It is also important to state that we will generally not evaluate the whole model output but compare a limited number of two-dimensional fields, such as the 500 hPa geopotential height or the 850 hPa temperature fields. For each selected field, the two model ensembles will be tested at grid scale against each other, using an appropriate statistical test. The probability of rejecting <inline-formula><mml:math id="M32" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> for two ensembles produced by an identical model is given by the significance level <inline-formula><mml:math id="M33" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula> (here, <inline-formula><mml:math id="M34" display="inline"><mml:mrow><mml:mi mathvariant="italic">α</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.05</mml:mn></mml:mrow></mml:math></inline-formula>). As discussed in Sect. <xref ref-type="sec" rid="Ch1.S2.SS2"/>, the main difficulty of using statistical hypothesis tests at the grid-cell level is the spatial correlation, which means that the respective tests are not statistically independent and thus prohibits the use of the binomial distribution for calculating the probabilities of false positives. We chose to deal with this in a conceptually simple but effective way. The methodology follows <xref ref-type="bibr" rid="bib1.bibx19" id="text.46"/> and combines Monte Carlo methods and subsampling to produce a null distribution of rejection rates, which can be used to get the probability of having <inline-formula><mml:math id="M35" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">rej</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> rejections for two ensembles coming from the same model. An alternative to generating the null distribution from a control ensemble is the use of Monte Carlo permutation testing, where one pools two ensembles (for which one does not yet know whether they come from the same distribution) and then applies the test to randomly drawn subsets from the pooled ensemble. This approach allows the creation of a control ensemble to be bypassed, therefore saving computation time. Strictly speaking, the reference value for the number of rejections then comes from a distribution produced not by one but by two models. Depending on the difference between the two models, this might lead to slightly different results compared to a case where the reference distribution comes from two identical models. However, <xref ref-type="bibr" rid="bib1.bibx24 bib1.bibx25" id="text.47"/>  used both approaches and found only minor differences between permutation testing and subsampling from a control ensemble to generate the null distribution. Nevertheless, we still opted for the approach with a control ensemble since the additionally needed computation time is relatively small for short simulations (see Sect. <xref ref-type="sec" rid="Ch1.S3.SS5"/>).</p>
      <p id="d1e914">Figure <xref ref-type="fig" rid="Ch1.F1"/> shows a schematic example of the procedure. The control and reference ensembles come from an identical model (old model), whereas the evaluation ensemble comes from a model where we are unsure whether it produces statistically indistinguishable results (new model). Each ensemble consists of <inline-formula><mml:math id="M36" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> members, and we use <inline-formula><mml:math id="M37" display="inline"><mml:mi>m</mml:mi></mml:math></inline-formula> subsamples consisting of <inline-formula><mml:math id="M38" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> random members (<inline-formula><mml:math id="M39" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>&lt;</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) drawn from each ensemble without replacement. We then test for field significance by comparing the mean rejection rate from the evaluation ensemble to the 0.95 quantile from the control ensemble, rejecting the null hypothesis if the mean rejection rate of the evaluation ensemble is equal to or above the 0.95 quantile of the control ensemble rejection rate.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F1" specific-use="star"><?xmltex \currentcnt{1}?><?xmltex \def\figurename{Figure}?><label>Figure 1</label><caption><p id="d1e968">Schematic sketch of the verification methodology. The control and the reference ensembles come from the same “old” model, whereas the evaluation ensemble comes from a “new” model, and we do not know whether this new model is indistinguishable from the model that created the control and reference ensembles. We draw many random subsamples from all three ensembles, perform the local statistical hypothesis tests of the control and evaluation subsamples against the reference subsamples, and then calculate the rejection rate for each subsample. This results in distributions of rejection rates for the control and evaluation ensembles that can be compared to each other in order to decide whether the evaluation ensemble is different. In this work, we reject the global null hypothesis if the mean of the evaluation ensemble rejection rate distribution is equal to or above the 0.95 quantile of the rejection rate distribution for the control ensemble.</p></caption>
          <?xmltex \igopts{width=455.244094pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f01.png"/>

        </fig>

      <p id="d1e978">As well as accounting for spatial correlation, having a rejection rate distribution from a control ensemble also offers more flexibility in evaluating different variables. In atmospheric models, some variables, such as precipitation, inherently have a high probability of being zero at many grid points. Therefore, a statistical test will often not reject the local null hypothesis even when the two ensembles come from two very different models. This can lead to a mean rejection rate of well below <inline-formula><mml:math id="M40" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula> for two different ensembles, and by just looking at <inline-formula><mml:math id="M41" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula>, we would conclude that the two ensembles are indistinguishable. However, here we derive the expected rejection rate from the control ensemble, which yields an objective threshold that accounts for such behavior.</p>
      <p id="d1e995">It is important to mention that the choice of <inline-formula><mml:math id="M42" display="inline"><mml:mrow><mml:mi mathvariant="italic">α</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.05</mml:mn></mml:mrow></mml:math></inline-formula> for the local statistical hypothesis test is arbitrary and does not determine the confidence interval for field significance. Furthermore, comparing the mean rejection rate from the evaluation ensemble with the 0.95 quantile from the control might also give a wrong idea of a confidence interval for the field significance. If we assume that the evaluation ensemble comes from an identical model and only take one subsample from the evaluation ensemble, the probability of it having a rejection rate equal to or higher than the 0.95 quantile from the control rejection rate distribution is, in fact, 5 %. However, the probability of the mean rejection rate of 100 subsamples from the evaluation ensemble being higher than the 0.95 quantile of the control is significantly lower than 5 %, but it is not easy to determine by how much. Using the binomial distribution in Eq. (<xref ref-type="disp-formula" rid="Ch1.E1"/>) for a calculation of the number of necessarily rejected subsamples to reject the overall null hypothesis is not valid, because the subsamples are not statistically independent from each other. Based on our experience and the results shown in this work, we consider the comparison of the mean to the 0.95 quantile a reasonable choice, even though it is not really based on a confidence interval (unlike, for example, the FDR approach discussed in Sect. <xref ref-type="sec" rid="Ch1.S2.SS2"/>). However, the sensitivity of the methodology could of course be adapted by changing this field significance criterion.</p>
      <p id="d1e1014">The verification methodology used in this work shares some similarities with verification methodologies presented in previous studies, most notably <xref ref-type="bibr" rid="bib1.bibx1 bib1.bibx2 bib1.bibx28 bib1.bibx24 bib1.bibx25 bib1.bibx23 bib1.bibx27" id="text.48"/>. However, most of these studies focus on mean values in space and time. Among the previously mentioned studies, only <xref ref-type="bibr" rid="bib1.bibx2" id="text.49"/>, <xref ref-type="bibr" rid="bib1.bibx24" id="text.50"/>, and <xref ref-type="bibr" rid="bib1.bibx23" id="text.51"/> used a similar methodology at the grid-cell level for either monthly or yearly averages of variables from an ocean model component <xref ref-type="bibr" rid="bib1.bibx2 bib1.bibx23" id="paren.52"/> or for the identification of differences in annual extreme values <xref ref-type="bibr" rid="bib1.bibx24" id="paren.53"/>. Moreover, except for <xref ref-type="bibr" rid="bib1.bibx28" id="text.54"/>, all other studies focused on longer simulations (1 year or more) and average values in time. We will focus on shorter simulations (days to months), with the idea that many small changes are often easier to identify at the beginning of the simulation. We apply the methodology directly to instantaneous or, in the case of precipitation, hourly output variables from an atmospheric model on a 3-hourly or 6-hourly basis. The rejection threshold is computed as a function of time and may transiently increase or decrease in response to changes in predictability. In essence, the rejection rate distribution from a control ensemble allows us to use an objective criterion for field significance. Another difference from most existing verification methodologies is that this methodology calculates the mean rejection rate from the evaluation ensemble and the 0.95 quantile from the control ensemble using subsampling. It thus essentially performs multiple global tests to arrive at a pass or fail decision. Most existing methodologies use only one test with all ensemble members for the pass or fail decision. However, many of them use subsampling to estimate the false positive rate.</p>
</sec>
<sec id="Ch1.S3.SS2">
  <label>3.2</label><title>Ensemble generation</title>
      <p id="d1e1047">The ensemble is created through a perturbation of the initial conditions of the prognostic variables (in our case, the horizontal and vertical wind components, pressure perturbation, temperature, specific humidity, and cloud water content). The perturbed variable <inline-formula><mml:math id="M43" display="inline"><mml:mover accent="true"><mml:mi mathvariant="italic">φ</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula> is defined as
            <disp-formula id="Ch1.E3" content-type="numbered"><label>3</label><mml:math id="M44" display="block"><mml:mrow><mml:mover accent="true"><mml:mi mathvariant="italic">φ</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>+</mml:mo><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mi>R</mml:mi><mml:mo>)</mml:mo><mml:mi mathvariant="italic">φ</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
          where <inline-formula><mml:math id="M45" display="inline"><mml:mi mathvariant="italic">φ</mml:mi></mml:math></inline-formula> is the unperturbed prognostic variable, <inline-formula><mml:math id="M46" display="inline"><mml:mi>R</mml:mi></mml:math></inline-formula> is a random number with a uniform distribution between <inline-formula><mml:math id="M47" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M48" display="inline"><mml:mn mathvariant="normal">1</mml:mn></mml:math></inline-formula>, and <inline-formula><mml:math id="M49" display="inline"><mml:mi mathvariant="italic">ϵ</mml:mi></mml:math></inline-formula> is the specified magnitude of the perturbation. In this study, we have used <inline-formula><mml:math id="M50" display="inline"><mml:mrow><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> for all experiments. Aside from already providing a good ensemble spread during the first few hours, the relatively strong perturbation also works well with single-precision floating-point representation. Furthermore, the effect on internal variability with <inline-formula><mml:math id="M51" display="inline"><mml:mrow><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is very similar to the one from much weaker perturbations (e.g., <inline-formula><mml:math id="M52" display="inline"><mml:mrow><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">16</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>) after a few hours, as shown in Appendix <xref ref-type="sec" rid="App1.Ch1.S1"/>.</p>
</sec>
<sec id="Ch1.S3.SS3">
  <label>3.3</label><title>Statistical hypothesis tests</title>
      <p id="d1e1192">In this study, we have applied three different statistical tests to test the local null hypothesis <inline-formula><mml:math id="M53" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>: Student's <inline-formula><mml:math id="M54" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test, the Mann–Whitney <inline-formula><mml:math id="M55" display="inline"><mml:mi>U</mml:mi></mml:math></inline-formula> (MWU) test, and the two-sample Kolmogorov–Smirnov (K-S) test. This allows us to see whether some statistical tests might be better suited for some variables than others and how sensitive the methodology is with regard to the underlying test statistics. If not mentioned otherwise, the MWU test was used as the default test for the results shown in this study.</p>
<sec id="Ch1.S3.SS3.SSS1">
  <label>3.3.1</label><?xmltex \opttitle{Student's $t$ test}?><title>Student's <inline-formula><mml:math id="M56" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test</title>
      <p id="d1e1247">Student's <inline-formula><mml:math id="M57" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test was introduced by William S. Gosset under the pseudonym “Student” <xref ref-type="bibr" rid="bib1.bibx46" id="paren.55"/>, and was originally used to determine the quality of raw material used to make stout for the Guinness Brewery. The independent two-sample <inline-formula><mml:math id="M58" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test has the null hypothesis that the means of two populations <inline-formula><mml:math id="M59" display="inline"><mml:mi>X</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math id="M60" display="inline"><mml:mi>Y</mml:mi></mml:math></inline-formula> are equal. As we use it for the local statistical test, we therefore have the following local null hypothesis:
<list list-type="bullet"><list-item>
      <p id="d1e1284"><inline-formula><mml:math id="M61" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>: the means <inline-formula><mml:math id="M62" display="inline"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">old</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> and <inline-formula><mml:math id="M63" display="inline"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">new</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> are drawn from the same distribution.</p></list-item></list>
Here, <inline-formula><mml:math id="M64" display="inline"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">old</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> is the sample mean of the variable <inline-formula><mml:math id="M65" display="inline"><mml:mi mathvariant="italic">φ</mml:mi></mml:math></inline-formula> at grid cell <inline-formula><mml:math id="M66" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> from the old model, and <inline-formula><mml:math id="M67" display="inline"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">new</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> is the respective sample mean from the new model. The <inline-formula><mml:math id="M68" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> statistic is calculated as
              <disp-formula id="Ch1.E4" content-type="numbered"><label>4</label><mml:math id="M69" display="block"><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mi>Y</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi mathvariant="normal">p</mml:mi></mml:msub><mml:msqrt><mml:mstyle displaystyle="false"><mml:mfrac style="text"><mml:mn mathvariant="normal">2</mml:mn><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mstyle></mml:msqrt></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
            with <inline-formula><mml:math id="M70" display="inline"><mml:mover accent="true"><mml:mi>X</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> and <inline-formula><mml:math id="M71" display="inline"><mml:mover accent="true"><mml:mi>Y</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> being the respective sample means, and assuming equal sample sizes <inline-formula><mml:math id="M72" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. The pooled standard deviation is given as
              <disp-formula id="Ch1.E5" content-type="numbered"><label>5</label><mml:math id="M73" display="block"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi mathvariant="normal">p</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msubsup><mml:mi>s</mml:mi><mml:mi>X</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>s</mml:mi><mml:mi>Y</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msubsup></mml:mrow><mml:mn mathvariant="normal">2</mml:mn></mml:mfrac></mml:mstyle></mml:msqrt><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
            where <inline-formula><mml:math id="M74" display="inline"><mml:mrow><mml:msubsup><mml:mi>s</mml:mi><mml:mi>X</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msubsup></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M75" display="inline"><mml:mrow><mml:msubsup><mml:mi>s</mml:mi><mml:mi>Y</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msubsup></mml:mrow></mml:math></inline-formula> are the unbiased estimators of the variances of the two samples. The <inline-formula><mml:math id="M76" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> statistic is then compared against a critical value for a certain significance level <inline-formula><mml:math id="M77" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula> from the Student <inline-formula><mml:math id="M78" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> distribution. For a two-sided test, we reject the local null hypothesis if the <inline-formula><mml:math id="M79" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> statistic is smaller or greater than this critical value. Student's <inline-formula><mml:math id="M80" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test requires that the means of the two populations follow a normal distribution and it assumes equal variance. However, Student's <inline-formula><mml:math id="M81" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test has been shown to be quite robust to violations of both the normality assumption and, provided the sample sizes are equal, the assumption of equal variance <xref ref-type="bibr" rid="bib1.bibx4 bib1.bibx33" id="paren.56"/>. <xref ref-type="bibr" rid="bib1.bibx47" id="text.57"/> showed that Student's <inline-formula><mml:math id="M82" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test even provided meaningful results in the presence of floor effects of the distribution (i.e., where a value can be a minimum of zero).</p>
</sec>
<sec id="Ch1.S3.SS3.SSS2">
  <label>3.3.2</label><?xmltex \opttitle{Mann--Whitney $U$ test}?><title>Mann–Whitney <inline-formula><mml:math id="M83" display="inline"><mml:mi>U</mml:mi></mml:math></inline-formula> test</title>
      <p id="d1e1666">The Mann–Whitney <inline-formula><mml:math id="M84" display="inline"><mml:mi>U</mml:mi></mml:math></inline-formula> (MWU) test (also known as the Wilcoxon rank-sum test) was introduced by <xref ref-type="bibr" rid="bib1.bibx26" id="text.58"/> and is a nonparametric test in the sense that no assumption is made concerning the distribution of the variables. The null hypothesis is that, for randomly selected values <inline-formula><mml:math id="M85" display="inline"><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M86" display="inline"><mml:mrow><mml:msub><mml:mi>Y</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> from two populations, the probability of <inline-formula><mml:math id="M87" display="inline"><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> being greater than <inline-formula><mml:math id="M88" display="inline"><mml:mrow><mml:msub><mml:mi>Y</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is equal to the probability of <inline-formula><mml:math id="M89" display="inline"><mml:mrow><mml:msub><mml:mi>Y</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> being greater than <inline-formula><mml:math id="M90" display="inline"><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. It therefore does not test exactly the same property as Student's <inline-formula><mml:math id="M91" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test (the means of the two populations are equal), even though it is often compared to it. In our case, the local null hypothesis test for the MWU test is the following:
<list list-type="bullet"><list-item>
      <p id="d1e1755"><inline-formula><mml:math id="M92" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>: the probability of <inline-formula><mml:math id="M93" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">old</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mo>&gt;</mml:mo><mml:msubsup><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">new</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is equal to the probability of <inline-formula><mml:math id="M94" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">old</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mo>&lt;</mml:mo><mml:msubsup><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">new</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>.</p></list-item></list>
Here, <inline-formula><mml:math id="M95" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">old</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M96" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">new</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> are the values of the variable <inline-formula><mml:math id="M97" display="inline"><mml:mi mathvariant="italic">φ</mml:mi></mml:math></inline-formula> at location <inline-formula><mml:math id="M98" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> from randomly selected members <inline-formula><mml:math id="M99" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math id="M100" display="inline"><mml:mi>l</mml:mi></mml:math></inline-formula> of the samples from the old and new models, respectively. The MWU test ranks all the observations (from both samples combined into one set) and then sums the ranks of the observations from the respective samples, resulting in <inline-formula><mml:math id="M101" display="inline"><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M102" display="inline"><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. <inline-formula><mml:math id="M103" display="inline"><mml:mrow><mml:msub><mml:mi>U</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is calculated as
              <disp-formula id="Ch1.E6" content-type="numbered"><label>6</label><mml:math id="M104" display="block"><mml:mrow><mml:msub><mml:mi>U</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo movablelimits="false">min⁡</mml:mo><mml:mfenced open="(" close=")"><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mn mathvariant="normal">2</mml:mn></mml:mfrac></mml:mstyle><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:msub><mml:mi>R</mml:mi><mml:mi>Y</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mn mathvariant="normal">2</mml:mn></mml:mfrac></mml:mstyle></mml:mrow></mml:mfenced><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
            where <inline-formula><mml:math id="M105" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M106" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> are the respective sample sizes, which are assumed to be equal in our case (<inline-formula><mml:math id="M107" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>). This value is then compared with a critical value <inline-formula><mml:math id="M108" display="inline"><mml:mrow><mml:msub><mml:mi>U</mml:mi><mml:mi mathvariant="normal">crit</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> from a table for a given significance level <inline-formula><mml:math id="M109" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula>. For larger samples (<inline-formula><mml:math id="M110" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>&gt;</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>), <inline-formula><mml:math id="M111" display="inline"><mml:mrow><mml:msub><mml:mi>U</mml:mi><mml:mi mathvariant="normal">crit</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is assumed to be normally distributed. If <inline-formula><mml:math id="M112" display="inline"><mml:mrow><mml:msub><mml:mi>U</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>≤</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mi mathvariant="normal">crit</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, the null hypothesis is rejected. As it is a nonparametric test, the MWU test has no strong assumptions and just requires the responses to be ordinal (i.e., <inline-formula><mml:math id="M113" display="inline"><mml:mo>&lt;</mml:mo></mml:math></inline-formula>, <inline-formula><mml:math id="M114" display="inline"><mml:mo>=</mml:mo></mml:math></inline-formula>, <inline-formula><mml:math id="M115" display="inline"><mml:mo>&gt;</mml:mo></mml:math></inline-formula>). <xref ref-type="bibr" rid="bib1.bibx61" id="text.59"/> showed that, given equal sample sizes, the MWU test is a bit less powerful than Student's <inline-formula><mml:math id="M116" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test, even if the variances are not equal. This means that the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true is assumed to be a bit lower. Nevertheless, when comparing these tests, it is important to remember that they are based on different null hypotheses and thus do not test the same properties.</p>
</sec>
<sec id="Ch1.S3.SS3.SSS3">
  <label>3.3.3</label><title>Two-sample Kolmogorov–Smirnov test</title>
      <p id="d1e2221">The two-sample Kolmogorov–Smirnov (K-S) test is a nonparametric test with the null hypothesis that the samples are drawn from the same distribution. Our local null hypothesis is therefore the following:
<list list-type="bullet"><list-item>
      <p id="d1e2226"><inline-formula><mml:math id="M117" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>: <inline-formula><mml:math id="M118" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">old</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M119" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">new</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are drawn from the same distribution.</p></list-item></list>
Here, <inline-formula><mml:math id="M120" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">old</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M121" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">φ</mml:mi><mml:mrow><mml:mi mathvariant="normal">new</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the samples of the variable <inline-formula><mml:math id="M122" display="inline"><mml:mi mathvariant="italic">φ</mml:mi></mml:math></inline-formula> at location <inline-formula><mml:math id="M123" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> from the old and new models, respectively. The K-S test statistic is given as
              <disp-formula id="Ch1.E7" content-type="numbered"><label>7</label><mml:math id="M124" display="block"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="false">sup⁡</mml:mo><mml:mi>x</mml:mi></mml:munder><mml:mo>|</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>Y</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mo>|</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>
            Here, <inline-formula><mml:math id="M125" display="inline"><mml:mo>sup⁡</mml:mo></mml:math></inline-formula> is the supremum function and <inline-formula><mml:math id="M126" display="inline"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M127" display="inline"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>Y</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> are the empirical distribution functions of the two samples <inline-formula><mml:math id="M128" display="inline"><mml:mi>X</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math id="M129" display="inline"><mml:mi>Y</mml:mi></mml:math></inline-formula>, where
              <disp-formula id="Ch1.E8" content-type="numbered"><label>8</label><mml:math id="M130" display="block"><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:munderover><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mo>-</mml:mo><mml:mi mathvariant="normal">∞</mml:mi><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
            with the indicator function <inline-formula><mml:math id="M131" display="inline"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mo>-</mml:mo><mml:mi mathvariant="normal">∞</mml:mi><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo>]</mml:mo></mml:mrow></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which is equal to 1 if <inline-formula><mml:math id="M132" display="inline"><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>≤</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:math></inline-formula> and zero otherwise. The null hypothesis is rejected if
              <disp-formula id="Ch1.E9" content-type="numbered"><label>9</label><mml:math id="M133" display="block"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>&gt;</mml:mo><mml:mi>c</mml:mi><mml:mo>(</mml:mo><mml:mi mathvariant="italic">α</mml:mi><mml:mo>)</mml:mo><mml:msqrt><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>X</mml:mi></mml:msub><mml:mo>⋅</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mi>Y</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mstyle></mml:msqrt><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
            where <inline-formula><mml:math id="M134" display="inline"><mml:mrow><mml:mi>c</mml:mi><mml:mo>(</mml:mo><mml:mi mathvariant="italic">α</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mo>-</mml:mo><mml:mi>ln⁡</mml:mi><mml:mo>(</mml:mo><mml:mstyle displaystyle="false"><mml:mfrac style="text"><mml:mi mathvariant="italic">α</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:mfrac></mml:mstyle><mml:mo>)</mml:mo><mml:mo>⋅</mml:mo><mml:mstyle displaystyle="false"><mml:mfrac style="text"><mml:mn mathvariant="normal">1</mml:mn><mml:mn mathvariant="normal">2</mml:mn></mml:mfrac></mml:mstyle></mml:mrow></mml:msqrt></mml:mrow></mml:math></inline-formula> for a given significance level <inline-formula><mml:math id="M135" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula>. The K-S test is often perceived to be not as powerful as, for example, Student's <inline-formula><mml:math id="M136" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test for comparing means and measures of location in general <xref ref-type="bibr" rid="bib1.bibx54" id="paren.60"/>. However, due to its different null hypothesis, it might be a more suitable test for testing a distribution's shape or spread.</p>
</sec>
</sec>
<sec id="Ch1.S3.SS4">
  <label>3.4</label><title>Model description and hardware</title>
      <p id="d1e2755">The Consortium for Small-scale Modelling (COSMO) model <xref ref-type="bibr" rid="bib1.bibx3" id="paren.61"/> is a regional model that operates on a grid with rotated latitude–longitude coordinates. It was originally developed for numerical weather prediction but has been extended to also run in climate mode <xref ref-type="bibr" rid="bib1.bibx38" id="paren.62"/>. COSMO uses a split explicit third-order Runge–Kutta discretization <xref ref-type="bibr" rid="bib1.bibx53" id="paren.63"/> in combination with a fifth-order upwind scheme for horizontal advection and an implicit Crank–Nicolson scheme for vertical advection. Parameterizations include a radiation scheme based on the <inline-formula><mml:math id="M137" display="inline"><mml:mi mathvariant="italic">δ</mml:mi></mml:math></inline-formula>-two-stream approach <xref ref-type="bibr" rid="bib1.bibx37" id="paren.64"/>, a single-moment cloud microphysics scheme <xref ref-type="bibr" rid="bib1.bibx36" id="paren.65"/>, a turbulent-kinetic-energy-based parameterization for the planetary boundary layer <xref ref-type="bibr" rid="bib1.bibx34" id="paren.66"/>, an adapted version of the convection scheme by <xref ref-type="bibr" rid="bib1.bibx49" id="text.67"/>, a subgrid-scale orography (SSO) scheme by <xref ref-type="bibr" rid="bib1.bibx22" id="text.68"/>, and a multilayer soil model with a representation of groundwater <xref ref-type="bibr" rid="bib1.bibx44" id="paren.69"/>. Explicit horizontal diffusion is applied by using a monotonic fourth-order linear scheme acting on model levels for wind, temperature, pressure, specific humidity, and cloud water content <xref ref-type="bibr" rid="bib1.bibx12" id="paren.70"/> with an orographic limiter that helps to avoid excessive vertical mixing around mountains. For the standard experiments in this paper, the explicit diffusion from the monotonic fourth-order linear scheme is set to zero.</p>
      <p id="d1e2796">Most experiments in this work have been carried out with version 5.09. While COSMO was originally designed to run on CPU architectures, this version is also able to run on hybrid GPU-CPU architectures thanks to an implementation described in <xref ref-type="bibr" rid="bib1.bibx15" id="text.71"/>, which was a joint effort from MeteoSwiss, the ETH-based Center for Climate Systems Modeling (C2SM), and the Swiss National Supercomputing Center (CSCS). The implementation uses the domain-specific language GridTools for the dynamical core and OpenACC compiler directives for the parameterization package. The simulations were carried out on the Piz Daint supercomputer at CSCS, using Cray XC50 compute nodes consisting of an Intel Xeon E5-2690 v3 CPU and an NVIDIA Tesla P100 GPU. Except for one ensemble that was created with a COSMO binary that exclusively uses CPUs, all simulations in this paper were run in hybrid GPU-CPU mode, where the GPUs perform the main load of the work.</p>
</sec>
<sec id="Ch1.S3.SS5">
  <label>3.5</label><title>Domain and setup</title>
      <p id="d1e2811">The domains that have been used for the simulation and verification include most of Europe and some parts of northern Africa (see Fig. <xref ref-type="fig" rid="Ch1.F2"/>). The simulated periods all start on 28 May 2018 at 00:00 UTC and range from several days to 3 months in length. The initial and the 6-hourly boundary conditions come from the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA-Interim reanalysis <xref ref-type="bibr" rid="bib1.bibx11" id="paren.72"/>. For this work, we have chosen a <inline-formula><mml:math id="M138" display="inline"><mml:mrow><mml:mn mathvariant="normal">132</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">129</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">40</mml:mn></mml:mrow></mml:math></inline-formula> grid with 50 km horizontal grid spacing and 40 nonequidistant vertical levels reaching up to a height of 22.7 km. In order to reduce the effect of the lateral boundary conditions, we excluded 15 grid points at each of the lateral boundaries from the verification, resulting in <inline-formula><mml:math id="M139" display="inline"><mml:mrow><mml:mn mathvariant="normal">102</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">99</mml:mn></mml:mrow></mml:math></inline-formula> grid points for one vertical layer. As the verification methodology is supposed to be used as a part of an automated testing environment, we have chosen this relatively coarse resolution in order to keep the computational and storage costs low. Running such a simulation for 10 d requires about 4 min on one Cray XC50 compute node when using the GPU-accelerated version of COSMO in double precision. This means that an ensemble of 50 members requires 3–4 node hours. However, as the runs can be executed in parallel, the generation of the ensemble requires only a matter of minutes.</p>
</sec>
<sec id="Ch1.S3.SS6">
  <label>3.6</label><title>Experiments</title>
      <p id="d1e2855">In order to test and demonstrate the methodology, we have performed a series of experiments. Many of these experiments are for cases where we deliberately changed the model. However, we also have one real-world case where we verified the effect of a major update of Piz Daint, the supercomputer on which we have been running our model.</p>
<sec id="Ch1.S3.SS6.SSS1">
  <label>3.6.1</label><title>Diffusion experiment</title>
      <p id="d1e2865">COSMO offers the possibility of applying explicit diffusion with a monotonic fourth-order linear scheme with an orographic limiter acting on model levels for wind, temperature, pressure, specific humidity, and cloud water content. Diffusion is applied by introducing an additional operator on the right-hand side of the prognostic equation, similar to
              <disp-formula id="Ch1.E10" content-type="numbered"><label>10</label><mml:math id="M140" display="block"><mml:mrow><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mo>∂</mml:mo><mml:mi mathvariant="italic">ψ</mml:mi></mml:mrow><mml:mrow><mml:mo>∂</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mo>(</mml:mo><mml:mi mathvariant="italic">ψ</mml:mi><mml:mo>)</mml:mo><mml:mo>+</mml:mo><mml:mi>D</mml:mi><mml:mo>⋅</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi mathvariant="normal">d</mml:mi></mml:msub><mml:mo>⋅</mml:mo><mml:msup><mml:mi mathvariant="normal">∇</mml:mi><mml:mn mathvariant="normal">4</mml:mn></mml:msup><mml:mi mathvariant="italic">ψ</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
            where <inline-formula><mml:math id="M141" display="inline"><mml:mi mathvariant="italic">ψ</mml:mi></mml:math></inline-formula> is the prognostic variable, <inline-formula><mml:math id="M142" display="inline"><mml:mi>S</mml:mi></mml:math></inline-formula> represents all physical and dynamical source terms for <inline-formula><mml:math id="M143" display="inline"><mml:mi mathvariant="italic">ψ</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M144" display="inline"><mml:mrow><mml:msub><mml:mi>c</mml:mi><mml:mi mathvariant="normal">d</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the default diffusion coefficient in the model, and <inline-formula><mml:math id="M145" display="inline"><mml:mi>D</mml:mi></mml:math></inline-formula> is the factor that can be set in order to change the strength of the computational mixing <xref ref-type="bibr" rid="bib1.bibx12" id="paren.73"><named-content content-type="pre">please refer to Sect. 5.2 in</named-content><named-content content-type="post">for the exact equations including the limiter</named-content></xref>. By default, we have set <inline-formula><mml:math id="M146" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:math></inline-formula>, which means that no explicit fourth-order linear diffusion is applied. However, for some experiments we have used <inline-formula><mml:math id="M147" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">0.01</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mn mathvariant="normal">0.005</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mn mathvariant="normal">0.001</mml:mn><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula>. Such small values should not affect the model results visibly or be easily quantifiable without statistical testing. A value of <inline-formula><mml:math id="M148" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1.0</mml:mn></mml:mrow></mml:math></inline-formula> reduces the amplitude of <inline-formula><mml:math id="M149" display="inline"><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mi mathvariant="normal">Δ</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:math></inline-formula> waves by about a factor of <inline-formula><mml:math id="M150" display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>/</mml:mo><mml:mn mathvariant="normal">3</mml:mn></mml:mrow></mml:math></inline-formula> per time step. For such a high value, the model results visibly change <xref ref-type="bibr" rid="bib1.bibx60" id="paren.74"/>.</p>
</sec>
<sec id="Ch1.S3.SS6.SSS2">
  <label>3.6.2</label><title>Architecture: CPU vs. GPU</title>
      <p id="d1e3052">By default, the simulations shown in this work have been performed with a COSMO binary that makes use of the NVIDIA Tesla P100 GPU on the Cray XC50 nodes (see Sect. <xref ref-type="sec" rid="Ch1.S3.SS4"/> for details). For this experiment, we have produced an ensemble from an identical source and with identical settings but compiled it to run exclusively on the Intel Xeon E5-2690 v3 CPUs in order to see whether there is a noticeable difference between the CPU version and the GPU version of COSMO.</p>
</sec>
<sec id="Ch1.S3.SS6.SSS3">
  <label>3.6.3</label><title>Floating-point precision</title>
      <p id="d1e3065">In this work, COSMO has used the double-precision (DP) floating-point format by default, where the representation of a floating-point number requires 64 bits. However, COSMO can also be run in 32 bit single-precision (SP) floating-point representation. The SP version was developed by MeteoSwiss and is currently used by them for their operational forecasts. They decided to use the SP version after carefully evaluating its performance compared to the DP version, which suggests that there are only very small differences. Nevertheless, a reduction of precision leads to greater round-off errors and thus could lead to a noticeable change in model behavior. In order to see whether our methodology would be able to detect differences, we have applied it to a case where the evaluation ensemble was produced by the SP version of COSMO and the control and reference ensembles were produced by the DP version. It has to be mentioned that for the SP version of COSMO, the soil model and parts of the radiation model still use double precision, as some discrepancies were detected during the development of the SP version.</p>
      <p id="d1e3068">Running COSMO on one node in single precision, where a floating-point number only requires 32 bits, gives a speedup of around 1.1 for our simulations, most likely due to the increased operational intensity (number of floating-point operations per number of bytes transferred between cache and memory). When running on more than one node, it is often possible to reduce the total number of nodes for the same setup when switching to single precision, thanks to a drastic reduction in required memory. For example, a model domain and resolution that usually requires four nodes in double precision (e.g., the same domain as in this paper, but with 12 km grid spacing instead of 50 km grid spacing) often only requires two nodes in single precision. This results in a coarser domain decomposition and thus fewer overlapping grid cells whose values have to be exchanged between the nodes. Combined with the reduced number of bytes of the floating-point values that have to be exchanged, a significant reduction in data transfer via the interconnect can be achieved, increasing the system's efficiency. While running in SP on only two nodes might be slower than running the same simulation in DP on four nodes, it requires fewer node hours. In this particular case (four nodes for DP vs. two nodes for SP), the speedup in node hours was around 1.4, which makes the use of single precision an attractive option.</p>
</sec>
<sec id="Ch1.S3.SS6.SSS4">
  <label>3.6.4</label><title>Vertical heat diffusion coefficient and soil effects</title>
      <p id="d1e3080">In order to test the methodology for slow processes related to the hydrological cycle, we have set up an experiment where we induce a relatively small but still notable change. One parameter that has been deemed important to the COSMO model calibration by <xref ref-type="bibr" rid="bib1.bibx6" id="text.75"/> is the minimal diffusion coefficient for vertical scalar heat transport <inline-formula><mml:math id="M151" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. It basically sets a lower bound for the respective coefficient used in the 1D turbulent kinetic energy (TKE)-based subgrid-scale turbulence scheme <xref ref-type="bibr" rid="bib1.bibx13" id="paren.76"/>. By default, we have used a value of <inline-formula><mml:math id="M152" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.35</mml:mn></mml:mrow></mml:math></inline-formula> for our simulations, but for this evaluation ensemble we have changed it to <inline-formula><mml:math id="M153" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.3</mml:mn></mml:mrow></mml:math></inline-formula>. This is not a huge change, as for example the default value in COSMO is set to <inline-formula><mml:math id="M154" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1.0</mml:mn></mml:mrow></mml:math></inline-formula>, whereas the German Weather Service (DWD, Deutscher Wetter Dienst) uses <inline-formula><mml:math id="M155" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.4</mml:mn></mml:mrow></mml:math></inline-formula> for their operational model with 2.8 km grid spacing <xref ref-type="bibr" rid="bib1.bibx43" id="paren.77"/>. The goal of this experiment is to see whether such a change becomes detectable in the slowly changing soil moisture variable and, if it does, to see how long it takes to propagate the signal through the different soil layers.</p>
</sec>
<sec id="Ch1.S3.SS6.SSS5">
  <label>3.6.5</label><title>No subgrid-scale orography parameterization</title>
      <p id="d1e3192">So far, the experiments have been set up for cases where there are only slight model changes. In order to see whether the methodology is able to confidently reject results from significantly different models, we have applied it on an evaluation ensemble where the model had the subgrid-scale orography (SSO) parameterization by <xref ref-type="bibr" rid="bib1.bibx22" id="text.78"/> switched off. At a grid spacing of 50 km, orography cannot be realistically represented in a model, which is why the parameterization should be switched on in order to account for orographic form drag and gravity wave drag effects. <xref ref-type="bibr" rid="bib1.bibx56" id="text.79"/> and <xref ref-type="bibr" rid="bib1.bibx40" id="text.80"/> both showed improvements in both short- and medium-range forecasts with a SSO parameterization based on the formulation by <xref ref-type="bibr" rid="bib1.bibx22" id="text.81"/> for the Canadian Global Environmental Mutiscale (GEM) model and the ECMWF Integrated Forecast System (IFS). <xref ref-type="bibr" rid="bib1.bibx32" id="text.82"/> showed that the parameterization was able to significantly reduce biases in large-scale pressure gradients and zonal wind speeds in climate runs with the general circulation model ECHAM6. So, we expect the test to clearly reject the global null hypothesis within the first few days, but also for a longer period of time, which is why we use model runs of 90 d for this experiment.</p>
</sec>
<sec id="Ch1.S3.SS6.SSS6">
  <label>3.6.6</label><title>Piz Daint update</title>
      <p id="d1e3218">The supercomputer Piz Daint at the Swiss National Supercomputing Center (CSCS) recently received two major updates (on 9 September 2020 and 16 March 2021). The major changes that affected COSMO were new versions of the Cray Programming Toolkit (CDT), which changed the compilation environment for COSMO. The new version is CDT 20.08 whereas the old version before the first update in September 2020 was CDT 19.08. Both changes were associated with the loss of bit-identical execution. Using containers, CSCS created a testing environment that replicated the environment before the first update on 9 September 2020 with CDT 19.08. With this environment, we could reproduce the results from runs before the update in a bit-identical way. So, by using this containerized version and comparing its output to the output from the executable compiled in the updated environment with CDT 20.08, we were able to apply our methodology for a realistic scenario with typical changes in a model development context. Indeed, the system upgrade of the Piz Daint software environment was the motivation for the current study.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F2"><?xmltex \currentcnt{2}?><?xmltex \def\figurename{Figure}?><label>Figure 2</label><caption><p id="d1e3223">Panels <bold>(a, b, c)</bold> show the ensemble-mean 850 hPa temperature (color shading) and 500 hPa geopotential height (white contours) for the control ensemble <bold>(a, d)</bold> and diffusion ensemble with <inline-formula><mml:math id="M156" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.01</mml:mn></mml:mrow></mml:math></inline-formula> <bold>(b, e)</bold> after 24 h, using <inline-formula><mml:math id="M157" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula> members per ensemble. The difference in mean temperature is shown in <bold>(c)</bold>. Panels <bold>(d, e, f)</bold> show the mean rejection rate for the 850 hPa temperature (calculated with the MWU test for <inline-formula><mml:math id="M158" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> subsamples with <inline-formula><mml:math id="M159" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula> members per subsample) for each grid cell for these two ensembles, as well as the difference in rejection rate between them. The substantial differences in mean rejection rate indicate clearly that the two ensembles come from different models.</p></caption>
            <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f02.png"/>

          </fig>

<?xmltex \hack{\newpage}?>
</sec>
</sec>
</sec>
<sec id="Ch1.S4">
  <label>4</label><title>Results</title>
<sec id="Ch1.S4.SS1">
  <label>4.1</label><title>Diffusion experiment</title>
      <p id="d1e3321">Here, we discuss the results from the diffusion experiment described in Sect. <xref ref-type="sec" rid="Ch1.S3.SS6.SSS1"/>. Figure <xref ref-type="fig" rid="Ch1.F2"/> shows why it is important to have such a statistical approach for verification. By just looking at the mean values of the ensembles and their differences (in the 850 hPa temperature in this case), it is impossible to say whether the two ensembles come from the same distribution. There are some small differences, but these could also be a product of internal variability, and the tiny amount of additional explicit diffusion in the diffusion ensemble (<inline-formula><mml:math id="M160" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.01</mml:mn></mml:mrow></mml:math></inline-formula>) is not visible by eye. However, the mean rejection rates calculated with the methodology are clearly higher for the diffusion ensemble in some places in comparison to the control, indicating that the ensembles do not come from the same model. This becomes clear when we compare the mean rejection rate for the 500 hPa geopotential of the diffusion ensemble with <inline-formula><mml:math id="M161" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula> to the 0.95 quantile of the control at the bottom of Fig. <xref ref-type="fig" rid="Ch1.F3"/>. The methodology can reject the global null hypothesis for the first 60 h. After that, the methodology is no longer able to reject it, which indicates that from this point on, the effect of internal variability is greater than that of the additional explicit diffusion.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F3"><?xmltex \currentcnt{3}?><?xmltex \def\figurename{Figure}?><label>Figure 3</label><caption><p id="d1e3356">Rejection rates and decisions for <inline-formula><mml:math id="M162" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mrow><mml:mn mathvariant="normal">0</mml:mn><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mo>(</mml:mo><mml:mi mathvariant="normal">global</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> for the 500 hPa geopotential using the MWU test as an underlying statistical hypothesis test with an ensemble size of <inline-formula><mml:math id="M163" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M164" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> randomly drawn subsamples with a subsample size of <inline-formula><mml:math id="M165" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>. The reference and control ensembles were produced by COSMO running on <bold>(a)</bold> GPUs in double precision, whereas the evaluation ensembles were produced by COSMO running on <bold>(b)</bold> CPUs in double precision, <bold>(c)</bold> GPUs in single precision, and <bold>(d)</bold> GPUs in double precision with additional explicit diffusion (<inline-formula><mml:math id="M166" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula>). We reject the null hypothesis if the mean rejection rate is above the 95th percentile of the rejection rate distribution from the control ensemble (dotted red line). The test detects no differences for the CPU version in DP, but it detects differences for the other two ensembles during the first few hours or days. The rejection of the initial conditions of the SP ensemble is most likely associated with differences in the diagnostic calculation of the geopotential due to the reduced precision.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f03.png"/>

        </fig>

      <p id="d1e3451">In Fig. <xref ref-type="fig" rid="Ch1.F3"/> (top panel), we can also see that the mean rejection rate of the control is very close to the expected one, 5 %, which is the significance level <inline-formula><mml:math id="M167" display="inline"><mml:mi mathvariant="italic">α</mml:mi></mml:math></inline-formula> of the underlying MWU test. However, the rejection rates of some samples in the control deviate quite a bit from 5 %, even though the results come from an identical model. Generally, the spread of the rejection rates also becomes bigger over time, which likely is related to changes in spatial correlation and/or decreasing predictability. While the initial perturbations are random and therefore not spatially correlated, statistical independence has already become invalid after the first time step, as the perturbation of a value in a grid cell will naturally affect the corresponding values in the neighboring grid cells. This increasing spread emphasizes the importance of having such a control rejection rate for the decision on the evaluation ensemble.</p>
      <p id="d1e3464">The first two columns of Fig. <xref ref-type="fig" rid="Ch1.F4"/> show the global decisions for 16 output fields in the diffusion experiments with <inline-formula><mml:math id="M168" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M169" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.001</mml:mn></mml:mrow></mml:math></inline-formula>. We believe that such a set of variables offers a good representation of the most important processes in an atmospheric model (i.e., dynamics, radiation, microphysics, surface fluxes), and, considering the often high correlation between different variables, is therefore likely sufficient to detect all but the tiniest changes in a model. While all variables seem to be affected for the ensemble with the larger diffusion coefficient, the smaller diffusion coefficient leads to a smaller but still noticeable number of rejections for many variables.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F4" specific-use="star"><?xmltex \currentcnt{4}?><?xmltex \def\figurename{Figure}?><label>Figure 4</label><caption><p id="d1e3495">Global decisions for several variables for two ensembles with additional explicit diffusion <inline-formula><mml:math id="M170" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M171" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.001</mml:mn></mml:mrow></mml:math></inline-formula>, respectively, for the single-precision ensemble and for an ensemble from the CPU version of COSMO. The decisions shown have been produced with the MWU test as
underlying statistical hypothesis test and <inline-formula><mml:math id="M172" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M173" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M174" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>. A smaller diffusion coefficient clearly leads to fewer rejections. The CPU ensemble shows no rejection of the tested variables, meaning that the GPU and CPU executables cannot be distinguished.</p></caption>
          <?xmltex \igopts{width=455.244094pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f04.png"/>

        </fig>

<?xmltex \hack{\newpage}?>
</sec>
<sec id="Ch1.S4.SS2">
  <label>4.2</label><title>Architecture: CPU vs. GPU</title>
      <p id="d1e3580">The COSMO executable running on CPUs does not lead to any global rejections when compared against the executable running mainly on GPUs, which is exemplified in Fig. <xref ref-type="fig" rid="Ch1.F3"/> for the 500 hPa geopotential and for all 16 tested variables in the fourth column in Fig. <xref ref-type="fig" rid="Ch1.F4"/>. So, while the results are not bit identical, we consider the difference between these two executables to be negligible. This confirms that the GPU implementation of the COSMO model is of very high quality as, in terms of execution, it cannot be distinguished from the original CPU implementation. This is an impressive achievement given that the whole code (dynamical core and parameterization package) had to be refactored.</p>
</sec>
<sec id="Ch1.S4.SS3">
  <label>4.3</label><title>Floating-point precision</title>
      <p id="d1e3595">The results of the verification of the single-precision (SP) version of COSMO against the corresponding double-precision (DP) version can be seen in Fig. <xref ref-type="fig" rid="Ch1.F3"/> for the 500 hPa geopotential and in Fig. <xref ref-type="fig" rid="Ch1.F4"/> for all 16 tested variables. Before discussing the results, we remind the reader that some of the variables, notably in the soil model and the radiation codes, are retained in double precision, as some discrepancies were detected during the development of the SP version. When looking at Fig. <xref ref-type="fig" rid="Ch1.F3"/> (third panel), it should be noted that the geopotential is a plain diagnostic field in the COSMO model, so it is not perturbed initially but diagnosed at output time from the prognostic variables. However, as the geopotential is vertically integrated, it encompasses information from many levels and variables and can thus be considered a well-suited field for testing. One of the most striking features in Fig. <xref ref-type="fig" rid="Ch1.F3"/> is that the methodology rejects the SP version at the initial state of the model. At this state, the perturbation has already been applied according to Eq. (<xref ref-type="disp-formula" rid="Ch1.E3"/>), but the model has performed only one time step. This one time step before the initial output has to be performed in COSMO to compute the diagnostic quantities. Typically, one time step is not enough time for small differences to manifest themselves, as can be seen by the lack of rejections at hour zero for the diffusion ensemble in Figs. <xref ref-type="fig" rid="Ch1.F3"/> and <xref ref-type="fig" rid="Ch1.F4"/>. It is not entirely clear why the 500 hPa geopotential rejection rate is that high after one time step for the SP ensemble, but we assume a small difference in its calculation due to increased round-off errors for the vertical integration. Considering that the small perturbations did not have much time to grow, there is no real internal variability that could “hide” that difference. After 3 h, the mean rejection rate of the SP ensemble is substantially lower but still higher than the 0.95 quantile from the control. Afterward, the rejection rate increases again and follows a similar trajectory to the diffusion ensemble's but with a higher magnitude. In order to rule out differences in perturbation strength due to rounding errors (see also Sect. <xref ref-type="sec" rid="Ch1.S3.SS2"/>), we have performed the same experiment for a modified double-precision version of COSMO, where the fields to be perturbed are cast to single precision, the perturbation is applied in single precision, and the fields are then cast back to double precision. However, this had no effect on the results, and the SP ensemble was still rejected with the same magnitude for the initial conditions.</p>
      <p id="d1e3615">The third column of Fig. <xref ref-type="fig" rid="Ch1.F4"/> shows the global decisions for 16 output variables of the single-precision ensemble during the first 100 h. Overall, the number of rejections is similar to that for the diffusion ensemble with <inline-formula><mml:math id="M175" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula> (first column). However, while most variables show a similar rejection pattern for the diffusion ensemble, the switch to single precision does not affect all variables to the same extent. As well as the 500 hPa geopotential, the test also rejects other variables after only one time step. The rejections of the diagnostic surface pressure, total cloud cover, and average top-of-atmosphere (TOA) outgoing longwave radiation are probably also caused by differences in the diagnostic calculations due to the reduced precision. The precipitation variable represents the sum of precipitation during the last hour. After the first time step, the model has produced very little precipitation. In this case, the maximum precipitation amount per grid point is below 0.09 mm h<inline-formula><mml:math id="M176" display="inline"><mml:msup><mml:mi/><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> in all members of the DP and SP ensembles. Therefore, it is possible that the increased round-off error due to the single-precision representation of very small numbers may lead to the rejection of precipitation at hour zero.</p>

      <fig id="Ch1.F5" specific-use="star"><?xmltex \currentcnt{5}?><?xmltex \def\figurename{Figure}?><label>Figure 5</label><caption><p id="d1e3645">Rejection rates and decisions similar to Fig. <xref ref-type="fig" rid="Ch1.F3"/> for different variables and with the use of different underlying statistical hypothesis tests for the diffusion ensemble with <inline-formula><mml:math id="M177" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.01</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M178" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M179" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M180" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>. While the rejection rates show some differences, the global decisions are very similar throughout all tests for the corresponding variables. The rejection rates with the K-S test are usually lower than those for the other two tests, but this does not affect the global decisions, as the respective 0.95 quantiles from the control ensemble are also lower. Student's <inline-formula><mml:math id="M181" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test shows very similar rejection rates to the nonparametric MWU test, even for precipitation, which is clearly not normally distributed.</p></caption>
          <?xmltex \igopts{width=455.244094pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f05.png"/>

        </fig>

</sec>
<sec id="Ch1.S4.SS4">
  <label>4.4</label><title>Statistical hypothesis tests</title>
      <p id="d1e3726">We have tested our methodology with the different statistical hypothesis tests described in Sect. <xref ref-type="sec" rid="Ch1.S3.SS3"/> for the test case with additional explicit diffusion (see above). Figure <xref ref-type="fig" rid="Ch1.F5"/> shows the respective rejection rates and decisions for several variables. The rejection rates from the Student <inline-formula><mml:math id="M182" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test and the MWU test are almost identical for all variables shown here. This confirms the robust behavior of the Student <inline-formula><mml:math id="M183" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test, despite violations of the normality assumptions. The results especially exemplify this for precipitation, where the means of the distribution do not follow a normal distribution and are floored (no negative precipitation). Like the MWU test, the K-S test is nonparametric and therefore does not rely on assumptions about the distribution of the variables. However, its rejection rate is generally lower than that of the MWU test and the Student <inline-formula><mml:math id="M184" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test. This effect can also be seen in the 0.95 quantile of the control rejection rate, which is generally lower than those for the other two hypothesis tests. The lower rejection rate is most likely associated with the lower power of the K-S test (see Sect. <xref ref-type="sec" rid="Ch1.S3.SS3.SSS3"/>). However, the decision (reject or not reject) is always the same in this case for all tests. This indicates that any of these tests is suitable as an underlying statistical hypothesis test and that the choice of the statistical test is not very critical for our methodology. Nevertheless, we have decided to use the MWU test for most of the subsequent experiments as it offers a slightly higher rejection rate than the K-S test and, as it is a nonparametric test, its use is easier to justify than the use of the Student <inline-formula><mml:math id="M185" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test, even though these two tests produce almost identical results.</p>
</sec>
<sec id="Ch1.S4.SS5">
  <label>4.5</label><title>Vertical heat diffusion and soil effects</title>
      <p id="d1e3773">Figure <xref ref-type="fig" rid="Ch1.F6"/> shows the rejection rates and global decisions for the 2 m temperature and soil moisture at different depths for the model setting with a modified minimal diffusion coefficient for vertical scalar heat transport (<inline-formula><mml:math id="M186" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.3</mml:mn></mml:mrow></mml:math></inline-formula> instead of 0.35). Note that this change will only affect a subset of the grid points, as <inline-formula><mml:math id="M187" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> represents a limiter. The rejection rate is quite high for the 2 m temperature during the first few days. For the soil moisture at different depths, we can see that the magnitude of the rejection rate decreases the deeper we go. Furthermore, the initial perturbation and the subsequent internal variability of the atmosphere need some time to travel down to the lower layers, which is most obvious in the layer at 2.86 m depth. In this layer, the rejection rate remains close to zero for the first few days because there is almost no difference visible between the different ensemble members. As a consequence of the time taken for the perturbation to arrive, the global decision for this layer should be interpreted with caution during these first few days. However, while the magnitude and variability of the rejection rate decrease for the lower soil layers, the effect is visible for longer, which is most probably related to the slower processes in the soil. For the 2 m temperature, there are still some rejections after 50–60 d. However, the test is usually not able to reject the global null hypothesis for 2 m temperature after 25 d, which indicates that from this point on, the effect of the change in <inline-formula><mml:math id="M188" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is overshadowed by internal variability, or that the test might no longer be sensitive enough to detect the difference with such a small ensemble and subsample size (<inline-formula><mml:math id="M189" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M190" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M191" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>).</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F6"><?xmltex \currentcnt{6}?><?xmltex \def\figurename{Figure}?><label>Figure 6</label><caption><p id="d1e3872">Rejection rates and decisions similar to Fig. <xref ref-type="fig" rid="Ch1.F3"/> for the 2 m temperature and soil moisture at different depths for an ensemble where the minimal diffusion coefficient for vertical scalar heat transport has been slightly changed (<inline-formula><mml:math id="M192" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mi>k</mml:mi><mml:msub><mml:mi>h</mml:mi><mml:mi mathvariant="normal">min</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.3</mml:mn></mml:mrow></mml:math></inline-formula> instead of 0.35) and with <inline-formula><mml:math id="M193" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M194" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M195" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>. The initial random perturbation of the atmosphere needs some time to travel to the deeper soil layers. While the magnitude of the rejection rate is significantly lower for the deeper soil layers, the difference is noticeable for a longer period of time.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f06.png"/>

        </fig>

</sec>
<sec id="Ch1.S4.SS6">
  <label>4.6</label><title>No subgrid-scale orography parameterization</title>
      <p id="d1e3952">Disabling the SSO parameterization is a substantial change, and our methodology can detect this for the whole 3-month simulation time. Despite the relatively small ensemble size of <inline-formula><mml:math id="M196" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula> and subsample size of <inline-formula><mml:math id="M197" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>, the mean rejection rate for the three variables shown in Fig. <xref ref-type="fig" rid="Ch1.F7"/> is very high and seems to remain at a relatively constant level after the first month. This indicates that the difference would also be detectable after a longer simulation time, even though the variability at the grid-cell level must be very high.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F7"><?xmltex \currentcnt{7}?><?xmltex \def\figurename{Figure}?><label>Figure 7</label><caption><p id="d1e3989">Rejection rates and decisions similar to Fig. <xref ref-type="fig" rid="Ch1.F3"/> but for the 500 hPa geopotential, 850 hPa temperature, and 850 hPa water vapor amount and for an evaluation ensemble where the subgrid-scale orography (SSO) parameterization was switched off (<inline-formula><mml:math id="M198" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M199" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M200" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>). The methodology rejects the null hypothesis throughout all 90 d, except in three instances for the 500 hPa geopotential. The difference between the mean rejection rate of the evaluation ensemble and the 0.95 quantile of the control is quite large and persistent (also considering the relatively small ensemble and subsample sizes), which indicates that such a big change in the model is detectable for an even longer time.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f07.png"/>

        </fig>

<?xmltex \hack{\newpage}?>
</sec>
<sec id="Ch1.S4.SS7">
  <label>4.7</label><title>Piz Daint update</title>
      <p id="d1e4052">Figure <xref ref-type="fig" rid="Ch1.F8"/> shows that we did not detect any differences after the update of the supercomputer Piz Daint. This test was one of the first cases where the methodology was used and it was performed with a relatively low number of ensemble and subsample members (<inline-formula><mml:math id="M201" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M202" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>). However, considering how closely the 0.95 quantile from the control ensemble follows the 0.95 quantile from the evaluation ensemble and how close the mean rejection rate from the evaluation ensemble is to 0.05, we believe that a test with a higher number of ensemble and subsample members would also either show no rejections or, for much larger ensemble and subsample sizes, a number of rejections that is comparable to the expected number of false positives (see Sect. <xref ref-type="sec" rid="Ch1.S4.SS10"/>).</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F8"><?xmltex \currentcnt{8}?><?xmltex \def\figurename{Figure}?><label>Figure 8</label><caption><p id="d1e4091">Rejection rates and decisions similar to Fig. <xref ref-type="fig" rid="Ch1.F3"/> for the 500 hPa geopotential, 850 hPa temperature, and surface pressure from the verification of a major system update of the underlying supercomputer Piz Daint. The methodology cannot reject the null hypothesis (at least not for the used ensemble size of <inline-formula><mml:math id="M203" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, subsample size of <inline-formula><mml:math id="M204" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M205" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> subsamples), which suggests that the update did not significantly affect the model behavior.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f08.png"/>

        </fig>

</sec>
<sec id="Ch1.S4.SS8">
  <label>4.8</label><title>Sensitivity to ensemble and subsample sizes</title>
      <p id="d1e4152">In order to test the sensitivity of the methodology to the number of ensemble members <inline-formula><mml:math id="M206" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, the number of subsample members <inline-formula><mml:math id="M207" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, and the number of subsamples <inline-formula><mml:math id="M208" display="inline"><mml:mi>m</mml:mi></mml:math></inline-formula>, we have performed the test for the diffusion experiment with <inline-formula><mml:math id="M209" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula> for a combination of different values of <inline-formula><mml:math id="M210" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M211" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M212" display="inline"><mml:mi>m</mml:mi></mml:math></inline-formula>. Figure <xref ref-type="fig" rid="Ch1.F9"/> shows the effects of different ensemble and subsample sizes on the evaluation of the 500 hPa geopotential. Adding more ensemble and subsample members increases the test's sensitivity, whereas using a higher number of subsamples (<inline-formula><mml:math id="M213" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">500</mml:mn></mml:mrow></mml:math></inline-formula> instead of <inline-formula><mml:math id="M214" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>) has a negligible effect (not shown in the figure), which indicates that 100 subsamples are sufficient for this methodology.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F9"><?xmltex \currentcnt{9}?><?xmltex \def\figurename{Figure}?><label>Figure 9</label><caption><p id="d1e4254">Rejection rates and decisions for the 500 hPa geopotential, as in Fig. <xref ref-type="fig" rid="Ch1.F3"/>, for the diffusion ensemble (<inline-formula><mml:math id="M215" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula>) with different numbers of ensemble members <inline-formula><mml:math id="M216" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, subsample members <inline-formula><mml:math id="M217" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M218" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> subsamples. Larger values for <inline-formula><mml:math id="M219" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M220" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> increase the sensitivity of the methodology.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f09.png"/>

        </fig>

</sec>
<sec id="Ch1.S4.SS9">
  <label>4.9</label><title>Influence of spatial averaging</title>
      <p id="d1e4342">Most existing verification methodologies for weather and climate models involve some form of spatial averaging of output variables (see Sect. <xref ref-type="sec" rid="Ch1.S2.SS1"/>). Our methodology evaluates the atmospheric fields at every grid point at a given vertical level. The idea behind this more fine-grained approach is that it should allow us to identify differences in small-scale features that may not affect spatial averages. In order to evaluate this, the model output from some of the previous experiments is spatially averaged into tiles consisting of an increasing number of grid cells (<inline-formula><mml:math id="M221" display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M222" display="inline"><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M223" display="inline"><mml:mrow><mml:mn mathvariant="normal">4</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M224" display="inline"><mml:mrow><mml:mn mathvariant="normal">8</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">8</mml:mn></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M225" display="inline"><mml:mrow><mml:mn mathvariant="normal">16</mml:mn><mml:mo>×</mml:mo><mml:mn mathvariant="normal">16</mml:mn></mml:mrow></mml:math></inline-formula> grid cells per tile).</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F10"><?xmltex \currentcnt{10}?><?xmltex \def\figurename{Figure}?><label>Figure 10</label><caption><p id="d1e4410">Global rejection rates of the 16 variables during the first 100 h, as in Fig. <xref ref-type="fig" rid="Ch1.F4"/>, for the diffusion ensemble with <inline-formula><mml:math id="M226" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula>. A rate of 1.0 would mean that all global decisions would show a rejection (i.e., only red in Fig. <xref ref-type="fig" rid="Ch1.F4"/>). The rates have been calculated for different ensemble and subsample sizes with <inline-formula><mml:math id="M227" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> randomly drawn subsamples. They are grouped by tile size, where one tile represents the spatial average value of <inline-formula><mml:math id="M228" display="inline"><mml:mrow><mml:mi>n</mml:mi><mml:mo>×</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:math></inline-formula> grid cells. Spatial averaging clearly reduces the sensitivity of the test for all ensemble sizes. The red lines indicate thresholds that could be used for an automated testing framework. For example, based on the false positive rate for <inline-formula><mml:math id="M229" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">200</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M230" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">150</mml:mn></mml:mrow></mml:math></inline-formula>, one could define a rejection rate of 0.1 as a threshold for this combination of ensemble and subsample sizes (i.e., the model has significantly changed if the rejection rate is greater than 0.1). The threshold should be lower for smaller ensemble and subsample sizes (e.g., 0.02 for <inline-formula><mml:math id="M231" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">200</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M232" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>).</p></caption>
          <?xmltex \igopts{width=241.848425pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f10.png"/>

        </fig>

      <p id="d1e4520">Figure <xref ref-type="fig" rid="Ch1.F10"/> shows the rejection rates for two diffusion ensembles (<inline-formula><mml:math id="M233" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M234" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.001</mml:mn></mml:mrow></mml:math></inline-formula>), the CPU ensemble, and an ensemble that was obtained from an identical model the same way as the control ensemble. The rates represent the fraction of global rejections from the 16 variables during the first 100 h (i.e., the fraction that is red in Fig. <xref ref-type="fig" rid="Ch1.F4"/>), and they have been calculated for different tile sizes and numbers of ensemble and subsample members. For the diffusion ensembles, the spatial averaging reduces the test's sensitivity for all ensemble and subsample sizes. These results strongly indicate that a test at the grid-cell level might detect differences that would not be detected by methods that compare domain mean values or use some other form of spatial averaging.</p>
      <p id="d1e4552">For the CPU ensemble, we only see a rejection rate that is significantly higher than zero for the largest subsample size in Fig. <xref ref-type="fig" rid="Ch1.F10"/>. However, since the rejection rate is similar to the corresponding false positive rate, one cannot reject the null hypothesis. It is also interesting to see that spatial averaging does not affect the rejection rate of the CPU ensemble and the ensemble that has been used to calculate the number of false positives.</p><?xmltex \hack{\newpage}?>
</sec>
<sec id="Ch1.S4.SS10">
  <label>4.10</label><title>False positives and determining a threshold for automated testing</title>
      <p id="d1e4566">Looking at the rejection rates of the ensemble with no change in Fig. <xref ref-type="fig" rid="Ch1.F10"/> (bottom right of the figure), we can see that we have almost no false positives except for <inline-formula><mml:math id="M235" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">200</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M236" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">150</mml:mn></mml:mrow></mml:math></inline-formula>. The reason for this is likely a combination of a lower variability of the result for larger subsample sizes (i.e., the test becomes more accurate) as well as the fact that with <inline-formula><mml:math id="M237" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">150</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M238" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">200</mml:mn></mml:mrow></mml:math></inline-formula>, many subsamples will consist of a set of very similar ensemble members, which also reduces the variability of the result. This effect can also be seen in Fig. <xref ref-type="fig" rid="Ch1.F9"/>, where the 0.95 quantile is quite close to the mean rejection rate for <inline-formula><mml:math id="M239" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">200</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M240" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">150</mml:mn></mml:mrow></mml:math></inline-formula>. This “narrow” distribution of rejection rates likely increases the probability of the mean rejection rate of the false positive ensemble being higher than the 0.95 quantile of the rejection rate of the control ensemble.</p>
      <p id="d1e4664">While the false positive rate for the smaller ensemble and subsample sizes is very close to zero with our methodology, we still have to expect a certain amount of false positives. An automated testing framework requires a clear pass/fail decision and, ideally, the test should not fail because of false positives. The false positive rate depends on the ensemble and subsample sizes, the evaluated variables, and the evaluation period. In order to determine a reasonable rejection rate threshold for the given parameters, the test should be first performed on an ensemble from a model that is identical to the reference and the control ensemble. Based on the results in Fig. <xref ref-type="fig" rid="Ch1.F10"/> for the output without spatial averaging, we would for example set the threshold to 0.1 (dashed red line) for <inline-formula><mml:math id="M241" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">200</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M242" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">150</mml:mn></mml:mrow></mml:math></inline-formula>. For <inline-formula><mml:math id="M243" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">200</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M244" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>, a threshold of 0.02 would probably make sense (dotted red line), and we could go even lower for smaller ensemble and subsample sizes.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F11" specific-use="star"><?xmltex \currentcnt{11}?><?xmltex \def\figurename{Figure}?><label>Figure 11</label><caption><p id="d1e4731">Comparison between our methodology, which uses subsampling (<inline-formula><mml:math id="M245" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M246" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M247" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>) and a control ensemble, and an approach that uses only one comparison between all members of the two ensembles with <inline-formula><mml:math id="M248" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> in combination with the FDR correction and <inline-formula><mml:math id="M249" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">α</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.05</mml:mn></mml:mrow></mml:math></inline-formula>. The Student <inline-formula><mml:math id="M250" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test was used for the local hypothesis testing in both cases. Panels <bold>(a)</bold> and <bold>(b)</bold> show the global rejections for the diffusion ensemble (<inline-formula><mml:math id="M251" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula>), whereas <bold>(c)</bold> and <bold>(d)</bold> show the respective rejections for an ensemble from an identical model (no change) to compare the false positive rates. Both methods show similar rejections, with a slightly higher number of false positives seen for the FDR approach.</p></caption>
          <?xmltex \igopts{width=455.244094pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f11.png"/>

        </fig>

</sec>
<sec id="Ch1.S4.SS11">
  <label>4.11</label><title>Comparison with the FDR method</title>
      <p id="d1e4853">The approach used in our methodology, which is based on subsampling and a control ensemble, is an effective way to determine field significance while accounting for spatial correlation and reducing the effect of false positives. As already discussed in Sect. <xref ref-type="sec" rid="Ch1.S2.SS2"/>, the FDR approach by <xref ref-type="bibr" rid="bib1.bibx7" id="text.83"/> serves a similar purpose by limiting the fraction of false rejections out of all rejections. The big advantage of the FDR approach is that we only need two ensembles (no control ensemble) and no subsampling, which reduces the computational costs. Figure <xref ref-type="fig" rid="Ch1.F11"/> shows the global rejections of our methodology (with <inline-formula><mml:math id="M252" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M253" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">S</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">50</mml:mn></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M254" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula>) and the FDR approach, which only performs one comparison with <inline-formula><mml:math id="M255" display="inline"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi mathvariant="normal">E</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></inline-formula> (no subsampling) and <inline-formula><mml:math id="M256" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">α</mml:mi><mml:mi mathvariant="normal">FDR</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.05</mml:mn></mml:mrow></mml:math></inline-formula>. We use Student's <inline-formula><mml:math id="M257" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> test as a local null hypothesis test for both methods. The FDR approach shows a similar result for the diffusion ensemble with <inline-formula><mml:math id="M258" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0.005</mml:mn></mml:mrow></mml:math></inline-formula> to our approach with a control ensemble and subsampling. With the FDR approach, the number of false positives is larger by a factor of 3–4, but one could account for this by using a slightly higher threshold for the global rejection rate (see previous section). This would slightly reduce the test's sensitivity, but considering the FDR approach's lower computational cost, it seems to be an attractive alternative to our approach, especially for frequent automated testing.</p><?xmltex \hack{\newpage}?>
</sec>
</sec>
<sec id="Ch1.S5">
  <label>5</label><title>Discussion</title>
      <p id="d1e4966">As opposed to most existing verification methodologies described in Sect. <xref ref-type="sec" rid="Ch1.S2"/>, our methodology does not rely on any averaging in either space or time. This approach offers several advantages. The verification at the grid-cell level allows us to identify differences in small-scale and short-lived features that may not affect spatial or temporal averages. Furthermore, it provides fine-grained information in space and time and therefore gives helpful information for investigating the source of the difference. A good example of this is the initial rejection of some diagnostic fields, such as the 500 hPa geopotential, for the single-precision experiment. The test rejects the null hypothesis after just one time step, which indicates that there are already detectable differences in the diagnostic calculation of the respective field (see Sect. <xref ref-type="sec" rid="Ch1.S4.SS3"/> for further detail). The focus on instantaneous values or averages over a small time frame is also a way to consider internal variability. Minor differences can often only be detected during the first few hours or days before the increasing internal variability outweighs the effect of the change. Therefore, we think short simulations of a few days should generally be preferred to longer, computationally more expensive simulations.</p>
      <p id="d1e4973">It is not entirely clear how sensitive such a methodology is in detecting differences in long climate simulations. For the verification of very slow processes, longer simulations with either spatial or temporal averaging might appear to be the better choice. However, the current methodology using short integrations can also detect changes in slower variables such as soil moisture within the first few days, which indicates that it might also be suited for climate simulations. Moreover, given that differences arising from the frequent changes (e.g., compiler upgrades, library updates, and minor code rearrangements) typically manifest themselves early in the simulation <xref ref-type="bibr" rid="bib1.bibx28" id="paren.84"><named-content content-type="pre">see</named-content></xref>, we think that this is a reasonable approach with low computational costs. Nevertheless, it is worth rethinking our methodology in the case of a global coupled climate model that may represent very fast (e.g., the atmospheric model) and very slow (e.g., an ice sheet model) components. In such a case, it might be advantageous to test the different model components in standalone mode, possibly using different integration periods, before evaluating the fully coupled system and focusing on the variables heavily affected by the coupling (e.g., the near-surface temperature for ocean–atmosphere coupling). However, further studies on this topic would be needed.</p>
      <p id="d1e4981">The methodology clearly shows some sensitivity to the ensemble and subsample sizes. Using a larger number of ensemble and subsample members generally increases the test's sensitivity but will also lead to higher computational costs. Similarly, the choice of the tested variables also has to be considered. Testing all possible model variables at all vertical levels would guarantee the highest degree of reliability. However, this is unfeasible due to the high computational costs it would demand. Moreover, since the atmosphere is such a complex and interconnected system, many variables are highly correlated. Therefore, and based on our results, we think that testing a few standard output variables at selected vertical levels (as in Fig. <xref ref-type="fig" rid="Ch1.F4"/>) is sufficient for all but the tiniest changes.</p>
</sec>
<sec id="Ch1.S6" sec-type="conclusions">
  <label>6</label><title>Conclusions and outlook</title>
      <p id="d1e4994">We have presented an ensemble-based verification methodology based on statistical hypothesis testing to detect model changes objectively. The methodology operates at the grid-cell level and works for instantaneous and accumulated/averaged variables. We showed that spatial averaging lowers the chance of detecting small-scale changes such as diffusion. Furthermore, the study suggests that short-term ensemble simulations (days to months) are best suited, as the smallest changes are often only detectable during the first few hours of the simulation. Combined with the fact that the methodology already works well for coarse resolutions (50 km grid spacing here), the methodology is a good candidate for a relatively inexpensive automated system test. We showed that the choice of the underlying statistical hypothesis test is secondary as long as the rejection rate is compared to a rejection rate distribution from a control ensemble that has been generated with an identical statistical hypothesis test.</p>
      <p id="d1e4997">While the methodology could theoretically be applied to all model output variables at all vertical levels and thus be exhaustive, we think that this would be overkill. Based on our results obtained using a limited-area climate model and the high correlations between many atmospheric variables, we think that a set of key variables that reflect the most important processes in an atmospheric model might already be sufficient to cover most of the atmospheric and land-surface processes. However, for a fully coupled global climate model, further considerations will be needed.</p>
      <p id="d1e5000">The verification methodology detected several configuration changes, ranging from very small changes, such as tiny increases in horizontal diffusion or changes in the minimum vertical heat diffusion coefficient, to more substantial changes, such as disabling the subgrid-scale orography (SSO) parameterization. The test was not able to detect any differences between the regional weather and climate model COSMO running on GPUs or on CPUs on the same supercomputer (Piz Daint, CSCS, Switzerland). However, the test detected differences between single- and double-precision versions of the model for almost all tested variables. In the case of single- versus double-precision analysis, rejections occur after just one time step for some diagnostic variables, suggesting precision-sensitive operations in the diagnostic calculation. Furthermore, the methodology has already been successfully applied for the verification of the regional weather and climate model COSMO after a major system update of the underlying supercomputer (Piz Daint).</p>
      <p id="d1e5003">Nonetheless, the results of such a test have to be interpreted with caution and might give a false sense of security. On the one hand, there are potential issues with any statistical hypothesis test, as the inability to reject the null hypothesis does not automatically mean that it is true. On the other hand, even though verification is termed a “system test”, it is hardly possible to test the whole model. There are countless configurations for such models, and testing all these configurations (i.e., different physical parameterizations, resolutions, and numerical methods) is almost impossible and would require a substantial computational effort. The methodology also has some potential limitations if a certain part of the code is only very rarely activated (as is potentially the case with threshold-triggered processes). First results also show that the FDR approach seems to be a suitable and computationally less expensive alternative to using a control ensemble and subsampling to determine the field significance of spatially correlated output data. However, the FDR approach has a somewhat higher rate of false rejections, and thus a somewhat lower sensitivity.</p>
      <p id="d1e5007">For future work, we intend to apply the methodology to more test cases, such as the compilation of the model with different optimization levels or running the model on different supercomputers. It would also be interesting to directly compare our verification methodology to other preexisting methodologies to better understand the differences in sensitivity and applicability.</p>
</sec>

      
      </body>
    <back><app-group>

<app id="App1.Ch1.S1">
  <?xmltex \currentcnt{A}?><label>Appendix A</label><title>Influence of perturbation strength</title>
      <p id="d1e5021">As described in Sect. <xref ref-type="sec" rid="Ch1.S3.SS2"/>, we have chosen a relatively strong initial perturbation with a magnitude in the order of <inline-formula><mml:math id="M259" display="inline"><mml:mrow><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> for ensemble generation. Most other existing verification frameworks use a weaker perturbation with a magnitude in the order of <inline-formula><mml:math id="M260" display="inline"><mml:mrow><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mi mathvariant="normal">−</mml:mi><mml:mn mathvariant="normal">14</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> <xref ref-type="bibr" rid="bib1.bibx1 bib1.bibx24 bib1.bibx28" id="paren.85"><named-content content-type="pre">e.g.,</named-content></xref>. For us, the chosen perturbation magnitude proved to be a good compromise between not disturbing the initial conditions too much while still providing a good enough ensemble spread for the statistical verification during the first few hours. Furthermore, choosing such a relatively strong perturbation also allows us to examine the effects of single- versus double-precision floating-point representation, as the choice already minimizes the chance of undesirable rounding artifacts for the perturbation.</p>

      <?xmltex \floatpos{h!}?><fig id="App1.Ch1.S1.F12"><?xmltex \currentcnt{A1}?><?xmltex \def\figurename{Figure}?><label>Figure A1</label><caption><p id="d1e5061">Mean coefficient of variation averaged over all grid points of 850 hPa temperature from ensembles (50 members per ensemble) with different initial perturbation magnitudes according to Eq. (<xref ref-type="disp-formula" rid="Ch1.E3"/>). The relatively strong perturbation used in this work (<inline-formula><mml:math id="M261" display="inline"><mml:mrow><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>) leads to only a slightly higher variance during the first few days than a perturbation at machine precision (<inline-formula><mml:math id="M262" display="inline"><mml:mrow><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">16</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>).</p></caption>
        <?xmltex \igopts{width=207.705118pt}?><graphic xlink:href="https://gmd.copernicus.org/articles/15/3183/2022/gmd-15-3183-2022-f12.png"/>

      </fig>

      <p id="d1e5108">Figure <xref ref-type="fig" rid="App1.Ch1.S1.F12"/> shows that the mean coefficient of variation averaged over all grid points of 850 hPa temperature, which is one of the directly perturbed variables, is not substantially higher with <inline-formula><mml:math id="M263" display="inline"><mml:mrow><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> than with <inline-formula><mml:math id="M264" display="inline"><mml:mrow><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">16</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> during the first few days. After around 300 h, the influence of the perturbation strength seems to be negligible.</p>
</app>
  </app-group><notes notes-type="codedataavailability"><title>Code and data availability</title>

      <p id="d1e5153">The source code that has been used to calculate the rejection rates shown in this paper is available at <ext-link xlink:href="https://doi.org/10.5281/zenodo.6355694" ext-link-type="DOI">10.5281/zenodo.6355694</ext-link> <xref ref-type="bibr" rid="bib1.bibx59" id="paren.86"/>. The corresponding model output data from the shorter ensemble simulations (5 d) are available at <ext-link xlink:href="https://doi.org/10.5281/zenodo.6354200" ext-link-type="DOI">10.5281/zenodo.6354200</ext-link> <xref ref-type="bibr" rid="bib1.bibx57" id="paren.87"/> and <ext-link xlink:href="https://doi.org/10.5281/zenodo.6355647" ext-link-type="DOI">10.5281/zenodo.6355647</ext-link> <xref ref-type="bibr" rid="bib1.bibx58" id="paren.88"/>. The COSMO model that has been used in this study is available under license (see <uri>http://www.cosmo-model.org/content/consortium/licencing.htm</uri>, <xref ref-type="bibr" rid="bib1.bibx10" id="altparen.89"/>). COSMO may be used for operational and research applications by the members of the COSMO consortium. Moreover, within a license agreement, the COSMO model may be used for operational and research applications by other national (hydro)meteorological services, universities, and research institutes. ERA-Interim reanalysis data, which were used for initial and lateral boundary conditions, are available at <uri>https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim</uri> (<xref ref-type="bibr" rid="bib1.bibx14" id="altparen.90"/>).</p>
  </notes><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d1e5190">CZ and CS conceptualized the verification methodology and designed the study. CZ performed the COSMO model ensemble simulations and developed the code for the verification of the model results. CZ wrote the paper with contributions from CS.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d1e5196">The contact author has declared that neither they nor their co-authors have any competing interests.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d1e5202">Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
  </notes><ack><title>Acknowledgements</title><p id="d1e5208">We would like to thank the two anonymous reviewers for their valuable comments. We acknowledge PRACE for awarding computational resources for the COSMO simulations on Piz Daint at the Swiss National Supercomputing Centre (CSCS). We also acknowledge the Federal Office for Meteorology and Climatology MeteoSwiss, CSCS, and ETH Zurich for their contributions to the development of the GPU-accelerated version of COSMO. In the discussion leading to this paper, we benefited from useful comments of several ETH, MeteoSwiss, and CSCS colleagues.</p></ack><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d1e5213">This paper was edited by Christoph Knote and reviewed by two anonymous referees.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><?xmltex \def\ref@label{{Baker et~al.(2015)Baker, Hammerling, Levy, Xu, Dennis, Eaton,
Edwards, Hannay, Mickelson, Neale, Nychka, Shollenberger, Tribbia,
Vertenstein, and Williamson}}?><label>Baker et al.(2015)Baker, Hammerling, Levy, Xu, Dennis, Eaton,
Edwards, Hannay, Mickelson, Neale, Nychka, Shollenberger, Tribbia,
Vertenstein, and Williamson</label><?label Baker2015?><mixed-citation>Baker, A. H., Hammerling, D. M., Levy, M. N., Xu, H., Dennis, J. M., Eaton, B. E., Edwards, J., Hannay, C., Mickelson, S. A., Neale, R. B., Nychka, D., Shollenberger, J., Tribbia, J., Vertenstein, M., and Williamson, D.: A new ensemble-based consistency test for the Community Earth System Model (pyCECT v1.0), Geosci. Model Dev., 8, 2829–2840, <ext-link xlink:href="https://doi.org/10.5194/gmd-8-2829-2015" ext-link-type="DOI">10.5194/gmd-8-2829-2015</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx2"><?xmltex \def\ref@label{{Baker et~al.(2016)Baker, Hu, Hammerling, Tseng, Xu, Huang, Bryan, and
Yang}}?><label>Baker et al.(2016)Baker, Hu, Hammerling, Tseng, Xu, Huang, Bryan, and
Yang</label><?label Baker2016?><mixed-citation>Baker, A. H., Hu, Y., Hammerling, D. M., Tseng, Y.-H., Xu, H., Huang, X., Bryan, F. O., and Yang, G.: Evaluating statistical consistency in the ocean model component of the Community Earth System Model (pyCECT v2.0), Geosci. Model Dev., 9, 2391–2406, <ext-link xlink:href="https://doi.org/10.5194/gmd-9-2391-2016" ext-link-type="DOI">10.5194/gmd-9-2391-2016</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx3"><?xmltex \def\ref@label{{Baldauf et~al.(2011)Baldauf, Seifert, F{\"{o}}rstner, Majewski,
Raschendorfer, and Reinhardt}}?><label>Baldauf et al.(2011)Baldauf, Seifert, Förstner, Majewski,
Raschendorfer, and Reinhardt</label><?label Baldauf2011?><mixed-citation>Baldauf, M., Seifert, A., Förstner, J., Majewski, D., Raschendorfer, M.,
and Reinhardt, T.: Operational Convective-Scale Numerical Weather Prediction
with the COSMO Model: Description and Sensitivities,
Mon. Weather Rev.,
139, 3887–3905, <ext-link xlink:href="https://doi.org/10.1175/MWR-D-10-05013.1" ext-link-type="DOI">10.1175/MWR-D-10-05013.1</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx4"><?xmltex \def\ref@label{{Bartlett(1935)}}?><label>Bartlett(1935)</label><?label Bartlett1935?><mixed-citation>Bartlett, M. S.: The Effect of Non-Normality on the t Distribution,
Math. Proc. Cambridge, 31,
223–231, <ext-link xlink:href="https://doi.org/10.1017/S0305004100013311" ext-link-type="DOI">10.1017/S0305004100013311</ext-link>, 1935.</mixed-citation></ref>
      <ref id="bib1.bibx5"><?xmltex \def\ref@label{{Bauer et~al.(2015)Bauer, Thorpe, and Brunet}}?><label>Bauer et al.(2015)Bauer, Thorpe, and Brunet</label><?label Bauer2015?><mixed-citation>Bauer, P., Thorpe, A., and Brunet, G.: The quiet revolution of numerical
weather prediction, Nature, 525, 47–55, <ext-link xlink:href="https://doi.org/10.1038/nature14956" ext-link-type="DOI">10.1038/nature14956</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx6"><?xmltex \def\ref@label{{Bellprat et~al.(2016)Bellprat, Kotlarski, L{\"{u}}thi, De~El{\'{i}}a,
Frigon, Laprise, and Sch{\"{a}}r}}?><label>Bellprat et al.(2016)Bellprat, Kotlarski, Lüthi, De Elía,
Frigon, Laprise, and Schär</label><?label Bellprat2016?><mixed-citation>Bellprat, O., Kotlarski, S., Lüthi, D., De Elía, R., Frigon, A.,
Laprise, R., and Schär, C.: Objective calibration of regional climate
models: Application over Europe and North America, J. Climate, 29,
819–838, <ext-link xlink:href="https://doi.org/10.1175/JCLI-D-15-0302.1" ext-link-type="DOI">10.1175/JCLI-D-15-0302.1</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx7"><?xmltex \def\ref@label{{Benjamini and Hochberg(1995)}}?><label>Benjamini and Hochberg(1995)</label><?label Benjamini1995?><mixed-citation>Benjamini, Y. and Hochberg, Y.: Controlling the False Discovery Rate: A
Practical and Powerful Approach to Multiple Testing,
J. Roy. Stat. Soc. B, 57, 289–300,
<ext-link xlink:href="https://doi.org/10.1111/j.2517-6161.1995.tb02031.x" ext-link-type="DOI">10.1111/j.2517-6161.1995.tb02031.x</ext-link>, 1995.</mixed-citation></ref>
      <ref id="bib1.bibx8"><?xmltex \def\ref@label{{Carson(2002)}}?><label>Carson(2002)</label><?label Carson2002?><mixed-citation>Carson, J. S.: Model verification and validation, in: Proceedings of the
Winter Simulation Conference, Winter Simulation Conference,  San Diego, CA, USA,   8–11 December 2002, 1, 52–58,
<ext-link xlink:href="https://doi.org/10.1109/WSC.2002.1172868" ext-link-type="DOI">10.1109/WSC.2002.1172868</ext-link>, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx9"><?xmltex \def\ref@label{{Clune and Rood(2011)}}?><label>Clune and Rood(2011)</label><?label Clune2011?><mixed-citation>Clune, T. and Rood, R.: Software Testing and Verification in Climate Model
Development, IEEE Software, 28, 49–55, <ext-link xlink:href="https://doi.org/10.1109/MS.2011.117" ext-link-type="DOI">10.1109/MS.2011.117</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx10"><?xmltex \def\ref@label{{COSMO Consortium(2022)}}?><label>COSMO Consortium(2022)</label><?label CosmoLicense?><mixed-citation>COSMO Consortium: COSMO Model License,
<uri>http://www.cosmo-model.org/content/consortium/licencing.htm</uri>,
last access: 12 April 2022.</mixed-citation></ref>
      <ref id="bib1.bibx11"><?xmltex \def\ref@label{{Dee et~al.(2011)Dee, Uppala, Simmons, Berrisford, Poli, Kobayashi,
Andrae, Balmaseda, Balsamo, Bauer, Bechtold, Beljaars, van~de Berg, Bidlot,
Bormann, Delsol, Dragani, Fuentes, Geer, Haimberger, Healy, Hersbach,
H{\'{o}}lm, Isaksen, K{\aa}llberg, K{\"{o}}hler, Matricardi, Mcnally,
Monge-Sanz, Morcrette, Park, Peubey, de~Rosnay, Tavolato, Th{\'{e}}paut, and
Vitart}}?><label>Dee et al.(2011)Dee, Uppala, Simmons, Berrisford, Poli, Kobayashi,
Andrae, Balmaseda, Balsamo, Bauer, Bechtold, Beljaars, van de Berg, Bidlot,
Bormann, Delsol, Dragani, Fuentes, Geer, Haimberger, Healy, Hersbach,
Hólm, Isaksen, Kållberg, Köhler, Matricardi, Mcnally,
Monge-Sanz, Morcrette, Park, Peubey, de Rosnay, Tavolato, Thépaut, and
Vitart</label><?label Dee2011?><mixed-citation>Dee, D. P., Uppala, S. M., Simmons, A. J., Berrisford, P., Poli, P., Kobayashi,
S., Andrae, U., Balmaseda, M. A., Balsamo, G., Bauer, P., Bechtold, P.,
Beljaars, A. C., van de Berg, L., Bidlot, J., Bormann, N., Delsol, C.,
Dragani, R., Fuentes, M., Geer, A. J., Haimberger, L., Healy, S. B.,
Hersbach, H., Hólm, E. V., Isaksen, L., Kållberg, P., Köhler,
M., Matricardi, M., Mcnally, A. P., Monge-Sanz, B. M., Morcrette, J. J.,
Park, B. K., Peubey, C., de Rosnay, P., Tavolato, C., Thépaut, J. N.,
and Vitart, F.: The ERA-Interim reanalysis: Configuration and performance of
the data assimilation system,
Q. J. Roy. Meteor. Soc., 137, 553–597, <ext-link xlink:href="https://doi.org/10.1002/qj.828" ext-link-type="DOI">10.1002/qj.828</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx12"><?xmltex \def\ref@label{{Doms and Baldauf(2018)}}?><label>Doms and Baldauf(2018)</label><?label Doms2018?><mixed-citation>Doms, G. and Baldauf, M.: A Description of the Nonhydrostatic Regional
COSMO-Model Part I: Dynamics and Numerics, Deutscher Wetterdienst (DWD), Offenbach, Germany,
<ext-link xlink:href="https://doi.org/10.5676/DWD_pub/nwv/cosmo-doc_5.05_I" ext-link-type="DOI">10.5676/DWD_pub/nwv/cosmo-doc_5.05_I</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx13"><?xmltex \def\ref@label{{Doms et~al.(2018)Doms, F{\"{o}}rstner, Heise, Herzog, Mironov,
Raschendorfer, Reinhardt, Ritter, Schrodin, Schulz, and
Vogel}}?><label>Doms et al.(2018)Doms, Förstner, Heise, Herzog, Mironov,
Raschendorfer, Reinhardt, Ritter, Schrodin, Schulz, and
Vogel</label><?label Cosmo2018_Physical?><mixed-citation>Doms, G., Förstner, J., Heise, E., Herzog, H.-J., Mironov, D.,
Raschendorfer, M., Reinhardt, T., Ritter, B., Schrodin, R., Schulz, J.-P.,
and Vogel, G.: COSMO Documentation Part II: Physical Parameterization,
Deutscher Wetterdienst (DWD), Offenbach, Germany, <ext-link xlink:href="https://doi.org/10.5676/dwd_pub/nwv/cosmo-doc_5.05_ii" ext-link-type="DOI">10.5676/dwd_pub/nwv/cosmo-doc_5.05_ii</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx14"><?xmltex \def\ref@label{{ECMWF(2022)}}?><label>ECMWF(2022)</label><?label EraInterim?><mixed-citation>ECMWF: ERA-Interim reanalysis,
ECMWF [data set],
<uri>https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim</uri>,
last access: 12 April 2022.</mixed-citation></ref>
      <ref id="bib1.bibx15"><?xmltex \def\ref@label{{Fuhrer et~al.(2014)Fuhrer, Osuna, Lapillonne, Gysi, Bianco, Arteaga,
and Schulthess}}?><label>Fuhrer et al.(2014)Fuhrer, Osuna, Lapillonne, Gysi, Bianco, Arteaga,
and Schulthess</label><?label Fuhrer2014?><mixed-citation>Fuhrer, O., Osuna, C., Lapillonne, X., Gysi, T., Bianco, M., Arteaga, A., and
Schulthess, T. C.: Towards a performance portable, architecture agnostic
implementation strategy for weather and climate models,
Supercomputing Frontiers and Innovations, 1, 44–61, <ext-link xlink:href="https://doi.org/10.14529/jsfi140103" ext-link-type="DOI">10.14529/jsfi140103</ext-link>, 2014.</mixed-citation></ref>
      <ref id="bib1.bibx16"><?xmltex \def\ref@label{{Hong et~al.(2013)Hong, Koo, Jang, Kim, Park, Joh, Kang, and
Oh}}?><label>Hong et al.(2013)Hong, Koo, Jang, Kim, Park, Joh, Kang, and
Oh</label><?label Hong2013?><mixed-citation>Hong, S.-Y., Koo, M.-S., Jang, J., Kim, J.-E. E., Park, H., Joh, M.-S., Kang,
J.-H., and Oh, T.-J.: An Evaluation of the Software System Dependency of a
Global Atmospheric Model, Mon. Weather Rev., 141, 4165–4172,
<ext-link xlink:href="https://doi.org/10.1175/MWR-D-12-00352.1" ext-link-type="DOI">10.1175/MWR-D-12-00352.1</ext-link>, 2013.</mixed-citation></ref>
      <ref id="bib1.bibx17"><?xmltex \def\ref@label{{Knight et~al.(2007)Knight, Knight, Massey, Aina, Christensen, Frame,
Kettleborough, Martin, Pascoe, Sanderson, Stainforth, and Allen}}?><label>Knight et al.(2007)Knight, Knight, Massey, Aina, Christensen, Frame,
Kettleborough, Martin, Pascoe, Sanderson, Stainforth, and Allen</label><?label Knight2007?><mixed-citation>Knight, C. G., Knight, S. H. E., Massey, N., Aina, T., Christensen, C., Frame,
D. J., Kettleborough, J. A., Martin, A., Pascoe, S., Sanderson, B.,
Stainforth, D. A., and Allen, M. R.: Association of parameter, software, and
hardware variation with large-scale behavior across 57,000 climate models,
P. Natl. Acad. Sci. USA, 104, 12259–12264,
<ext-link xlink:href="https://doi.org/10.1073/pnas.0608144104" ext-link-type="DOI">10.1073/pnas.0608144104</ext-link>, 2007.</mixed-citation></ref>
      <ref id="bib1.bibx18"><?xmltex \def\ref@label{{Leutbecher and Palmer(2008)}}?><label>Leutbecher and Palmer(2008)</label><?label Leutbecher2008?><mixed-citation>Leutbecher, M. and Palmer, T. N.: Ensemble forecasting,
J. Comput. Phys., 227, 3515–3539,
<ext-link xlink:href="https://doi.org/10.1016/j.jcp.2007.02.014" ext-link-type="DOI">10.1016/j.jcp.2007.02.014</ext-link>, 2008.</mixed-citation></ref>
      <ref id="bib1.bibx19"><?xmltex \def\ref@label{{Livezey(1985)}}?><label>Livezey(1985)</label><?label Livezey1985?><mixed-citation>Livezey, R. E.: Statistical Analysis of General Circulation Model Climate
Simulation: Sensitivity and Prediction Experiments,
J. Atmos. Sci., 42, 1139–1150,
<ext-link xlink:href="https://doi.org/10.1175/1520-0469(1985)042&lt;1139:SAOGCM&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0469(1985)042&lt;1139:SAOGCM&gt;2.0.CO;2</ext-link>, 1985.</mixed-citation></ref>
      <ref id="bib1.bibx20"><?xmltex \def\ref@label{{Livezey and Chen(1983)}}?><label>Livezey and Chen(1983)</label><?label Livezey1983?><mixed-citation>Livezey, R. E. and Chen, W. Y.: Statistical Field Significance and its
Determination by Monte Carlo Techniques, Mon. Weather Rev., 111,
46–59, <ext-link xlink:href="https://doi.org/10.1175/1520-0493(1983)111&lt;0046:SFSAID&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0493(1983)111&lt;0046:SFSAID&gt;2.0.CO;2</ext-link>, 1983.</mixed-citation></ref>
      <ref id="bib1.bibx21"><?xmltex \def\ref@label{{Lorenz(1963)}}?><label>Lorenz(1963)</label><?label Lorenz1963?><mixed-citation>Lorenz, E. N.: Deterministic Nonperiodic Flow, J. Atmos. Sci., 20, 130–141, <ext-link xlink:href="https://doi.org/10.1175/1520-0469(1963)020&lt;0130:DNF&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0469(1963)020&lt;0130:DNF&gt;2.0.CO;2</ext-link>,
1963.</mixed-citation></ref>
      <ref id="bib1.bibx22"><?xmltex \def\ref@label{{Lott and Miller(1997)}}?><label>Lott and Miller(1997)</label><?label Lott1997?><mixed-citation>Lott, F. and Miller, M. J.: A new subgrid-scale orographic drag
parametrization: Its formulation and testing,
Q. J. Roy. Meteor. Soc., 123, 101–127, <ext-link xlink:href="https://doi.org/10.1256/smsqj.53703" ext-link-type="DOI">10.1256/smsqj.53703</ext-link>, 1997.</mixed-citation></ref>
      <ref id="bib1.bibx23"><?xmltex \def\ref@label{{Mahajan(2021)}}?><label>Mahajan(2021)</label><?label Mahajan2021?><mixed-citation>Mahajan, S.: Ensuring Statistical Reproducibility of Ocean Model Simulations
in the Age of Hybrid Computing, in: Proceedings of the Platform for Advanced
Scientific Computing Conference, PASC '21, Association for Computing
Machinery, New York, NY, USA, 5–9 July 2021, <ext-link xlink:href="https://doi.org/10.1145/3468267.3470572" ext-link-type="DOI">10.1145/3468267.3470572</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx24"><?xmltex \def\ref@label{{Mahajan et~al.(2017)Mahajan, Gaddis, Evans, and Norman}}?><label>Mahajan et al.(2017)Mahajan, Gaddis, Evans, and Norman</label><?label Mahajan2017?><mixed-citation>Mahajan, S., Gaddis, A. L., Evans, K. J., and Norman, M. R.: Exploring an
Ensemble-Based Approach to Atmospheric Climate Modeling and Testing at
Scale, Procedia Comput. Sci., 108, 735–744,
<ext-link xlink:href="https://doi.org/10.1016/j.procs.2017.05.259" ext-link-type="DOI">10.1016/j.procs.2017.05.259</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx25"><?xmltex \def\ref@label{{Mahajan et~al.(2019)Mahajan, Evans, Kennedy, Xu, and
Norman}}?><label>Mahajan et al.(2019)Mahajan, Evans, Kennedy, Xu, and
Norman</label><?label Mahajan2019?><mixed-citation>Mahajan, S., Evans, K. J., Kennedy, J. H., Xu, M., and Norman, M. R.: A
Multivariate Approach to Ensure Statistical Reproducibility of Climate Model
Simulations, in: Proceedings of the Platform for Advanced Scientific
Computing Conference, PASC '19, Association for Computing Machinery, New
York, NY, USA, 12–14 June 2019, <ext-link xlink:href="https://doi.org/10.1145/3324989.3325724" ext-link-type="DOI">10.1145/3324989.3325724</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx26"><?xmltex \def\ref@label{{Mann and Whitney(1947)}}?><label>Mann and Whitney(1947)</label><?label Mann1947?><mixed-citation>Mann, H. B. and Whitney, D. R.: On a Test of Whether one of Two Random
Variables is Stochastically Larger than the Other, Ann. Math. Stat., 18,
50–60, <ext-link xlink:href="https://doi.org/10.1214/aoms/1177730491" ext-link-type="DOI">10.1214/aoms/1177730491</ext-link>, 1947.</mixed-citation></ref>
      <ref id="bib1.bibx27"><?xmltex \def\ref@label{{Massonnet et~al.(2020)Massonnet, M{\'{e}}n{\'{e}}goz, Acosta,
Yepes-Arb{\'{o}}s, Exarchou, and Doblas-Reyes}}?><label>Massonnet et al.(2020)Massonnet, Ménégoz, Acosta,
Yepes-Arbós, Exarchou, and Doblas-Reyes</label><?label Massonet2020?><mixed-citation>Massonnet, F., Ménégoz, M., Acosta, M., Yepes-Arbós, X., Exarchou, E., and Doblas-Reyes, F. J.: Replicability of the EC-Earth3 Earth system model under a change in computing environment, Geosci. Model Dev., 13, 1165–1178, <ext-link xlink:href="https://doi.org/10.5194/gmd-13-1165-2020" ext-link-type="DOI">10.5194/gmd-13-1165-2020</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx28"><?xmltex \def\ref@label{{Milroy et~al.(2018)Milroy, Baker, Hammerling, and
Jessup}}?><label>Milroy et al.(2018)Milroy, Baker, Hammerling, and
Jessup</label><?label Milroy2018?><mixed-citation>Milroy, D. J., Baker, A. H., Hammerling, D. M., and Jessup, E. R.: Nine time steps: ultra-fast statistical consistency testing of the Community Earth System Model (pyCECT v3.0), Geosci. Model Dev., 11, 697–711, <ext-link xlink:href="https://doi.org/10.5194/gmd-11-697-2018" ext-link-type="DOI">10.5194/gmd-11-697-2018</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx29"><?xmltex \def\ref@label{{Oberkampf and Roy(2010)}}?><label>Oberkampf and Roy(2010)</label><?label Oberkampf2010?><mixed-citation>Oberkampf, W. L. and Roy, C. J.: Verification and Validation in Scientific
Computing, Cambridge University Press, <ext-link xlink:href="https://doi.org/10.1017/CBO9780511760396" ext-link-type="DOI">10.1017/CBO9780511760396</ext-link>, 2010.</mixed-citation></ref>
      <ref id="bib1.bibx30"><?xmltex \def\ref@label{{Oreskes(1998)}}?><label>Oreskes(1998)</label><?label Oreskes1998?><mixed-citation>Oreskes, N.: Evaluation (not validation) of quantitative models,
Environ. Health Persp., 106, 1453–1460,
<ext-link xlink:href="https://doi.org/10.1289/ehp.98106s61453" ext-link-type="DOI">10.1289/ehp.98106s61453</ext-link>, 1998.</mixed-citation></ref>
      <ref id="bib1.bibx31"><?xmltex \def\ref@label{{Oreskes et~al.(1994)Oreskes, Shrader-Frechette, and
Belitz}}?><label>Oreskes et al.(1994)Oreskes, Shrader-Frechette, and
Belitz</label><?label Oreskes1994?><mixed-citation>Oreskes, N., Shrader-Frechette, K., and Belitz, K.: Verification, Validation,
and Confirmation of Numerical Models in the Earth Sciences, Science, 263,
641–646, <ext-link xlink:href="https://doi.org/10.1126/science.263.5147.641" ext-link-type="DOI">10.1126/science.263.5147.641</ext-link>, 1994.</mixed-citation></ref>
      <ref id="bib1.bibx32"><?xmltex \def\ref@label{{Pithan et~al.(2015)Pithan, Angevine, and Mauritsen}}?><label>Pithan et al.(2015)Pithan, Angevine, and Mauritsen</label><?label Pithan2015?><mixed-citation>Pithan, F., Angevine, W., and Mauritsen, T.: Improving a global model from the
boundary layer: Total turbulent energy and the neutral limit Prandtl number,
J. Adv. Model. Earth Sy., 7, 2029–2043,
<ext-link xlink:href="https://doi.org/10.1002/2015MS000503" ext-link-type="DOI">10.1002/2015MS000503</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx33"><?xmltex \def\ref@label{{Posten(1984)}}?><label>Posten(1984)</label><?label Posten1984?><mixed-citation>Posten, H. O.: Robustness of the Two-Sample T-Test, in: Robustness of
Statistical Methods and Nonparametric Statistics, edited by: Rasch, D. and
Tiku, M. L.,  Springer, Netherlands, Dordrecht,
92–99,  <ext-link xlink:href="https://doi.org/10.1007/978-94-009-6528-7_23" ext-link-type="DOI">10.1007/978-94-009-6528-7_23</ext-link>, 1984.</mixed-citation></ref>
      <ref id="bib1.bibx34"><?xmltex \def\ref@label{{Raschendorfer(2001)}}?><label>Raschendorfer(2001)</label><?label Raschendorfer2001?><mixed-citation>Raschendorfer, M.: The new turbulence parameterization of LM, COSMO
Newsletter, 1, 89–97,
<uri>http://www.cosmo-model.org/content/model/documentation/newsLetters/newsLetter01/newsLetter_01.pdf</uri> (last access: 9 April 2022),
2001.</mixed-citation></ref>
      <ref id="bib1.bibx35"><?xmltex \def\ref@label{{Reichler and Kim(2008)}}?><label>Reichler and Kim(2008)</label><?label Reichler2008?><mixed-citation>Reichler, T. and Kim, J.: How Well Do Coupled Models Simulate Today's
Climate?, B. Am. Meteorol. Soc., 89, 303–312,
<ext-link xlink:href="https://doi.org/10.1175/BAMS-89-3-303" ext-link-type="DOI">10.1175/BAMS-89-3-303</ext-link>, 2008.</mixed-citation></ref>
      <ref id="bib1.bibx36"><?xmltex \def\ref@label{{Reinhardt and Seifert(2006)}}?><label>Reinhardt and Seifert(2006)</label><?label Reinhardt2006?><mixed-citation>Reinhardt, T. and Seifert, A.: A three-category ice scheme for LMK, COSMO
Newsletter, 6, 115–120,
<uri>http://www.cosmo-model.org/content/model/documentation/newsLetters/newsLetter06/cnl6_reinhardt.pdf</uri> (last access: 9 April 2022),
2006.</mixed-citation></ref>
      <ref id="bib1.bibx37"><?xmltex \def\ref@label{{Ritter and Geleyn(1992)}}?><label>Ritter and Geleyn(1992)</label><?label Ritter1992?><mixed-citation>Ritter, B. and Geleyn, J.-F.: A Comprehensive Radiation Scheme for Numerical
Weather Prediction Models with Potential Applications in Climate
Simulations, Mon. Weather Rev., 120, 303–325, <ext-link xlink:href="https://doi.org/10.1175/1520-0493(1992)120&lt;0303:ACRSFN&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0493(1992)120&lt;0303:ACRSFN&gt;2.0.CO;2</ext-link>, 1992.</mixed-citation></ref>
      <ref id="bib1.bibx38"><?xmltex \def\ref@label{{Rockel et~al.(2008)Rockel, Will, and Hense}}?><label>Rockel et al.(2008)Rockel, Will, and Hense</label><?label Rockel2008?><mixed-citation>Rockel, B., Will, A., and Hense, A.: The regional climate model COSMO-CLM
(CCLM), Meteorol. Z., 17, 347–348,
<ext-link xlink:href="https://doi.org/10.1127/0941-2948/2008/0309" ext-link-type="DOI">10.1127/0941-2948/2008/0309</ext-link>, 2008.</mixed-citation></ref>
      <ref id="bib1.bibx39"><?xmltex \def\ref@label{{Rosinski and Williamson(1997)}}?><label>Rosinski and Williamson(1997)</label><?label Rosinski1997?><mixed-citation>Rosinski, J. M. and Williamson, D. L.: The Accumulation of Rounding Errors and
Port Validation for Global Atmospheric Models,
SIAM J. Sci. Comput., 18, 552–564, <ext-link xlink:href="https://doi.org/10.1137/S1064827594275534" ext-link-type="DOI">10.1137/S1064827594275534</ext-link>, 1997.</mixed-citation></ref>
      <ref id="bib1.bibx40"><?xmltex \def\ref@label{{Sandu et~al.(2013)Sandu, Beljaars, Bechtold, Mauritsen, and
Balsamo}}?><label>Sandu et al.(2013)Sandu, Beljaars, Bechtold, Mauritsen, and
Balsamo</label><?label Sandu2013?><mixed-citation>Sandu, I., Beljaars, A., Bechtold, P., Mauritsen, T., and Balsamo, G.: Why is
it so difficult to represent stably stratified conditions in numerical
weather prediction (NWP) models?, J. Adv. Model. Earth Sy., 5, 117–133, <ext-link xlink:href="https://doi.org/10.1002/jame.20013" ext-link-type="DOI">10.1002/jame.20013</ext-link>, 2013.</mixed-citation></ref>
      <ref id="bib1.bibx41"><?xmltex \def\ref@label{{Sargent(2013)}}?><label>Sargent(2013)</label><?label Sargent2013?><mixed-citation>Sargent, R. G.: Verification and validation of simulation models,
J. Simul., 7, 12–24, <ext-link xlink:href="https://doi.org/10.1057/jos.2012.20" ext-link-type="DOI">10.1057/jos.2012.20</ext-link>, 2013.</mixed-citation></ref>
      <ref id="bib1.bibx42"><?xmltex \def\ref@label{{Sch{\"{a}}r et~al.(2020)Sch{\"{a}}r, Fuhrer, Arteaga, Ban,
Charpilloz, Girolamo, Hentgen, Hoefler, Lapillonne, Leutwyler, Osterried,
Panosetti, R{\"{u}}dis{\"{u}}hli, Schlemmer, Schulthess, Sprenger, Ubbiali,
and Wernli}}?><label>Schär et al.(2020)Schär, Fuhrer, Arteaga, Ban,
Charpilloz, Girolamo, Hentgen, Hoefler, Lapillonne, Leutwyler, Osterried,
Panosetti, Rüdisühli, Schlemmer, Schulthess, Sprenger, Ubbiali,
and Wernli</label><?label Schaer2020?><mixed-citation>Schär, C., Fuhrer, O., Arteaga, A., Ban, N., Charpilloz, C., Girolamo,
S. D., Hentgen, L., Hoefler, T., Lapillonne, X., Leutwyler, D., Osterried,
K., Panosetti, D., Rüdisühli, S., Schlemmer, L., Schulthess,
T. C., Sprenger, M., Ubbiali, S., and Wernli, H.: Kilometer-Scale Climate
Models, B. Am. Meteorol. Soc., 101, E567–E587,
<ext-link xlink:href="https://doi.org/10.1175/BAMS-D-18-0167.1" ext-link-type="DOI">10.1175/BAMS-D-18-0167.1</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx43"><?xmltex \def\ref@label{{Sch{\"{a}}ttler et~al.(2018)Sch{\"{a}}ttler, Doms, and
Baldauf}}?><label>Schättler et al.(2018)Schättler, Doms, and
Baldauf</label><?label Cosmo2018_UserGuide?><mixed-citation>Schättler, U., Doms, G., and Baldauf, M.: COSMO Documentation Part VII:
User's Guide, Deutscher Wetterdienst (DWD), Offenbach, Germany, <ext-link xlink:href="https://doi.org/10.5676/dwd_pub/nwv/cosmo-doc_5.05_vii" ext-link-type="DOI">10.5676/dwd_pub/nwv/cosmo-doc_5.05_vii</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx44"><?xmltex \def\ref@label{{Schlemmer et~al.(2018)Schlemmer, Sch{\"{a}}r, L{\"{u}}thi, and
Strebel}}?><label>Schlemmer et al.(2018)Schlemmer, Schär, Lüthi, and
Strebel</label><?label Schlemmer2018?><mixed-citation>Schlemmer, L., Schär, C., Lüthi, D., and Strebel, L.: A
Groundwater and Runoff Formulation for Weather and Climate Models, J. Adv. Model. Earth Sy., 10, 1809–1832,
<ext-link xlink:href="https://doi.org/10.1029/2017MS001260" ext-link-type="DOI">10.1029/2017MS001260</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx45"><?xmltex \def\ref@label{{Storch(1982)}}?><label>Storch(1982)</label><?label Storch1982?><mixed-citation>Storch, H. V.: A Remark on Chervin-Schneider's Algorithm to Test Significance
of Climate Experiments with GCM's, J. Atmos. Sci., 39,
187–189, <ext-link xlink:href="https://doi.org/10.1175/1520-0469(1982)039&lt;0187:AROCSA&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0469(1982)039&lt;0187:AROCSA&gt;2.0.CO;2</ext-link>, 1982.</mixed-citation></ref>
      <ref id="bib1.bibx46"><?xmltex \def\ref@label{{{Student}(1908)}}?><label>Student(1908)</label><?label Student1908?><mixed-citation>Student: The Probable Error of a Mean, Biometrika, 6, 1–25,
<ext-link xlink:href="https://doi.org/10.2307/2331554" ext-link-type="DOI">10.2307/2331554</ext-link>, 1908.</mixed-citation></ref>
      <ref id="bib1.bibx47"><?xmltex \def\ref@label{{Sullivan and D'Agostino(1992)}}?><label>Sullivan and D'Agostino(1992)</label><?label Sullivan1992?><mixed-citation>Sullivan, L. M. and D'Agostino, R. B.: Robustness of the t Test Applied to
Data Distorted from Normality by Floor Effects,
J. Dent. Res.,
71, 1938–1943, <ext-link xlink:href="https://doi.org/10.1177/00220345920710121601" ext-link-type="DOI">10.1177/00220345920710121601</ext-link>, 1992.</mixed-citation></ref>
      <ref id="bib1.bibx48"><?xmltex \def\ref@label{{Thomas et~al.(2002)Thomas, Hacker, Desgagn?, and Stull}}?><label>Thomas et al.(2002)Thomas, Hacker, Desgagn?, and Stull</label><?label Thomas2002?><mixed-citation>Thomas, S. J., Hacker, J. P., Desgagné, M., and Stull, R. B.: An Ensemble
Analysis of Forecast Errors Related to Floating Point Performance,
Weather Forecast., 17, 898–906,
<ext-link xlink:href="https://doi.org/10.1175/1520-0434(2002)017&lt;0898:AEAOFE&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0434(2002)017&lt;0898:AEAOFE&gt;2.0.CO;2</ext-link>, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx49"><?xmltex \def\ref@label{{Tiedtke(1989)}}?><label>Tiedtke(1989)</label><?label Tiedtke1989?><mixed-citation>Tiedtke, M.: A comprehensive mass flux scheme for cumulus parameterization in
large-scale models, Mon. Weather Rev., 117, 1779–1800,
<ext-link xlink:href="https://doi.org/10.1175/1520-0493(1989)117&lt;1779:ACMFSF&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0493(1989)117&lt;1779:ACMFSF&gt;2.0.CO;2</ext-link>, 1989.</mixed-citation></ref>
      <ref id="bib1.bibx50"><?xmltex \def\ref@label{{Ventura et~al.(2004)Ventura, Paciorek, and Risbey}}?><label>Ventura et al.(2004)Ventura, Paciorek, and Risbey</label><?label Ventura2004?><mixed-citation>Ventura, V., Paciorek, C. J., and Risbey, J. S.: Controlling the Proportion of
Falsely Rejected Hypotheses when Conducting Multiple Tests with
Climatological Data, J. Climate, 17, 4343–4356,
<ext-link xlink:href="https://doi.org/10.1175/3199.1" ext-link-type="DOI">10.1175/3199.1</ext-link>, 2004.</mixed-citation></ref>
      <ref id="bib1.bibx51"><?xmltex \def\ref@label{{Wan et~al.(2017)Wan, Zhang, Rasch, Singh, Chen, and
Edwards}}?><label>Wan et al.(2017)Wan, Zhang, Rasch, Singh, Chen, and
Edwards</label><?label Wan2017?><mixed-citation>Wan, H., Zhang, K., Rasch, P. J., Singh, B., Chen, X., and Edwards, J.: A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0), Geosci. Model Dev., 10, 537–552, <ext-link xlink:href="https://doi.org/10.5194/gmd-10-537-2017" ext-link-type="DOI">10.5194/gmd-10-537-2017</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx52"><?xmltex \def\ref@label{{Whitner and Balci(1989)}}?><label>Whitner and Balci(1989)</label><?label Whitner1989?><mixed-citation>Whitner, R. B. and Balci, O.: Guidelines for Selecting and Using Simulation
Model Verification Techniques, in: Proceedings of the 21st Conference on
Winter Simulation, WSC '89, 4–6 December 1989, Association for Computing
Machinery, New York, NY, USA, 559–568, <ext-link xlink:href="https://doi.org/10.1145/76738.76811" ext-link-type="DOI">10.1145/76738.76811</ext-link>, 1989.</mixed-citation></ref>
      <ref id="bib1.bibx53"><?xmltex \def\ref@label{{Wicker and Skamarock(2002)}}?><label>Wicker and Skamarock(2002)</label><?label Wicker2002?><mixed-citation>Wicker, L. J. and Skamarock, W. C.: Time-Splitting Methods for Elastic Models
Using Forward Time Schemes, Mon. Weather Rev., 130, 2088–2097,
<ext-link xlink:href="https://doi.org/10.1175/1520-0493(2002)130&lt;2088:TSMFEM&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0493(2002)130&lt;2088:TSMFEM&gt;2.0.CO;2</ext-link>, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx54"><?xmltex \def\ref@label{{Wilcox(1997)}}?><label>Wilcox(1997)</label><?label Wilcox1997?><mixed-citation>Wilcox, R. R.: Some practical reasons for reconsidering the Kolmogorov-Smirnov
test, Brit. J. Math. Stat. Psy., 50, 9–20,
<ext-link xlink:href="https://doi.org/10.1111/j.2044-8317.1997.tb01098.x" ext-link-type="DOI">10.1111/j.2044-8317.1997.tb01098.x</ext-link>, 1997.</mixed-citation></ref>
      <ref id="bib1.bibx55"><?xmltex \def\ref@label{{Wilks(2016)}}?><label>Wilks(2016)</label><?label Wilks2016?><mixed-citation>Wilks, D. S.: “The Stippling Shows Statistically Significant Grid Points”:
How Research Results are Routinely Overstated and Overinterpreted, and What
to Do about It, B. Am. Meteorol. Soc., 97,
2263–2273, <ext-link xlink:href="https://doi.org/10.1175/BAMS-D-15-00267.1" ext-link-type="DOI">10.1175/BAMS-D-15-00267.1</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx56"><?xmltex \def\ref@label{{Zadra et~al.(2003)Zadra, Roch, Laroche, and Charron}}?><label>Zadra et al.(2003)Zadra, Roch, Laroche, and Charron</label><?label Zadra2003?><mixed-citation>Zadra, A., Roch, M., Laroche, S., and Charron, M.: The subgrid-scale
orographic blocking parametrization of the GEM Model, Atmos. Ocean,
41, 155–170, <ext-link xlink:href="https://doi.org/10.3137/ao.410204" ext-link-type="DOI">10.3137/ao.410204</ext-link>, 2003.
</mixed-citation></ref><?xmltex \hack{\newpage}?>
      <ref id="bib1.bibx57"><?xmltex \def\ref@label{{Zeman and Schär(2021)}}?><label>Zeman and Schär(2021)</label><?label ZemanS2021?><mixed-citation>Zeman, C. and  Schär, C.: Data for “An Ensemble-Based Statistical Methodology to Detect Differences in Weather and Climate Model Executables” Part 1/2, Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.6354200" ext-link-type="DOI">10.5281/zenodo.6354200</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx58"><?xmltex \def\ref@label{{Zeman and Schär(2022a)}}?><label>Zeman and Schär(2022a)</label><?label ZemanS2022a?><mixed-citation>Zeman, C. and  Schär, C.: Data for “An Ensemble-Based Statistical Methodology to Detect Differences in Weather and Climate Model Executables” Part 2/2, Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.6355647" ext-link-type="DOI">10.5281/zenodo.6355647</ext-link>, 2022a.</mixed-citation></ref>
      <ref id="bib1.bibx59"><?xmltex \def\ref@label{{Zeman and Schär(2022b)}}?><label>Zeman and Schär(2022b)</label><?label ZemanS2022b?><mixed-citation>Zeman, C. and  Schär, C.: Source Code for “An Ensemble-Based Statistical Methodology to Detect Differences in Weather and Climate Model Executables”, Zenodo [code], <ext-link xlink:href="https://doi.org/10.5281/zenodo.6355694" ext-link-type="DOI">10.5281/zenodo.6355694</ext-link>, 2022b.</mixed-citation></ref>
      <ref id="bib1.bibx60"><?xmltex \def\ref@label{{Zeman et~al.(2021)Zeman, Wedi, Dueben, Ban, and
Sch{\"{a}}r}}?><label>Zeman et al.(2021)Zeman, Wedi, Dueben, Ban, and
Schär</label><?label Zeman2021?><mixed-citation>Zeman, C., Wedi, N. P., Dueben, P. D., Ban, N., and Schär, C.: Model intercomparison of COSMO 5.0 and IFS 45r1 at kilometer-scale grid spacing, Geosci. Model Dev., 14, 4617–4639, <ext-link xlink:href="https://doi.org/10.5194/gmd-14-4617-2021" ext-link-type="DOI">10.5194/gmd-14-4617-2021</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx61"><?xmltex \def\ref@label{{Zimmerman(1987)}}?><label>Zimmerman(1987)</label><?label Zimmermann1987?><mixed-citation>Zimmerman, D. W.: Comparative Power of Student T Test and Mann-Whitney U Test
for Unequal Sample Sizes and Variances,
J. Exp. Educ., 55, 171–174, <ext-link xlink:href="https://doi.org/10.1080/00220973.1987.10806451" ext-link-type="DOI">10.1080/00220973.1987.10806451</ext-link>, 1987.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>An ensemble-based statistical methodology to detect differences in weather and climate model executables</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Baker et al.(2015)Baker, Hammerling, Levy, Xu, Dennis, Eaton,
Edwards, Hannay, Mickelson, Neale, Nychka, Shollenberger, Tribbia,
Vertenstein, and Williamson</label><mixed-citation>
Baker, A. H., Hammerling, D. M., Levy, M. N., Xu, H., Dennis, J. M., Eaton, B. E., Edwards, J., Hannay, C., Mickelson, S. A., Neale, R. B., Nychka, D., Shollenberger, J., Tribbia, J., Vertenstein, M., and Williamson, D.: A new ensemble-based consistency test for the Community Earth System Model (pyCECT v1.0), Geosci. Model Dev., 8, 2829–2840, <a href="https://doi.org/10.5194/gmd-8-2829-2015" target="_blank">https://doi.org/10.5194/gmd-8-2829-2015</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Baker et al.(2016)Baker, Hu, Hammerling, Tseng, Xu, Huang, Bryan, and
Yang</label><mixed-citation>
Baker, A. H., Hu, Y., Hammerling, D. M., Tseng, Y.-H., Xu, H., Huang, X., Bryan, F. O., and Yang, G.: Evaluating statistical consistency in the ocean model component of the Community Earth System Model (pyCECT v2.0), Geosci. Model Dev., 9, 2391–2406, <a href="https://doi.org/10.5194/gmd-9-2391-2016" target="_blank">https://doi.org/10.5194/gmd-9-2391-2016</a>, 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Baldauf et al.(2011)Baldauf, Seifert, Förstner, Majewski,
Raschendorfer, and Reinhardt</label><mixed-citation>
Baldauf, M., Seifert, A., Förstner, J., Majewski, D., Raschendorfer, M.,
and Reinhardt, T.: Operational Convective-Scale Numerical Weather Prediction
with the COSMO Model: Description and Sensitivities,
Mon. Weather Rev.,
139, 3887–3905, <a href="https://doi.org/10.1175/MWR-D-10-05013.1" target="_blank">https://doi.org/10.1175/MWR-D-10-05013.1</a>, 2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Bartlett(1935)</label><mixed-citation>
Bartlett, M. S.: The Effect of Non-Normality on the t Distribution,
Math. Proc. Cambridge, 31,
223–231, <a href="https://doi.org/10.1017/S0305004100013311" target="_blank">https://doi.org/10.1017/S0305004100013311</a>, 1935.
</mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Bauer et al.(2015)Bauer, Thorpe, and Brunet</label><mixed-citation>
Bauer, P., Thorpe, A., and Brunet, G.: The quiet revolution of numerical
weather prediction, Nature, 525, 47–55, <a href="https://doi.org/10.1038/nature14956" target="_blank">https://doi.org/10.1038/nature14956</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>Bellprat et al.(2016)Bellprat, Kotlarski, Lüthi, De Elía,
Frigon, Laprise, and Schär</label><mixed-citation>
Bellprat, O., Kotlarski, S., Lüthi, D., De Elía, R., Frigon, A.,
Laprise, R., and Schär, C.: Objective calibration of regional climate
models: Application over Europe and North America, J. Climate, 29,
819–838, <a href="https://doi.org/10.1175/JCLI-D-15-0302.1" target="_blank">https://doi.org/10.1175/JCLI-D-15-0302.1</a>, 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Benjamini and Hochberg(1995)</label><mixed-citation>
Benjamini, Y. and Hochberg, Y.: Controlling the False Discovery Rate: A
Practical and Powerful Approach to Multiple Testing,
J. Roy. Stat. Soc. B, 57, 289–300,
<a href="https://doi.org/10.1111/j.2517-6161.1995.tb02031.x" target="_blank">https://doi.org/10.1111/j.2517-6161.1995.tb02031.x</a>, 1995.
</mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Carson(2002)</label><mixed-citation>
Carson, J. S.: Model verification and validation, in: Proceedings of the
Winter Simulation Conference, Winter Simulation Conference,  San Diego, CA, USA,   8–11 December 2002, 1, 52–58,
<a href="https://doi.org/10.1109/WSC.2002.1172868" target="_blank">https://doi.org/10.1109/WSC.2002.1172868</a>, 2002.
</mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Clune and Rood(2011)</label><mixed-citation>
Clune, T. and Rood, R.: Software Testing and Verification in Climate Model
Development, IEEE Software, 28, 49–55, <a href="https://doi.org/10.1109/MS.2011.117" target="_blank">https://doi.org/10.1109/MS.2011.117</a>, 2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>COSMO Consortium(2022)</label><mixed-citation>
COSMO Consortium: COSMO Model License,
<a href="http://www.cosmo-model.org/content/consortium/licencing.htm" target="_blank"/>,
last access: 12 April 2022.
</mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Dee et al.(2011)Dee, Uppala, Simmons, Berrisford, Poli, Kobayashi,
Andrae, Balmaseda, Balsamo, Bauer, Bechtold, Beljaars, van de Berg, Bidlot,
Bormann, Delsol, Dragani, Fuentes, Geer, Haimberger, Healy, Hersbach,
Hólm, Isaksen, Kållberg, Köhler, Matricardi, Mcnally,
Monge-Sanz, Morcrette, Park, Peubey, de Rosnay, Tavolato, Thépaut, and
Vitart</label><mixed-citation>
Dee, D. P., Uppala, S. M., Simmons, A. J., Berrisford, P., Poli, P., Kobayashi,
S., Andrae, U., Balmaseda, M. A., Balsamo, G., Bauer, P., Bechtold, P.,
Beljaars, A. C., van de Berg, L., Bidlot, J., Bormann, N., Delsol, C.,
Dragani, R., Fuentes, M., Geer, A. J., Haimberger, L., Healy, S. B.,
Hersbach, H., Hólm, E. V., Isaksen, L., Kållberg, P., Köhler,
M., Matricardi, M., Mcnally, A. P., Monge-Sanz, B. M., Morcrette, J. J.,
Park, B. K., Peubey, C., de Rosnay, P., Tavolato, C., Thépaut, J. N.,
and Vitart, F.: The ERA-Interim reanalysis: Configuration and performance of
the data assimilation system,
Q. J. Roy. Meteor. Soc., 137, 553–597, <a href="https://doi.org/10.1002/qj.828" target="_blank">https://doi.org/10.1002/qj.828</a>, 2011.
</mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Doms and Baldauf(2018)</label><mixed-citation>
Doms, G. and Baldauf, M.: A Description of the Nonhydrostatic Regional
COSMO-Model Part I: Dynamics and Numerics, Deutscher Wetterdienst (DWD), Offenbach, Germany,
<a href="https://doi.org/10.5676/DWD_pub/nwv/cosmo-doc_5.05_I" target="_blank">https://doi.org/10.5676/DWD_pub/nwv/cosmo-doc_5.05_I</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>Doms et al.(2018)Doms, Förstner, Heise, Herzog, Mironov,
Raschendorfer, Reinhardt, Ritter, Schrodin, Schulz, and
Vogel</label><mixed-citation>
Doms, G., Förstner, J., Heise, E., Herzog, H.-J., Mironov, D.,
Raschendorfer, M., Reinhardt, T., Ritter, B., Schrodin, R., Schulz, J.-P.,
and Vogel, G.: COSMO Documentation Part II: Physical Parameterization,
Deutscher Wetterdienst (DWD), Offenbach, Germany, <a href="https://doi.org/10.5676/dwd_pub/nwv/cosmo-doc_5.05_ii" target="_blank">https://doi.org/10.5676/dwd_pub/nwv/cosmo-doc_5.05_ii</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>ECMWF(2022)</label><mixed-citation>
ECMWF: ERA-Interim reanalysis,
ECMWF [data set],
<a href="https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim" target="_blank"/>,
last access: 12 April 2022.
</mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Fuhrer et al.(2014)Fuhrer, Osuna, Lapillonne, Gysi, Bianco, Arteaga,
and Schulthess</label><mixed-citation>
Fuhrer, O., Osuna, C., Lapillonne, X., Gysi, T., Bianco, M., Arteaga, A., and
Schulthess, T. C.: Towards a performance portable, architecture agnostic
implementation strategy for weather and climate models,
Supercomputing Frontiers and Innovations, 1, 44–61, <a href="https://doi.org/10.14529/jsfi140103" target="_blank">https://doi.org/10.14529/jsfi140103</a>, 2014.
</mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Hong et al.(2013)Hong, Koo, Jang, Kim, Park, Joh, Kang, and
Oh</label><mixed-citation>
Hong, S.-Y., Koo, M.-S., Jang, J., Kim, J.-E. E., Park, H., Joh, M.-S., Kang,
J.-H., and Oh, T.-J.: An Evaluation of the Software System Dependency of a
Global Atmospheric Model, Mon. Weather Rev., 141, 4165–4172,
<a href="https://doi.org/10.1175/MWR-D-12-00352.1" target="_blank">https://doi.org/10.1175/MWR-D-12-00352.1</a>, 2013.
</mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>Knight et al.(2007)Knight, Knight, Massey, Aina, Christensen, Frame,
Kettleborough, Martin, Pascoe, Sanderson, Stainforth, and Allen</label><mixed-citation>
Knight, C. G., Knight, S. H. E., Massey, N., Aina, T., Christensen, C., Frame,
D. J., Kettleborough, J. A., Martin, A., Pascoe, S., Sanderson, B.,
Stainforth, D. A., and Allen, M. R.: Association of parameter, software, and
hardware variation with large-scale behavior across 57,000 climate models,
P. Natl. Acad. Sci. USA, 104, 12259–12264,
<a href="https://doi.org/10.1073/pnas.0608144104" target="_blank">https://doi.org/10.1073/pnas.0608144104</a>, 2007.
</mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Leutbecher and Palmer(2008)</label><mixed-citation>
Leutbecher, M. and Palmer, T. N.: Ensemble forecasting,
J. Comput. Phys., 227, 3515–3539,
<a href="https://doi.org/10.1016/j.jcp.2007.02.014" target="_blank">https://doi.org/10.1016/j.jcp.2007.02.014</a>, 2008.
</mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Livezey(1985)</label><mixed-citation>
Livezey, R. E.: Statistical Analysis of General Circulation Model Climate
Simulation: Sensitivity and Prediction Experiments,
J. Atmos. Sci., 42, 1139–1150,
<a href="https://doi.org/10.1175/1520-0469(1985)042&lt;1139:SAOGCM&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0469(1985)042&lt;1139:SAOGCM&gt;2.0.CO;2</a>, 1985.
</mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Livezey and Chen(1983)</label><mixed-citation>
Livezey, R. E. and Chen, W. Y.: Statistical Field Significance and its
Determination by Monte Carlo Techniques, Mon. Weather Rev., 111,
46–59, <a href="https://doi.org/10.1175/1520-0493(1983)111&lt;0046:SFSAID&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0493(1983)111&lt;0046:SFSAID&gt;2.0.CO;2</a>, 1983.
</mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Lorenz(1963)</label><mixed-citation>
Lorenz, E. N.: Deterministic Nonperiodic Flow, J. Atmos. Sci., 20, 130–141, <a href="https://doi.org/10.1175/1520-0469(1963)020&lt;0130:DNF&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0469(1963)020&lt;0130:DNF&gt;2.0.CO;2</a>,
1963.
</mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>Lott and Miller(1997)</label><mixed-citation>
Lott, F. and Miller, M. J.: A new subgrid-scale orographic drag
parametrization: Its formulation and testing,
Q. J. Roy. Meteor. Soc., 123, 101–127, <a href="https://doi.org/10.1256/smsqj.53703" target="_blank">https://doi.org/10.1256/smsqj.53703</a>, 1997.
</mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Mahajan(2021)</label><mixed-citation>
Mahajan, S.: Ensuring Statistical Reproducibility of Ocean Model Simulations
in the Age of Hybrid Computing, in: Proceedings of the Platform for Advanced
Scientific Computing Conference, PASC '21, Association for Computing
Machinery, New York, NY, USA, 5–9 July 2021, <a href="https://doi.org/10.1145/3468267.3470572" target="_blank">https://doi.org/10.1145/3468267.3470572</a>, 2021.
</mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Mahajan et al.(2017)Mahajan, Gaddis, Evans, and Norman</label><mixed-citation>
Mahajan, S., Gaddis, A. L., Evans, K. J., and Norman, M. R.: Exploring an
Ensemble-Based Approach to Atmospheric Climate Modeling and Testing at
Scale, Procedia Comput. Sci., 108, 735–744,
<a href="https://doi.org/10.1016/j.procs.2017.05.259" target="_blank">https://doi.org/10.1016/j.procs.2017.05.259</a>, 2017.
</mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Mahajan et al.(2019)Mahajan, Evans, Kennedy, Xu, and
Norman</label><mixed-citation>
Mahajan, S., Evans, K. J., Kennedy, J. H., Xu, M., and Norman, M. R.: A
Multivariate Approach to Ensure Statistical Reproducibility of Climate Model
Simulations, in: Proceedings of the Platform for Advanced Scientific
Computing Conference, PASC '19, Association for Computing Machinery, New
York, NY, USA, 12–14 June 2019, <a href="https://doi.org/10.1145/3324989.3325724" target="_blank">https://doi.org/10.1145/3324989.3325724</a>, 2019.
</mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Mann and Whitney(1947)</label><mixed-citation>
Mann, H. B. and Whitney, D. R.: On a Test of Whether one of Two Random
Variables is Stochastically Larger than the Other, Ann. Math. Stat., 18,
50–60, <a href="https://doi.org/10.1214/aoms/1177730491" target="_blank">https://doi.org/10.1214/aoms/1177730491</a>, 1947.
</mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Massonnet et al.(2020)Massonnet, Ménégoz, Acosta,
Yepes-Arbós, Exarchou, and Doblas-Reyes</label><mixed-citation>
Massonnet, F., Ménégoz, M., Acosta, M., Yepes-Arbós, X., Exarchou, E., and Doblas-Reyes, F. J.: Replicability of the EC-Earth3 Earth system model under a change in computing environment, Geosci. Model Dev., 13, 1165–1178, <a href="https://doi.org/10.5194/gmd-13-1165-2020" target="_blank">https://doi.org/10.5194/gmd-13-1165-2020</a>, 2020.
</mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Milroy et al.(2018)Milroy, Baker, Hammerling, and
Jessup</label><mixed-citation>
Milroy, D. J., Baker, A. H., Hammerling, D. M., and Jessup, E. R.: Nine time steps: ultra-fast statistical consistency testing of the Community Earth System Model (pyCECT v3.0), Geosci. Model Dev., 11, 697–711, <a href="https://doi.org/10.5194/gmd-11-697-2018" target="_blank">https://doi.org/10.5194/gmd-11-697-2018</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Oberkampf and Roy(2010)</label><mixed-citation>
Oberkampf, W. L. and Roy, C. J.: Verification and Validation in Scientific
Computing, Cambridge University Press, <a href="https://doi.org/10.1017/CBO9780511760396" target="_blank">https://doi.org/10.1017/CBO9780511760396</a>, 2010.
</mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Oreskes(1998)</label><mixed-citation>
Oreskes, N.: Evaluation (not validation) of quantitative models,
Environ. Health Persp., 106, 1453–1460,
<a href="https://doi.org/10.1289/ehp.98106s61453" target="_blank">https://doi.org/10.1289/ehp.98106s61453</a>, 1998.
</mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Oreskes et al.(1994)Oreskes, Shrader-Frechette, and
Belitz</label><mixed-citation>
Oreskes, N., Shrader-Frechette, K., and Belitz, K.: Verification, Validation,
and Confirmation of Numerical Models in the Earth Sciences, Science, 263,
641–646, <a href="https://doi.org/10.1126/science.263.5147.641" target="_blank">https://doi.org/10.1126/science.263.5147.641</a>, 1994.
</mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>Pithan et al.(2015)Pithan, Angevine, and Mauritsen</label><mixed-citation>
Pithan, F., Angevine, W., and Mauritsen, T.: Improving a global model from the
boundary layer: Total turbulent energy and the neutral limit Prandtl number,
J. Adv. Model. Earth Sy., 7, 2029–2043,
<a href="https://doi.org/10.1002/2015MS000503" target="_blank">https://doi.org/10.1002/2015MS000503</a>, 2015.
</mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>Posten(1984)</label><mixed-citation>
Posten, H. O.: Robustness of the Two-Sample T-Test, in: Robustness of
Statistical Methods and Nonparametric Statistics, edited by: Rasch, D. and
Tiku, M. L.,  Springer, Netherlands, Dordrecht,
92–99,  <a href="https://doi.org/10.1007/978-94-009-6528-7_23" target="_blank">https://doi.org/10.1007/978-94-009-6528-7_23</a>, 1984.
</mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Raschendorfer(2001)</label><mixed-citation>
Raschendorfer, M.: The new turbulence parameterization of LM, COSMO
Newsletter, 1, 89–97,
<a href="http://www.cosmo-model.org/content/model/documentation/newsLetters/newsLetter01/newsLetter_01.pdf" target="_blank"/> (last access: 9 April 2022),
2001.
</mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Reichler and Kim(2008)</label><mixed-citation>
Reichler, T. and Kim, J.: How Well Do Coupled Models Simulate Today's
Climate?, B. Am. Meteorol. Soc., 89, 303–312,
<a href="https://doi.org/10.1175/BAMS-89-3-303" target="_blank">https://doi.org/10.1175/BAMS-89-3-303</a>, 2008.
</mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>Reinhardt and Seifert(2006)</label><mixed-citation>
Reinhardt, T. and Seifert, A.: A three-category ice scheme for LMK, COSMO
Newsletter, 6, 115–120,
<a href="http://www.cosmo-model.org/content/model/documentation/newsLetters/newsLetter06/cnl6_reinhardt.pdf" target="_blank"/> (last access: 9 April 2022),
2006.
</mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Ritter and Geleyn(1992)</label><mixed-citation>
Ritter, B. and Geleyn, J.-F.: A Comprehensive Radiation Scheme for Numerical
Weather Prediction Models with Potential Applications in Climate
Simulations, Mon. Weather Rev., 120, 303–325, <a href="https://doi.org/10.1175/1520-0493(1992)120&lt;0303:ACRSFN&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0493(1992)120&lt;0303:ACRSFN&gt;2.0.CO;2</a>, 1992.
</mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>Rockel et al.(2008)Rockel, Will, and Hense</label><mixed-citation>
Rockel, B., Will, A., and Hense, A.: The regional climate model COSMO-CLM
(CCLM), Meteorol. Z., 17, 347–348,
<a href="https://doi.org/10.1127/0941-2948/2008/0309" target="_blank">https://doi.org/10.1127/0941-2948/2008/0309</a>, 2008.
</mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Rosinski and Williamson(1997)</label><mixed-citation>
Rosinski, J. M. and Williamson, D. L.: The Accumulation of Rounding Errors and
Port Validation for Global Atmospheric Models,
SIAM J. Sci. Comput., 18, 552–564, <a href="https://doi.org/10.1137/S1064827594275534" target="_blank">https://doi.org/10.1137/S1064827594275534</a>, 1997.
</mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Sandu et al.(2013)Sandu, Beljaars, Bechtold, Mauritsen, and
Balsamo</label><mixed-citation>
Sandu, I., Beljaars, A., Bechtold, P., Mauritsen, T., and Balsamo, G.: Why is
it so difficult to represent stably stratified conditions in numerical
weather prediction (NWP) models?, J. Adv. Model. Earth Sy., 5, 117–133, <a href="https://doi.org/10.1002/jame.20013" target="_blank">https://doi.org/10.1002/jame.20013</a>, 2013.
</mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>Sargent(2013)</label><mixed-citation>
Sargent, R. G.: Verification and validation of simulation models,
J. Simul., 7, 12–24, <a href="https://doi.org/10.1057/jos.2012.20" target="_blank">https://doi.org/10.1057/jos.2012.20</a>, 2013.
</mixed-citation></ref-html>
<ref-html id="bib1.bib42"><label>Schär et al.(2020)Schär, Fuhrer, Arteaga, Ban,
Charpilloz, Girolamo, Hentgen, Hoefler, Lapillonne, Leutwyler, Osterried,
Panosetti, Rüdisühli, Schlemmer, Schulthess, Sprenger, Ubbiali,
and Wernli</label><mixed-citation>
Schär, C., Fuhrer, O., Arteaga, A., Ban, N., Charpilloz, C., Girolamo,
S. D., Hentgen, L., Hoefler, T., Lapillonne, X., Leutwyler, D., Osterried,
K., Panosetti, D., Rüdisühli, S., Schlemmer, L., Schulthess,
T. C., Sprenger, M., Ubbiali, S., and Wernli, H.: Kilometer-Scale Climate
Models, B. Am. Meteorol. Soc., 101, E567–E587,
<a href="https://doi.org/10.1175/BAMS-D-18-0167.1" target="_blank">https://doi.org/10.1175/BAMS-D-18-0167.1</a>, 2020.
</mixed-citation></ref-html>
<ref-html id="bib1.bib43"><label>Schättler et al.(2018)Schättler, Doms, and
Baldauf</label><mixed-citation>
Schättler, U., Doms, G., and Baldauf, M.: COSMO Documentation Part VII:
User's Guide, Deutscher Wetterdienst (DWD), Offenbach, Germany, <a href="https://doi.org/10.5676/dwd_pub/nwv/cosmo-doc_5.05_vii" target="_blank">https://doi.org/10.5676/dwd_pub/nwv/cosmo-doc_5.05_vii</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib44"><label>Schlemmer et al.(2018)Schlemmer, Schär, Lüthi, and
Strebel</label><mixed-citation>
Schlemmer, L., Schär, C., Lüthi, D., and Strebel, L.: A
Groundwater and Runoff Formulation for Weather and Climate Models, J. Adv. Model. Earth Sy., 10, 1809–1832,
<a href="https://doi.org/10.1029/2017MS001260" target="_blank">https://doi.org/10.1029/2017MS001260</a>, 2018.
</mixed-citation></ref-html>
<ref-html id="bib1.bib45"><label>Storch(1982)</label><mixed-citation>
Storch, H. V.: A Remark on Chervin-Schneider's Algorithm to Test Significance
of Climate Experiments with GCM's, J. Atmos. Sci., 39,
187–189, <a href="https://doi.org/10.1175/1520-0469(1982)039&lt;0187:AROCSA&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0469(1982)039&lt;0187:AROCSA&gt;2.0.CO;2</a>, 1982.
</mixed-citation></ref-html>
<ref-html id="bib1.bib46"><label>Student(1908)</label><mixed-citation>
Student: The Probable Error of a Mean, Biometrika, 6, 1–25,
<a href="https://doi.org/10.2307/2331554" target="_blank">https://doi.org/10.2307/2331554</a>, 1908.
</mixed-citation></ref-html>
<ref-html id="bib1.bib47"><label>Sullivan and D'Agostino(1992)</label><mixed-citation>
Sullivan, L. M. and D'Agostino, R. B.: Robustness of the t Test Applied to
Data Distorted from Normality by Floor Effects,
J. Dent. Res.,
71, 1938–1943, <a href="https://doi.org/10.1177/00220345920710121601" target="_blank">https://doi.org/10.1177/00220345920710121601</a>, 1992.
</mixed-citation></ref-html>
<ref-html id="bib1.bib48"><label>Thomas et al.(2002)Thomas, Hacker, Desgagn?, and Stull</label><mixed-citation>
Thomas, S. J., Hacker, J. P., Desgagné, M., and Stull, R. B.: An Ensemble
Analysis of Forecast Errors Related to Floating Point Performance,
Weather Forecast., 17, 898–906,
<a href="https://doi.org/10.1175/1520-0434(2002)017&lt;0898:AEAOFE&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0434(2002)017&lt;0898:AEAOFE&gt;2.0.CO;2</a>, 2002.
</mixed-citation></ref-html>
<ref-html id="bib1.bib49"><label>Tiedtke(1989)</label><mixed-citation>
Tiedtke, M.: A comprehensive mass flux scheme for cumulus parameterization in
large-scale models, Mon. Weather Rev., 117, 1779–1800,
<a href="https://doi.org/10.1175/1520-0493(1989)117&lt;1779:ACMFSF&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0493(1989)117&lt;1779:ACMFSF&gt;2.0.CO;2</a>, 1989.
</mixed-citation></ref-html>
<ref-html id="bib1.bib50"><label>Ventura et al.(2004)Ventura, Paciorek, and Risbey</label><mixed-citation>
Ventura, V., Paciorek, C. J., and Risbey, J. S.: Controlling the Proportion of
Falsely Rejected Hypotheses when Conducting Multiple Tests with
Climatological Data, J. Climate, 17, 4343–4356,
<a href="https://doi.org/10.1175/3199.1" target="_blank">https://doi.org/10.1175/3199.1</a>, 2004.
</mixed-citation></ref-html>
<ref-html id="bib1.bib51"><label>Wan et al.(2017)Wan, Zhang, Rasch, Singh, Chen, and
Edwards</label><mixed-citation>
Wan, H., Zhang, K., Rasch, P. J., Singh, B., Chen, X., and Edwards, J.: A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0), Geosci. Model Dev., 10, 537–552, <a href="https://doi.org/10.5194/gmd-10-537-2017" target="_blank">https://doi.org/10.5194/gmd-10-537-2017</a>, 2017.
</mixed-citation></ref-html>
<ref-html id="bib1.bib52"><label>Whitner and Balci(1989)</label><mixed-citation>
Whitner, R. B. and Balci, O.: Guidelines for Selecting and Using Simulation
Model Verification Techniques, in: Proceedings of the 21st Conference on
Winter Simulation, WSC '89, 4–6 December 1989, Association for Computing
Machinery, New York, NY, USA, 559–568, <a href="https://doi.org/10.1145/76738.76811" target="_blank">https://doi.org/10.1145/76738.76811</a>, 1989.
</mixed-citation></ref-html>
<ref-html id="bib1.bib53"><label>Wicker and Skamarock(2002)</label><mixed-citation>
Wicker, L. J. and Skamarock, W. C.: Time-Splitting Methods for Elastic Models
Using Forward Time Schemes, Mon. Weather Rev., 130, 2088–2097,
<a href="https://doi.org/10.1175/1520-0493(2002)130&lt;2088:TSMFEM&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0493(2002)130&lt;2088:TSMFEM&gt;2.0.CO;2</a>, 2002.
</mixed-citation></ref-html>
<ref-html id="bib1.bib54"><label>Wilcox(1997)</label><mixed-citation>
Wilcox, R. R.: Some practical reasons for reconsidering the Kolmogorov-Smirnov
test, Brit. J. Math. Stat. Psy., 50, 9–20,
<a href="https://doi.org/10.1111/j.2044-8317.1997.tb01098.x" target="_blank">https://doi.org/10.1111/j.2044-8317.1997.tb01098.x</a>, 1997.
</mixed-citation></ref-html>
<ref-html id="bib1.bib55"><label>Wilks(2016)</label><mixed-citation>
Wilks, D. S.: “The Stippling Shows Statistically Significant Grid Points”:
How Research Results are Routinely Overstated and Overinterpreted, and What
to Do about It, B. Am. Meteorol. Soc., 97,
2263–2273, <a href="https://doi.org/10.1175/BAMS-D-15-00267.1" target="_blank">https://doi.org/10.1175/BAMS-D-15-00267.1</a>, 2016.
</mixed-citation></ref-html>
<ref-html id="bib1.bib56"><label>Zadra et al.(2003)Zadra, Roch, Laroche, and Charron</label><mixed-citation>
Zadra, A., Roch, M., Laroche, S., and Charron, M.: The subgrid-scale
orographic blocking parametrization of the GEM Model, Atmos. Ocean,
41, 155–170, <a href="https://doi.org/10.3137/ao.410204" target="_blank">https://doi.org/10.3137/ao.410204</a>, 2003.

</mixed-citation></ref-html>
<ref-html id="bib1.bib57"><label>Zeman and Schär(2021)</label><mixed-citation>
Zeman, C. and  Schär, C.: Data for “An Ensemble-Based Statistical Methodology to Detect Differences in Weather and Climate Model Executables” Part 1/2, Zenodo [data set], <a href="https://doi.org/10.5281/zenodo.6354200" target="_blank">https://doi.org/10.5281/zenodo.6354200</a>, 2021.
</mixed-citation></ref-html>
<ref-html id="bib1.bib58"><label>Zeman and Schär(2022a)</label><mixed-citation>
Zeman, C. and  Schär, C.: Data for “An Ensemble-Based Statistical Methodology to Detect Differences in Weather and Climate Model Executables” Part 2/2, Zenodo [data set], <a href="https://doi.org/10.5281/zenodo.6355647" target="_blank">https://doi.org/10.5281/zenodo.6355647</a>, 2022a.
</mixed-citation></ref-html>
<ref-html id="bib1.bib59"><label>Zeman and Schär(2022b)</label><mixed-citation>
Zeman, C. and  Schär, C.: Source Code for “An Ensemble-Based Statistical Methodology to Detect Differences in Weather and Climate Model Executables”, Zenodo [code], <a href="https://doi.org/10.5281/zenodo.6355694" target="_blank">https://doi.org/10.5281/zenodo.6355694</a>, 2022b.
</mixed-citation></ref-html>
<ref-html id="bib1.bib60"><label>Zeman et al.(2021)Zeman, Wedi, Dueben, Ban, and
Schär</label><mixed-citation>
Zeman, C., Wedi, N. P., Dueben, P. D., Ban, N., and Schär, C.: Model intercomparison of COSMO 5.0 and IFS 45r1 at kilometer-scale grid spacing, Geosci. Model Dev., 14, 4617–4639, <a href="https://doi.org/10.5194/gmd-14-4617-2021" target="_blank">https://doi.org/10.5194/gmd-14-4617-2021</a>, 2021.
</mixed-citation></ref-html>
<ref-html id="bib1.bib61"><label>Zimmerman(1987)</label><mixed-citation>
Zimmerman, D. W.: Comparative Power of Student T Test and Mann-Whitney U Test
for Unequal Sample Sizes and Variances,
J. Exp. Educ., 55, 171–174, <a href="https://doi.org/10.1080/00220973.1987.10806451" target="_blank">https://doi.org/10.1080/00220973.1987.10806451</a>, 1987.
</mixed-citation></ref-html>--></article>
