the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The Ensemble Consistency Test: From CESM to MPAS and Beyond
Abstract. The Ensemble Consistency Test (ECT) and its Ultra-Fast variant (UF-ECT) have become powerful tools in the development community for the identification of unwanted changes in the Community Earth System Model (CESM). By characterizing the distribution of an accepted ensemble of perturbed ultra-fast model runs, the UF-ECT is able to identify changes exceeding internal variability in expensive chaotic numerical models with reasonable computational costs. However, up until now this approach has not seen adoption by other communities, in part because the process of adopting the UF-ECT procedure to other models was not clear. In this work we develop a generalized setup framework for applying the UF-ECT to different models and show how our specification of UF-ECT parameters allow us to balance important goals like test sensitivity and computational cost. Finally, we walk through the setup framework in detail and demonstrate the performance of the UF-ECT with our new determined parameters for the Model Across Prediction Scales-Atmosphere (MPAS-A) model, the substantially updated CESM atmospheric model, and realistic development scenarios.
- Preprint
(4194 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
-
RC1: 'Comment on gmd-2024-115', Anonymous Referee #1, 13 Sep 2024
I have had the opportunity to review the paper entitled "The Ensemble Consistency Test: From CESM to MPAS and beyond" by Price-Broncucia et al.
I think it is a very good paper which deserves publication at GMD. Initially, at a first glance to the introduction, it seemed to me that it was more oriented towards technical aspects of model technical testing. However, as the paper described the methodology that the authors use to identify potential errors, the way they take into account the expected numbers of False Alarms and so on, I realized that the paper went clearly (in my humble opinion) beyond the point of a "technical contribution for model testing" to the next level. The authors have successfully described the way an existing methodology can be extended to new models. And the authors make the tool available to other model developers.
As such, I support the paper being accepted, as I really feel it can be a valid contribution for other members of the community. I was shocked by the results in section 5.3, the system automatically detecting these errors with such a low number of model steps is a very good result, from my point of view.
However, there are some results or ways that the authors use to solve the problem which puzzle me. I think that the paper would be better self-contained if the authors discussed the following points. I guess that the authors have some reasons to use the technique the way they use it. But I have my own (may be wrong) ideas, questions or, basically constructive suggestions, which I am sure the authors may answer easily.
- I am surprised by the way the authors use the PCs by averaging out the variables over the whole domain. It is true that PCs are good for isolating the main axes of variability in a dataset. However, it is also clear that there are some solid spatial structures in atmospheric fields that are rooted in basic physics. I am thinking myself in terms of the latitudinal structure of zonally averaged surface temperature. Or I can also consider the vertical temperature profile of the middle troposphere or the stratosphere. These features are missed when the variables are averaged over the full domain before calculating the PCs. I am also aware of the fact that the way these features are represented might be very dependent on the horizontal/vertical resolution of the model being tested. However, checking the zonally averaged surface temperature in midlatitudes ([30,60] degrees in every hemisphere) and tropical temperatures [20S,20N] before calculating the PCs might give three different series which would be representing (even in a crude way) the meridional temperature gradient. Thinking in the same way, getting the temperature at 400 hPa minus the temperature at 700 hPa seems a crude but simple way of evaluating the vertical profile in the troposphere. A similar technique might be used to evaluate the lapse rate around 100 hPa, for instance, adding two different columns to the PCs (without removing the averages that the authors use). Using these diagnostics might improve the sensitivity of the system, can be extended to models of different resolutions and does not involve significant new computations. What do the authors think about this? Can they elaborate on this?
- Line 293. The authors state that a correlation coefficient of 0.75 is a “limit” that they use to diagnose whether a correlation matrix is/is not rank deficient. However, this would be strongly dependent on the number of degrees of freedom. Why do not they use the spectrum of singular values derived from the singular value decomposition (SVD) of the correlation matrix? If no singular value is negligible, they know for sure they don’t have a problem. Since they apply this a “small” matrix of averaged series, using the SVD shouldn’t be computationally very expensive.
- Line 332. The authors use the (often used) target of 95% of explained variance to identify the number of PCs that must be kept. I don’t think this is critical, but I suggest the authors (for future versions of the software) that determining whether the corresponding EOFs are or are not well determined by the sample might be safer from the point of view of the stability of the next steps. There are alternative methods in the literature for this, either based on the errors of the eigenvalues or the congruence coefficients.
- Cheng, X., G. Nitsche, and J. M. Wallace, 1995: Robustness of low-frequency circulation patterns derived from EOF and rotated EOF analyses. J. Climate, 8, 1709–1713.
- North, G. R., T. L. Bell, R. F. Cahalan, and F. J. Moeng, 1982: Sampling errors in the estimation of empirical orthogonal functions. Mon. Wea. Rev., 110, 699–706.
- I think the sentence in lines 174 and 175 must be better worded, I don’t find it clear enough.
Besides those comments, there are still a few typos in the paper. My mother tongue is not English, but I guess the authors should still make a thorough reading of the text.
- Line 174. “the the”
- Line 314. Is a “d” missing after “average”?
-
Line 454. “with with”
Citation: https://doi.org/10.5194/gmd-2024-115-RC1 -
AC1: 'Comment on gmd-2024-115', Teo Price-Broncucia, 09 Dec 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2024-115/gmd-2024-115-AC1-supplement.pdf
-
RC2: 'Comment on gmd-2024-115', Anonymous Referee #2, 24 Oct 2024
The paper presents the application of an existing method of model validation in scenarios where major model updates are provided (such as modifying compilers, etc.). The work was interesting and well written, with the work seemingly easily repeatable, and such an approach does have potential for significant impact in the modeling community. I do have a few comments to consider as the authors work to refine the study further.
Major comments:
In the process of spatially averaging, you lose spatial autocorrelation that may yield important patterns. For example, you lose the ability to characterize if your model is identifying features like the NAO appropriately. It helps with computational tractability of the problem, but is the ability to characterize the underlying differences spatial variability within the two sets of model runs an issue you are able to overlook? I would argue that for meteorological time scales this may not be as important, but in a climate time-scale application, this is certainly a tradeoff and potential limitation of the validation approach.
When selecting variables, the variables selected are natively generated by the model. If the end user of the model calculates additional derived fields from the model, would you suggest modifying your parameters specific to that problem? It may be good to add a few sentences discussing how your method may change if the user is deriving fields from the native model output, which is often the case with these models.
The authors state they retained 43 variables from the MPAS, yet they suggest there are 55 vertical levels in the data. DO the spatial averages include vertical averaging as well? Nesting down 40000+ values to a single spatial mean and then repeating that 55 times to get a single number from well over 2 million points seems like a lot. Are vertical levels of 3-D variables treated separately or together?
When using the Shapiro-Wilks test for normality purposes with 43 variables, the probability of committing a type 1 error will be quite high (roughly 89%) for each time slice considered. This is more problematic if you are considering vertical levels separately. Did you do any type of correction to the Shapiro-Wilks tests to account for the multiplicity problems?
The selection of 95% variance explained for the N_PC cutoff seems to increase the risk that you are comparing PCs that are noise instead of signal. Did you experiment with more traditional methods for selecting PCs, such as a scree test, the method of congruence (Richman and Lamb 1985) or a North test (North et al. 1984)? Regardless, how do you account for the risk of noise versus signal when keeping so many PCs? This is even more egregious with the CESM where you’re keeping 130 PCs. Almost certainly that amount of PCs includes some noise that may not be useful.
I appreciated the model resolution experiments as this was something I was certainly interested in. I also noted the authors’ selection of the same # of PCs for both resolutions of CESM they tested. In their example they suggested this was okay since the differences were minimal in variance explained, but the authors had the luxury of knowing the # of PCs for the coarser resolution run already. In practice, is this a fair comparison, since in a real experiment the # of PCs you retained would be related to something else?
Minor comments:
When choosing variables to exclude, what criteria are used to determine if variables are “linearly correlated”? Is there a correlation threshold? If so, what threshold and why? Upon further reading I found this definition on line 293. I recommend moving it earlier so the reader has context when the idea of “linearly correlated” is first introduced in the text.
Something strange is happening with the text on line 197 with the Molinari citation. I think a comma, parentheses, or something else may be missing.
Line 219, The word “don’t” should be changed to “do not” to avoid use of contractions in scientific writing. I see the same issue on line 328. It may appear elsewhere, so please check and clean those up.
While I realize the point is to show the application of the method to any objective model configuration, the reader may benefit from a table listing what the 43 variables are that you consider from the MPAS, at least to reveal the types of things you are considering in the PCA.
In the figure caption for Fig. 17 I assume you mean p < 0.05, not p < 0.5.
Citation: https://doi.org/10.5194/gmd-2024-115-RC2 -
AC1: 'Comment on gmd-2024-115', Teo Price-Broncucia, 09 Dec 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2024-115/gmd-2024-115-AC1-supplement.pdf
-
AC1: 'Comment on gmd-2024-115', Teo Price-Broncucia, 09 Dec 2024
-
AC1: 'Comment on gmd-2024-115', Teo Price-Broncucia, 09 Dec 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2024-115/gmd-2024-115-AC1-supplement.pdf
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
356 | 95 | 18 | 469 | 10 | 13 |
- HTML: 356
- PDF: 95
- XML: 18
- Total: 469
- BibTeX: 10
- EndNote: 13
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1