A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems

Yu, Yangyang; Zhang, Shaoqing; Fu, Haohuan; Chen, Dexun; Gao, Yang; Lin, Xiaopei; Liu, Zhao; Lv, Xiaojing

doi:10.5194/gmd-2024-10

Preprints

https://doi.org/10.5194/gmd-2024-10

Preprints

Submitted as: methods for assessment of models

29 Jan 2024

Submitted as: methods for assessment of models |

| 29 Jan 2024

Status: this preprint has been withdrawn by the authors.

A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems

Yangyang Yu, Shaoqing Zhang, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu, and Xiaojing Lv

Abstract. Physical and heat limits of the semiconductor technology require the adaptation of heterogeneous architectures in supercomputers to maintain a continuous increase of computing performance. The coexistence of general-purpose cores and accelerator cores, which usually employ different hardware architectures, can lead to bit-level differences, especially when we try to maximize the performance on both kinds of cores. Such differences further lead to unavoidable computational perturbations through temporal integration, which can blend with software or human errors. Software correctness verification in the form of quality assurance is a critically important step in the development and optimization of Earth system models (ESMs) on heterogeneous many-core systems with mixed perturbations of software changes and hardware updates. We have developed a deep learning-based consistency test approach for Earth System Models referred to as ESM-DCT. The ESM-DCT is based on the unsupervised bidirectional gate recurrent unit-autoencoder (BGRU-AE) model, which can still detect the existence of software or human errors when taking hardware-related perturbations into account. We use the Community Earth System Model (CESM) on the new Sunway system as an example of large-scale ESMs to evaluate the ESM-DCT. The results show that facing with the mixed perturbations caused by hardware designs and software changes in heterogeneous computing, the ESM-DCT can detect software or human errors when determining whether or not the model simulation is consistent with the original results in homogeneous computing. Our ESM-DCT tool provides an efficient and objective approach for verifying the reliability of the development and optimization of scientific computing models on the heterogeneous many-core systems.

This preprint has been withdrawn.

Received: 18 Jan 2024 – Discussion started: 29 Jan 2024

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 3163 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (3163 KB)

Download & links

This preprint has been withdrawn.

Yangyang Yu, Shaoqing Zhang, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu, and Xiaojing Lv

Interactive discussion

Status: closed

RC1:
'Comment on gmd-2024-10', Anonymous Referee #1, 05 Mar 2024
This manuscript presents an effort to address the need to determine the correctness (i.e., consistency with trusted references) of Earth system model (ESM) simulations conducted in new computing environments. The methodology is based on deep learning, and the primary focus is the impact of using heterogeneous many-core computer systems. I have very limited knowledge in deep learning and computer hardware, and hence will leave it to the other referee(s) and the Handling Editor to assess the manuscript in those respects. My comments below are from the perspective of an Earth system model developer who has encountered the challenge of correctness/consistency testing in their own work.
Overall, I think the study has made a worthwhile attempt to address an important need in ESM development, and I look forward to new, efficient, and objective methods that help fulfill the need. The manuscript, on the other hand, needs substantial improvements to make the proposed method more convincing and the presentation easier to follow. Below are my major comments.
Some conceptual clarifications are needed regarding the scope and the underlying assumptions of the proposed method. My understanding is that the work by Baker et al. (2015) and the follow-up studies cited as references in this manuscript aimed at assessing the mean climate simulated by ESMs like the CESM. Therein, Earth system modeling was viewed as a boundary condition problem even though the action taken to obtain the climate statistics was to numerically integrate a set of time evolution equations. The work by Milroy et al. (2018) and other studies that developed “ultra-fast” tests using short simulations aimed at early detections of potential changes in the long-term averages of the model state. Supposedly, the method proposed in this manuscript has the same goal, namely, issuing a "fail" result for climate-changing modifications in the source code or the computing environment and issuing a "pass" result for non-climate-changing situations. However, the discussion on acceptable and unacceptable initial perturbations in Section 3.3 gives the impression that a CESM simulation is considered a solution of an initial condition problem. In my understanding, initial perturbations on the order of 10E-6 K are unlikely to cause a change in the long-term climate unless the perturbations have some special features and hit some special parts of the code that eventually push the model climate into a different equilibrium. Along the line of the “unacceptable initial conditions”, if one conducts a set of CESM simulations using the unmodified code and in a trusted HPC system but using initial conditions written out on different days or different years of a previously performed trusted simulation, would this new set of simulations be given a “fail” by the proposed test? If the answer is “yes”, then does this mean the test results provided by the proposed method can have a very high false positive rate?

It will be very helpful to put the proposed method into the context of existing work and further explain what is new and advantageous

The earlier papers by, e.g., Baker et al. (2015, 2016), Milroy et al. (2016, 2018), Mahajan et al. (2017, DOI: 10.1016/j.procs.2017.05.259), Mahajan et al. (2019, DOI: 10.1145/3324989.3325724) and Wan et al. (2017, DOI: 10.5194/gmd-10-537-2017) based their work on more traditional statistical methods and/or the theory of numerical analysis, while this work uses deep learning. What is the added value of using deep learning? How did the nonlinear and chaotic features of the Earth system affect the choice of deep learning methods? In Section 3.2, it is stated that bidirectional neural network models can better capture “context information”. What would be an example of “context information” when testing CESM simulations? Also, could there be a physics-based interpretation of the reconstruction error that is used as the key test metric in the proposed method?

It is stated in the introduction that the earlier studies by Baker et al. (2015, 2016) and Milroy et al. (2016, 2018) used homogeneous HPC systems for the CESM simulations while this work focuses on heterogenous systems. Is it expected that the earlier methods are invalid for simulations performed on heterogenous systems? In reverse, is it expected that the newly proposed method can work only for heterogenous systems and not for homogeneous systems? If the answers to these two questions are “yes”, then what is special/different about the heterogenous systems that makes the earlier methods invalid? An explanation from the perspective of solving the equations underlying models like CESM would greatly help the readers understand the value of the proposed method. My impression from reading the manuscript, however, is that both the earlier methods and the method proposed in this manuscript should in principle be applicable to both homogeneous and heterogenous HPC systems. If that’s indeed the case, it will be useful to know whether the newly proposed method is particularly suitable for heterogenous systems and why.

More evidence is needed to demonstrate the trustworthiness of the proposed method. It is known that testing the consistency/correctness of ESM simulations is a challenging, and that different methods can give the same or different answers to the pass-or-fail question, see, e.g., Milroy et al. (2018) for multiple examples and this link (https://e3sm.org/can-we-switch-computers-an-application-of-e3sm-climate-reproducibility-tests/) for a real-life application of a different set of methods. Assuming my understanding is correct that the proposed method is meant to be used for assessing the simulated long-term climate using very short simulations and the method is expected to be applicable to both homogeneous and heterogeneous multi-core HPC system, I would suggest the following:

More evidence is needed to demonstrate that for cases which have been unambiguously identified in the literature to be climate-changing or non-climate-changing, the newly proposed method gives the same pass/fail results as reported in the literature. Admittedly, two examples in this category are presented in the manuscript, namely the value changes for the uncertain parameters c0_lnd and c0_ocean, but more examples would make the manuscript more convincing. (BTW, I suppose the “pass” listed in each row of the rightmost column of Table 5 should be “fail” instead.)

For the cases labeled with “unknown outcome”, the manuscript should demonstrate that the proposed method (which uses CESM output after “24 time steps”) gives the same pass/fail results as the conclusions drawn by an independent assessment either using a CESM-ECT method or using experts’ judgement based on multi-year or multi-decade simulations. The current manuscript presents 9 cases of unknown outcome (Table 6). It is unclear whether the purpose of showing this many cases in the “unknown outcome” category is to discuss certain features of the proposed method or to demonstrate some applications of the method.

If there are cases where the pass or fail determined by the proposed method is different from the conclusion in the literature or an experts’ judgement, then an explanation for the discrepancy will provide very useful guidance for potential users of the proposed method.

There are significant gaps in the description of the proposed method. The parameters listed in Table 3 are not defined in the text, and it is unclear how they enter the equations or algorithms of the deep learning model. The application of the method to the 5VCCM used a training ensemble of 151 members, a simulation length of 1000 time steps, a test ensemble size of 40, and a threshold passing rate of 90% for an overall “fail” (lines 140-146). How were these numbers determined? The CESM simulations used output at “24 time steps” (are these atmosphere model time steps of 30 minutes each?) with the coupling to the ocean occurring 8 times per day. The ensemble sizes for training, validation, and testing were 120, 40, and 40, respectively. The threshold pass rate for issuing a “fail” was 90%. How where these numbers selected? If a reader is interested in applying the proposed method to a different ESM, which of the above-mentioned details need to be revised?

The scientific presentation can benefit from significant improvements. The sequence in which contents are currently presented is awkward at many places. Here only two examples are mentioned, but hopefully a systematic review and revision can be done by the authors: The proof-of-concept presented using the 5VCCM from line 116 to line 151 should be moved to somewhere in Section 3 after the details of the new testing method has been described. The CESM code version, compset, resolution etc. should be clarified before the first mention of model time step, coupling time step etc.

Again, I think the study is worthwhile, but a clearer and more compelling presentation is needed for publication in GMD.
Citation: https://doi.org/10.5194/gmd-2024-10-RC1
- AC2: 'Reply on RC1', Yangyang Yu, 23 Apr 2024
  
  This manuscript presents an effort to address the need to determine the correctness (i.e., consistency with trusted references) of Earth system model (ESM) simulations conducted in new computing environments. The methodology is based on deep learning, and the primary focus is the impact of using heterogeneous many-core computer systems. I have very limited knowledge in deep learning and computer hardware, and hence will leave it to the other referee(s) and the Handling Editor to assess the manuscript in those respects. My comments below are from the perspective of an Earth system model developer who has encountered the challenge of correctness/consistency testing in their own work.
  Overall, I think the study has made a worthwhile attempt to address an important need in ESM development, and I look forward to new, efficient, and objective methods that help fulfill the need. The manuscript, on the other hand, needs substantial improvements to make the proposed method more convincing and the presentation easier to follow. Below are my major comments.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the advantages of the deep learning method; 2) the experiments of modifications with statistically distinguishable outputs; 3) the description of hyperparameters; 4) the experiments of determining the ensemble size of datasets and model optimal parameters; 5) the description of CESM code version, compset, resolution, etc.
  The point-by-point replies are followed.
  1. Some conceptual clarifications are needed regarding the scope and the underlying assumptions of the proposed method. My understanding is that the work by Baker et al. (2015) and the follow-up studies cited as references in this manuscript aimed at assessing the mean climate simulated by ESMs like the CESM. Therein, Earth system modeling was viewed as a boundary condition problem even though the action taken to obtain the climate statistics was to numerically integrate a set of time evolution equations. The work by Milroy et al. (2018) and other studies that developed “ultra-fast” tests using short simulations aimed at early detections of potential changes in the long-term averages of the model state. Supposedly, the method proposed in this manuscript has the same goal, namely, issuing a "fail" result for climate-changing modifications in the source code or the computing environment and issuing a "pass" result for non-climate-changing situations. However, the discussion on acceptable and unacceptable initial perturbations in Section 3.3 gives the impression that a CESM simulation is considered a solution of an initial condition problem. In my understanding, initial perturbations on the order of 10E-6 K are unlikely to cause a change in the long-term climate unless the perturbations have some special features and hit some special parts of the code that eventually push the model climate into a different equilibrium. Along the line of the “unacceptable initial conditions”, if one conducts a set of CESM simulations using the unmodified code and in a trusted HPC system but using initial conditions written out on different days or different years of a previously performed trusted simulation, would this new set of simulations be given a “fail” by the proposed test? If the answer is “yes”, then does this mean the test results provided by the proposed method can have a very high false positive rate?
  RE: Thank you for your comment. The new set of simulations with the unacceptable initial conditions will be given a “fail” by the proposed test, but the test results provided by the proposed method don’t have a very high false positive rate. The CESM-ECT (Baker et al., 2015) aims to assess the mean climate simulated, our method aimed at the detection of smaller-scale modifications using short simulations to determine whether or not the test datasets simulation is statistically distinguishable from the original results. As far as our method is concerned, the response to modifications known to produce statistically distinguishable outputs should be a fail, and the response to modifications not expected to produce statistically distinguishable outputs should be a pass. Therefore, we have changed “climate-changing” to “statistically distinguishable output”. Please see Sections 4.3 and 4.4. Thanks.
  2. It will be very helpful to put the proposed method into the context of existing work and further explain what is new and advantageous
  
  The earlier papers by, e.g., Baker et al. (2015, 2016), Milroy et al. (2016, 2018), Mahajan et al. (2017, DOI: 10.1016/j.procs.2017.05.259), Mahajan et al. (2019, DOI: 10.1145/3324989.3325724) and Wan et al. (2017, DOI: 10.5194/gmd-10-537-2017) based their work on more traditional statistical methods and/or the theory of numerical analysis, while this work uses deep learning. What is the added value of using deep learning? How did the nonlinear and chaotic features of the Earth system affect the choice of deep learning methods?
  RE: Thank you for your comment. PCA is a linear transformation method. It assumes the linear relationships of variables, which hamper the application of PCA if the relationships are nonlinear. The deep learning is a nonlinear transformation method that can handle high-dimensional, nonlinear, and complex data. The simulation results of Earth system models have the non-linear relationship, which can be analyzed using the deep learning models. We have added the description of the deep learning in analyzing nonlinear features of the Earth system models simulation results. Please see lines 82-84. Then, we show the performance of deep learning models in mining non-linear features from coupled models. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method and the ESM-DCT. The accuracy of ESM-DCT, 98.6%, is slightly better than the accuracy of LOF, 96.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF. Please see Section 4.5. Thanks.
  In Section 3.2, it is stated that bidirectional neural network models can better capture “context information”. What would be an example of “context information” when testing CESM simulations?
  RE: We examine 97 variables from the atmosphere component results and 4 variables from the ocean component results. The global area-weighted mean is calculated for each variable, which is converted to a 101 dimensional vector. The context information refers to the data of 100 variables surrounding a variable. The bidirectional neural network models can capture context information of variables in the sequence data. We have added the description of the context information and sequence data. Please see lines 162-163, 314-316. Thanks.
  Also, could there be a physics-based interpretation of the reconstruction error that is used as the key test metric in the proposed method?
  RE: The reconstruction error is a mathematical indicator and calculated by the MSE of test data and original data after feature analysis, which determines whether or not the test datasets simulation is statistically distinguishable from the original results. The physics-based of the reconstruction error is a very good improvement strategy to increase the interpretability of the deep learning model. We will study the physics-based interpretation of reconstruction error in our future work. We have added the plan of developing the deep learning-based consistency test approach with physics-based interpretation. Please see lines 442-443. Thanks.
  It is stated in the introduction that the earlier studies by Baker et al. (2015, 2016) and Milroy et al. (2016, 2018) used homogeneous HPC systems for the CESM simulations while this work focuses on heterogenous systems. Is it expected that the earlier methods are invalid for simulations performed on heterogenous systems? In reverse, is it expected that the newly proposed method can work only for heterogenous systems and not for homogeneous systems? If the answers to these two questions are “yes”, then what is special/different about the heterogenous systems that makes the earlier methods invalid? An explanation from the perspective of solving the equations underlying models like CESM would greatly help the readers understand the value of the proposed method. My impression from reading the manuscript, however, is that both the earlier methods and the method proposed in this manuscript should in principle be applicable to both homogeneous and heterogenous HPC systems. If that’s indeed the case, it will be useful to know whether the newly proposed method is particularly suitable for heterogenous systems and why.
  RE: Thank you for your comment. We develop a deep learning-based consistency test approach for Earth system models and apply this approach on the heterogeneous many-core systems. There are few consistency tests on the heterogeneous many-core systems, although the computational perturbations caused by the heterogeneous hardware designs will not produce the statistically distinguishable outputs by default. Our experiments point out that the response of consistency tests to the heterogeneous computing should be a pass, which can accept the influence of the heterogeneous computational perturbations. Our method addresses the issues as the presence of heterogeneous architecture hardware. Please see lines 104-111. Thanks.
  3. More evidence is needed to demonstrate the trustworthiness of the proposed method. It is known that testing the consistency/correctness of ESM simulations is a challenging, and that different methods can give the same or different answers to the pass-or-fail question, see, e.g., Milroy et al. (2018) for multiple examples and this link (https://e3sm.org/can-we-switch-computers-an-application-of-e3sm-climate-reproducibility-tests/) for a real-life application of a different set of methods. Assuming my understanding is correct that the proposed method is meant to be used for assessing the simulated long-term climate using very short simulations and the method is expected to be applicable to both homogeneous and heterogeneous multi-core HPC system, I would suggest the following:
  
  More evidence is needed to demonstrate that for cases which have been unambiguously identified in the literature to be climate-changing or non-climate-changing, the newly proposed method gives the same pass/fail results as reported in the literature. Admittedly, two examples in this category are presented in the manuscript, namely the value changes for the uncertain parameters c0_lnd and c0_ocean, but more examples would make the manuscript more convincing. (BTW, I suppose the “pass” listed in each row of the rightmost column of Table 5 should be “fail” instead.)
  RE: Thanks for the good suggestion. We have added the experiments of modifications with statistically distinguishable outputs. We evaluating the consistency of simulation on the heterogeneous many-core systems after changing the parameters of sol_factb_interstitial, Sol_factic_interstitial, cldfrc_rhminl. The results shows in Table 6. Then, we have corrected the table, the results should be “fail”. Please see Table 6. Thanks.
  For the cases labeled with “unknown outcome”, the manuscript should demonstrate that the proposed method (which uses CESM output after “24 time steps”) gives the same pass/fail results as the conclusions drawn by an independent assessment either using a CESM-ECT method or using experts’ judgement based on multi-year or multi-decade simulations. The current manuscript presents 9 cases of unknown outcome (Table 6). It is unclear whether the purpose of showing this many cases in the “unknown outcome” category is to discuss certain features of the proposed method or to demonstrate some applications of the method.If there are cases where the pass or fail determined by the proposed method is different from the conclusion in the literature or an experts’ judgement, then an explanation for the discrepancy will provide very useful guidance for potential users of the proposed method.
  RE: Thanks for the good suggestion. The experiments of unknown outcomes modifications are the prediction application on the new Sunway system. After completing model training and testing, we expect that the ESM-DCT can help to predict new data with unknown outcomes modifications on the heterogeneous systems. In this study, we expect that the ESM-DCT can help to predict new data with unknown outcomes modifications on the heterogeneous systems. The result shows that the effect of -O3 compiler optimization option is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. Then, our tool can serve as a rapid method for detecting correctness in the mixed precision programming to help ESMs benefit from a reduction of the precision of certain variables on the heterogeneous many-core HPC systems. We have revised the descriptions of prediction datasets. Please see lines 392-393. Thanks.
  4. There are significant gaps in the description of the proposed method. The parameters listed in Table 3 are not defined in the text, and it is unclear how they enter the equations or algorithms of the deep learning model.
  RE: Thanks for the good suggestion. We have added the description of hyperparameters listed in the table. Please see lines 321-324. Thanks.
  The application of the method to the 5VCCM used a training ensemble of 151 members, a simulation length of 1000 time steps, a test ensemble size of 40, and a threshold passing rate of 90% for an overall “fail” (lines 140-146). How were these numbers determined?
  RE: Thanks for the good suggestion. We have added the description of the parameters about the ensemble experiments into the text. For the ensemble sizes of datasets, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data of the ESM-DCT. Please see lines 235-240. Then, the accuracy in testing datasets is used to adjust the ensemble size of the training sets and model optimal parameters in the ESM-DCT. The accuracy of testing datasets with different ensemble sizes is shown in Table 1. Following Table 1, the ensemble size of training datasets of ESM-DCT is 120 when the accuracy of testing datasets reaches its maximum value. Please see lines 241-249. Thanks.
  The CESM simulations used output at “24 time steps” (are these atmosphere model time steps of 30 minutes each?) with the coupling to the ocean occurring 8 times per day.
  RE: Thanks for the good suggestion. The time integration step size is 30 min. We have added the description of the parameters of the time integration step size into the text. Please see line 269. Thanks.
  The ensemble sizes for training, validation, and testing were 120, 40, and 40, respectively. The threshold pass rate for issuing a “fail” was 90%. How where these numbers selected? If a reader is interested in applying the proposed method to a different ESM, which of the above-mentioned details need to be revised?
  RE: Thank you for your comment. We select the ensemble size of training sets when the accuracy of testing datasets reaches its maximum value. The ratio of training datasets, validation datasets, and testing datasets of ESM-DCT is 6:2:2. The accuracy of ESM-DCT with different ensemble sizes of training datasets is shown in Table 3. Following Table 3, the ensemble size of training sets, validation sets, and testing sets is 120, 40, and 40. At this size, the accuracy of test datasets with modifications known to produce statistically distinguishable and indistinguishable outputs in the BGRU-AE model is the maximum value of 98.6 %. We have added the description of the parameters about the ensemble experiments into the text. If the passing rate is less than 90%, the tool issues an overall “failure”, which yields the accuracy of 98.6 % of the test datasets with modifications known to produce statistically distinguishable and indistinguishable outputs. Please see lines 326-330. Thanks.
  5. The scientific presentation can benefit from significant improvements. The sequence in which contents are currently presented is awkward at many places. Here only two examples are mentioned, but hopefully a systematic review and revision can be done by the authors: The proof-of-concept presented using the 5VCCM from line 116 to line 151 should be moved to somewhere in Section 3 after the details of the new testing method has been described. The CESM code version, compset, resolution etc. should be clarified before the first mention of model time step, coupling time step etc.
  RE: Thanks for the good suggestion. We have moved the description of experiments about 5VCCM to Section 3.4. Please see lines 212-249. Thanks. Then, we have added the description of CESM code version, compset, resolution. Please see lines 259-264. Thanks.
  Again, I think the study is worthwhile, but a clearer and more compelling presentation is needed for publication in GMD.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC2
- AC4: 'Reply on RC1', Yangyang Yu, 24 May 2024
  
  This manuscript presents an effort to address the need to determine the correctness (i.e., consistency with trusted references) of Earth system model (ESM) simulations conducted in new computing environments. The methodology is based on deep learning, and the primary focus is the impact of using heterogeneous many-core computer systems. I have very limited knowledge in deep learning and computer hardware, and hence will leave it to the other referee(s) and the Handling Editor to assess the manuscript in those respects. My comments below are from the perspective of an Earth system model developer who has encountered the challenge of correctness/consistency testing in their own work.
  Overall, I think the study has made a worthwhile attempt to address an important need in ESM development, and I look forward to new, efficient, and objective methods that help fulfill the need. The manuscript, on the other hand, needs substantial improvements to make the proposed method more convincing and the presentation easier to follow. Below are my major comments.
  
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the advantages of the deep learning method; 2) the experiments of modifications with statistically distinguishable outputs; 3) the descriptions of hyperparameters; 4) the experiments of determining the ensemble size of datasets and model optimal parameters; 5) the descriptions of CESM code version, compset, resolution, etc.
  The point-by-point replies are followed.
  
  1. Some conceptual clarifications are needed regarding the scope and the underlying assumptions of the proposed method. My understanding is that the work by Baker et al. (2015) and the follow-up studies cited as references in this manuscript aimed at assessing the mean climate simulated by ESMs like the CESM. Therein, Earth system modeling was viewed as a boundary condition problem even though the action taken to obtain the climate statistics was to numerically integrate a set of time evolution equations. The work by Milroy et al. (2018) and other studies that developed “ultra-fast” tests using short simulations aimed at early detections of potential changes in the long-term averages of the model state. Supposedly, the method proposed in this manuscript has the same goal, namely, issuing a "fail" result for climate-changing modifications in the source code or the computing environment and issuing a "pass" result for non-climate-changing situations. However, the discussion on acceptable and unacceptable initial perturbations in Section 3.3 gives the impression that a CESM simulation is considered a solution of an initial condition problem. In my understanding, initial perturbations on the order of 10^-6K are unlikely to cause a change in the long-term climate unless the perturbations have some special features and hit some special parts of the code that eventually push the model climate into a different equilibrium. Along the line of the “unacceptable initial conditions”, if one conducts a set of CESM simulations using the unmodified code and in a trusted HPC system but using initial conditions written out on different days or different years of a previously performed trusted simulation, would this new set of simulations be given a “fail” by the proposed test? If the answer is “yes”, then does this mean the test results provided by the proposed method can have a very high false positive rate?
  RE: Thank you for your good comment. We have removed the testing datasets of “unacceptable initial conditions” and focused on early detection of potential changes in the long-term averages of the model state. Please see Section 4.4. Thanks.
  
  2. It will be very helpful to put the proposed method into the context of existing work and further explain what is new and advantageous
  
  The earlier papers by, e.g., Baker et al. (2015, 2016), Milroy et al. (2016, 2018), Mahajan et al. (2017, DOI: 10.1016/j.procs.2017.05.259), Mahajan et al. (2019, DOI: 10.1145/3324989.3325724) and Wan et al. (2017, DOI: 10.5194/gmd-10-537-2017) based their work on more traditional statistical methods and/or the theory of numerical analysis, while this work uses deep learning. What is the added value of using deep learning? How did the nonlinear and chaotic features of the Earth system affect the choice of deep learning methods?
  RE: Thank you for your comment. The unsupervised deep learning model has been a fresh and widely used data mining method and explored for anomaly detection. Such deep learning methods can efficiently and objectively identify whether or not the data are different from the original status. The model verification and simulation result analysis has been an excellent application scenario for deep learning. We have added the descriptions of the advantages of the deep learning method. Please see lines 75-78. Then, we show the performance of the deep learning models in mining non-linear features from coupled models. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method, CESM-ECT, and the ESM-DCT. The accuracy of ESM-DCT, 98.4%, is slightly better than the accuracy of LOF, 96.2%, and the accuracy of CESM-ECT, 98.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF and CESM-ECT. Please see Section 4.5. Thanks.
  
  In Section 3.2, it is stated that bidirectional neural network models can better capture “context information”. What would be an example of “context information” when testing CESM simulations?
  RE: We examine 97 variables from the atmosphere component. The global area-weighted mean is calculated for each variable, which is converted to a 97 dimensional vector. The context information refers to the data of 96 variables surrounding a variable. The bidirectional neural network models can capture context information of variables in the sequence data. We have added the descriptions of the context information and sequence data. Please see lines 271-274. Thanks.
  
  Also, could there be a physics-based interpretation of the reconstruction error that is used as the key test metric in the proposed method?
  RE: The reconstruction error is a mathematical indicator and calculated by the MSE of test data and original data after feature analysis, which determines whether or not the test datasets simulation is statistically distinguishable from the original results. The physics-based of the reconstruction error is a very good improvement strategy to increase the interpretability of the deep learning model. We will study the physics-based interpretation of reconstruction error in our future work. We have added the plan of developing the deep learning-based consistency test approach with physics-based interpretation. Please see lines 395-396. Thanks.
  
  It is stated in the introduction that the earlier studies by Baker et al. (2015, 2016) and Milroy et al. (2016, 2018) used homogeneous HPC systems for the CESM simulations while this work focuses on heterogenous systems. Is it expected that the earlier methods are invalid for simulations performed on heterogenous systems? In reverse, is it expected that the newly proposed method can work only for heterogenous systems and not for homogeneous systems? If the answers to these two questions are “yes”, then what is special/different about the heterogenous systems that makes the earlier methods invalid? An explanation from the perspective of solving the equations underlying models like CESM would greatly help the readers understand the value of the proposed method. My impression from reading the manuscript, however, is that both the earlier methods and the method proposed in this manuscript should in principle be applicable to both homogeneous and heterogenous HPC systems. If that’s indeed the case, it will be useful to know whether the newly proposed method is particularly suitable for heterogenous systems and why.
  
  RE: Thank you for your comment. In this study, we focus on developing a deep learning-based consistency test approach for Earth system models. The uncertainties caused by heterogeneous hardware designs should be accepted in evaluating the consistency. We adopt the heterogeneous computing as a type of testing datasets, which is regarded as modifications known to produce statistically indistinguishable outputs. The test datasets of other types are all homogeneous computing. We have revised the descriptions of the role of the heterogeneous computing in this study. Please see lines 97-102. We also have revised the descriptions of the datasets. Please see Table 2. Thanks.
  
  3. More evidence is needed to demonstrate the trustworthiness of the proposed method. It is known that testing the consistency/correctness of ESM simulations is a challenging, and that different methods can give the same or different answers to the pass-or-fail question, see, e.g., Milroy et al. (2018) for multiple examples and this link (https://e3sm.org/can-we-switch-computers-an-application-of-e3sm-climate-reproducibility-tests/) for a real-life application of a different set of methods. Assuming my understanding is correct that the proposed method is meant to be used for assessing the simulated long-term climate using very short simulations and the method is expected to be applicable to both homogeneous and heterogeneous multi-core HPC system, I would suggest the following:
  
  More evidence is needed to demonstrate that for cases which have been unambiguously identified in the literature to be climate-changing or non-climate-changing, the newly proposed method gives the same pass/fail results as reported in the literature. Admittedly, two examples in this category are presented in the manuscript, namely the value changes for the uncertain parameters c0_lnd and c0_ocean, but more examples would make the manuscript more convincing. (BTW, I suppose the “pass” listed in each row of the rightmost column of Table 5 should be “fail” instead.)
  RE: Thanks for the good suggestion. We have added the experiments of modifications with statistically distinguishable outputs. We evaluating the consistency of simulation on the HPC systems after changing the parameters of sol_factb_interstitial, Sol_factic_interstitial, cldfrc_rhminl. The results shows in Table 6. Then, we have corrected the table, the results should be “fail”. Please see Table 6. Thanks.
  
  For the cases labeled with “unknown outcome”, the manuscript should demonstrate that the proposed method (which uses CESM output after “24 time steps”) gives the same pass/fail results as the conclusions drawn by an independent assessment either using a CESM-ECT method or using experts’ judgement based on multi-year or multi-decade simulations. The current manuscript presents 9 cases of unknown outcome (Table 6). It is unclear whether the purpose of showing this many cases in the “unknown outcome” category is to discuss certain features of the proposed method or to demonstrate some applications of the method.
  If there are cases where the pass or fail determined by the proposed method is different from the conclusion in the literature or an experts’ judgement, then an explanation for the discrepancy will provide very useful guidance for potential users of the proposed method.
  
  RE: Thanks for the good suggestion. After the ESM-DCT tool is constructed, we expect that it can provide the guidelines for predicting the consistency results of ESM new configurations to better understand and improve the tool. The prediction datasets are used to show the applications of ESM-DCT. For example, the result shows that the effect of -O3 compiler optimization option is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. Then, ESMs in mixed precision programming must assess the simulation outputs to ensure that the results do not change the climate. Our tool can serve as a rapid method for detecting correctness in the mixed precision programming to help ESMs benefit from a reduction of the precision of certain variables on the HPC systems. We have revised the descriptions of prediction datasets. Please see section 4.6. Thanks.
  
  4. There are significant gaps in the description of the proposed method. The parameters listed in Table 3 are not defined in the text, and it is unclear how they enter the equations or algorithms of the deep learning model.
  RE: Thanks for the good suggestion. We have added the description of hyperparameters listed in the table. Please see lines 279-282. Thanks.
  
  The application of the method to the 5VCCM used a training ensemble of 151 members, a simulation length of 1000 time steps, a test ensemble size of 40, and a threshold passing rate of 90% for an overall “fail” (lines 140-146). How were these numbers determined?
  RE: Thanks for the good suggestion. We have added the description of the parameters about the ensemble experiments into the text. For the ensemble sizes of datasets, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data of the ESM-DCT for 5VCCM. Please see lines 205-210. Then, the accuracy in testing datasets is used to adjust the ensemble size of the training sets and model optimal parameters in the ESM-DCT. The accuracy of testing datasets with different ensemble sizes is shown in Table 1. Following Table 1, the ensemble size of training datasets of ESM-DCT for 5VCCM is 120 when the accuracy of testing datasets reaches its maximum value. The ratio of training, validation, and testing datasets is also 6:2:2. Please see lines 211-219. Thanks.
  
  The CESM simulations used output at “24 time steps” (are these atmosphere model time steps of 30 minutes each?) with the coupling to the ocean occurring 8 times per day.
  RE: Thanks for the good suggestion. The time integration step size is 30 min. We have added the description of the parameters of the time integration step size into the text. Please see line 230. Thanks.
  
  The ensemble sizes for training, validation, and testing were 120, 40, and 40, respectively. The threshold pass rate for issuing a “fail” was 90%. How where these numbers selected? If a reader is interested in applying the proposed method to a different ESM, which of the above-mentioned details need to be revised?
  RE: Thank you for your comment. We select the ensemble size of training sets when the accuracy of testing datasets reaches its maximum value. The ratio of training datasets, validation datasets, and testing datasets of ESM-DCT is 6:2:2. The accuracy of ESM-DCT for CESM with different ensemble sizes of training datasets is shown in Table 3. Following Table 3, the ensemble size of training sets, validation sets, and testing sets is 120, 40, and 40. At this size, the accuracy of test datasets with modifications known to produce statistically distinguishable and indistinguishable outputs in the BGRU-AE model is the maximum value of 98.4 %. We have added the description of the parameters about the ensemble experiments into the text. If the passing rate is equal to 0%, the tool issues an overall “failure”. It represents that the simulation results of training and testing datasets are statistically distinguishable. Please see lines 285-288. Thanks.
  
  5. The scientific presentation can benefit from significant improvements. The sequence in which contents are currently presented is awkward at many places. Here only two examples are mentioned, but hopefully a systematic review and revision can be done by the authors: The proof-of-concept presented using the 5VCCM from line 116 to line 151 should be moved to somewhere in Section 3 after the details of the new testing method has been described. The CESM code version, compset, resolution etc. should be clarified before the first mention of model time step, coupling time step etc.
  RE: Thanks for the good suggestion. We have moved the description of experiments about 5VCCM to Section 3.4. Please see lines 182-219. Thanks. Then, we have added the description of CESM code version, compset, resolution. Please see lines 227-231. Thanks.
  
  Again, I think the study is worthwhile, but a clearer and more compelling presentation is needed for publication in GMD.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC4
RC2:
'Comment on gmd-2024-10', Anonymous Referee #2, 11 Mar 2024

A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems
Yangyang Yu, Shaoqing Zhang,, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu,, Xiaojing

General comments:
This manuscript presents an approach for evaluating software correctness for Earth System Models using an ensemble and deep learning.
1. The general idea is promising, but a lot of the algorithm decisions are not well supported. In general, very substantial improvements are needed for GMD. The authors need to justify unsubstantiated claims and statements that are not correct or misleading.
2. The paper needs a good editing pass. There are many grammatical errors and awkward phrases (a few of which I mention below).
3. Some of the information about other authors’ work is inaccurate and other related work has been left out (listed later in this report). The Introductory section, in particular, needs revision.
Specific comments:
1, Line 58: “Evaluating the scientific consistency is a commonly used method for model verification in the form of quality assurance.”.
I am not sure that I agree this is a “commonly-used” method (consistency) - I don’t think this phrasing is found prior to the [Baker 2015]. Also I don’t believe the authors here ever define what they mean by consistency.
2. Line 59: “For example, for detecting the influences of hardware environment changes, data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015).Then, the CESM ensemble-based consistency test (CESM-ECT) is used to compare the new simulations against the control ensemble from the trusted machine (Baker et al., 2015; Milroy et al., 2016; Baker et al., 2016; Milroy et al., 2018).
This description is not accurate. [Baker 2015] explains that the 400-year test is the approach that was used prior to the CESM-ECT development. In fact, [Baker 2015] states that it was the motivation for a more objective approach. The CESM-ECT does not involve the 400 years and is a separate thing.
3. Line 64: “2018). However, all the methods mentioned above focus on homogeneous multi-core HPC systems.”
This statement is not accurate. As far as I can tell, there is nothing specific to homogeneous machines in these works. In fact, [Milroy 2018] specifically mentions heterogeneous computing environments as a motivation for the work.
4. Line 71: “However, the ultra-fast tests in the CESM-ECT are applied for evaluating the scientific consistency on the Community Atmosphere Model ” … “There is a lack of a method to analyze short-time simulation results of multi-components”
The [Milroy 2018] work also tests CLM (land model). It explains how land and atmosphere are tightly coupled and CLM modifications can be detected in CAM variables. And gives experimental results.
Consider also that ocean scales are much larger than atmospheric scales, hence the need for more time for changes to propagate through the ocean model from a single perturbed variable.
5. Line 74: “Besides, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets”
This statement does not make sense to me. Please clarify what is meant. PCA detects changes in relationships between variables. They don’t have to be linear relationships for change to be detected.
6. Line 75: ”Facing with the non-linear relationship generated by the combination of multi-component data, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.”
This statement is just unclear. The atmosphere model by itself is nonlinear and chaotic - so I’m not sure what is meant by multi-component combinations - is the implication that this is more non-linear? Please clarify. Also “facing with the …” is awkward phrasing.
7. Line 80: The phrase “unavoidable computational perturbations” is used frequently (also line 100, line 353), and it’s a bit awkward. It seems that it should be clarified that these are “numerical differences” due to using finite-precision and changing the order of operations and precision due to the changes in architecture. This is not really explained.
8. Line 81: “The ESM-DCT is applied to evaluate whether or not a new CESM configuration in the scenario of mixed perturbations composed of the inevitable computational perturbations and software or human errors in the heterogeneous computing is consistent with the original “trusted” configuration in the homogeneous computing.”
This phrase is unclear. Aren’t you trying to figure out (i.e. differentiate) whether a difference in output is due to numerical round-off OR software/human error?
9. Line 103: “The key challenge is designing a tool to evaluate the scientific consistency, which can remove the influences of heterogeneous perturbations”
The authors have definitely implied that the CESM-ECT tools do not do this, which is untrue. This new method/approach is still a valid contribution, and there is no need to misrepresent other previous work.
10. Line 108: “However, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets (Liu et al., 2009). Facing with the non-linear relationship generated by the combination of multi-component data in the coupled numerical model, …”
This is the exact same phrase as in line 74 and it still does not make sense (see above comment).
11. Line 116: I am confused by the last bit of section 2, regarding the 5-variable conceptual couples model and its results. It seems that the purpose is to show that the DCT approach is better than the ECT approach. This is quite hard to judge on the information given here. Most choices are not justified. Note that there is little point in a PCA-approach with only 5 variables. (Which is why the CESM-ECT for ocean does not use PCA. Though the authors cite that work, it is unclear that they are familiar with it. Also the ocean variables are not globally averaged.) Also why the choice of 151 ensemble members for 1000 timesteps? [Baker 2015] uses 151 timesteps for yearly averages. [Milroy 2018] shows that more are needed for shorter time scales. Why does the DCT look at 40? Are you aiming for some false positive rate? The choices here seem quite arbitrary, and it’s unclear why this “experiment” is in the background section. [Say more?]
12. Line 212: “Based on the ultra-fast tests”
Which ultra-fast tests are you referring to? Also I am skeptical about the robustness of bugs being detected in only a couple time steps for the ocean.
13. Line 216 - 217: I find it odd to simply increase the frequency of the ocn/atmosphere coupling. This change affects the nature of the model and model output and certainly needs justification as to why it is acceptable. Also how did you choose 24 time steps as an appropriate time slice to evaluate?
14. Line 236: “...clearly inconsistent…”. I don’t believe that there has been a definition provided as to what is meant by consistency for this new approach.
15. Lines 236-237: “the testing datasets with unacceptable CESM model parameter adjustments are with the O(10-14) perturbations of initial atmospheric temperature” . I don’t understand this - why is O(10^-14) unacceptable?
16. Line 246: Why the decision to do a spatial mean for the ocean variables? It seems that this could be problematic given the much higher spatial variability in the ocean compared to the atmosphere.
17. Line 258 ”.. because the variables have vastly different units and magnitudes.”
Matches the phrase used in [Baker 2015]: “..because the CAM variables have vastly different units and magnitudes.”
18. Line 26: “Software correctness verification n the form of quality assurance”

Line 59: “verification in the form of quality assurance”

Line 104: “for software verification in the form of quality assurance”
This particular phrase comes from the abstract of [Baker 2015]: “software verification in the form of quality assurance”. Consider rephrasing or quoting.
19. Line 268: Why using CESM version 1.3? That is quite old and not an official release: https://www.cesm.ucar.edu/models/releases
20. Line 310: What are c0_lnd and c0_ocn? Why did you choose them? (There are a couple of DOE-authored studies with CAM5 that mention these for the ZM scheme, but none are cited here.) Also, I know these are later coarsely defined in Table 5, but more info is needed when they are mentioned. How did you pick the parameter values in Table 5?
21. Line 315: What is a “mixed perturbation”? I think you mean something like in line 320 that implies two types of perturbations are problematic: “The results show that the tool can detect the climate changing modifications when taking hardware-related perturbations into account.” I think this line of reasoning (where the authors focus on perturbations from 2 sources somehow being trickier) is flawed. Either the output of two different runs is consistent or it is not. The source of the difference is not easily quantified, particularly with compiler and hardware changes. If you run on a different machine with a different compiler then your trusted machine is that a mixed perturbation? I don’t see that that distinction matters (and such experiments were done in [Baker 2015].) I do agree that the way an ensemble spread is created affects the variability of the distribution (e.g., see [Milroy 16] ), and that's an interesting question, but the authors here have not delved into that at all.
22. Figure 9: why compare the output between the O(10^-6) tests and the climate-changing tests if the O(10^-6) is inconsistent? Don't you want to compare to the accepted one to know how far off you are? I am clearly missing something..
23. Figures 7 -11: Why is the reconstruction error threshold .05? Did this somehow come from the second paragraph in 3.4?
24. Table 5: What does it mean that the passing rate is 0% and the test overall passes? That seems wrong.
25. Section 4.4: The way the beginning of this section reads, with “For example, the effect of -O3 compiler optimization option was not known, because the CESM code base is large and level-three optimizations can be quite aggressive (Baker et al., 2015). We input the simulation results of the heterogeneous version of CESM on the new Sunway system into the ESM-DCT with -O3 compiler optimization option.“ implies that we don't know if O3 will pass or not. I checked [Baker 2015] and there they show that O3 does pass, so I am not sure why it is in the unknown outcomes section.
26. Table 6: Was the O3 optimization only done for the ZM subroutine (as the table says)? If so, this needs to be noted in the paper text.
27. Line 329: “..which provides the references for the porting and optimization“ - what does this mean?
28. Line: 341: “Then, our tool can detect sensitivity of input parameters, which is excluded to the input parameter list provided by the climate scientist thought to affect the climate in a non-trivial manner“
The meaning of this sentence is unclear.
29. Table 6: For the “changes in model parameter” portion, line 345 says: “The result shows that ke, vdc_eq, and vdc_psim variables are not sensitive to the configuration in this study, while the value of the vdc1 variables should not be changed.”
This conclusion is a bit strong. I don’t think the results indicate that vdc1 should not be changed, The results just show that the one value tested fails. Maybe a smaller change would pass? The reverse is true for those that passed. Those variables may be sensitive to a larger change in value. I don’t know how/why these particular values were chosen.
30. Line 355: “..form a mixed perturbation environment...” Again, I don’t think there is anything specifically meaningful to this “mixed perturbation’ terminology. The whole earth system model is chaotic and it is affected by hardware/software stack, roundoff error, truncation error, etc.
31. Other related work on climate model correctness that is not cited in this paper:
Mahajan et. al, “Ensuring statistical reproducibility of ocean model

simulations in the age of hybrid computing” 2021
Massonnet et. al, “Replicability of the EC-Earth3 Earth system model

under a change in computing environment”, 2020.
Mahajan et. al, “A multivariate approach to ensure statistical

reproducibility of climate model simulations” 2019.
Mahajan et. al, “Exploring an Ensemble-Based Approach to Atmospheric

Climate Modeling and Testing at Scale”, 2017.
Wan, H,et. al, “A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0)”, 2017.

Technical corrections:
1. Line 32: “that facing with the” is awkward, please rephrase. (This also occurs in line 75, 108, …)
2. Line 98: MPEs are not defined.
3. Line 234: “probability density function (PDF) of the CESM” . I assume the intention was of a variable, not the CESM model

Citation: https://doi.org/10.5194/gmd-2024-10-RC2
- AC3:
  'Reply on RC2', Yangyang Yu, 23 Apr 2024
  A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems
  Yangyang Yu, Shaoqing Zhang,, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu,, Xiaojing
  General comments:
  This manuscript presents an approach for evaluating software correctness for Earth System Models using an ensemble and deep learning.
  The general idea is promising, but a lot of the algorithm decisions are not well supported. In general, very substantial improvements are needed for GMD. The authors need to justify unsubstantiated claims and statements that are not correct or misleading.
  
  The paper needs a good editing pass. There are many grammatical errors and awkward phrases (a few of which I mention below).
  
  Some of the information about other authors’ work is inaccurate and other related work has been left out (listed later in this report). The Introductory section, in particular, needs revision.
  
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have revised the Introductory section and added: 1) the necessity of developing a consistency test approach on heterogeneous many-core systems; 2) the descriptions of heterogeneous computational perturbations; 3) the experiments of determining the ensemble size of training sets and model parameters; 4) the experiments of determining the model time step; 5) the description of the CESM version, etc.
  Specific comments:
  1, Line 58: “Evaluating the scientific consistency is a commonly used method for model verification in the form of quality assurance.”.
  I am not sure that I agree this is a “commonly-used” method (consistency) - I don’t think this phrasing is found prior to the [Baker 2015]. Also I don’t believe the authors here ever define what they mean by consistency.
  RE: Thank you for your comments. We have revised the sentences. It should be “For detecting the influences of hardware environment changes, a historical method is that data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015). Then, many ensemble-based consistency evaluation methods are used to compare the new simulations against the control ensemble from the trusted machine. ” Please see lines 64-68. Thanks.
  2. Line 59: “For example, for detecting the influences of hardware environment changes, data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015).Then, the CESM ensemble-based consistency test (CESM-ECT) is used to compare the new simulations against the control ensemble from the trusted machine (Baker et al., 2015; Milroy et al., 2016; Baker et al., 2016; Milroy et al., 2018).
  This description is not accurate. [Baker 2015] explains that the 400-year test is the approach that was used prior to the CESM-ECT development. In fact, [Baker 2015] states that it was the motivation for a more objective approach. The CESM-ECT does not involve the 400 years and is a separate thing.
  RE: Thank you for your comments. For detecting the influences of hardware environment changes, a historical method is that data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015). Then, many ensemble-based consistency evaluation methods are used to compare the new simulations against the control ensemble from the trusted machine. Please see lines 64-68. Thanks.
  3. Line 64: “2018). However, all the methods mentioned above focus on homogeneous multi-core HPC systems.”
  This statement is not accurate. As far as I can tell, there is nothing specific to homogeneous machines in these works. In fact, [Milroy 2018] specifically mentions heterogeneous computing environments as a motivation for the work.
  RE: [Milroy 2018] developed an ultra-fast statistical consistency testing and demonstrated that adequate ensemble variability is achieved with instantaneous variable values at the ninth step, despite rapid perturbation growth and heterogeneous variable spread. However, there is a lack of experiments of evaluating the consistency on the heterogeneous many-core systems, although the CESM-ECT has the capability to determine consistency without bit-for-bit results. For the heterogeneous many-core systems, there are the hardware differences between the general-purpose cores and accelerator cores. There can exist computational perturbations caused by the hardware designs, which should be accepted for further detection of software or human errors generated in optimizing and developing the ESMs. We have added the necessity of developing a consistency test approach on heterogeneous many-core systems. Please lines 104-111. Thanks.
  4. Line 71: “However, the ultra-fast tests in the CESM-ECT are applied for evaluating the scientific consistency on the Community Atmosphere Model ” … “There is a lack of a method to analyze short-time simulation results of multi-components”
  The [Milroy 2018] work also tests CLM (land model). It explains how land and atmosphere are tightly coupled and CLM modifications can be detected in CAM variables. And gives experimental results.
  Consider also that ocean scales are much larger than atmospheric scales, hence the need for more time for changes to propagate through the ocean model from a single perturbed variable.
  RE: Thank you for your good comments. We have added the experiments of the effects on the atmosphere and ocean variables of initial atmosphere temperature perturbations over 24 time steps, as shown in Figure 6. Figure 6 demonstrates sensitive dependence on initial conditions in the atmosphere and ocean variables and suggests that choosing a small number of time steps may provide sufficient variability to determine statistical distinguishability resulting from significant changes. We have revised the descriptions of ultra-fast tests in the CESM-ECT. Please see lines 79-82. Thanks.
  5. Line 74: “Besides, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets”
  This statement does not make sense to me. Please clarify what is meant. PCA detects changes in relationships between variables. They don’t have to be linear relationships for change to be detected.
  RE: PCA is a linear transformation method. It assumes the linear relationships of variables, which hamper the application of PCA if the relationships are nonlinear. We have added the descriptions of PCA. Please see lines 82-84. Thanks.
  6. Line 75: ”Facing with the non-linear relationship generated by the combination of multi-component data, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.”
  This statement is just unclear. The atmosphere model by itself is nonlinear and chaotic - so I’m not sure what is meant by multi-component combinations - is the implication that this is more non-linear? Please clarify. Also “facing with the …” is awkward phrasing.
  RE: Thank you for your comments. We have revised the sentence. It should be “For the non-linear relationship of ESM outputs, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.” Please see lines 84-85. Thanks.
  7. Line 80: The phrase “unavoidable computational perturbations” is used frequently (also line 100, line 353), and it’s a bit awkward. It seems that it should be clarified that these are “numerical differences” due to using finite-precision and changing the order of operations and precision due to the changes in architecture. This is not really explained.
  RE: Thank you for your suggestion. We have revised the description of computational perturbations. It should be “the uncertainties caused by heterogeneous hardware designs”. We have added the descriptions of heterogeneous computational perturbations. Please see lines 129-131, 136-137, 256-257. Thanks.
  8. Line 81: “The ESM-DCT is applied to evaluate whether or not a new CESM configuration in the scenario of mixed perturbations composed of the inevitable computational perturbations and software or human errors in the heterogeneous computing is consistent with the original “trusted” configuration in the homogeneous computing.”
  This phrase is unclear. Aren’t you trying to figure out (i.e. differentiate) whether a difference in output is due to numerical round-off OR software/human error?
  RE: There are the differences of hardware design between the general-purpose cores and accelerator cores in the heterogeneous many-core architectures. Compared with homogeneous computing using the general-purpose cores only, heterogeneous computing can cause nonidentical floating-point outputs whenever an accelerator core is involved. The uncertainties generated by heterogeneous hardware designs can blend with software or human errors, which can affect the accuracy of the model verification. The uncertainties generated by heterogeneous hardware designs should be accepted for further detection of software or human errors generated in optimizing and developing the ESMs.We have revised the sentence. Please see lines 55-61. Thanks.
  9. Line 103: “The key challenge is designing a tool to evaluate the scientific consistency, which can remove the influences of heterogeneous perturbations”
  The authors have definitely implied that the CESM-ECT tools do not do this, which is untrue. This new method/approach is still a valid contribution, and there is no need to misrepresent other previous work.
  RE: Thank you for your comment. It has been changed to “We develop a consistency test approach which addresses the issues as the presence of heterogeneous architecture hardware”. Please see lines 110-111. Thanks.
  10. Line 108: “However, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets (Liu et al., 2009). Facing with the non-linear relationship generated by the combination of multi-component data in the coupled numerical model, …”
  This is the exact same phrase as in line 74 and it still does not make sense (see above comment).
  RE: Thank you for your comment. We have revised the sentence. It should be “However, the PCA in the CESM-ECT is a linear transformation method. It assumed that there are linear relationships of variables, which hamper the application of PCA if the relationships are nonlinear (Liu et al., 2009). For the non-linear relationship of ESM outputs, there can use deep learning models to handle data by embedding multiple non-linear activation functions.” Please see lines 118-121.
  11. Line 116: I am confused by the last bit of section 2, regarding the 5-variable conceptual couples model and its results. It seems that the purpose is to show that the DCT approach is better than the ECT approach. This is quite hard to judge on the information given here. Most choices are not justified. Note that there is little point in a PCA-approach with only 5 variables. (Which is why the CESM-ECT for ocean does not use PCA. Though the authors cite that work, it is unclear that they are familiar with it. Also the ocean variables are not globally averaged.)
  RE: Thank you for your comment. In this study, we start from 5VCCM to develop the ESM-DCT. The 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tools. We have added the descriptions of the function of 5VCCM. Please see lines 212-249.
  Also why the choice of 151 ensemble members for 1000 timesteps? [Baker 2015] uses 151 timesteps for yearly averages. [Milroy 2018] shows that more are needed for shorter time scales. Why does the DCT look at 40? Are you aiming for some false positive rate? The choices here seem quite arbitrary, and it’s unclear why this “experiment” is in the background section. [Say more?]
  RE: Thank your for your comment. We show the performance of deep learning models in mining non-linear features from coupled models. First, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 demonstrates the sensitive dependence of variables of 5VCCM on initial . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data. Then, we select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 235-249.
  12. Line 212: “Based on the ultra-fast tests”
  Which ultra-fast tests are you referring to? Also I am skeptical about the robustness of bugs being detected in only a couple time steps for the ocean.
  RE: Thank you for your comment. In this study, in order to quickly spread the perturbations, we modify the coupling frequency of the ocean component to 8 times a day. We run two simulations of 24 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the initial atmosphere temperature. Figure 6 shows that choosing a small number of time steps and modifying the ocean coupling frequency setting can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 24 time step as the input data of the BGRU-AE model, where the ocean component is coupled 4 times. We have added the description of the parameters of 24 time steps and coupling frequency of the ocean component. Please see lines 265-279. Thanks.
  13. Line 216 - 217: I find it odd to simply increase the frequency of the ocn/atmosphere coupling. This change affects the nature of the model and model output and certainly needs justification as to why it is acceptable. Also how did you choose 24 time steps as an appropriate time slice to evaluate?
  RE: Thank you for your comment. Our method aimed at determining whether or not the test datasets simulation is statistically distinguishable from the original results. As far as our method is concerned, the response to modifications known to produce statistically distinguishable outputs should be a fail, and the response to modifications not expected to produce statistically distinguishable outputs should be a pass. Then, in order to quickly spread the perturbations, we modify the coupling frequency of the ocean component to 8 times a day. We run two simulations of 24 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the initial atmosphere temperature. Figure 6 shows that choosing a small number of time steps and modifying the ocean coupling frequency setting can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 24 time step as the input data of the BGRU-AE model, where the ocean component is coupled 4 times. We have added the description of the parameters of 24 time steps and coupling frequency of the ocean component. Please see lines 265-279. Thanks.
  14. Line 236: “...clearly inconsistent…”. I don’t believe that there has been a definition provided as to what is meant by consistency for this new approach.
  RE: Thank you for your comment. We expect to use the labeled datasets to detect the generalization performance of the BGRU-AE model. The testing datasets whose PDF is clearly distinguishable with that of the O(10^-14) initial perturbations are tagged as unacceptable initial perturbations. We revised the sentences. It should be “Therefore, the testing datasets tagged as unacceptable initial perturbations are the ensembles with the O(10^-6) perturbations of initial atmosphere temperature, whose PDF is clearly distinguishable with that of the O(10^-14) initial perturbations.” Please see lines 295-297.
  15. Lines 236-237: “the testing datasets with unacceptable CESM model parameter adjustments are with the O(10^-14) perturbations of initial atmospheric temperature” . I don’t understand this - why is O(10^-14) unacceptable?
  RE: Thank you for your comment. We define the testing datasets with unacceptable CESM model parameter that are known to produce statistically distinguishable outputs with the training datasets, but we control this testing datasets are the ensembles with the O(10^-14) perturbations of initial atmosphere temperature. We have added description of the testing datasets with unacceptable CESM model parameter. Please see lines 297-298.
  16. Line 246: Why the decision to do a spatial mean for the ocean variables? It seems that this could be problematic given the much higher spatial variability in the ocean compared to the atmosphere.
  RE: We expect to analyze short-time simulation results of atmosphere and ocean components, which can achieve an overall consistency evaluation of ESM rapidly. The spatial mean for the atmosphere and ocean variables is an approach to data preprocessing, which can align the dimensions of atmosphere and ocean data. This method does indeed lose spatial information but adds the oceanic information for the rapid consistency evaluation of the entire ESM. Then, to analyze the spatial features of the atmosphere and ocean data, we will focus on refining convolutional neural networks and attention mechanisms within the deep-learning model architecture in the follow-up studies. Please see lines 439-442. Thanks.
  17. Line 258 ”.. because the variables have vastly different units and magnitudes.”
  Matches the phrase used in [Baker 2015]: “..because the CAM variables have vastly different units and magnitudes.”
  RE: Thank you for your suggestion. We have cited the paper of Baker et al., 2015. Please see line 314-316. Thanks.
  18. Line 26: “Software correctness verification n the form of quality assurance”
  
  Line 59: “verification in the form of quality assurance”
  
  Line 104: “for software verification in the form of quality assurance”
  This particular phrase comes from the abstract of [Baker 2015]: “software verification in the form of quality assurance”. Consider rephrasing or quoting.
  RE: Thank you for your suggestion. We have revised the sentences and cited the paper of Baker et al., 2015. Please see line 25-27, 112-113. Thanks.
  19. Line 268: Why using CESM version 1.3? That is quite old and not an official release: https://www.cesm.ucar.edu/models/releases
  RE: Thank you for your comment. The CESM version used in the present study is the CESM1.3-beta17_sehires38, which is applied to the Sunway system and described in Zhang et al. (2020). We added the description of the CESM version. Please see lines 259-260. Thanks.
  20. Line 310: What are c0_lnd and c0_ocn? Why did you choose them? (There are a couple of DOE-authored studies with CAM5 that mention these for the ZM scheme, but none are cited here.) Also, I know these are later coarsely defined in Table 5, but more info is needed when they are mentioned. How did you pick the parameter values in Table 5?
  RE: Thank you for your comment. Climate scientists provided a list of CAM input parameters thought to affect the climate in a non-trivial manner, which is used to detect changes to the simulation results that are known to produce statistically distinguishable outputs in the CESM-ECT (Baker et al., 2015), such as zm_c0_lnd, zm_c0_ocn, sol_factb_interstitial, sol_factic_interstitial, and cldfrc_rhminh. Our tool must successfully detect the inconsistency of the simulation results that are known to produce statistically distinguishable outputs. Therefore, we modify the values of input parameters in the atmosphere components and then test whether or not our tool can detect the inconsistency caused by the model parameter changes using the ESM-DCT in the heterogeneous computing. We have added the descriptions of experiments of modifications to produce statistically distinguishable outputs. Please see lines 356-365. Thanks.
  21. Line 315: What is a “mixed perturbation”? I think you mean something like in line 320 that implies two types of perturbations are problematic: “The results show that the tool can detect the climate changing modifications when taking hardware-related perturbations into account.” I think this line of reasoning (where the authors focus on perturbations from 2 sources somehow being trickier) is flawed. Either the output of two different runs is consistent or it is not. The source of the difference is not easily quantified, particularly with compiler and hardware changes. If you run on a different machine with a different compiler then your trusted machine is that a mixed perturbation? I don’t see that that distinction matters (and such experiments were done in [Baker 2015].) I do agree that the way an ensemble spread is created affects the variability of the distribution (e.g., see [Milroy 16] ), and that's an interesting question, but the authors here have not delved into that at all.
  RE: Thank you for your comment. The ESM-DCT is used for detecting the existence of software or human errors when taking hardware-related perturbations into account on the heterogeneous many-core systems. Therefore, the mixed perturbations refer to the additional uncertainty caused by heterogeneous hardware designs and software or human changes (such as compiler optimization option and model input parameter changes). The coexistence of general-purpose cores and accelerator cores, which usually employ different hardware architectures, can lead to bit-level differences, especially when we try to maximize the performance on both kinds of cores. Such differences further lead to computational perturbations through temporal integration, which can blend with software or human errors. We have removed the descriptions of the mixed perturbations. Please see lines 372-373. Thanks.
  22. Figure 9: why compare the output between the O(10^-6) tests and the climate-changing tests if the O(10^-6) is inconsistent? Don't you want to compare to the accepted one to know how far off you are? I am clearly missing something.
  RE: In this study, our tool must successfully detect modifications to the simulation results that are known to produce statistically distinguishable outputs with the training datasets. Besides the unacceptable CESM model parameter adjustments listed by climate scientists (Baker, et al., 2015), we do the experiments of unacceptable initial perturbations to produce statistically distinguishable outputs. Following the Figure 7, the testing datasets tagged as unacceptable initial perturbations are the ensembles with the O(10^-6) perturbations of initial atmosphere temperature, whose PDF is clearly distinguishable with that of the O(10^-14) initial perturbations. Then, we use the tagged testing datasets to detect the generalization performance of ESM-DCT and adjust the ensemble size of training datasets. We have added the descriptions of testing datasets. Please see lines 293-297. Thanks.
  23. Figures 7 -11: Why is the reconstruction error threshold .05? Did this somehow come from the second paragraph in 3.4?
  RE: Thank you for your comment. We calculate the reconstruction errors after re-inputting the training datasets into the saved BGRU-AE model. The PDF of the reconstruction errors of the training datasets is shown in Figure 9. Following Figure 9, the PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the descriptions of the calculation method of the reconstruction errors. Please see lines 205-211.
  24. Table 5: What does it mean that the passing rate is 0% and the test overall passes? That seems wrong.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 7.
  25. Section 4.4: The way the beginning of this section reads, with “For example, the effect of -O3 compiler optimization option was not known, because the CESM code base is large and level-three optimizations can be quite aggressive (Baker et al., 2015). We input the simulation results of the heterogeneous version of CESM on the new Sunway system into the ESM-DCT with -O3 compiler optimization option.“ implies that we don't know if O3 will pass or not. I checked [Baker 2015] and there they show that O3 does pass, so I am not sure why it is in the unknown outcomes section.
  RE: Thank you for your comment. The CESM code base is large and level-three optimizations can be quite aggressive. In the results of CESM-ECT [Baker 2015], INTEL13-O3 and INTEL14-O3 get the “pass”, but INTEL15-O3 gets the “failure”. Different compilers have different optimization algorithms and strategies for codes, the outputs can be different. We expect that the ESM-DCT can help to predict new data with unknown outcome modifications on the heterogeneous systems. The prediction datasets are simulation results of the heterogeneous version of CESM which is with -O3 compiler optimization option for all codes on the new Sunway system. The computing environments for CESM have been changed, so the consistency needs to be reassessed. We have added the descriptions of -O3 compiler optimization options. Please see lines 393-395. Thanks.
  26. Table 6: Was the O3 optimization only done for the ZM subroutine (as the table says)? If so, this needs to be noted in the paper text.
  RE: Thank you for your suggestion. The prediction datasets are simulation results of the heterogeneous version of CESM which is with -O3 compiler optimization option for all codes on the new Sunway system. We have added the descriptions of prediction datasets with -O3 compiler optimization options. Please see lines 395-397. Thanks.
  27. Line 329: “..which provides the references for the porting and optimization“ - what does this mean?
  RE: Thank you for your comment. The result shows that the effect of -O3 compiler optimization option in the new Sunway system is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. We have revised the sentence. Please see lines 399-400. Thanks.
  28. Line: 341: “Then, our tool can detect sensitivity of input parameters, which is excluded to the input parameter list provided by the climate scientist thought to affect the climate in a non-trivial manner“
  The meaning of this sentence is unclear.
  RE: Thank you for your comment. The input parameters used in Section 4.4 are except the parameters listed by climate scientists in [Baker et al., 2015]. In this study, the experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  29. Table 6: For the “changes in model parameter” portion, line 345 says: “The result shows that ke, vdc_eq, and vdc_psim variables are not sensitive to the configuration in this study, while the value of the vdc1 variables should not be changed.”
  This conclusion is a bit strong. I don’t think the results indicate that vdc1 should not be changed, The results just show that the one value tested fails. Maybe a smaller change would pass? The reverse is true for those that passed. Those variables may be sensitive to a larger change in value. I don’t know how/why these particular values were chosen.
  RE: Thank you for your excellent comment. The experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  30. Line 355: “..form a mixed perturbation environment...” Again, I don’t think there is anything specifically meaningful to this “mixed perturbation’ terminology. The whole earth system model is chaotic and it is affected by hardware/software stack, roundoff error, truncation error, etc.
  RE: Thank you for your comment. The ESM-DCT is used for detecting the existence of software or human errors when taking hardware-related perturbations into account on the heterogeneous many-core systems. Therefore, the mixed perturbations refer to the additional uncertainty caused by heterogeneous hardware designs and software or human changes (such as compiler optimization option and model input parameter changes). The coexistence of general-purpose cores and accelerator cores, which usually employ different hardware architectures, can lead to bit-level differences, especially when we try to maximize the performance on both kinds of cores. Such differences further lead to computational perturbations through temporal integration, which can blend with software or human errors. We have removed the descriptions of the mixed perturbations. Please see lines 372-373. Thanks.
  31. Other related work on climate model correctness that is not cited in this paper:
  Mahajan et. al, “Ensuring statistical reproducibility of ocean model simulations in the age of hybrid computing” 2021
  Massonnet et. al, “Replicability of the EC-Earth3 Earth system model under a change in computing environment”, 2020.
  Mahajan et. al, “A multivariate approach to ensure statistical reproducibility of climate model simulations” 2019.
  Mahajan et. al, “Exploring an Ensemble-Based Approach to Atmospheric Climate Modeling and Testing at Scale”, 2017.
  Wan, H,et. al, “A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0)”, 2017.
  RE: Thank you for your suggestion. We have cited the related work on climate model correctness. Please see lines 68-75.
  
  Technical corrections:
  1. Line 32: “that facing with the” is awkward, please rephrase. (This also occurs in line 75, 108, …)
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 84-85, 120-121.Thanks.
  2. Line 98: MPEs are not defined.
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 100-101.Thanks.
  3. Line 234: “probability density function (PDF) of the CESM” . I assume the intention was of a variable, not the CESM model
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 294-295.Thanks
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC3
- AC5:
  'Reply on RC2', Yangyang Yu, 24 May 2024
  A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems
  Yangyang Yu, Shaoqing Zhang,, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu,, Xiaojing
  
  General comments:
  This manuscript presents an approach for evaluating software correctness for Earth System Models using an ensemble and deep learning.
  The general idea is promising, but a lot of the algorithm decisions are not well supported. In general, very substantial improvements are needed for GMD. The authors need to justify unsubstantiated claims and statements that are not correct or misleading.
  
  The paper needs a good editing pass. There are many grammatical errors and awkward phrases (a few of which I mention below).
  
  Some of the information about other authors’ work is inaccurate and other related work has been left out (listed later in this report). The Introductory section, in particular, needs revision.
  
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have revised the Introductory section and added: 1) the descriptions of heterogeneous computational perturbations; 2) the experiments of determining the ensemble size of training sets and model parameters; 3) the experiments of determining the model time step; 4) the description of the CESM version, etc.
  The point-by-point replies are followed.
  
  Specific comments:
  1, Line 58: “Evaluating the scientific consistency is a commonly used method for model verification in the form of quality assurance.”.
  I am not sure that I agree this is a “commonly-used” method (consistency) - I don’t think this phrasing is found prior to the [Baker 2015]. Also I don’t believe the authors here ever define what they mean by consistency.
  RE: Thank you for your comments. We have revised the sentences. It should be “Model verification during optimizing and developing ESMs is critical to establishing and maintaining the credibility of the ESMs (Carson II, 2002), which focuses on determining whether or not the implementation of a model is correct and matches the intended description and assumptions for the model. ” Please see lines 58-60. Thanks.
  
  2. Line 59: “For example, for detecting the influences of hardware environment changes, data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015).Then, the CESM ensemble-based consistency test (CESM-ECT) is used to compare the new simulations against the control ensemble from the trusted machine (Baker et al., 2015; Milroy et al., 2016; Baker et al., 2016; Milroy et al., 2018).
  This description is not accurate. [Baker 2015] explains that the 400-year test is the approach that was used prior to the CESM-ECT development. In fact, [Baker 2015] states that it was the motivation for a more objective approach. The CESM-ECT does not involve the 400 years and is a separate thing.
  RE: Thank you for your comments. We have revised the sentence. It should be “For detecting the influences of hardware environment changes, a historical method is that data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015). Then, many ensemble-based consistency evaluation methods are used to compare the new simulations against the control ensembles.” Please see lines 60-64. Thanks.
  
  3. Line 64: “2018). However, all the methods mentioned above focus on homogeneous multi-core HPC systems.”
  This statement is not accurate. As far as I can tell, there is nothing specific to homogeneous machines in these works. In fact, [Milroy 2018] specifically mentions heterogeneous computing environments as a motivation for the work.
  RE: Thank you for your comments. We have removed the sentence. Thanks.
  
  4. Line 71: “However, the ultra-fast tests in the CESM-ECT are applied for evaluating the scientific consistency on the Community Atmosphere Model ” … “There is a lack of a method to analyze short-time simulation results of multi-components”
  The [Milroy 2018] work also tests CLM (land model). It explains how land and atmosphere are tightly coupled and CLM modifications can be detected in CAM variables. And gives experimental results.
  Consider also that ocean scales are much larger than atmospheric scales, hence the need for more time for changes to propagate through the ocean model from a single perturbed variable.
  RE: Thank you for your good comments. We have removed the sentence and only focus on the consistency test of the atmosphere component for the CESM. Please see lines 232-244. Thanks.
  
  5. Line 74: “Besides, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets”
  This statement does not make sense to me. Please clarify what is meant. PCA detects changes in relationships between variables. They don’t have to be linear relationships for change to be detected.
  RE: Thank you for your comment. We have removed the sentence and added the descriptions of the advantages of deep learning methods for evaluating the consistency for ESMs. Please see lines 75-80. Thanks.
  
  6. Line 75: ”Facing with the non-linear relationship generated by the combination of multi-component data, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.”
  This statement is just unclear. The atmosphere model by itself is nonlinear and chaotic - so I’m not sure what is meant by multi-component combinations - is the implication that this is more non-linear? Please clarify. Also “facing with the …” is awkward phrasing.
  RE: Thank you for your comments. We have removed the sentence. Thanks.
  
  7. Line 80: The phrase “unavoidable computational perturbations” is used frequently (also line 100, line 353), and it’s a bit awkward. It seems that it should be clarified that these are “numerical differences” due to using finite-precision and changing the order of operations and precision due to the changes in architecture. This is not really explained.
  RE: Thank you for your suggestion. We have added the descriptions of uncertainties caused by heterogeneous hardware designs. Please see lines 97-102. Thanks.
  
  8. Line 81: “The ESM-DCT is applied to evaluate whether or not a new CESM configuration in the scenario of mixed perturbations composed of the inevitable computational perturbations and software or human errors in the heterogeneous computing is consistent with the original “trusted” configuration in the homogeneous computing.”
  This phrase is unclear. Aren’t you trying to figure out (i.e. differentiate) whether a difference in output is due to numerical round-off OR software/human error?
  RE: Thank you for your comment. We have revised the sentence. It should be “The ESM-DCT tool is based on the unsupervised bidirectional gate recurrent unit-autoencoder (BGRU-AE; Zhao et al., 2017) model, and is applied to evaluate whether or not a new ESM (CEMS in this case) configuration is consistent with the original “trusted” configuration. ” Please see lines 82-84. Thanks.
  
  9. Line 103: “The key challenge is designing a tool to evaluate the scientific consistency, which can remove the influences of heterogeneous perturbations”
  The authors have definitely implied that the CESM-ECT tools do not do this, which is untrue. This new method/approach is still a valid contribution, and there is no need to misrepresent other previous work.
  RE: Thank you for your comment. We have removed the sentence. Thanks.
  
  10.Line 108: “However, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets (Liu et al., 2009). Facing with the non-linear relationship generated by the combination of multi-component data in the coupled numerical model, …”
  This is the exact same phrase as in line 74 and it still does not make sense (see above comment).
  RE: Thank you for your comment. We have removed the sentence. Thanks.
  
  11. Line 116: I am confused by the last bit of section 2, regarding the 5-variable conceptual couples model and its results. It seems that the purpose is to show that the DCT approach is better than the ECT approach. This is quite hard to judge on the information given here. Most choices are not justified. Note that there is little point in a PCA-approach with only 5 variables. (Which is why the CESM-ECT for ocean does not use PCA. Though the authors cite that work, it is unclear that they are familiar with it. Also the ocean variables are not globally averaged.)
  RE: Thank you for your comment. In this study, we start from 5VCCM to develop the ESM-DCT. The 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tools. We have added the descriptions of the function of 5VCCM. Please see lines 182-219.
  
  Also why the choice of 151 ensemble members for 1000 timesteps? [Baker 2015] uses 151 timesteps for yearly averages. [Milroy 2018] shows that more are needed for shorter time scales. Why does the DCT look at 40? Are you aiming for some false positive rate? The choices here seem quite arbitrary, and it’s unclear why this “experiment” is in the background section. [Say more?]
  RE: Thank your for your comment. We show the performance of deep learning models in mining non-linear features from coupled models. First, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 demonstrates the sensitive dependence of variables of 5VCCM on initial . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data. Then, we select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 205-219.
  
  12. Line 212: “Based on the ultra-fast tests”
  Which ultra-fast tests are you referring to? Also I am skeptical about the robustness of bugs being detected in only a couple time steps for the ocean.
  RE: Thank you for your comment. Milroy et al. (2018) demonstrated that adequate ensemble variability is achieved with instantaneous variable values at a small number of time steps for CAM, despite rapid perturbation growth and heterogeneous variable spread. Therefore, we are expected to analyze the short-term simulation results to achieve the consistency evaluation of CESM. Please see lines 232-234. Thanks.
  
  13. Line 216 - 217: I find it odd to simply increase the frequency of the ocn/atmosphere coupling. This change affects the nature of the model and model output and certainly needs justification as to why it is acceptable. Also how did you choose 24 time steps as an appropriate time slice to evaluate?
  RE: Thank you for your comment. We have removed the consistency test of the ocean component. Then, we have revised the descriptions of the experiments of obtaining the datasets. We obtain the results using the B compset (active atmosphere, land, ice, ocean) and expect to provide the guidance for consistency test of the fully active ESM components. Therefore, we examine 97 variables from the atmosphere component results after the ocean component transfers data to the CESM coupler, as redundant variables and those with no variance are excluded. We did not change the coupling frequency. Next, We run two simulations at the 96 time step: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the initial atmosphere temperature. Figure 6 demonstrates the sensitive dependence of atmosphere variables on initial atmosphere temperature. The vertical axis labels are atmosphere variables, while the horizontal axis labels are the CESM time steps. The color of each step represents the number of significant figures in common between the perturbed and unperturbed simulations. Figure 6 shows that choosing a small number of time steps can provide sufficient variability of atmosphere variables to determine statistical distinguishability. Therefore, we use the simulation results at the 96 time step as the input data of the BGRU-AE model, where the ocean component is coupled 2 times. Please see lines 232-244. Thanks.
  
  14. Line 236: “...clearly inconsistent…”. I don’t believe that there has been a definition provided as to what is meant by consistency for this new approach.
  RE: Thank you for your comment. We have removed the sentence because we have removed the testing datasets of “unacceptable initial conditions”. Thanks.
  
  15. Lines 236-237: “the testing datasets with unacceptable CESM model parameter adjustments are with the O(10^-14) perturbations of initial atmospheric temperature” . I don’t understand this - why is O(10^-14) unacceptable?
  RE: Thank you for your comment. We define the testing datasets with unacceptable CESM model parameter that are known to produce statistically distinguishable outputs with the training datasets, but we control this testing datasets are the ensembles with the O(10^-14) perturbations of initial atmosphere temperature. We have added description of the testing datasets with unacceptable CESM model parameter. Please see lines 256-258.
  
  16. Line 246: Why the decision to do a spatial mean for the ocean variables? It seems that this could be problematic given the much higher spatial variability in the ocean compared to the atmosphere.
  RE: We have removed the consistency test of the ocean component and only focus on the atmosphere component for the CESM. Thanks.
  
  17. Line 258 ”.. because the variables have vastly different units and magnitudes.”
  Matches the phrase used in [Baker 2015]: “..because the CAM variables have vastly different units and magnitudes.”
  RE: Thank you for your suggestion. We have cited the paper of Baker et al., 2015. Please see line 271-274. Thanks.
  
  18. Line 26: “Software correctness verification n the form of quality assurance”
  
  Line 59: “verification in the form of quality assurance”
  
  Line 104: “for software verification in the form of quality assurance”
  This particular phrase comes from the abstract of [Baker 2015]: “software verification in the form of quality assurance”. Consider rephrasing or quoting.
  RE: Thank you for your suggestion. We have revised the sentences and cited the paper. For example, “Model verification during optimizing and developing ESMs is critical to establishing and maintaining the credibility of the ESMs (Carson II, 2002)”. Please see lines 58-59. Thanks.
  
  19. Line 268: Why using CESM version 1.3? That is quite old and not an official release: https://www.cesm.ucar.edu/models/releases
  RE: Thank you for your comment. In this study, the CESM version used in the present study is CESM 2.1.1 (Danabasoglu et al., 2020). We added the description of the CESM version. Please see line 227. Thanks.
  
  20. Line 310: What are c0_lnd and c0_ocn? Why did you choose them? (There are a couple of DOE-authored studies with CAM5 that mention these for the ZM scheme, but none are cited here.) Also, I know these are later coarsely defined in Table 5, but more info is needed when they are mentioned. How did you pick the parameter values in Table 5?
  RE: Thank you for your comment. Climate scientists provided a list of CAM input parameters thought to affect the climate in a non-trivial manner, which is used to detect changes to the simulation results that are known to produce statistically distinguishable outputs in the CESM-ECT (Baker et al., 2015), such as zm_c0_lnd, zm_c0_ocn, sol_factb_interstitial, sol_factic_interstitial, and cldfrc_rhminh. Our tool must successfully detect the inconsistency of the simulation results that are known to produce statistically distinguishable outputs. Therefore, we modify the values of input parameters in the atmosphere components and then test whether or not our tool can detect the inconsistency caused by the model parameter changes using the ESM-DCT. We have added the descriptions of experiments of modifications to produce statistically distinguishable outputs. Please see lines 311-317. Thanks.
  
  21. Line 315: What is a “mixed perturbation”? I think you mean something like in line 320 that implies two types of perturbations are problematic: “The results show that the tool can detect the climate changing modifications when taking hardware-related perturbations into account.” I think this line of reasoning (where the authors focus on perturbations from 2 sources somehow being trickier) is flawed. Either the output of two different runs is consistent or it is not. The source of the difference is not easily quantified, particularly with compiler and hardware changes. If you run on a different machine with a different compiler then your trusted machine is that a mixed perturbation? I don’t see that that distinction matters (and such experiments were done in [Baker 2015].) I do agree that the way an ensemble spread is created affects the variability of the distribution (e.g., see [Milroy 16] ), and that's an interesting question, but the authors here have not delved into that at all.
  RE: Thank you for your comment. We have removed the descriptions of “mixed perturbation” and only focus on documenting the development of a deep learning-based consistency test approach for the ESMs. Thanks.
  
  22. Figure 9: why compare the output between the O(10^-6) tests and the climate-changing tests if the O(10^-6) is inconsistent? Don't you want to compare to the accepted one to know how far off you are? I am clearly missing something.
  RE: Thank you for your comment. We have removed the testing datasets of “unacceptable initial conditions”. Thanks.
  
  23. Figures 7 -11: Why is the reconstruction error threshold .05? Did this somehow come from the second paragraph in 3.4?
  RE: Thank you for your comment. We calculate the reconstruction errors after re-inputting the training datasets into the saved BGRU-AE model. The PDF of the reconstruction errors of the training datasets is shown in Figure 8. Following Figure 8, the PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the descriptions of the calculation method of the reconstruction errors. Please see lines 283-288.
  
  24. Table 5: What does it mean that the passing rate is 0% and the test overall passes? That seems wrong.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 6.
  
  25. Section 4.4: The way the beginning of this section reads, with “For example, the effect of -O3 compiler optimization option was not known, because the CESM code base is large and level-three optimizations can be quite aggressive (Baker et al., 2015). We input the simulation results of the heterogeneous version of CESM on the new Sunway system into the ESM-DCT with -O3 compiler optimization option.“ implies that we don't know if O3 will pass or not. I checked [Baker 2015] and there they show that O3 does pass, so I am not sure why it is in the unknown outcomes section.
  RE: Thank you for your comment. The CESM code base is large and level-three optimizations can be quite aggressive. In the results of CESM-ECT [Baker 2015], INTEL13-O3 and INTEL14-O3 get the “pass”, but INTEL15-O3 gets the “failure”. Different compilers have different optimization algorithms and strategies for codes, the outputs can be different. After the ESM-DCT tool is constructed, we expect that it can provide the guidelines for predicting the consistency results of ESM new configurations to better understand and improve the tool. The result shows that the effect of -O3 compiler optimization option is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. We have added the descriptions of -O3 compiler optimization options. Please see lines 351-359. Thanks.
  
  26. Table 6: Was the O3 optimization only done for the ZM subroutine (as the table says)? If so, this needs to be noted in the paper text.
  RE: Thank you for your suggestion. The prediction datasets are simulation results of the heterogeneous version of CESM which is with -O3 compiler optimization option for all codes on the new Sunway system. We have added the descriptions of prediction datasets with -O3 compiler optimization options. Please see lines 354-355. Thanks.
  
  27. Line 329: “..which provides the references for the porting and optimization“ - what does this mean?
  RE: Thank you for your comment. The result shows that the effect of -O3 compiler optimization option in the new Sunway system is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. We have revised the sentence. Please see lines 358-359. Thanks.
  
  28. Line: 341: “Then, our tool can detect sensitivity of input parameters, which is excluded to the input parameter list provided by the climate scientist thought to affect the climate in a non-trivial manner“
  The meaning of this sentence is unclear.
  RE: Thank you for your comment. The input parameters used in Section 4.4 are except the parameters listed by climate scientists in [Baker et al., 2015]. In this study, the experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  
  29. Table 6: For the “changes in model parameter” portion, line 345 says: “The result shows that ke, vdc_eq, and vdc_psim variables are not sensitive to the configuration in this study, while the value of the vdc1 variables should not be changed.”
  This conclusion is a bit strong. I don’t think the results indicate that vdc1 should not be changed, The results just show that the one value tested fails. Maybe a smaller change would pass? The reverse is true for those that passed. Those variables may be sensitive to a larger change in value. I don’t know how/why these particular values were chosen.
  RE: Thank you for your excellent comment. The experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  
  30. Line 355: “..form a mixed perturbation environment...” Again, I don’t think there is anything specifically meaningful to this “mixed perturbation’ terminology. The whole earth system model is chaotic and it is affected by hardware/software stack, roundoff error, truncation error, etc.
  RE: Thank you for your comment. We have removed the descriptions of “mixed perturbations”. Thanks.
  
  31. Other related work on climate model correctness that is not cited in this paper:
  Mahajan et. al, “Ensuring statistical reproducibility of ocean model simulations in the age of hybrid computing” 2021
  Massonnet et. al, “Replicability of the EC-Earth3 Earth system model under a change in computing environment”, 2020.
  Mahajan et. al, “A multivariate approach to ensure statistical reproducibility of climate model simulations” 2019.
  Mahajan et. al, “Exploring an Ensemble-Based Approach to Atmospheric Climate Modeling and Testing at Scale”, 2017.
  Wan, H,et. al, “A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0)”, 2017.
  RE: Thank you for your suggestion. We have cited the related work on climate model correctness. Please see lines 64-70.
  
  Technical corrections:
  Line 32: “that facing with the” is awkward, please rephrase. (This also occurs in line 75, 108, …)
  
  RE: Thank you for your suggestion. We have removed the sentence because we have removed the descriptions of “mixed perturbations”. Thanks.
  
  Line 98: MPEs are not defined.
  
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 93-94.Thanks.
  
  Line 234: “probability density function (PDF) of the CESM” . I assume the intention was of a variable, not the CESM model
  
  RE: Thank you for your suggestion. We have removed the sentence because we have removed the testing datasets of “unacceptable initial conditions”. Thanks
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC5
RC3:
'Comment on gmd-2024-10', Anonymous Referee #3, 18 Mar 2024
This manuscript presents interesting and necessary work, and it seems that the authors have done a thorough job with their science. There are two main improvements I believe should be made. First, the writing needs to be edited to be clearer, and the proposed BGRU-AE model needs to be compared to a simpler method to set a baseline. The following is a list of questions/revisions.
The English needs significant revision, perhaps by a professional service. The paper is filled with nonstandard terminology and unusual or wrong grammar, making comprehension difficult.

The need is well communicated and important. Processors are becoming increasingly heterogeneous and it’s important to understand how that affects calculations.

Lines 88-103: While the technical description of the SW26010P processor is sufficient, a little bit more context for the processor and the TaihuLight supercomputer would be appreciated. What was the supercomputer built to do? What are the stated advantages to using this processor, and were they ever meant to run climate models? Some of this is addressed in the summary and discussions but is worth mentioning here.

Lines 104-114: I’m approaching this paper from a machine learning perspective, so a few more sentences in the background expanding on how CESM-ECT works would be helpful for me.

Line 140: Why was the ensemble size 151?

Line 198: The role of the FC layers needs to be explicitly stated. What exactly do they do? Does this change the inputs and outputs of the network during training?

Section 3.2: A line or two should be added about what software was used to create the model. Pytorch, Tensorflow, etc.?

Figure 7: Please put a legend for all three elements and axis labels. Alternatively, make the caption more detailed.

Figure 8-11: Please label the y-axis. Also, it would be helpful to show the number of points in each category. If there are enough points that there is significant overlap, perhaps the plots should be converted to boxplots.

While it’s very possible this is my fault, I do not understand the paragraph that spans lines 140-151, and subsequently, Table 1, perhaps some context or motivation should be given. Are the test results shown in Section 4 and the following tables a subset of the tests shown in Table 1? If the ECT is failing for 2/3 tests, why do we believe the DCT pass rate instead?

The language in the table is confusing. For example, in Table 5, the passing rate is 0%, but the ESM-DCT result is “Pass”. I think this means that the ESM-DCT detected all of the anomalies, so the tool itself “passed” the test, but it would help if the language was clearer.

I will allow the editor to decide if this is required or not, since it will require a lot of work, but I believe the DCT method needs to be compared to something for the tests performed. How does the ECT method do on the tests? Alternatively, how does a simpler machine-learning based method do on the tests? The authors’ implementation of the BGRU-AE model is impressive, but it could be difficult to implement and have heavy training requirements. If local outlier factor (LOF) or an isolation forest does almost as well or as well as the DCT method, it’s worth pointing out since those methods are simpler. Viewing the convergence curves shown in Figure 6, I suspect a simpler non-linear ML model will work also. This, along with English revisions, are what I believe to be the two major improvements that need to be made to the manuscript.

I am happy to see the code is included with the manuscript. It still might be helpful to have an appendix with a summary of the technical details of the BGRU-AE model. This will help with reproducibility since people can try to implement the model with their preferred computational tools.
Citation: https://doi.org/10.5194/gmd-2024-10-RC3
- AC1: 'Reply on RC3', Yangyang Yu, 23 Apr 2024
  
  This manuscript presents interesting and necessary work, and it seems that the authors have done a thorough job with their science. There are two main improvements I believe should be made. First, the writing needs to be edited to be clearer, and the proposed BGRU-AE model needs to be compared to a simpler method to set a baseline. The following is a list of questions/revisions.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the ESM applications of the Sunway Taihulight system and new Sunway system, 2) more detailed descriptions of the PCA method; 3) the experiments of ESM-DCT and LOF in evaluating the consistency for CESM, etc.
  Specific comments:
  1. The English needs significant revision, perhaps by a professional service. The paper is filled with nonstandard terminology and unusual or wrong grammar, making comprehension difficult.
  RE: Thank you for your valuable and thoughtful comments. We have carefully checked and improved the English writing in the revised manuscript.
  2. The need is well communicated and important. Processors are becoming increasingly heterogeneous and it’s important to understand how that affects calculations.
  RE: Thank you for your suggestion. In the heterogeneous many-core architectures, the general-purpose cores are mainly used for controlling and managing tasks, and the accelerator cores are used for calculating. For example, the major computing power in the heterogeneous many-core architectures is provided by many-core accelerators such as NVIDIA GPUs and many-core processors Sunway CPEs. We have added the descriptions of the heterogeneous many-core architectures. Please see lines 39-43.
  3. Lines 88-103: While the technical description of the SW26010P processor is sufficient, a little bit more context for the processor and the TaihuLight supercomputer would be appreciated. What was the supercomputer built to do? What are the stated advantages to using this processor, and were they ever meant to run climate models? Some of this is addressed in the summary and discussions but is worth mentioning here.
  RE: Thank you for your suggestion. ESMs are the important application scenarios for the heterogeneous many-core high-performance computing (HPC) systems. For example, Zhang et al. (2020) enabled highly efficient simulations of the high-resolution (25-km atmosphere and 10-km ocean) CESM on the heterogeneous Sunway Taihulight. Gu et al. (2022) established a non-hydrostatic global atmospheric modeling system at 3 km horizontal resolution with aerosol feedbacks on the heterogeneous new Sunway system. Zhang et al. (2023) developed a series of high-resolution CESM with up to 5 km of atmosphere and 3 km of ocean to capture major weather-climate extremes on the heterogeneous new Sunway system. We have added the descriptions of applications of Sunway Taihulight and new Sunway system. Please see lines 49-54. Thanks.
  4. Lines 104-114: I’m approaching this paper from a machine learning perspective, so a few more sentences in the background expanding on how CESM-ECT works would be helpful for me.
  RE: Thank you for your suggestion. The CESM-ECT uses the PCA to get linearly independent feature vectors from the control ensemble simulations and issues an overall pass or fail result by scores after the new simulations are converted to scores via the linearly independent feature vectors. We have added the descriptions of PCA. Please see lines 114-120. Thanks.
  5. Line 140: Why was the ensemble size 151?
  RE: Thank you for your comment. We select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 241-249.
  6. Line 198: The role of the FC layers needs to be explicitly stated. What exactly do they do? Does this change the inputs and outputs of the network during training?
  RE: Thank you for your comment. For the BGRU-AE model, the input data are 101 dimension vectors. For the BGRU, the output vectors are [number of network layers* number of network directions, number of hidden]. We use the FC layers to convert the output vectors of BGRU to align with the input data and compute the loss. We add the role of the FC layers. Please see lines 171-172. Thanks.
  7. Section 3.2: A line or two should be added about what software was used to create the model. Pytorch, Tensorflow, etc.?
  RE: Thank you for your good suggestion. We use Pytorch to implement the model. We add the description of the software we used. Please see lines 200-201. Thanks.
  8. Figure 7: Please put a legend for all three elements and axis labels. Alternatively, make the caption more detailed.
  RE: Thank you for your suggestion. The figure shows the PDF of the reconstruction errors of the training datasets. The PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the caption of the figure more detailed. Thanks.
  9. Figure 8-11: Please label the y-axis. Also, it would be helpful to show the number of points in each category. If there are enough points that there is significant overlap, perhaps the plots should be converted to boxplots.
  RE: Thank you for your suggestion. We have labeled the y-axis in Figure 10-12. Thanks.
  10. While it’s very possible this is my fault, I do not understand the paragraph that spans lines 140-151, and subsequently, Table 1, perhaps some context or motivation should be given. Are the test results shown in Section 4 and the following tables a subset of the tests shown in Table 1? If the ECT is failing for 2/3 tests, why do we believe the DCT pass rate instead?
  RE: Thank you for your comment. 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tool. We have added the descriptions of ESM-DCT for 5VCCM. Please see lines 212-249.
  11. The language in the table is confusing. For example, in Table 5, the passing rate is 0%, but the ESM-DCT result is “Pass”. I think this means that the ESM-DCT detected all of the anomalies, so the tool itself “passed” the test, but it would help if the language was clearer.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 6. Thanks.
  12. I will allow the editor to decide if this is required or not, since it will require a lot of work, but I believe the DCT method needs to be compared to something for the tests performed. How does the ECT method do on the tests? Alternatively, how does a simpler machine-learning based method do on the tests? The authors’ implementation of the BGRU-AE model is impressive, but it could be difficult to implement and have heavy training requirements. If local outlier factor (LOF) or an isolation forest does almost as well or as well as the DCT method, it’s worth pointing out since those methods are simpler. Viewing the convergence curves shown in Figure 6, I suspect a simpler non-linear ML model will work also. This, along with English revisions, are what I believe to be the two major improvements that need to be made to the manuscript.
  RE: Thank you for your good suggestion. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method and the ESM-DCT. The accuracy of ESM-DCT, 98.6%, is slightly better than the accuracy of LOF, 96.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF. Please see Section 4.5. Thanks.
  13. I am happy to see the code is included with the manuscript. It still might be helpful to have an appendix with a summary of the technical details of the BGRU-AE model. This will help with reproducibility since people can try to implement the model with their preferred computational tools.
  RE: Thank you for your suggestion. We added the guidelines for the ESM-DCT software tool in the code. Please see the code in https://doi.org/10.5281/zenodo.10972563. Thanks.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC1
- AC6: 'Reply on RC3', Yangyang Yu, 24 May 2024
  
  This manuscript presents interesting and necessary work, and it seems that the authors have done a thorough job with their science. There are two main improvements I believe should be made. First, the writing needs to be edited to be clearer, and the proposed BGRU-AE model needs to be compared to a simpler method to set a baseline. The following is a list of questions/revisions.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the ESM applications of the Sunway Taihulight system and new Sunway system, 2) more detailed descriptions of the PCA method; 3) the experiments of ESM-DCT and LOF in evaluating the consistency for CESM, etc.
  The point-by-point replies are followed.
  
  Specific comments:
  1. The English needs significant revision, perhaps by a professional service. The paper is filled with nonstandard terminology and unusual or wrong grammar, making comprehension difficult.
  RE: Thank you for your valuable and thoughtful comments. We have carefully checked and improved the English writing in the revised manuscript.
  
  2. The need is well communicated and important. Processors are becoming increasingly heterogeneous and it’s important to understand how that affects calculations.
  RE: Thank you for your suggestion. In the heterogeneous many-core architectures, the general-purpose cores are mainly used for controlling and managing tasks, and the accelerator cores are used for calculating. For example, the major computing power in the heterogeneous many-core architectures is provided by many-core accelerators such as NVIDIA GPUs and many-core processors Sunway CPEs. We have added the descriptions of the heterogeneous many-core architectures. Please see lines 35-39.
  
  3. Lines 88-103: While the technical description of the SW26010P processor is sufficient, a little bit more context for the processor and the TaihuLight supercomputer would be appreciated. What was the supercomputer built to do? What are the stated advantages to using this processor, and were they ever meant to run climate models? Some of this is addressed in the summary and discussions but is worth mentioning here.
  RE: Thank you for your suggestion. ESMs are the important application scenarios for the heterogeneous many-core high-performance computing (HPC) systems. For example, Zhang et al. (2020) enabled highly efficient simulations of the high-resolution (25-km atmosphere and 10-km ocean) CESM on the heterogeneous Sunway Taihulight. Gu et al. (2022) established a non-hydrostatic global atmospheric modeling system at 3 km horizontal resolution with aerosol feedbacks on the heterogeneous new Sunway system. Zhang et al. (2023) developed a series of high-resolution CESM with up to 5 km of atmosphere and 3 km of ocean to capture major weather-climate extremes on the heterogeneous new Sunway system. We have added the descriptions of applications of Sunway Taihulight and new Sunway system. Please see lines 44-50. Thanks.
  
  4. Lines 104-114: I’m approaching this paper from a machine learning perspective, so a few more sentences in the background expanding on how CESM-ECT works would be helpful for me.
  RE: Thank you for your suggestion. For CESM-ECT, independent feature vectors are obtained from the control ensemble using the PCA method. The CESM-ECT issues an overall pass or fail result by scores after the new simulations are converted to scores via the independent feature vectors. The CESM-ECT evaluates 3 simulations for each test scenario and issues an overall failure (meaning the results are statistically distinguishable) if more than two of the PC scores are problematic in at least two of the test runs. We have added the descriptions of PCA. Please see lines 332-336. Thanks.
  
  5. Line 140: Why was the ensemble size 151?
  RE: Thank you for your comment. We select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. The accuracy of testing datasets with different ensemble sizes is shown in Table 1. Following Table 1, the ensemble size of training datasets of ESM-DCT for 5VCCM is 120 when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 211-219. Thanks.
  
  6. Line 198: The role of the FC layers needs to be explicitly stated. What exactly do they do? Does this change the inputs and outputs of the network during training?
  RE: Thank you for your comment. For the BGRU-AE model, the input data are 97 dimension vectors. For the BGRU, the output vectors are [number of network layers* number of network directions, number of hidden]. We use the FC layers to convert the output vectors of BGRU to align with the input data and compute the loss. We add the role of the FC layers. Please see lines 142-143. Thanks.
  
  7. Section 3.2: A line or two should be added about what software was used to create the model. Pytorch, Tensorflow, etc.?
  RE: Thank you for your good suggestion. We use Pytorch to implement the model. We add the description of the software we used. Please see lines 170-171. Thanks.
  
  8. Figure 7: Please put a legend for all three elements and axis labels. Alternatively, make the caption more detailed.
  RE: Thank you for your suggestion. The figure shows the PDF of the reconstruction errors of the training datasets. The PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the caption of the figure more detailed. Thanks.
  
  9. Figure 8-11: Please label the y-axis. Also, it would be helpful to show the number of points in each category. If there are enough points that there is significant overlap, perhaps the plots should be converted to boxplots.
  RE: Thank you for your suggestion. We have labeled the y-axis in Figure 9-11. Thanks.
  
  10. While it’s very possible this is my fault, I do not understand the paragraph that spans lines 140-151, and subsequently, Table 1, perhaps some context or motivation should be given. Are the test results shown in Section 4 and the following tables a subset of the tests shown in Table 1? If the ECT is failing for 2/3 tests, why do we believe the DCT pass rate instead?
  RE: Thank you for your comment. 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tool. We have added the descriptions of ESM-DCT for 5VCCM. Please see lines 182-219.
  
  11. The language in the table is confusing. For example, in Table 5, the passing rate is 0%, but the ESM-DCT result is “Pass”. I think this means that the ESM-DCT detected all of the anomalies, so the tool itself “passed” the test, but it would help if the language was clearer.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 6. Thanks.
  
  12. I will allow the editor to decide if this is required or not, since it will require a lot of work, but I believe the DCT method needs to be compared to something for the tests performed. How does the ECT method do on the tests? Alternatively, how does a simpler machine-learning based method do on the tests? The authors’ implementation of the BGRU-AE model is impressive, but it could be difficult to implement and have heavy training requirements. If local outlier factor (LOF) or an isolation forest does almost as well or as well as the DCT method, it’s worth pointing out since those methods are simpler. Viewing the convergence curves shown in Figure 6, I suspect a simpler non-linear ML model will work also. This, along with English revisions, are what I believe to be the two major improvements that need to be made to the manuscript.
  RE: Thank you for your good suggestion. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method, CESM-ECT, and the ESM-DCT. The accuracy of ESM-DCT, 98.4%, is slightly better than the accuracy of LOF, 96.2%, and the accuracy of CESM-ECT, 98.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF and CESM-ECT. Please see Section 4.5. Thanks.
  
  13. I am happy to see the code is included with the manuscript. It still might be helpful to have an appendix with a summary of the technical details of the BGRU-AE model. This will help with reproducibility since people can try to implement the model with their preferred computational tools.
  RE: Thank you for your suggestion. We added the guidelines for the ESM-DCT software tool in the code. Please see the code in https://doi.org/10.5281/zenodo.10972563. Thanks.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC6

Interactive discussion

Status: closed

RC1:
'Comment on gmd-2024-10', Anonymous Referee #1, 05 Mar 2024
This manuscript presents an effort to address the need to determine the correctness (i.e., consistency with trusted references) of Earth system model (ESM) simulations conducted in new computing environments. The methodology is based on deep learning, and the primary focus is the impact of using heterogeneous many-core computer systems. I have very limited knowledge in deep learning and computer hardware, and hence will leave it to the other referee(s) and the Handling Editor to assess the manuscript in those respects. My comments below are from the perspective of an Earth system model developer who has encountered the challenge of correctness/consistency testing in their own work.
Overall, I think the study has made a worthwhile attempt to address an important need in ESM development, and I look forward to new, efficient, and objective methods that help fulfill the need. The manuscript, on the other hand, needs substantial improvements to make the proposed method more convincing and the presentation easier to follow. Below are my major comments.
Some conceptual clarifications are needed regarding the scope and the underlying assumptions of the proposed method. My understanding is that the work by Baker et al. (2015) and the follow-up studies cited as references in this manuscript aimed at assessing the mean climate simulated by ESMs like the CESM. Therein, Earth system modeling was viewed as a boundary condition problem even though the action taken to obtain the climate statistics was to numerically integrate a set of time evolution equations. The work by Milroy et al. (2018) and other studies that developed “ultra-fast” tests using short simulations aimed at early detections of potential changes in the long-term averages of the model state. Supposedly, the method proposed in this manuscript has the same goal, namely, issuing a "fail" result for climate-changing modifications in the source code or the computing environment and issuing a "pass" result for non-climate-changing situations. However, the discussion on acceptable and unacceptable initial perturbations in Section 3.3 gives the impression that a CESM simulation is considered a solution of an initial condition problem. In my understanding, initial perturbations on the order of 10E-6 K are unlikely to cause a change in the long-term climate unless the perturbations have some special features and hit some special parts of the code that eventually push the model climate into a different equilibrium. Along the line of the “unacceptable initial conditions”, if one conducts a set of CESM simulations using the unmodified code and in a trusted HPC system but using initial conditions written out on different days or different years of a previously performed trusted simulation, would this new set of simulations be given a “fail” by the proposed test? If the answer is “yes”, then does this mean the test results provided by the proposed method can have a very high false positive rate?

It will be very helpful to put the proposed method into the context of existing work and further explain what is new and advantageous

The earlier papers by, e.g., Baker et al. (2015, 2016), Milroy et al. (2016, 2018), Mahajan et al. (2017, DOI: 10.1016/j.procs.2017.05.259), Mahajan et al. (2019, DOI: 10.1145/3324989.3325724) and Wan et al. (2017, DOI: 10.5194/gmd-10-537-2017) based their work on more traditional statistical methods and/or the theory of numerical analysis, while this work uses deep learning. What is the added value of using deep learning? How did the nonlinear and chaotic features of the Earth system affect the choice of deep learning methods? In Section 3.2, it is stated that bidirectional neural network models can better capture “context information”. What would be an example of “context information” when testing CESM simulations? Also, could there be a physics-based interpretation of the reconstruction error that is used as the key test metric in the proposed method?

It is stated in the introduction that the earlier studies by Baker et al. (2015, 2016) and Milroy et al. (2016, 2018) used homogeneous HPC systems for the CESM simulations while this work focuses on heterogenous systems. Is it expected that the earlier methods are invalid for simulations performed on heterogenous systems? In reverse, is it expected that the newly proposed method can work only for heterogenous systems and not for homogeneous systems? If the answers to these two questions are “yes”, then what is special/different about the heterogenous systems that makes the earlier methods invalid? An explanation from the perspective of solving the equations underlying models like CESM would greatly help the readers understand the value of the proposed method. My impression from reading the manuscript, however, is that both the earlier methods and the method proposed in this manuscript should in principle be applicable to both homogeneous and heterogenous HPC systems. If that’s indeed the case, it will be useful to know whether the newly proposed method is particularly suitable for heterogenous systems and why.

More evidence is needed to demonstrate the trustworthiness of the proposed method. It is known that testing the consistency/correctness of ESM simulations is a challenging, and that different methods can give the same or different answers to the pass-or-fail question, see, e.g., Milroy et al. (2018) for multiple examples and this link (https://e3sm.org/can-we-switch-computers-an-application-of-e3sm-climate-reproducibility-tests/) for a real-life application of a different set of methods. Assuming my understanding is correct that the proposed method is meant to be used for assessing the simulated long-term climate using very short simulations and the method is expected to be applicable to both homogeneous and heterogeneous multi-core HPC system, I would suggest the following:

More evidence is needed to demonstrate that for cases which have been unambiguously identified in the literature to be climate-changing or non-climate-changing, the newly proposed method gives the same pass/fail results as reported in the literature. Admittedly, two examples in this category are presented in the manuscript, namely the value changes for the uncertain parameters c0_lnd and c0_ocean, but more examples would make the manuscript more convincing. (BTW, I suppose the “pass” listed in each row of the rightmost column of Table 5 should be “fail” instead.)

For the cases labeled with “unknown outcome”, the manuscript should demonstrate that the proposed method (which uses CESM output after “24 time steps”) gives the same pass/fail results as the conclusions drawn by an independent assessment either using a CESM-ECT method or using experts’ judgement based on multi-year or multi-decade simulations. The current manuscript presents 9 cases of unknown outcome (Table 6). It is unclear whether the purpose of showing this many cases in the “unknown outcome” category is to discuss certain features of the proposed method or to demonstrate some applications of the method.

If there are cases where the pass or fail determined by the proposed method is different from the conclusion in the literature or an experts’ judgement, then an explanation for the discrepancy will provide very useful guidance for potential users of the proposed method.

There are significant gaps in the description of the proposed method. The parameters listed in Table 3 are not defined in the text, and it is unclear how they enter the equations or algorithms of the deep learning model. The application of the method to the 5VCCM used a training ensemble of 151 members, a simulation length of 1000 time steps, a test ensemble size of 40, and a threshold passing rate of 90% for an overall “fail” (lines 140-146). How were these numbers determined? The CESM simulations used output at “24 time steps” (are these atmosphere model time steps of 30 minutes each?) with the coupling to the ocean occurring 8 times per day. The ensemble sizes for training, validation, and testing were 120, 40, and 40, respectively. The threshold pass rate for issuing a “fail” was 90%. How where these numbers selected? If a reader is interested in applying the proposed method to a different ESM, which of the above-mentioned details need to be revised?

The scientific presentation can benefit from significant improvements. The sequence in which contents are currently presented is awkward at many places. Here only two examples are mentioned, but hopefully a systematic review and revision can be done by the authors: The proof-of-concept presented using the 5VCCM from line 116 to line 151 should be moved to somewhere in Section 3 after the details of the new testing method has been described. The CESM code version, compset, resolution etc. should be clarified before the first mention of model time step, coupling time step etc.

Again, I think the study is worthwhile, but a clearer and more compelling presentation is needed for publication in GMD.
Citation: https://doi.org/10.5194/gmd-2024-10-RC1
- AC2: 'Reply on RC1', Yangyang Yu, 23 Apr 2024
  
  This manuscript presents an effort to address the need to determine the correctness (i.e., consistency with trusted references) of Earth system model (ESM) simulations conducted in new computing environments. The methodology is based on deep learning, and the primary focus is the impact of using heterogeneous many-core computer systems. I have very limited knowledge in deep learning and computer hardware, and hence will leave it to the other referee(s) and the Handling Editor to assess the manuscript in those respects. My comments below are from the perspective of an Earth system model developer who has encountered the challenge of correctness/consistency testing in their own work.
  Overall, I think the study has made a worthwhile attempt to address an important need in ESM development, and I look forward to new, efficient, and objective methods that help fulfill the need. The manuscript, on the other hand, needs substantial improvements to make the proposed method more convincing and the presentation easier to follow. Below are my major comments.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the advantages of the deep learning method; 2) the experiments of modifications with statistically distinguishable outputs; 3) the description of hyperparameters; 4) the experiments of determining the ensemble size of datasets and model optimal parameters; 5) the description of CESM code version, compset, resolution, etc.
  The point-by-point replies are followed.
  1. Some conceptual clarifications are needed regarding the scope and the underlying assumptions of the proposed method. My understanding is that the work by Baker et al. (2015) and the follow-up studies cited as references in this manuscript aimed at assessing the mean climate simulated by ESMs like the CESM. Therein, Earth system modeling was viewed as a boundary condition problem even though the action taken to obtain the climate statistics was to numerically integrate a set of time evolution equations. The work by Milroy et al. (2018) and other studies that developed “ultra-fast” tests using short simulations aimed at early detections of potential changes in the long-term averages of the model state. Supposedly, the method proposed in this manuscript has the same goal, namely, issuing a "fail" result for climate-changing modifications in the source code or the computing environment and issuing a "pass" result for non-climate-changing situations. However, the discussion on acceptable and unacceptable initial perturbations in Section 3.3 gives the impression that a CESM simulation is considered a solution of an initial condition problem. In my understanding, initial perturbations on the order of 10E-6 K are unlikely to cause a change in the long-term climate unless the perturbations have some special features and hit some special parts of the code that eventually push the model climate into a different equilibrium. Along the line of the “unacceptable initial conditions”, if one conducts a set of CESM simulations using the unmodified code and in a trusted HPC system but using initial conditions written out on different days or different years of a previously performed trusted simulation, would this new set of simulations be given a “fail” by the proposed test? If the answer is “yes”, then does this mean the test results provided by the proposed method can have a very high false positive rate?
  RE: Thank you for your comment. The new set of simulations with the unacceptable initial conditions will be given a “fail” by the proposed test, but the test results provided by the proposed method don’t have a very high false positive rate. The CESM-ECT (Baker et al., 2015) aims to assess the mean climate simulated, our method aimed at the detection of smaller-scale modifications using short simulations to determine whether or not the test datasets simulation is statistically distinguishable from the original results. As far as our method is concerned, the response to modifications known to produce statistically distinguishable outputs should be a fail, and the response to modifications not expected to produce statistically distinguishable outputs should be a pass. Therefore, we have changed “climate-changing” to “statistically distinguishable output”. Please see Sections 4.3 and 4.4. Thanks.
  2. It will be very helpful to put the proposed method into the context of existing work and further explain what is new and advantageous
  
  The earlier papers by, e.g., Baker et al. (2015, 2016), Milroy et al. (2016, 2018), Mahajan et al. (2017, DOI: 10.1016/j.procs.2017.05.259), Mahajan et al. (2019, DOI: 10.1145/3324989.3325724) and Wan et al. (2017, DOI: 10.5194/gmd-10-537-2017) based their work on more traditional statistical methods and/or the theory of numerical analysis, while this work uses deep learning. What is the added value of using deep learning? How did the nonlinear and chaotic features of the Earth system affect the choice of deep learning methods?
  RE: Thank you for your comment. PCA is a linear transformation method. It assumes the linear relationships of variables, which hamper the application of PCA if the relationships are nonlinear. The deep learning is a nonlinear transformation method that can handle high-dimensional, nonlinear, and complex data. The simulation results of Earth system models have the non-linear relationship, which can be analyzed using the deep learning models. We have added the description of the deep learning in analyzing nonlinear features of the Earth system models simulation results. Please see lines 82-84. Then, we show the performance of deep learning models in mining non-linear features from coupled models. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method and the ESM-DCT. The accuracy of ESM-DCT, 98.6%, is slightly better than the accuracy of LOF, 96.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF. Please see Section 4.5. Thanks.
  In Section 3.2, it is stated that bidirectional neural network models can better capture “context information”. What would be an example of “context information” when testing CESM simulations?
  RE: We examine 97 variables from the atmosphere component results and 4 variables from the ocean component results. The global area-weighted mean is calculated for each variable, which is converted to a 101 dimensional vector. The context information refers to the data of 100 variables surrounding a variable. The bidirectional neural network models can capture context information of variables in the sequence data. We have added the description of the context information and sequence data. Please see lines 162-163, 314-316. Thanks.
  Also, could there be a physics-based interpretation of the reconstruction error that is used as the key test metric in the proposed method?
  RE: The reconstruction error is a mathematical indicator and calculated by the MSE of test data and original data after feature analysis, which determines whether or not the test datasets simulation is statistically distinguishable from the original results. The physics-based of the reconstruction error is a very good improvement strategy to increase the interpretability of the deep learning model. We will study the physics-based interpretation of reconstruction error in our future work. We have added the plan of developing the deep learning-based consistency test approach with physics-based interpretation. Please see lines 442-443. Thanks.
  It is stated in the introduction that the earlier studies by Baker et al. (2015, 2016) and Milroy et al. (2016, 2018) used homogeneous HPC systems for the CESM simulations while this work focuses on heterogenous systems. Is it expected that the earlier methods are invalid for simulations performed on heterogenous systems? In reverse, is it expected that the newly proposed method can work only for heterogenous systems and not for homogeneous systems? If the answers to these two questions are “yes”, then what is special/different about the heterogenous systems that makes the earlier methods invalid? An explanation from the perspective of solving the equations underlying models like CESM would greatly help the readers understand the value of the proposed method. My impression from reading the manuscript, however, is that both the earlier methods and the method proposed in this manuscript should in principle be applicable to both homogeneous and heterogenous HPC systems. If that’s indeed the case, it will be useful to know whether the newly proposed method is particularly suitable for heterogenous systems and why.
  RE: Thank you for your comment. We develop a deep learning-based consistency test approach for Earth system models and apply this approach on the heterogeneous many-core systems. There are few consistency tests on the heterogeneous many-core systems, although the computational perturbations caused by the heterogeneous hardware designs will not produce the statistically distinguishable outputs by default. Our experiments point out that the response of consistency tests to the heterogeneous computing should be a pass, which can accept the influence of the heterogeneous computational perturbations. Our method addresses the issues as the presence of heterogeneous architecture hardware. Please see lines 104-111. Thanks.
  3. More evidence is needed to demonstrate the trustworthiness of the proposed method. It is known that testing the consistency/correctness of ESM simulations is a challenging, and that different methods can give the same or different answers to the pass-or-fail question, see, e.g., Milroy et al. (2018) for multiple examples and this link (https://e3sm.org/can-we-switch-computers-an-application-of-e3sm-climate-reproducibility-tests/) for a real-life application of a different set of methods. Assuming my understanding is correct that the proposed method is meant to be used for assessing the simulated long-term climate using very short simulations and the method is expected to be applicable to both homogeneous and heterogeneous multi-core HPC system, I would suggest the following:
  
  More evidence is needed to demonstrate that for cases which have been unambiguously identified in the literature to be climate-changing or non-climate-changing, the newly proposed method gives the same pass/fail results as reported in the literature. Admittedly, two examples in this category are presented in the manuscript, namely the value changes for the uncertain parameters c0_lnd and c0_ocean, but more examples would make the manuscript more convincing. (BTW, I suppose the “pass” listed in each row of the rightmost column of Table 5 should be “fail” instead.)
  RE: Thanks for the good suggestion. We have added the experiments of modifications with statistically distinguishable outputs. We evaluating the consistency of simulation on the heterogeneous many-core systems after changing the parameters of sol_factb_interstitial, Sol_factic_interstitial, cldfrc_rhminl. The results shows in Table 6. Then, we have corrected the table, the results should be “fail”. Please see Table 6. Thanks.
  For the cases labeled with “unknown outcome”, the manuscript should demonstrate that the proposed method (which uses CESM output after “24 time steps”) gives the same pass/fail results as the conclusions drawn by an independent assessment either using a CESM-ECT method or using experts’ judgement based on multi-year or multi-decade simulations. The current manuscript presents 9 cases of unknown outcome (Table 6). It is unclear whether the purpose of showing this many cases in the “unknown outcome” category is to discuss certain features of the proposed method or to demonstrate some applications of the method.If there are cases where the pass or fail determined by the proposed method is different from the conclusion in the literature or an experts’ judgement, then an explanation for the discrepancy will provide very useful guidance for potential users of the proposed method.
  RE: Thanks for the good suggestion. The experiments of unknown outcomes modifications are the prediction application on the new Sunway system. After completing model training and testing, we expect that the ESM-DCT can help to predict new data with unknown outcomes modifications on the heterogeneous systems. In this study, we expect that the ESM-DCT can help to predict new data with unknown outcomes modifications on the heterogeneous systems. The result shows that the effect of -O3 compiler optimization option is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. Then, our tool can serve as a rapid method for detecting correctness in the mixed precision programming to help ESMs benefit from a reduction of the precision of certain variables on the heterogeneous many-core HPC systems. We have revised the descriptions of prediction datasets. Please see lines 392-393. Thanks.
  4. There are significant gaps in the description of the proposed method. The parameters listed in Table 3 are not defined in the text, and it is unclear how they enter the equations or algorithms of the deep learning model.
  RE: Thanks for the good suggestion. We have added the description of hyperparameters listed in the table. Please see lines 321-324. Thanks.
  The application of the method to the 5VCCM used a training ensemble of 151 members, a simulation length of 1000 time steps, a test ensemble size of 40, and a threshold passing rate of 90% for an overall “fail” (lines 140-146). How were these numbers determined?
  RE: Thanks for the good suggestion. We have added the description of the parameters about the ensemble experiments into the text. For the ensemble sizes of datasets, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data of the ESM-DCT. Please see lines 235-240. Then, the accuracy in testing datasets is used to adjust the ensemble size of the training sets and model optimal parameters in the ESM-DCT. The accuracy of testing datasets with different ensemble sizes is shown in Table 1. Following Table 1, the ensemble size of training datasets of ESM-DCT is 120 when the accuracy of testing datasets reaches its maximum value. Please see lines 241-249. Thanks.
  The CESM simulations used output at “24 time steps” (are these atmosphere model time steps of 30 minutes each?) with the coupling to the ocean occurring 8 times per day.
  RE: Thanks for the good suggestion. The time integration step size is 30 min. We have added the description of the parameters of the time integration step size into the text. Please see line 269. Thanks.
  The ensemble sizes for training, validation, and testing were 120, 40, and 40, respectively. The threshold pass rate for issuing a “fail” was 90%. How where these numbers selected? If a reader is interested in applying the proposed method to a different ESM, which of the above-mentioned details need to be revised?
  RE: Thank you for your comment. We select the ensemble size of training sets when the accuracy of testing datasets reaches its maximum value. The ratio of training datasets, validation datasets, and testing datasets of ESM-DCT is 6:2:2. The accuracy of ESM-DCT with different ensemble sizes of training datasets is shown in Table 3. Following Table 3, the ensemble size of training sets, validation sets, and testing sets is 120, 40, and 40. At this size, the accuracy of test datasets with modifications known to produce statistically distinguishable and indistinguishable outputs in the BGRU-AE model is the maximum value of 98.6 %. We have added the description of the parameters about the ensemble experiments into the text. If the passing rate is less than 90%, the tool issues an overall “failure”, which yields the accuracy of 98.6 % of the test datasets with modifications known to produce statistically distinguishable and indistinguishable outputs. Please see lines 326-330. Thanks.
  5. The scientific presentation can benefit from significant improvements. The sequence in which contents are currently presented is awkward at many places. Here only two examples are mentioned, but hopefully a systematic review and revision can be done by the authors: The proof-of-concept presented using the 5VCCM from line 116 to line 151 should be moved to somewhere in Section 3 after the details of the new testing method has been described. The CESM code version, compset, resolution etc. should be clarified before the first mention of model time step, coupling time step etc.
  RE: Thanks for the good suggestion. We have moved the description of experiments about 5VCCM to Section 3.4. Please see lines 212-249. Thanks. Then, we have added the description of CESM code version, compset, resolution. Please see lines 259-264. Thanks.
  Again, I think the study is worthwhile, but a clearer and more compelling presentation is needed for publication in GMD.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC2
- AC4: 'Reply on RC1', Yangyang Yu, 24 May 2024
  
  This manuscript presents an effort to address the need to determine the correctness (i.e., consistency with trusted references) of Earth system model (ESM) simulations conducted in new computing environments. The methodology is based on deep learning, and the primary focus is the impact of using heterogeneous many-core computer systems. I have very limited knowledge in deep learning and computer hardware, and hence will leave it to the other referee(s) and the Handling Editor to assess the manuscript in those respects. My comments below are from the perspective of an Earth system model developer who has encountered the challenge of correctness/consistency testing in their own work.
  Overall, I think the study has made a worthwhile attempt to address an important need in ESM development, and I look forward to new, efficient, and objective methods that help fulfill the need. The manuscript, on the other hand, needs substantial improvements to make the proposed method more convincing and the presentation easier to follow. Below are my major comments.
  
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the advantages of the deep learning method; 2) the experiments of modifications with statistically distinguishable outputs; 3) the descriptions of hyperparameters; 4) the experiments of determining the ensemble size of datasets and model optimal parameters; 5) the descriptions of CESM code version, compset, resolution, etc.
  The point-by-point replies are followed.
  
  1. Some conceptual clarifications are needed regarding the scope and the underlying assumptions of the proposed method. My understanding is that the work by Baker et al. (2015) and the follow-up studies cited as references in this manuscript aimed at assessing the mean climate simulated by ESMs like the CESM. Therein, Earth system modeling was viewed as a boundary condition problem even though the action taken to obtain the climate statistics was to numerically integrate a set of time evolution equations. The work by Milroy et al. (2018) and other studies that developed “ultra-fast” tests using short simulations aimed at early detections of potential changes in the long-term averages of the model state. Supposedly, the method proposed in this manuscript has the same goal, namely, issuing a "fail" result for climate-changing modifications in the source code or the computing environment and issuing a "pass" result for non-climate-changing situations. However, the discussion on acceptable and unacceptable initial perturbations in Section 3.3 gives the impression that a CESM simulation is considered a solution of an initial condition problem. In my understanding, initial perturbations on the order of 10^-6K are unlikely to cause a change in the long-term climate unless the perturbations have some special features and hit some special parts of the code that eventually push the model climate into a different equilibrium. Along the line of the “unacceptable initial conditions”, if one conducts a set of CESM simulations using the unmodified code and in a trusted HPC system but using initial conditions written out on different days or different years of a previously performed trusted simulation, would this new set of simulations be given a “fail” by the proposed test? If the answer is “yes”, then does this mean the test results provided by the proposed method can have a very high false positive rate?
  RE: Thank you for your good comment. We have removed the testing datasets of “unacceptable initial conditions” and focused on early detection of potential changes in the long-term averages of the model state. Please see Section 4.4. Thanks.
  
  2. It will be very helpful to put the proposed method into the context of existing work and further explain what is new and advantageous
  
  The earlier papers by, e.g., Baker et al. (2015, 2016), Milroy et al. (2016, 2018), Mahajan et al. (2017, DOI: 10.1016/j.procs.2017.05.259), Mahajan et al. (2019, DOI: 10.1145/3324989.3325724) and Wan et al. (2017, DOI: 10.5194/gmd-10-537-2017) based their work on more traditional statistical methods and/or the theory of numerical analysis, while this work uses deep learning. What is the added value of using deep learning? How did the nonlinear and chaotic features of the Earth system affect the choice of deep learning methods?
  RE: Thank you for your comment. The unsupervised deep learning model has been a fresh and widely used data mining method and explored for anomaly detection. Such deep learning methods can efficiently and objectively identify whether or not the data are different from the original status. The model verification and simulation result analysis has been an excellent application scenario for deep learning. We have added the descriptions of the advantages of the deep learning method. Please see lines 75-78. Then, we show the performance of the deep learning models in mining non-linear features from coupled models. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method, CESM-ECT, and the ESM-DCT. The accuracy of ESM-DCT, 98.4%, is slightly better than the accuracy of LOF, 96.2%, and the accuracy of CESM-ECT, 98.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF and CESM-ECT. Please see Section 4.5. Thanks.
  
  In Section 3.2, it is stated that bidirectional neural network models can better capture “context information”. What would be an example of “context information” when testing CESM simulations?
  RE: We examine 97 variables from the atmosphere component. The global area-weighted mean is calculated for each variable, which is converted to a 97 dimensional vector. The context information refers to the data of 96 variables surrounding a variable. The bidirectional neural network models can capture context information of variables in the sequence data. We have added the descriptions of the context information and sequence data. Please see lines 271-274. Thanks.
  
  Also, could there be a physics-based interpretation of the reconstruction error that is used as the key test metric in the proposed method?
  RE: The reconstruction error is a mathematical indicator and calculated by the MSE of test data and original data after feature analysis, which determines whether or not the test datasets simulation is statistically distinguishable from the original results. The physics-based of the reconstruction error is a very good improvement strategy to increase the interpretability of the deep learning model. We will study the physics-based interpretation of reconstruction error in our future work. We have added the plan of developing the deep learning-based consistency test approach with physics-based interpretation. Please see lines 395-396. Thanks.
  
  It is stated in the introduction that the earlier studies by Baker et al. (2015, 2016) and Milroy et al. (2016, 2018) used homogeneous HPC systems for the CESM simulations while this work focuses on heterogenous systems. Is it expected that the earlier methods are invalid for simulations performed on heterogenous systems? In reverse, is it expected that the newly proposed method can work only for heterogenous systems and not for homogeneous systems? If the answers to these two questions are “yes”, then what is special/different about the heterogenous systems that makes the earlier methods invalid? An explanation from the perspective of solving the equations underlying models like CESM would greatly help the readers understand the value of the proposed method. My impression from reading the manuscript, however, is that both the earlier methods and the method proposed in this manuscript should in principle be applicable to both homogeneous and heterogenous HPC systems. If that’s indeed the case, it will be useful to know whether the newly proposed method is particularly suitable for heterogenous systems and why.
  
  RE: Thank you for your comment. In this study, we focus on developing a deep learning-based consistency test approach for Earth system models. The uncertainties caused by heterogeneous hardware designs should be accepted in evaluating the consistency. We adopt the heterogeneous computing as a type of testing datasets, which is regarded as modifications known to produce statistically indistinguishable outputs. The test datasets of other types are all homogeneous computing. We have revised the descriptions of the role of the heterogeneous computing in this study. Please see lines 97-102. We also have revised the descriptions of the datasets. Please see Table 2. Thanks.
  
  3. More evidence is needed to demonstrate the trustworthiness of the proposed method. It is known that testing the consistency/correctness of ESM simulations is a challenging, and that different methods can give the same or different answers to the pass-or-fail question, see, e.g., Milroy et al. (2018) for multiple examples and this link (https://e3sm.org/can-we-switch-computers-an-application-of-e3sm-climate-reproducibility-tests/) for a real-life application of a different set of methods. Assuming my understanding is correct that the proposed method is meant to be used for assessing the simulated long-term climate using very short simulations and the method is expected to be applicable to both homogeneous and heterogeneous multi-core HPC system, I would suggest the following:
  
  More evidence is needed to demonstrate that for cases which have been unambiguously identified in the literature to be climate-changing or non-climate-changing, the newly proposed method gives the same pass/fail results as reported in the literature. Admittedly, two examples in this category are presented in the manuscript, namely the value changes for the uncertain parameters c0_lnd and c0_ocean, but more examples would make the manuscript more convincing. (BTW, I suppose the “pass” listed in each row of the rightmost column of Table 5 should be “fail” instead.)
  RE: Thanks for the good suggestion. We have added the experiments of modifications with statistically distinguishable outputs. We evaluating the consistency of simulation on the HPC systems after changing the parameters of sol_factb_interstitial, Sol_factic_interstitial, cldfrc_rhminl. The results shows in Table 6. Then, we have corrected the table, the results should be “fail”. Please see Table 6. Thanks.
  
  For the cases labeled with “unknown outcome”, the manuscript should demonstrate that the proposed method (which uses CESM output after “24 time steps”) gives the same pass/fail results as the conclusions drawn by an independent assessment either using a CESM-ECT method or using experts’ judgement based on multi-year or multi-decade simulations. The current manuscript presents 9 cases of unknown outcome (Table 6). It is unclear whether the purpose of showing this many cases in the “unknown outcome” category is to discuss certain features of the proposed method or to demonstrate some applications of the method.
  If there are cases where the pass or fail determined by the proposed method is different from the conclusion in the literature or an experts’ judgement, then an explanation for the discrepancy will provide very useful guidance for potential users of the proposed method.
  
  RE: Thanks for the good suggestion. After the ESM-DCT tool is constructed, we expect that it can provide the guidelines for predicting the consistency results of ESM new configurations to better understand and improve the tool. The prediction datasets are used to show the applications of ESM-DCT. For example, the result shows that the effect of -O3 compiler optimization option is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. Then, ESMs in mixed precision programming must assess the simulation outputs to ensure that the results do not change the climate. Our tool can serve as a rapid method for detecting correctness in the mixed precision programming to help ESMs benefit from a reduction of the precision of certain variables on the HPC systems. We have revised the descriptions of prediction datasets. Please see section 4.6. Thanks.
  
  4. There are significant gaps in the description of the proposed method. The parameters listed in Table 3 are not defined in the text, and it is unclear how they enter the equations or algorithms of the deep learning model.
  RE: Thanks for the good suggestion. We have added the description of hyperparameters listed in the table. Please see lines 279-282. Thanks.
  
  The application of the method to the 5VCCM used a training ensemble of 151 members, a simulation length of 1000 time steps, a test ensemble size of 40, and a threshold passing rate of 90% for an overall “fail” (lines 140-146). How were these numbers determined?
  RE: Thanks for the good suggestion. We have added the description of the parameters about the ensemble experiments into the text. For the ensemble sizes of datasets, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data of the ESM-DCT for 5VCCM. Please see lines 205-210. Then, the accuracy in testing datasets is used to adjust the ensemble size of the training sets and model optimal parameters in the ESM-DCT. The accuracy of testing datasets with different ensemble sizes is shown in Table 1. Following Table 1, the ensemble size of training datasets of ESM-DCT for 5VCCM is 120 when the accuracy of testing datasets reaches its maximum value. The ratio of training, validation, and testing datasets is also 6:2:2. Please see lines 211-219. Thanks.
  
  The CESM simulations used output at “24 time steps” (are these atmosphere model time steps of 30 minutes each?) with the coupling to the ocean occurring 8 times per day.
  RE: Thanks for the good suggestion. The time integration step size is 30 min. We have added the description of the parameters of the time integration step size into the text. Please see line 230. Thanks.
  
  The ensemble sizes for training, validation, and testing were 120, 40, and 40, respectively. The threshold pass rate for issuing a “fail” was 90%. How where these numbers selected? If a reader is interested in applying the proposed method to a different ESM, which of the above-mentioned details need to be revised?
  RE: Thank you for your comment. We select the ensemble size of training sets when the accuracy of testing datasets reaches its maximum value. The ratio of training datasets, validation datasets, and testing datasets of ESM-DCT is 6:2:2. The accuracy of ESM-DCT for CESM with different ensemble sizes of training datasets is shown in Table 3. Following Table 3, the ensemble size of training sets, validation sets, and testing sets is 120, 40, and 40. At this size, the accuracy of test datasets with modifications known to produce statistically distinguishable and indistinguishable outputs in the BGRU-AE model is the maximum value of 98.4 %. We have added the description of the parameters about the ensemble experiments into the text. If the passing rate is equal to 0%, the tool issues an overall “failure”. It represents that the simulation results of training and testing datasets are statistically distinguishable. Please see lines 285-288. Thanks.
  
  5. The scientific presentation can benefit from significant improvements. The sequence in which contents are currently presented is awkward at many places. Here only two examples are mentioned, but hopefully a systematic review and revision can be done by the authors: The proof-of-concept presented using the 5VCCM from line 116 to line 151 should be moved to somewhere in Section 3 after the details of the new testing method has been described. The CESM code version, compset, resolution etc. should be clarified before the first mention of model time step, coupling time step etc.
  RE: Thanks for the good suggestion. We have moved the description of experiments about 5VCCM to Section 3.4. Please see lines 182-219. Thanks. Then, we have added the description of CESM code version, compset, resolution. Please see lines 227-231. Thanks.
  
  Again, I think the study is worthwhile, but a clearer and more compelling presentation is needed for publication in GMD.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC4
RC2:
'Comment on gmd-2024-10', Anonymous Referee #2, 11 Mar 2024

A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems
Yangyang Yu, Shaoqing Zhang,, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu,, Xiaojing

General comments:
This manuscript presents an approach for evaluating software correctness for Earth System Models using an ensemble and deep learning.
1. The general idea is promising, but a lot of the algorithm decisions are not well supported. In general, very substantial improvements are needed for GMD. The authors need to justify unsubstantiated claims and statements that are not correct or misleading.
2. The paper needs a good editing pass. There are many grammatical errors and awkward phrases (a few of which I mention below).
3. Some of the information about other authors’ work is inaccurate and other related work has been left out (listed later in this report). The Introductory section, in particular, needs revision.
Specific comments:
1, Line 58: “Evaluating the scientific consistency is a commonly used method for model verification in the form of quality assurance.”.
I am not sure that I agree this is a “commonly-used” method (consistency) - I don’t think this phrasing is found prior to the [Baker 2015]. Also I don’t believe the authors here ever define what they mean by consistency.
2. Line 59: “For example, for detecting the influences of hardware environment changes, data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015).Then, the CESM ensemble-based consistency test (CESM-ECT) is used to compare the new simulations against the control ensemble from the trusted machine (Baker et al., 2015; Milroy et al., 2016; Baker et al., 2016; Milroy et al., 2018).
This description is not accurate. [Baker 2015] explains that the 400-year test is the approach that was used prior to the CESM-ECT development. In fact, [Baker 2015] states that it was the motivation for a more objective approach. The CESM-ECT does not involve the 400 years and is a separate thing.
3. Line 64: “2018). However, all the methods mentioned above focus on homogeneous multi-core HPC systems.”
This statement is not accurate. As far as I can tell, there is nothing specific to homogeneous machines in these works. In fact, [Milroy 2018] specifically mentions heterogeneous computing environments as a motivation for the work.
4. Line 71: “However, the ultra-fast tests in the CESM-ECT are applied for evaluating the scientific consistency on the Community Atmosphere Model ” … “There is a lack of a method to analyze short-time simulation results of multi-components”
The [Milroy 2018] work also tests CLM (land model). It explains how land and atmosphere are tightly coupled and CLM modifications can be detected in CAM variables. And gives experimental results.
Consider also that ocean scales are much larger than atmospheric scales, hence the need for more time for changes to propagate through the ocean model from a single perturbed variable.
5. Line 74: “Besides, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets”
This statement does not make sense to me. Please clarify what is meant. PCA detects changes in relationships between variables. They don’t have to be linear relationships for change to be detected.
6. Line 75: ”Facing with the non-linear relationship generated by the combination of multi-component data, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.”
This statement is just unclear. The atmosphere model by itself is nonlinear and chaotic - so I’m not sure what is meant by multi-component combinations - is the implication that this is more non-linear? Please clarify. Also “facing with the …” is awkward phrasing.
7. Line 80: The phrase “unavoidable computational perturbations” is used frequently (also line 100, line 353), and it’s a bit awkward. It seems that it should be clarified that these are “numerical differences” due to using finite-precision and changing the order of operations and precision due to the changes in architecture. This is not really explained.
8. Line 81: “The ESM-DCT is applied to evaluate whether or not a new CESM configuration in the scenario of mixed perturbations composed of the inevitable computational perturbations and software or human errors in the heterogeneous computing is consistent with the original “trusted” configuration in the homogeneous computing.”
This phrase is unclear. Aren’t you trying to figure out (i.e. differentiate) whether a difference in output is due to numerical round-off OR software/human error?
9. Line 103: “The key challenge is designing a tool to evaluate the scientific consistency, which can remove the influences of heterogeneous perturbations”
The authors have definitely implied that the CESM-ECT tools do not do this, which is untrue. This new method/approach is still a valid contribution, and there is no need to misrepresent other previous work.
10. Line 108: “However, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets (Liu et al., 2009). Facing with the non-linear relationship generated by the combination of multi-component data in the coupled numerical model, …”
This is the exact same phrase as in line 74 and it still does not make sense (see above comment).
11. Line 116: I am confused by the last bit of section 2, regarding the 5-variable conceptual couples model and its results. It seems that the purpose is to show that the DCT approach is better than the ECT approach. This is quite hard to judge on the information given here. Most choices are not justified. Note that there is little point in a PCA-approach with only 5 variables. (Which is why the CESM-ECT for ocean does not use PCA. Though the authors cite that work, it is unclear that they are familiar with it. Also the ocean variables are not globally averaged.) Also why the choice of 151 ensemble members for 1000 timesteps? [Baker 2015] uses 151 timesteps for yearly averages. [Milroy 2018] shows that more are needed for shorter time scales. Why does the DCT look at 40? Are you aiming for some false positive rate? The choices here seem quite arbitrary, and it’s unclear why this “experiment” is in the background section. [Say more?]
12. Line 212: “Based on the ultra-fast tests”
Which ultra-fast tests are you referring to? Also I am skeptical about the robustness of bugs being detected in only a couple time steps for the ocean.
13. Line 216 - 217: I find it odd to simply increase the frequency of the ocn/atmosphere coupling. This change affects the nature of the model and model output and certainly needs justification as to why it is acceptable. Also how did you choose 24 time steps as an appropriate time slice to evaluate?
14. Line 236: “...clearly inconsistent…”. I don’t believe that there has been a definition provided as to what is meant by consistency for this new approach.
15. Lines 236-237: “the testing datasets with unacceptable CESM model parameter adjustments are with the O(10-14) perturbations of initial atmospheric temperature” . I don’t understand this - why is O(10^-14) unacceptable?
16. Line 246: Why the decision to do a spatial mean for the ocean variables? It seems that this could be problematic given the much higher spatial variability in the ocean compared to the atmosphere.
17. Line 258 ”.. because the variables have vastly different units and magnitudes.”
Matches the phrase used in [Baker 2015]: “..because the CAM variables have vastly different units and magnitudes.”
18. Line 26: “Software correctness verification n the form of quality assurance”

Line 59: “verification in the form of quality assurance”

Line 104: “for software verification in the form of quality assurance”
This particular phrase comes from the abstract of [Baker 2015]: “software verification in the form of quality assurance”. Consider rephrasing or quoting.
19. Line 268: Why using CESM version 1.3? That is quite old and not an official release: https://www.cesm.ucar.edu/models/releases
20. Line 310: What are c0_lnd and c0_ocn? Why did you choose them? (There are a couple of DOE-authored studies with CAM5 that mention these for the ZM scheme, but none are cited here.) Also, I know these are later coarsely defined in Table 5, but more info is needed when they are mentioned. How did you pick the parameter values in Table 5?
21. Line 315: What is a “mixed perturbation”? I think you mean something like in line 320 that implies two types of perturbations are problematic: “The results show that the tool can detect the climate changing modifications when taking hardware-related perturbations into account.” I think this line of reasoning (where the authors focus on perturbations from 2 sources somehow being trickier) is flawed. Either the output of two different runs is consistent or it is not. The source of the difference is not easily quantified, particularly with compiler and hardware changes. If you run on a different machine with a different compiler then your trusted machine is that a mixed perturbation? I don’t see that that distinction matters (and such experiments were done in [Baker 2015].) I do agree that the way an ensemble spread is created affects the variability of the distribution (e.g., see [Milroy 16] ), and that's an interesting question, but the authors here have not delved into that at all.
22. Figure 9: why compare the output between the O(10^-6) tests and the climate-changing tests if the O(10^-6) is inconsistent? Don't you want to compare to the accepted one to know how far off you are? I am clearly missing something..
23. Figures 7 -11: Why is the reconstruction error threshold .05? Did this somehow come from the second paragraph in 3.4?
24. Table 5: What does it mean that the passing rate is 0% and the test overall passes? That seems wrong.
25. Section 4.4: The way the beginning of this section reads, with “For example, the effect of -O3 compiler optimization option was not known, because the CESM code base is large and level-three optimizations can be quite aggressive (Baker et al., 2015). We input the simulation results of the heterogeneous version of CESM on the new Sunway system into the ESM-DCT with -O3 compiler optimization option.“ implies that we don't know if O3 will pass or not. I checked [Baker 2015] and there they show that O3 does pass, so I am not sure why it is in the unknown outcomes section.
26. Table 6: Was the O3 optimization only done for the ZM subroutine (as the table says)? If so, this needs to be noted in the paper text.
27. Line 329: “..which provides the references for the porting and optimization“ - what does this mean?
28. Line: 341: “Then, our tool can detect sensitivity of input parameters, which is excluded to the input parameter list provided by the climate scientist thought to affect the climate in a non-trivial manner“
The meaning of this sentence is unclear.
29. Table 6: For the “changes in model parameter” portion, line 345 says: “The result shows that ke, vdc_eq, and vdc_psim variables are not sensitive to the configuration in this study, while the value of the vdc1 variables should not be changed.”
This conclusion is a bit strong. I don’t think the results indicate that vdc1 should not be changed, The results just show that the one value tested fails. Maybe a smaller change would pass? The reverse is true for those that passed. Those variables may be sensitive to a larger change in value. I don’t know how/why these particular values were chosen.
30. Line 355: “..form a mixed perturbation environment...” Again, I don’t think there is anything specifically meaningful to this “mixed perturbation’ terminology. The whole earth system model is chaotic and it is affected by hardware/software stack, roundoff error, truncation error, etc.
31. Other related work on climate model correctness that is not cited in this paper:
Mahajan et. al, “Ensuring statistical reproducibility of ocean model

simulations in the age of hybrid computing” 2021
Massonnet et. al, “Replicability of the EC-Earth3 Earth system model

under a change in computing environment”, 2020.
Mahajan et. al, “A multivariate approach to ensure statistical

reproducibility of climate model simulations” 2019.
Mahajan et. al, “Exploring an Ensemble-Based Approach to Atmospheric

Climate Modeling and Testing at Scale”, 2017.
Wan, H,et. al, “A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0)”, 2017.

Technical corrections:
1. Line 32: “that facing with the” is awkward, please rephrase. (This also occurs in line 75, 108, …)
2. Line 98: MPEs are not defined.
3. Line 234: “probability density function (PDF) of the CESM” . I assume the intention was of a variable, not the CESM model

Citation: https://doi.org/10.5194/gmd-2024-10-RC2
- AC3:
  'Reply on RC2', Yangyang Yu, 23 Apr 2024
  A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems
  Yangyang Yu, Shaoqing Zhang,, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu,, Xiaojing
  General comments:
  This manuscript presents an approach for evaluating software correctness for Earth System Models using an ensemble and deep learning.
  The general idea is promising, but a lot of the algorithm decisions are not well supported. In general, very substantial improvements are needed for GMD. The authors need to justify unsubstantiated claims and statements that are not correct or misleading.
  
  The paper needs a good editing pass. There are many grammatical errors and awkward phrases (a few of which I mention below).
  
  Some of the information about other authors’ work is inaccurate and other related work has been left out (listed later in this report). The Introductory section, in particular, needs revision.
  
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have revised the Introductory section and added: 1) the necessity of developing a consistency test approach on heterogeneous many-core systems; 2) the descriptions of heterogeneous computational perturbations; 3) the experiments of determining the ensemble size of training sets and model parameters; 4) the experiments of determining the model time step; 5) the description of the CESM version, etc.
  Specific comments:
  1, Line 58: “Evaluating the scientific consistency is a commonly used method for model verification in the form of quality assurance.”.
  I am not sure that I agree this is a “commonly-used” method (consistency) - I don’t think this phrasing is found prior to the [Baker 2015]. Also I don’t believe the authors here ever define what they mean by consistency.
  RE: Thank you for your comments. We have revised the sentences. It should be “For detecting the influences of hardware environment changes, a historical method is that data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015). Then, many ensemble-based consistency evaluation methods are used to compare the new simulations against the control ensemble from the trusted machine. ” Please see lines 64-68. Thanks.
  2. Line 59: “For example, for detecting the influences of hardware environment changes, data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015).Then, the CESM ensemble-based consistency test (CESM-ECT) is used to compare the new simulations against the control ensemble from the trusted machine (Baker et al., 2015; Milroy et al., 2016; Baker et al., 2016; Milroy et al., 2018).
  This description is not accurate. [Baker 2015] explains that the 400-year test is the approach that was used prior to the CESM-ECT development. In fact, [Baker 2015] states that it was the motivation for a more objective approach. The CESM-ECT does not involve the 400 years and is a separate thing.
  RE: Thank you for your comments. For detecting the influences of hardware environment changes, a historical method is that data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015). Then, many ensemble-based consistency evaluation methods are used to compare the new simulations against the control ensemble from the trusted machine. Please see lines 64-68. Thanks.
  3. Line 64: “2018). However, all the methods mentioned above focus on homogeneous multi-core HPC systems.”
  This statement is not accurate. As far as I can tell, there is nothing specific to homogeneous machines in these works. In fact, [Milroy 2018] specifically mentions heterogeneous computing environments as a motivation for the work.
  RE: [Milroy 2018] developed an ultra-fast statistical consistency testing and demonstrated that adequate ensemble variability is achieved with instantaneous variable values at the ninth step, despite rapid perturbation growth and heterogeneous variable spread. However, there is a lack of experiments of evaluating the consistency on the heterogeneous many-core systems, although the CESM-ECT has the capability to determine consistency without bit-for-bit results. For the heterogeneous many-core systems, there are the hardware differences between the general-purpose cores and accelerator cores. There can exist computational perturbations caused by the hardware designs, which should be accepted for further detection of software or human errors generated in optimizing and developing the ESMs. We have added the necessity of developing a consistency test approach on heterogeneous many-core systems. Please lines 104-111. Thanks.
  4. Line 71: “However, the ultra-fast tests in the CESM-ECT are applied for evaluating the scientific consistency on the Community Atmosphere Model ” … “There is a lack of a method to analyze short-time simulation results of multi-components”
  The [Milroy 2018] work also tests CLM (land model). It explains how land and atmosphere are tightly coupled and CLM modifications can be detected in CAM variables. And gives experimental results.
  Consider also that ocean scales are much larger than atmospheric scales, hence the need for more time for changes to propagate through the ocean model from a single perturbed variable.
  RE: Thank you for your good comments. We have added the experiments of the effects on the atmosphere and ocean variables of initial atmosphere temperature perturbations over 24 time steps, as shown in Figure 6. Figure 6 demonstrates sensitive dependence on initial conditions in the atmosphere and ocean variables and suggests that choosing a small number of time steps may provide sufficient variability to determine statistical distinguishability resulting from significant changes. We have revised the descriptions of ultra-fast tests in the CESM-ECT. Please see lines 79-82. Thanks.
  5. Line 74: “Besides, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets”
  This statement does not make sense to me. Please clarify what is meant. PCA detects changes in relationships between variables. They don’t have to be linear relationships for change to be detected.
  RE: PCA is a linear transformation method. It assumes the linear relationships of variables, which hamper the application of PCA if the relationships are nonlinear. We have added the descriptions of PCA. Please see lines 82-84. Thanks.
  6. Line 75: ”Facing with the non-linear relationship generated by the combination of multi-component data, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.”
  This statement is just unclear. The atmosphere model by itself is nonlinear and chaotic - so I’m not sure what is meant by multi-component combinations - is the implication that this is more non-linear? Please clarify. Also “facing with the …” is awkward phrasing.
  RE: Thank you for your comments. We have revised the sentence. It should be “For the non-linear relationship of ESM outputs, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.” Please see lines 84-85. Thanks.
  7. Line 80: The phrase “unavoidable computational perturbations” is used frequently (also line 100, line 353), and it’s a bit awkward. It seems that it should be clarified that these are “numerical differences” due to using finite-precision and changing the order of operations and precision due to the changes in architecture. This is not really explained.
  RE: Thank you for your suggestion. We have revised the description of computational perturbations. It should be “the uncertainties caused by heterogeneous hardware designs”. We have added the descriptions of heterogeneous computational perturbations. Please see lines 129-131, 136-137, 256-257. Thanks.
  8. Line 81: “The ESM-DCT is applied to evaluate whether or not a new CESM configuration in the scenario of mixed perturbations composed of the inevitable computational perturbations and software or human errors in the heterogeneous computing is consistent with the original “trusted” configuration in the homogeneous computing.”
  This phrase is unclear. Aren’t you trying to figure out (i.e. differentiate) whether a difference in output is due to numerical round-off OR software/human error?
  RE: There are the differences of hardware design between the general-purpose cores and accelerator cores in the heterogeneous many-core architectures. Compared with homogeneous computing using the general-purpose cores only, heterogeneous computing can cause nonidentical floating-point outputs whenever an accelerator core is involved. The uncertainties generated by heterogeneous hardware designs can blend with software or human errors, which can affect the accuracy of the model verification. The uncertainties generated by heterogeneous hardware designs should be accepted for further detection of software or human errors generated in optimizing and developing the ESMs.We have revised the sentence. Please see lines 55-61. Thanks.
  9. Line 103: “The key challenge is designing a tool to evaluate the scientific consistency, which can remove the influences of heterogeneous perturbations”
  The authors have definitely implied that the CESM-ECT tools do not do this, which is untrue. This new method/approach is still a valid contribution, and there is no need to misrepresent other previous work.
  RE: Thank you for your comment. It has been changed to “We develop a consistency test approach which addresses the issues as the presence of heterogeneous architecture hardware”. Please see lines 110-111. Thanks.
  10. Line 108: “However, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets (Liu et al., 2009). Facing with the non-linear relationship generated by the combination of multi-component data in the coupled numerical model, …”
  This is the exact same phrase as in line 74 and it still does not make sense (see above comment).
  RE: Thank you for your comment. We have revised the sentence. It should be “However, the PCA in the CESM-ECT is a linear transformation method. It assumed that there are linear relationships of variables, which hamper the application of PCA if the relationships are nonlinear (Liu et al., 2009). For the non-linear relationship of ESM outputs, there can use deep learning models to handle data by embedding multiple non-linear activation functions.” Please see lines 118-121.
  11. Line 116: I am confused by the last bit of section 2, regarding the 5-variable conceptual couples model and its results. It seems that the purpose is to show that the DCT approach is better than the ECT approach. This is quite hard to judge on the information given here. Most choices are not justified. Note that there is little point in a PCA-approach with only 5 variables. (Which is why the CESM-ECT for ocean does not use PCA. Though the authors cite that work, it is unclear that they are familiar with it. Also the ocean variables are not globally averaged.)
  RE: Thank you for your comment. In this study, we start from 5VCCM to develop the ESM-DCT. The 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tools. We have added the descriptions of the function of 5VCCM. Please see lines 212-249.
  Also why the choice of 151 ensemble members for 1000 timesteps? [Baker 2015] uses 151 timesteps for yearly averages. [Milroy 2018] shows that more are needed for shorter time scales. Why does the DCT look at 40? Are you aiming for some false positive rate? The choices here seem quite arbitrary, and it’s unclear why this “experiment” is in the background section. [Say more?]
  RE: Thank your for your comment. We show the performance of deep learning models in mining non-linear features from coupled models. First, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 demonstrates the sensitive dependence of variables of 5VCCM on initial . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data. Then, we select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 235-249.
  12. Line 212: “Based on the ultra-fast tests”
  Which ultra-fast tests are you referring to? Also I am skeptical about the robustness of bugs being detected in only a couple time steps for the ocean.
  RE: Thank you for your comment. In this study, in order to quickly spread the perturbations, we modify the coupling frequency of the ocean component to 8 times a day. We run two simulations of 24 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the initial atmosphere temperature. Figure 6 shows that choosing a small number of time steps and modifying the ocean coupling frequency setting can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 24 time step as the input data of the BGRU-AE model, where the ocean component is coupled 4 times. We have added the description of the parameters of 24 time steps and coupling frequency of the ocean component. Please see lines 265-279. Thanks.
  13. Line 216 - 217: I find it odd to simply increase the frequency of the ocn/atmosphere coupling. This change affects the nature of the model and model output and certainly needs justification as to why it is acceptable. Also how did you choose 24 time steps as an appropriate time slice to evaluate?
  RE: Thank you for your comment. Our method aimed at determining whether or not the test datasets simulation is statistically distinguishable from the original results. As far as our method is concerned, the response to modifications known to produce statistically distinguishable outputs should be a fail, and the response to modifications not expected to produce statistically distinguishable outputs should be a pass. Then, in order to quickly spread the perturbations, we modify the coupling frequency of the ocean component to 8 times a day. We run two simulations of 24 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the initial atmosphere temperature. Figure 6 shows that choosing a small number of time steps and modifying the ocean coupling frequency setting can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 24 time step as the input data of the BGRU-AE model, where the ocean component is coupled 4 times. We have added the description of the parameters of 24 time steps and coupling frequency of the ocean component. Please see lines 265-279. Thanks.
  14. Line 236: “...clearly inconsistent…”. I don’t believe that there has been a definition provided as to what is meant by consistency for this new approach.
  RE: Thank you for your comment. We expect to use the labeled datasets to detect the generalization performance of the BGRU-AE model. The testing datasets whose PDF is clearly distinguishable with that of the O(10^-14) initial perturbations are tagged as unacceptable initial perturbations. We revised the sentences. It should be “Therefore, the testing datasets tagged as unacceptable initial perturbations are the ensembles with the O(10^-6) perturbations of initial atmosphere temperature, whose PDF is clearly distinguishable with that of the O(10^-14) initial perturbations.” Please see lines 295-297.
  15. Lines 236-237: “the testing datasets with unacceptable CESM model parameter adjustments are with the O(10^-14) perturbations of initial atmospheric temperature” . I don’t understand this - why is O(10^-14) unacceptable?
  RE: Thank you for your comment. We define the testing datasets with unacceptable CESM model parameter that are known to produce statistically distinguishable outputs with the training datasets, but we control this testing datasets are the ensembles with the O(10^-14) perturbations of initial atmosphere temperature. We have added description of the testing datasets with unacceptable CESM model parameter. Please see lines 297-298.
  16. Line 246: Why the decision to do a spatial mean for the ocean variables? It seems that this could be problematic given the much higher spatial variability in the ocean compared to the atmosphere.
  RE: We expect to analyze short-time simulation results of atmosphere and ocean components, which can achieve an overall consistency evaluation of ESM rapidly. The spatial mean for the atmosphere and ocean variables is an approach to data preprocessing, which can align the dimensions of atmosphere and ocean data. This method does indeed lose spatial information but adds the oceanic information for the rapid consistency evaluation of the entire ESM. Then, to analyze the spatial features of the atmosphere and ocean data, we will focus on refining convolutional neural networks and attention mechanisms within the deep-learning model architecture in the follow-up studies. Please see lines 439-442. Thanks.
  17. Line 258 ”.. because the variables have vastly different units and magnitudes.”
  Matches the phrase used in [Baker 2015]: “..because the CAM variables have vastly different units and magnitudes.”
  RE: Thank you for your suggestion. We have cited the paper of Baker et al., 2015. Please see line 314-316. Thanks.
  18. Line 26: “Software correctness verification n the form of quality assurance”
  
  Line 59: “verification in the form of quality assurance”
  
  Line 104: “for software verification in the form of quality assurance”
  This particular phrase comes from the abstract of [Baker 2015]: “software verification in the form of quality assurance”. Consider rephrasing or quoting.
  RE: Thank you for your suggestion. We have revised the sentences and cited the paper of Baker et al., 2015. Please see line 25-27, 112-113. Thanks.
  19. Line 268: Why using CESM version 1.3? That is quite old and not an official release: https://www.cesm.ucar.edu/models/releases
  RE: Thank you for your comment. The CESM version used in the present study is the CESM1.3-beta17_sehires38, which is applied to the Sunway system and described in Zhang et al. (2020). We added the description of the CESM version. Please see lines 259-260. Thanks.
  20. Line 310: What are c0_lnd and c0_ocn? Why did you choose them? (There are a couple of DOE-authored studies with CAM5 that mention these for the ZM scheme, but none are cited here.) Also, I know these are later coarsely defined in Table 5, but more info is needed when they are mentioned. How did you pick the parameter values in Table 5?
  RE: Thank you for your comment. Climate scientists provided a list of CAM input parameters thought to affect the climate in a non-trivial manner, which is used to detect changes to the simulation results that are known to produce statistically distinguishable outputs in the CESM-ECT (Baker et al., 2015), such as zm_c0_lnd, zm_c0_ocn, sol_factb_interstitial, sol_factic_interstitial, and cldfrc_rhminh. Our tool must successfully detect the inconsistency of the simulation results that are known to produce statistically distinguishable outputs. Therefore, we modify the values of input parameters in the atmosphere components and then test whether or not our tool can detect the inconsistency caused by the model parameter changes using the ESM-DCT in the heterogeneous computing. We have added the descriptions of experiments of modifications to produce statistically distinguishable outputs. Please see lines 356-365. Thanks.
  21. Line 315: What is a “mixed perturbation”? I think you mean something like in line 320 that implies two types of perturbations are problematic: “The results show that the tool can detect the climate changing modifications when taking hardware-related perturbations into account.” I think this line of reasoning (where the authors focus on perturbations from 2 sources somehow being trickier) is flawed. Either the output of two different runs is consistent or it is not. The source of the difference is not easily quantified, particularly with compiler and hardware changes. If you run on a different machine with a different compiler then your trusted machine is that a mixed perturbation? I don’t see that that distinction matters (and such experiments were done in [Baker 2015].) I do agree that the way an ensemble spread is created affects the variability of the distribution (e.g., see [Milroy 16] ), and that's an interesting question, but the authors here have not delved into that at all.
  RE: Thank you for your comment. The ESM-DCT is used for detecting the existence of software or human errors when taking hardware-related perturbations into account on the heterogeneous many-core systems. Therefore, the mixed perturbations refer to the additional uncertainty caused by heterogeneous hardware designs and software or human changes (such as compiler optimization option and model input parameter changes). The coexistence of general-purpose cores and accelerator cores, which usually employ different hardware architectures, can lead to bit-level differences, especially when we try to maximize the performance on both kinds of cores. Such differences further lead to computational perturbations through temporal integration, which can blend with software or human errors. We have removed the descriptions of the mixed perturbations. Please see lines 372-373. Thanks.
  22. Figure 9: why compare the output between the O(10^-6) tests and the climate-changing tests if the O(10^-6) is inconsistent? Don't you want to compare to the accepted one to know how far off you are? I am clearly missing something.
  RE: In this study, our tool must successfully detect modifications to the simulation results that are known to produce statistically distinguishable outputs with the training datasets. Besides the unacceptable CESM model parameter adjustments listed by climate scientists (Baker, et al., 2015), we do the experiments of unacceptable initial perturbations to produce statistically distinguishable outputs. Following the Figure 7, the testing datasets tagged as unacceptable initial perturbations are the ensembles with the O(10^-6) perturbations of initial atmosphere temperature, whose PDF is clearly distinguishable with that of the O(10^-14) initial perturbations. Then, we use the tagged testing datasets to detect the generalization performance of ESM-DCT and adjust the ensemble size of training datasets. We have added the descriptions of testing datasets. Please see lines 293-297. Thanks.
  23. Figures 7 -11: Why is the reconstruction error threshold .05? Did this somehow come from the second paragraph in 3.4?
  RE: Thank you for your comment. We calculate the reconstruction errors after re-inputting the training datasets into the saved BGRU-AE model. The PDF of the reconstruction errors of the training datasets is shown in Figure 9. Following Figure 9, the PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the descriptions of the calculation method of the reconstruction errors. Please see lines 205-211.
  24. Table 5: What does it mean that the passing rate is 0% and the test overall passes? That seems wrong.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 7.
  25. Section 4.4: The way the beginning of this section reads, with “For example, the effect of -O3 compiler optimization option was not known, because the CESM code base is large and level-three optimizations can be quite aggressive (Baker et al., 2015). We input the simulation results of the heterogeneous version of CESM on the new Sunway system into the ESM-DCT with -O3 compiler optimization option.“ implies that we don't know if O3 will pass or not. I checked [Baker 2015] and there they show that O3 does pass, so I am not sure why it is in the unknown outcomes section.
  RE: Thank you for your comment. The CESM code base is large and level-three optimizations can be quite aggressive. In the results of CESM-ECT [Baker 2015], INTEL13-O3 and INTEL14-O3 get the “pass”, but INTEL15-O3 gets the “failure”. Different compilers have different optimization algorithms and strategies for codes, the outputs can be different. We expect that the ESM-DCT can help to predict new data with unknown outcome modifications on the heterogeneous systems. The prediction datasets are simulation results of the heterogeneous version of CESM which is with -O3 compiler optimization option for all codes on the new Sunway system. The computing environments for CESM have been changed, so the consistency needs to be reassessed. We have added the descriptions of -O3 compiler optimization options. Please see lines 393-395. Thanks.
  26. Table 6: Was the O3 optimization only done for the ZM subroutine (as the table says)? If so, this needs to be noted in the paper text.
  RE: Thank you for your suggestion. The prediction datasets are simulation results of the heterogeneous version of CESM which is with -O3 compiler optimization option for all codes on the new Sunway system. We have added the descriptions of prediction datasets with -O3 compiler optimization options. Please see lines 395-397. Thanks.
  27. Line 329: “..which provides the references for the porting and optimization“ - what does this mean?
  RE: Thank you for your comment. The result shows that the effect of -O3 compiler optimization option in the new Sunway system is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. We have revised the sentence. Please see lines 399-400. Thanks.
  28. Line: 341: “Then, our tool can detect sensitivity of input parameters, which is excluded to the input parameter list provided by the climate scientist thought to affect the climate in a non-trivial manner“
  The meaning of this sentence is unclear.
  RE: Thank you for your comment. The input parameters used in Section 4.4 are except the parameters listed by climate scientists in [Baker et al., 2015]. In this study, the experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  29. Table 6: For the “changes in model parameter” portion, line 345 says: “The result shows that ke, vdc_eq, and vdc_psim variables are not sensitive to the configuration in this study, while the value of the vdc1 variables should not be changed.”
  This conclusion is a bit strong. I don’t think the results indicate that vdc1 should not be changed, The results just show that the one value tested fails. Maybe a smaller change would pass? The reverse is true for those that passed. Those variables may be sensitive to a larger change in value. I don’t know how/why these particular values were chosen.
  RE: Thank you for your excellent comment. The experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  30. Line 355: “..form a mixed perturbation environment...” Again, I don’t think there is anything specifically meaningful to this “mixed perturbation’ terminology. The whole earth system model is chaotic and it is affected by hardware/software stack, roundoff error, truncation error, etc.
  RE: Thank you for your comment. The ESM-DCT is used for detecting the existence of software or human errors when taking hardware-related perturbations into account on the heterogeneous many-core systems. Therefore, the mixed perturbations refer to the additional uncertainty caused by heterogeneous hardware designs and software or human changes (such as compiler optimization option and model input parameter changes). The coexistence of general-purpose cores and accelerator cores, which usually employ different hardware architectures, can lead to bit-level differences, especially when we try to maximize the performance on both kinds of cores. Such differences further lead to computational perturbations through temporal integration, which can blend with software or human errors. We have removed the descriptions of the mixed perturbations. Please see lines 372-373. Thanks.
  31. Other related work on climate model correctness that is not cited in this paper:
  Mahajan et. al, “Ensuring statistical reproducibility of ocean model simulations in the age of hybrid computing” 2021
  Massonnet et. al, “Replicability of the EC-Earth3 Earth system model under a change in computing environment”, 2020.
  Mahajan et. al, “A multivariate approach to ensure statistical reproducibility of climate model simulations” 2019.
  Mahajan et. al, “Exploring an Ensemble-Based Approach to Atmospheric Climate Modeling and Testing at Scale”, 2017.
  Wan, H,et. al, “A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0)”, 2017.
  RE: Thank you for your suggestion. We have cited the related work on climate model correctness. Please see lines 68-75.
  
  Technical corrections:
  1. Line 32: “that facing with the” is awkward, please rephrase. (This also occurs in line 75, 108, …)
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 84-85, 120-121.Thanks.
  2. Line 98: MPEs are not defined.
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 100-101.Thanks.
  3. Line 234: “probability density function (PDF) of the CESM” . I assume the intention was of a variable, not the CESM model
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 294-295.Thanks
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC3
- AC5:
  'Reply on RC2', Yangyang Yu, 24 May 2024
  A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems
  Yangyang Yu, Shaoqing Zhang,, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu,, Xiaojing
  
  General comments:
  This manuscript presents an approach for evaluating software correctness for Earth System Models using an ensemble and deep learning.
  The general idea is promising, but a lot of the algorithm decisions are not well supported. In general, very substantial improvements are needed for GMD. The authors need to justify unsubstantiated claims and statements that are not correct or misleading.
  
  The paper needs a good editing pass. There are many grammatical errors and awkward phrases (a few of which I mention below).
  
  Some of the information about other authors’ work is inaccurate and other related work has been left out (listed later in this report). The Introductory section, in particular, needs revision.
  
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have revised the Introductory section and added: 1) the descriptions of heterogeneous computational perturbations; 2) the experiments of determining the ensemble size of training sets and model parameters; 3) the experiments of determining the model time step; 4) the description of the CESM version, etc.
  The point-by-point replies are followed.
  
  Specific comments:
  1, Line 58: “Evaluating the scientific consistency is a commonly used method for model verification in the form of quality assurance.”.
  I am not sure that I agree this is a “commonly-used” method (consistency) - I don’t think this phrasing is found prior to the [Baker 2015]. Also I don’t believe the authors here ever define what they mean by consistency.
  RE: Thank you for your comments. We have revised the sentences. It should be “Model verification during optimizing and developing ESMs is critical to establishing and maintaining the credibility of the ESMs (Carson II, 2002), which focuses on determining whether or not the implementation of a model is correct and matches the intended description and assumptions for the model. ” Please see lines 58-60. Thanks.
  
  2. Line 59: “For example, for detecting the influences of hardware environment changes, data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015).Then, the CESM ensemble-based consistency test (CESM-ECT) is used to compare the new simulations against the control ensemble from the trusted machine (Baker et al., 2015; Milroy et al., 2016; Baker et al., 2016; Milroy et al., 2018).
  This description is not accurate. [Baker 2015] explains that the 400-year test is the approach that was used prior to the CESM-ECT development. In fact, [Baker 2015] states that it was the motivation for a more objective approach. The CESM-ECT does not involve the 400 years and is a separate thing.
  RE: Thank you for your comments. We have revised the sentence. It should be “For detecting the influences of hardware environment changes, a historical method is that data from a model simulation of several hundred years (typically 400) on the new machine is analyzed and compared to data from the same simulation on a trusted machine by climate scientists (Baker et al., 2015). Then, many ensemble-based consistency evaluation methods are used to compare the new simulations against the control ensembles.” Please see lines 60-64. Thanks.
  
  3. Line 64: “2018). However, all the methods mentioned above focus on homogeneous multi-core HPC systems.”
  This statement is not accurate. As far as I can tell, there is nothing specific to homogeneous machines in these works. In fact, [Milroy 2018] specifically mentions heterogeneous computing environments as a motivation for the work.
  RE: Thank you for your comments. We have removed the sentence. Thanks.
  
  4. Line 71: “However, the ultra-fast tests in the CESM-ECT are applied for evaluating the scientific consistency on the Community Atmosphere Model ” … “There is a lack of a method to analyze short-time simulation results of multi-components”
  The [Milroy 2018] work also tests CLM (land model). It explains how land and atmosphere are tightly coupled and CLM modifications can be detected in CAM variables. And gives experimental results.
  Consider also that ocean scales are much larger than atmospheric scales, hence the need for more time for changes to propagate through the ocean model from a single perturbed variable.
  RE: Thank you for your good comments. We have removed the sentence and only focus on the consistency test of the atmosphere component for the CESM. Please see lines 232-244. Thanks.
  
  5. Line 74: “Besides, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets”
  This statement does not make sense to me. Please clarify what is meant. PCA detects changes in relationships between variables. They don’t have to be linear relationships for change to be detected.
  RE: Thank you for your comment. We have removed the sentence and added the descriptions of the advantages of deep learning methods for evaluating the consistency for ESMs. Please see lines 75-80. Thanks.
  
  6. Line 75: ”Facing with the non-linear relationship generated by the combination of multi-component data, there is an urgent need of data analysis methods for non-linear transformation, such as deep learning models.”
  This statement is just unclear. The atmosphere model by itself is nonlinear and chaotic - so I’m not sure what is meant by multi-component combinations - is the implication that this is more non-linear? Please clarify. Also “facing with the …” is awkward phrasing.
  RE: Thank you for your comments. We have removed the sentence. Thanks.
  
  7. Line 80: The phrase “unavoidable computational perturbations” is used frequently (also line 100, line 353), and it’s a bit awkward. It seems that it should be clarified that these are “numerical differences” due to using finite-precision and changing the order of operations and precision due to the changes in architecture. This is not really explained.
  RE: Thank you for your suggestion. We have added the descriptions of uncertainties caused by heterogeneous hardware designs. Please see lines 97-102. Thanks.
  
  8. Line 81: “The ESM-DCT is applied to evaluate whether or not a new CESM configuration in the scenario of mixed perturbations composed of the inevitable computational perturbations and software or human errors in the heterogeneous computing is consistent with the original “trusted” configuration in the homogeneous computing.”
  This phrase is unclear. Aren’t you trying to figure out (i.e. differentiate) whether a difference in output is due to numerical round-off OR software/human error?
  RE: Thank you for your comment. We have revised the sentence. It should be “The ESM-DCT tool is based on the unsupervised bidirectional gate recurrent unit-autoencoder (BGRU-AE; Zhao et al., 2017) model, and is applied to evaluate whether or not a new ESM (CEMS in this case) configuration is consistent with the original “trusted” configuration. ” Please see lines 82-84. Thanks.
  
  9. Line 103: “The key challenge is designing a tool to evaluate the scientific consistency, which can remove the influences of heterogeneous perturbations”
  The authors have definitely implied that the CESM-ECT tools do not do this, which is untrue. This new method/approach is still a valid contribution, and there is no need to misrepresent other previous work.
  RE: Thank you for your comment. We have removed the sentence. Thanks.
  
  10.Line 108: “However, the principal component analysis (PCA) in the CESM-ECT is just applied to exploring linear patterns contained in the confusing datasets (Liu et al., 2009). Facing with the non-linear relationship generated by the combination of multi-component data in the coupled numerical model, …”
  This is the exact same phrase as in line 74 and it still does not make sense (see above comment).
  RE: Thank you for your comment. We have removed the sentence. Thanks.
  
  11. Line 116: I am confused by the last bit of section 2, regarding the 5-variable conceptual couples model and its results. It seems that the purpose is to show that the DCT approach is better than the ECT approach. This is quite hard to judge on the information given here. Most choices are not justified. Note that there is little point in a PCA-approach with only 5 variables. (Which is why the CESM-ECT for ocean does not use PCA. Though the authors cite that work, it is unclear that they are familiar with it. Also the ocean variables are not globally averaged.)
  RE: Thank you for your comment. In this study, we start from 5VCCM to develop the ESM-DCT. The 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tools. We have added the descriptions of the function of 5VCCM. Please see lines 182-219.
  
  Also why the choice of 151 ensemble members for 1000 timesteps? [Baker 2015] uses 151 timesteps for yearly averages. [Milroy 2018] shows that more are needed for shorter time scales. Why does the DCT look at 40? Are you aiming for some false positive rate? The choices here seem quite arbitrary, and it’s unclear why this “experiment” is in the background section. [Say more?]
  RE: Thank your for your comment. We show the performance of deep learning models in mining non-linear features from coupled models. First, we run two simulations of 2000 time steps each: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the . Figure 5 demonstrates the sensitive dependence of variables of 5VCCM on initial . Figure 5 shows that choosing 2000 time steps can provide sufficient variability of atmosphere and ocean variables to determine statistical distinguishability. Therefore, we use the simulation results at the 2000 time step as the input data. Then, we select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 205-219.
  
  12. Line 212: “Based on the ultra-fast tests”
  Which ultra-fast tests are you referring to? Also I am skeptical about the robustness of bugs being detected in only a couple time steps for the ocean.
  RE: Thank you for your comment. Milroy et al. (2018) demonstrated that adequate ensemble variability is achieved with instantaneous variable values at a small number of time steps for CAM, despite rapid perturbation growth and heterogeneous variable spread. Therefore, we are expected to analyze the short-term simulation results to achieve the consistency evaluation of CESM. Please see lines 232-234. Thanks.
  
  13. Line 216 - 217: I find it odd to simply increase the frequency of the ocn/atmosphere coupling. This change affects the nature of the model and model output and certainly needs justification as to why it is acceptable. Also how did you choose 24 time steps as an appropriate time slice to evaluate?
  RE: Thank you for your comment. We have removed the consistency test of the ocean component. Then, we have revised the descriptions of the experiments of obtaining the datasets. We obtain the results using the B compset (active atmosphere, land, ice, ocean) and expect to provide the guidance for consistency test of the fully active ESM components. Therefore, we examine 97 variables from the atmosphere component results after the ocean component transfers data to the CESM coupler, as redundant variables and those with no variance are excluded. We did not change the coupling frequency. Next, We run two simulations at the 96 time step: one with no initial condition perturbation and one with a perturbation of O(10⁻¹⁴) to the initial atmosphere temperature. Figure 6 demonstrates the sensitive dependence of atmosphere variables on initial atmosphere temperature. The vertical axis labels are atmosphere variables, while the horizontal axis labels are the CESM time steps. The color of each step represents the number of significant figures in common between the perturbed and unperturbed simulations. Figure 6 shows that choosing a small number of time steps can provide sufficient variability of atmosphere variables to determine statistical distinguishability. Therefore, we use the simulation results at the 96 time step as the input data of the BGRU-AE model, where the ocean component is coupled 2 times. Please see lines 232-244. Thanks.
  
  14. Line 236: “...clearly inconsistent…”. I don’t believe that there has been a definition provided as to what is meant by consistency for this new approach.
  RE: Thank you for your comment. We have removed the sentence because we have removed the testing datasets of “unacceptable initial conditions”. Thanks.
  
  15. Lines 236-237: “the testing datasets with unacceptable CESM model parameter adjustments are with the O(10^-14) perturbations of initial atmospheric temperature” . I don’t understand this - why is O(10^-14) unacceptable?
  RE: Thank you for your comment. We define the testing datasets with unacceptable CESM model parameter that are known to produce statistically distinguishable outputs with the training datasets, but we control this testing datasets are the ensembles with the O(10^-14) perturbations of initial atmosphere temperature. We have added description of the testing datasets with unacceptable CESM model parameter. Please see lines 256-258.
  
  16. Line 246: Why the decision to do a spatial mean for the ocean variables? It seems that this could be problematic given the much higher spatial variability in the ocean compared to the atmosphere.
  RE: We have removed the consistency test of the ocean component and only focus on the atmosphere component for the CESM. Thanks.
  
  17. Line 258 ”.. because the variables have vastly different units and magnitudes.”
  Matches the phrase used in [Baker 2015]: “..because the CAM variables have vastly different units and magnitudes.”
  RE: Thank you for your suggestion. We have cited the paper of Baker et al., 2015. Please see line 271-274. Thanks.
  
  18. Line 26: “Software correctness verification n the form of quality assurance”
  
  Line 59: “verification in the form of quality assurance”
  
  Line 104: “for software verification in the form of quality assurance”
  This particular phrase comes from the abstract of [Baker 2015]: “software verification in the form of quality assurance”. Consider rephrasing or quoting.
  RE: Thank you for your suggestion. We have revised the sentences and cited the paper. For example, “Model verification during optimizing and developing ESMs is critical to establishing and maintaining the credibility of the ESMs (Carson II, 2002)”. Please see lines 58-59. Thanks.
  
  19. Line 268: Why using CESM version 1.3? That is quite old and not an official release: https://www.cesm.ucar.edu/models/releases
  RE: Thank you for your comment. In this study, the CESM version used in the present study is CESM 2.1.1 (Danabasoglu et al., 2020). We added the description of the CESM version. Please see line 227. Thanks.
  
  20. Line 310: What are c0_lnd and c0_ocn? Why did you choose them? (There are a couple of DOE-authored studies with CAM5 that mention these for the ZM scheme, but none are cited here.) Also, I know these are later coarsely defined in Table 5, but more info is needed when they are mentioned. How did you pick the parameter values in Table 5?
  RE: Thank you for your comment. Climate scientists provided a list of CAM input parameters thought to affect the climate in a non-trivial manner, which is used to detect changes to the simulation results that are known to produce statistically distinguishable outputs in the CESM-ECT (Baker et al., 2015), such as zm_c0_lnd, zm_c0_ocn, sol_factb_interstitial, sol_factic_interstitial, and cldfrc_rhminh. Our tool must successfully detect the inconsistency of the simulation results that are known to produce statistically distinguishable outputs. Therefore, we modify the values of input parameters in the atmosphere components and then test whether or not our tool can detect the inconsistency caused by the model parameter changes using the ESM-DCT. We have added the descriptions of experiments of modifications to produce statistically distinguishable outputs. Please see lines 311-317. Thanks.
  
  21. Line 315: What is a “mixed perturbation”? I think you mean something like in line 320 that implies two types of perturbations are problematic: “The results show that the tool can detect the climate changing modifications when taking hardware-related perturbations into account.” I think this line of reasoning (where the authors focus on perturbations from 2 sources somehow being trickier) is flawed. Either the output of two different runs is consistent or it is not. The source of the difference is not easily quantified, particularly with compiler and hardware changes. If you run on a different machine with a different compiler then your trusted machine is that a mixed perturbation? I don’t see that that distinction matters (and such experiments were done in [Baker 2015].) I do agree that the way an ensemble spread is created affects the variability of the distribution (e.g., see [Milroy 16] ), and that's an interesting question, but the authors here have not delved into that at all.
  RE: Thank you for your comment. We have removed the descriptions of “mixed perturbation” and only focus on documenting the development of a deep learning-based consistency test approach for the ESMs. Thanks.
  
  22. Figure 9: why compare the output between the O(10^-6) tests and the climate-changing tests if the O(10^-6) is inconsistent? Don't you want to compare to the accepted one to know how far off you are? I am clearly missing something.
  RE: Thank you for your comment. We have removed the testing datasets of “unacceptable initial conditions”. Thanks.
  
  23. Figures 7 -11: Why is the reconstruction error threshold .05? Did this somehow come from the second paragraph in 3.4?
  RE: Thank you for your comment. We calculate the reconstruction errors after re-inputting the training datasets into the saved BGRU-AE model. The PDF of the reconstruction errors of the training datasets is shown in Figure 8. Following Figure 8, the PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the descriptions of the calculation method of the reconstruction errors. Please see lines 283-288.
  
  24. Table 5: What does it mean that the passing rate is 0% and the test overall passes? That seems wrong.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 6.
  
  25. Section 4.4: The way the beginning of this section reads, with “For example, the effect of -O3 compiler optimization option was not known, because the CESM code base is large and level-three optimizations can be quite aggressive (Baker et al., 2015). We input the simulation results of the heterogeneous version of CESM on the new Sunway system into the ESM-DCT with -O3 compiler optimization option.“ implies that we don't know if O3 will pass or not. I checked [Baker 2015] and there they show that O3 does pass, so I am not sure why it is in the unknown outcomes section.
  RE: Thank you for your comment. The CESM code base is large and level-three optimizations can be quite aggressive. In the results of CESM-ECT [Baker 2015], INTEL13-O3 and INTEL14-O3 get the “pass”, but INTEL15-O3 gets the “failure”. Different compilers have different optimization algorithms and strategies for codes, the outputs can be different. After the ESM-DCT tool is constructed, we expect that it can provide the guidelines for predicting the consistency results of ESM new configurations to better understand and improve the tool. The result shows that the effect of -O3 compiler optimization option is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. We have added the descriptions of -O3 compiler optimization options. Please see lines 351-359. Thanks.
  
  26. Table 6: Was the O3 optimization only done for the ZM subroutine (as the table says)? If so, this needs to be noted in the paper text.
  RE: Thank you for your suggestion. The prediction datasets are simulation results of the heterogeneous version of CESM which is with -O3 compiler optimization option for all codes on the new Sunway system. We have added the descriptions of prediction datasets with -O3 compiler optimization options. Please see lines 354-355. Thanks.
  
  27. Line 329: “..which provides the references for the porting and optimization“ - what does this mean?
  RE: Thank you for your comment. The result shows that the effect of -O3 compiler optimization option in the new Sunway system is positive, which provides the confidence to choose the level-three optimizations on the new Sunway system. We have revised the sentence. Please see lines 358-359. Thanks.
  
  28. Line: 341: “Then, our tool can detect sensitivity of input parameters, which is excluded to the input parameter list provided by the climate scientist thought to affect the climate in a non-trivial manner“
  The meaning of this sentence is unclear.
  RE: Thank you for your comment. The input parameters used in Section 4.4 are except the parameters listed by climate scientists in [Baker et al., 2015]. In this study, the experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  
  29. Table 6: For the “changes in model parameter” portion, line 345 says: “The result shows that ke, vdc_eq, and vdc_psim variables are not sensitive to the configuration in this study, while the value of the vdc1 variables should not be changed.”
  This conclusion is a bit strong. I don’t think the results indicate that vdc1 should not be changed, The results just show that the one value tested fails. Maybe a smaller change would pass? The reverse is true for those that passed. Those variables may be sensitive to a larger change in value. I don’t know how/why these particular values were chosen.
  RE: Thank you for your excellent comment. The experiments of unknown outcomes modifications are the prediction application on the new Sunway system. The experiments of unknown outcomes modifications about input parameters are needed for further research. Only some test cases are shown here. In the follow-up study, we will increase the number of experiments to detect sensitivity of the input parameters. To avoid misunderstandings, we have removed the experiments of unknown outcomes modifications about input parameters and focused on the prediction applications about -O3 compiler optimization option and mixed precision programming. Thanks.
  
  30. Line 355: “..form a mixed perturbation environment...” Again, I don’t think there is anything specifically meaningful to this “mixed perturbation’ terminology. The whole earth system model is chaotic and it is affected by hardware/software stack, roundoff error, truncation error, etc.
  RE: Thank you for your comment. We have removed the descriptions of “mixed perturbations”. Thanks.
  
  31. Other related work on climate model correctness that is not cited in this paper:
  Mahajan et. al, “Ensuring statistical reproducibility of ocean model simulations in the age of hybrid computing” 2021
  Massonnet et. al, “Replicability of the EC-Earth3 Earth system model under a change in computing environment”, 2020.
  Mahajan et. al, “A multivariate approach to ensure statistical reproducibility of climate model simulations” 2019.
  Mahajan et. al, “Exploring an Ensemble-Based Approach to Atmospheric Climate Modeling and Testing at Scale”, 2017.
  Wan, H,et. al, “A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (TSC1.0)”, 2017.
  RE: Thank you for your suggestion. We have cited the related work on climate model correctness. Please see lines 64-70.
  
  Technical corrections:
  Line 32: “that facing with the” is awkward, please rephrase. (This also occurs in line 75, 108, …)
  
  RE: Thank you for your suggestion. We have removed the sentence because we have removed the descriptions of “mixed perturbations”. Thanks.
  
  Line 98: MPEs are not defined.
  
  RE: Thank you for your suggestion. We have revised the sentence. Please see lines 93-94.Thanks.
  
  Line 234: “probability density function (PDF) of the CESM” . I assume the intention was of a variable, not the CESM model
  
  RE: Thank you for your suggestion. We have removed the sentence because we have removed the testing datasets of “unacceptable initial conditions”. Thanks
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC5
RC3:
'Comment on gmd-2024-10', Anonymous Referee #3, 18 Mar 2024
This manuscript presents interesting and necessary work, and it seems that the authors have done a thorough job with their science. There are two main improvements I believe should be made. First, the writing needs to be edited to be clearer, and the proposed BGRU-AE model needs to be compared to a simpler method to set a baseline. The following is a list of questions/revisions.
The English needs significant revision, perhaps by a professional service. The paper is filled with nonstandard terminology and unusual or wrong grammar, making comprehension difficult.

The need is well communicated and important. Processors are becoming increasingly heterogeneous and it’s important to understand how that affects calculations.

Lines 88-103: While the technical description of the SW26010P processor is sufficient, a little bit more context for the processor and the TaihuLight supercomputer would be appreciated. What was the supercomputer built to do? What are the stated advantages to using this processor, and were they ever meant to run climate models? Some of this is addressed in the summary and discussions but is worth mentioning here.

Lines 104-114: I’m approaching this paper from a machine learning perspective, so a few more sentences in the background expanding on how CESM-ECT works would be helpful for me.

Line 140: Why was the ensemble size 151?

Line 198: The role of the FC layers needs to be explicitly stated. What exactly do they do? Does this change the inputs and outputs of the network during training?

Section 3.2: A line or two should be added about what software was used to create the model. Pytorch, Tensorflow, etc.?

Figure 7: Please put a legend for all three elements and axis labels. Alternatively, make the caption more detailed.

Figure 8-11: Please label the y-axis. Also, it would be helpful to show the number of points in each category. If there are enough points that there is significant overlap, perhaps the plots should be converted to boxplots.

While it’s very possible this is my fault, I do not understand the paragraph that spans lines 140-151, and subsequently, Table 1, perhaps some context or motivation should be given. Are the test results shown in Section 4 and the following tables a subset of the tests shown in Table 1? If the ECT is failing for 2/3 tests, why do we believe the DCT pass rate instead?

The language in the table is confusing. For example, in Table 5, the passing rate is 0%, but the ESM-DCT result is “Pass”. I think this means that the ESM-DCT detected all of the anomalies, so the tool itself “passed” the test, but it would help if the language was clearer.

I will allow the editor to decide if this is required or not, since it will require a lot of work, but I believe the DCT method needs to be compared to something for the tests performed. How does the ECT method do on the tests? Alternatively, how does a simpler machine-learning based method do on the tests? The authors’ implementation of the BGRU-AE model is impressive, but it could be difficult to implement and have heavy training requirements. If local outlier factor (LOF) or an isolation forest does almost as well or as well as the DCT method, it’s worth pointing out since those methods are simpler. Viewing the convergence curves shown in Figure 6, I suspect a simpler non-linear ML model will work also. This, along with English revisions, are what I believe to be the two major improvements that need to be made to the manuscript.

I am happy to see the code is included with the manuscript. It still might be helpful to have an appendix with a summary of the technical details of the BGRU-AE model. This will help with reproducibility since people can try to implement the model with their preferred computational tools.
Citation: https://doi.org/10.5194/gmd-2024-10-RC3
- AC1: 'Reply on RC3', Yangyang Yu, 23 Apr 2024
  
  This manuscript presents interesting and necessary work, and it seems that the authors have done a thorough job with their science. There are two main improvements I believe should be made. First, the writing needs to be edited to be clearer, and the proposed BGRU-AE model needs to be compared to a simpler method to set a baseline. The following is a list of questions/revisions.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the ESM applications of the Sunway Taihulight system and new Sunway system, 2) more detailed descriptions of the PCA method; 3) the experiments of ESM-DCT and LOF in evaluating the consistency for CESM, etc.
  Specific comments:
  1. The English needs significant revision, perhaps by a professional service. The paper is filled with nonstandard terminology and unusual or wrong grammar, making comprehension difficult.
  RE: Thank you for your valuable and thoughtful comments. We have carefully checked and improved the English writing in the revised manuscript.
  2. The need is well communicated and important. Processors are becoming increasingly heterogeneous and it’s important to understand how that affects calculations.
  RE: Thank you for your suggestion. In the heterogeneous many-core architectures, the general-purpose cores are mainly used for controlling and managing tasks, and the accelerator cores are used for calculating. For example, the major computing power in the heterogeneous many-core architectures is provided by many-core accelerators such as NVIDIA GPUs and many-core processors Sunway CPEs. We have added the descriptions of the heterogeneous many-core architectures. Please see lines 39-43.
  3. Lines 88-103: While the technical description of the SW26010P processor is sufficient, a little bit more context for the processor and the TaihuLight supercomputer would be appreciated. What was the supercomputer built to do? What are the stated advantages to using this processor, and were they ever meant to run climate models? Some of this is addressed in the summary and discussions but is worth mentioning here.
  RE: Thank you for your suggestion. ESMs are the important application scenarios for the heterogeneous many-core high-performance computing (HPC) systems. For example, Zhang et al. (2020) enabled highly efficient simulations of the high-resolution (25-km atmosphere and 10-km ocean) CESM on the heterogeneous Sunway Taihulight. Gu et al. (2022) established a non-hydrostatic global atmospheric modeling system at 3 km horizontal resolution with aerosol feedbacks on the heterogeneous new Sunway system. Zhang et al. (2023) developed a series of high-resolution CESM with up to 5 km of atmosphere and 3 km of ocean to capture major weather-climate extremes on the heterogeneous new Sunway system. We have added the descriptions of applications of Sunway Taihulight and new Sunway system. Please see lines 49-54. Thanks.
  4. Lines 104-114: I’m approaching this paper from a machine learning perspective, so a few more sentences in the background expanding on how CESM-ECT works would be helpful for me.
  RE: Thank you for your suggestion. The CESM-ECT uses the PCA to get linearly independent feature vectors from the control ensemble simulations and issues an overall pass or fail result by scores after the new simulations are converted to scores via the linearly independent feature vectors. We have added the descriptions of PCA. Please see lines 114-120. Thanks.
  5. Line 140: Why was the ensemble size 151?
  RE: Thank you for your comment. We select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 241-249.
  6. Line 198: The role of the FC layers needs to be explicitly stated. What exactly do they do? Does this change the inputs and outputs of the network during training?
  RE: Thank you for your comment. For the BGRU-AE model, the input data are 101 dimension vectors. For the BGRU, the output vectors are [number of network layers* number of network directions, number of hidden]. We use the FC layers to convert the output vectors of BGRU to align with the input data and compute the loss. We add the role of the FC layers. Please see lines 171-172. Thanks.
  7. Section 3.2: A line or two should be added about what software was used to create the model. Pytorch, Tensorflow, etc.?
  RE: Thank you for your good suggestion. We use Pytorch to implement the model. We add the description of the software we used. Please see lines 200-201. Thanks.
  8. Figure 7: Please put a legend for all three elements and axis labels. Alternatively, make the caption more detailed.
  RE: Thank you for your suggestion. The figure shows the PDF of the reconstruction errors of the training datasets. The PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the caption of the figure more detailed. Thanks.
  9. Figure 8-11: Please label the y-axis. Also, it would be helpful to show the number of points in each category. If there are enough points that there is significant overlap, perhaps the plots should be converted to boxplots.
  RE: Thank you for your suggestion. We have labeled the y-axis in Figure 10-12. Thanks.
  10. While it’s very possible this is my fault, I do not understand the paragraph that spans lines 140-151, and subsequently, Table 1, perhaps some context or motivation should be given. Are the test results shown in Section 4 and the following tables a subset of the tests shown in Table 1? If the ECT is failing for 2/3 tests, why do we believe the DCT pass rate instead?
  RE: Thank you for your comment. 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tool. We have added the descriptions of ESM-DCT for 5VCCM. Please see lines 212-249.
  11. The language in the table is confusing. For example, in Table 5, the passing rate is 0%, but the ESM-DCT result is “Pass”. I think this means that the ESM-DCT detected all of the anomalies, so the tool itself “passed” the test, but it would help if the language was clearer.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 6. Thanks.
  12. I will allow the editor to decide if this is required or not, since it will require a lot of work, but I believe the DCT method needs to be compared to something for the tests performed. How does the ECT method do on the tests? Alternatively, how does a simpler machine-learning based method do on the tests? The authors’ implementation of the BGRU-AE model is impressive, but it could be difficult to implement and have heavy training requirements. If local outlier factor (LOF) or an isolation forest does almost as well or as well as the DCT method, it’s worth pointing out since those methods are simpler. Viewing the convergence curves shown in Figure 6, I suspect a simpler non-linear ML model will work also. This, along with English revisions, are what I believe to be the two major improvements that need to be made to the manuscript.
  RE: Thank you for your good suggestion. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method and the ESM-DCT. The accuracy of ESM-DCT, 98.6%, is slightly better than the accuracy of LOF, 96.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF. Please see Section 4.5. Thanks.
  13. I am happy to see the code is included with the manuscript. It still might be helpful to have an appendix with a summary of the technical details of the BGRU-AE model. This will help with reproducibility since people can try to implement the model with their preferred computational tools.
  RE: Thank you for your suggestion. We added the guidelines for the ESM-DCT software tool in the code. Please see the code in https://doi.org/10.5281/zenodo.10972563. Thanks.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC1
- AC6: 'Reply on RC3', Yangyang Yu, 24 May 2024
  
  This manuscript presents interesting and necessary work, and it seems that the authors have done a thorough job with their science. There are two main improvements I believe should be made. First, the writing needs to be edited to be clearer, and the proposed BGRU-AE model needs to be compared to a simpler method to set a baseline. The following is a list of questions/revisions.
  RE: Thanks for the reviewer’s thorough examination of our manuscript (MS) and positive comments. We all agree that the comments are very constructive for us to improve the presentation of the MS, and all the major comments and points have been fully addressed in the revision. Specifically, in the revision, we have added:
  1) more detailed descriptions of the ESM applications of the Sunway Taihulight system and new Sunway system, 2) more detailed descriptions of the PCA method; 3) the experiments of ESM-DCT and LOF in evaluating the consistency for CESM, etc.
  The point-by-point replies are followed.
  
  Specific comments:
  1. The English needs significant revision, perhaps by a professional service. The paper is filled with nonstandard terminology and unusual or wrong grammar, making comprehension difficult.
  RE: Thank you for your valuable and thoughtful comments. We have carefully checked and improved the English writing in the revised manuscript.
  
  2. The need is well communicated and important. Processors are becoming increasingly heterogeneous and it’s important to understand how that affects calculations.
  RE: Thank you for your suggestion. In the heterogeneous many-core architectures, the general-purpose cores are mainly used for controlling and managing tasks, and the accelerator cores are used for calculating. For example, the major computing power in the heterogeneous many-core architectures is provided by many-core accelerators such as NVIDIA GPUs and many-core processors Sunway CPEs. We have added the descriptions of the heterogeneous many-core architectures. Please see lines 35-39.
  
  3. Lines 88-103: While the technical description of the SW26010P processor is sufficient, a little bit more context for the processor and the TaihuLight supercomputer would be appreciated. What was the supercomputer built to do? What are the stated advantages to using this processor, and were they ever meant to run climate models? Some of this is addressed in the summary and discussions but is worth mentioning here.
  RE: Thank you for your suggestion. ESMs are the important application scenarios for the heterogeneous many-core high-performance computing (HPC) systems. For example, Zhang et al. (2020) enabled highly efficient simulations of the high-resolution (25-km atmosphere and 10-km ocean) CESM on the heterogeneous Sunway Taihulight. Gu et al. (2022) established a non-hydrostatic global atmospheric modeling system at 3 km horizontal resolution with aerosol feedbacks on the heterogeneous new Sunway system. Zhang et al. (2023) developed a series of high-resolution CESM with up to 5 km of atmosphere and 3 km of ocean to capture major weather-climate extremes on the heterogeneous new Sunway system. We have added the descriptions of applications of Sunway Taihulight and new Sunway system. Please see lines 44-50. Thanks.
  
  4. Lines 104-114: I’m approaching this paper from a machine learning perspective, so a few more sentences in the background expanding on how CESM-ECT works would be helpful for me.
  RE: Thank you for your suggestion. For CESM-ECT, independent feature vectors are obtained from the control ensemble using the PCA method. The CESM-ECT issues an overall pass or fail result by scores after the new simulations are converted to scores via the independent feature vectors. The CESM-ECT evaluates 3 simulations for each test scenario and issues an overall failure (meaning the results are statistically distinguishable) if more than two of the PC scores are problematic in at least two of the test runs. We have added the descriptions of PCA. Please see lines 332-336. Thanks.
  
  5. Line 140: Why was the ensemble size 151?
  RE: Thank you for your comment. We select the ensemble size of training sets and model optimal parameters when the accuracy of testing datasets reaches its maximum value. The accuracy of testing datasets with different ensemble sizes is shown in Table 1. Following Table 1, the ensemble size of training datasets of ESM-DCT for 5VCCM is 120 when the accuracy of testing datasets reaches its maximum value. We have added the experiments of determining the ensemble size of training sets and model parameters. Please see lines 211-219. Thanks.
  
  6. Line 198: The role of the FC layers needs to be explicitly stated. What exactly do they do? Does this change the inputs and outputs of the network during training?
  RE: Thank you for your comment. For the BGRU-AE model, the input data are 97 dimension vectors. For the BGRU, the output vectors are [number of network layers* number of network directions, number of hidden]. We use the FC layers to convert the output vectors of BGRU to align with the input data and compute the loss. We add the role of the FC layers. Please see lines 142-143. Thanks.
  
  7. Section 3.2: A line or two should be added about what software was used to create the model. Pytorch, Tensorflow, etc.?
  RE: Thank you for your good suggestion. We use Pytorch to implement the model. We add the description of the software we used. Please see lines 170-171. Thanks.
  
  8. Figure 7: Please put a legend for all three elements and axis labels. Alternatively, make the caption more detailed.
  RE: Thank you for your suggestion. The figure shows the PDF of the reconstruction errors of the training datasets. The PDF of the reconstruction errors of the training datasets is represented by the blue line. The threshold of the reconstruction errors is 0.05, represented by the red line. We have added the caption of the figure more detailed. Thanks.
  
  9. Figure 8-11: Please label the y-axis. Also, it would be helpful to show the number of points in each category. If there are enough points that there is significant overlap, perhaps the plots should be converted to boxplots.
  RE: Thank you for your suggestion. We have labeled the y-axis in Figure 9-11. Thanks.
  
  10. While it’s very possible this is my fault, I do not understand the paragraph that spans lines 140-151, and subsequently, Table 1, perhaps some context or motivation should be given. Are the test results shown in Section 4 and the following tables a subset of the tests shown in Table 1? If the ECT is failing for 2/3 tests, why do we believe the DCT pass rate instead?
  RE: Thank you for your comment. 5VCCM is a simple nonlinear coupled model, the experimental result is an example and is only used as the first step in testing the ESM-DCT tool. We have added the descriptions of ESM-DCT for 5VCCM. Please see lines 182-219.
  
  11. The language in the table is confusing. For example, in Table 5, the passing rate is 0%, but the ESM-DCT result is “Pass”. I think this means that the ESM-DCT detected all of the anomalies, so the tool itself “passed” the test, but it would help if the language was clearer.
  RE: Thank you for your comment. It should be “Failure”. We have revised the table. Please see Table 6. Thanks.
  
  12. I will allow the editor to decide if this is required or not, since it will require a lot of work, but I believe the DCT method needs to be compared to something for the tests performed. How does the ECT method do on the tests? Alternatively, how does a simpler machine-learning based method do on the tests? The authors’ implementation of the BGRU-AE model is impressive, but it could be difficult to implement and have heavy training requirements. If local outlier factor (LOF) or an isolation forest does almost as well or as well as the DCT method, it’s worth pointing out since those methods are simpler. Viewing the convergence curves shown in Figure 6, I suspect a simpler non-linear ML model will work also. This, along with English revisions, are what I believe to be the two major improvements that need to be made to the manuscript.
  RE: Thank you for your good suggestion. We compare the accuracy performance and computation time of the CESM between the local outlier factor (LOF) machine learning method, CESM-ECT, and the ESM-DCT. The accuracy of ESM-DCT, 98.4%, is slightly better than the accuracy of LOF, 96.2%, and the accuracy of CESM-ECT, 98.2%, when the accuracy of testing datasets reaches its maximum value. Also, the ESM-DCT has higher computational efficiency than the LOF and CESM-ECT. Please see Section 4.5. Thanks.
  
  13. I am happy to see the code is included with the manuscript. It still might be helpful to have an appendix with a summary of the technical details of the BGRU-AE model. This will help with reproducibility since people can try to implement the model with their preferred computational tools.
  RE: Thank you for your suggestion. We added the guidelines for the ESM-DCT software tool in the code. Please see the code in https://doi.org/10.5281/zenodo.10972563. Thanks.
  
  Citation: https://doi.org/10.5194/gmd-2024-10-AC6

Yangyang Yu, Shaoqing Zhang, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu, and Xiaojing Lv

Viewed

Total article views: 1,678 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,205	393	80	1,678	100	117

HTML: 1,205
PDF: 393
XML: 80
Total: 1,678
BibTeX: 100
EndNote: 117

Views and downloads (calculated since 29 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	65	11	3	79
Feb 2024	105	16	4	125
Mar 2024	77	16	10	103
Apr 2024	55	14	11	80
May 2024	55	15	13	83
Jun 2024	22	10	0	32
Jul 2024	29	11	3	43
Aug 2024	22	4	0	26
Sep 2024	16	5	0	21
Oct 2024	15	5	0	20
Nov 2024	28	5	0	33
Dec 2024	10	14	0	24
Jan 2025	20	13	4	37
Feb 2025	11	10	0	21
Mar 2025	8	12	1	21
Apr 2025	11	32	0	43
May 2025	11	8	1	20
Jun 2025	20	30	2	52
Jul 2025	11	13	1	25
Aug 2025	45	6	1	52
Sep 2025	283	17	2	302
Oct 2025	48	35	5	88
Nov 2025	62	25	5	92
Dec 2025	47	17	2	66
Jan 2026	38	16	6	60
Feb 2026	39	9	1	49
Mar 2026	37	15	4	56
Apr 2026	15	9	1	25

Cumulative views and downloads (calculated since 29 Jan 2024)

Month	HTML	PDF	XML	Total
Jan 2024	65	11	3	79
Feb 2024	105	16	4	125
Mar 2024	77	16	10	103
Apr 2024	55	14	11	80
May 2024	55	15	13	83
Jun 2024	22	10	0	32
Jul 2024	29	11	3	43
Aug 2024	22	4	0	26
Sep 2024	16	5	0	21
Oct 2024	15	5	0	20
Nov 2024	28	5	0	33
Dec 2024	10	14	0	24
Jan 2025	20	13	4	37
Feb 2025	11	10	0	21
Mar 2025	8	12	1	21
Apr 2025	11	32	0	43
May 2025	11	8	1	20
Jun 2025	20	30	2	52
Jul 2025	11	13	1	25
Aug 2025	45	6	1	52
Sep 2025	283	17	2	302
Oct 2025	48	35	5	88
Nov 2025	62	25	5	92
Dec 2025	47	17	2	66
Jan 2026	38	16	6	60
Feb 2026	39	9	1	49
Mar 2026	37	15	4	56
Apr 2026	15	9	1	25

Viewed (geographical distribution)

Total article views: 1,696 (including HTML, PDF, and XML) Thereof 1,696 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 13 Apr 2026

Short summary

The hardware-related perturbations caused by the heterogeneous many-core architectures can blend with software or human errors, which can affect the accuracy of the model consistency verification. We develop a deep learning-based consistency test tool for ESMs on the heterogeneous systems (ESM-DCT) and evaluate it in CESM on new Sunway system. The ESM-DCT can detect the existence of software or human errors when taking hardware-related perturbations into account.


Total:	0
HTML:	0
PDF:	0
XML:	0