Preprints
https://doi.org/10.5194/gmd-2024-10
https://doi.org/10.5194/gmd-2024-10
Submitted as: methods for assessment of models
 | 
29 Jan 2024
Submitted as: methods for assessment of models |  | 29 Jan 2024
Status: a revised version of this preprint is currently under review for the journal GMD.

A Deep Learning-Based Consistency Test Approach for Earth System Models on Heterogeneous Many-Core Systems

Yangyang Yu, Shaoqing Zhang, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu, and Xiaojing Lv

Abstract. Physical and heat limits of the semiconductor technology require the adaptation of heterogeneous architectures in supercomputers to maintain a continuous increase of computing performance. The coexistence of general-purpose cores and accelerator cores, which usually employ different hardware architectures, can lead to bit-level differences, especially when we try to maximize the performance on both kinds of cores. Such differences further lead to unavoidable computational perturbations through temporal integration, which can blend with software or human errors. Software correctness verification in the form of quality assurance is a critically important step in the development and optimization of Earth system models (ESMs) on heterogeneous many-core systems with mixed perturbations of software changes and hardware updates. We have developed a deep learning-based consistency test approach for Earth System Models referred to as ESM-DCT. The ESM-DCT is based on the unsupervised bidirectional gate recurrent unit-autoencoder (BGRU-AE) model, which can still detect the existence of software or human errors when taking hardware-related perturbations into account. We use the Community Earth System Model (CESM) on the new Sunway system as an example of large-scale ESMs to evaluate the ESM-DCT. The results show that facing with the mixed perturbations caused by hardware designs and software changes in heterogeneous computing, the ESM-DCT can detect software or human errors when determining whether or not the model simulation is consistent with the original results in homogeneous computing. Our ESM-DCT tool provides an efficient and objective approach for verifying the reliability of the development and optimization of scientific computing models on the heterogeneous many-core systems.

Yangyang Yu, Shaoqing Zhang, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu, and Xiaojing Lv

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2024-10', Anonymous Referee #1, 05 Mar 2024
    • AC2: 'Reply on RC1', Yangyang Yu, 23 Apr 2024
  • RC2: 'Comment on gmd-2024-10', Anonymous Referee #2, 11 Mar 2024
    • AC3: 'Reply on RC2', Yangyang Yu, 23 Apr 2024
  • RC3: 'Comment on gmd-2024-10', Anonymous Referee #3, 18 Mar 2024
    • AC1: 'Reply on RC3', Yangyang Yu, 23 Apr 2024
Yangyang Yu, Shaoqing Zhang, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu, and Xiaojing Lv
Yangyang Yu, Shaoqing Zhang, Haohuan Fu, Dexun Chen, Yang Gao, Xiaopei Lin, Zhao Liu, and Xiaojing Lv

Viewed

Total article views: 374 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
293 54 27 374 14 11
  • HTML: 293
  • PDF: 54
  • XML: 27
  • Total: 374
  • BibTeX: 14
  • EndNote: 11
Views and downloads (calculated since 29 Jan 2024)
Cumulative views and downloads (calculated since 29 Jan 2024)

Viewed (geographical distribution)

Total article views: 368 (including HTML, PDF, and XML) Thereof 368 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 26 Apr 2024
Download
Short summary
The hardware-related perturbations caused by the heterogeneous many-core architectures can blend with software or human errors, which can affect the accuracy of the model consistency verification. We develop a deep learning-based consistency test tool for ESMs on the heterogeneous systems (ESM-DCT) and evaluate it in CESM on new Sunway system. The ESM-DCT can detect the existence of software or human errors when taking hardware-related perturbations into account.