Data-driven discovery and model reduction methods  for the atmospheric effects of high altitude emissions

van 't Hoff, Jurriaan A.; van Cranenburgh, Tom S.; Fasel, Urban; Dedoussi, Irene C.

doi:10.5194/gmd-19-1867-2026

Articles | Volume 19, issue 5

https://doi.org/10.5194/gmd-19-1867-2026

Articles | Volume 19, issue 5

Methods for assessment of models

04 Mar 2026

Methods for assessment of models |

| 04 Mar 2026

Data-driven discovery and model reduction methods for the atmospheric effects of high altitude emissions

Jurriaan A. van 't Hoff, Tom S. van Cranenburgh, Urban Fasel, and Irene C. Dedoussi

Abstract

Chemistry transport models play a crucial role in the evaluation of the effect of anthropogenic emissions on the atmosphere and climate, but they come with high computational costs and require specialized know-how. This renders them impractical for applications in multidisciplinary optimisation, or regulatory and operational decision-making processes where environmental effects are to be considered. Such applications require computationally efficient surrogate models of the complex chemistry transport models. Here we investigate the use of data-driven discovery and reduced-order modelling methods for this purpose. Specifically, we examine the dynamic mode decomposition (DMD) and proper orthogonal decomposition coupled with the sparse identification of non-linear dynamics (POD-SINDy). We evaluate their ability to reconstruct and forecast changes in the distribution of ozone in response to the introduction of supersonic aircraft as modelled by the GEOS-Chem chemistry transport model. Of the tested methods, we find that optimized DMD and bagging optimized DMD with constrained eigenvalues perform best. These methods can reconstruct and forecast full-atmospheric ozone responses for up to several years without losing stability, at smaller errors than estimates using the spatio-temporal mean of the data. On average, the constrained optimized DMD method reduces the reconstruction error by 63.5 % and that of forecasting by 25.8 % compared to the spatio-temporal mean. For the constrained bagging optimized DMD these reductions are 45.0 % and 23.1 %, respectively. The resulting change in global ozone column, calculated from the reconstructed atmospheres, has an error smaller than 10 %. This is achieved while reducing the computational and storage requirements by several orders of magnitude, which may be a worthwhile tradeoff for some applications.

Download & links

How to cite.

Received: 11 Jun 2025 – Discussion started: 15 Oct 2025 – Revised: 07 Jan 2026 – Accepted: 16 Jan 2026 – Published: 04 Mar 2026

1 Introduction

Chemistry transport models (CTMs) and climate chemistry models (CCMs) are critical to our understanding of how anthropogenic emissions affect the environment and climate. These models simulate the chemistry, transport, and deposition of hundreds of chemical species in the atmosphere under evolving meteorological conditions, allowing them to capture complex chemical feedback mechanisms and responses. This level of complexity requires specialised expertise and comes with a considerable computational cost, which is why these models usually require access to dedicated high-performance computer systems. Despite this barrier, CTMs and CCMs are widely used to study a variety of problems in atmospheric composition and chemistry, although their use is often limited to the evaluation of a handful of case-studies.

In the consideration of new technologies, such as new aircraft concepts, multiple parties have interest in the evaluation of CTMs and CCMs. Engineers may want to integrate environmental assessment into multidisciplinary optimisation (MDO) approaches to minimise environmental effects in their design process. Regulatory bodies may be interested in evaluating the effectiveness of certain regulations, and operators may want to know if environmental effects can be reduced through operations. These parties often take iterative approaches, that may require up to hundreds or thousands of cycles over their applications. Considering that CTMs and CCMs often require days, if not weeks, to perform their evaluations, they are impractical to integrate in these approaches, even if the technical know-how to use them is already present. Therefore, to support the integration of environmental evaluations into MDO approaches or regulatory and operational decision-making processes, we need easy-to-use alternatives to replicate or estimate the output of these models at a fraction of their computational cost.

The recent developments in the field of machine learning have led to new methods that may be suitable for low-cost surrogate models, but many of them (e.g. neural networks) require vast amounts of data, which is impractical to generate with CTMs and CCMs. Data-driven model discovery and reduced-order modelling methods, that require less data to operate and are more interpretable than “black-box” neural network models, can be suitable alternatives. Two methods of interest are Dynamic Mode Decomposition (DMD) and Proper Orthogonal Decomposition (POD) coupled with the Sparse Identification of Non-Linear Dynamics (SINDy). These methods extract low-dimensional features from data that can be used for analysis and to construct reduced-order models. This application is similar to linear inverse modelling methods, which have been used before to model coupled atmospheric-oceanic systems (Perkins and Hakim, 2020), atmospheric flows (Kwasniok, 2022) and surface temperatures (Newman, 2013), but DMD and POD-SINDy are more widely used in the field of fluid mechanics (e.g. Taira et al., 2017, 2020; Champion et al., 2019; Khodkar and Hassanzadeh, 2021; Callaham et al., 2022). Atmospheric processes share some similarities with fluid flows, and earlier work has shown that these methods may be transferable. For example, Yang et al. (2024) have shown that dimensionality reduction techniques can be combined with SINDy to reproduce atmospheric data from CTMs, and Velegar et al. (2024) have shown that DMD methods can be used to produce efficient reduced-order models to forecast the concentration of several tropospheric chemicals over the course of multiple months. These initial explorations show promising results, but the application of the methods is still limited to small subsets of CTM data.

We expand on this exploration by assessing the suitability of the DMD and POD-SINDy methods on large-scale and comprehensive datasets that describe chemistry and transport processes across the entire atmosphere. Where earlier work assessed the capability of these methods to reproduce parts of unperturbed atmospheres, we evaluate their use on data describing perturbations over the entire atmosphere. Specifically, we apply the methods to datasets that describe the difference of the ozone (O₃) distribution in response to different supersonic aircraft scenarios (van 't Hoff et al., 2025 c, 2024 b). The aircraft attributable ozone effect is identified as the difference between two independent CTM evaluations (one with supersonic aircraft emissions, and one without). If data-driven methods can reconstruct and predict these differences, they may valuable to support the analysis of adoption scenarios and more efficient exchange and embedding of the spatiotemporal data. If stable models can be found this may also open up avenues for future research to directly use these methods to support MDO applications by combining multiple dynamical models. In this work, we aim to investigate the potential of these methods through a stepwise approach, starting from their interpretability for the analysis of atmospheric processes, followed by their reproductions of data in specific locations and the entire atmosphere. We also assess the possibility of calculating other metrics from the reproduced atmosphere, a capability that may be valuable for several applications.

2 Case studies

We use seven datasets that describe the changes in global ozone distribution in response to the adoption of supersonic aircraft. These datasets were generated using the Dutch national supercomputer Snellius using an estimate of around 493 920 CPU-hours. From van 't Hoff et al. (2024 b) we use four datasets which describe the response to annual emissions equivalent to 8 Tg of annual fuel consumption by supersonic aircraft, originating from different flight corridors. These are the transatlantic flight corridor (TAC) and the corridor over the southern Arabian sea (SAS). In both cases we use datasets with emission altitudes of 16.2 and 20.4 km. These are therefore denoted as SAS204, TAC204, SAS162, and TAC162 respectively. These datasets are generated with version 13.3.1 of the GEOS-Chem CTM and represent 10-year of daily changes in global ozone mass at a horizontal resolution of 4°×5° (latitude × longitude) with 72 vertical pressure levels.

We also use three other datasets from van 't Hoff et al. (2025 c). These also describe changes in the distribution of the ozone mass over 10 years, but in response to emissions from a global supersonic aircraft fleet rather than from a regional source. Due to the global distribution of these emissions we expect that these datasets may exhibit more challenging dynamical behaviour compared to the regional emission datasets. These datasets are generated with version 14.3.0 of the high-performance GEOS-Chem model, with a horizontal resolution of 2 ° × 2.5° and 72 vertical pressure levels. Following the notation of van 't Hoff et al. (2025 c) we denote these datasets as S1, S2, and S3. These represent the change in ozone in response to a theoretical supersonic fleet operating at Mach 2.0 with cruise altitudes of 16.5 to 19.5 km, one with increased NO_x emissions, and one with a lower cruise altitude and speed (Mach 1.6, 14–16.5 km), respectively. Detailed descriptions of the emission scenarios and chemistry transport model setups are provided in the related works (van 't Hoff et al., 2024 b, 2025 c).

3 Methodology

The methodology for this work is outlined in Fig. 1. Subsequent subsections will discuss the preprocessing of data and the model discovery methods.

https://gmd.copernicus.org/articles/19/1867/2026/gmd-19-1867-2026-f01

Figure 1Methodology overview. In step 1 the effects of emissions are isolated by combining two GEOS-Chem atmosphere simulations, of which one is affected by the additional emissions. The resulting spatiotemporal datasets describe the daily average changes in ozone mass distribution for a period of 10 years. In step 2 the data is preprocessed and organised as a snapshot of matrices prior to applying data-driven methods (step 3). In step 4 the data-driven methods are used to forecast future behaviour of the ozone response, based on which the data-driven methods are assessed.

3.1 Data description and processing

The datasets that we use contain four-dimensional (time, longitude, altitude, latitude) descriptions of the daily-averaged changes in ozone mass over the course of 10 years. We take several pre-processing steps to reduce the complexity of our data. All datasets exhibit an initial transient response to the introduction of the supersonic aircraft emissions, followed by a stabilized response where the new atmospheric quasi-equilibrium is reached. We first isolate the stable response from the data by removing the transient part, which spans the first five years. Secondly, we reduce the dimensionality of the dataset by calculating the longitudinally-averaged ozone response. Longitudinal averaging is standard practice in studies of the environmental effects of supersonic aircraft, because in the multi-annual timespans considered, the atmosphere may be considered as well-mixed over the longitude (Zhang et al., 2021 a, 2023; Eastham et al., 2022). Thirdly, the GEOS-Chem model uses a non-uniform vertical grid that is denser near the surface. This is not uncommon for CTMs, but in this case study the majority of the ozone response occurs in the coarser stratospheric grid. To avoid over-representation of the near-surface conditions in the data, we therefore interpolate the data to a normalized vertical grid with a resolution of 1 km. Finally, we discard the grid cells above 51 km altitude, as this is the vertical limit of GEOS-Chem's extensive chemical solver for ozone (Eastham et al., 2014).

The data may be considered as combination of a dominant steady-state change in the ozone distribution and a dynamic component that is driven by seasonal and day-to-day changes in meteorology. Across all datasets, the steady-state response exhibits similar characteristics, with increases of ozone in the lower stratosphere and depletion of ozone in the upper stratosphere, yielding a net-loss of the global ozone column. For detailed explanations of the mechanisms behind these responses, we refer to the related articles (van 't Hoff et al., 2024 b, 2025 c). Figure 2 shows the average steady state for one of the datasets (TAC204) alongside four seasonal averages, highlighting that the ozone response shifts seasonally. This shift is part of the dynamic component, that we isolate by subtracting the temporal mean of the data similar to Velegar et al. (2024) and Yang et al. (2024). This mean is re-added after the dynamics are predicted to recompose the full signal. The data is considered as a matrix of snapshots X:

\begin{matrix} (1) & X = [\begin{array}{cccc} ∣ & ∣ & ∣ \\ x (t_{1}) & x (t_{2}) & \dots & x (t_{m}) \\ ∣ & ∣ & ∣ \end{array}] . \end{matrix}

The matrix $X \in R^{n \times m}$ represents the ozone response, with n representing the product of the latitude and altitude dimensions and m the number of daily snapshots. In this representation, the latitude and altitude are flattened into a single column x(t_y). To evaluate the modelling of the entire atmosphere, we also calculate the mean global change in the ozone columns in Dobson Units (DU). This metric is also commonly used in studies of the environmental effects of supersonic aircraft (Zhang et al., 2023, 2021 a, b; Eastham et al., 2022; van 't Hoff et al., 2025 c).

https://gmd.copernicus.org/articles/19/1867/2026/gmd-19-1867-2026-f02

Figure 2Average ozone response of the TAC204 dataset over different timespans in tons of ozone. The leftmost plot shows the average response over all 5 years of data, and the other four show averages taken over 3-monthly snapshots from the entire dataset. For example, the rightmost figure (Average (OND)) shows the average ozone response for the months of October, November, and December across the 5 years of data.

Data-driven discovery and model reduction methods for the atmospheric effects of high altitude emissions

3.1 Data description and processing

3.2 Proper orthogonal decomposition

3.3 Sparse identification of non-linear dynamics (SINDy)

3.4 Dynamic mode decomposition

4.1 Analysis of spatial modes

4.2 Individual grid cells

4.3 Zonal average

4.4 Derivative metric: global ozone columns