Towards variance-conserving reconstructions of climate indices with Gaussian process regression in an embedding space

Klockmann, Marlene; von Toussaint, Udo; Zorita, Eduardo

doi:https://doi.org/10.5194/gmd-17-1765-2024

Articles | Volume 17, issue 4

https://doi.org/10.5194/gmd-17-1765-2024

Special issue:

Benchmark datasets and machine learning algorithms for Earth...

https://doi.org/10.5194/gmd-17-1765-2024

Articles | Volume 17, issue 4

Development and technical paper

28 Feb 2024

Development and technical paper |

| 28 Feb 2024

Towards variance-conserving reconstructions of climate indices with Gaussian process regression in an embedding space

Marlene Klockmann, Udo von Toussaint, and Eduardo Zorita

Abstract

We present a new framework for the reconstruction of climate indices based on proxy data such as tree rings. The framework is based on the supervised learning method Gaussian Process Regression (GPR) and aims at preserving the amplitude of past climate variability. It can adequately handle noise-contaminated proxies and variable proxy availability over time. To this end, the GPR is formulated in a modified input space, termed here embedding space. We test the new framework for the reconstruction of the Atlantic multi-decadal variability (AMV) in a controlled environment with pseudo-proxies derived from coupled climate-model simulations. In this test environment, the GPR outperforms benchmark reconstructions based on multi-linear principal component regression. On AMV-relevant timescales, i.e. multi-decadal, the GPR is able to reconstruct the true amplitude of variability even if the proxies contain a realistic non-climatic noise signal and become sparser back in time. Thus, we conclude that the embedded GPR framework is a highly promising tool for climate-index reconstructions.

Download & links

Article (PDF, 11134 KB)

Supplement (1236 KB)

Download & links

Article (11134 KB)
Full-text XML
Supplement (1236 KB)
BibTeX
EndNote

How to cite.

Received: 04 Feb 2022 – Discussion started: 05 Apr 2022 – Revised: 31 Aug 2023 – Accepted: 17 Jan 2024 – Published: 28 Feb 2024

1 Introduction

Climate indices are important measures to describe the evolution of climate on regional, hemispheric or global scales in a condensed way. They reveal relevant timescales of climate variability and, in some cases, also subspaces that are important for predictability. Paramount examples are the El Niño–Southern Oscillation, the North Atlantic Oscillation and the Atlantic multi-decadal variability (AMV). To understand whether the typical timescales and magnitude of climate variability have been stationary over time or whether they have changed, e.g. with anthropogenic climate change, we need a long-term perspective on these climate indices. The index time series must not only cover the historical period of the past 150 years but also the period of interest, e.g. the past 1000–2000 years (Common Era). To obtain these long time series we need information from so-called climate proxies (e.g. tree rings and sediment cores) in combination with sophisticated statistical models to reconstruct the climate indices from the proxy data. We present a new machine learning framework for climate-index reconstructions and test its skill for reconstructing the AMV.

The AMV is an important index that describes the North Atlantic climate variability on decadal and longer timescales. Different definitions of the AMV have been developed over time, but the basic definition relies on the low-pass-filtered spatial average of sea surface temperature anomalies over the North Atlantic. Observations starting in about 1850 indicate that the AMV varies on typical timescales of 30–60 years. The state of the AMV plays a key role for many relevant climate phenomena such as Arctic sea-ice anomalies (Miles et al., 2014), North American and European summer climate, hurricane seasons and Sahel rainfall (Zhang and Delworth, 2006; Zhang et al., 2007). Both atmospheric as well as oceanic processes have been suggested as possible drivers of the AMV (e.g. Clement et al., 2015; Zhang et al., 2019; Yan et al., 2019; Garuba et al., 2018). It is not clear how much of the AMV is generated by internal climate variability and how much is generated by changes in external radiative forcing, i.e. volcanic and anthropogenic aerosols, solar insolation and greenhouse gas concentrations (Haustein et al., 2019; Mann et al., 2021).

The observational period of approximately 150 years is not sufficient to provide a long-term perspective on the AMV or in fact any climate index that describes variability on multi-decadal and longer timescales. Therefore, longer time series are needed. These time series are typically derived from climate reconstructions based on climate proxies such as tree rings, bivalves or coral skeletons (e.g. Gray et al., 2004; Mann et al., 2008; Svendsen et al., 2014; Wang et al., 2017; Singh et al., 2018). This kind of reconstruction is based on statistical models that link the target index with proxy time series, using the observational period to calibrate their parameters. The trained models then use the much longer proxy time series as input to provide an estimation of the target index in the past.

Existing AMV reconstructions disagree on the amplitude and timing of AMV variability, especially prior to the beginning of the 18th century (Wang et al., 2017). As a consequence, they also provide conflicting views on the AMV response to external forcing (Knudsen et al., 2014; Wang et al., 2017; Zhang et al., 2019; Mann et al., 2022). Possible reasons for this disagreement are numerous. In general, the reconstructed variability will depend on the predictor data, i.e. the number, quality and locations of the proxies. Previous AMV reconstructions differed in their employed proxy networks and types, using only terrestrial or also marine records. As an example, including marine records seems to yield better reconstructions of AMV variability (e.g. Saenger et al., 2009; Mette et al., 2021). Proxy data are only available at a limited number of locations on the globe (see e.g. PAGES2k, 2017), and their availability decreases further back in time. Proxies also contain varying amounts of non-climatic signals, i.e. noise.

Existing reconstruction methods range from very simple linear methods such as composite plus scaling (Jones and Mann, 2004) or principal component analysis (e.g. Gray et al., 2004), over more complex linear methods such as Bayesian hierarchical modelling (Barboza et al., 2014) to non-linear methods such as random forest (Michel et al., 2020), pairwise comparison (Hanhijärvi et al., 2013) or data assimilation (e.g. Singh et al., 2018). The presence of noise or mutually unrelated variability may result in biased estimations of parameters of the statistical models such as regression coefficients. Especially regression-based methods are known to underestimate the true magnitude of variability, especially on lower frequencies (Zorita et al., 2003; Esper et al., 2005; Von Storch et al., 2004; Christiansen et al., 2009). They also tend to “regress to the mean”, i.e. they have difficulties in reconstructing values that lie outside the range of the calibration data. This is further exacerbated by the presence of strong warming trends and shortness of the available calibration period (approximately 150 years).

Thus, robust reconstruction methods are needed in order to produce more reliable estimates of the amplitude of the past variability of the AMV in order to better quantify its response to external forcing. This is also a precondition for an unbiased detection of any “unusual” observed trends and for the subsequent attribution of those trends to a particular forcing, e.g. anthropogenic greenhouse gases. To this end, we need to design reconstruction methods which are more robust against noise and, importantly, do not strongly “regress to the mean” when the predictors become more noisy or scarce back in time. As in many disciplines, machine learning methods have successfully gained traction in the climate reconstruction community (e.g. Michel et al., 2020; Zhang et al., 2022; Wegmann and Jaume-Santero, 2023). Here, we explore the potential of the non-linear supervised learning method Gaussian process regression (GPR) for climate index reconstructions. GPR finds growing use in climate applications such as climate model emulators (Mansfield et al., 2020) or reconstructions of sea level fields (Kopp et al., 2016) and global mean surface temperature (Büntgen et al., 2021).

Unlike other machine learning methods, such as neural networks, GPR offers greater transparency and is less of a “black box”. The number of free parameters is usually much smaller and ideally the parameters have a more direct physical interpretation. A Gaussian process (GP) describes a distribution over functions with a given mean and covariance structure. The covariance structure is chosen such that the resulting functions best match a given set of observations. This setup appears as more intuitive and closer to the more familiar family of regression methods than convoluted deep learning structures, which in the end may need additional algorithms for their physical interpretation. GPR's non-parametric nature has the advantage that we do not need to make any assumptions about the (non-)linearity of the underlying reconstruction problem. As a Bayesian method, GPR comes with its own uncertainty estimates, which is a very important feature for paleoclimate applications.

We do not only test GPR as a climate index reconstruction tool but also propose a modified input space for the GPR-based reconstructions. To this end, we embed the entire available dataset (proxy data and the target index) in a virtual space. The location of the data time series in this space are based on the similarity between the time series. The resulting cloud of data points in this virtual space can be viewed as a temporal sequence of images with missing values. The covariance of the GP describes the cross-correlation between the proxy records and the target index across time and virtual space. We use the GPR to fill the missing values where we do not have observations of the target index. This approach is somewhat similar to kriging in geostatistics, where two-dimensional fields are reconstructed based on point measurements and a known covariance structure. In our case, the input space is not the geographical space but the virtual embedding space and the covariance structure is learned from the data. This setup has the additional advantage that it can easily accommodate variable proxy availability in time and that the proxy-related uncertainty can be directly accounted for by the parameters of the GP.

To fully judge the methodological performance and related uncertainties, reconstruction methods need to be tested in so-called pseudo-proxy experiments (Smerdon, 2012). Many methods have already been tested in such controlled environments, but the evaluation often lacks a thorough assessment of the method's capability to reconstruct the magnitude of the variability on different timescales. In particular, a reconstruction method must be able to capture extreme phases, again to ascertain whether the AMV is sensitive to sudden changes in the external forcing, e.g. after volcanic eruptions, but also to capture possible large internally generated variations, which could occur independent of external forcing. Here, we test our proposed framework of the embedded GPR in such a pseudo-proxy environment and place special emphasis on the method's skill of reconstructing extreme phases and the magnitude of variability of the AMV.

2 Methods and data

2.1 Pseudo-proxies and simulated AMV index

We generate the pseudo-proxies from a simulation of the Common Era (i.e. the past 2000 years) with the Max Planck Institute Earth System Model (MPI-ESM). The model version corresponds to the MPI-ESM-P LR setup used in the fifth phase of the Coupled Model Intercomparison Project (CMIP5, Giorgetta et al., 2013). A detailed description of the simulation can be found in Zhang et al. (2022). The target of the pseudo-reconstructions is the simulated AMV index (AMVI). We define the AMVI as the spatial mean of annually averaged sea surface temperature anomalies (SST) in the North Atlantic (0–70° N and 80° W to 0° E). The SST anomalies are calculated against the mean over the entire simulation period. We do not further detrend the AMVI because it is difficult to define a meaningful trend period in the paleocontext. In the case of real reconstructions, all proxies and the AMVI would be available for overlapping periods with different length, and it is not possible to define a meaningful common trend that could be subtracted from all records.

The pseudo-proxies are defined as time series of the simulated temperature at the model grid points closest to existing proxy sites in the PAGES2k database (PAGES2k, 2017). Over land, we use a 2 m annual mean air temperature and over ocean we use annual mean sea surface temperature. We do not use all available proxy sites from the PAGES2k database but only a subset thereof. We limit our selection of proxy sites to those within the North Atlantic domain (10–90° N and 100° W to 30° E) with annual resolution or finer. Out of these sites, we further select only those locations at which the pseudo-proxies have a correlation of 0.35 or higher with the AMVI during the last 150 simulation years. (In this case, both the AMVI and the pseudo-proxies are detrended before calculating the correlation.) The final proxy network consists of 23 pseudo-proxies (Fig. 1a).

We design three sets of pseudo-proxies to account for different sources of uncertainty: in the first test case (TCppp), we use perfect pseudo-proxies, i.e. the pseudo-proxies contain only the temperature signal. In the second test case (TCnpp), we use noisy pseudo-proxies, i.e. the pseudo-proxies contain additional non-climatic noise. The non-climatic noise is generated by adding white noise to the perfect pseudo-proxies. The amplitude of the white noise is defined such that the correlation between the noisy and the perfect pseudo-proxies is 0.5, i.e. the amplitude of the white noise corresponds to the standard deviation of the perfect pseudo-proxy times $\sqrt{3}$ . This is a reasonable choice, as the correlation for real proxies with observations ranges from 0.3 to 0.7. The amount of white noise applied here is also well within the range of other pseudo-proxy studies (e.g. Smerdon, 2012). To ensure that the performance with noisy data is independent of the specific noise realisation, we create an ensemble of 30 noise realisations. In both TCppp and TCnpp we assume that all records are available at every point in time, i.e. that the network size remains constant in time. In reality, different proxy records cover different periods and the network size is not constant (Fig. 1b). Therefore, we set up a third test case (TCp2k) with realistic temporal proxy availability from the PAGES2k database and both perfect and noisy pseudo-proxies. In all three test cases, the pseudo-proxy records have annual resolution. The reconstruction period corresponds to the last 500 simulation years for TCppp and TCnpp, and to the entire 2000 simulation years for TCp2k.

To test the sensitivity of the method to the underlying climate-model simulation, we repeat the test cases TCppp and TCnpp with an analogously derived set of 25 pseudo-proxies and AMVI from simulations with the Community Climate System Model (CCSM4; Gent et al., 2011). We combine the “past1000” simulation (Landrum et al., 2013; Otto-Bliesner, 2014) and one “historical” simulation (Gent et al., 2011; Meehl, 2014) from the CMIP5 suite and use the last 500 years of the combined dataset. From the historical simulations, we used the ensemble member r1i1p1. The results are displayed in Appendix B.

https://gmd.copernicus.org/articles/17/1765/2024/gmd-17-1765-2024-f01

Figure 1The selected pseudo-proxy records and resulting distance metrics based on the MPI-ESM simulation: (a) the locations of the records, colour-coded with the correlation between the records and the AMVI during the last 150 simulation years (after detrending); (b) the number of available proxy records at the selected locations within the PAGES2k dataset over time; (c) cross-correlation; (d) standard deviation ratio; and (e) the resulting embedding distances from the combination of both. Matrix indices 1–23 are the selected pseudo-proxy records as labelled in (a), and index 24 is the simulated AMVI. The diagonal entries in (e) are empty because zero cannot be displayed on the logarithmic colour scale.

2.2 Benchmark reconstruction

To have a benchmark for the GPR-based reconstruction in the cases of TCppp and TCnpp, we use pseudo-reconstructions with a multi-linear principal component regression (PCR). PCR is well established as a climate-index reconstruction method and has been used, for example, for reconstructions of the global mean surface temperature (PAGES2k, 2019) and the AMVI (Gray et al., 2004; Wang et al., 2017). The selected proxy time series are first decomposed into principal components (PCs); the latter are then used as predictors in a linear least-squares regression to obtain the AMVI for those time steps where proxies and AMVI overlap. In other words, the AMVI is expressed as a function of PCs of the original proxies (Eq. 1). We do not use all PCs but only retain those with a cumulative explained variance of 99.5 %. The trained model can then be used to reconstruct the AMVI for time steps where we have only proxies available:

\begin{matrix} (1) & AMVI (t) = f_{PCR} ({PC}_{1} (t), \dots, {PC}_{n} (t)) . \end{matrix}

2.3 Gaussian process regression

2.3.1 The concept

Gaussian process regression is a Bayesian, non-parametric, supervised learning method (Rasmussen and Williams, 2006). Just like a probability distribution describes random variables, a Gaussian process (GP) describes a distribution over functions with certain properties. A GP is determined by a mean function and a covariance function:

\begin{matrix} (2) & f (x) \sim GP (μ (x), k (x, x^{'})) . \end{matrix}

The mean function μ(x) describes the mean of all functions within the GP at location x. In the absence of other knowledge, it is typically assumed that the mean of all functions within the prior GP is zero everywhere. The covariance function $k (x, x^{'})$ describes the statistical dependence between the function values at two different points in the input space. The exact covariance structure is prescribed by a kernel function. Kernel functions range from very simple (e.g. linear, radial basis functions) to more complex (e.g. Matern functions, periodic). In principle, there is no limit to the kernel complexity and finding the right kernel can be considered an art in itself (e.g. Duvenaud et al., 2013). Once a general functional form of the kernel has been chosen (e.g. radial basis function), the specific form is determined by the kernel parameters. Since the underlying GP model itself is non-parametric, kernel parameters are often also referred to as hyperparameters (Rasmussen and Williams, 2006). These hyperparameters are either prescribed a priori, if they are known, or learned from the data through optimisation if they are unknown (e.g. through maximum likelihood estimation).

Without being constrained by data, the prior GP is a distribution of all functions with the given mean and covariance (Eq. 2). In order to use the GP for regression and prediction, the prior GP is combined with the additional information from the training data through the Bayes theorem (see Appendix A and Rasmussen and Williams (2006) for a more detailed description). Thus, the posterior GP is obtained, i.e. only those functions are selected that agree with the training data in a given uncertainty range. Predictions at previously unseen input points are then given by that posterior distribution of functions evaluated at those unseen input points.

2.3.2 Finding the right regression space

As described for the PCR, classical climate-index reconstruction methods formulate their underlying statistical model so that the climate index is assumed to be a function of temperature, the proxy values or, for example, the principal components thereof. In other words, the regression is performed in temperature/proxy/PC space; the proxies/PCs are the predictors and the climate index is the predictand. If we reconstruct the AMVI with GPR in this classical setup, the target AMVI becomes the posterior mean function and the covariance is estimated across the proxy space. With the trained GP model, the AMVI can be reconstructed by evaluating the GP at the proxy values that occurred during the reconstruction period. Figure 2a shows the regression in proxy space for an example where the AMVI is given as a function of two pseud-proxy records p₁ and p₂. In this example, the posterior mean AMVI function forms a surface in the space spanned by p₁ and p₂. (Note that in our pseudo-proxy experiments we use 23 pseudo-proxies (Fig. 1a), so the proxy space is actually 23-dimensional, which is impossible to visualise.)

In initial tests, the GPR reconstruction in proxy space did not perform well: the variability of the AMVI was strongly underestimated (not shown). A possible explanation is that GPs are very good interpolators but bad at extrapolating to regions of the proxy space that have not been sampled during training (e.g. upper left and lower right quadrants in the example of Fig. 2a) or where predictors become sparse (lower left quadrant in Fig. 2a). In those cases, the GPR estimation will fall back to the prior mean function (regression to the mean) and the predictive skill becomes very small. From a mathematical point of view, this setup of the regression in proxy space is also actually not suitable for GPR. Real-world proxies come with large uncertainties, and while GPs are designed to handle uncertain targets, they assume that the inputs are without uncertainty. Therefore, we approach the problem differently and set up the GPR in a way that leverages two GPR strengths: (1) being good interpolators and (2) handling uncertain targets.

https://gmd.copernicus.org/articles/17/1765/2024/gmd-17-1765-2024-f02

Figure 2Schematic visualisation of the regression spaces for an example with two proxy records p₁ and p₂ and the AMVI. Panel (a) shows the GPR in the proxy space: the independent variables are the temperature anomalies of the proxy records; the dependent variable is the AMVI (colour coded) Panel (b) shows the GPR in the embedding space: the independent variables are the locations in the embedding space and time, and the dependent variables are the temperature anomalies of the proxy records and the AMVI (colour coded). In this simplified example, the three time series are located such that the distance between them is equal for all respective pairs of records.

Towards variance-conserving reconstructions of climate indices with Gaussian process regression in an embedding space

2.1 Pseudo-proxies and simulated AMV index

2.2 Benchmark reconstruction

2.3 Gaussian process regression

2.3.1 The concept

2.3.2 Finding the right regression space

2.3.3 Defining the distance matrix

2.3.4 Kernel design and hyperparameters

2.3.5 GP scaling behaviour

2.3.6 Technical notes

2.4 Training and testing

3.1 TCppp: perfect pseudo-proxies

3.2 TCnpp: noisy pseudo-proxies

3.3 TCp2k: realistic PAGES2k proxy availability

3.3.1 Full emGPR

3.3.2 Sparse emGPR

4.1 Non-climatic noise

4.2 Hyperparameter estimation and overfitting

4.3 Embedding distance

4.4 Climate model dependence

4.5 Using real proxies and wider applications