Climate change is typically modeled using sophisticated mathematical models (climate models) of physical processes that range in temporal and spatial scales. Multi-model ensemble means of climate models show better correlation with the observations than any of the models separately. Currently, an open research question is how climate models can be combined to create an ensemble mean in an optimal way. We present a novel stochastic approach based on Markov chains to estimate model weights in order to obtain ensemble means. The method was compared to existing alternatives by measuring its performance on training and validation data, as well as model-as-truth experiments. The Markov chain method showed improved performance over those methods when measured by the root mean squared error in validation and comparable performance in model-as-truth experiments. The results of this comparative analysis should serve to motivate further studies in applications of Markov chain and other nonlinear methods that address the issues of finding optimal model weight for constructing ensemble means.

Climate change is often modeled using sophisticated mathematical models of physical processes taking place over a range of temporal and spatial scales. These models are inherently limited in their ability to represent all aspects of the modeled physical processes. Simple averages of multi-model ensembles of GCMs (global climate models) often show better correlations with the observations than any of the individual models separately

Most studies attempting to define an optimal ensemble weighting either employ linear optimization techniques

Hence, it is desirable that an ensemble weighting method is robust against the dependency issue and has normalized non-negative weights for interpretability. Finally, the methods should work well across a range of different climate variables, such as temperature, precipitation, etc.

In this paper, we propose a novel way to construct a weighted ensemble mean using Markov chains, which we call the Markov Chain Ensemble (MCE) method. Our purpose is to demonstrate that going beyond linear optimization on a vector space of climate models' outputs allows better-performing weighted ensembles to be built. We selected Markov chains as a basis for such nonlinear optimization as one of the most straightforward nonlinear structures. It naturally produces non-negative weights that sum to one and captures some of the nonlinear patterns in the ensemble (here we refer to nonlinear patterns as time-dependent selections of model components rather than considering a complete model output vector). It performs well on a range of datasets when compared to the standard simple mean and linear optimization weighting methods as we demonstrate below. We also examine how the method responds to the introduction of interdependent models.

Although Markov chains have been used frequently in the literature for the prediction of future time series (e.g.,

We describe the datasets used in this study and the proposed MCE method in Sect. 2. We compare the proposed method (MCE) to the commonly used multi-model ensemble average (AVE) method

Here we first describe the datasets used in this study. We have chosen three publicly available datasets with differing numbers of models, historical period lengths and model interdependence levels to evaluate and compare the performance of the MCE method with alternative approaches.

The first dataset we use is the temperature anomaly (

The second dataset contains temperature output from the New South Wales (NSW) and Australian Capital Territory Regional Climate Modelling project

The third dataset contains yearly heatwave amplitudes (HWAs) for the Korean Peninsula from 29 CMIP5 climate models and observations between 1973 and 2005

These three datasets cover different scenarios, data structures, parameter distributions and scales (see Table

Summary of CMIP5, NARCLiM and KMA data properties.

Generally, a homogeneous Markov chain is a sequence of random system states evolving through time, where each next state is defined sequentially based on its predecessor and predefined transition probabilities

This property allows us to construct a non-negative transition matrix

More precisely, we start by constructing a transition vector

The Markov chain ensemble (MCE) algorithm.

length of training period

historical observations

climate model output

an initialized number of simulations,

an initialized

an initialized transition matrix

We provide some details of the algorithm as described in Table

From Eq. (

As we select only one of the simulations, the MCE method is not sensitive to the number of simulations

Sensitivity of the ensemble properties to the value of

Though better RMSE results can be achieved with larger

While we do not claim that the proposed method explicitly addresses the issue of model dependence, it is implicitly addressed to some degree at Step 3 in Table

We demonstrate this property of the MCE method on modified NARCliM data by adding a copy of one of the models with a small random error added and comparing the resulting weights as shown in Fig.

Change of MCE weights after adding a copy of Model 1, 3, 8 and 9 (clockwise from panel

As we can see from Fig. 2,
adding a highly correlated ensemble member does not significantly change the weight distribution significantly, and more pleasingly when a high performing model is duplicated, the weights are shared between the two copies (see Model 3 and Model 9).
Consequently the performance of

Though the MCE method can be used on any climate dataset which contains the required inputs, its relative performance differs depending on the properties of the dataset. We will demonstrate that in the case of normally distributed data, its performance is competitive with the simple averaging and other more sophisticated methods. In more challenging scenarios, when data are not normally distributed, MCE performs better than the common alternatives.

As the MCE method is based on a stochastic process, the results between runs can vary. To mitigate this effect and to have reproducible results we set the seed of R software's random number generator to a constant for all simulations. The MCE method in its current implementation does not provide an uncertainty quantification, and this limitation is a subject for future nonlinear ensemble weighting method development.

Finally, as the MCE method does not consider spatial information, the resulting weights have limited physical interpretability. Extending the MCE method to utilize such information is a subject for future research.

In order to evaluate the relative performance of the MCE method we select two other popular approaches to constructing ensemble weighted average. The first approach is the widely used average of individual climate model outputs

The second approach that has been selected for relative performance evaluation in this study is a convex optimization as proposed by

The purpose of this method is to find a linear combination of climate model outputs with

This method and its implementation are discussed in detail in

The root mean squared error (RMSE, Eq.

The monthly trend bias is calculated as the difference between the inclination parameter

The monthly bias is calculated as the difference between the mean of the weighted ensemble and the observation for each month on validation data. The total climatology monthly bias metric is calculated as a mean of the monthly biases.

Interannual variability for each month is calculated as the difference between the standard deviation of a detrended weighted ensemble and the standard deviations of detrended observations on validation data. The total interannual variability metric is calculated as the mean of interannual variability for each month.

Climatological RMSE is calculated according to Eq. (

In this method the dataset, which contains the observations, is split into a training (or calibration) set and a validation (or testing) set. The goal of cross-validation is to examine the model's ability to predict new data that were not used in estimating the required parameters.

We partition our data into two sets, with 70 % of data used for training and 30 % for validation. This is a specific case of the

To evaluate each method's performance on the future model projections, we use the model-as-truth approach and analyze the metrics described in Sect. 2.5. At each step of model-as-truth performance assessment one model is selected as a true model (pseudo-observations) and the remaining models are used to build a weighted ensemble mean that best estimates the true model over the historical period. This weighted ensemble mean is then tested against the future projections of the true model. For a given ensemble this is repeated as many times as the number of the ensemble members with a different member being chosen as the true model each time. The median and spread of these results is reported.

Though the selected monthly CMIP5 data contain annual variation, it is not predominant due to the length and trend of the dataset as shown in Fig.

Applying the MCE method on the selected CMIP5 data with

CMIP5 data properties.

Performance comparison of different methods on CMIP5 data; RMSE on training (RMSE

We can see that

The model-as-truth performance assessment is done on

CMIP5 model-as-truth performance assessment results. Median, 25 % and 75 % percentiles of

Model-as-truth performance comparison of different methods on CMIP5 data, median of trend bias (

All the methods perform similarly in model-as-truth assessment, with

The seasonal variation in NARCLiM data is larger than in CMIP5 data as shown in Fig.

We apply the MCE method on the selected NARCliM data with

NARCLiM data properties.

Performance comparison of different methods on NARCLiM data; RMSE on training (RMSE

As in CMIP5 data analysis (Fig.

The model-as-truth performance assessment is done on

NARCLiM model-as-truth performance assessment results. Median, 25 % and 75 % percentiles of the

Model-as-truth performance comparison of different methods on NARCLiM data, median of trend bias (

As in the CMIP5 results (Fig.

The KMA data are non-negative with a non-normal distribution of model outputs and observations as shown in Fig.

Applying the MCE method on the selected data with

KMA data properties.

Performance comparison of different methods on KMA data, RMSE on training (RMSE

We can see that MCE has the lowest RMSE and maintains the ensembles' diversity with a few models receiving zero weights. The COE method gives non-zero weights to only a small subset of models, which results in its performance on the validation period being lower compared to MCE.

The obtained results indicate that Markov chains can be used to construct a better performing weighted ensemble mean with lower RMSE on validation data than commonly used methods like multi-model ensemble averaging and convex optimization (Tables

The MCE method also performs at the same level as other methods in terms of climatological metrics and model-as-truth performance assessment, which gives us confidence in its ability to be used for future estimation of climate variables.

However, as previous studies show (e.g.,

As the number of models increases, MCE tends to become closer to AVE weights (Fig.

The MCE method is computationally cheap and is limited only by a software’s ability to handle extreme numerical values. One limitation of the MCE method is its current inability to quantify the uncertainty of the resulting weighted ensemble mean. However, we believe that given the stochastic nature of the method, this limitation can be overcome in future implementations. MCE performance can be further improved by combining it with other types of optimization, e.g., linear. In addition, other nonlinear optimization techniques, which would include more complex structures than simple Markov chains, can be developed based on our demonstrated results.

Finally, the MCE method does not require some of the assumptions necessary for the multi-model ensemble average method (e.g., models being reasonably independent and equally plausible as discussed by

Geometrically, the restrictions

In this study, we presented a novel approach based on Markov chains to estimate model weights in constructing weighted climate model ensemble means. The complete MCE method was applied to selected climate datasets, and its performance was compared to two other common approaches (AVE and COE) using a cross-validation holdout method and model-as-truth performance assessment with RMSE, trend bias, climatology monthly bias, interannual variability and climatological monthly RMSE metrics. The MCE method was discussed in detail, and its step-wise implementation, including mathematical background, was presented (Table

The results of this study indicate that applying nonlinear ensemble weighting methods on climate datasets can improve future climate projection in terms of accuracy. Even a simple nonlinear structure such as Markov chains shows good performance on different commonly used datasets compared to linear optimization approaches. These results are supported by using standard performance metrics, cross-validation procedures and model-as-truth performance assessment. The developed MCE method is objective in terms of parameter selection, has a sound theoretical basis and has a relatively low number of limitations. It maintains ensemble diversity, mitigates model interdependence and captures some of the nonlinear patterns in the data while optimizing ensemble weights. It is also shown to perform well on non-Gaussian datasets. Based on the above, we are confident in suggesting its application on other datasets and its usage for the future development of new nonlinear optimization methods for weighting climate model ensembles.

Code and data for this study are available at

All co-authors contributed to method development, theoretical framework and designing of experiments. MK, YF and JPE selected climate data for this study. MK developed the model code with contributions from SP and performed the simulations. RO prepared NARCLiM and KMA data. MK and YF prepared the paper with contribution from all co-authors.

The authors declare that they have no conflict of interest.

Max Kulinich would like to acknowledge the support from the UNSW Scientia PhD Scholarship Scheme. Roman Olson would like to acknowledge the support from the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2018R1A5A1024958).

This paper was edited by Steven Phipps and reviewed by Ben Sanderson and one anonymous referee.