Near-term climate predictions such as multi-year to decadal forecasts are increasingly being used to guide adaptation measures and building of resilience. To ensure the utility of multi-member probabilistic predictions, inherent systematic errors of the prediction system must be corrected or at least reduced. In this context, decadal climate predictions have further characteristic features, such as the long-term horizon, the lead-time-dependent systematic errors (drift) and the errors in the representation of long-term changes and variability. These features are compounded by small ensemble sizes to describe forecast uncertainty and a relatively short period for which typical pairs of hindcasts and observations are available to estimate calibration parameters.
With DeFoReSt (Decadal Climate Forecast Recalibration Strategy),

Decadal climate predictions of initialized forecasts focus on describing the climate variability for the coming years.
Significant advances have been made by recent progress in model development, data assimilation for initialization and climate observation. A need for up-to-date and reliable near-term climate information and services for adaptation and planning accompanies this progress

Despite the progress being made in decadal climate forecasting, such forecasts still suffer from considerable systematic errors like unconditional and conditional biases and ensemble over- or underdispersion.
Those errors generally depend on forecast lead time since models tend to drift from the initial state towards its own climatology

DeFoReSt uses third-order polynomials in lead time to capture conditional and unconditional biases, second-order polynomials for dispersion and a first-order polynomial to model initialization time dependency. Third-order polynomials for the drift have been suggested by

Although DeFoReSt with third- and/or second-order polynomials turned out in past applications to be beneficial for both full-field initialized decadal predictions

For post-processing of probabilistic forecasts with non-homogeneous Gaussian regression,

Unlike other parameter estimation strategies based on iterative minimization of a cost function by simultaneously updating the full set of parameters, boosting only updates one parameter at a time: the one that leads to the largest decrease in the cost function. As all parameters are initialized to zero, those parameters corresponding to terms which do not lead to a considerable decrease in the cost function – hence are not relevant – will not be updated and thus will not differ from zero; the associated term has thus no influence on the predictor. Here, we extend the underlying non-homogeneous regression model of DeFoReSt to higher-order polynomials and use boosting for parameter estimation. Additionally, cross-validation identifies the optimal number of boosting iteration and serves thus for model selection. The resulting boosted non-homogeneous regression model is hereafter named boosted recalibration.

A toy model producing synthetic decadal forecast–observation pairs is used to study the effect of using higher-order polynomials and boosting on recalibration. Moreover, we compare boosted recalibration and DeFoReSt to recalibrate forecasts from the MiKlip decadal prediction system.

The paper is organized as follows: Sect.

The basis for this study is retrospective forecast (hereafter called hindcast) of surface temperature from the Max Planck Institute Earth System Model in a low-resolution configuration (MPI-ESM-LR).
The atmospheric component of the coupled model is ECHAM6 at a horizontal resolution of T63 with 47 vertical levels up to 0.01 hPa

The Met Office Hadley Centre and the Climatic Research Unit at the University of East Anglia produced HadCRUT4

To assess the performance of boosted recalibration with respect to DeFoReSt, we use the same metrics as in

Calibration or reliability refers to the statistical consistency between the forecast probability distributions and the verifying observations

A common tool to evaluate the reliability and therefore the effect of a recalibration is the rank histogram or Talagrand diagram which was separately proposed by

Following

Sharpness, on the other hand, refers to the concentration or spread of a probabilistic forecast and is a property of the forecast only.
A forecast is sharp when it is taking a risk, i.e., when it is frequently different from the climatology.
The smaller the forecast spread, the sharper the forecast. Sharpness is indicative of forecast performance for calibrated and thus reliable forecasts, as forecast uncertainty reduces with increasing sharpness (subject to calibration).
To assess sharpness, we use properties of the width of prediction intervals as in

Scoring rules, like the continuous ranked probability score (CRPS), assign numerical scores to probabilistic forecasts and form attractive summary measures of predictive performance, since they address reliability and sharpness simultaneously

Given that

Its skill score (CRPSS) relates the accuracy of the prediction system to the accuracy of a reference prediction (e.g., climatology).
Thus, with hindcast scores CRPS

We first review the decadal climate forecast recalibration strategy (DeFoReSt) proposed by

DeFoReSt assumes normality for the PDF

Schematic overview of the effect of DeFoReSt for an exemplary decadal toy model with ensemble mean (colored lines), ensemble minimum/maximum (colored dotted lines) and associated pseudo-observations (black line). Note that different colors indicate different initialization times. Before recalibration

The functional forms of

Initial guesses for parameters need to be carefully chosen to avoid convergence into local minima of the cost function.
Here, we obtain initial guesses for

Initial guesses for

An alternative to minimization of the CRPS is maximization of the likelihood.
Here, CRPS grows linearly in the prediction error, in contrast to the likelihood which grows quadratically

We use cross-validation with a 10-year moving validation period as proposed by

In Eq. (

Schematic overview of the cross-validation setting for a decadal climate prediction, initialized in 1964 (dotted red line). All hindcasts which are initialized outside the prediction period are used as training data (dotted black lines). A hindcast which is initialized inside the prediction period is not used for training (dotted gray lines).

We apply boosting for non-homogeneous regression problems as
proposed by

This is realized with the R package

The above-mentioned effect of outliers and extremes on dispersivity
described by

In each iteration, the negative partial derivatives,

Schematic flow chart for boosting algorithm proposed by

If the chosen iteration steps is small enough, a certain number of less relevant predictor terms have coefficients equal to zero, which prevents the model from overfitting.
A cross-validation (CV) approach is used to identify the iteration
with the set of parameter estimates with maximum predictive
performance. Currently, CV is carried out after each boosting
iteration. The data are split into five parts, and each part consists of
approximately 10 years in order to reflect conditions of decadal prediction. For each part, a recalibrated prediction is computed,
with the model trained on the remaining four parts. Afterwards, these five recalibrated parts are used to calculate the full negative log likelihood. Here, the full negative
log likelihood results from summing Eq. (

Overview of the different toy model setups and the corresponding polynomial lead time dependencies.

Analog to the standard DeFoReSt, the previously described modeling
procedure (boosting and CV for iteration selection) is carried out in
a cross-validation setting (second level of CV) for model validation.
A 10-year moving validation period (see Sect.

To assess the model selection approach for DeFoReSt, we consider two toy model experiments with different potential predictabilities to generate pseudo-forecasts, as introduced by

the predictable signal is stronger than the unpredictable noise, and

the predictable signal is weaker than the unpredictable noise.

determines the ratio between the variance
of the predictable signal and the variance of the unpredictable noise,
it controls potential predictability; see

specifies the unconditional bias added to the predictable signal.

analogously specifies the conditional bias.

specifies the conditional dispersion of the forecast ensemble.

analogously controls the
unconditional dispersion and has not been used in

For an assessment of the model selection approach, we use seven
different toy model setups per value of

As mentioned before, the functions

Analogously to the MiKlip experiment, the toy model uses 50 start years,
each with 10 lead years, and 15 ensemble members. The corresponding
pseudo-observations run over a period of 59 years in order to cover
lead year 10 of start year 50. The corresponding imposed systematic errors for the unconditional and conditional bias (related to

For each toy model setup, we calculated the ESS, the MSE, time mean
intra-ensemble variance and the CRPSS of pseudo-forecasts
recalibrated with boosting. References for the skill score are
forecasts recalibrated with DeFoReSt. All scores have been
calculated using cross-validation with an annually moving calibration
window with a width of 10 years (see

To ensure a certain consistency, 1000 pseudo-forecasts are generated from the toy model and evaluated as described above. The scores presented are all mean values over these 1000 experiments. In particular, to assess a significant improvement of boosted recalibration over DeFoReSt with respect to CRPSS, the 2.5 % and 97.5 % percentiles are also estimated from these 1000 experiments.

Mean squared error (MSE) of different toy model setups with high potential predictability (

Figure

Regarding the ESS, Fig.

The post-processing methods are further compared by calculating the
time mean intra-ensemble variance (see
Fig.

A joint measure for sharpness and reliability is the

CRPSS of different toy model setups with high potential
predictability (

MSE of different toy model setups with high potential predictability (

Figure

The ESS (see Fig.

Figure

In the low potential predictability setting (

Figure

While in Sect.

We discuss which predictors are identified by boosted recalibration as most relevant and we compute the ESS, the MSE the intra-ensemble variance and the CRPSS with respect to climatology for both recalibration approaches. The scores have been calculated for a period from 1960 to 2010. In this section, a 95 % confidence interval was additionally calculated for these metrics using a bootstrapping approach with 1000 replicates. For bootstrapping, we randomly draw a new forecast–observation pair of dummy time series with replacement from the original validation period and calculate these scores again. This procedure has been repeated 1000 times. Please note that we draw for each model a new forecast–observation pair of dummy time series to avoid the metrics of these models being calculated on the basis of the same sample. Furthermore, all scores have been calculated using cross-validation with a yearly moving calibration window with a 10-year validation period (see Sect.

Figure

Most relevant are the coefficients

The recalibration of ensemble dispersion is mostly influenced by a linear start year dependence in the unconditional term (

CRPSS of different toy model setups with low potential
predictability (

The performance of the ensemble mean of the raw forecast (black), recalibrated with DeFoReSt (blue) and with boosted recalibration is measured with the MSE shown in Fig.

Figure

Figure

Compared to raw and DeFoReSt, the intra-ensemble variance of boosted recalibration is larger for lead year 1 and smaller for lead years 3 to 10. Boosted recalibration is sufficiently flexible to adjust the ensemble variance to a value close to the MSE. This consistent behavior is roughly constant over lead years.

Although boosted recalibration shows mostly a smaller ensemble
variance (lead years 3–10) than DeFoReSt, both recalibration
approaches are roughly equal when the performance is assessed with the CRPSS with climatological reference (Fig.

Here, the CRPSS of both models is around 0.8 for all lead years with respect to climatological forecast. In contrast, the raw forecast is inferior to the climatological forecast for most lead years, except lead years 3–6, where the raw forecast has positive skill, which could be attributed to the fact that temperature anomalies are considered. This implies that the observations and the raw forecast have the same mean value of 0. This mean value seems to be crossed by the raw forecast mainly between lead years 4 and 5.

Figure

Coefficient estimates for recalibrating global mean
2 m temperature of the MiKlip prototype system.
Colored boxes represent the interquartile range (IQR)
around the median (central, bold and black line) for
coefficient estimates from the cross-validation setup;
whiskers denote maximum 1.5 IQR. Coefficients are grouped according to
correcting unconditional bias (blue), conditional bias (red),
unconditional dispersion (orange) and conditional dispersion
(green). Values refer to coefficients

Figure

Regarding the reliability, both recalibrated
forecasts also show an ESS close to 1 for all lead years for
the North Atlantic surface temperature (Fig.

The mentioned lower potential predictability for
the North Atlantic manifests also in a 10 times larger ensemble variance; see Fig.

Identified coefficients for recalibrating the mean 2 m temperature over the North Atlantic of the prototype. Here, the coefficients are grouped by correcting uncond. bias (blue bars), cond. bias (red bars), uncond. dispersion (orange bars) and cond. dispersion (green bars). The coefficients are standardized, i.e., with higher values implying a higher relevance. Values refer to coefficients

Common parameter estimation and model selection approaches such as stepwise regression and LASSO are designed for predictions of mean values. Non-homogeneous boosting jointly adjusts mean and variance and automatically selects the most relevant input terms for post-processing ensemble predictions with non-homogeneous (i.e., varying variance) regression. Boosting iteratively seeks the minimum of a cost function (here the log likelihood) and updates only the one coefficient with the largest improvement of the fit; if the iteration is stopped before a convergence criterion is fulfilled, the coefficients not considered until then are kept at zero. Thus, boosting is able to handle statistical models with a large number of variables.

We investigated boosted recalibration using toy model
simulations with high (

Irrespective of the complexity of systematic errors and the potential
predictability, both recalibration approaches lead to an improved
reliability with ESS close to 1. Sharpness and MSE can also be
improved with both recalibration approaches. Given a high potential predictability (

Regarding the probabilistic forecast skill (CRPSS), DeFoReSt
and boosted recalibration perform roughly equally well, implying that
the polynomial structure of DeFoReSt, chosen originally from
personal experience, turns out to be quite appropriate. Both
recalibration approaches are reliable and outperforming the
climatological forecast with a CRPSS near

Also for the North Atlantic surface temperature, both post-processing approaches are performing roughly equal; they are reliable and superior to climatology with respect to CRPSS. However, the CRPSS for the North Atlantic case is generally smaller than for the global mean.

This study shows that boosted recalibration, i.e., recalibration model selection with nonhomogeneous boosting, allows a parametric decadal recalibration strategy with an increased flexibility to account for lead-time-dependent systematic errors. However, while we increased the polynomial order to capture complex lead-time-dependent features, we still assumed a linear dependency in initialization time. As this model selection approach reduces parameters by eliminating irrelevant terms, this opens up the possibility to increase flexibility (polynomial orders) also in terms related to the start year.

Based on simulations from a toy model and the MiKlip decadal climate forecast system, we could demonstrate the benefit of model selection with boosting (boosted recalibration) for recalibrating decadal predictions, as it decreases the number of parameters to estimate without being inferior to the state-of-the-art recalibration approach (DeFoReSt).

The toy model proposed by

Both are based on an arbitrary but predictable signal

The pseudo-observation

In this toy model setup, the concrete form of this variability is
not considered and thus taken as random. A potential climate trend could
be superimposed as a time-varying mean

The pseudo-forecast with ensemble members

In contrast to the original toy model design, proposed by

According to Eq. (

Given this setup, a choice of

As mentioned in Sect.

Here,

As described in Sect.

For the current toy model experiment, we exemplarily specify values for

Coefficient estimates for recalibrating global mean
2 m temperature of the MiKlip prototype system with a third-order polynomial lead time dependency for the unconditional and conditional bias and dispersion. Here, non-homogeneous boosting is not applied and all polynomials are orthogonalized; i.e.,

Overview of the values for the coefficients

The HadCRUT4 global temperature dataset used in this study is freely accessible through the Climatic Research Unit at the University of East Anglia (

AP, JG, HWR and UU established the scientific scope of this study. AP, JG and HWR developed the algorithm of boosted recalibration and designed the toy model applied in this study. AP carried out the statistical analysis and evaluated the results. JG supported the analysis regarding post-processing of decadal climate predictions. HWR supported the statistical analysis. AP wrote the manuscript with contribution from all co-authors.

The authors declare that they have no conflict of interest.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research has been supported by the German Federal Ministry for Education and Research (BMBF) project MiKlip (sub-projects CALIBRATION Förderkennzeichen FKZ 01LP1520A).We acknowledge support from the Open Access Publication Initiative of Freie Universität Berlin.

This paper was edited by Ignacio Pisso and reviewed by three anonymous referees.