Near-term climate predictions such as decadal climate forecasts are increasingly being used to guide adaptation measures. For near-term probabilistic predictions to be useful, systematic errors of the forecasting systems have to be corrected. While methods for the calibration of probabilistic forecasts are readily available, these have to be adapted to the specifics of decadal climate forecasts including the long time horizon of decadal climate forecasts, lead-time-dependent systematic errors (drift) and the errors in the representation of long-term changes and variability. These features are compounded by small ensemble sizes to describe forecast uncertainty and a relatively short period for which typically pairs of reforecasts and observations are available to estimate calibration parameters. We introduce the Decadal Climate Forecast Recalibration Strategy (DeFoReSt), a parametric approach to recalibrate decadal ensemble forecasts that takes the above specifics into account. DeFoReSt optimizes forecast quality as measured by the continuous ranked probability score (CRPS). Using a toy model to generate synthetic forecast observation pairs, we demonstrate the positive effect on forecast quality in situations with pronounced and limited predictability. Finally, we apply DeFoReSt to decadal surface temperature forecasts from the MiKlip prototype system and find consistent, and sometimes considerable, improvements in forecast quality compared with a simple calibration of the lead-time-dependent systematic errors.

Decadal climate predictions aim to characterize climatic conditions over the
coming years. Recent advances in model development, data assimilation and
climate-observing systems together with the need for up-to-date and reliable
information on near-term climate for adaptation planning have led to
considerable progress in decadal climate predictions. In this context,
international and national projects like the German initiative Mittelfristige
Klimaprognosen (MiKlip) have developed model systems to produce a skillful
decadal climate prediction

Despite the progress being made in decadal climate forecasting, such
forecasts still suffer from considerable systematic biases. In particular,
decadal climate forecasts are affected by lead-time-dependent biases (drift)
and exhibit long-term trends that differ from the observed changes. To
correct these biases in the expected mean climate, bias correction methods
tailored to the specifics of decadal climate forecasts have been developed

Given the inherent uncertainties due to imperfectly known initial conditions
and model errors, weather and climate predictions are framed
probabilistically

Statistical postprocessing

The most prominent recalibration methods proposed in the context of
medium-range weather forecasting are Bayesian model averaging

We expand on NGR and CCR by introducing a parametric dependence of the
forecast errors on forecast lead time and long-term time trends hereafter
named Decadal Climate Forecast Recalibration Strategy (DeFoReSt). To
better understand the properties of DeFoReSt, we conduct experiments
using a toy model to produce synthetic forecast observation pairs with known
properties. We compare the decadal recalibration with the drift correction
proposed by

The remainder of the paper is organized as follows. In Sect.

In this study, we use retrospective forecasts (hereafter called hindcasts) of
surface temperature performed with the Max Planck Institute Earth System
Model in a low-resolution configuration (MPI-ESM-LR). The atmospheric
component of the coupled model is ECHAM6 run at a horizontal resolution of
T63 with 47 vertical levels up to 0.1 hPa

We investigate one set of decadal hindcasts, namely from the MiKlip prototype
system, which consists 41 hindcasts, each with 15 ensemble members, yearly
initialized at 1 January between 1961 and 2000 and then integrated for
10 years. The initialization of the atmospheric part was realized by full
field initialization from fields of ERA-40

This study uses the 20th Century Reanalysis

Calibration or reliability refers to the statistical consistency between the
forecast PDFs and the verifying observations. Hence, it is a joint property
of the predictions and the observations. A forecast is reliable if forecast
probabilities correspond to observed frequencies on average. Alternatively, a
necessary condition for forecasts to be reliable is given if the time mean
intra-ensemble variance equals the mean squared error (MSE) between ensemble
mean and observation

A common tool to evaluate the reliability and therefore the effect of a
recalibration is the rank histogram or “Talagrand diagram”
which was separately proposed by

Following

Sharpness, on the other hand, refers to the concentration or spread of a
probabilistic forecast and is a property of the forecast only. A forecast is
sharp when it is taking a risk, i.e., when it is frequently different from
the climatology. The smaller the forecast spread, the sharper the forecast.
Sharpness is indicative of forecast performance for calibrated and thus
reliable forecasts, as forecast uncertainty reduces with increasing sharpness
(subject to calibration). To assess sharpness, we use properties of the width
of prediction intervals as in

Scoring rules, finally, assign numerical scores to probabilistic forecasts
and form attractive summary measures of predictive performance, since they
address reliability and sharpness simultaneously

Given that

The CRPS is negatively oriented. A lower CRPS indicates more accurate
forecasts; a CRPS of zero denotes a perfect (deterministic) forecast.
Moreover, the average score over

The continuous ranked probability skill score (CRPSS) is, as the name
implies, the corresponding skill score. A skill score relates the accuracy of
the prediction system to the accuracy of a reference prediction (e.g.,
climatology). Thus, with a given CRPS

In the following paragraphs, we discuss DeFoReSt and illustrate how forecast quality is used to estimate the parameters of the recalibration method.

We assume that the recalibrated predictive PDF

In the following, we motivate and develop linear parametric functions for

For bias and drift correction, we start with a parametric approach based on
the studies of

This motivates the following functional form for

In addition to adjusting the unconditional lead-year-dependent bias,
DeFoReSt aims at simultaneously adjusting conditional bias and
ensemble spread. As a first approach, we take the same functional form for

These assumptions on model complexity are supported only by our experience; however, they remain subjective. A more transparent order selection will be a topic of future work.

The coefficients

The initial guesses for optimization need to be carefully chosen to
avoid local minima. Here, we obtain the

In this section, we apply DeFoReSt to a stochastic toy model, which
is motivated from

The toy model consists of two parts which are detailed in the following two
subsections: (a) pseudo-observations, the part generating a substitute

We construct a toy model setup simulating ensemble predictions for the
decadal timescale and associated pseudo-observations. Both are based on an
arbitrary but predictable signal

In this toy model setup, the specific form of the variability of

We now specify a model giving a potential ensemble forecast with ensemble
members

We further assume an ensemble dispersion related to the variability of the
unpredictable noise term

According to Eq. (

This toy model setup is controlled by four parameters. The first parameter
(

Unconditional bias (

The remaining three parameters are

The ensemble inflation factor

Given this setup, a choice of

Analogous to the MiKlip experiment, the toy model uses 50 start years
(

Temporal evolution of the raw

To assess DeFoReSt, we consider two extreme toy model setups. The two setups
are designed such that the predictable signal is stronger than the
unpredictable noise for higher potential predictability (setup 1), and vice
versa (setup 2; see Sect.

In addition to the recalibrated pseudo-forecast, we compare

a “raw” pseudo-forecast (no correction of unconditional, conditional bias and spread),

a “drift-corrected” pseudo-forecast (no correction of conditional bias and spread) and

a “perfect” pseudo-forecast (Eq.

The CRPSS and reliability values of the perfect forecast could be interpreted
as optimum performance within the associated toy model setup, due to the
missing bias and ensemble dispersion. For instance, the perfect model's CRPSS
with respect to climatology would be 1 for a toy model setup with perfect
potential predictability (

Reliability

Figure

Moreover, the pseudo-forecast is almost perfectly reliable after
recalibration (not underdispersive), which could be shown with the ESS
(Fig.

The recalibrated forecast outperforms the raw model output and the
drift-corrected forecast, whose ESS values are lower than
1 and thus underdispersive. The lower performance of the raw models and the drift
correction is a result of the toy model design, leading to a higher ensemble
mean variance combined with a decreased ensemble spread. In addition, the
increased variance of the ensemble mean also results in an increased
influence of the conditional bias. The problem is that the raw model forecast
and the drift correction could not account for that conditional bias, because
neither the ensemble mean nor the ensemble spread were corrected by these
forecasts. Therefore, the influence of the conditional bias also becomes
noticeable for the reliability of the raw model and the drift-corrected
forecast; one can see that the minimum and maximum of the conditional bias
(see Fig.

Regarding the differences between the raw model and the drift-corrected forecast, it is visible that the latter outperforms the raw model. The explanation is that the drift correction accounts for the unconditional bias, while the raw model does not correct this type of error. Here, one can see the impact of the unconditional bias on the raw model. Nonetheless, the influence of the unconditional bias is rather small compared to the conditional bias.

The effect of unconditional and conditional biases is illustrated in
Fig.

The sharpness of the different forecasts is compared by calculating the time
mean intra-ensemble variance (see Fig.

Another notable aspect is that the raw and drift-corrected forecasts have a higher sharpness (i.e., lower ensemble variance) than the perfect model for lead years 1 to 4, and vice versa for lead years 5 to 10. This is because the toy models incorporated underdispersion for the first lead years and an overdispersion for later lead years. Therefore, the sharpness of the perfect model could be interpreted as the maximum sharpness of the model without being unreliable.

The sharpness of the recalibrated forecast is very similar to the sharpness of the perfect model for all lead years. The recalibration therefore performs well in correcting under- and overdispersion in the toy model forecasts.

A joint measure for sharpness and reliability is the CRPS and consequently
the CRPSS with respect to climatology, where the latter is shown in
Fig.

Reliability

In contrast, the recalibrated forecast approaches CRPSS values around 0.5 for all lead years and performs nearly identical to the perfect model. This illustrates that the unconditional bias, conditional bias and ensemble dispersion can be corrected with this method.

Figure

After recalibration, the lead- and start-time-dependent biases are corrected, such that the recalibrated forecast mostly describes the trend of the pseudo-observations.

The recalibrated forecast is also reliable (Fig.

In contrast, one can see a general improvement of the raw and drift-corrected
forecasts' reliability compared to the model setup with high potential
predictability. The reason is that the low potential predictability

The minor effect of the conditional bias in the low potential predictability
setup is also represented by the MSE (Fig.

Figure

Nonetheless, the raw model and drift-corrected forecast also still have a higher sharpness (i.e., lower ensemble variance) than the perfect model for lead years 1 to 4, and vice versa for lead years 5 to 10. Here, the reason for this is again the construction of the toy model, with an underdispersion for the first lead years and an overdispersion for later lead years.

The recalibrated forecast reproduces the perfect models' sharpness also quite well for the potential predictability setup.

Figure

While in Sect.

Analogous to the previous section, we compute the ESS, the MSE, the
intra-ensemble variance and the CRPSS with respect to climatology. The scores
have been calculated for a period from 1961 to 2005. In this section, a
95 % confidence interval was additionally calculated for these metrics
using a bootstrapping approach with 1000 replicates. For bootstrapping, we
draw a new pair of dummy time series with replacement from the original
validation period and calculate these scores again. This procedure has been
repeated 1000 times. Furthermore, all scores have been calculated using
cross validation with a yearly moving calibration window with a width of
10 years (see Appendix

Figure

Temporal evolution of North Atlantic yearly mean surface temperature
from the MiKlip prototype

Reliability

Regarding the reliability, Fig.

Regarding the MSE, one can see that the recalibrated forecast outperforms the
drift-corrected forecast for lead years 1 and 2 and 8 to 10
(Fig.

Figure

Figure

Figure

Temporal evolution of global yearly mean surface temperature from
MiKlip prototype

The ESS for a global mean surface temperature is shown in
Fig.

Reliability

Figure

Figure

Regarding the CRPSS, Fig.

All in all, the better CRPSS performance of DeFoReSt model could be
explained due to a superior reliability for all lead years (see
Fig.

There are many studies describing recalibration methods for weather and
seasonal forecasts

In addition to unconditional biases, probabilistic forecasts could show lead-
and start-year-dependent conditional biases and under- or overdispersion.
Therefore, we proposed the postprocessing method DeFoReSt which accounts for
the three abovementioned issues. Following the suggestion for the
unconditional bias

We investigated DeFoReSt using toy model simulations with high
(

A recalibrated forecast shows (almost) perfect reliability (ESS of 1).
Sharpness can be improved due to the correction of conditional and
unconditional biases. Thus, given a high potential predictability
(

We also applied DeFoReSt to surface temperature data of the MiKlip
prototype decadal climate forecasts, spatially averaged over the North
Atlantic subpolar gyre region and a global mean. Pronounced predictability
for these cases has been identified by previous studies

DeFoReSt with third-/second-order polynomials is quite successful.
However, it is worthwhile investigating the use of order selection
strategies, such as LASSO

Based on simulations from a toy model and the MiKlip decadal climate forecast system, we could show that DeFoReSt is a consistent recalibration strategy for decadal forecast leading to reliable forecast with increased sharpness due to simultaneous adjustment of conditional and unconditional biases depending on lead time.

The NCEP 20CR reanalysis used in this study is freely
accessible through NCAR (National Centers for Atmospheric Research) after a
simple registration process. The MiKlip prototype data used for this paper
are from the BMBF-funded project MiKlip and are available on request. The
postprocessing, toy model and cross-validation algorithms are implemented
using GNU-licensed free software from the R Project for Statistical Computing
(

For this toy model setup,

Overview of the values coefficients

This setting has the advantage that the perfect estimation of

Following the suggestion of

For the current toy model experiment, we exemplarily specify values for

We propose a cross-validation setting for decadal climate predictions to ensure fair conditions for assessing the benefit of a postprocessing method over a raw model without any postprocessing. All scores are calculated with a yearly moving validation period with a length of 10 years. This means that 1 start year including 10 lead years was left out for validation. The remaining start years and the corresponding lead years were used for estimating the correction parameters for the prediction within the validation period; start years within the validation period were not taken into account. This procedure was repeated for a start-year-wise shifted validation period.

This setting is illustrated in Fig.

Schematic overview of the applied cross-validation procedure for a decadal climate prediction, initialized in 1964 (red dotted line). All hindcasts which are initialized outside the prediction period are used as training data (black dotted lines). A hindcast which is initialized inside the prediction period is not used for training (gray dotted lines).

The authors declare that they have no conflict of interest.

This study was funded by the German Federal Ministry for Education and Research (BMBF) project MiKlip (subprojects CALIBRATION Förderkennzeichen FKZ 01LP1520A and FLEXFORDEC Förderkennzeichen: FKZ 01LP1519A). Mark Liniger and Jonas Bend have received funding from the European Union's Seventh Framework Programme (FP7/2007–2013) under grant agreement no. 308291. Edited by: James Annan Reviewed by: two anonymous referees