Interactive comment on “Recalibrating Decadal Climate Predictions – What is an adequate model for the drift?”

Abstract. Near-term climate predictions such as multi-year to decadal forecasts are increasingly being used to guide adaptation measures and building of resilience. To ensure the utility of multi-member probabilistic predictions, inherent systematic errors of the prediction system must be corrected or at least reduced. In this context, decadal climate predictions have further characteristic features, such as the long-term horizon, the lead-time-dependent systematic errors (drift) and the errors in the representation of long-term changes and variability. These features are compounded by small ensemble sizes to describe forecast uncertainty and a relatively short period for which typical pairs of hindcasts and observations are available to estimate calibration parameters.
With DeFoReSt (Decadal Climate Forecast Recalibration Strategy), Pasternack et al. (2018) proposed a parametric post-processing approach to tackle these problems. The original approach of DeFoReSt assumes third-order polynomials in lead time to capture conditional and unconditional biases, second order for dispersion and first order for start time dependency. In this study, we propose not to restrict orders a priori but use a systematic model selection strategy to obtain model orders from the data based on non-homogeneous boosting.
The introduced boosted recalibration estimates the coefficients of the statistical model, while the most relevant predictors are selected automatically by keeping the coefficients of the less important predictors to zero. Through toy model simulations with differently constructed systematic errors, we show the advantages of boosted recalibration over DeFoReSt. Finally, we apply boosted recalibration and DeFoReSt to decadal surface temperature forecasts from the German initiative Mittelfristige Klimaprognosen (MiKlip) prototype system. We show that boosted recalibration performs equally as well as DeFoReSt and yet offers a greater flexibility.


of word as it plays into the idea that statistics might be objective. As such, the word objective should be omitted in the manuscript completely.
Answer: If the name "objective function" is misleading, we will change it to "cost function". 5. 87: "For the sake of completeness and readability these are presented in this section again." -Unnecessary sentence Answer: Will be deleted.
6. 124: By introducing the normal distribution with an calligraphic N and then use for the standard normal distribution greek letters, it gets quite confusing. As such this part needs to be rewritten. I would suggest to introduce N S or similar for the standard normal distribution. As the authors work beforehand with large letters for CDFs, I would recommend to use a consistent approach for the nomenclature. I am aware that the equation for the CRPS is shown in this way often in statistical leaning literature, but as GMD is not such a journal I strongly recommend intuitive naming of variables.
Answer: We will replace the symbols Φ and ϕ for the CDF and PDF of the standard normal distribution with N S C and N S P . 7. 138ff: I would strongly recommend a schematic on which basis the authors explain the mechanism of DeFoRFeSt. Equations are fine, but as they become extremely lengthy and hard to understand for the general reader (like eq. 13), they need support and motivation.

Answer:
We will add such a schematic to the manuscript. C3 9. 202ff: The problem at this point is that the boosting algorithm forms an essential part for the understanding of the manuscript. I would strongly recommend the design of a schematic to make clear what exactly is done in the boosting process (apart from the equation, but the algorithmic strategy). This part of the manuscript needs effort to make it better understandable for the wider audience, especially as the authors do not publish here for a statistical, but a general model related audience.
Answer: We will add a schematic flow chart describing the boosting algorithm analogously to Messner et al. 2017.
10. 202: "R-function poly" please make it a proper reference Answer: Will be corrected.
11. 205: "R-package crch" please make it a proper reference Answer: Will be corrected.
13. 218: The way it is written the choice of nu requires a sensitivity test. So either it requires the motivation for choosing nu = 0.05 to be rewritten, or a demonstration and discussion of its effect.
Answer: We will add a better motivation to the manuscript.
14. 226: The description of the cross-validation is not sufficient. A CV requires the statement on how the non-training data is afterwards evaluated (without taking into account the training data, otherwise it is not a CV but a Jackknife). The authors point to equation 21, but it is just the basis for the validation (which is described in line 216 with the Pearson correlation). So it would be required to C4 state exactly what process is used for validation, which data is used for this step and which exact metric is applied to make the statement on a validated result.
Answer: We will add a more detailed description. 16. 267ff: The authors show a very large figure with many elements in 4 main colours for the different parameters, but just spend three sentences without putting it in context and give the plot any meaning (e.g. comparison, interpretation apart from first three coefficients vs. last three). As such either the plot has not more information, then it is doubtful whether the plot has any use for the manuscript, or the many different whisker plots are important and it is not represented in the C5 text. Just showing them is not enough, especially as later it is not referenced back to the figure when similar coefficient plots are made. Fig. 2 is relevant for the toy model construction since it supports the decision to use the same magnitude for the coefficients of the start and lead time dependent systematic errors. However, since it is not used for any further evaluations we will put it to the appendix related to the  3-10 show a U-shape over the lead years?

Answer: Showing this
Answer: Regarding Figs. 3-10, particularly the ESS and the intra-ensemble variance omit a certain inverse U-shape. The reason might be, that DeFoReSt tends to be more underdispersive for the first and last lead year due to the missing additive correction term for the ensemble spread.
20. 288: It is not explained why the uncertainties of the ESS are not visible (either small or not calculable).

Answer:
We have decided not to show any uncertainties for the ESS, since we just wanted to show the general effect of boosted recalibration and DeFoReSt and to ensure a better visibility.
Answer: Will be corrected.
22. 334 Why is there a bootstrapping in this section but not in the section above?
Answer: Unlike Sec. 4 we evaluate in Sec. 5 the CRPSS also w.r.t. a raw model. Thus, we decided to apply a bootstrapping approach to avoid any advantages of the post-processed models.
23. 334 Why is there a bootstrapping in this section but not in the section above?
Answer: Unlike Sec. 4 we evaluate in Sec. 5 the CRPSS also w.r.t. a raw model. Thus, we decided to apply a bootstrapping approach to avoid any advantages of the post-processed models.
24. 340ff: Why is there no comparison to the coefficients in Fig. 2?
Answer: The coefficients in Fig. 2 were used to derive the scale of the coefficients associated to 4th to 6th polynomials for the pseudo-forecasts. Here, unlike Fig. 11 and 13 no model selection was applied, i.e. a comparison is not very reasonable.
25. 348: "have also some impact." This should be analysed with a significance test and statements made accordingly Answer: We will change the statement "have also some impact" to "have also been identified by the boosting algorithm as relevant".
26. 376: Are there significant differences between global and NA 2m-Temperature? Why is North Atlantic framed here as independent compared to the global and the C7 comparison between those kept so short? It seems like it is written currently that one example would be sufficient. So why are the two not conclusively compared with each other in one section? So could there be a different story apart from just showing the statistical model applied to data?
Answer: DeFoReSt and boosted recalibration have been developed within MiKlip project. Here, the NA as well as the global 2m-temperature are the key variables within this project. Moreover these regions distinguish themselves by their potential predictability. Thus analog to the toy model experiments we show the mechanisms of theses recalibration approaches to MiKlip predictions with smaller and higher potential predictability. Furthermore, regarding the different identified predictor variables for the NA and global 2m-temperature (Figs. 11 and 13) one can see that other processes are relevant due to a different spatial scale of these examples.
27. Fig3-5 should be combined in one figure with 9 panels Answer: Will be corrected.
28. Fig7-9 should be combined in one figure with 9 panels Answer: Will be corrected.