Reconstructing climatic modes of variability from proxy records : sensitivity to the methodological approach

Modes of climate variability strongly impact our climate and thus human society. Nevertheless, their statistical properties remain poorly known due to the short time frame of instrumental measurements. Reconstructing these modes further back in time using statistical learning methods applied to proxy records is a useful way to improve our understanding of their behaviours and meteorological impacts. For doing so, several statistical reconstruction methods exist, among which the Principal Component Regression is one of the most widely used. Additional predictive, and then reconstructive, statistical methods 5 have been developed recently, following the advent of big data. Here, we provide to the climate community a multi-statistical toolbox, based on four statistical learning methods and cross validation algorithms, that enables systematic reconstruction of any climate mode of variability as long as there are proxy records that overlap in time with the observed variations of the considered mode. The efficiency of the methods can vary, depending on the statistical properties of the mode and the learning set, thereby allowing to assess sensitivity related to the reconstruction techniques. This toolbox is modular in the sense that it 10 allows different inputs like the proxy database or the chosen variability mode. As an example, the toolbox is here applied to the reconstruction of the North Atlantic Oscillation by using Pages 2K database. In order to identify the most reliable reconstruction among those given by the different methods, we also investigate the sensitivity to the methodological setup to other properties such as the number and the nature of the proxy records used as predictors or the reconstruction period targeted. The best reconstruction of the NAO that we thus obtain shows significant correlation with former reconstructions, but exhibits 15 better validation scores.

(3) Following this comment we have updated our code, manuscript and data with the use of the 2017 version of the Pages 2k database as suggested by Reviewer 2.Then, using this new proxy database, and in order to address this comment, we decided to remove the proxy records that are not annually resolved .For dating uncertainties, this is certainly something to be considered in the next version of the code.We thus add a short discussion on this aspect in the discussion section, concerning potential outlooks for the next versions.
(1) Section 2.2 Do the methods estimate uncertainty in the reconstruction or just provide a single reconstruction?Are the ensembles of reconstructions discussed elsewhere a kind of uncertainty estimate of the mean reconstruction?These, or something like them, would be essential to use and display because without reliable uncertainty estimates, paleoclimate reconstructions are not useful.
(2) This was actually a major omission in the former version of the paper and we thank the reviewer to report it.The uncertainties we now provide are calculated as in Ortega et al. (2015) using the residuals calculated over the 50 training periods.These uncertainties are represented by the standard errors (s.e.) of the regression, calculated as the root of the sum of period.
An uncertainty band 2*s.e. is calculated for each of the 50 individual reconstructions and the envelope of this 2*s.e.uncertainty bands is our estimate of the total uncertainty range of the final reconstruction.
(3) We added regression uncertainties in a table and on the figures where the reconstructions are shown.Also, the code we deliver provide standard errors for each member of a given final reconstruction.
(1) p.7 l.16-19Using correlation as the only validation metric is problematic, especially when it comes to comparing reconstruction methodologies.You really must include additional metrics that account not just for the correlation, but the variance and bias as well.If the approaches provide uncertainty estimates, then the skill metrics need to also account for those (using, for example, the continuous ranked probability score).
(2) This comment was also highlighted by the other reviewer as well as in the short comment of Eduardo Zorita.We totally agree with this comment and we decided to add both the root mean squared errors and the Nash-Sutcliffe Coefficient of Efficiency (NSCE) as additional metrics.The NSCE calculates the ratio of the averaged quadratic distance between the reconstruction and the observations and the quadratic distance between the mean of the observations and the observations.This metric, defined between and 1 indicates that the − ∞ reconstruction is robust when NSCE>0.Otherwise, lower values mean that using the mean of the testing series is more robust than performing a reconstruction using the statistical model.We thus believe that these two metrics adequately account for the bias and variance in the reconstruction, which should then improve the conservation of these properties in our reconstruction.
(3) The whole new manuscript now accounts for these two metrics and use the NSCE as the main decision metric.
(1) p.16 l.19-20This statement is incorrect.Previous reconstructions almost never overlook this issue, but rather proxy network selection is integral to the reconstruction process.It is very rare to have a reconstruction approach, especially one that is regression-based, that does not remove proxies because of insufficient correlation with the target climate variable.
(2) For climate index reconstructions we found at least two major studies that have not used proxy network selection to perform their reconstruction : Cook et al 2002 (NAO reconstruction) and Wang et al 2017 (AMV reconstruction).
(3) Nevertheless, we indeed found that these studies are particular cases and we modified this statement to clarify that we were referring mainly to these two studies.
(1) p.18 l.1-2 Or the "significant" correlation with the NAO could be spurious.Also note that non-stationarity violates one of the fundamental assumptions of these (and nearly all) reconstruction approaches.
(2) Indeed, we also ask ourselves if the significant correlations we found could be spurious but it is relatively difficult to determine whether they are or not.An indirect way to "verify" this significance of correlation is the location of the proxy records that have high correlations with the NAO.A way to rule out spurious correlation is the use of pseudo-proxies like in Ortega et al. ( 2015), but handling pseudo-proxies from different datasets was an arduous task for this multimethod paper.Nevertheless, the fact that most proxy records selected for the highest levels of correlation significance (i.e.Greenland, Arctic Canada, North America and Europe.See Fig. 6 in the last version of the manuscript) are located in the centers of action of the NAO (which has not been imposed a priori ) (e.g.Casado et al. 2013) is a good indicator that most proxy records won't be spurious NAO predictors.The second comment about non-stationarity indeed highlights a problem that not only questions our study, but also all of the proxy based reconstructions studies.
(3) In the new version of the manuscript we remove the sentence concerning non-stationarity since this type of caveat has to be included in the discussion section.We also highlight that the location of most of the proxy records selected shows that our method seems to adequately select reliable predictors.
(1) p.19 l.12-15I think this statement is too strongly worded given that you've only validated the reconstructions using correlation and haven't validated reconstruction uncertainties.How do the reconstructions compare given the uncertainties?
(2) (3) As mentioned above, in the revised version we use the coefficient of efficiency to validate our reconstructions and we include and discuss regression uncertainties in our main text and dedicated figures.
-Response to Anonymous Referee # 2: 1 Scientific Comments (1) I'll start with what I like about the paper: it applies several methods to the same dataset, and the results are fairly consistent among methods and with another recent reconstruction, in which one of the authors was involved (Ortegal et al, 2015).That's about it.
(2) (3) We thank the reviewer for this positive comment.Nevertheless, as a general response to the main reviewer's criticisms below, we would like to highlight that our study is proposing novel regression methods that have, to our knowledge, not yet been applied to climate signal reconstructions.In addition, we found in previous studies cited in this manuscript (that concerns the reconstruction of climate modes, but not of climate fields), several issues in the classical methodological approaches.Our objective here is to assist paleoclimate experts in making the best out of their proxy databases with valid and robust statistical assessments.More specifically, using a new metric that we discuss below, we show how to evaluate different reconstructions of the same climate index but with different methodological choices (regression method, proxy network, length of the period on which the regression model is built).The wide range covered by the scores shows that the selection of these inputs is an important step to obtain a reconstruction as robust as possible.
Furthermore, to make the production of such reconstructions more straightforward and facilitate its use to potential users, we have developed a code that simply requires a few parameters as input and that provides a set of different alternative reconstructions of a given climate index for a given proxy record database.In addition, the code provides an ensemble of scores that evaluate the different reconstructions, each produced with different methodological choices.Thus the user of CliMoRec (see " Response to the short comment from Astrid Kerkweg: 'Executive Editor comment on gmd-2018-21 ' ") can finally pick the one that has the best scores.This is why we do not submit this paper to Climate of The Past, as we would like to make climate signal reconstructions more transparent and easily accessible and verified by the community.Furthermore, we believe that CliMoRec could be improved in the future by including further refinements in follow up versions which constitute an additional reason for which we prefer to submit this paper to GMD.Last but not least, we believe that providing sufficient level of details concerning the mathematical rationale behind our methods is very useful, an information that is hidden in the appendix in journals like Climate of the Past, which are more focused on the scientific results.
1.1 This is no "big data" (1) Few things are more irritating than people pretending to do "big data" when they actually don't.The authors only end up using a few dozen proxies, and only reconstruct a single index.Nothing wrong with that, but it's not "big data" by any stretch of the imagination.In fact, except for the random forest method (which is only useful in the presence of hundreds or thousands of predictors, therefore not very useful here), all of the methods described are classic forms of linear regression.Anyone is free to call that "machine learning" (since most ML methods are regression in one form or another), but the larger problem is that this is a modeling journal, and I see very little in the way of statistical modeling here.
(2) (3) We entirely agree that what is done in this paper is not "big data" and we didn't intend to claim we did it.The word "big data" was mentioned twice in the submitted text with the only aim of providing a context, once in the abstract (line 6) and once in the introduction (page 4, line 8).We are actually claiming that the emergence of big data that followed the innovation in technologies and data storage has led to the development of new regression methods in the 2000's, in particular elastic net regression and Random Forest (Breiman 2001; Zou and Hastie 2005).Those methods have indeed been developed in order to address high-dimensional problems (p>n), that Principal Components Regression and Partial Least Squares poorly deal with.However, since the word "big data" can be misleading, we have decided to remove it in the revised version.Random Forests are indeed particularly useful for high dimensional data with numerous predictors such as boosting gradients or neural networks.However, in the new version of the code, by using the Nash -Sutcliffe Coefficient of Efficiency, we have found significantly better results for the Random Forest and the Elastic-net methods than for the PLS and the PCR methods (this is illustrated in the Fig. R1 that will replace Fig. 7 of the previous manuscript), which shows that adding these methods even in a low-dimension study such as in ours can be more efficient than using classical forms of linear regression.Additionally the code we provide allows to choose the network of proxy records that is used for the reconstruction.As the number of available paleoclimate data is constantly growing (even if it does not reach hundreds of thousands yet), we claim that regression methods adapted to high-dimensional problems such as Random Forests will sooner or later, become particularly useful for climate index reconstructions.We have added a few words on this subject in the discussion of the manuscript.1.2 Suboptimal Methods (1) Furthermore, the chosen methods are unable to deal with missing data, forcing the authors to limit the calibration to a set of complete records, thereby jettisoning important information.Meanwhile, at least three methods have been proposed to estimate past climates using discontinuous records:  All of these methods have code that is publicly archived, often in open-source languages like R. Restricting themselves to antiquated regression methods forces the authors play a dubious game of optimization on the various training and verification sets, to offset the disadvantage of restricting the network to a gap-less training set.This is suboptimal on methodological and computational grounds.
(2) In this study, we focus on climate variability modes, which is only a part of the global climate.We applied dedicated methods aiming at improving the reconstruction of these modes.Our techniques can certainly be further improved, but as it stands, we believe that they add new potentialities to the regression approaches currently at use .This paper is actually clarifying and adding methodological clue and gives an accessible tool to help paleoclimatologists to build more robust climate index reconstructions.Although our approach and the approaches mentioned by the reviewer aim at reconstructing past climate, the question and focus of the paper is not to show if one is better than the other, but to try to further develop one of them.Concerning data assimilation methods, we certainly agree that these are very useful methods, but we do not believe that these methods, difficult to implement and thus not accessible to all paleoclimatologists, necessarily discard other more simple statistical models.We believe that science can benefit from a variety of approaches, all together contributing to identify robust results.
(3) Therefore, we acknowledge the existence of the three methods depicted by the reviewer, and discuss them shortly in our manuscript, but we do not think there are decisive arguments showing that our approach is necessarily weaker, although this is not the scope of this paper to prove it at this stage.

How uncertain?
(1) An even more serious issue is that the authors do not provide any measure of uncertainty for their reconstructions.They could do so via any defensible method that has been applied in paleoclimate investigations, e.g.parametric or non-parametric bootstrap, jackknife, or maximum-entropy bootstrap (Vinod and de Lacalle, 2009).
(2) We thank the reviewer for pointing out this major omission (also mentioned by Anonymous Reviewer 1): that is the importance of assessing the reliability of our reconstruction.The uncertainties we now provide are calculated as in Ortega et al. (2015) using the residuals calculated over the 50 training periods.These regression uncertainties are represented by the standard errors (s.e.) of the regression, calculated as the root of the sum of the squared residuals over the training periods divided by the degree of freedom: period.
An uncertainty band 2*s.e. is calculated for each of the 50 individual reconstructions and the envelope of this 2*s.e.uncertainty bands is our estimate of the total uncertainty range of the final reconstruction (as a sum of the regression uncertainty plus the parameter uncertainty).
(3) We added regression uncertainties in a table and on the figures where the reconstructions are shown (Fig. 11 of the last version of the manuscript).Also, the code we deliver provide standard errors for each member of a given final reconstruction.
1.4 Statistical Models are Models too (1) I feel compelled to point out that this is a journal about models, so it would be desirable to discuss the advantages of the methodological choices on modeling grounds: each of them models the data and uncertainties in various ways, and it would seem natural for such modeling assumptions and choices to be discussed here (more so than say, Climate of the Past, where the current manuscript would be a better fit in present form).One implicit modeling assumption they make is that the NAO is a linear combination of the proxy data, whereas the correct etiological relationship is the other way around (proxies react to climate, not climate to proxies).This inevitably leads to important biases (Frost and Thompson, 2000).Again, some of the methods mentioned above can deal with that, and the authors should consider using them.
(2) We have explained before our motivation for submitting the paper to this journal rather than to Climate of the Past : the idea is to propose a statistical modelling tool, which will be available to the community and could be further developed in a transparent way, rather than to only propose a new NAO reconstruction.We have been encouraged for this by the editorial guidelines of GMD which include 'statistical models'.Nevertheless, we leave it to the editor to decide whether our study is suited for GMD or not.Regarding the modelling assumption: stating that the NAO is a linear combination of the proxy data is something about which we have been unclear in the manuscript but this is not what we have meant literally."NAO index can be reconstructed from a linear combination" would be a more suited sentence.
(3) We have revised the manuscript so as to avoid such shortcuts following proposition described above.(2) (3) We agree that the results may be sensitive to the choice of the calibration/validation metric.Thus, we have also calculated Root Mean Squared Errors as a new validation score.We thank the reviewer to suggest this more sophisticated metrics that have been added and used as the main metric in the manuscript on top of the correlations and RMSE: The Nash-Sutcliffe Coefficient of Efficiency (NSCE).The NSCE scores is indeed helping us in many ways.It shows that all the reconstruction made using the Vinther et al (2003) NAO index are not reliable since their NSCE scores are not significantly different to 0 (following student test on the scores obtained from the individual reconstructions).However, using the Jones et al (1999) index (which is exactly the same as Vinther et al (2003) index on their common period) we obtain more robust validation scores (i.e.significantly higher than 0 at 95% ).
(1) If the authors were making interval forecasts, which they should, the sharpness of their prediction bands should be evaluated by an Interval Score (Gneiting and Raftery, 2007).
(2) (3) We thank the reviewer for this interesting comment.Nevertheless we should confess that what the reviewer is requesting here is not very clear to us even after carefully reading the reference mentioned.As a response, we can say that in the revised version of the manuscript, we are now properly computing uncertainties (cf.point 1.3) and notably for the validation scores, which correspond to the "forecast section" from our methodology.
(1) Finally, an obligatory measure of any statistical forecasting is to inspect the quality of residuals: since regression relies on residuals being Gaussian, independent and identically distributed, any statistics book (e.g.Wilks, 2011) says that the residuals should be tested for these features.This should at least be present in an Appendix.
(2) We agree that this is an important assumption to check.
(3) We have then check this assumption for the best reconstruction of each method (presented Fig. 11 of the previous manuscript) and we have added a figure showing the p-values of Shapiro-Wilk tests obtained for the 50 individual reconstructions for each of them (which have the best NSCE scores on average).Also, we have updated the code to provide the p-values of this test as an additional output.
1.6 Double dipping (1) The authors pre-screen the proxy network for correlation to the NAO index.What isn't clear is whether that is done as part over the model training, or whether this is done over the entire instrumental era (or the parts of it that overlap with each proxy series).If the latter, this is an example of "double-dipping", whereby information from the test set is used as part of training, leading to overoptimistic results.I could not ascertain this from the paper, so a clarification is necessary.
(2) This comment is very useful firstly because we have been unclear on this point and secondly because it helps us to actually find out that we were doing double-dipping.Indeed, as the proxy records are selected over the entire instrumental era, the model built over the training period uses proxy records that are, at least partially, coherent with the NAO index over the testing period, which is supposed to be independent.
(3) To correct this issue, we decided that the subselection of proxy records based on correlation test with the NAO has to be made always on training period, which means that there is no a priori information about the coherence between the NAO index and the selected proxy records made over the overlap period with the NAO.We have modified the code and all the results following this improvement in our approach.This does not affect much our results in the end, but is clearly an improvement in the coherence and rigor of our method, for which we thank the reviewer again.
the squared residuals divided by the degree of freedom over the training periods divided by the degree of freedom: of the training sample, the true values of the NAO index over n train Y train the training period, and the fitted NAO by the regression model over the training Y ︿ train

Fig. R1 :
Fig. R1: Nash-Sutcliffe Coefficient Efficiency (NSCE) scores obtained for each method for the reconstruction period 1000-1970 and for different significance for the correlation test performed on the training periods: 95%, 90%, 80% and 0%.Red boxplots give the NSCE scores for the Random Forest method.Blue boxplots give the NSCE scores for the Elastic-net method.Green boxplots give the NSCE scores for the Principal Components Regression method.Yellow boxplots give the NSCE scores for Partial Least Squares method.
of the training sample, the true values of the NAO index over n train Y train the training period, and the fitted NAO by the regression model over the training Y ︿ train 1.5 Perfunctory Validation (1) Another major problem is that the authors carry out a very perfunctory validation using a metric (correlation) that is known to only reward phase coherence (Wang et al., 2014).At the very least, the authors should explore the Reduction of Error and Coefficient of Efficiently (Nash and Sutcliffe, 1970) statistics, which have been used for more than 25 years in the dendrochronological literature(Cook et al., 1994).Another useful measure for point forecasts is the Continuous Ranked Probability Score (Gneiting and Raftery, 2007).