A priori sele tion and data-based skill assessment of reanalysis data as predictors for daily air temperature on a glaciated , tropical mountain range

Introduction Conclusions References


Introduction
Ongoing developments in atmospheric modelling have made available choices of longterm, temporally high-resolution atmospheric data sets for the entire globe.These data, however, are still restricted in terms of spatial resolutions, such that their immediate application to study regional and local climate is not recommended.Especially over complex topography, such as glacier-covered mountains, atmospheric models often miss significant processes that characterize local weather and climate.So-called downscaling methods bridge this gap between the available data from global atmospheric models and the required local-scale information (for an overview see, e.g.Christensen et al., 2007).Generally two types of downscaling exist, namely dynamical downscaling Figures (e.g.Hill, 1968;Giorgi and Bates, 1989;Mearns et al., 2003), and empirical-statistical downscaling (ESD; e.g.Klein et al., 1959;Wilby et al., 2004;Benestad et al., 2008).Since the early development of both downscaling classes a variety of different models and approaches has emerged.An important step in general ESD procedures is the selection and assessment of global atmospheric model data for the input to the downscaling model (e.g.Von Storch, 1999;Wilby et al., 2004;Benestad et al., 2008).Given the increasing availability of atmospheric models and variables, the issue of predictor selection and assessment has become even more intricate over the last decades, concerning the choice of (i) the physical variable type, (ii) the model grid points or spatial area (i.e. the downscaling domain), and (iii) of the data source (i.e. the type of global model).However, only few studies (e.g.Winkler et al., 1997;Cavazos and Hewitson, 2005) have systematically assessed the relevance of different predictors (in terms of variable types, spatial area, or predictor model).In fact, there is little consensus on the most appropriate choice (e.g.Von Storch, 1999;Fowler et al., 2007), since it depends upon various factors (such as predictand variable, spatial and time scales, season, as well as geographical location).Wilby et al. (2002) propose a promising solution by providing regression-based, automated tools for predictor selection in ESD (see also Wilby and Dawson, 2007;Hessami et al., 2008).Yet these methods are suitable only if the observational data base for model calibration is relatively large, i.e. daily time series for several decades.Beyond the field of atmospheric sciences the problem of predictor and model selection is also well known (e.g.Zucchini, 2000;Hastie et al., 2001); e.g.Bair et al. (2006) present an interesting avenue based on supervised principal components.This study presents an ESD method for high-altitude, mountainous sites.The ESD method is designed (i) to be applicable when only short observational time series are available for model calibration (i.e.few years), (ii) to appropriately consider autocorrelation in the high-resolution time series, and (iii) to provide a solid tool for model (or predictor) assessment and selection by avoiding subjective choices.The ESD method, as presented here applicable to Gaussian variables only, is comprehensible and of Figures minimum complexity, in order to be easily transferred to different sites, predictors, or predictands.We show an application of the ESD method to quantify the skill of reanalysis data as predictors for local-scale, daily air temperature measured at high-altitude automatic weather stations (AWSs) in the tropical Cordillera Blanca, where only a few years of high-resolution air temperature measurements are available.Section 2 introduces the study site and observational data used in this study.Section 3 presents reanalysis data that are used as the predictors in this study.Section 4 gives a comprehensive description of the ESD model.The results of the ESD model application are discussed in Sect. 5 and summarized in Sect.6.

Study site and observations: the predictands
The investigation site of the present study is the Cordillera Blanca, a glaciated mountain range located in the Northern Andes of Peru (Fig. 1) that harbors 25 % of all tropical glaciers (with respect to surface area; Kaser and Osmaston, 2002).Glaciers in the Cordillera Blanca have been shrinking since their last maximum extent in the late 19th century (e.g.Ames, 1998;Silverio and Jaquet, 2005;Georges, 2004) and have significantly shaped the socio-economic development in the region.During the 20th century, a series of the history's most catastrophic glacier disasters -i.e.outburst floods and avalanches -occurred (e.g.Carey, 2005Carey, , 2010)).But Cordillera Blanca glaciers also have important positive impacts for water availability in industry, agriculture and households because they contribute to balancing the high runoff seasonality in the extensively populated Rio Santa valley (Juen, 2006;Juen et al., 2007;Mark and Seltzer, 2003;Kaser et al., 2003Kaser et al., , 2010)).Located in the the outer tropical climate zone, atmospheric seasonality in the Cordillera Blanca is mainly characterized by precipitation variance, with the seasonal air temperature variance being small (Niedertscheider, 1990;Kaser and Osmaston, 2002;Georges, 2005;Juen, 2006).More than 50 % of the annual precipitation falls during the humid season (January-March), whereas during the dry season Introduction

Conclusions References
Tables Figures

Back Close
Full (June-August), less than 2 % of the annual precipitation falls (annual precipitation amounts 770 mm in the Northern, and 470 mm in the Southern Cordillera Blanca; Niedertscheider, 1990).A detailed description about the underlying mechanisms is given in the work by Garreaud et al. (2003).Since 1999, an observational network of several AWSs at and nearby glaciers in the Cordillera Blanca has been installed, primarily to provide high-resolution data for glacier mass balance and runoff modeling (Juen et al., 2007).Maintaining the AWSs to provide continuous and reliable atmospheric time series has represented a logistical and technical challenge.Field work has been costly in terms of time and materials, since the AWSs are located at very high altitudes (between 4700 and 5100 m a.s.l.) in remote areas.Further problems also include instrument theft and natural hazards (Juen, 2006).Thus ESD methods have been required that are able to provide reliable results also on the basis of limited measurement availability.Hofer et al. (2010) present a comprehensive ESD modeling procedure to investigate if the short-term AWS time series (air temperature and specific humidity) can be extended into the past, using reanalysis data as predictors.They find that the ESD model skill largely varies as function of season and daytime, and emphasize uncertainty in the exact choice of variables that constitute the mixed-field predictors, upon which the model results show large sensitivity.Hofer et al. (2012) use a simpler methodology, based on single linear regression, in order to determine the best reanalysis product for daily air temperature predictands measured in the Cordillera Blanca.In the present study, we present an ESD methodology which is similar in terms of complexity to the one used by Hofer et al. (2012), with more emphasis on important elements concerning the ESD model configuration and application (for example, of how many measured data -in terms of sample size -are needed exactly for the assessment to be significant, and how the targeted temporal resolution affects the model skill).
The target variables here are daily air temperature time series measured at two AWSs located in the Northern Cordillera Blanca (hereafter referred to as AWS1 and AWS2) and one AWS located in the Southern Cordillera Blanca (hereafter AWS3), Introduction

Conclusions References
Tables Figures

Back Close
Full In technical terms, the measurements are carried out with a HMP45 sensor by V äisalla and a ventilated radiation shield, described by Georges (2002).
Figure 2 shows statistics of AWS1, AWS2, and AWS3 daily mean air temperature for each month of the year (daily means are calculated from hourly samples measured at the AWSs), over the period for which data are available at all three AWSs (July 2006-December 2009, hereafter period 1).The air temperatures are approximately normally distributed (not shown).The seasonal cycles in the data are small (< 2 • C), showing multiple local minima and maxima throughout the year.The warmest months are November-January and April-May, and the coldest months March and July.Note, however, that these statistics (in particular the occurrence of multiple maxima and minima) should not be overvalued as a climatology, because they are based on only four years of measurements.The lowest-elevation AWS2 systematically shows overall slightly higher air temperatures, with the warmest months above 2 • C, whereas at AWS1 and AWS2 monthly mean air temperatures are below 2 • C throughout the year.
At all AWSs, monthly mean air temperatures are above 0 • C in all months, and daily means are above 0 • C in 75 % of all cases.The interquartile ranges (blue bars in Fig. 2) show within-month daily mean air temperature variations of less than 2 • C in 50 % of all cases at all AWSs.The highest within-month variabilities occur from December to January, which points to El Ni ño Southern Oscillation (ENSO) variability playing an important role in the region at this time of the year (e.g.Vuille et al., 2008b), whereas Introduction

Conclusions References
Tables Figures

Back Close
Full variabilities are generally lower for the dry season months June-July.Also shown in Fig. 2 are statistics of the reanalysis data predictor that will be referred to later.

Reanalysis data: the predictors
In this study, we assess reanalysis data as the predictors for daily mean air temperature measured at AWS1, AWS2 and AWS3.Reanalysis data are a combination of general circulation model (GCM) "first guess" and quality-controlled observations, generated using a data assimilation system, similar to analysis data in numerical weather prediction (NWP).First proposed in the studies of Bengtsson andShukla (1988), andTrenberth andOlson (1988) "re"-analyses have the advantage over NWP analyses that their production is based on a fixed modeling system for the entire assimilation period.
Thus data discontinuities due to changes in atmospheric model and assimilation techniques are avoided.Today, global reanalysis data are available from four institutions worldwide (in cooperation with partner institutions not mentioned here for brevity): the NCEP (Kalnay et al., 1996), the European Centre for Medium-range Weather Forecasts (ECMWF; Uppala et al., 2005), the Japan Meteorological Agency (JMA; Kazutoshi et al., 2007), and the National Aeronautics and Space Administration (NASA; Bosilovich, 2008).First-generation NCEP reanalyses have been a frequent choice in climate studies about the Cordillera Blanca and the South American Andes (e.g.Garreaud et al., 2003;Vuille et al., 2008a,b;Hofer et al., 2010).Hofer et al. (2012), however, show for the Cordillera Blanca that the interim reanalyses by the ECMWF, the MERRA (the Modern Era Retrospective-Analysis for Research and Applications from NASA), the NCEP Climate Forecast System Reanalysis, CFSR (the latest reanalysis product by the NCEP), as well as ensembles thereof, show considerably higher skill than the first generation NCEP reanalyses, or JMA reanalyses.In this study, we use ensembles constructed by the interim, the CFSR, and the MERRA, as described later in Sect.4.2, as well as the interim reanalyses (who showed the overall highest performance in Hofer et al., 2012) as the predictors.Introduction

Conclusions References
Tables Figures

Back Close
Full If atmospheric time series are considerably shorter than thirty years and the climatological seasonal cycle is not known, the problem arises how to strictly distinguish periodic, seasonal variations from aperiodic (or less periodic), day-to-day and inter-annual variability.Especially in statistical forecasting, periodicity must be accounted for to avoid that the periodic, seasonal variations dominate the model fit.When long enough data series are available, the problem is often avoided by subtracting the climatological seasonal cycle from the time series (e.g.Madden, 1976).This way seasonal periodicity is removed from the time series, but not necessarily from the model error.
In the present study we assume that seasonal atmospheric periodicity leads to changing relationships between large-and local-scale atmospheric variables throughout the year.Considering the atmospheric seasonal cycle in ESD models is important especially if the study site is located in the mountains.For example, local-scale atmospheric conditions can be affected by topographic shading that changes with the solar altitude throughout the year, but the topography is misrepresented and thus these effects can not be captured by the large-scale model.Due to the same effects related to the diurnal cycle (sub-daily data) different models for the different times of day are required (e.g.Hofer et al., 2010).By consequently using separate statistical predictorpredictand transfer functions for the different months of the year, seasonal periodicity is eliminated not only in the time series, but also in the model error.In practice in this study, each predictor-predictand pair is divided into twelve separate time series for each month, the number of observations in each time series consequently being approximately n = N/12, where N is the length of the complete data series.Then the modeling procedure is repeated identically for each calendar month's time series.Introduction

Conclusions References
Tables Figures

Back Close
Full

Predictor selection
In order to simplify the transference of our approach to different cases (in terms of target variables, locations, or model), we distinguish two ways of predictor selection, namely (1) a priori predictor selection, and (2) data-based predictor selection.More precisely, (1) means predictor selection based on knowledge outside the data -e.g.
about physical mechanisms -that is available without ("prior" to) looking at the data or data analysis.Most downscaling studies more or less systematically use a combination of ( 1) and ( 2), by first pre-selecting a subset of potential predictors from an available pool based on process knowledge, and then choosing the definite, final predictors based on criteria derived from the data (i.e.data-based selection) (e.g.Klein and Glahn, 1974;Wilby et al., 2002).Yet, it is difficult to generalize data-based findings for different cases (in terms of variables, sites or models), and it is desirable to find objective, a priori criteria, that simplify predictor selection.Hofer et al. (2012), for example, use an a priori predictor for the assessment of different reanalysis models with regard to local, daily air temperature variations in the Cordillera Blanca.
What information of a large-scale atmospheric model would we use to represent local, daily air temperature, if no observations were available?The most intuitive choice is to relate the same physical predictor and target variables; thus here, to use largescale air temperature as predictor for local-scale air temperature.Previous ESD studies focusing on air temperature have suggested air temperature predictors in combination with sea level pressure (e.g.Benestad et al., 2002), with geopotential height (e.g.Kidson and Thompson, 1998), or with zonal wind speed and specific humidity (e.g.Hofer et al., 2010).Similarly, Huth (2004) underline the necessity to use a combination of both circulation-based, and radiation-based predictors for air temperature.Von Storch (1999) recommends the use of air temperature predictors as surrogate indicator of atmospheric radiative properties that are not captured by circulation predictors alone.Beyond these recommendations, though, the definite choice of concrete variables (as well as horizontal extents, vertical levels or model) in every single case require data-based Introduction

Conclusions References
Tables Figures

Back Close
Full assessments.Even if, by selecting air temperature as single a priori predictor here, we neglect any circulation-induced variations not included in the predictor air temperature, our choice is useful especially for inter-comparisons of different models, because it is reasonable to assume that the best model also shows the highest skill in representing the same variable.
As mentioned above, predictor selection includes not only the choice of a physical variable type, but also of geographical allocation in terms of model grid points.Since model topographies are smoothed representations of the real topography, the surface height at a particular location in the model does generally not correspond to the real surface height at that location.Therefore the question arises whether (1) surface or near surface predictors -to account for important surface processes, or (2) upper air predictors that are located at the same elevation as the predictands, are the more realistic choice for a predictand located at the surface.In the application example of this study, all three AWSs are located between 500 and 600 hPa, at 5050 (AWS1), 4825 (AWS2), and 4950 (AWS3) m a.s.l.For comparison, Table 1 shows coordinates, surface elevations (h), and geopotential heights (gph) for the 550 hPa level of the closest (relative to the study site) grid points in the interim, CFSR and MERRA models.The grid points are all located between 3000 and 3500 m a.s.l., thus about 1500 m lower than the AWS sites in reality.There is only a small difference between the geopotential heights of the 550 hPa levels of the different reanalyses, all located at about 5100 m a.s.l.
In terms of horizontal, or spatial predictor domain, ESD studies suggest that the optimum downscaling domain is generally not limited to the closest grid points around the study site, but includes important synoptic patterns around and upstream of the study area (e.g.Benestad et al., 2008).For studies that use grid point predictors, it is generally recommended to not using single, but rather ensembles of grid points as predictors.Grid point averaging of atmospheric models is necessary in order to minimize numerical model errors apparent in single grid point data (e.g.Grotch and MacCracken, 1991 errors, the minimum scale of a model (i.e. the distance of two neighbouring grid points) can not be regarded as the skillful scale of the same model (e.g.Von Storch et al., 1993;Zorita and Von Storch, 1997;Benestad et al., 2008;Hofer et al., 2012).For the Cordillera Blanca, Hofer et al. (2012) analyze the optimum spatial averaging scales of the interim, MERRA, CFSR, JCDAS, and the first NCEP/NCAR reanalyses.They find the optimum spatial averaging domains of the different reanalyses largely varying (from eight grid points for the NCEP/NCAR reanalyses, to 507 grid points of the modern CFSR reanalyses), with the minimum scales being only weak indicators of the optimum scales.Hofer et al. (2012) show further that the problem of determining the optimum scale of a model can be circumvented by using ensembles of grid points from different models, i.e. the mean of grid point data from different reanalyses.Hofer et al. (2012) find that there is no or only marginal difference whether the ensemble is constructed from the reanalyses at the closest grid points, or at their optimum scales.As a priori choice, it is thus most reasonable to use reanalysis ensembles as predictors, because this way no assessment is required in order to determine the optimum model for each case, or the optimum spatial domain of each model.Even if for our study site this choice is already evaluated (Hofer et al., 2012), the usefulness of ensemble predictors is justifiable also in cases with no data-based assessments.In our a priori selection, we do not consider remote grid point predictors, as recommended in several ESD studies for precipitation predictands (e.g.Wilby and Wigley, 2000;Brinkmann, 2002;Sauter and Venema, 2011).
Finally, synthesizing all our above considerations concerning a priori predictor selection, we define air temperature averaged over the three reanalyses that have been available most recently: interim, MERRA and CFSR, at the grid points located closest to the study site (the horizontally closest grid points in the reanalysis models, at the 550 hPa levels, shown in Table 1), hereafter abbreviated by rea-ens-air.Statistics of rea-ens-air are shown in Fig. 2 for each month of the year for the period, when data are available for all AWSs (i.e. the same period as shown for the AWSs in Fig. 2, period 1).rea-ens-air is proposed as a priori choice because it is based on simple assumptions Introduction

Conclusions References
Tables Figures

Back Close
Full that apply equally for different sites, seasons and large-scale models, without preceding data analysis.Note however that we do not claim that this is necessarily the best choice in each individual case.More precisely, the coarser the large-scale model, the more complex a relationship between the simulated predictors and the local predictand might be, e.g.involving multiple large-scale variables, and other than linear relationships (e.g.Hofer et al., 2010).However, such selections include data-based inference.
In practice, the relation between large-and local-scale variables can be investigated best with limited-area numerical atmospheric models (LAMs), as they certainly include the most complete framework of linkages between the different scales: i.e. expressed by the governing atmospheric equations (e.g.M ölg and Kaser, 2011).Though, LAMs are computationally expensive, by contrast to ESD.

Downscaling process: linear model calibration and cross-validation
In this section the entire ESD modeling procedure, including data preprocessing, ESD model calibration, and skill estimation based on leave-one-out cross-validation is presented.Leave-one-out cross-validation is important especially in the case of short-term observational time series (as in the present study), because it allows each observation to be used in the model building process as well as in the model evaluation process (if time series are long, e.g.ten-fold cross-validation can be used instead of leave-one-out cross-validation, Hastie et al., 2001).The modification of leave-one-out cross-validation specifically presented here is appropriate for daily or sub-daily atmospheric time series, because it accounts for temporal autocorrelation (i.e.persistence, Madden, 1979).First, the predictor and predictand time series, consisting of daily means, are separated into twelve different time series of daily means for the twelve months of the year.All steps described below are repeated separately and independently for each month's time series.The simplest way to relate the (a priori) predictor to a Gaussian target variable is a linear regression model.Note that the model is not appropriate for non Introduction

Conclusions References
Tables Figures

Back Close
Full Gaussian target variables (e.g.precipitation).It applies is the model error, assumed to follow a Gaussian distribution with zero mean.α is the least-squares regression parameter.Note that least-squares regression does not account for the time ordering in data series, and is therefore not affected by the use of discontinuous (month-separated) time-series.Because y s (t) and x s (t) are standardized predictand and predictor time series (it applies y s t = x s t := 0 and σ t (y s ) = σ t (x s ) := 1, with the standardization parameters • t being the temporal mean, and σ t (•) the temporal standard deviation of a variable), it can be shown that α is equal to the correlation coefficient (Von Storch and Zwiers, 2001).
With ŷ(t) := y(t) − (t), Eq. ( 1) can generally be rewritten including untransformed predictand y(t), predictor x(t), and • t , and σ t (•): To estimate ŷ(t) and (t) a modification of leave-one-out cross-validation (Michaelsen, 1987) is applied.Cross-validation is repeated n cv times.Here we use n cv = n (n is the number of observations of each month-separated time series).Please note again (as defined above) that an observation in the month-separated time series is a daily mean value.Each cross-validation repetition, n lo observations (thus, daily means) are excluded from the model fit (the "left-out" observations), with (3) τ is the temporal lag, for which the autocorrelation function of y is within the 95 % confidence interval of Gaussian white noise (the 5 % confidence interval is approximated with 2/n 1/2 ).n io is the number of independent observations used to estimate the model error.In general n io can be chosen to be larger than one, then the leave-one-out Introduction

Conclusions References
Tables Figures

Back Close
Full cross-validation becomes moving-block cross-validation (Kunsch, 1989).In this study we choose n io = 1.In each cross-validation step (cv) n T = n − n lo (T for training) data pairs {y T , x T } := {y (t T (cv)) , x (t T (cv)) } are used to estimate the parameters of the simple linear model.Thus in Eq. ( 2) it applies and {y V , ŷV } := {y(t V (cv)), ŷ(t V (cv))} (V for validation) are then used to estimate the model error (Eq.1).y V is the central of the withheld observations in each cross-validation step cv and can be considered as independent from the calibration process.
When the cross-validation process is completed, the skill score (SS) can be calculated (e.g.Wilks, 2006): and, with Eq. ( 1): where (cv) = y V − ŷV , and cv = 1, . . ., n cv .mse r is the mean of squared errors of the reference model, ŷr , as follows:  (Murphy, 1988), is the more accurate skill measure than the correlation coefficient r 2 (Wilks, 2006).SS as specifically presented here can be considered as r 2 deflated by the conditional bias term (the unconditional bias is by construction constrained to be zero in least-squares regression; definitions of reliability and bias terms are given in the work of Murphy, 1988).
The ESD method and the skill assessment are appropriate also when only short measured time series are available for the model calibration and validation.Below, we show an application with about three to seven years' measurement series from the three AWSs introduced earlier.In order to give an exact quantification on whether the data are long enough for the ESD model and skill assessment to be reliable (or useful), the significance of SS is determined by performing a left-tailed T-test of the null hypothesis that the squared model error estimated by the cross-validation process, 2 (cv) in Eq. ( 5), is equal to the mean of squared errors of the reference model, mse r , against the alternative that the mean of 2 (cv) is smaller than mse r .SS is considered significant, when the null hypothesis is rejected at the 5 % significance level.In performing the T-test, we consider the autocorrelation of 2 (cv), by replacing the sample size n cv = n with the effective sample size n eff (Wilks, 2006), approximated as with τ 1 being the lag-1 autocorrelation of 2 .Finally, let v cv be the mean, and σ cv (v) be the standard deviation of a variable (v) over all cv repetitions, then the final model ŷF , with model uncertainty estimated by cross-validation (not to be mistaken for the model error (t)) is ŷF (t) = ŷ(t) cv ± σ cv ( ŷ(t)).( 9) Equations ( 1)-( 7) apply similarly in multiple regression, where x has multiple columns (i.e.x ∈ R (n×p) ).Note that, even though not shown in this study, SS as defined above Introduction

Conclusions References
Tables Figures

Back Close
Full is a powerful goodness-of-fit estimate especially in the case of multiple predictors, because it detects over-fitting (then SS is zero or not significant).

Example
Figure 3 provides a comprehensible example of the skill estimation procedure described above.The two plots show daily air temperature time series (y in Eq. 2) of the months March (top) and July (bottom) at AWS1 (blue line).The red line is the linear model ŷ based on rea-ens-air, as defined by Eq. ( 2).The example shows the model building and error estimation in an individual cross-validation repetition cv (Eqs.6-7).
The grey bar indicates the n lo observations left out in the model calibration (note that each cross-validation round cv, the grey bar is shifted one observation to the right).
The amount of observations left apart is determined by the cross-validation parameter τ (the temporal lag for which the autocorrelation of the time series can be assumed to be zero).The values of τ are 9 in March, and 2 in July.This means that, e.g. in the March time series an observation is considered independent from an other observation only if there is a shift of at least 9 time steps (in this case, days) between the two observations (thus, the grey bar includes 9 • 2 + 1 = 19 left-out observations).The error cv for each repetition cv (Eq.6) is estimated as the difference between the central (independent) observation (y V , the blue star in the grey bar in Fig. 3) and the model value at this time step ( ŷV , the red star in the grey bar in Fig. 3).y r in Eq. ( 7) is the black star in Fig. 3, calculated as the mean of y T , the observations used in the model training.
Cross-validation is repeated until each observation is used once as y V to determine the model error.This way independence between the observations used in the model training and the validation process is warranted, and at the same time all observations can be used to determine the final model (Eq.9).Note that the results in Fig. 3 are discussed in Sect.5.2.

Conclusions References
Tables Figures

Back Close
Full  3).We define the standardization parameter R σ as the ratio between the standard deviation of the predictand σ t (y) and the standard deviation of the predictor σ t (x) R σ is a parameter in the ESD Eq. ( 2). Figure 4 shows the regression parameter α (equal to the correlation coefficient, Eq. 1) and the standardization parameter R σ with uncertainties, estimated by cross-validation (Eqs.4-9), for each calendar month, and data from AWS1.Most remarkably, both α and R σ show a relativelay high inter-monthly variability.Values of α are relatively high all year round, but highest for the months January-May (wet season in the Cordillera Blanca), with a second maximum in September, and slightly lower for the dry-season months June-August, and November-December.The largest coefficient uncertainties are evident for the months February-April.R σ throughout the year varies from approximately 0.9-1.4.In December, the difference between the (day-to-day or year-to-year) variabilities of the predictand and rea-ens-air is largest, whereas in May-August the variability of the predictand is about the same as (or even Introduction

Conclusions References
Tables Figures

Back Close
Full lower than) the variability of rea-ens-air.The rather high inter-monthly variations in the downscaling parameters clearly support the importance of using different models for the different calendar months.Figure 5 (red circles) shows values of the cross-validation parameter τ of AWS1 daily air temperature means for each month of the year.As defined in Sect.4.3, τ can be interpreted in terms of the temporal lag (in days), for which the time series values can be assumed independent.Values of τ for the different calendar months vary between 2 (small persistence) and 11 days (high persistence).Values of τ are 2 or 3 for all months despite the wet-season months February, March and the transitional-season month April, where values of τ are considerably higher (7, 9 and 11, respectively).
The higher values of τ in these months are probably related to the prevailing intraseasonal variability in the tropical Andes: i.e. rainy episodes in terms of sequences of wet days followed by sequences of dry days, with associated variances in air temperature (Garreaud et al., 2003).In the tropics, such synoptic episodes are known to typically range from 30 to 60 days (with the basic mechanism known as the Madden-Julian Oscillation, MJO; Madden and Julian, 1994), however for the Bolivian Altiplano (located nearby the Cordillera Blanca), Garreaud et al. (2003) report about shorter synoptic periods of approximately 15 days length.Please note that by examining τ of the month-separated time series, it is not possible to identify the full length of MJO cycles.Small values of τ especially for the austral winter months indicate not only small day-to-day (intra-seasonal), but also small inter-annual variability.Consequently there are no important differences amongst the different years of the respective austral winter months.In fact, ENSO, the most important source of inter-annual variability in the region, has its strongest and most widespread impacts during austral summer.

Skill assessment and significance analysis
Figure 6 shows values of SS of rea-ens-air for each month of the year, with AWS1 daily air temperature as the predictands.As mentioned earlier, data from AWS1 represent the longest high-resolution measurement series available at high altitude in the 2900 Introduction

Conclusions References
Tables Figures

Back Close
Full Cordillera Blanca (i.e.July 2006-July 2012, without data gaps, hereafter period 2; green circles in Fig. 6).Values of SS show a distinct seasonal pattern, with two maxima for April and for September (SS ≈ 0.6), and two minima for June and for November (SS ≈ 0.3).Overall, values of SS are lowest for the core dry season months June and July, and highest for the wet season months February-April.The high values of SS in the wet season imply that the largest portion of day-to-day variability of the AWS1 air temperature time series can be explained by rea-ens-air, indicating high spatial homogeneity of air temperature fluctuations for these months of the year.In fact, Garreaud et al. (2003) report spatially very coherent intra-seasonal weather patterns for the nearby Bolivian Altiplano, and also MJO is known to act on large spatial scales (local wavelengths of 1.2 − 2 × 10 3 km).By contrast in the core dry season, values of SS of rea-ens-air reach only half of the wet season values.We suggest that variability in these months must be governed by processes that act more locally (e.g.triggered by the strong radiation interacting with the complex topography), in a way that the generally weaker synoptic forcing in these months impacts the local-scale variability in a more complex way that is only partially captured by the single linear predictor rea-ens-air.In Fig. 3 (lower panel), the model chosen by least-squares regression for the dryseason month July shows only minor variance, as the co-variability between rea-ensair and the predictand data series is small.Year-to-year as well as day-to-day variability in the observational time series are evidently smaller than for the wet-season month March (upper panel in Fig. 3).However, the underestimation of observed variance is nevertheless smaller for the wet-season month March, indicating a higher co-variability between the predictor and predictand time series (note that, by construction of leastsquares-regression, the difference in underestimation of observed variance is immediately related to the co-variability between predictand and predictor time series).
The grey bars in Fig. 6 show results of the skill assessment applied to the AWS1, AWS2, and AWS3 data over the period, where measurements are available for all three AWSs (i.e. from 2006/7 to 2009/12, period 1).Whereas for the same period, differences of SS for the different AWSs are rather small (all AWSs showing a similar Introduction

Conclusions References
Tables Figures

Back Close
Full seasonal pattern, with similar minima and maxima like for AWS1 for the entire data period, and an additional minimum in the core wet season), differences in the values of SS between period 1 and period 2 are quite evident.Overall, but mostly for the core wet season, values of SS for AWS1 based on the shorter period 2 are considerably lower, than based on period 1.This is a somehow expected result, because the periods for the skill assessment are very limited.Thus, as the data base for the model training increases with an almost doubling from period 1 to period 2, the model parameters are estimated more accurately, and the cross-validation results are more positive.Values of SS are expected to be less dependent on the length of the measuring period with an increasing amount of available data.
The significance analysis as described in Sect.4.3 reveals that the skill scores for all months and all AWSs shown in Fig. 6 are significant at the 5 % significance level.By systematically assessing the significance of SS for decreasing time periods, a minimum time period can be identified for which the skill assessment is significant (and thus useful).This threshold time period can be quantified in terms of a minimum number of observations required in each calendar month for the skill assessment to be significant.An AWS time series can be considered long enough for the skill assessment, if for all months the number of available observations exceeds this minimum number of observations.Figure 5 shows the minimum number of observations required, n min , at the example of data from AWS1, for each month of the year.This minimum number of observations is a function of SS of each month, as well as of the autocorrelation of each month's time series (by considering the effective sample size of each time series n eff , Eq. 8, also shown in Fig. 5).The minimum number of observations largely varies from about 40, for months with high values of SS (December-March, May, September), to 140, for months with low values of SS (July, November).Since the AWS1 time series includes more than 120-150 (period 2), and 210 (period 1), respectively, observations for each month (daily means), the minimum time length for the skill assessment to be significant is achieved for all months (see also Fig. 3); though only hardly for July and November for period 1.

Conclusions References
Tables Figures

Back Close
Full Values of SS and r 2 averaged over all months for the different AWSs are shown in Table 2.The values are shown for the entire period of available data at each AWS, and additionally for AWS1 and AWS2 for period 1, the period of common available data at all AWSs.On average, SS is highest at AWS2 (mean SS = 0.48) and lowest at AWS3 (mean SS = 0.4).As discussed above for AWS1, values of SS are considerably lower for the shorter time period (period 1) also for AWS2.Table 2 further shows that values of r 2 overestimate the skill, compared to SS based on cross-validation, with up to 0.08 lower values (AWS3).The overestimation of skill by r 2 can be expected to increase strongly in the case of multiple predictors (multiple regression), due to overfitting.

Results for different time scales
In this study the short length of the data represents a lower limit of possible time resolutions for the skill assessment, because for lower temporal resolutions the number of observations n available for the model set-up decreases.More specifically, the sevenyears observations at AWS1 include a time series of 84 monthly means, or 2555 daily means.As we use separate models for each month, the time series include only seven monthly means, but approximately 210 daily means.With this regard it makes sense to profit from the higher (daily) temporal resolution to have more observations for the model fit.Nevertheless it is worth to investigate if the reanalysis data show higher skill for temporal resolutions lower than the daily time resolution.
In this section we repeat the modeling procedure as introduced in Sect.4.3, but for different temporal resolutions.The number of observations n in each time scale is set constant, in order to allow for a comparison of SS of the different time scales.The lowest temporal resolution for which the AWS1 time series (period 1) still include enough values for the skill assessment to be significant (as defined in Sect.4.3), differ for the different calendar months.Figure 7 shows values of SS for decreasing time scales (from daily means to twelve-daily means), and the different calendar months of AWS1 data, for which SS is found to be significant at each respective time resolution.

GMDD Introduction Conclusions References
Tables Figures

Back Close
Full The values are weighted with the number of months with significant values of SS in each time resolution, such that their sum (total length of each colored bar in Fig. 7) is the average skill for each time scale.Note that the values for the daily means are lower than the values of SS found in Fig. 6, because they are based on considerably smaller data samples.
As one would expect, the average SS shows an increase for longer averaging intervals.However, the increase is not monotonously, but rather stepwise, with a considerable increase from the three-to four-day averages, with rather constant values thereafter and another considerable increase from the nine-to ten-days averages.This suggests that it makes sense to use one-, four-or ten-days averages, rather than threeor nine-days averages, in order to obtain the highest possible skill based on the highest possible number of observations (thus the highest skill with the greatest significance).To sum up, predictability based on large-scale predictors is found to increase with decreasing time resolution.Consequently, when longer time series are available, we can suggest to use longer averaging windows in order to obtain higher skill.

Towards automated predictor selection
To this point we presented the ESD model assessment with a predictor selected by arguments independent of the data (a priori predictor selection).In this section we compare the performance of the a priori selected predictor rea-ens-air to a list of other potential predictors, to show how the skill assessment presented here can be used for data-based predictor selection.Table 3 gives a list of all abbreviations of the nine variables presented hereafter: shm, gph, air, uwn, vwn, spr, vor, t2m, and wwn.For the demonstrative purpose here, all assessed variables are from the ECMWF interim reanalyses at their optimum scale (i.e.four times four grid points located around the study site, Hofer et al., 2012) that have shown the highest skill for the study site out of all global reanalyses (Hofer et al., 2012).We use the ECMWF interim reanalysis instead of the reanalysis ensembles, because not all variables assessed here are available by all reanalyses as analysis variables, and because we observed inhomogeneities in

Conclusions References
Tables Figures

Back Close
Full the CFSR variables spr and wwn between data prior and after December 2010 (the discrepancies are due to changes in the model configurations from CFSR available until December 2010, to the subsequently operationally available CFSV2).
The results of the skill assessment reveal that only three of the assessed variables show significant skill for the local measured daily air temperature predictand: air, t2m and gph.Values of SS of these variables for AWS1 data are shown in Fig. 8 for each month of the year.For all other variables listed in Table 3, values of SS are nonsignificantly different from zero.t2m clearly shows lower values of SS than air in all months.Thus the reanalysis variables that are less affected by the model surface actually emerge as the better predictors for surface predictand air temperature at AWS1.This is consistent with findings in earlier studies (e.g.Murphy, 1999;Rummukainen, 1997) and supports our a priori assumption that unrealistic surface information negatively affects the data.We conclude from Fig. 8 that (1) only air temperature predictors, or gph (which is physically closely linked with air temperature) show significant skill regarding day-to-day and year-to-year variability based on our linear model set up (presented in Sect.4.3), (2) the a priori predictor choice is clearly supported by the data, and (3) even if values of SS show a distinct seasonal cycle, the same predictors show the highest skill throughout the year.

Accounting for the effects of diurnal periodicity
Finally we show a simple analysis based on AWS1 air temperature data to demonstrate the effects of periodicity (in this case diurnal) for regression analysis based solely on r 2 (an often applied criterium for predictor selection).Figure 9 shows values of r 2 between six-hourly (all-month) time series of AWS1 air temperature, and the predictors assessed in Sect.9: shm, gph, air, uwn, vwn, spr, vor, t2m, and wwn.The same analysis is shown at a daily time scale (thus r 2 between the same time series, but the sixhourly data averaged to daily means).Results show for the six-hourly data (left hand side in Fig. 9) that t2m emerges as the best predictor, showing a relatively high covariance (r 2 > 0.7), whereas all other predictors show only minor covariance (r 2 < 0.1) Introduction

Conclusions References
Tables Figures

Back Close
Full predictor rea-ens-air is assessed for the longest AWS air temperature series.High seasonality of statistical data properties (e.g.persistence) and ESD model parameters emphasizes the importance of using different models for different times of year.The ESD model skill shows high seasonality as well, with generally higher skill in the wet season, and lower skill in the dry season.Whereas differences in skill for the different AWSs are rather small, the seasonal pattern of skill shows an increasing tendency for increasing numbers of available observations.The skill assessment is shown to be significant at a 5 % test level for a minimum of 40-140 daily observations available for each calendar month (depending on calendar month).For the same number of observations at different temporal resolutions (i.e. from one-day to twelve-day averages), values of skill averaged over all months increase with increasing averaging time windows.Consequently we suggest switching to lower temporal resolutions when the ESD model skill is low, given that long enough data series are available.The predictor air clearly shows higher skill than other potential predictors, such as t2m or gph.We suggest that large-scale surface variables are weak predictors if the model surface is not representative for the real surface, because unrealistic boundary layer variability masks the relevant synoptic forcing and variability.By skill assessments that do not account for periodicity, the remarkably lower performance of t2m, compared to air, can not be identified.For six further assessed reanalyses predictors the skill assessment reveals no significant skill.The presented ESD model can be generalized to non-linear, multiple regression problems (not shown here).The validation process is especially useful in multiple predictor fitting because it detects over-fitting.The method is not restricted to reanalysis data and can be applied to any atmospheric model predictors.

Conclusions References
Tables Figures

Back Close
Full  Full  Full Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | which include the longest time series available from all installed AWSs to date.Yet the measuring periods are still relatively short, ranging from July 2006 to July 2012 (AWS1), to August 2011 (AWS2), and to December 2009 (AWS3), with three months of missing data at AWS2.AWS1, AWS2, and AWS3 are situated at 5050, 4825, and 4950 m a.s.l., respectively, on rocky terrain (glacial polish and moraines) in the vicinity of retreating glacier tongues.Whereas AWS1 and AWS2 are located very nearby each other in the Paron valley (only about two km distance), AWS3 is located approximately 100 km more southwards in the Shallap valley.The sites are indicated in Fig. 1.
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | ; Williamson and Laprise, 2000; R äis änen and Ylh äisi, 2011).Due to these numerical Introduction Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | of a contribution due to the correlation between the forecasts and observations, and two penalty terms relating to the reliability and bias of the forecast Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | In this section the above presented ESD procedure is applied to daily mean air temperature time series of the AWSs in the Cordillera Blanca (introduced in Sect.2).As mentioned in Sect.4.3, the linear models are calibrated and validated independently for each calendar month and for each of the AWSs' time series.Thus the time series finally input to the ESD procedure (e.g.January at AWS1) consist of seven months (consecutive Januaries) of daily data from the seven years of available observations (July 2006-July 2012), consequently approximately 200 observations per month and AWS (data gaps included) to calibrate the ESD model (e.g. the AWS1 March time series consists of 186 values and the AWS1 July time series as well of 186 values, in Fig.
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Fig. 1 .
Fig. 1.The map shows the Rio Santa watershed with the Cordillera Blanca mountain range with the positions of AWS1, AWS2 and AWS3 (mentioned in the text).Also indicated is the 1990 glacier extent (grey shaded area, Georges, 2004).

Fig. 1 .
Fig. 1.The map shows the Rio Santa watershed with the Cordillera Blanca mountain range with the positions of AWS1, AWS2 and AWS3 (mentioned in the text).Also indicated is the 1990 glacier extent (grey shaded area, Georges, 2004).

Fig. 2 .
Fig. 2. Statistics of daily air temperature time series at AWS1 (5050 m a.s.l.), AWS2 (4820 m a.s.l.), and AWS3 (4950 m a.s.l.), and of the a priori selected predictor rea-ens-air (as defined in the text), for each month of the year (abscissa: January-December).Shown are the means (blue solid line) and the medians (red dashes).The edges of the thick blue bars are the 25th and the 75th percentiles.The thin blue bars extend to the most extreme data not considered as outliers, and the red crosses are the outliers.All statistics are computed for period 1 (July 2006-December 2009).

Fig. 9 .
Fig. 9. r 2 for six-hourly and daily (all-month) time series of AWS1 air temperature and all predictors assessed in Sect. .

Table 1 .
Specifications of the reanalyses' data grid points applied as predictors: coordinates, surface heights (h), and mean geopotential heights (gph), with standard deviations in brackets, during the investigation period (all values are in units meters above see level).

Table 3 .
Hofer et al., 2012)(and their abbreviations) assessed in Sect.9.All predictors are from the interim reanalyses at their optimum spatial domain (as determined in the work ofHofer et al., 2012).