Convolutional conditional neural processes for local climate downscaling

A new model is presented for multisite statistical downscaling of temperature and precipitation using convolutional conditional neural processes (convCNPs). ConvCNPs are a recently developed class of models that allow deep learning techniques to be applied to off-the-grid spatio-temporal data. This model has a substantial advantage over existing downscaling methods in that the trained model can be used to generate multisite predictions at an arbitrary set of locations, regardless of the availability of training data. The convCNP model is shown to outperform an ensemble of existing downscaling techniques over Europe for both temperature and precipitation taken from the VALUE intercomparison project. The model also outperforms an approach that uses Gaussian processes to interpolate single-site downscaling models at unseen locations. Importantly, substantial improvement is seen in the representation of extreme precipitation events. These results indicate that the convCNP is a robust downscaling model suitable for generating localised projections for use in climate impact studies, and motivates further research into applications of deep learning techniques in statistical downscaling.

. In contrast, in PP downscaling, the aim is to learn a transfer function f such that Whereŷ is the downscaled prediction of a given climate variable whose true value is y at location x and Z is a set of predictors from the climate model (Maraun and Widmann, 2018). This is based on the assumption that while sub-grid-scale and parameterised processes are poorly represented in GCMs, the large scale flow is generally better resolved.
Multiple different models have been trialled for parameterising f . Traditional statistical methods used for this purpose include multiple linear regression (Gutiérrez et al., 2013;Hertig and Jacobeit, 2013), generalised linear models (San-Martín et al., 2017) and analog techniques (Hatfield and Prueger, 2015;Ayar et al., 2016). More recently, there has been considerable interest in applying advances in machine learning to this problem, including relevance vector machines (Ghosh and Mujumdar, 2008), artificial neural networks (Sachindra et al., 2018), autoencoders (Vandal et al., 2019), recurrent neural networks (Bhardwaj et al., 2018) and convolutional neural networks (Baño-Medina et al., 2020;Höhlein et al., 2020). There has been debate as to whether deep learning methods provide improvement over traditional statistical techniques such as multiple linear regression. Vandal et al. (2019) found that machine learning approaches offered little improvement over traditional methods in downscaling precipitation. In contrast Baño-Medina et al. (2020) compared downscaling performance of convolutional neural networks (CNNs) and simple linear models, finding that CNNs improved predictions of precipitation, but did not result in improvements for temperature.
Limitations remain in these models. In many climate applications it is desirable to make projections that are both (i) consistent over multiple locations and (ii) specific to an arbitrary locality. The problem of multi-site downscaling has been widely studied, with two classes of approaches emerging. Traditional methods take analogues or principal components of the coarseresolution field as predictors. The spatial dependence is then explicitly modelled for a given set of sites, using observations at those locations to train the model (Maraun and Widmann, 2018;Cannon, 2008;Bevacqua et al., 2017;Mehrotra and Sharma, 2005). More recent work has sought to leverage avances in machine learning, for example CNNs, for feature extraction (Bhardwaj et al., 2018;Baño-Medina et al., 2020;Höhlein et al., 2020). Each of these classes of methods has their drawbacks.
Traditional methods have limited feature selection, however are able to train directly on true observations. In contrast, deep learning techniques such as CNNs allow for sophisticated feature extraction, however are only able to be applied to gridded datasets. Gridding of observations naturally introduces error, especially for areas with complex topography and highly stochastic variables such as precipitation (King et al., 2013). The second limitation common to existing downscaling models is that predictions can only be made at sites for which training data are available. Creating a projection at an arbitrary location is achieved either through interpolation of model predictions or taking the closest location.
In this study, we address these challenges by developing a new statistical model for downscaling temperature and precipitation at an arbitrary set of sites. This is achieved using convolutional conditional neural processes, state of the art probabilistic machine learning methods combining ideas from Gaussian Processes (GPs) and deep neural networks. The model combines the advantages of existing classes of multi-site approaches, with feature extraction using a convolutional neural network to-gether with training on off-the-grid data. In contrast to existing methods, the trained model can be used to make coherent local projections at any site regardless of the availability of training observations.
The specific aims of this study are as follows: 1. Develop a new statistical model for downscaling GCM output capable of training on off-grid data, making predictions at unseen locations and utilising sub-grid-scale topographic information to refine local predictions.
2. Compare the performance of the statistical model to existing strong baselines.
3. Compare the performance of the statistical model at unseen locations to existing interpolation methods.
4. Quantify the impact of including sub-grid-scale topography on model predictions.
Section 2 outlines the development of the downscaling model, and presents the experimental setup used to address aims 2-4. Section 3 compares the performance of the statistical model to an ensemble of baselines. Sections 4 and 5 explore model performance at unseen locations and the impact of including local topographic data. Finally, Section 6 presents a discussion of these results and suggestions for further applications.

Datasets and methodology
We first outline the development of the statistical downscaling model, followed by a description of three validation experiments.

The downscaling model
Our aim is to approximate the function f in equation 1 to predict the value of a downscaled climate variable y at locations x given a set of coarse-scale predictors Z. In order to take the local topography into account, we assume that this function also depends on the local topography at each target point, denoted e, i.ê We take a probabilistic approach to specifying f where we include a noise model, so that Deterministic predictions are made from this by using, for example, the predictive mean y = yp(y|x, Z, e)dy In this model θ is parameterised as . Schematic of the convCNP model for downscaling precipitation demonstrating the flow of data in predicting precipitation for a given day at target locations x. Gridded coarse resolution data for each predictor is fed into the CNN, producing predictions of θ = (ρ, α, β) at each grid point. These gridded predictions are then transformed to a prediction at the target location using an exponentiated-quadratic kernel. Finally, these elevation agnostic predictions are fed into a multi-layer perceptron together with topographic data e to produce a final prediction of the parameters.
Here θ is a vector of parameters of an distribution for the climate variable at prediction locations x. This is assumed to be Gaussian for maximum temperature and a Gamma-Bernoulli mixture for precipitation. e is a vector of sub-grid-scale topographic information at each of the prediction locations, ψ M LP is a multi-layer perceptron, φ c is a function parameterised as a neural network and CNN is a convolutional neural network. Each component of this is described below, with a schematic of the model shown in Figure 1.
1. Convolutional neural network In the first step, gridded reanalysis predictor data Z are fed into the model. These grids are used as input to a convolutional neural network to extract relevant features. This is implemented as a 6-block Resnet architecture (He et al., 2016) with depthwise separable convolutions (Chollet, 2017). The output from this step is a prediction of the relevant parameters for each variable at each grid point in the predictor set, i.e Where h nm is the vector-valued output at latitude m∆x 1 and longitude n∆x 2 and m, n ∈ Z + with ∆x i indicating the grid spacing.
2. Translation to off-the-grid predictions These gridded predictions are translated to the off-the-grid target locations x using outputs from step 1 as weights for an exponentiated-quadratic (EQ) kernel φ, i.e This outputs predictions of the relevant distributional parameters at each target location θ.

Inclusion of sub-grid scale topography
By design, the predictions from the previous step only model variation on the scale of the context grid spacing. This elevation agnostic output is post-processed using a multi-layer perceptron (MLP). This takes the parameter predictions from the EQ kernel as input together with a vector of topographic data e at each target location.
The vector e consists of three measurements at each target point:  where f is parameterised by a neural network which is optimised by gradient descent, and Bayesian methods specifying a distribution over f . Conditional neural processes (CNPs; Garnelo et al., 2018) combine the advantages of these approaches, using neural networks to parameterise a stochastic process over the output variable, in this case either temperature or precipitation.
Various improvements to the original CNP model have been suggested, for example the inclusion of a global latent variable (Garnelo et al., 2018) and attention to improve the flexibility of the model and prevent under-fitting (Kim et al., 2019). In the context of spatial data such as downscaling predictions, a desirable characteristic of predictions are that they are translationequivariant, that is the model makes identical predictions if data are spatially translated. The convCNP model (Gordon et al., 2019) applied here builds this equivariance into the CNP model. Throughout this study, we refer to the model developed in this section as 'the convCNP model'.

Training
The convCNP models are trained by minimising the average negative log-likelihood. For temperature, this is given by where y i is the observed value, N (y i ; µ i , σ i ) denotes a Gaussian distribution over y with mean µ i and variance σ 2 i . These parameters θ( are generated by the model at each location x i and use topography e i . N is the total number of target locations. For precipitation, the negative log likelihood is given by where r i is a Bernoulli random variable describing whether precipitation was observed at the i th target location, y i is the observed precipitation, ρ i parameterises the predicted Bernoulli distribution and Γ(y i ; α i , β i ) is a Gamma distribution with shape parameter α i and scale parameter β i . Here θ( Weights are optimised using Adam (Kingma and Ba, 2014), with the learning rate set to 5 × 10 −4 . Each model is trained for 100 epochs on 456 batches of 16 days each, using early stopping with a patience of 10 epochs.

Experiments and datasets
Having addressed the first aim in developing the convCNP model, we next evaluate model performance via three experiments.
The first experiment compares the convCNP model to an ensemble of existing downscaling methods following a standardised experimental protocol. In contrast to the convCNP model, these methods are unable to make predictions at locations where training data are not available. In the second experiment, we assess the performance of the convCNP model at these unseen locations compared to a baseline constructed by interpolating single-site models. Finally, ablation experiments are performed to quantify the impact of including sub-grid scale topographic information on performance.

Experiment 1 -baseline comparison
ConvCNP model performance is first compared to strong baseline methods taken from the VALUE experimental protocol.
VALUE (Maraun et al., 2015) provides a standardised suite of experiments to evaluate new downscaling methods, together with data benchmarking the performance of existing methods. In the VALUE 1a experiment, each downscaling method predicts the maximum temperature and daily precipitation at 86 stations across Europe (Figure 2), given gridded data from the ERA-Interim reanalysis (Dee et al., 2011). These stations are chosen as they offer continuous, high fidelity data over the training and held out test periods and represent multiple different climate regimes . Data is taken from 1979-2008, with five-fold cross validation used over six-year intervals to produce a 30 year time-series.
The convCNPs are trained to predict maximum temperature and precipitation at these 86 VALUE stations given the ERA Interim grids over Europe. Station data are taken from the European Climate Assessment Dataset (Klein Tank et al., 2002).
These grids are restricted to points between 35 and 72 degrees latitude and -15 to 40 degrees longitude. The VALUE experiment protocol does not specify which predictors are used in the downscaling model (i.e which gridded variables are included in Z).
Based on the predictors used by methods in the baseline ensemble, winds, humidity and temperature are included at multiple levels together with time, latitude, longitude and invariant fields. Predictors are summarised in Table 1.
For the sub-grid scale information for input into the final MLP, the point measurement of three products is provided at each station. True station elevation is taken from the Global Multi-resolution Terrain Elevation Dataset (USGS, 2010). This is provided to the model together with the difference between the ERA-Interim gridscale resolution elevation and true elevation.
Results of the convCNP model are compared to all available PP models in the VALUE ensemble, a total of 16 statistical models for precipitation and 23 for maximum temperature. These models comprise a range of techniques including analogs, multiple linear regression, generalised multiple linear regression and genetic programming. For a complete description of all models included in the comparison, see Appendix A.

Experiment 2 -performance at unseen locations
We next quantify model performance at unseen locations compared to an interpolation baseline. The convCNP models are retrained using station data from the European Climate Assessment Dataset (ECA&D), comprising 3010 stations for precipitation and 3047 stations for maximum temperature (Figure 2). The 86 VALUE stations are held out as the validation set, testing the model performance at both unseen times and locations.
As existing downscaling models are unable to handle unseen locations, it is necessary to construct a new baseline. A natural baseline for this problem is to construct individual models for each station using the training set, use these to make predictions at future times, and then interpolate to get predictions at the held out locations. For the single-station models, predictors are taken from ERA-Interim data at the closest gridbox, similar to Gutiérrez et al. (2013). Multiple linear regression is used for maximum temperature. For precipitation, occurrence is modelled using logistic regression, and accumulation using a generalised linear model with gamma error distribution, similar to San-Martín et al. (2017). These methods are chosen as they are amongst the best-performing methods of the VALUE ensemble for each variable .
Following techniques used to convert station observations to gridded datasets (Haylock et al., 2008), predictions at these known stations in the future time period are made by first interpolating monthly means (totals) for temperature (precipitation) using a thin-plate spline, then using a GP to interpolate the anomalies (fraction of the total value). All interpolation is three dimensional over longitude, latitude and elevation. Throughout the results section, this model is referred to as the GP-baseline.

Experiment 3 -topography ablation
Finally, the impact of topography on predictions is quantified. Experiment 2 is repeated three times with different combinations of topographic data fed into the final MLP (step 3 in Figure 1): no topographic data, elevation and elevation difference only and mTPI only.

Evaluation metrics
A selection of standard climate metrics are chosen to assess model performance over the evaluation period, quantifying the representation of mean properties and extreme events (Table 2) Comparison to these metrics requires generating a timeseries of values from the distributions predicted by the convCNP model. For temperature, this is generated by taking the mean of the predicted distribution for mean metrics, and sampling is used to complete the extreme metrics. For precipitation, a day is first classified as wet if ρ ≥ 0.5 or dry if ρ < 0.5. For wet days, accumulations are generated by taking the mean of the gamma distribution for mean metrics or sampling for extreme metrics.
3 Results: baseline comparison (experiment 1) The convCNP model outperforms all VALUE baselines on median mean absolute error (MAE) and spearman correlation for both maximum temperature and precipitation. Comparisons of convCNP model performance at the 86 VALUE stations to each model in the VALUE baseline ensemble are shown in Figure 4. The low MAE and high spearman correlation indicate that the model performs well at capturing day-to-day variability.
For maximum temperature, the mean bias is larger than baseline models at many stations, with interquartile range -0.02C to 0.08C. This is a direct consequence of training a global model as opposed to individual models to each station which will trivially correct the mean (Maraun and Widmann, 2018). Though larger than baseline models, this error is still small for a majority of stations. Similarly for precipitation, though mean biases are larger than many of the VALUE models, the interquartile range is just -0.07mm to 0.12mm. For precipitation, the bias in convCNP relative wet day frequency (R01) and mean wet day precipitation (SDII) are comparable to the best models in the VALUE ensemble (not shown).
When downscaling GCM output for impact studies, it is of particular importance to accurately reproduce extreme events (Katz and Brown, 1992). In line with previous work comparing the VALUE baselines , an extreme event is defined to be a value greater than the 98th percentile of observations. Comparisons of biases in the 98th percentile of maximum temperature and precipitation are shown in Figure 5. The convCNP performs similarly to the best baselines, with a median bias of -0.02C for temperature and -2.04mm for precipitation across the VALUE stations. R10 biases are comparable to baselines, with a median bias of just -0.003mm.   A limitation to the analysis of the standard climate metrics in Table 2 is that these only assess certain aspects of the predicted distribution. To assess the calibration of the models, we next examine the probability integral transform (PIT) values. The PIT value for a given prediction is defined as the CDF of the distribution predicted by the convCNP model evaluated at the true observed value. These values can be used to determine whether the model is calibrated by evaluating the PIT for every model prediction at the true observation, and plotting their distribution. If the model is properly calibrated, it is both necessary and sufficient for this distribution to be uniform (Gneiting et al., 2007). PIT distributions for maximum temperature and wet-day precipitation are shown in Figure 10. For temperature, the model is well calibrated overall, although the predicted distributions are often too narrow, as demonstrated by the peaks around zero and one indicating that the observed value falls outside the predicted normal distribution. Calibration of the precipitation model is poorer overall. The peak in PIT mass around zero indicates that this model often over predicts rainfall accumulation. Performance varies between individual stations for both temperature and precipitation, with examples of PIT distributions for both well-and poorly-calibrated stations shown in Figure   10.

Results: topography ablation (experiment 3)
Results of the topography ablation experiment are shown in Figure 11 (mean metrics) and Figure 12 (extreme metrics). These figures compare the performance on each metric between the convCNP model with all topographic predictors, and convCNP models trained with no topography, elevation and elevation difference only and mTPI only.
For maximum temperature, inclusion of topographic information improves MAE, mean bias and spearman correlation.

Models including only mTPI or no topographic predictors have a number of stations with very large MAE, exceeding 10C
at several stations. Unsurprisingly, these stations are found to be located in areas of complex topography in the Alps (not shown). Including elevation both decreases the median MAE and corrects errors at these outliers, with further improvement Figure 8. As for Figure 6, but for P98 and R10 biases.
observed with mTPI added. A similar pattern is seen for mean bias. More modest improvements are seen for precipitation, though inclusion of topographic data does result in slightly improved performance.
For maximum temperature, inclusion of topographic data results in reduced 98th percentile bias. This is primarily as a result of including elevation and elevation difference data, with limited benefit derived from the inclusion of mTPI. In contrast, for precipitation, models with topographic correction perform worse than the elevation agnostic model for both 98th percentile and R10 biases. This reduced performance for precipitation may result from overfitting.

Discussion and conclusion
This study demonstrated the successful application on convCNPs to statistical downscaling of temperature and precipitation.
The convCNP model performs well compared to strong baselines from the VALUE ensemble on both mean and extreme metrics. For both variables the convCNP model outperforms an interpolation based baseline. Inclusion of sub-grid-scale topographic information is shown to improve model performance for mean and extreme metrics for maximum temperature, and mean metrics for precipitation.
The convCNP model has several advantages over existing methods beyond performance on these metrics. The most important of these is that the trained model can be used to predict values at any target point, regardless of whether this point is included in the training set. In Sect. 2, we demonstrated that the convCNP model outperforms interpolation of single-site downscaling models. This has application in developing local projections for climate impact studies. A second useful feature is that the convCNP takes raw grid data as input, without the need for feature engineering. This contrasts with many of the baseline methods, which rely on the identification of relevant circulation analogs. Although only temperature and precipitation are considered in this study, the model is easily applied to any climate variable with available station observations, for example windspeed.
It is instructive to consider these results more broadly in the context of the debate surrounding the application of deep learning models in downscaling. In this study we deploy deep learning approaches using domain-specific knowledge to shape the architecture, and then deploy this inside simple statistical observation models. This approach is demonstrated to improve on simple baselines on both mean and extreme metrics for maximum temperature and precipitation. This contrasts to previous work suggesting that little benefit is derived from applying neural network models to downscaling maximum temperature (Baño-Medina et al., 2020) and precipitation (Vandal et al., 2019), and motivates continued efforts to develop and benchmark such models for application in climate impact studies. Several areas remain for future work. Representation of certain metrics, notably precipitation extremes requires further improvement, particularly in areas with complex topography. The topography ablation experiments demonstrate that the con-vCNP P98 bias increases with the inclusion of topographic information. Dynamically, this is likely due to local flow effects such as Föhn winds (Gaffin, 2007;Basist et al., 1994), which depend on the incident angle of the background flow. A possible explanation for this is that the MLP is insufficient to model these effects. Further experimentation with adding a second CNN to capture the sub-grid-scale processes, and possibly conditioning predictions of this model on local flow is left as a topic for future research.
Another avenue for improving model performance would be to change the distribution predicted by the convCNP. Model calibration results presented in Section 4 indicate that the temperature downscaling model could be improved using a distribution with heavier tails. Precipitation model calibration requires improvement, with the model frequently under-predicting wet day accumulations. A possible explanation for this is that the left hand tail of the gamma distribution decays rapidly. For cases where the mode of the predicted distribution is greater than zero, small observed accumulations are therefore heavily penalised. Previous work has acknowledged that the Bernoulli-Gamma distribution used in this study is not realistic for all sites (Vlček and Huth, 2009), and suggested that representation of precipitation extremes can be improved using a Bernoulli-Gamma-Gerneralised Pareto distribution (Ben Alaya et al., 2015;Volosciuk et al., 2017). Future work will explore improving the calibration of the downscaling models using mixture distributions and normalising flows (Rezende and Mohamed, 2015) to improve the calibration of the model. A further possibility for extending the convCNP model would be to explicitly incorporate time by building recurrence into the model (Qin et al., 2019;Singh et al., 2019).
The final aspect to consider is extending these promising results downscaling reanalysis data to apply to future climate simulations from GCMs. An in depth analysis of the convCNP model performance on seasonal and annual metrics would be beneficial in informing application to impact scenarios. A limitation in all PP downscaling techniques is that applying a transfer function trained on reanalysis data to a GCM makes the assumption that the predictors included in the context set are realistically simulated in the GCM (Maraun and Widmann, 2018). Future work will aim to address this issue through training a convCNP model directly on RCM or GCM hindcasts available through projects such as EURO-CORDEX (Jacob et al., 2014). Code availability. Model code is available at https://github.com/anna-184702/convCNPClimate.

MLR-RAN
Tmax, precip Multiple linear regression, seasonal training, raw data, no variance correction .

MLR-WT Tmax
As for MLR, but conditioned on weather types defined using k-means clustering (Gutiérrez et al., 2013).

WT-WG
Tmax, precip Distributional fitting based on weather types selected using k-means clustering (Gutiérrez et al., 2013). Table A1. Summary of models included in the VALUE ensemble.