CLIMFILL v 0 . 9 : a framework for intelligently gap filling Earth observations

Remotely sensed Earth observations have many missing values. The abundance and often complex patterns of these missing values can be a barrier for combining different observational datasets and may cause biased estimates of derived statistics. To overcome this, missing values in geoscientific data are regularly infilled with estimates through univariate gap-filling techniques such as spatial or temporal interpolation or by upscaling approaches in which complete donor variables are used to infer missing values. However, these approaches typically do not account for information that may be present in other observed variables that also have missing values. Here we propose CLIMFILL (CLIMate data gapFILL), a multivariate gap-filling procedure that combines kriging interpolation with a statistical gap-filling method designed to account for the dependence across multiple gappy variables. In a first stage, an initial gap fill is constructed for each variable separately using state-of-the-art spatial interpolation. Subsequently, the initial gap fill for each variable is updated to recover the dependence across variables using an iterative procedure. Estimates for missing values are thus informed by knowledge of neighbouring observations, temporal processes, and dependent observations of other relevant variables. CLIMFILL is tested using gap-free ERA-5 reanalysis data of ground temperature, surface-layer soil moisture, precipitation, and terrestrial water storage to represent central interactions between soil moisture and climate. These variables were matched with corresponding remote sensing observations and masked where the observations have missing values. In this “perfect dataset approach” CLIMFILL can be evaluated against the original, usually not observed part of the data. We show that CLIMFILL successfully recovers the dependence structure among the variables across all land cover types and altitudes, thereby enabling subsequent mechanistic interpretations in the gap-filled dataset. Correlation between original ERA-5 data and gap-filled ERA-5 data is high in many regions, although it shows artefacts of the interpolation procedure in large gaps in high-latitude regions during winter. Bias and noise in gappy satellite-observable data is reduced in most regions. A case study of the European 2003 heatwave shows how CLIMFILL reduces biases in ground temperature and surface-layer soil moisture induced by the missing values. Furthermore, in idealized experiments we see the impact of fraction of missing values and the complexity of missing value patterns to the performance of CLIMFILL, showing that CLIMFILL for most variables operates at the upper limit of what is possible given the high fraction of missing values and the complexity of missingness patterns. Thus, the framework can be a tool for gap filling a large range of remote sensing observations commonly used in climate and environmental research.

Abstract. Remotely sensed Earth observations have many missing values. The abundance and often complex patterns of these missing values can be a barrier for combining different observational datasets and may cause biased estimates of derived statistics. To overcome this, missing values in geoscientific data are regularly infilled with estimates through univariate gap-filling techniques such as spatial or temporal interpolation or by upscaling approaches in which complete donor variables are used to infer missing values. However, these approaches typically do not account for information that may be present in other observed variables that also have missing values. Here we propose CLIMFILL (CLIMate data gap-FILL), a multivariate gap-filling procedure that combines kriging interpolation with a statistical gap-filling method designed to account for the dependence across multiple gappy variables. In a first stage, an initial gap fill is constructed for each variable separately using state-of-the-art spatial interpolation. Subsequently, the initial gap fill for each variable is updated to recover the dependence across variables using an iterative procedure. Estimates for missing values are thus informed by knowledge of neighbouring observations, temporal processes, and dependent observations of other relevant variables. CLIMFILL is tested using gap-free ERA-5 reanalysis data of ground temperature, surface-layer soil moisture, precipitation, and terrestrial water storage to represent central interactions between soil moisture and climate. These variables were matched with corresponding remote sensing observations and masked where the observations have missing values. In this "perfect dataset approach" CLIMFILL can be evaluated against the original, usually not observed part of the data. We show that CLIMFILL successfully recovers the dependence structure among the variables across all land cover types and altitudes, thereby enabling subsequent mech-anistic interpretations in the gap-filled dataset. Correlation between original ERA-5 data and gap-filled ERA-5 data is high in many regions, although it shows artefacts of the interpolation procedure in large gaps in high-latitude regions during winter. Bias and noise in gappy satellite-observable data is reduced in most regions. A case study of the European 2003 heatwave shows how CLIMFILL reduces biases in ground temperature and surface-layer soil moisture induced by the missing values. Furthermore, in idealized experiments we see the impact of fraction of missing values and the complexity of missing value patterns to the performance of CLIMFILL, showing that CLIMFILL for most variables operates at the upper limit of what is possible given the high fraction of missing values and the complexity of missingness patterns. Thus, the framework can be a tool for gap filling a large range of remote sensing observations commonly used in climate and environmental research.

Missing observations in Earth system science
Observing the Earth surface from space is an endeavour that has significantly contributed to advance our understanding of the Earth system and has played a vital role in the fields of data assimilation (Bauer et al., 2015), Earth surface modelling , global freshwater hydrology (Lettenmaier et al., 2015), global carbon cycle processes (Humphrey et al., 2018), and the study of climate extremes in the land-atmosphere system Teuling et al., 2010). A plethora of instruments observes variables relevant for determining the state of the Earth remotely at any given time. However, this observa-tional record is highly fragmented: remote sensing observations have a extensive spatial coverage but differ in their spatial and temporal resolution and their frequency and temporal extent or suffer from inhomogeneities and measurement limitations (Lettenmaier et al., 2015;Shen et al., 2015;Seneviratne et al., 2010;de Jeu et al., 2008).
Moreover, the observational record suffers from complex, large-scale, and unavoidable missing values that differ among variables. These missing values can hinder further analysis and can obscure physical dependencies between observations. Therefore, gap filling is common in the Earth system sciences. It is used to fill gaps originating from sensor failure or sensor limitations (Pastorello et al., 2020;Liu et al., 2018;Shen and Zhang, 2009), to extrapolate into under-sampled regions (Ghiggi et al., 2019;Gudmundsson and Seneviratne, 2015;Cowtan and Way, 2014;Jung et al., 2011Jung et al., , 2009 or to get estimates for regions obscured to the sensor by clouds, dense vegetation, flight geometry or other influences (Huffmann et al., 2019;Zeng et al., 2015;Brooks et al., 2012;Shen and Zhang, 2009).
In the geoscientific literature, among the most commonly used approaches for estimating unobserved points are spatial and temporal interpolation methods, including nearest neighbour regression, kriging, and derivatives thereof (Liu et al., 2018;Cowtan and Way, 2014;Haylock et al., 2008;Cressie et al., 2006; for an overview, see Cressie and Wikle, 2015;Chiles and Delfiner, 2012). Spectral methods are also used von Buttlar et al., 2014;Brooks et al., 2012). Shen et al. (2015) gives an overview over univariate spatial, temporal, spatiotemporal, and spectral methods often used for gap filling remote sensing observations. In recent years, machine-learning-based approaches have become more common to fill gaps in univariate, gappy satellite data or to upscale sparse station networks (Kadow et al., 2020;Gerber et al., 2018;Zeng et al., 2015;Shen and Zhang, 2009). These methods are by default univariate but can be extended into multivariate settings (Bhattacharjee and Chen, 2020;von Buttlar et al., 2014).
In the multivariate context, several data products exist that gap fill one or more observations to a spatially or temporally complete dataset using auxiliary variables (Huffmann et al., 2019;Brocca et al., 2014) or estimate variables that are only observed through sparse station networks via statistical upscaling (O. and Orth, 2021;Zhang et al., 2021;Ghiggi et al., 2019;Jung et al., 2019;Martens et al., 2017;Gudmundsson and Seneviratne, 2015;Jung et al., 2011Jung et al., , 2009. Those approaches rely on a gap-free "donor" dataset to infer values of incomplete variables, meaning that only one of the variables in the multivariate setting is allowed to have missing values. In multivariate cases where more than one variable has missing values, ad hoc gap fills are usually applied in the preprocessing (Pastorello et al., 2020;Jung et al., 2019;Martens et al., 2017;Tramontana et al., 2016). To our knowledge only a few notable exceptions (e.g. Mariethoz et al., 2012) to the common practice of focusing on single gappy variables exist in the geoscientific literature.
In summary, geoscientific approaches often centre around exploiting the spatial, temporal, or spectral neighbourhood of gaps to infer missing values. Furthermore, available methods mostly focus on estimating missing values in one single variable and typically cannot be applied in multivariate settings where missing values are observed in all considered datasets and a coherent, gap-free multivariate dataset is the aim. This implies that gap-filling estimates of different variables may not be physically consistent and that available information may not be used efficiently if there are observations from more than one variable with missing values.
Nevertheless, combining observations from several, possibly gappy variables into a coherent "view" of the state of the Earth system is crucial for many applications. These include, but are not limited to, the analysis of local and regional land surface dynamics (Humphrey et al., 2018;Vogel et al., 2017), tracing of compound extreme events (Ridder et al., 2020;Wehrli et al., 2019) or observational water and energy budget closures (Alemohammad et al., 2017;Martens et al., 2017). The necessity of creating a global, physically coherent observational dataset of the Earth's state is also highlighted through international initiatives such as the Digital Twin Earth Initiative from ESA (Bauer et al., 2021b).
Atmospheric reanalyses can be viewed as another class of gap-free reconstruction of the state of the Earth system. They typically assimilate a wide range of observations into global weather models and are often the default dataset for a range of applications (Hersbach et al., 2020;Gelaro et al., 2017;Dee et al., 2011). However, since reanalysis products are model driven by construction, they are subject to model biases (Bocquet et al., 2019), and issues with model independence can arise if reanalysis products are used for model validation. Moreover, the observational record of the Earths' surface is generally underutilized in state-of-the-art reanalysis products, and the large fraction of missing values is one of the major constraints . For example, in the state-of-the-art atmospheric reanalysis product ERA-5, the fragmented observational record of soil moisture is used only sparsely (Hersbach et al., 2020), although the added value of assimilating remote sensing soil moisture has been shown for weather forecast models (Zhan et al., 2016) and flood forecasting (Brocca et al., 2014;Sahoo et al., 2013).
Given the current status of research in this field, Balsamo et al. (2018) note the need for more multivariate Earth observation datasets apart from reanalysis. At the same time, Bauer et al. (2021b) mention an ongoing trend to reshape classical reanalysis such that physical modelling and fragmented observation can be harmonized into a combined product by the use of machine learning techniques wherever processes are unknown or difficult to parameterize. In the following, we present an approach to consolidate fragmented Earth observations into a coherent, multivariate, gapfree dataset by tackling the problem of missing values in mul-V. Bessenbacher et al.: CLIMFILL v0.9: a framework for intelligently gap filling Earth observations 4571 tivariate remotely sensed Earth observations. Distinguishing the approach from reanalysis, we do not aim to assimilate observations with a pre-defined physical model but to leverage the power of modern statistical techniques to produce dependable and physically consistent estimates of essential Earth system observations. The newly developed methodology is tested for variables relevant for the study of landatmosphere dynamics.

Statistical concepts for treating missing values
The methodological literature offers a overarching framework for the problem of missing values (Rubin, 1976). Typically, the simplest form of gap management is referred to as list-wise deletion, where data points are only considered if all variables are observed. However, this approach can lead to large data loss. Furthermore, statistics derived from incomplete data can be biased if the data are missing not at random (Rubin, 1976). Consequently, the pattern in which the data are missing (i.e. the "missingness") is one of the most important factors when estimating the impact of missing values (Little and Rubin, 2014). In particular, Rubin (1976) categorizes three ways in which data can be missing: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In the following these categories of missingness are described in the context of Earth observations.
-If the probability that a data point is missing is not dependent on any process, the missingness is described as missing completely at random (MCAR, Fig. 1a). In the context of Earth observations this might be caused by random sensor failure, but it is rarely the dominant pattern of missingness.
-Satellite data are often missing because of satellite swaths. For example, orbiting satellites that are measuring soil moisture with a microwave sensor do not pass certain regions at certain times ( Fig. 1b). Here, the fact that we cannot measure the soil moisture at a certain space-time point is not dependent on the actual soil moisture at this point. In other words, the soil moisture is not significantly lower or higher in the locations where the satellite does not pass through. Therefore, the probability of a data point missing is not dependent on the value of the missing data point. Such patterns are referred to as missing at random (MAR).
-The most complex missingness pattern is missing not at random (MNAR). Here, the mechanism that obscures data points depends on the data that are missing. This mechanism can be a function of the observed variables, for example when values above or below a certain threshold are not observable (Fig. 1c). Moreover, missingness might be controlled by a different but related variable. In the case of a satellite measuring soil mois-ture via microwave retrievals, the measurement over dense vegetation represents the water content of the canopy rather than the one of the soil. Hence, the data at such points are masked during post-processing, leading to large patches of missing values especially in tropical forests. Here, we cannot safely assume that soil moisture below dense vegetation is not significantly different from observed soil moisture. Therefore, we cannot assume statistical independence between the fact that a point is missing and the unobserved value of the missing point.
Geoscientific data are in a large part missing not at random (MNAR), making statistical measures of the data biased (van Buuren, 2018;Rubin, 1976) and gap filling challenging (see for example Cowtan and Way, 2014). Ghahramani and Jordan (1994) show that statistically motivated gap filling of missing data is possible for MCAR and MAR cases in both a Bayesian and a maximum likelihood setting, but they note that MNAR data cannot be tackled with the same methods. However, gap filling can still be successful if a high degree of dependence between MNAR variables increases their mutual information. We argue that this is especially the case for geoscientific observations, since the variables are often directly linked through a number of processes.
In statistical literature, a wide range of algorithms exist that make use of cross-variable dependence to estimate missing values. These centre around low-rank matrix recovery, eigenvalue analysis, or regression for estimating missing values (Davenport and Romberg, 2016;Mazumder et al., 2010). Here, in contrast to common geoscientific approaches, missing values in all variables are allowed. In the following, we highlight two common approaches. On the one hand, Gaussian processes are a natural choice for gap-filling problems (Gelfand and Schliep, 2016) and are mathematically identical to kriging if the predictors are latitude and longitude. Gaussian processes, however, have limitations when moving to large data (Heaton et al., 2019), as is the case in Earth observation data. In recent years, some applications of Gaussian processes have been shown to work in settings with too much data to estimate the co-variance matrix between all data points precisely. They estimate the co-variance matrix via sophisticated sampling techniques (Wang and Chaib-draa, 2017;Das et al., 2018), pre-process the data via dimensionreduction methods (Banerjee et al., 2008), or apply the Gaussian process to local subsets of the data (Gramacy and Apley, 2015;Datta et al., 2016). On the other hand, iterative procedures like the MICE algorithm ("multiple imputation by chained equation", van Buuren, 2018) are well suited for multivariate imputation and scaling to large data but cannot account for neighbourhood relations. Regression-based multivariate gap-filling algorithms like MICE have, to the best of our knowledge, not yet been applied in the geoscientific context.

4572
V. Bessenbacher et al.: CLIMFILL v0.9: a framework for intelligently gap filling Earth observations Figure 1. Examples of the three patterns in which values can be missing: (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR). The MCAR missingness is created by setting randomly drawn grid points to be missing. For MAR missingness, a patch of the data was removed to mimic satellite swaths. In MNAR missingness, all values below a certain threshold are missing.
In Sect. 2, we propose the multivariate gap-filling framework CLIMFILL (CLIMate data gap-FILL) that aims to overcome the mentioned issues and combines the two approaches highlighted above and thus also takes advantage of univariate interpolation techniques (Cressie et al., 2006) and approaches for improving cross-variable coherence (Stekhoven and Bühlmann, 2012). In Sect. 3, we describe the data that have been used to evaluate the skill of the framework, and in Sect. 4 we show the results of evaluating and benchmarking the framework. Finally, Sect. 5 discusses the results and provides a conclusion and an outlook for possible future work.
2 CLIMFILL v0.9: a framework for infilling missing values in multivariate spatiotemporal geoscientific data We aim to develop a multivariate gap-filling framework that exploits the spatial, temporal, and cross-variable dependence structure of Earth system observations to produce estimates for missing values even if they are present in all variables. To achieve this goal we build upon geo-statistical interpolation and a multivariate gap-filling approach that has been popularized in other fields, namely the MissForest algorithm (van Buuren, 2018; Stekhoven and Bühlmann, 2012). In particular, we aim to utilize (1) spatial neighbourhood information, (2) temporal correlation, and (3) and statistical dependence across all considered variables. With these design requirements we aim to recover both the marginal distributions and the dependence among variables at any location with missing values. The CLIMFILL framework works mutually for all considered variables, meaning that information available in each of the variables is used for filling the gaps of all the other variables. With this design we implicitly assume that if one variable is not observed at a certain space-time point, a subset of the other variables might be observed and can reconstruct the missing value while conserving the dependence structure among all variables. The framework is divided into four steps (Fig. 2). In the first step, initial estimates for all missing values are produced by spatial interpolation of each variable independently, i.e. in a univariate setting. In the second step, the data are preprocessed to account for spatial and temporal dependence, which contributes to approximate physical links among different variables. In the third step, the data are divided into environmentally similar clusters. In the fourth and final step, the multivariate dependencies are taken into account. The initial estimates from the interpolation step are updated using an iterative procedure that aims to reconstruct the dependence structure between the variables with the aim of increasing the accuracy of the initial estimates.

Step 1: interpolation for integrating spatial context
The interpolation step creates initial estimates based on the spatial or spatiotemporal context of the gap using interpolation. Following the approach of Haylock et al. (2008), the data are first divided into monthly climatology maps and anomalies. The climatology maps are gap filled using thin plate spline interpolation to represent the spatial trends in the data. Subsequently, the daily anomalies from the monthly climatology are gap filled using kriging. In contrast to the E-OBS dataset created in Haylock et al. (2008) from in situ observations, satellite data has a much larger number of observed values, making a direct implementation of this approach computationally infeasible. For the interpolation of the monthly climatology maps we therefore restrict the thin plate spline interpolation to the 50 closest neighbours of each point. The interpolation of the daily anomalies follows Das et al. (2018), who suggest reducing complexity of kriging and Gaussian process regression by repeated interpolations 4573 Figure 2. Overview of the structure of the gap-filling framework. The framework is divided into four steps. In the first step (Sect. 2.1), any missing value is gap filled by an initial estimate from the spatiotemporal context. This step is called interpolation step. Here the spatiotemporal mean of observed values surrounding the missing value is used for each variable individually. In the second step (Sect. 2.2), embedded features are created to inform about time-dependent processes. In the third step, the data are divided into environmentally similar clusters (Sect. 2.3, Algorithm 1). In the fourth step (Sect. 2.4, Algorithm 1), the initial estimates from step 1 are updated while accounting for the dependence structure among all considered variables. This is achieved by first grouping available data points into environmentally similar clusters and then iteratively updating the initial estimates using a supervised learning algorithm.

4574
V. Bessenbacher et al.: CLIMFILL v0.9: a framework for intelligently gap filling Earth observations on random sub-samples of all available data points and averaging the resulting estimates. More specifically, the missing values in the anomalies are estimated by randomly selecting 1000 observed points per month over which the interpolation is calculated. This is repeated five times, and the mean of all interpolations for each missing point is taken as the gap fill estimate. As a consequence of these adaptations, the interpolation step becomes computationally feasible, but the uncertainty of the interpolation cannot be estimated. Finally, monthly maps and anomalies are summed up to form the initial gap fill estimate from step 1.

2.2
Step 2: feature engineering informed by process knowledge An important step in data-driven modelling is taking care that the data consist of informative variables that represent the mechanisms at work. This creation of variables or "features" guided by expert knowledge is called feature engineering. Earth observations often inform about time-dependent processes like seasonal effects, weather persistence, or soil moisture memory effects that act from daily to monthly or subseasonal timescales (Nicolai-Shaw et al., 2016). To account for such antecedent and subsequent effects, backwardand forward-looking running means of different window sizes and temporal lags are included. This is motivated by prior work on large-scale runoff estimation (Gudmundsson and Seneviratne, 2015). Given a variable v i,j,t at longitude i, latitude j , and time step t we define the window size s and time lag l over which a running mean of a variable v is computed: resulting in an embedded feature v * produced from variable v. We create embedded features of 7 d (s = 7, l = 0), 1month (s = 23, l = 7), and 6-month (s = 150, l = 30) backward and forward running means in such way that the windows are not overlapping (see Fig. 3). This way six additional features are created for each variable. Furthermore, gap-free time-independent maps describing properties of the land surface such as topography or land cover can be included. Maps of altitude, topographic complexity, land cover class, and land cover height from ERA-5, as well as latitude, longitude, and time, are added to the list of features and copied for each time step. The above procedure thus results in a set of 35 features: the four variables and the six embedded features of each of the four variables (totalling 24 embedded features); the four maps; and latitude, longitude, and time information. All data are standardized to have zero mean and a standard deviation of 1. We perform feature selection experiments (see Sect. 2.2) to find the most descriptive subset of these 34 features, which we then use for computing the results.

Step 3: grouping the data into environmentally similar clusters
Depending on the climate regime and the season, different processes might govern the local dependence among variables. Furthermore, geoscientific datasets are very large, and the computational costs of supervised learning methods do often not scale linearly with the number of samples. We therefore split the data into K environmentally similar clusters X (1) , . . ., X (K) (Algorithm 1, line 3) in which the multivariate gap filling happens (Algorithm 1, first loop, lines 4 to 16). This grouping is done in such way that grid points can be in different clusters at different time steps. For example, a grid point in the Mediterranean area can be in a different cluster in winter than in summer, accounting for seasonally varying climate phenomena such as changing soil moisture regimes . Here a k-means algorithm is used, and the data are partitioned into 30 clusters. This value is chosen such that the number of data points per cluster is large enough to ensure that the regression model can be calibrated efficiently but not so small that no individual clusters consist of missing values entirely.

2.4
Step 4: optimizing the initial estimates by accounting for the dependence between variables In the fourth step, the initial estimates from step 1 are updated by accounting for the dependence between variables. Within each of the clusters X k , the algorithm repeatedly iterates over the variables until convergence is reached. This procedure builds upon the MissForest algorithm by Stekhoven and Bühlmann (2012). For each variable v, a random forest model (Breiman, 2001) is fitted to the cluster to predict originally missing values in all variables based on the remaining features. Random forests have favourable properties for gap-filling applications: they can handle mixed types of data, they are scalable to large amounts of data, and they are nonparametric, i.e. adaptive to linear and non-linear relationships (Tang and Ishwaran, 2017). This core mechanism of CLIMFILL is detailed in the inner, third loop of Algorithm 1 (lines 6 to 14): the current variable is selected from the cluster as predictand y k v . All other columns of X k form the predictor table X k −v , where −v denotes the set of all variables and features except v. Subsequently, both y k v and X k −v are divided into two sets of data points. (1) All data points where y k v was originally observed are used to fit the supervised learning method y k (2) all data points where y k v was missing y k v,m are predicted from the fitted function to overwrite the former estimates, i.e.ŷ k v,m = f (X k −v,m ). Note that the training data most likely include originally missing values in the predictor variables. Here, the estimates from the interpolation step play the role of giving an initial estimate in the first iteration. Once the algorithm has iterated over all the variables, each missing value has been updated once. The algorithm is stopped (stopping criterion, Algorithm 1, second loop, lines 5 to 15) once the change in the estimates for the missing values is small between iterations (convergence) or a maximum number of iterations is reached (early stopping).
Note that the framework is set up such that each cluster learns different model parameters. With these choices the model is flexible to tailor its hyper-parameters individually to each variable and the regression parameters individually to each cluster. The hyper-parameters of the interpolation and the regression step are largely determined by computational limits of the available resources (for an overview, see Table A2). Where possible, we calibrated the remaining hyper-parameters by cutting out spatiotemporal cubes of observed data in year 2013 and compare values gap filled with CLIMFILL with the originally observed ones.

Data
To illustrate the impact of fragmented observational records, we focus here on the study of land-climate dynamics. At the land-atmosphere boundary, a complex interplay between soil moisture, temperature, and precipitation governs much of the water and energy balance at the surface . Thus, a combination of atmospheric and terrestrial processes influences local climate (Greve et al., 2014;Seneviratne et al., 2010), the development of hot and dry extreme events (Wehrli et al., 2019;Miralles et al., 2019;Mueller and Seneviratne, 2012), changes in freshwater availability , and the interaction of all these factors with climate change . These interactions are inherently multivariate and act on different timescales, making it necessary to observe the variables at a fine spatial and temporal resolution. Consequently, the study of land-climate dynamics requires observations spanning several components of the Earth system, including the land water and energy balances and the atmospheric state.
Since the original values that need to be gap filled are unobserved, we fall back on naturally gap-free atmospheric reanalysis data for benchmarking the framework. We use landonly global reanalysis data from ERA-5 at 0.25 • resolution for the years 2003-2020 (see Hersbach et al., 2020). ERA-5 is chosen as a gap-free dataset because of its advanced representation of land surface processes (Hersbach et al., 2020) and improved agreement of relevant surface variables with available observations (Martens et al., 2020;Tarek et al., Algorithm 1 Pseudo-code algorithm of the CLIMFILL clustering and learning step (steps 3 and 4), where K is the number of clusters, n v is the number of variables, and n f the number of features. X −v refers to the data table with all variables (columns) except v. Algorithm and pseudo-code are adapted from Stekhoven and Bühlmann (2012).
1: X is a matrix containing all variables and features as n v + n f columns and all data points as rows. 2: Create a mask of missing values M in the same shape as X, where M is true where X is missing and false where X is observed. Note that missing values are only present in variables, not in features. 3: Split X into K clusters X k using k-means algorithm. 4: for cluster k = 1, 2, . . ., K do 5: while stopping criterion not reached do 6: for variable v = 1, 2, . . ., n v do 7: Define current variable as predictand y k v and all other columns of X k as predictors X k −v . 8: Define y k v,o as all data points in y k v where M is false, and y k v,m as all data points where M is true. Create an updated estimate with the fitted random forestŷ k v,m = f (X k −v,m ).

12:
Replace y k v,m with the new updatedŷ k v,m in X k . 13: Update stopping criterion. 14: end for 15: end while 16: end for 17: Combine all X k back to X and save result. 2020; Albergel et al., 2018). The missingness patterns of satellite observations in the same period are extracted, regridded to ERA-5 resolution, and applied to the corresponding ERA-5 variable. In other words, only the part of the ERA-5 data that would have been observable by satellite are retained. In this "perfect dataset approach", the "true" values of the variables at the locations of the missing values are known and can be compared with the estimates of the gap-filling framework (see Fig. 4).
The hourly ERA-5 data are aggregated to daily resolution. The aggregation function for each variable is chosen to be consistent with the satellite products (e.g. daily sums for pre- Figure 4. Comparison of (a) the original naturally gap-free ERA-5 reanalysis, (b) the same data but where only satellite-observable values are shown, and (c) the gap-filled data created from CLIMFILL after starting with the gappy data in (b). The maps show an example snapshot of ERA-5 surface-layer soil moisture anomaly on 1 August 2020. CLIMFILL successfully reconstructs major anomalies in surface-layer soil moisture for this day. The anomalies are calculated by subtracting the monthly mean values. cipitation and daily average for soil moisture; see Table A1 in Appendix A). Since GRACE is only available in monthly resolution, we up-sample the data by linearly interpolating the monthly values to daily resolution. Permanently glaciated areas and deserts (defined as areas with less 50 mm average yearly precipitation in the years 2003-2012) are masked. We extract the missingness pattern from four satellite remote sensing datasets related to land-climate interactions and apply it to the ERA-5 dataset: ESA-CCI surface-layer soil moisture (Gruber et al., 2019;Dorigo et al., 2017;Gruber et al., 2017), MODIS ground temperature (Wan et al., 2015), GPM precipitation (Huffmann et al., 2019), and GRACE terrestrial water storage Landerer and Swenson, 2012;Swenson and Wahr, 2006). These variables represent central interactions between soil moisture and climate that drive land water and energy balance through the soil moisture-temperature and soil moisture-precipitation feedbacks . Selecting both microwave remote sensing measurements of surface-layer soil moisture and total water storage of the land surface is a compromise that aims to include as much possible information about root zone soil moisture as is available via remote sensing.
There are ubiquitous missing values in the selected satellite observations (Fig. 5). Since the missingness patterns only partially overlap, the selected set of variables is a good can-didate for mutual gap filling. Ground temperature is missing where there is cloud cover, with the maximum amount of missing values in the inner tropics and extratropical storm tracks and moving along latitudinal bands throughout the year. Almost half of the ground temperature values (46 %) are missing globally in the considered years. Surface-layer soil moisture is only observed in 39 % of all cases. It is missing where there is ice or snow cover or when vegetation is too dense. This is the most complicated missingness case because it exhibits the highest fraction of missing values and has considerable amount of land mass where high vegetation cover prevents retrieval at all times. For precipitation, around a quarter of the values are missing (27 %); this is only true at high latitudes during winter. In the GPM remote sensing precipitation dataset, values in the presence of surface snow or ice are masked because of poor sensor quality (Huffmann et al., 2019). In postprocessing, Huffmann et al. (2019) use a sophisticated Kalman smoother time interpolation to fill the gaps from the retrieval. From available metadata, we retrieved the originally missing maps to be able to quantify the added value of mutual gap filling for precipitation. Terrestrial water storage is missing if the global measurement is discarded due to instrument failure or during calibration missions (Landerer, 2021), leading to individual months being missing and 15 % of values being missing total.

Testing and benchmarking the CLIMFILL algorithm
The results section is structured as follows. Sect. 4.1 compares CLIMFILL with three different feature sets and the interpolation step (experiments A through D in Table 1). At the end of this section, we settle on a feature set of CLIMFILL for the rest of the results. In Sect. 4.2, the theoretical upper performance limits given the data described in Sect. 3 are examined using simpler, artificial missingness patterns and changing fractions of missing values (experiments D, E, and F in Table 1). Finally, in Sect. 4.3 CLIMFILL is run for the whole period (2003-2020) (experiment D) and its gap-filling results are compared to the original, gappy part of the data that is observable by satellite.

Benchmarking against univariate interpolation
The objective of the CLIMFILL framework is not only to reconstruct variables separately but also to recover multivariate dependencies. In this first part of the results, we illustrate the improvement of the multivariate gap-filling framework CLIMFILL compared to the univariate interpolation that takes place in the first step of the framework (experiment A). With this analysis, we can quantify the added value of incorporating information from the other variables present, which happens in step 4 of the framework. Furthermore, within this section we examine which subset of features created in step 2 is most descriptive for the problem at hand. This is done to ensure an informed decision about the set of features is made that reflects their usefulness in the gap-filling process. We try three different sets of features: (1) only the four variables (experiment B), (2) the variables plus their embedded features as described in Sect. 2.2 (experiment C), and (3) all of the features created in step 2, including the constant features describing land properties (experiment D). The benchmarking process therefore quantifies the merit of CLIMFILL compared to univariate interpolation and explores the possible feature space for combinations that improve the results.
To allow for a quantitative assessment of the similarity of the multivariate distributions of observed and simulated variables, we apply a scalar measure of multivariate similarity. In this study, we use the Jenson-Shannon distance (JS distance) (Lin, 1991). This measure compares the distance between two multivariate distributions, where a value of zero means that both distributions are identical and a value of one indicates that the distributions are not overlapping. We apply the JS distance on four-dimensional histograms computed of the four variables using 50 bins for each variable. Figure 6 shows the JS distance between the original ERA-5 data and the interpolation as well as the different feature sets. Overall, the JS-distance is lower for CLIMFILL than for interpolation globally (Fig. 6a) for experiment D. Including all variables shows the best results overall. For the rest of the paper we will therefore refer to experiment D when referring to CLIMFILL. Regionally, the largest improvement between CLIMFILL and the interpolation is in the tropical and subtropical regions (Fig. 6b, c), where the high fraction of missing values inhibits the performance of interpolation. Taking a closer look at the results by dividing the global map into types of vegetation and altitudes (Fig. 6d, e) shows that the JS distance improves from interpolation to CLIMFILL for all altitudes and all land cover types. This indicates an improvement regarding the multivariate features in CLIMFILL gap filling globally for a wide range of environmental conditions. Overall, CLIMFILL has a higher skill when reconstructing the multivariate dependence structure of the original ERA-5 data compared to univariate interpolation.
To illustrate the complex impacts of missing values and their alleviation in univariate and multivariate gap filling, Fig. 7 exemplary shows the bivariate distribution of surfacelayer soil moisture and ground temperature globally for the whole time period (all other possible combinations of variables are shown in Fig. A1). The part of the data that is observable from space (Fig. 7b) shows a collapsed distribution and clearly fails to recover the original bivariate distribution. Results after univariate interpolation recover parts of the distributions. CLIMFILL further improves this and recovers the shape of the original distribution. Thus, it generally provides an improved estimate of the bivariate distribution of surfacelayer soil moisture and ground temperature such that it is closest to the original ERA-5 data despite knowing only the satellite-observable points.

Data-constrained upper performance limits
Missing values in Earth observation data are often present in large proportions and in complex MNAR patterns. These characteristic properties of Earth observation data can hamper gap filling. We therefore are interested in exploring the envelope of data properties in which gap filling can be successful and seeing the deterioration of performance with increasing data sparsity and increasingly complex missing value patterns. In contrast to the last section, the goal is to show the upper limit of what is possible in gap filling with the complex missingness patterns exhibited by satellite observations. To this end, we rely on the four considered variables to test the impact of increasing fractions of missing data using idealized patterns. In particular, we delete (1) data according to a MCAR random missingness pattern (experiment E, Fig. 8a) and (2) by imitating satellite swaths, effectively creating MAR missingness patterns (experiment F, Fig. 8b). Both patterns are applied for fractions of missing values between 5 % and 95 % for each of the variables.
Multivariate JS distance (Fig. 9) and univariate statistical performance measures (Fig. 10) are used to compare original and gap-filled values for all performed experiments. With increasing fraction of missing values, the two artificial missingness cases increase in error, increase in their JS distance, and decrease in correlation. Once more than 80 % of the values  are missing, the gap filling breaks down because not enough observed values are available for the iterative procedure to converge to a meaningful result. Random and artificial swath missingness show similar deterioration with increasing fraction of missing values, but values missing completely at random tend to be easier to estimate at all fractions of missing values. Gap filling random missingness is the easiest case, since it is likely that neighbouring or environmentally similar points are observed. MAR missingness exposes large patches of missing values, making spatiotemporal interpolation less effective and hence decreasing the gap-filling performance as compared to MCAR. Since the real (MNAR) missingness case is the most complex missingness pattern, these addi-  In the swath-only missingness data we create long ellipses centred around the Equator to simulate characteristic satellite swath missingness patterns. Note that the two missingness patterns are not exactly the same for each day and variable to allow for cross-variable learning. tional experiments serve as an upper limit on performance in the real case. When moving from the artificial patterns of missingness to the real case (dots and circles in Figs. 9 and 10), the deterioration in performance is different for each of the vari-ables. However, in most cases the metrics for the real missingness case are close to the artificial missingness patterns, suggesting CLIMFILL operates at the upper limit of what is possible with the complex missingness pattern of real observations. For ground temperature, a spatially and temporally smooth variable, the interpolation is already quite a good first guess, and its correlation is only slightly improved in CLIMFILL. However, the RMSE drops quite dramatically, indicating smaller biases in ground temperature for gap-filled data. In this case study, we found the biggest improvement compared to interpolation for surface-layer soil moisture despite its large fraction of missing values. This high performance could be due to the fact that surface-layer soil moisture exposes missingness in areas where other variables are observed, for example in the tropical forests, such that learning in this area is easier. Additionally, variable selection is centred around soil moisture, and soil moisture is a key variable of land hydrological processes. The most difficult case is precipitation. Precipitation estimates are only slightly improved with CLIMFILL compared to initial interpolation. Precipitation is influenced by several processes that are not captured within the four selected variables. For example, frontal rain patterns are mostly not explained by land surface properties but are governed by large-scale circulation. This is a challenging case and could still be further improved, for example by adding wind patterns to capture more synoptic Figure 10. Median performance of gap filling with CLIMFILL and univariate interpolation on different missingness patterns and fractions of missingness expressed by two metrics: Pearson correlation and root-mean-square error (RMSE) per variable. Gap filling for random missingness and artificial swaths is executed for a range of fractions of missing values and is denoted as a line, while real missingness is only one case that is depicted as point. The metrics are calculated over each time step for all non-satellite-observable values of grid points on land, and the median of all land points is also plotted.
features. Terrestrial water storage contains only a small fraction of missing values (15 %). Since the interpolation is only applied spatially, it fails for full months of missing data and therefore the difference in correlation between interpolation and CLIMFILL is particularly high.

Recovery of regional and local land-climate dynamics
For any gap-filling framework to be useful for both scientific and practical applications it needs to be able to recover essential properties of the phenomena of interest. The coupling of energy and water between land and atmosphere at the land surface is a central, multivariate property of landclimate interactions that is currently underestimated in satellite data (Hirschi, 2014). By comparing CLIMFILL gap fill with the subset of data that can be observed from space, i.e. the gappy ERA-5 data (Fig. 4), we explore the role of missing values in this problem. This analysis is conducted over the whole available time period 2003-2020 (experiment D). More specifically, we show that leaving gaps in satellite data unfilled leads to biases and noise in estimates of regional and local climate feedbacks, and we also show how the CLIMFILL framework can contribute to overcoming this issue. Figure 11 showcases the Pearson correlation between the mean seasonal cycle in original ERA-5 data and CLIMFILL estimates, as well as spatial averages of the variables for selected IPCC reference regions (AR6 regions; see Iturbide et al., 2020). Overall, we find good agreement between them for the majority of regions in the world. The missing values in the satellite-observable ERA-5 data result in a noisy signal and biased values in regional estimates from the satellite-observable data. CLIMFILL alleviates the noise, reduces the bias, and has a high correlation to original ERA-5 data for a majority of regions (for all other regions, see Fig. A2 in Appendix A). However, surface-layer soil moisture and precipitation suffer from gap-filling artefacts in high-latitude regions.
In Fig. 12, the reconstruction of month-to-month variability by CLIMFILL is shown. Monthly, deseasonalized anomalies of the four variables for years 2003-2020 are plotted as spatial averages for selected IPCC reference regions (for all other regions, see Fig. A5 in Appendix A). CLIMFILL is able to reconstruct the overall variability throughout the years. Here the skill for surface-layer soil moisture and precipitation estimates is again region dependent. High-latitude regions show decreased correlation compared to other regions and unrealistic values in winter.
There are multiple reasons for the unrealistic performance of CLIMFILL in high-latitude regions in winter. Firstly, precipitation gap filling is a challenging case due to its nonnormal distribution. Secondly, soil moisture of frozen soil is hard to define. Thirdly, both variables suffer from unrealistic estimates of the interpolation step. These estimates that are created when the thin plate spline interpolation interpolates over areas where observed values are sparse and geographically far away, for example in the Russian Arctic during winter. These unrealistic values cannot be fully dampened by the iterative procedure in step 4 of CLIMFILL. Furthermore, precipitation is a challenging case due to its non-normal distribution. In future versions of the framework, a way to reduce this problem would be to not allow thin plate spline interpolation to estimate missing values that are too distant from observations, and instead rely on regional averages as an initial guess. In summary, for most variables in most re-  gions CLIMFILL shows high correlation with the original values and reduces the bias and noise of estimates compared to only satellite-observable data for both the seasonal cycle and interannual variability, with some difficulties arising from the missingness patterns of surface-layer soil moisture and precipitation in high-latitude regions.
Soil moisture-temperature coupling plays an important role in the development of heat extremes (Wehrli et al., 2019;Vogel et al., 2017;Seneviratne et al., 2010). As a last measure, we look at the case study of the European 2003 heatwave. Figure 13 shows the regionally averaged development of ground temperature and surface-layer soil moisture for the first 8 months of 2003, as well as anomaly maps of ground temperature for JJA 2003 and surface-layer soil moisture for MAM 2003 for the three cases. With satellite-observable data only, the ground temperature is overestimated because only clear-sky values are reported and ground temperature values below clouds, usually systematically lower during daytime in summer, are missing. CLIMFILL alleviates this bias and brings absolute temperatures and anomalies close to the original ERA-5 data. A strong dry soil moisture anomaly in spring was characteristic for the 2003 heat event, which is overestimated and noisy in the satellite-observable data. CLIMFILL is able to recover the spatial distribution of the event and reduce the bias. The 2003 heatwave case showcases how CLIMFILL can alleviate biases and noise and show a more realistic heat extreme development compared to gappy satellite data.

Discussion and conclusions
Gaps in remotely sensed Earth observations are ubiquitous and unavoidable and lead to a fragmented record of observational data. Ignoring these gaps creates noisy and biased derived statistics and regional averages. Spatial univariate interpolations with state-of-the-art methods typically only consider one variable at a time and can therefore not fully recover the multivariate dependence structure between the variables. To bridge this gap, a framework for gap filling multivariate gridded Earth observations, CLIMFILL, is proposed. CLIMFILL estimates missing values by considering not only spatial and temporal factors but also multivariate dependence across variables. In doing so, CLIMFILL mines the highly structured nature of geoscientific datasets and combines interpolation-centred approaches common to geosciences and multivariate gap-filling methods from statistical literature. In contrast to popular upscaling approaches, CLIMFILL does not need a gap-free gridded "donor" variable for estimating missing values. Thus, the algorithm can digest complex patterns of missingness in multivariate Earth observations. CLIMFILL fills gaps in fragmented Earth observations while recovering the physical dependence structure among the considered variables. To this end, the CLIMFILL framework contributes to decreasing the inherent fragmentation of Earth observations, enables usage of multiple gappy satellite observations simultaneously, and is a helpful tool when working towards a coherent digital representation of the Earth.
This study illustrates the need for gap-filling approaches and the merit of CLIMFILL with a set of variables relevant for the study of land-climate dynamics. CLIMFILL is benchmarked in an exemplary setting of reanalysis data with a focus on variables relevant for the study of land-climate dynamics. To this end, reanalysis data have been deleted to match missing values in satellite observations in a "perfect dataset approach". The multivariate JS distance shows that CLIMFILL recovers the dependence structure in the considered variables. Furthermore, univariate metrics show that CLIMFILL estimates have better agreement with original ERA-5 data compared to gappy data for many variables and regions. In summary, CLIMFILL is able to recover the dependence structure among several variables, in contrast to results obtained when missing values are not gap filled.
Although in this case study only four variables are considered, all of which are high in their respective fraction of missing values (up to more than two-thirds of the values missing) and complex in their pattern of missing values (always missing not at random), the multivariate gap filling with CLIMFILL improves estimates compared to univariate spatial interpolation. This is likely explained by the high correlation among the variables, which can to some degree counteract the complex missingness. This highlights that information from other available physically relevant variables can be beneficial for gap filling, indicating that the power of the framework might increase if even more dependent variables are included. Idealized experiments with simpler missingness patterns and different fractions of missing values within these four variables show that CLIMFILL improves upon univariate interpolation in all cases for all considered metrics, and the performance is close to easier cases with less complex missingness patterns for most variables.
In short, we have presented CLIMFILL, a multivariate gap-filling framework that exploits spatial, temporal, and multivariate information to create estimates for missing values in Earth observations. The fidelity of the framework has been successfully demonstrated in a case study centred around remote sensing observations relevant for the study of land-climate dynamics, which highlighted the merits of the approach compared to univariate interpolation. It is nevertheless important to stress that the "perfect dataset" approach employed here for benchmarking might not be fully representative for real observations. Therefore, we stress that the fidelity of the suggested algorithm has to be evaluated for real satellite observations and new applications. A natural next step could be to apply this gap-filling mechanism to a larger number of relevant observed variables and create a consistent, gap-free reconstruction of land hydrology.
Missing values in Earth observations are ubiquitous. Our efforts should centre around reducing these gaps in observations by, e.g. enhancing sensors, developing new measurement techniques, or closing gaps in observational networks. Looking at the problem from the other end, another approach could be to optimize the current observational capacities for information completeness, for example utilizing methods from information theory (Bauer et al., 2021a) and first tackle the gaps that are the largest or most severe for data analysis (both in natural and physical space). However, missing values will still remain unavoidable in many observations. Where they are present, it is imperative to develop dependable estimates that also consider links among variables. To this end, the CLIMFILL framework is developed to not only produce dependable estimates of individual variables but also to recover multivariate dependencies, eventually facilitating the creation of gap-free observational data products for environmental monitoring that also enable the study of Earth system processes, allow observation-only process analysis, or help to assimilate relevant but gappy observations into physical models.

4584
V. Bessenbacher et al.: CLIMFILL v0.9: a framework for intelligently gap filling Earth observations Appendix A Figure A1. Improvement of multivariate distribution with CLIMFILL gap filling: 2D histogram for all other combinations of variables (apart from the one already shown in Fig. 7) for the original ERA-5 data, satellite-observable ERA-5 data, the interpolation gap-filled data and CLIMFILL gap-filled data. GRACE terrestrial water storage volumetric soil water content of the first to fourth soil layer, snow depth sd, and lake cover cl multiplied with lake depth dl anomalies of daily sums compared to GRACE baseline (2004GRACE baseline ( -2009 cm (water-equivalent thickness) Table A2. Hyper-parameters of each step, their respective values, and how they were determined.

4591
Step Hyper-parameter Value Reason Step 1 Code and data availability. The current version of CLIMFILL is available from the project website (https://github.com/climachine/ climfill, last access: 7 June 2022) under the Apache 2.0 License. The exact version of the model used to produce the results used in this paper is archived on Zenodo (https://doi.org/10.5281/zenodo.4773663, , as are scripts to run the model and produce the plots for all the simulations presented in this paper. CLIMFILL was written in Python (Python Software Foundation, https://www.python.org/, last access: 7 June 2022) with core packages including xarray , NumPy , Matplotlib (Hunter, 2007), scikit-learn (Pedregosa et al., 2011), regionmask (Hauser, 2021), and scipy . The used ERA-5 data are publicly available at https://www.ecmwf.int/en/forecasts/datasets/ reanalysis-datasets/era5 (last access: 16 February 2021).
Author contributions. VB, LG, and SIS designed the study based on an initial idea from LG. VB and LG developed the framework and the evaluation. VB carried out the analysis and drafted the text. All authors contributed to reviewing and editing the article.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.