Comment on gmd-2021-164 Anonymous Referee # 1 Referee comment on " CLIMFILL : A Framework for Intelligently Gap-filling Earth Observations

The method is well described, but involves a number of modeling choices (initial interpolation method, clustering approach, random forest estimation and averaging) that are not always justified, except by the experimental results showing that “it works”. The problem is that I cannot make sure that it works with the current benchmarking. Indeed, the proposed method is compared only against the interpolation of step 1 of the proposed method itself. It is not in my opinion a sufficient benchmark. While it shows that steps 2-4 do have some added value compared to the extremely simple interpolation of step 1, added value against other interpolation methods is not demonstrated. By construction, it is expected that steps 1-4 perform than step 1 alone. I suggest to demonstrate the performance of the proposed approach is to compare it against something slightly more sophisticated, and already known to work in such contexts, for example (co-)kriging, possibly with a separate variogram model in each of the clusters defined in step 3.


Missing observations in Earth system science
Observing the Earth surface from the ground or space is an endeavour that has significantly contributed to advance our understanding of the Earth system and has played a vital role in the fields of data assimilation (Bauer et al., 2015), global freshwater In Table 1, we show an example of the fragmented world of Earth observations that challenge investigations in land-climate dynamics. The table highlights two issues that are typically encountered when analysing the observational record of the Earth system: there is either none or more than one observation system available for each variable. For example, evaporation is a key variable linking the water and the energy cycle at the surface, but it cannot be observed from space and is only sparsely measured on ground based observatories (Martens et al., 2017). If there is more than one observation of relevant variables, those 65 are usually difficult to combine because of inherently different measurement procedures. An example for this is temperature.
Space-borne observations see the temperature of the Earth surface, while in situ stations typically measure temperature in the atmosphere at two meters height. Combining those products can lead to errors in the estimate of surface energy partitioning (Balsamo et al., 2018) or might lead to diverging results when attempting model evaluation. Soil moisture is affected by both issues: While soil moisture can be observed from space, the microwave signal only penetrates the few first centimeters of the 70 soil . Consequently, information on vegetation-available root zone water which is central to many landatmosphere coupling effects is only available from sparse in situ observations , whilst surface soil moisture is measured both from space and in situ. Terrestrial water storage is available globally from the GRACE satellite , but tracks all water on land, including soil moisture, ground water and lake water. Hence we have several datasets for soil moisture that are difficult to combine.

75
Coming from the realm of physical modeling, reanalysis can be viewed as another class of gap-free reconstructions of the state of the Earth system, and are often the default dataset for a range of applications (Hersbach et al., 2020;Dee et al., 2011;Gelaro et al., 2017). Atmospheric reanalysis typically assimilates a wide range of observations into global weather models.
However, reanalysis products are by construction model-driven. They are therefore subject to model biases (Bocquet et al.,80 2019) and issues with model independence can arise if classical reanalysis products are used for model validation. Moreover, the observational record of the Earths surface is generally underutilised in state-of-the-art reanalysis products. The large fraction of missing values is cited as one of the mentioned reasons for this shortcoming . For example, in state-of-the-art atmospheric reanalysis product ERA5 the already difficult observational record of soil moisture is used only sparsely (Hersbach et al., 2020), although the added value for example remote sensing soil moisture assimilation been shown 85 for weather forecast models (Zhan et al., 2016) and flood forecasting (Brocca et al., 2014;Sahoo et al., 2013). Incomplete observation assimilation can therefore lower forecast accuracy and for example have consequences on the prediction of extreme events. A gap-filling procedure that can combine different observations into a coherent gap-free dataset could be used as a possible pre-processing step in reanalysis to enable a more thorough usage of available land observations. 90 Consequently, Balsamo et al. (2018) note the need for more multivariate Earth observation datasets apart from reanalysis.
At the same time, Bauer et al. (2021) mention an ongoing trend to reshape classical reanalysis such that physical modeling and fragmented observation can be harmonised into a combined product by the use of machine learning techniques wherever processes are unknown or difficult to parameterise. In the following, we present an approach to consolidate fragmented Earth observations into a coherent, multivariate, gap-free dataset by tackling the problem of missing values in the multivariate Earth 95 observation record with the gap-filling framework CLIMFILL. Distinguishing the approach from reanalysis, we do not aim to assimilate observations with a pre-defined physical model, but to leverage the power of modern statistical techniques to produce dependable and physically consistent estimates of essential Earth system observations. The newly developed methodology is exemplarily tested for variables relevant in the study of land-atmosphere dynamics.
1.2 A brief review of gap-filling methods 100 1.2.1 Gap-filling in the methodological literature The methodological literature offers a theoretical framework for the problem of missing values in any kind of data (Rubin, 1976). Typically, the simplest form of gap management is referred to as list-wise deletion, where only data points are considered if all variables are observed. However, this approach can lead to very large data loss. Furthermore, statistics derived from incomplete data can be biased if the data are missing not at random (Rubin, 1976 data are missing (i.e., the "missingness") is one of the most important factors when estimating the impact of missing values (Little and Rubin, 2014). In particular, Rubin (1976) categorizes three ways in which data can be missing: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). All these three missingness patterns can be observed in Earth observation data: -If the probability that a data point is missing is not dependent of any process, the missingness is described as missing 110 completely at random (MCAR, Figure 1 a). This is rarely the case in Earth observations.
-Satellite data are often missing because of satellite swaths. For example orbiting satellites, e.g. measuring soil moisture with a microwave sensor, do not pass certain regions at certain times (Figure 1 b). Here, the fact that we can't measure the soil moisture at a certain space-time-point is not dependent on the actual soil moisture at this point. In other words, the soil moisture is not significantly lower or higher where the satellite does not pass through. Therefore, the probability 115 of a data point being missing is not dependent on the value of the missing data point. Rubin (1976) call this missingness pattern missing at random (MAR).
-The most complex missingness pattern is missing not at random (MNAR). Here, the mechanism that obscures data points depends on the data that are missing. This mechanism can either be a function of the observed variables, for example when values above or below a certain threshold are not observable. Or if missingness is controlled by a different, 120 but unobservable variable. In the case of an exemplary satellite measuring soil moisture via microwave retrievals, the measurement over dense vegetation represents more the water content of the canopy rather than the one of the soil.
Hence the data at such points are masked during post-processing, leading to large patches of missing values especially in tropical forests. Here, we cannot safely assume that the soil moisture below dense vegetation is not significantly different from the soil moisture that is not missing. Therefore, we cannot assume independence between the fact that a point is 125 missing and the unobserved value of the missing point. We observe MNAR missingness (Figure 1 c).
Geoscientific data are in a large part missing not at random (MNAR), making statistical measures of the data biased (van Buuren, 2018) and gap-filling challenging (see for example Cowtan and Way (2014)). Ghahramani and Jordan (1994) show that gap-filling with the help of statistical tools (called statistical imputation) of missing data is possible for MCAR and MAR in both a Bayesian and a Maximum Likelihood setting, but note that MNAR data cannot be tackled with the same methods.

130
However, imputation can still be successful if a high degree of dependence between MNAR variables increases their mutual information. We argue that this is especially the case for geoscientific observations, since the variables are often directly linked through a number of processes.
A wide range of algorithms that make use of cross-variable dependence to estimate missing values exist in statistical liter-135 ature. Gaussian Processes are a natural choice for gap-filling problems, but they have limitations when moving to large data (Heaton et al., 2019). Other approaches center around low-rank matrix recovery or eigenvalue analysis for estimating missing values (Davenport and Romberg, 2016;Mazumder et al., 2010). Iterative procedures like the MICE-Algorithm ("Multiple imputation by chained equation", van Buuren (2018)) are suited well for multivariate imputation and scale to large data, but cannot account for neighborhood relations. Regression-based multivariate gap-filling algorithms like these have, to the best of Table 2. Main CLIMFILL settings per step and method class. Each task can be performed using other method of the corresponding method class.
Step Task CLIMFILL-RF (this study) Examples of alternative methods Step 1: Interpolation Interpolation Mean of spatio-temporally neighboring, non-mising points Kriging, linear interpolation, nearest-neighbor interpolation kriging or more complex interpolation methods.
Step 4: Learning Regression Random Forest Multiple Linear Regression, Neural Nets, Gradient Boosting, Gaussian Models.
they are ill-equipped to incorporate the neighborhood relations with spatially extensive, gridded data. Spatial analogue searching algorithms such as the direct sampling approach by Mariethoz et al. (2012) and image inpainting (Kadow et al., 2020) explore multivariate spatial interpolation. Upscaling is a common, multivariate regression-based approach in Geosciences to gap-fill spatially incomplete observations but rely on at least one complete "donor-variable" or an additional, gap-free dataset 160 to infer values of incomplete variables (Brocca et al., 2014;Kadow et al., 2020;Zhang et al., 2018;Zeng et al., 2015;Brajard et al., 2019;Ghiggi et al., 2019).
In summary, there exists a rich body of geoscientific literature on tailored solutions for individual gap-filling needs. However, no unified and modular solution exists that can be applied for any gap-filling scenario that might arise when working with Earth 165 system observations. In the following, we introduce the multivariate gap-filling framework CLIMFILL that aims at overcoming the mentioned issues (Sect. 2). Section 3 describes a case study used for evaluating and benchmarking the framework. In Sect.
3.1 we describe the data that has been used to evaluate the skill of the framework. , , , , } } Figure 2. Overview on the structure of the gap-filling framework. The framework is divided into three steps. In the first step (Sect. 2.1), any missing value is gap-filled by an initial estimate from the spatio-temporal context. This step is called interpolation step. Here the spatiotemporal mean of observed values surrounding the missing value is used for each variable individually. In the second step (Sect. 2.2), embedded features are created to inform about time-dependent processes. In the third step, the data are divided into environmentally similar clusters (Sect. 2.3, Algorithm 1). In the forth step (Sect. 2.4, Algorithm 1), the inital estimates from step 1 are updated while accounting for the dependence structure among all considered variables. This is achieved by first grouping available data point into environmentally similar clusters and then iteratively updating the initial estimates using a supervised learning algorithm. different variables expressed through their statistical dependence. With this design we aim at recovering both the marginal distributions and the dependence among variables at any location with missing values. The framework CLIMFILL (CLIMate data gap-FILL) works mutually, i.e. information available in each of the variables is used for filling the gaps of all the other variables. With this design we implicitly assume that if one variable is not observed at a certain space-time point, a subset of 180 the other variables might be observed and can reconstruct the missing value while conserving the correlation structure among all variables.
The framework is divided in four steps ( Fig. 2): In a first step, initial estimates for all missing values are produced by spatiotemporal interpolation of each variable independently. In a second step, the data are pre-processed to enable the analysis of 185 spatial and temporal dependence, which ultimately allows to uncover physical links among different variables. In the third step, the data are divided into environmentally similar clusters. In the forth step (learning step) the multivariate gap-filling happens: the initial estimates from the interpolation step are updated by an iterative procedure that aims to both reconstruct the dependence structure between the variables and to increase the accuracy of the initial estimates.

190
In the following, the newly developed iterative framework for gap-filling is described. CLIMFILL allows for a wide range of different options, creating a different instance of the framework for any missing value problem. A summary of the the necessary steps for setting up the framework, possible tweaks and extensions is given in Table 2. These include the process for coming up with initial estimates in step 1, feature engineering in step 2, as well as the selection of a clustering method in step 3 and the regression method used in step 4.

Step 1: Interpolation for integrating spatio-temporal context
The interpolation step creates initial estimates based on the spatio-temporal context of the gap. Interpolation methods that are typically used in geosciences, such as linear, bilinear or nearest neighbor interpolation as well as kriging can be used here (for examples see Table 2).

2.2
Step 2: Feature engineering informed by process knowledge 200 An important step in data driven modelling is taking care that the data consist of informative variables that represent the mechanisms at work. This creation of informative variables or "features" guided by expert knowledge is called feature engineering. For example, gap-free constant maps of describing properties of the land surface such as topography or land cover can be included. Furthermore, Earth observations often inform about time dependent processes like seasonal effects, weather persistence or soil moisture memory. To account for such antecedent and subsequent effects, backwards and forwards looking 205 running means of different window size and temporal lags are considered. This is motivated by the Takens Theorem (Takens, 1981) and prior work on large-scale runoff estimation (Gudmundsson and Seneviratne, 2015). Given a variable v i,j,t at longitude i, latitude j and time step t we define the window size s and time lag l over which a running mean of a variable v is computed: resulting in an embedded feature v * l,s produced from variable v. The specific values for s and l can be informed by domain knowledge or identified through optimisation. For example, to account for the soil moisture memory effect, an embedded feature v * could be added that contains the average value of all soil moisture values at this point in a 3-month backwards window (s = 90 days) from the current date (l = 0 days), corresponding to previous work indicating the soil moisture memory 215 effect acts on "monthly to subseasonal time scales" (Nicolai-Shaw et al., 2016). For the application of data science methods, the data need to be rearranged in a table X build from all variables v 1 , ...v n and derived features v * 1 , ..., v * m as columns and space-time points as rows.

Step 3: Clustering the data into environmentally similar clusters
Depending on the climate regime and the season different physical processes might govern the local dependence among vari-220 ables. Furthermore, geoscientific datasets are very large and the computational costs of supervised learning methods does often not scale linearly with the number of samples. We therefore split the data into K environmentally similar clusters X (1) , ..., X (K) (Algorithm 1, line 5) in which the multivariate gap-filling happens (Algorithm 1, second loop, line 6+17). This grouping is done in such way that grid points can be in different clusters depending on the time step. For example, a grid point in the Mediterranean area can be in a different cluster in winter than in summer, accounting for seasonally varying climate phenomena such 225 as changing soil moisture regimes (Seneviratne et al., 2010). All data are transformed to have zero mean and standard deviation of one.
In each of the clusters, the initial estimate of the missing values is further refined using an iterative procedure. For stabilising the results and to reduce the risk of discontinuities at the cluster edges, the clustering procedure is repeated E times with 230 different numbers of terminal clusters on copies of the data X (1) , ..., X (E) . We call these E different clustering results "epochs".
In the end, the estimates from the E different clusterings are averaged for the final result (Algorithm 1, outer first loop, line 3-5,19-20).

Step 4: Optimising the initial estimates by accounting for the dependence between variables
In the fourth step, the initial estimates are updated by accounting for the dependence between variables. Within each of the 235 clusters in epoch e and cluster k, X (e,k) , the algorithm repeatedly iterates over the variables until convergence is reached.
This procedure builds upon the MissForest algorithm by Stekhoven and Bühlmann (2012). For each variable v, a supervised learning model is fitted to the cluster to predict originally missing values in all variables based on the remaining features.
This core mechanism of CLIMFILL is detailed in the inner, forth loop of Algorithm 1 (line 8 to 15): The current variable is selected from the cluster as predictand y −v,m ). Note that the training data most likely include originally missing values in the predictor variables.
cedure. Once the algorithm has iterated over all the variables, each missing value has been updated once (Algorithm 1, third loop, line 7+16). The algorithm is stopped (stopping criterion) once the change in the estimates for the missing values is small between iterations (convergence) or a maximum number of iterations is reached (early stopping).
Note that the framework is set up such that each cluster applies the same supervised learning method but learns different Split X (e) into K (e) clusters X (e,k) using an unsupervised classification method. −v,m as all data points where M is true.

13:
Fit the regression model where f denotes any supervised learning method. 14: Create an updated estimate with the fitted regression model −v,m ).

Data
Since the original values that need to be gap-filled are unobserved, we fall back on naturally gap-free atmospheric reanalysis data for benchmarking the framework. We use 10 years (2003-2012) of land-only global reanalysis data from ERA5 at 0.25 degree resolution (see Hersbach et al. 2020). ERA5 is chosen as a gap-free dataset for the "perfect dataset approach" because of its advanced representation of land surface processes (Hersbach et al., 2020) and improved agreement of relevant surface variables with available observations (Martens et al., 2020;Tarek et al., 2020;Albergel et al., 2018). The missingness patterns 260 of satellite observations in the same period are extracted, regridded to ERA5 resolution and applied to the corresponding ERA5 variable. In other words, only the part of the ERA5 data that would have been observable by satellite are retained. In this "perfect dataset approach", the "true" values of the variables at the locations of the missing values are known and can be compared with the estimates of the gap-filling framework (see Figure 3). This analysis is constrained to orbiting satellite remote sensing datasets and excludes in situ observations and gridded observations for the purpose of developing the framework. We note 265 however that the framework is naturally extendable to include more satellite observations, and in situ observations that can be treated as a very sparse gridded product.
The hourly ERA5 data are aggregated to daily resolution. The aggregation function for each variable is chosen to be consistent with the satellite products (e.g. daily sums for precipitation and daily average for soil moisture, see Supplementary  the retrieval. From available metadata, we retrieved the originally missing maps to be able to quantify the added value of mutual gap-filling for precipitation. Terrestrial water storage is only missing if the global measurement is discarded due to instrument failure or during calibration missions (Landerer, 2021), leading to individual time slabs missing, and only 7% missing values.
3.2 CLIMFILL-RF: Settings of the CLIMFILL framework used for benchmarking 295 The CLIMFILL framework allows for a wide range of individual settings to tailor it to the specific gap-filling use case. In each of the four steps, a method needs to be chosen to perform the specific task of this step. There is a large pool of methods  that can be used, for examples see Table 2. In the following, we describe the settings of the framework that are used for this benchmarking experiment and call this particular instance CLIMFILL-RF, denoting the Random Forest method used at the core of the algorithm.

300
For the first step (interpolation) initial estimates are generated through simple interpolation by applying a 3d running mean for each variable independently. If a data point of a variable, v i,j,t , is missing, the initial estimate is calculated by the mean of its non-missing surrounding points in space and time. Here we consider a 5-pixel side length, corresponding to a distance of In the third step (clustering step), the data are divided into clusters. Here a k-means algorithm is considered and the data are partitioned three times with different number of clusters, where the number of clusters is randomly drawn between 50 and 150.
These limits are chosen such that the number of data points per cluster is sufficiently large to ensure that the regression models 320 can be calibrated efficiently, but not too small such that no individual clusters consist of missing values entirely.
In the fourth step (learning step), we use a Random Forest regressor as supervised learning function. Random Forests have have favorable properties for gap-filling applications: they can handle mixed types of data, are scale-able to large amounts of data and non-parametric, i.e. adaptive to linear and non-linear relationships (Tang and Ishwaran, 2017). The hyper-parameters of the supervised learning functions are determined via leave-one-out cross-validation on clustered ERA5 data between 2015 325 and 2020 downscaled to 2.5 degrees resolution, where one fold is one year. The cross-validation optimises the number of trees, the minimum number of samples for a leaf node, the maximum number of features to be considered for each split and whether to use bootstrap samples for tree building.

Benchmarking against univariate interpolation
The objective of the CLIMFILL framework is to not only reconstruct variables separately, but also to recover multivariate de-330 pendencies. In the first part of the results, we illustrate the improvement of the multivariate gap-filling framework CLIMFILL-RF compared to the univariate, spatiotemporal interpolation that takes place in the first step of the framework.  1943) (B-distance). The B-distance is a general measure to quantify the distance of two multivariate distributions, taking into account both the similarity in mean and covariance of both distributions. For samples of two multivariate normal distribution with means µ 1 , µ 2 and covariances Σ 1 , Σ 2 , the Bhattacharyya distance is defined as where Σ is the mean of Σ 1 and Σ 2 . The first term is a measure of similarity of the mean between two samples, and the second term is a measure of similarity of their covariances. Although the data considered may not be normally distributed we rely here on the normal approximation of the B-distance to facilitate a quantitative comparison of the considered gap filling methods at a reasonable computational cost. Overall, the B-distance is lower for CLIMFILL-RF than for interpolation globally (Fig. 7). The largest improvement is in temperate and boreal regions, where a high fraction of missing values inhibits the performance of interpolation. In parts of the inner tropics the B-distance of the interpolation gap-fill is not defined (in Fig. 7 indicated with dark grey color). Here the gap-fill estimate from interpolation is the same in every time step, because the high vegetation cover causes the satellite to never 360 observe surface layer soil moisture in this area. Leads to an all-zero covariance and therefore theoretically infinite, practically undefined B-distance. These points have been removed in Fig. 7 (c) and (d) to improve readability. Taking a closer look at the results by dividing the global map into types of vegetation and altitudes shows that the B-distance improves from interpolation to CLIMFILL-RF for all altitudes and almost all land cover types. This indicates an improvement of multivariate features in CLIMFILL-RF gap-fill globally for a wide range of environmental conditions. Overall CLIMFILL-RF has a higher skill in 365 reconstructing the multivariate dependence structure of the original ERA5 data compared to univariate interpolation.

Data-constrained upper perfomance limits
Missing values in Earth observation data are often present in a large proportion and a complex missing-not-at-random pattern.
These characteristic properties of gappy Earth observation data can inhibit gap-filling. We therefore are interested in carving out the envelope of data properties in which gap-filling can be successful and see the deterioration of performance with in-370 creasing data sparsity and increasingly complex missing value patterns. Using the four considered ERA5 variables we test the framework in idealised, simpler missingness patterns. In these additional experiments, we delete data according to a (1) MCAR random missingness pattern and (2) imitating satellite swaths, effectively creating MAR missingness patterns (Fig. 8).
Both patterns are applied for fractions of missing values between 5% to 80% for each of the variables. We performed these experiment on a downscaled 2.5 degrees resolution ERA5 data because of computational constraints.

375
Multivariate B-distance ( Figure 9) and univariate statistical performance measures ( Figure 10) for all performed experiments comparing original and gap-filled values. With increasing fraction of missing values, the two artificial missingness cases increase in error, increase the B-distance and decrease in correlation. Once more than 80% of the values are missing, the gap-filling breaks down because not enough observed values are available for the iterative procedure to converge to a meaning-380 ful result. Random and artificial swath missingness show similar deterioration with increasing fraction of missing values, but values missing completely at random tend to be easier to estimate at all fractions of missing values. Gap-filling random missingness is the easiest case, since it is likely that neighboring or environmentally similar points are observed. MAR missingness exposes large patches of missing values, therefore making spatiotemporal interpolation less effective and therefore decreasing the gap-filling performance as compared to MCAR. Since the MNAR missingness case is the most complex missingness pat-385 tern, these additional experiments serve as upper limits of the performance in the real case.
When moving from the artificial patterns of missingness to the real case (dots and circles in Fig. 10), the deterioration in performance is different for each of the variables. For ground temperature, a spatially and temporally smooth variable, the interpolation is already quite a good first guess, which is only slightly improved in CLIMFILL-RF. In this case study, we found 390 the biggest improvement compared to interpolation for surface layer soil moisture despite its large fraction of missing values. This could be due to the fact that surface layer soil moisture exposes missingness in areas where other variables are observed, for example in the tropical forests, such that learning in this area is easier. Additionally, variable selection is centered around soil moisture, and soil moisture is a key variable of land hydrological processes. The most difficult case is precipitation. Despite the additional pre-processing step to account for its non-normality, the low precision precipitation estimates were only slightly 395 improved with CLIMFILL-RF and it is difficult to improve the result of the initial interpolation. Precipitation is influenced by a lot of processes that are not captured within the four selected variables. For example, frontal rain patterns are mostly not explained by land surface properties but are governed by large scale circulation. This is a challenging case and could still furthermore be improved, for example by adding wind patterns to capture more synoptic features. Terrestrial water storage contains only a small fraction of missing values (7%), but its gap-filling could be hampered by its monthly resolution that does 400 not co-vary enough with the other variables. Introducing an additional bias correction step could help alleviate these problems.

Recovery of regional and local land-climate dynamics
For any gap-filling framework to be useful for both scientific and practical applications it needs to be able to recover essential properties of the phenomena of interest. The coupling of energy and water between land and atmosphere at the land surface is a central, multivariate property of land climate interactions that is currently underestimated in satellite data (Hirschi, 2014).

405
By comparing CLIMFILL-RF gap-fill with the subset of data that are observable by space, i.e. the gappy ERA5 data (Fig. 3) we explore the role of missing values in this problem. In particular we show that leaving gaps in satellite data unfilled leads to biases in estimates of regional and local climate feedbacks and how the CLIMFILL framework contributes to overcoming this issue.
410 Figure 11 showcases the mean seasonal cycle of the variables for selected IPCC reference regions (AR6 regions, see Iturbide et al. 2020). Surface layer soil moisture, ground temperature and precipitation suffer from gaps in the winter months in mid to high latitude regions like Western & Central Europe and South-West North America. In tropical regions like Central Africa and South-East Asia, especially soil moisture estimates suffer from little data availability. The missing values result in a noisy signal and biased values in regional estimates from the satellite-observable data. CLIMFILL-RF alleviates the noise 415 and reduces the bias for surface layer soil moisture and ground temperature for these regions with low satellite coverage better than the interpolation estimates. The largest relative difference is in the surface layer soil moisture estimates. For surface layer soil moisture and ground temperature especially the amplitude of the signal is reconstructed, but also the bias is reduced in all regions (see Supplementary Fig. A2). Precipitation and terrestrial water storage estimates show little change.

420
Soil moisture-temperature coupling plays an important role for the development of heat extremes (Seneviratne et al., 2010;Vogel et al., 2017;Wehrli et al., 2019). This feedback can be described by the correlation between the soil moisture anomaly sm anom and the number of hot days (NHD). The correlation can expose "hot spots" of soil moisture-temperature coupling where hot extremes can be exacerbated (Mueller and Seneviratne, 2012;Hirschi, 2014) and is central for representing com- pound extreme events at the land surface, such as droughts and heat waves. We compute this correlation for original ERA5 425 data, satellite-observable ERA5-data and the CLIMFILL-RF gap-fill. Hirschi (2014) note that the coupling strength between remotely sensed soil moisture and NHD is qualitatively similar, but underestimated in satellite observations compared to a precipitation-based soil moisture estimate from interpolated weather station data (CRU dataset, Harris et al. (2020b)). A similar effect can be found in ERA5 data when only satellite-observable data points are taken into account. Comparing Fig. 12 shows that removing data from ERA5 that would not have been observable via space leads to a deterioration of soil moisture-430 temperature coupling strength, especially in the showcased regions that have sparse surface layer soil moisture observations such as tropical forests and high latitudes. In these areas, CLIMFILL-RF is able to alleviate the underestimated coupling ( Fig.   12) and successfully reconstructs the correlation between NHD and sm anom in these regions. This is highlighting that missing values in Earth observation can bias process analysis and multivariate gap-filling can help alleviating these biases and recover important dynamics and dependencies between variables which would have been dampened or lost in gappy satellite-data 435 alone.
Gaps in Earth observations are unavoidable and lead to a fragmented record of observational data. CLIMFILL is a framework for gap-filling multivariate gridded Earth observations that estimates missing values by taking into account the spatial, temporal and the multivariate context of a missing value. In doing that CLIMFILL mines the highly structured nature of geoscientific 440 datasets and bridges the gap between interpolation-centered approaches common to geosciences and multivariate gap-filling methods from statistical literature. In contrast to popular up-scaling approaches, CLIMFILL does not need a gap-free gridded "donor" variable for learning and can digest any gap structure in the provided data, including spatial gaps, temporal gaps and non-overlapping observations from different datasets. Furthermore, by clustering the global data into environmentally similar points, we tailor the multivariate gap-filling to the needs of datasets spanning global, highly diverse ecosystems and changing 445 land-atmosphere interactions. This approach also decreases computing time such that high resolution gap-filling is possible (not shown). The highly flexible nature of CLIMFILL does not imply a physical model, but allows important physical dependencies to be imprinted in the dataset before gap-filling through feature engineering. This way, CLIMFILL can be tailored to many geoscientific use cases. In summary, CLIMFILL can successfully fill gaps in fragmented Earth Observation datasets, while maintaining the physical dependence structure among the considered variables. To this end, the CLIMFILL framework 450 contributes to decreasing the inherent fragmentation of earth observations and enables usage of multiple gappy satellite observations simultaneously.
We have tested and bench-marked CLIMFILL in an exemplary setting of land hydrology reanalysis data. To this end this data have been deleted to match missing values in satellite observations in a "perfect dataset approach". This case study shows 455 that seeing only satellite-observable data without filling the gaps creates biased, noisy regional estimates and destroys the dependence structure in multivariate settings. CLIMFILL is able to recover this dependence structure in land-atmosphere coupling and hence enables process investigation in gappy, multivariate observations. Quantified with the multivariate B-distance we show that this recovery improves the dependence structure globally across almost all land covers and altitudes compared to interpolation. The largest improvements are in temperate and boreal regions, although these are areas with large patches of 460 low numbers of observed points. In summary, CLIMFILL is able to recover the dependence structure among several variables, contrasting results obtained when missing values are not gap-filled or treated without considering multivariate aspects. Thereby CLIMFILL enables a physically consistent interpolation of the resulting gap free dataset.
Interestingly, the case study showed that the benefit of CLIMFILL compared to interpolation is not equally large across 465 variables. The selected group of variables and their individual missing value patterns are central for the success of multivariate gap-filling. Learning from the other variables is highly beneficial in gap-filling surface layer soil moisture estimates, although it has the largest fraction of missing values. Since the framework is targeted at recovering the the physical dependence structure across variables, the improvement in univariate measures like correlation and bias tend to be improved at a smaller scale than the multivariate dependence structure. The case study also highlights that information from other available variables can indeed be beneficial for gap-filling if process knowledge is used when selecting a sub-set of variables and suggests the potential power of the framework if even more dependent and important variables are included in the multivariate gap-filling process.
Although the selected observations are small in number (only four variables considered), high in their respective fraction of missing values and complex in their pattern of missing values (always missing not at random), the multivariate gap-filling 475 with CLIMFILL successfully improves estimates compared to univariate interpolation. This is likely related to the high correlation among the variables, which can to some degree counteract the complex missingness. Idealised experiments with simpler missingness patterns and different fractions of missing values within these four variables show that CLIMFILL improves upon univariate interpolation in all cases for all considered metrics, but that multivariate gap-filling is easier with smaller fractions of missing values and less complex missingness patterns. The high correlation and low error scores for low fractions of miss- 480 ing values indicate that the four included variables represent important processes and are explanatory for each other, i.e. their mutual dependence is expressive enough to conduct meaningful gap-filling.
In conclusion, we have presented a multivariate gap-filling framework that uses spatial, temporal and multivariate information to create estimates for missing values. This framework has been successfully applied in a case study centered around land 485 hydrology remote sensing observations. The modularity and flexibility of the proposed gap-filling framework make it applicable to all kinds of Earth observation data once suitable settings are chosen by applying knowledge of the important physical processes represented in the data. CLIMFILL can be used for multivariate, observation-only process analysis or help including relevant but gappy observations into data assimilation or reanalysis. in situ data could possibly be included as well if treated as a very sparsely gridded data where the area of representation for the point measurement is accessed (see e.g. Nicolai-Shaw 490 et al. 2015). A natural next step could be to apply this gap-filling mechanism on a larger number of relevant observed variables and create a consistent, gap-free reconstruction of land hydrology.
Missing values in Earth observations will remain unavoidable. However, the intrinsic motivation should be to reduce gaps in observations. Enhancing sensors, developing new measurement techniques or closing gaps in observational networks are 495 three possible directions of innovation that could help reduce missing information. This endeavour however must start with an assessment of the information completeness of existing observations, for example by applying methods from information theory. It should aim at closing the largest gaps first, for example in terms of available variables, sampled ecosystems or in physical space. Reducing the complexity of missing information in Earth observations can be a large step towards better observational estimates of crucial Earth system processes.

500
Code and data availability. The current version of CLIMFILL is available from the project website: https://github.com/climachine/climfill under the Apache 2.0 License. The exact version of the model used to produce the results used in this paper is archived on Zenodo (http://doi.org/10.5281/zenodo.4773664), as are scripts to run the model and produce the plots for all the simulations presented in this paper.      Table A2. References for all observational datasets mentioned in Table 1 name available variables reference SYNOP stations many, most prominently 2-meter temperature e.g. Lawrimore et al. (2011) FLUXNET stations many, see Table 2  SEVIRI ground temperature Trigo et al. (2015) thank the ECMWF for creating and providing the ERA5 reanalysis product. Lastly, we would like to thank Roberto Villalobos, Jonas Jucker, Joel Zeder and Johannes Senn for feedback on the draft.