Review of “CLIMFILL: A Framework for Intelligently Gap-filling Earth Observations“

This study introduces a sophisticated procedure to gap-fill Earth observation time series while benefitting from independently and concurrently observed related variables. The authors showcase the method with reanalysis data where some parts are intentionally masked, and the reconstructed estimates are finally compared with the original data. Thereby, they consider ground temperature, terrestrial water storage, surface layer soil moisture and precipitation and discuss the results both in terms of reconstucted individual time series, and for the interactions between reconstructed variables compared with respective estimates from the original data.

This study introduces a sophisticated procedure to gap-fill Earth observation time series while benefitting from independently and concurrently observed related variables. The authors showcase the method with reanalysis data where some parts are intentionally masked, and the reconstructed estimates are finally compared with the original data. Thereby, they consider ground temperature, terrestrial water storage, surface layer soil moisture and precipitation and discuss the results both in terms of reconstucted individual time series, and for the interactions between reconstructed variables compared with respective estimates from the original data.
-------------------Recommendation: I think the paper requires major revisions. This is a useful and timely contribution for the Earth science community, and interesting for the readership of the Geoscientific Model Development. Benefitting from a growing suite of Earth observations, complex statistical tools and machine learning applications are increasingly employed in Earth science research. Mostly, these analysis tools require gap-free data which is often derived through gapfilling procedures. In this context, improving the quality of the gap-filling by exploiting the relationships between the independent Earth observations is a promising avenue. However, I have some concerns regarding the description of the method and the benchmarking of the results, as detailed below.
--------------------General comments: (1) Comparing the results from the plain interpolation with that at the end of all four steps of the gap-filling procedure is interesting to understand the method and the relevance of the various steps. However, it is not a suitable benchmarking exercise as it is to be expected that the results after four steps are closer to the original ERA5 data than the result after the first relatively crude interpolation step. Instead, an established univariate gap-filling technique should be employed here as a benchmark to illustrate under which circumstances the presented methodology offers benefits over previous approaches. Also, this could reveal to which is extent the gap filling can be improved by (i) complete exploration of uni-variate time series beyond neighbors, versus (ii) a multivariate approach.
(2) I think it would be useful for future CLIMFILL users to give more guidance on the methods to use in each step of the algorithm. Table 2 offers many possible choices, but in addition some recommendations would be needed on when to use which method and why. Also, the selection of employed variables is important as their inter-relations are a key source for the gap reconstructions, so also some additional advice on this would be helpful.
(3) I think that the feature selection is a bit arbitrary and dependent on expert knowledge. To somewhat address this issue, maybe several features could be used by default, such as the 34 features used in the presented example and maybe even additional time lags and windows. Then, the random forest model can be employed to rank the features by their importance (e.g. using SHAP value importance) to make a more informed decision on the useful features. Finally, the gap-filling could be re-run with only retaining relevant features.
(4) There is advanced statistical and data science language used across the manuscript and I recommend to clarify this with additional information to allow a broader geoscientific audience to follow this manuscript. Please see my respective suggestions in the specific comments below.  line 427: similar in "remotely sensed" data but underestimated in "satellite observations", this should be the same thing?  O, S. and R. Orth, Global soil moisture data derived through machine learning trained with in-situ measurements, Sci. Data 8, 170 (2021).