Comment on gmd-2021-353 Anonymous Referee # 1 Referee comment on " DINCAE 2 . 0 : multivariate convolutional neural network with error estimates to reconstruct sea surface temperature satellite and altimetry observations

The paper presents an update on the previous version of DINCAE, a convolutional autoencoder method for in-painting of sparse satellite data. DINACE2 presents some improvements over the previous version, most notably in performance (vs DINCAE1), speed (presumably due to being rewritten to Julia from Python) and an option to treat ungridded data like satellite altimetry observations. It also introduces an extra refinement step in the cost function to increase its depth, and an intermediate loss term is included in the total loss to compensate for the vanishing gradients of the deep network. When treating sparse data, the error variance estimation of DINCAE2 is more reliable than that from variational interpolation method DIVAnd. The results are solid, the paper is well written, the figures are clear. I recommend publication after minor revision. I do have some comments which might be worth discussing further.


Introduction
Ocean data are generally sparse and inhomogeneously distributed. The data coverage often contains large gaps in space and time. This is in particular the case with in situ observations. Satellite remote sensing only measures the surface of the ocean but generally has better spatial coverage than in situ observations. However, still about 75 % of the ocean surface is on average covered by clouds that block sensors in the optical and infrared bands (Wylie et al., 2005). Given the sparsity of data, it is natural to aim to combine data representing different parameters as, e.g., mesoscale flow structures are often visible in all ocean tracers.
Prior work on using multivariate data in connection with satellite data use, for example, empirical orthogonal functions (EOF), which can be naturally extended to multivariate datasets as long as an appropriate norm is defined. For example, Alvera-Azcárate et al. (2007) uses sea surface temperature, chlorophyll and wind satellite fields with data interpolating empirical orthogonal functions (DINEOF). Multivariate EOFs have also been used to project surface observations to deeper layers (Nardelli and Santoleri, 2005) or to derive nitrate maps in the Southern Ocean (Liang et al., 2018). In the latter case, EOFs linking salinity, and potential temperature and nitrate concentrations are derived from model simulations.
Published by Copernicus Publications on behalf of the European Geosciences Union.

2184
A. Barth et al.: Reliable error estimates for reconstructed missing data in satellite observations As some observations can be measured at much high spatial resolution via remote sensing (in particular the resolution of sea surface temperature is much higher than the resolution of sea surface salinity products), "multifractal fusion techniques" are used to improve remote sensed surface salinity estimates using sea surface temperature. Data fusion is implemented as a locally weighted linear regression (Olmedo et al., 2018(Olmedo et al., , 2021. Han et al. (2020) also used an earlier version of the DINCAE code to estimate sea surface chlorophyll using additional sea surface temperature observations.
The structure of a neural network, and in particular its depth, is uncertain and to some degree dependent on the used data set. We also investigate the influence of the depth of the neural networks in this work. It is known that neural networks are increasingly more difficult to train as their depth increases because of the well-known vanishing gradient problem (Hochreiter, 1998): the derivative of the loss function relative to the weights of the first layers of a neural network has the tendency to either decrease (or increase) exponentially with a increasing number of layers. This prevents effective optimization (training) of these layers using gradient-based optimization methods.
Several methods have been proposed in the literature to mitigate such problems using alternative neural network architectures. In the context of the present manuscript, skip connections in the form of residual layers have been tested (similar to residual networks; He et al., 2016). The derivative of the loss function relative to the weights in such layers remains (at least initially) closer to one, so there is a more direct relationship between the loss function and the weights and biases to be optimized. Deeper residual networks include shallower networks as a special case and thus, as per their construction, should perform at least as well as shallower networks.
The gradient of a whole network is computed via backpropagation, which is essentially based on the repeated application of the chain rule for differentiation. The information of the observation is injected via the loss function and propagated backward in a way which is similar to the 4D-var backward in time integration of the adjoint model. Another interesting neural network architecture has been proposed in the form of the Inception network (Szegedy et al., 2015), where the output of intermediate layers, here in the form of a preliminary reconstruction, are used in the loss function (in addition to the output of the final layer). The result is that the information of the observations are injected not only at the final layer but also in the intermediate layer, which also contributes to reducing the vanishing gradient problem.
While for gridded satellite data, approaches based on empirical orthogonal functions and convolutional neural networks have been shown the be successful, it is difficult to apply similar concepts to non-gridded data as these methods typically require a stationary grid. Another objective of this paper is to show how convolutional neural networks can be used on non-gridded data. This approach is illustrated with altimetry observations. The objective of this manuscript is to highlight the improvement of DINCAE relative to the previously published version (Barth et al., 2020). The Sect. 2 presents the updated structure of the neural network. The gridded and non-gridded observations used here are presented in Sect. 3. Details of the implementation are also given (Sect. 4). The results and conclusions are presented in Sects. 5 and 6.

The neural network architecture
The DINCAE network (Barth et al., 2020) is a neural network composed of an encoder and decoder network. The encoder uses the original gappy satellite data (with additional metadata as explained later) and the decoder uses the output of the encoder to reconstruct the full data image (along with an error estimate). The encoder uses a series of convolutional layers followed by max pooling layers, reducing the resolution of the datasets. The decoder does essentially the reverse operation by using convolutional and interpolation layers. This is the general structure of a convolutional autoencoder and the classical U-Net networks (Ronneberger et al., 2015). In the following section we will discuss the main components of the DINCAE neural network used for the different test cases and emphasize the changes relative to the previous version.

Skip connections
In an autoencoder, the inputs are compressed by forcing the data flow through a bottleneck, which ensures that the neural network must efficiently compress and decompress the information. However, in U-Net (Ronneberger et al., 2015) and DINCAE, skip connections are implemented, allowing the information flow of the network to partially bypass the bottleneck to prevent the loss of small-scale details in the reconstruction. Skip connections can be realized either by concatenating tensors along the feature map dimension (as it is done in U-Net) or by summing the tensors. The cat skip connections at the step l in the autoencoder can be written as the following operation: where cat concatenates two 3D arrays along the dimension representing the features channels. The function f (l) is a sequence of neural network layers applied to the array of X (l) (produced by previous network layers). Sum skip connections are implemented as Clearly, the output of a cat skip connection has a size twice as large as the output of a sum skip connection. These skip connections are followed by a convolutional layer, which ensures that the number of output features are the same for both types of skip connection. In fact, one can show that the sum skip connection (followed by a convolution layer) is formally a special case of the cat skip connection. However, sum skip connections can be advantageous because the weight and bias of the convolutional layers are more directly related to the output of the neural network, which helps to reduce the "vanishing gradient problem" (He et al., 2016).

Refinement step
The whole neural network can be described as two functions that provide the input variable product between the reconstructionŷ and its expected error varianceσ 2 for every grid cell. The loss function is derived from the negative loglikelihood of a Gaussian with meanŷ and varianceσ 2 : where the sum with the spatial indices i and j runs over all grid points with valid (i.e., non-masked) values and N is the number of valid values. The first term of the right-hand side of the equation is the mean square error, but scaled by the estimated error standard deviation, the second term penalizes any overestimation of the error standard deviation and the third term is constant and can be neglected as it does not influence the gradient of the cost function. For a convolutional auto-encoder with refinement, the intermediate outputsŷ and σ are concatenated with the inputs and passed through another auto-encoder with the same structure (except for the number of filters for the input layer, which has to accommodate the two additional fields corresponding toŷ andσ ). The weights of the first and second auto-encoder are not related. The final cost function with refinement J r is given by whereŷ andσ 2 are the reconstruction and its expected error variance produced by the second auto-encoder. The weights α and α control how much importance is given to the intermediate output (relative to the final output). With a refinement step, the neural network becomes essentially twice as deep and the number of parameters (approximately) doubles. The increased depth would make it prone to the vanishing gradient problem. However, by including the intermediate results in the cost function, this problem is reduced. In fact, information from the observations is injected during back-propagation by the loss function. Due to the refinement step and the loss function, which also depends on the intermediate result, the information from the observation is injected at the last layer and at the middle layer of the combined neural network (Szegedy et al., 2015). The relationship between the first layers of the neural network and the cost function is therefore more direct, which helps in the training of these first layers.
The refinement step has been used in image in-painting for a computer vision application (Liu et al., 2019) and it has also been applied for oceanographic data for tide gauge data (Zhang et al., 2020). In the present work, only one refinement step is tested, but the code supports an arbitrary number of sequential refinement steps.

Multivariate reconstructions
Auxiliary satellite data (with potentially missing data) can be provided for the reconstruction. The handling of missing data in these auxiliary data is identical to the way missing data are treated for the primary variable. For every auxiliary satellite data, the average over time is first removed. The auxiliary data (divided by its corresponding error variance) and the inverse of the error variance are provided as input. Where data are missing, the corresponding input values are set to zero representing an infinitely large error (as a consequence of the chosen scaling). Multiple time instances centered around a target time can be provided as input.

Non-gridded input data
Current satellite altimetry mission measures sea surface height along the ground track of the satellite. Satellite altimetry can measure through clouds but the data are only available along a collection of tracks. In order to better handle such data sets, we extended DINCAE to handle unstructured data as input.
The first layer in DINCAE is a convolutional layer, which typically requires a field discretized on a rectangular grid. The convolutional layer can be seen as the discretized version of the following integral: where f is the input field, w are the weights in the convolution (also called convolution kernel), w is the support of the function w (i.e., the domain where w is different from zero) and g the output of the convolutional layer. To discretize the integral, the continuous function f is replaced by a sum of Dirac functions using the values f i,j defined on a regular grid: In this case, the continuous convolution becomes the standard discrete convolution as used in neural networks. The weights w(x, y) only need to be known at the discrete locations defined by the underlying grid.
For data points which are not defined on a regular grid we essentially use a similar approach. The function f is again written as a sum of Dirac functions: where now f k are the values at the locations (x k , y k ), which can be arbitrary. In order to evaluate the integral (Eq. 5), it is necessary to know the weights at the location (x k , y k ). The weights are still discretized on a regular grid, but are interpolated bilinearly to the data location to evaluate the integral. In fact, instead of interpolating the weights w, one can also apply the adjoint of the linear interpolation to f k (which is mathematically equivalent). This has the benefit that the computation of the convolution can be implemented using the optimized functions in the neural network library.
For data defined on a regular grid, it has been verified numerically that this proposed approach and the traditional approach used to compute the convolution give the same results.

Data
The improvements are examined in two test cases. For multivariate gridded data, the approach is tested with sea surface temperature, chlorophyll and winds on the Adriatic Sea; for non-gridded data altimetry, observations of the whole Mediterranean Sea were used. As the altimetry observations do not resolve as many small scales as sea surface temperature, a larger domain was chosen for the altimetry test case.

Gridded data (Adriatic Sea)
As the previous application (Barth et al., 2020) considered relatively low-resolution AVHRR data, we used more modern and higher-resolution satellite data in this application for the Adriatic Sea. The datasets used include the following. The data sets span the time period 1 January 2003 to 31 December 2016. They are all interpolated (using bi-linear interpolation) on the common grid defined by the SST fields.
As ocean mixing reacts to the averaged effect of the wind speed (norm of the wind vector), we also smoothed the speed with a Laplacian filter using a time period of 2.2 d and a lag of 4 d (wind speed preceding SST). The optimal lag and time period were obtained by maximizing the correlation between the smoothed wind field and SST from the training data.

Non-gridded data (Mediterranean Sea)
Altimetry data from 1 January 1993 to 13 May 2019 covering 7 • W to 37 • E and from 30 to 46 • N from 22 satellite missions operating during this time frame are used. This domain essentially contains the Mediterranean Sea but also a small part of the Bay of Biscay and the Black Sea. In preliminary studies we found that including the data from adjacent seas can help neural networks better generalize and prevent overfitting (e.g., Gong et al., 2019) as the neural network is confronted with a more diverse set of conditions. The data (SEALEVEL_EUR_PHY_L3_MY_OBSERVATIONS_008 _061, accessed on 13 October 2020) are made available by the Copernicus Marine Environment Monitoring Service (CMEMS).
To reduce the correlation between the different datasets, satellite tracks are not split and belong entirely to one of these three datasets.
Some experiments of the reconstructed altimetry use gridded sea surface temperature satellite observations as an auxiliary dataset for multivariate reconstruction. We use the AVHRR_OI-NCEI-L4-GLOB-v2.0 datasets (Reynolds et al., 2007;National Centers for Environmental Information, 2016) because it is a single consistent dataset covering the full time period of the altimetry data and because it matches approximately the altimetry dataset in terms of resolved spatial scales.

Implementation
Python code was first ported from TensorFlow 1.12 to 1.15, reducing the training time from 4.5 to 3.5 h using a GeForce GTX 1080 GPU and Intel Core i7-7700 CPU. We also considered porting DINCAE to TensorFlow 2. The TensorFlow 2 programming interface is however quite different from previous versions. As our group gained familiarity with the Julia programming language (Bezanson et al., 2017), we decided to rewrite DINCAE in Julia. Porting DINCAE to Julia with the package Knet (Yuret, 2016) cut down the runtime from 3.5 to 1.9 h (thanks to more efficient data transformation) using the AVHRR dataset described in Barth et al. (2020), and using a concatenation skip connection in both cases.
For the Adriatic test case, the input is a 3D array with the dimension corresponding to the longitude, latitude and the different parameters. The input parameters for an univariate reconstruction are three time instances of temperature scaled by the inverse of the error variance (previous, current and next day), the corresponding inverse of the error variance, the longitude and latitude of every grid cell and the sine and cosine of the day of the year multiplied by 2π/365.25 as in Barth et al. (2020). The assembled array has a size of 168 × 144 × 10 elements (for a single training sample). The input is processed by the encoder which is composed of five convolutional layers (with 16, 30, 58, 110 and 209 output filters) with a kernel size of 3 × 3 and a rectified linear unit (RELU) activation function followed by a max pooling layer with a size of 2 × 2. The RELU activation function is commonly used in neural networks and is defined by The output of the encoder is transformed back to a full image by the decoder, which mirrors the structure of the encoder. The decoder is composed of five upsampling layers (nearest-neighborhood interpolation or bilinear interpolation) followed by a convolutional layer with the equivalent number output filters from the encoder (except for the final layer, which has only two outputs related to the reconstruction and its error variance). The final layer produces a 3Darray T out ij k (size 168×144×2) from which the reconstruction y ij and its error varianceσ 2 ij are computed bŷ where the parameters γ and µ are set to 10 and 10 −3 , respectively. The minimum and maximum functions (min and max) are introduced here to prevent division by a value close to zero or the exponentiation of a too-large value. This stabilizes the neural network during the initial phase of the training as the weights and biases are randomly initialized. In Barth et al. (2020), after the convolutional layers, the model included two fully connected layers (with drop-out). This is no longer used as such layers require that the input matrix for training has exactly the same size as the input matrix of the reconstruction (inference), which makes this architecture difficult for large input arrays (which would arise for global or basin-wide sea surface temperature fields for example). The benefit of replacing dense layers by convolutional layers is further discussed in Long et al. (2015).  The altimetry data were analyzed on a 0.25 • resolution grid covering the area from 7 • W and 37 • E and 30 to 46 • N. For this neural network implementation, the input size is 177 × 69. The resolution is progressively reduced to 89 × 35, 45 × 18 and 23 × 9 by convolutional layers followed by max pooling layers with 32, 64 and 96 convolutional filters, respectively. Skip connections are implemented after the second convolutional layer onwards.
During training, Gaussian noise with a standard deviation σ pos is added to the position of the measurements and every track of the current date has a certain probability p drop to be withheld from the input of the neural network. The loss function is computed only on tracks from the current date. This helps the neural network to learn how to spread the information spatially. The neural network is optimized during 1000 epochs and the intermediate results are saved every 25 epochs. The altimetry test case illustrates the results for a nongridded dataset. Sea surface altimetry is usually gridded with a method like optimal interpolation or variational analysis. The latter can also be seen as a special case of optimal interpolation. For the autoencoder, the following fields are used as inputs: longitude and latitude of the measurement; day of the year (sine and cosine) of the measurement multiplied by 2π/T y where T y is 365.25 (the length of a year in days); all data within a given centered time window of length t win . For instance if the time window of length t win is 9, the data from the current day are used as well as the tracks from the 4 previous days and the 4 following days.
As in Barth et al. (2020), instead of using the observations directly, the observations are divided by their respective error variance and the inverse of the error variance is used as input. Due to this scaling, it follows that missing data results in a zero input value as it corresponds to a data point with an "infinitely" large error.
The training is done using mini-batches of size n batch . The weights and biases in the neural network are updated using the gradient of the loss function evaluated at a batch of n batch time instances (chosen at random). Evaluating the gradient using a subset of the training data chosen at random introduces some stochastic fluctuation allowing the optimization procedure to escape a local minima.
All numerical experiments used the Adam optimizer (Kingma and Ba, 2014) with the standard parameter for the exponential decay rate for the first moment β 1 = 0.9, and for the second moment β 1 = 0.999, and a regularization parameter = 10 −8 . The learning rate α n is computed for every n-th epoch as follows: where α 0 is the initial learning rate and γ decay controls the exponential decay of the learning rate: every 1/γ decay epochs, the learning rates is halved. If γ decay is zero, then the learning rate is kept constant.
The batch size includes 32 time instances (all hyperparameters are determined via Bayesian optimization as described further on). The learning rate for the Adam optimizer is 0.00058. The L2 regularization on the weights has been set to a β value of 10 −4 . The upsampling method can either be nearest neighbor or bilinear interpolation. In our tests, nearest neighbor provided the lowest RMS error relative to the development dataset. The absolute value of the gradients is clipped to 5 in order to stabilize the training. Satellite tracks from t win = 27 time days (centered around the target time) are used to derive gridded altimetry data.
The hyper-parameters of the neural network mentioned previously have been determined by Bayesian optimization (Mockus et al., 1978;Jones et al., 1998;Snoek et al., 2012) by minimizing the RMS error relative to the development dataset. The "expected improvement acquisition function" (Mockus et al., 1978) is used to determine which sequence of parameter values is to be evaluated. Bayesian optimization as implemented by the Python package scikit-optimize (Head et al., 2020) was used in these tests.

Gridded data
The new type of skip connection was first tested with the AVHRR Test case from the Ligurian Sea (Barth et al., 2020). The previous best result (in terms of RMS) was 0.3835 • C using the cat skip connection. With the new approach this RMS error is reduced to 0.3604 • C. The new type of skip connection makes the neural network more similar to residual networks, which have been shown to be highly successful for image recognition tasks and allow for training deep networks more easily (He et al., 2016).
In Table 2 we present the test case for the Adriatic Sea with and without the skip connections and using the multivariate reconstruction. Besides the RMS error, this table also includes the 10 % and 90 % percentiles of the absolute value of the difference between the reconstructed data and the cross-validation data to provide a typical range of the error. In general, using chlorophyll a (or the wind fields) together with SST improved the results only marginally. The improvements were more consistent by using again the sum skip connection instead of the cat skip connection (in particular in conjunction with the additional refinement step). The analysis in the following uses the result of the lowest RMS, namely the reconstruction with sum skip connections and refinement and with the considered variables (chlorophyll a, the wind speed, zonal wind component and meridional wind component) as auxiliary parameters.
When reconstructing sea surface temperature time series, it is often the case that for some days only very few data points are available. Figure 3 illustrates such a case, where a quite clear image was almost entirely masked as missing. DINCAE essentially produces an average SST (still using previous and next time frames) for the time of the year and a realistic spatial distribution of the expected error. The few pixels that are available have a relatively low error as expected, but the overall error structure looks quite realistic as the expected error increases significantly in the coastal areas (where the variability is higher and where the original satellite data are expected to be noisier). The reconstructed image matches the original image large-scale patterns relatively well, but as expected some small-scale structures are not reconstructed by the neural network. For this particular image, the RMS error (between the reconstructed data and the withheld data for cross-validation) is 0.43 • C and the 10 % and 90 % percentiles of the absolute value of the error are 0.05 and 0.63 • C, respectively.
The detection of cloud pixels in the MODIS dataset is generally good, but Fig. 4 (for 17 September 2003) shows an example where some pixels were characterized as valid sea points while they are probably (at least partially) obscured by clouds, resulting in an unrealistically low sea surface temperature. For most analysis techniques derived from optimal interpolation, outliers like undetected clouds typically degrade the analyzed field in the vicinity of the spurious observations. The outlier also produced an artifact in the output of DIN-CAE, but it is interesting to note that in this case, the artifact did not spread spatially and the associated expected error has some elevated values indicating a potential issue at the location. The RMS error for this time instance is at 0.45 • C, similar to the previously shown image. The typical range of the absolute value of the error is 0.04-0.72 • C (10 % and 90 % percentiles).
A problem with techniques like optimal interpolation, variational analysis and to some degree also DINEOF is that the reconstruction smoothes out some small-scale features present in the initial data. For optimal interpolation and variational analysis, this smoothing is explicitly induced by using a specific correlation length. In EOF-based methods, this is related to the truncation of the EOFs series. In DINCAE, the input data are also compressed by a series of convolution and Table 2. RMS errors (in • C) relative to the test dataset for different configurations (chlor_a: chlorophyll a, wind_speed: the wind speed, uwnd: zonal wind component, vwnd: meridional wind component). The two numbers in parentheses correspond to the 10 % and 90 % percentiles of the absolute value of the error. max pooling layers, and some smoothing is also expected, as in Fig. 4. Figure 5 shows an example where the initial data have almost no clouds and only few clouds are added for validation. The reconstructed field retains the filament and other small-scale structures. For this image, the RMS error (between the reconstructed data and the withheld data for crossvalidation) is 0.55 • C and its typical range (as defined earlier) is 0.05 to 0.77 • C. The degree of smoothing can be quantified by the RMS difference of the reconstructed data and the input data (which is not an independent validation metric). This RMS difference is 0.15 • C, which is relatively low given the typical variability in sea surface temperature in this region.
To assess the improvement spatially, the mean square skill score S is computed (Fig. 6) for every grid cell.
where MSE ref is the mean square error of the monovariate reconstruction corresponding to DINCAE 1.0 relative to the validation dataset (per grid cell and averaging over time) and MSE is the mean square error of the multivariate case (considering all variables) and with an additional refinement step. The improvement is spatially quite consistent. The mean square error is mostly reduced in the northern and central parts of the Adriatic. Only on some isolated grid cells is a degradation observed. The skill score reflects the combined improvement due to the three changes implemented in this version: updated skip connections, refined step and multivariate reconstruction.

Non-gridded data
The altimetry data is first gridded by the tool DIVAnd . The main parameters here are the spatial correlation length (in km), the temporal correlation scale (days), the error variance of the observations (normalized by the background error variance) and the duration of the time win- dow t win determining which observations are used to compute the reconstruction at the center of the time window. All parameters of DIVAnd are also optimized using Bayesian minimization with an expected improvement in the acquisition function from minimizing the RMS error relative to the development datasets.
The best DIVAnd result is obtained with a horizontal correlation length of 74.8 km, a temporal correlation length of 5.5 d, a time window of 13 d and a normalized error variance of the observations of 20.5. An example reconstruction for the date 7 June 2017 is illustrated in Fig. 7. The parameters are determined by Bayesian optimization, minimizing the error relative to the development dataset. The RMS error of the analysis for these parameters is 3.61 cm relative to the independent test dataset (Fig. 9).
The best performing neural network had a RMS error of 3.58 cm, which is only slightly better than results of DIVAnd (3.60 cm). When using the Mediterranean sea surface temperature as a co-variable we obtained a RMS error relative to the test dataset of 3.47 cm, resulting in a clearer advantage of the neural network approach. The left panels of Figs. 9 and 10 show on the x and y axes the observed altimetry (withheld during the analysis) and the corresponding reconstructed altimetry, respectively. The range of altimetry values from −20 to 30 cm was divided into 51 bins of 1 cm. The colors indicate the number of data points within each bin. For both reconstruction methods, the results scatter around the dashed line, which corresponds to the case where the reconstructed data correspond exactly to the observed altimetry. The eddy field of the DINCAE dataset is also quite similar to the one obtained from DIVAnd, but the anomalies of the structure are more pronounced in the DINCAE reconstruction (Fig. 8).
DINCAE and DIVAnd provide a field with the estimated expected error. For DIVAnd we used the "clever poor man's method" as described in Beckers et al. (2014), and the background error variance is estimated by fitting an empirical covariance based on a random pair of points binned by their distance (Thiebaux, 1986;Troupin et al., 2012). The estimated error variance is later adjusted by a factor to account for uncertainties in estimating the background error variance.
We made 10 categories of pixels based on the expected standard deviation error, evenly distributed between the 10 % and 90 % percentiles of the expected standard deviation error. For every category, we computed the actual RMS relative to the test dataset. Ideally this should correspond to the estimated expected error of the reconstruction (including the observational error). A global adjustment factor is also applied so that the average RMS error matches the mean expected error standard deviation, which is represented in the left panels of Figs. 9 and 10. This adjustment factor is also applied to Fig. 7b. The main advantage of DINCAE relative to DIVAnd is the improved estimate of the error variance of the results.
In summary, the accuracy of the DINCAE reconstruction is slightly better than the accuracy of the DIVAnd analysis. However, the main improvement of the DINCAE approach here is that the expected error variance of the analysis is much more reliable than the expected error variance of DI-VAnd. Figure 11 shows the standard deviation of the sea-level anomaly computed over the whole time period. From this figure, three areas in particular stand out corresponding (from east to west) to the East Alboran Gyre, regions of the Algerian current and the Ierapetra Anticyclone (annotated with the black rectangle in Fig. 11). The maximum standard deviation (related to the surface transport variability) for these three areas is shown in Table 3. The standard deviation is also computed from the satellite altimetry data considering all satellite observations falling within a given grid cell (ex-     cluding coastal grid cells with less than 10 observations). The standard deviation of the DINCAE reconstruction is in all three regions higher than the standard deviation for DIVAnd despite the DINCAE reconstruction having a lower RMS error than DIVAnd. In addition, the standard deviation of DIN-CAE is in general closer to the observed standard deviation.

Conclusions
In this paper, we discussed improvements of the previous described DINCAE method. The code has been extended to handle multi-variate reconstructions, which were also described in Han et al. (2020). We also found that multivariate reconstruction can improve the reconstruction, but the largest improvement was obtained by changing the structure of the neural network by using a newly implemented different type of skip connection and refinement pass. Interestingly, this type bears some similarities to the hierarchical multigrid method for solving partial differential equations.
The handling of different types of satellite data was also improved. While most ocean satellite observations are gridded data (like sea surface temperature, ocean color, sea surface salinity and sea ice concentration), some parameters can only be inferred by nadir-looking satellites along tracks. For such non-gridded datasets, the first input layer is extended to handle such arbitrary location input data. We were able to show that the DINCAE method applied with altimetry data produces better reconstructions, but the main advantages are the significantly improved estimates of the expected reconstruction error variance. In this case, the DINCAE method compares favorably to the DIVAnd method (which is similar to optimal interpolation) in terms of reliability of the expected error variance, accuracy of the reconstruction relative to the test dataset and realism of the temporal standard deviation of the reconstruction assessed from the standard deviation of the observations.
Author contributions. AB designed and implemented the neural network. AB, AAA, CT and JMB contributed to the planning and discussions and to the writing of the manuscript.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.