DINCAE 1.0: a convolutional neural network with error estimates to reconstruct sea surface temperature satellite observations

: the sentence "However, it is unclear how to handle missing data (or data with variable accuracy) in a neural network when using incomplete satellite data in the training phase." is not very clear. Perhaps rephrase as "Contrary to standard image 20 reconstruction with neural networks, this application requires a method to handle missing data (or data with variable

DINEOF (Data Interpolating Empirical Orthogonal Functions, Beckers and Rixen, 2003;Alvera-Azcárate et al., 2005), is an iterative method to reconstruct missing observations reducing noise in satellite datasets using empirical orthogonal functions (EOF). A truncated EOF decomposition using the leading EOFs is performed and the initially missing data are reconstructed using this EOF decomposition. The EOF decomposition and reconstruction is repeated until convergence. DINEOF has been applied to several oceanographic variables, at different spatial resolutions (e.g. Alvera-Azcárate et al. (2005) for SST,  Azcárate et al. (2007) for ocean colour, Alvera-Azcárate et al. (2016) for Sea Surface Salinity), providing accurate reconstructions. A truncated EOF decomposition will focus primarily in spatial structures with a "strong" signature (or more formally defined with a significant L2 norm compared to the total variance). Small scale structures can be included in a truncated EOF decomposition as long as their related variance is large enough to be present in the retained EOF modes. But small scale structures tend to be transient (short-lived) and therefore are often not retained in the dominant EOF modes. It should be noted that 10 there is no explicit spatial filtering scale in DINEOF removing small scales (unlike other methods like optimal interpolation, kriging, spline interpolation). But in practice a similar smoothing effect is noticed because of the EOF truncation (which is necessary in the presence of clouds).
Neural networks are mathematical models that can efficiently extract nonlinear relationships from a mapping problem (i.e. 15 an input/output relationship that can be determined through a mathematical function). Neural networks are therefore specially well positioned to learn nonlinear, stochastic features measured at the sea surface by satellite sensors, and their use might prove efficient in retaining these structures when analysing satellite data, for example for reconstructing missing data.
Neural networks can be composed of a wide variety of building blocks, such as fully connected layers (Rosenblatt, 1958;20 Widrow and Hoff, 1962) recurrent networks (e.g. Long Short-term Memory (Hochreiter and Schmidhuber, 1997), Gated recurrent units (Cho et al., 2014)) convolutional layers (LeCun et al., 1998;Krizhevsky et al., 2012). Recurrent networks work typically with a one dimensional list of inputs of a variable length (such as a text sentence). Fully connected layers and convolutional layers require to have a full dataset without missing data, at least for the training phase. For a review on neural networks the reader is referred to Schmidhuber (2015) and references therein. As neural networks are typically applied on a large and 25 complete data set (i.e. no or almost no gaps) as input data, a solution needs to be found to handle a large number of missing data.
The use of neural networks in the frame of Earth Observation has been increasing recently. Garcia-Gorriz and Garcia-Sanchez (2007), for example, used meteorological variables like wind and air temperature (among others) to infer SST, with the aim of reproducing annual and interannual variability of SST during the pre-satellite era. Patil and Deo (2017) used a 30 wavelet neural network to predict SST at various locations in the Indian Ocean, which allowed to focus on daily variations of SST. Pisoni et al. (2008) resorted to past instances and averaging to overcome gaps in SST, which results in smooth reconstructions. Krasnopolsky et al. (2016) used neural networks to infer ocean colour in the complete absence of these data (i.e. emulating a sensor failure). The neural network by Krasnopolsky et al. (2016) uses as input satellite sea surface elevation, sea surface salinity, sea surface temperature and in situ Argo salinity and temperature vertical profiles with some auxiliary infor-35 mation (like longitude, latitude and time) to estimate the Chlorophyll-a concentration. The network does not use measured Chlorophyll-a concentration at a given location as input during inference (the reconstruction phase), nor the information from nearby grid points to infer Chlorophyll-a concentration. The network is exposed to the chlorophyll-a measurements only during the training phase. Jo et al. (2018) infers ocean colour from related data (SST and wind among others), taking advantage of the close relation between different ocean variables, but also at a lower spatial resolution. Renosh et al. (2017) produced a 5 suspended particulate matter dataset from model and in situ data using Self Organizing Maps, that was compared to satellite data. Chapman and Charantonis (2017) used surface satellite data to infer subsurface ocean currents also using Self Organizing Maps. Also using Self Organizing Maps, Jouini et al. (2013) reconstructed missing data in chlorophyll-a data using the relation between this variable and ocean currents (proxied by SST and sea surface height).

10
The objective of this manuscript is to present a neural network in the form of a convolutional auto-encoder which can be trained on gappy satellite observations, in order to reconstruct missing observations and also to provide an error estimate of the reconstruction. This neural network is referred to in the following as DINCAE (data-interpolating convolutional autoencoder). An auto-encoder is a particular type of network which can compress and decompress the information in an input dataset (Hinton and Salakhutdinov, 2006), effectively reducing the dimensionality in the input data. Projecting the input data 15 on a low-dimensional subspace is also the central idea of DINEOF, where it is achieved by an EOF decomposition.
In section 2, the SST dataset used in this study is presented. This dataset is the input of the neural network described in section 3. This section includes the general structure of the network, the activation functions, skip connections, the cost function and its optimization. The SST dataset is also reconstructed with DINEOF (section 4). The results are validated by cross-validation 20 and by a comparison to the World Ocean Database 2018 in section 5. Finally, the conclusions are presented in section 6.

Data availability
For this study we used the longest available time series coming from the Advanced Very High Resolution Radiometer (AVHRR) dataset (Kilpatrick et al., 2001)   For this study, only SST data with quality flags of 4 or higher are retained (Evans et al., 2009). One single image is composed of 112 x 112 grid points. If a given pixel has measurements less than 5% of the time, then it is not reconstructed and it is considered as a land point in the following. In total, 27% of grid points correspond to land. Images with at least 20% of valid sea points are retained for the reconstruction which corresponds to a total of 5266 time instances.
To assess the accuracy of the reconstruction method, cross-validation is used (e.g. Wilks, 1995). For cross-validation a subset of the data is withheld from the analysis and the final reconstruction is compared to the withheld dataset to access its accu-5 racy. Since clouds have a spatial extent, we wanted to withhold data with a similar spatial structure. In the last 50 images we removed data according to the cloud mask of the first 50 images of the SST time series. The last 50 images represent the data from 2009-09-25 to 2009-12-27 (since some scenes with too few data have been dropped as mentioned before). These data are not used at all during either the training or the reconstruction phases, and can therefore be considered independent. In total, 106 816 measurements (i.e. individual pixels) have been withheld this way.

10
Initially, the average cloud coverage of the dataset is 46% (over all 25 years). The cloud coverage for the 50 last scenes is increased to 77% when the cross-validation points are excluded. A significant part of the scene is obscured after marking the data for cross-validation, but in the Mediteranan Sea the cloud coverage is relatively low compared to the globally average cloud coverage which is 75% (Wylie et al., 2005). Removing some data for cross-validation makes the cloud coverage thus

Neural network with missing data as input
Convolutional and other deep neural networks are extensively used in computer vision and find an increasing number of applications in Earth sciences (Rasp et al., 2018;Bolton and Zanna, 2019;Zhou et al., 2016;Geng et al., 2015) where full datasets are available, at least for training a network. However, when using satellite data, the number of images without any clouds is very small and it is difficult to provide enough training data when only clear images are used. Therefore the aim is to The handling of missing data is done in analogy to data assimilation in numerical ocean models. The standard optimal interpolation equations (e.g. Bretherton et al., 1976;Buongiorno Nardelli, 2012) can be written as follows: where x f is the model forecast with error covariance P f , y o are the observations with error covariance R and H is the 5 observation operator extracting the observed part from the state vector x f . The analysis x a is the combined estimate with the error covariance matrix P a . We use these equations as an analogy to propose an approach to handle missing data (or data with errors varying in space and/or time). The main input datasets of the CAE are i) the SST divided by its error variance (corresponding to R −1 y o ) and ii) the inverse of the error variance (corresponding to the diagonal elements of R −1 , assuming spatially uncorrelated errors). If a data point is missing, then the corresponding error variance is considered infinitely large and 10 the value at this point would be zero for both input datasets. The main difference is that in optimal interpolation, the observation vector y o is multiplied by the inverse of the error covariance (possibly including non-diagonal elements) while in the present case we use only the error variance. The structure of the neural network will be used to spatially propagate the information from the observations. 15 The time average has been removed from the SST dataset (computed over all years but excluding the cross-validation dataset). The neural network works thus with anomalies relative to this mean SST. To obtain reasonable results, the network uses more input than merely SST divided by its error variance and the inverse of the error variance. The total list of input parameters is consequently the following: -SST anomalies scaled by the inverse of the error variance (the scaled anomaly is zero if the data is missing) 20 -Inverse of the error variance (zero if the data is missing) -Scaled SST anomalies and inverse of error variance of the previous day -Scaled SST anomalies and inverse of error variance of the next day -Longitude (scaled linearly between -1 and 1) -Latitude (scaled linearly between -1 and 1) 25 cosinus of the day of the year divided by 365.25 sinus of the day of the year divided by 365.25 The complete dataset is thus represented by an array of the size 10 x 112 x 112 x 5266 (number of inputs, number of grid points in the zonal direction, number of grid points in the meridional direction, number of time instances). The inverse of the error variance is either zero (for missing data) or a constant. The precise value of this constant is not important because it will be multiplied by a weight matrix and this weight matrix will be optimized by training the network. In future studies, it would be interesting to use sensor specific error statistics provided with GHRSST products, i.e. spatially and temporally varying error 5 estimate. Using the previous and next day as inputs and the information on the season (last two inputs) will allow for a temporal coherency of the results. It should be noted that DINEOF does not use the day of the year of each satellite image, but it uses a temporal filter which increases the temporal coherence of the reconstruction (Alvera-Azcárate et al., 2009). The final layer of the neural network produces the following output: -SST scaled by the inverse of the expected error variance 10 -Logarithm of the inverse of the expected error variance The overall structure of the neural network (Table 2) is a convolutional autoencoder (CAE; Hinton and Salakhutdinov, 2006;Ronneberger et al., 2015). Its main building blocks are convolutional layers (LeCun et al., 1998;Krizhevsky et al., 2012). DINCAE uses 5 encoding and 5 decoding layers with a different number of filters. Beside the input and output layers, the number of filters are 16, 24, 36 and 54 (the number of filters increases 50% from one encoding convolutional layer to 15 the next). All convolutional layers have a receptive field of 3x3 grid points (Simonyan and Zisserman, 2015). Between the convolutional layers there are max pooling or average pooling layers (Scherer et al., 2010) to progressively reduce the spatial resolution by only retaining either the maximum or average value of a region of 2x2 grid points. After the last encoding convolutional layer, there are two fully connected layers (the so-called bottleneck). The number of neurons in the bottleneck is a fifth of the number of the last pooling layer of the encoder (rounded to the nearest integer). Drop-out is used in the fully 20 connected layers to avoid overfitting. The decoding layers are composed of convolutional layers and interpolation layers (to the nearest neighbor) to upsample the results. We also added skip connections between the output of pooling layers and the upsampling layers (Ronneberger et al., 2015). These skip connections correspond to layers 16, 19, 22 and 25 of Table 1. The motivation of this choice is that large scale information of the SST would be captured by the neurons in the bottle-neck, but small scale structures unrelated to the overall structure in the SST would be handled by these skip connections. In the absence 25 of the skip connections, the small scale structures would be removed from the dataset.
A rectified linear unit (RELU) activation function is commonly used in neural networks which is defined as: However, in our case it leads quickly (in 10 epochs) to a zero gradient and thus to no improvements in training. This problem is solved by choosing a leaky RELU (Maas et al., 2013) for the convolutions and the standard RELU for the fully connected layers.
where we use here α = 0.2. The output of the network, i.e. the 26th layer of Table 1, is an array T ijk with 112 x 112 x 5 2 elements. The first slice k = 1 is essentially interpreted as the logarithm of the inverse of the expected error variance and the second slice is the temperature anomaly divided by the error variance. The reconstructed temperature anomalyŷ ij and the corresponding error varianceσ 2 ij (for every single grid point i, j) are computed as: where γ = 10 and δ = 10 −3 • C −2 . The min and max functions in the previous equations are introduced to avoid a division by a value close to zero or a floating point overflow. The effective range of the error standard deviation is thus from

Training of the neural network
The input data set is randomly shuffled (over the time dimension) and partitioned into so-called mini-batches of 50 images, as an array of the size 10 x 112 x 112 x 50. The complete time series is splitted into 105 minibatches with 50 images each and one last minibatch with only 16 images (representing a total of 5266 as mentioned before). The splitting of the dataset into minibatches is necessary because the graphical processing unit (GPU) has only a limited amount of memory. Computing 20 the gradient over randomly chosen subsets also introduces some stochasticity which prevents the minimization algorithm from being trapped in a local minima. An optimization cycle using all 106 minibatches is called an epoch.
For every input image, more data points were masked (in addition to the cross-validation) by using a randomly chosen cloud mask during training. The cloud mask of a training image would thus be the union of the cloud mask of the input dataset and 25 a randomly chosen cloud mask. This allows us to assess the capability of the network to recover missing data under clouds.
Without the additional clouds, the neural network would simply learn to reproduce the SST values that are already received as input. At every epoch a different mask is applied to a given image to mitigate overfitting and aid generalization.
The aim of DINCAE is to provide a good SST reconstruction but also an assessment of the accuracy of the reconstruction.
The output of the neural network is assumed to be a Gaussian probability distribution function (pdf) characterized by a mean y ij and a standard deviationσ ij . Given this pdf one can compute the likelihood p(y ij |ŷ ij ,σ ij ) of the observed values y ij . The 5 weights and biases in the neural network are adjusted to maximize the likelihood of all observations. Maximizing the likelihood is equivalent to minimizing the negative log-likelihood: where N is the number of measurements in y ij (excluding therefore land points and cross-validation points). Including the number measurements N is important as it can change from one mini-batch to the other. The likelihood of the observations p(y ij |ŷ ij ,σ ij ) is given by a Gaussian distribution: The cost function has finally the following form: The loss function per individual scalar sample is the term in brackets of the previous equation. The first term is directly related to the mean square error, but scaled by the estimated error standard deviation. The second term penalizes any overestimation of the error standard deviation. The third term is a constant term which can be neglected in the following as it does not influence the gradient. The sum in the previous equation runs over all grid points where a measurement is available but 10 excluding the measurements withheld for cross-validation as the later are never used during training.
We used the Adam optimizer (Kingma and Ba, 2014) with the standard parameters for the learning rate α = 0.001, the exponential decay rate for the first moment estimates β 1 = 0.9, and for the second-moment estimates β 1 = 0.999, regularization parameter = 10 −8 . 15 During the development of the neural network, it was clear that it tended to overfit the provided observations, leading to degraded results when comparing the network to cross-validation data. Commonly used strategies were therefore used to avoid overfitting, namely introducing a drop-out layer between the fully connected layers of the network. The drop-out layer randomly sets, with a probability of 0.3, the output of these intermediate layers to zero during the training of the network. We also 20 added some Gaussian-distributed noise to the input of the network with a zero mean and a standard deviation of 0.05 o C.
It is useful to compare the proposed approach to the traditional autoencoder to highlight the different choices that have been adopted. The essential steps to implement and validate an auto-encoder are the following: -Some data are marked for validation and never used during training 25 -The network is given some data as input and produces an output which should be as close as possible to the input. All training data are given thus at all epochs to the network -The network is validated using the validation data set aside.
In essence, the traditional auto-enoder optimises how well the provided input data can be recovered after dimensionality reduction. In the present approach, there are two steps where data are intentionally hidden to the network: 1. The validation data that were set aside and never used during the training, similar to the traditional auto-encoder.
2. Some additional data in every minibatch were set aside to compute the reconstruction error and its gradient (unlike the traditional auto-encoder). This additional subset is chosen at random.

5
This is done because the main purpose of the network is to assess the ability of the network to reconstruct the missing data using the available data. The proposed method is not withholding less data than the traditional auto-encoder. The downside of the approach is that the cost function fluctuates more because it is computed only over a relatively smaller set of data. But for us this is acceptable (and controlled by taking the average of the output of the network at several epochs, as explained later) because the cost function reflects more closely the objective: reconstructing missing data from the available data (instead of 10 reproducing the input data as it is the case of the traditional auto-encoder).
The traditional auto-encoder approach trained using only clear images was not considered because only 13 images of out 5266 have a cloud coverage of less than 5%. So the ability to handle missing data was a requirement for us from the start.

DINEOF reconstruction
The results of the DINCAE method are compared to the reconstruction obtained by the DINEOF method (Alvera-Azcárate et al., 2005) which uses an EOF-basis to infer the missing data. As a first step, the spatial and temporal mean is removed from the data, and the missing data are set to zero. The leading EOF modes are then computed and the missing data are reconstructed using these EOFs (Alvera-Azcárate et al., 2005). A temporal low-pass filter with a cut-off period of 1.08 days is applied to 20 improve the temporal coherency of the results, following Alvera-Azcárate et al. (2009). This filter effectively propagates the information in time so that for a given date the satellite data from the previous and next days are used in the reconstruction.
The optimal number of EOFs retained in this reconstruction is 13 modes, which explain 99.4% of the variability of the initial data.

25
The classical DINEOF technique reconstructs the cross-validation data points withheld in the last 50 images with an error of 0.4629 o C and a slight negative bias of -0.0922 o C ( Table 2). As only 13 modes are retrained by DINEOF for the reconstruction, some small scale structures are smoothed-out, which is a well known property of a truncated EOF decomposition (Wilks, 1995).
This smoothing effect results in an RMS (root mean square) error of 0.3864 o C when comparing the reconstructed dataset to all the initially present SST (i.e. used for the reconstruction). A somewhat surprising result is that when using less data with 30 DINEOF (only from the last two years, i.e. 2008 to 2009), 19 EOFs modes are retained, leading to a reconstruction with richer structures. Therefore, the RMS error compared to all the initially present SST provided to DINEOF (but excluding the Table 2. Comparison with the independent cross-validation data and the dependent data used for training (in o C). CRMS is the centered root mean square error.

5
The figure 2 shows the cost function for every minibatch. Large fluctuations are quite apparent from this figure. But it is expected that the cost function will fluctuate using any optimization method based on mini-batch (unless the learning rate explicitly is decreased to zero, which is not the case here) because the cost function is evaluated using a different mini-batch at every iteration. Consequently, the gradient of the cost function also includes some stochastic variability. Even if the dataset is small and the gradient could be computed over the entire dataset at once, using mini-batches is still advised because these 10 fluctuations allow the cost function to get out of a local minima (Ge et al., 2015;Masters and Luschi, 2018). While the mini-batch selection effectively computes the gradient over a temporal subset, the additional data marked as missing within a minibatch is a spatial subset which enhances these fluctuations but allows us to define the cost function more closely to our objective (i.e. inferring the missing data from observations, as explained above). The neural network is updated using the gradient for every mini-batch during training and after every 10 epochs the current state of the neural network is used to infer the missing data over the whole time series, and in particular reconstructing the missing data is the cross-validation dataset. But importantly, the network is not updated using the cross-validation data. Figure 3 shows the RMS error relative to the cross-validation dataset computed every ten epochs (during this reconstruction 5 phase drop-out is disabled) using DINCAE. There is an initial sharp decrease of the cross-validation error and after 200 epochs the RMS error has mostly stabilized but still presents some fluctuations. These fluctuations are due to the fact that the gradient computed at every optimization set is computed over a subset of the data and this subset varies at every optimization step.
As mentioned before, in every minibatch a random subset (in form of clouds) of data is marked as missing and the gradient is computed over this randomly changing subset which leads to some fluctuations in the gradient and thus in the parameters 10 of the neural network. In order to obtain a better estimate of the reconstruction, we average the output of the neural network between epoch 200 and epoch 1000 (saved at every 10th epoch) which leads to a better reconstruction than every individual intermediate results. The expected error of the reconstruction is similarly averaged. Ideally, one would take the correlation of the error between the different reconstructions into account. Ignoring these error correlations could result in overestimating the expected error of the reconstruction. Alternatively one would average the output of an ensemble of neural networks initialized 5 with different weights (and possibly using different structures) but this would significantly increase the necessary computing resources of the technique (Krizhevsky et al., 2012). But this ensemble averaging approach could be beneficial to improve the representation of the expected error and the accuracy of the reconstruction.
Instead of using the average, the median reconstruction was also tested, as the median is more robust to outliers. The results 10 were very similar and slightly better with the average instead of the median SST. In the following, only the average estimate is used.    To obtain a clearer idea of the reliability of the expected error we computed the difference between the cross-validation SST and the reconstructed SST divided by the expected error standard deviation. A histogram of the scaled differences is shown in Figure 5. The scaled error follows the theoretical distribution relatively well. When a Gaussian pdf is fitted to the histogram of the scaled error, one obtains a mean of -0.02 and a standard deviation 0.85 (both adimensional), so that generally speaking DINCAE is overestimating the expected error by 15 %.

5
An interpolation technique which is commonly used in operational context, is optimal interpolation. This technique is able to provide an expected error variance of the interpolated fields based on a series of assumptions, in particular that the errors are Gaussian distributed with a known covariance and zero mean. Given these assumptions, the error variance of the optimal interpolation algorithm is only found to be weakly related to the observed RMSE in a study of Pisano et al. (2016) using satellite sea surface temperature in the Mediterranean Sea. The averaged results of DINCAE overestimate the actual error by 15% which in this context can be seen as an improvement.
Different variants of the neural network are tested in order to optimize its structure. The DINCAE neural network with an increasing or decreasing number of layers (5 or 3 convolutional layers) did not improve the results. However, it is possible that the depth of the neural network is dependent on the available training data set and that for a more extensive data, increasing the number of layers could have a positive effect.
Max pooling layers are commonly used in image classification problems (e.g. Simonyan and Zisserman, 2015;Krizhevsky 20 et al., 2012) where the strongest detected feature is passed from one layer to the next. However, the purpose of this network here is rather different as we intend to recover missing data which requires to spread the information spatially. Therefore we also tried the network with average pooling instead of max pooling, which further reduced the reconstruction error to 0.3835 o C. This better performance of average pooling can be related to the fact that SST images do generally not have as abrupt gradients as typical images used for classification. Another way to look at this is the fact that for a dynamical system in the In all cases the biases are relatively small and the present discussion is essentially also valid when considering the centered RMS (i.e. the RMS difference when the bias is removed). In the following, we only use DINCAE with all skip connections and 4 convolutional layers with a number of filters of 16, 24, 36, 54 and average pooling for future comparison.  The cold water in the western part of the domain is better defined in the DINCAE reconstruction, and the general position of the 21 o C isotherm agrees better with the SST observations in the DINCAE reconstruction than with the DINEOF results.  In some cases the DINCAE reconstruction also introduces some artefacts as some zonal and meridional gradients near the open boundaries (Figure 7). This is probably due to the fact that in the convolutional layers, zero padding is applied so that the convolution operation does not change the size of the arrays. As this issue is relatively localized at the border it is recommended that one chooses a slightly larger domain than the primary domain of interest for the reconstruction.
To further quantify how well the reconstruction methods could recover data under a cloud cover, we use in situ temperature from the World Ocean Database 2018 (Boyer et al., 2018). For every in situ grid point, the SST image with the same time stamp (ignoring hour, minutes and seconds) is interpolated to the nearest grid cell relative to the location of the in situ observations.
Only in situ observations corresponding to a cloudy SST pixel are used in the following. In total, there are 774 surface in situ observations. The depth of the in situ observations should be between 0.5 m and 1 m and if there are multiple data points 5 between this depth range, the data point closest to the surface is used. As expected, biases play a more important role now when comparing in situ observations with reconstructed satellite data (Table 3). DINCAE represented a small improvement relative to the DINEOF reconstruction confirming the results from the cross-validation comparison.
In Figure 8 the variability of the reconstructed SST dataset is assessed. These figures represent the standard deviation 10 relative to a yearly average climatology computed for the original SST, and the reconstructions from DINCAE and DINEOF.
For the original SST, the climatological mean SST and the standard deviation were computed only using the available data.
The standard deviation derived from DINCAE matches well the standard deviation from the original data, in particular in the interior of the domain, but the standard deviation is too large along the southern coast of France and Corsica. The DINEOF standard deviation matches the original SST standard deviation better in those areas but generally underestimates the SST 15 standard deviation. Given the fact that DINCAE tends to retain more variability in the reconstruction it is thus remarkable that it still features a lower RMS despite the so-called double penalty effect (Gilleland et al., 2009;Ebert et al., 2013), i.e.
RMS-based error measures tend to be lower for smoother fields with lower variability, but this is not the case here.

Conclusions
This paper presents a consistent way to handle missing data in satellite images for neural networks. Essentially, the neural network uses the measured data divided by its expected error variance. Missing data are thus treated as data with an infinitely large error variance. The cost function of the neural network is chosen such that the network provides the reconstruction but also the confidence of the reconstruction error (quantified by the expected error variance). An over-or underestimation of the expected 5 error variance are both penalized by maximising the likelihood and assuming Gaussian distributed errors. This approach can be easily generalized to parametric probability distributions, in particular to log-normal distributions for concentrations like remote sensed chlorophyll-a concentration or suspended sediment concentration.
The presented reconstruction method DINCAE compared favourably to the widely used DINEOF reconstruction method 10 which is based on a truncated EOF analysis. Formally there are similarities between an auto-encoder (composed of just 2 fully-connected layers) and an EOF projection followed by an EOF reconstruction (Chicco et al., 2014). However, neural networks can represent non-linear relationships which is not possible with an EOF approach. Both methods were compared by cross-validation and the DINCAE method resulted in RMS error reduction from 0.46 o C to 0.38 o C.
The expected error for the reconstruction reflects well the areas covered by the satellite measurements as well as the areas with more intrinsic variability (like meanders of the Northern Current). The expected error predicted by the neural network provides a good indication of the accuracy of the reconstruction.

5
The accuracy of the reconstructed data under clouds was also assessed by comparing the results to in situ observations of the World Ocean Database 2018. Also compared to this dataset, the RMS error of the DINCAE reconstruction is lower than the corresponding results from DINEOF.

10
It is quite common that data analysis methods to reconstruct missing data tend to smooth the available observations in order to fill the area of missing observations. Therefore, the temporal variability (relative to the seasonal cycle) of the reconstructed sea surface temperature was computed from the original data and from the reconstructed data using DINCAE and DINEOF.
The variability of the reconstructed SST with DINEOF generally underestimated the variability in the original dataset, but the variability of the DINCAE reconstruction matched the variability of the original data relatively well. 15 The tests conducted in this paper show that DINCAE is able to provide a good reconstruction of missing data in satellite SST observations and retaining more variability than the DINEOF method. In addition, the expected error variance of the reconstruction is estimated avoiding several assumptions (difficult to justify in practice) of other methods like optimal interpolation.