Strengths and weaknesses of three Machine Learning methods for pCO 2 interpolation

. Using the Large Ensemble Testbed, a collection of 100 members from four independent Earth system models, we test three general-purpose Machine Learning (ML) approaches to understand their strengths and weaknesses in statistically reconstructing full-coverage surface ocean pCO 2 from sparse in situ data. To apply the Testbed, we sample the full-ﬁeld model pCO 2 as real-world pCO 2 collected from 1982-2016 for each ensemble member. We then use ML approaches to reconstruct the full-ﬁeld and compare with the original model full-ﬁeld pCO 2 to assess reconstruction skill. We use feed forward neural 5 network (NN), XGBoost (XGB), and random forest (RF) approaches to perform the reconstructions. Our baseline is the NN, since this approach has previously been shown to be a successful method for pCO 2 reconstruction. The XGB and RF allow us to test tree-based approaches. We perform comparisons to a test set, which consists of 20% of the real-world sampled data that are withheld from training. Statistical comparisons with this test set are equivalent to that which could be derived using real-world data. Unique to the Testbed is that it allows for comparison to all the "unseen" points to which the ML 10 algorithms extrapolate. When compared to the test set, XGB and RF both perform better than NN based on a suite of regression metrics. However, when compared to the unseen data, degradation of performance is large with XGB and even larger with RF. Degradation is comparatively small with NN, indicating a greater ability to generalize. Despite its larger degradation, in the ﬁnal comparison to unseen data, XGB slightly outperforms NN and greatly outperforms RF, with lowest mean bias and more consistent performance across Testbed members. All three approaches perform best in the open ocean and for seasonal 15 variability, but performance drops off at longer

that could be related to pCO 2 : latitude and longitude. A decision tree would segment the world into contiguous distinct regions based on which latitudes and longitudes are of similar pCO 2 . Due to such a physical relationship, one of the 90 benefits of the XGB is intelligibility. It can easily be determined which features are driving a prediction. However, this benefit of the XGB approach also drives its biggest drawback. Because it applies one estimate of pCO 2 in each region, it can have trouble extrapolating to new areas for which it has not "seen" data.
3. Random forest (RF): The RF (Breiman, 2001) is an older tree-based approach in which N decision trees are independently built by bootstrapping the data. There is no reference to the error of the previous trees, although the trees will 95 be correlated. The final RF prediction is an average across the predictions from all N trees and is able to capture complex non-linear decision boundaries. Though its popularity for commercial applications has been eclipsed by XGB, it remains a standard component of the ML toolkit. We include it as a point of comparison to the other approaches and as an alternative to XGB.
It is important to state the key assumptions made across all these approaches. The first assumption is to treat each data 100 point independently. Instead of using ML algorithms that take into account the spatial and temporal aspects of the data, these approaches treat each data point separately, regardless of if they are adjacent in space or time. We use time and location as additional features in our ML algorithms. The second assumption, supported by previous studies, is that we can estimate ocean pCO 2 statistically by association with physical and biological variables. Our third assumption is that the real-world sampled data cover sufficient state space of the whole ocean such that extrapolation is possible. In actuality, we have a correlated and 105 biased sample of data because of the real-world data collection process in which observations are concentrated along major shipping routes. A commercial ship traversing the ocean collects data points along a trajectory between major shipping ports, leading to along-track correlations (Jones et al., 2012). Data are more dense in the Northern Hemisphere, while Southern Hemisphere data are very sparse . Given these concerns, a benefit of the LET is that we can compare the ML model output to the ground truth to see how well it reconstructs, despite these assumptions that are not entirely valid in the 110 real world.

Data inputs
The 100-member Large Ensemble Testbed (LET) consists of 25 randomly selected members for 1982 -2016 from each of four independent initial-condition ensemble of Earth system models:
-MPI-GE: Max Planck Institute for Meteorology Grand Ensemble (RCP8.5) (Maher et al., 2019) Each individual Earth system model attempts to represent the actual Earth system. By using many members from multiple 120 Large Ensembles, we can test the capabilities of the reconstruction methods across different model structures and states of internal variability. In other words, we test the reconstructions across a large set of plausible ocean states. Each ensemble member uses the same external forcing of historical atmospheric CO 2 before 2005 and Representative Concentration Pathway 8.5 (RCP8.5) afterwards. Internal variation in the ensemble members is caused by perturbing the initial state of the Earth system at the point where that ensemble member is started. This causes each ensemble member to follow a unique trajectory 125 of the ocean-atmosphere state over time. Each Testbed member consists of monthly averaged sea surface temperature (SST), sea surface salinity (SSS), Chlorophyl-a (Chl-a), mixed layer depth (MLD), CO 2 forcing (xCO 2 ), and pCO 2 each with a spatial resolution of 1 • x1 • . Furthermore, each member of the LET assumes perfect coverage over space and time. That is, each pCO 2 estimate for a location is the true average for that location over that time period. This is compared to SOCAT data, which is monthly, meaning that pCO 2 at a location could be from a single observation from that month. See Gloege et al. (2020) for 130 additional details on the LET.

Data preparation
From each LET member, we extract monthly output for each of the six features listed above to be used in the ML models. Our analysis domain is restricted to the open ocean, defined here as having a depth greater than 100m.
In order to mimic real world observations, pCO 2 for each testbed member is sampled in space and time to match the 135 SOCATv5 data product Sabine et al., 2013). As would be the case for the real world where actual SOCAT data would be the only pCO 2 data, the task for the ML approaches is to learn a representation of the system based on these limited observations and to then apply this representation to full-field driver data in order to upscale to the global ocean. The key advantage of the LET is that the true values for pCO 2 are known, and thus we can compare the reconstruction result to this "unseen" data.
Each member has a different random split so that we can understand how the approaches generalize from a wide array of training data points. The training set is used to fit each ML algorithm, the validation set is used to select tunable parameters and avoid over-fitting, and the test set is used for performance evaluation. Using the LET allows the non-SOCAT sampled data to be leveraged for a final evaluation as well. We consider the unseen data to determine reconstruction ability, while the 145 SOCAT-sampled test group allows us to understand what we would have seen with only real-world data.
To eliminate outliers that could unduly influence the reconstruction, we eliminate all pCO 2 values above 816µatm, the absolute maximum value in SOCATv5. We exclude these points as they are not representative of the real ocean; this exclusion primarily impacts the CanEMS2 model (∼11.3K data points per member, or <0.1% of the data).

Machine learning approaches
We use three machine learning approaches to reconstruct surface ocean pCO 2 (the target variable) using features derived from SST, SSS, Chl-a, MLD, xCO 2 , latitude, longitude, and time. The key characteristics and our development process are described here, with further details provided in Appendix A. For all approaches, we train the algorithms using mean squared error (L2 loss). The algorithms are implemented using Python library packages, as described below.

155
The NN is implemented using Keras with tensorflow 2.0 backend (Chollet et al., 2015). The main parameters considered for the NN are number of hidden layers and number of units per layer, which determine the size of the network. Additional parameters considered are learning rate, activation function, batch size, epochs, loss function regularization, and dropout. These parameters must be set by the user; they cannot be learned from the data. Learning rate controls the size of parameter updates at each step, and can be optimized during training. The activation function determines the non-linear transformation applied in 160 the training process. Batch size represents how many data points are used at a time to update parameters and is bounded by available memory. The number of epochs determines how long the algorithm is trained; running for too many epochs can lead to overfitting. Loss function regularization and dropout both attempt to prevent overfitting. Regularization is used to force the NN to have smaller parameter values, and is commonly used in many ML algorithms. Dropout randomly deactivates part of the network during each training update to force the NN to learn multiple pathways for predicting the output.

165
Initially, we tested NN parameters on several members from each model and evaluated performance on the validation set to preserve the test set. Parameters such as learning rate and activation did not affect performance, so we used standard values for these. For learning rate, we used the Adam optimizer with 0.01 rate; for activation, we used the rectified linear unit (ReLU) function. In terms of regularization, dropout did not improve performance, but small L2 parameter regularization was used.
Regularization led to a better fit and helped produce more consistent results. For the size of the network, once it was large 170 enough (approximately two layers with 500 units in each), there was no noticeable difference in performance. We did not use a network with many layers, as used in deep learning, for this task. Deep learning is more applicable when there is enormous amounts of raw data and the depth of layers is required to extract intermediate representations of the data. In this case, we have more limited data with a clear set of feature variables.
For the NN, large variation in output occurred in different runs operating with the same architecture. The only difference in 175 these cases was the random initialization. Because of this instability, we chose a single architecture and ran it five times for each LET ensemble member. From these five, the top two in terms of root mean squared error (RMSE) are evaluated and of these two, the one with the lower bias is identified as the best run. This best run is selected as the final NN output for that ensemble member.
The XGB approach (Chen and Guestrin, 2016) is implemented using the Scikit-Learn wrapper for XGBoost. The primary hyperparameters tuned were the maximum depth of each tree and the number of estimators, or trees, in the series. Maximum depth controls the complexity of each individual tree, while the choice of number of estimators balances a good fit without overfitting. We used cross validation over a grid of values to select these parameters. In preliminary tests on several members from each model, the same optimal parameters were selected during the grid search. Because of this, instead of evaluating this 185 grid for each member, we selected one member for each model and performed the grid search on this member. The optimal parameters were then applied to the other members for the model. Ultimately, the same parameters were selected in the grid search for all four models.

Random forest (RF)
The RF (Breiman, 2001) is implemented using the Scikit-Learn API. This approach followed a similar method as XGB. We 190 tuned maximum depth and number of estimators hyperparameters using the same grid search method. One model selected different hyperparameter settings in the grid search than the other three.

Evaluation of reconstructions
We evaluate the ML approaches in three parts. First, we look at set of goodness-of-fit metrics to understand performance on both the test and "unseen" data. Second, we use the full field input data to create complete pCO 2 reconstructions in space and 195 time, and with this we can fianlly evaluate each approach's capability with respect to estimating all points.
After developing a general understanding of the fidelity of the ML approaches, we consider the degree to which the spatial and temporal patterns of pCO 2 can be reconstructed, following the approach of Gloege et al. (2020). We consider the spatial pattern of long-term mean bias. We decompose the pCO 2 from each ensemble member and the corresponding reconstruction at every point for multiple time scales to assess the fidelity of the the phasing and amplitude of the reconstruction.

200
The temporal decomposition consists of breaking down the time series at each grid point into seasonal, sub-decadal, and decadal components. We perform this decomposition on the original modeled pCO 2 for each ensemble member and compare it to the ML model reconstruction for the same member using standard statistical metrics (Section 2.5).
Temporal decomposition is performed as follows. A linear-trend at each 1 • x1 • is removed at each grid cell to account for the pseudo-linear increase in atmospheric CO 2 over 1982-2016. Second, a repeating annual seasonal cycle is calculated from 205 the detrended time series. After subtracting the detrended and de-seasoned compoent at each location, the decadal signal is determined by applying a locally weighted regression (loess) smoother with a 10-year window. Finally, the residual signal is termed the sub-decadal component.

Statistical metrics
Reconstruction skill for both test and unseen data are evaluated on a series of summary metrics. These include root mean square 210 error (RMSE), bias, mean absolute error, and maximum error.
RMSE is a measure of how well the predictions fit the data, with low values indicating a better fit. This penalizes larger errors much more severely in proportion to the squared error. For data points y i and predictionsŷ i corresponding to i = 1, . . . , N , this is calculated as Bias is calculated as the average error of the reconstruction 1 Bias is a measure of the systematic discrepancy 215 between the reconstruction and model over the long term. It is important to note that values near zero may be misleading as positive and negative discrepancies can cancel each other out.

Mean absolute error
This metric provides an estimate of the average magnitude of the error, regardless of direction. While it does not show the systematic discrepancy, it shows how far off a prediction will be on average.
Max error identifies the largest absolute error value among the predictions, max N i=1 |y i −ŷ i |. While the previous metrics 220 focused on average performance, this gives a sense of worst case performance.
We evaluate these for each individual ensemble member and then average across all ensemble members. The variance across the 100 members is also presented to determine the degree of consistency across the 100 members. To consider the spatial patterns of reconstruction skill, the same metrics are calculated at each geographical grid point, and then averaged across members to understand the spatial performance.

225
For the temporal deconstruction, we focus on correlation and percent error of the standard deviation. These metrics are used to assess the degree to which the different ML approaches are able to reconstruct the temporal phasing and amplitude of variability in the original ensemble.
Pearson correlation coefficient, r is the covariance between the reconstruction and the model divided by the product of their standard deviations, r = cov(R,M ) σ R σ M . Correlation quantifies the synchrony between the reconstruction and model truth on  = 14.57 µatm), only degrading by 73%. In summary, XGB most closely fits the test data, but degrades substantially when extrapolated to unseen points. In contrast, while NN is less capable of fitting the test data, its performance degradation is much lower when extrapolated to unseen data (Figure 1).
While the XGB has the lowest average maximum error across members on the test data, all approaches have a similar maximum error on the unseen data. For this metric, degradation from test to unseen data in the NN model is 203%, 266% in 255 the XGB model, and 183% in the RF model. RF has the least relative degradation because it has the worst performance on the test data to begin with.

Mean spatial performance
Though the global mean bias is near zero for all three approaches (Table 1) Across the 100 members for each approach, the spread of the bias indicates how consistently the ML method is able to accurately reconstruct pCO 2 (Table 1). Figure 3 illustrates that by showing the variance of the bias at each point across the 100 members. The locations where mean bias is small (Figure 2) are also where variance is low. This indicates that the ability of 265 the reconstruction to provide a low-bias result is robust across the ensemble members at these locations.
Regions of high bias are consistent across the ML methods ( Figure 2). Coastal regions are generally poorly reconstructed, since the reconstructions are based on primarily open ocean data. The southeast Pacific is a region of notable bias and variance for all three methods, and the Southern Ocean has more bias and variance of bias (Figure 3) than the Northern extratropics.

Temporal decomposition 270
At each point, the reconstruction is divided into three temporal components: seasonal, sub-decadal, and decadal. For the deconstructions, we evaluate performance using correlation ( Figure 4) and percent error of the amplitude ( Figure 5).
On the seasonal timescale, all of the approaches capture the phasing well with > 0.9 correlation across most of the ocean ( Figure 4, top row). The globally averaged correlation is > 0.9 for all methods (Table A1) Table A1 provides global and regional average results.
For seasonal amplitude ( Figure 5, top row), NN and XGB generally provide high fidelity reconstructions, but RF systematically underestimates seasonal amplitudes. All approaches have a tendency to overestimate the decadal amplitude except in the western equatorial Pacific ( Figure 5, bottom row). to overestimate amplitude, but underestimates for members with large decadal amplitude. Globally averaged, NN has a 13% overestimate, while XBG overestimates by 28%, and RF by 60% (Table A1).
For the sub-decadal timescale, NN underestimates amplitude almost everywhere (global average -10%, Table A1). XGB and RF tend to overestimate sub-decadal amplitude in the subtropics, but underestimate it elsewhere ( Figure 5, middle row).  Table A1 provides global and regional average results.  Table A1 provides global and regional average results.
13 https://doi.org/10.5194/gmd-2020-311 Preprint. Discussion started: 22 October 2020 c Author(s) 2020. CC BY 4.0 License.  the time-average bias at a given location, averaged across the 100 members of the LET. When only sparse data are available for a given location, predictions can be poor in all methods. However, as more months of data become available, there is a sharp 295 decrease in mean bias. XGB reduces bias more quickly than the other two approaches. In other words, XGB performs best when only moderate amounts of data are available, as shown also with Table 1 and Figure 1.

Discussion
Overall, XGB best reconstructs the pCO 2 field. It has low bias on the test set, and the lowest RMSE on both the test and unseen data sets. It is also most consistent across the 100 members in terms of RMSE, i.e. it has a low standard deviation of RMSE across the members (Table 1) and a low standard deviation of bias (Figure 3). NN provides comparably average performance; however, this comes with higher variance in its performance (Table 1, Figure 3). Though RF does almost as well as XGB on the test data, it generalizes poorly. Gregor et al. (2019) also found that the random forest method performs poorly in pCO 2 reconstruction.
It is important to note that the NN model does not consistently perform well, and steps must be taken in implementation to 305 limit the spread of its performance. For this implementation, for each of the 100 ensemble members, we trained this algorithm five times and selected the best run, as in Denvil-Sommer et al. (2019). This reduces the resulting variance, but the fidelity of the NN result still has substantial spread (Table 1, Figure 1). The inherent instability of the NN approach is thus a concern for any real-world application. We find variation across NN runs for all sizes of networks considered. Running the model as many times as possible can help to address this problem, but this iterative approach does add substantially to the already high 310 computational cost of developing a NN algorithm.
Between the tree-based models, the superior performance of XGB on the unseen data compared to RF is attributable to the XGB algorithm focusing on previous errors when constructing the series of trees. RF trees are correlated, meaning that there is less ability to drive down the error of the fit. The greater fidelity of XGB here is consistent with its rapid adoption for many commercial applications in recent years (Chen and Guestrin, 2016).

315
A significant benefit of the LET is the ability to simulate the model training process with real-world sampled data and then to directly examine performance on unseen data. The degradation in the various statistical metrics can provide useful quantification of the suspected degradation that occurs with these ML methods when they are applied to real-world data.
Depending on the approach, the RMSE from test to unseen data can increase by 75%-185% (Figure 1). Recent related research (Landschützer et al., 2013;Gregor et al., 2019) studying the degradation between SOCAT test data and independent pCO 2 320 data from the Lamont-Doherty Earth Observatory (LDEO) database (Takahashi et al., 2009) found similar levels of RMSE degradation with SOM-FFN (115%) and the ensemble CSIR-ML6 (56%). These results indicate that the comparison to the independent LDEO data is an accurate indicator of the aggregate level of degradation with extrapolation.
In our tests, the NN suffered the least degradation in performance from test to unseen data, indicating that it captures the pCO 2 effectively without overfitting too much on the training data. Still, it is important to note that the greater degradation in 325 XGB as opposed to NN does not mean NN performs better overall. For our implementation, XGB performance is high enough on the test data that even a larger drop off puts it in a better position on the unseen data than its competitors. Gregor et al.
(2019) also found that the NN had the smallest difference in RMSE performance when compared to indpenedent data, while the ERT (similar to our RF) suffered the most degradation.
SOCAT pCO 2 data are collected along ship tracks and have substantial spatial along-track correlation (Jones et al., 2012), 330 so the data are not entirely random and thus their value to ML method training is compromised. Gregor et al. (2019) ran an experiment comparing random sampling vs. sampling by year to separate training and test data. This approach reduces correlations between the training and test data because it segregates entire ship tracks to either the training or the test set. When data are randomly sorted, then training and test data will be from the same ship tracks and thus have greater correlation. Gregor true error. They recommend this sampling method in order to prevent overfitting.
As discussed by Gregor et al. (2019) and Gloege et al. (2020), there are many regions of the global ocean pCO 2 state space that are inadequately sampled. Gregor et al. (2019) propose that the convergence across recent implementation of ML methods and other approaches to pCO 2 extrapolation indicates that the available real-world sampling is essentially a "wall" that prevents further improvement of pCO 2 reconstruction skill. That is, reconstruction approaches may not be able to improve significantly 340 from the current best-in-class because of prohibitive data sparsity.
To test the value of additional data using the LET, Gloege et al. (2020) add a modest amount of additional sampling in the Southern Ocean to SOM-FFN. They show that this could substantially improve the ability of this neural network to reconstruct decadal timescale variability, both in the Southern Ocean and also across the global ocean. As seen here in Figure 7, the number of months of data at a given location is a significant driver of performance for our ML methods. XGB performance 345 with moderately sparse data is the most impressive. As soon as the XGB model has a small amount of data at a location, the range of bias is much reduced compared to the other ML methods. This indicates that additional sampling in low coverage regions combined with optimized ML algorithms could lead to improved pCO 2 reconstructions going forward. It would be valuable to repeat the test of Gloege et al. (2020) with the XGB model to determine how much a modest increase in Southern Ocean sampling could improve Southern Ocean and global performance. Ocean. The very limited sampling for these regions is consistent with the state space for pCO 2 being inadequately represented in the data available for algorithm training. For similar reasons, algorithms built specifically for the coastal regions also appear to be needed (Roobaert et al., 2019;Laruelle et al., 2017). Specialized approaches could focus on localized patterns or 355 incorporating additional information into the training process to achieve a better representation of these areas.
Another consideration is the timescale of interest. We find very good performance seasonally, but deteriorating performance on the sub-decadal and decadal timescales. Seasonal variations are large in most locations, and are also the best-sampled given our dataset from 1982-2016. Thus, these are the easiest signals for all three ML approaches to identify. However, the NN shows stronger performance in the decadal time series -the one place where it outshines XGB.

360
Why does the NN perform best for the phasing and amplitude of pCO 2 variability on decadal timescales? In Figure 1, we show that despite not fitting the test data as well, NN performs best in generalization to unseen data. XGB and RF have more trouble extrapolating. Relative to seasonal variation, the decadal timescale is very poorly sampled, with the entire dataset extending only 35 years. It is likely that the fundamental structures of these models causes this difference. The NN creates a non-linear mapping of input to output, without creating distinct regions in the feature space. Conversely, the tree-based 365 models do create distinct regions, effectively chopping up the training data into distinct bins. Given the limited sampling of the geophysical state space relevant to pCO 2 variations on decadal timescales, the tree based methods perform more poorly in this extrapolation. The ability of NN to generalize beyond the training set, as indicated by Figure 1, appears to be the reason for its ability to better construct decadal timescale variations.
Though NN generalizes better, we find that XGB performs so well against training data that its overall performance is slightly better than NN (Table 1 and Figure 1). This is likely due to how this algorithm constructs its estimators. The initial trees in the series are very weak, but they still capture the rough pCO 2 trends across the globe. As more trees are added, the algorithm focuses on reducing errors, particularly in locations that are well sampled. This leads to an excellent fit in areas with sufficient sampling while still capturing the average trends elsewhere. It is also worth noting that XGB is more computationally efficient than the NN, a considerable benefit for use with datasets of this size.
375 Gregor et al. (2019) propose the use of an ensemble of ML methods for reconstruction of surface ocean pCO 2 , and included NN and XGB methods in their final ensemble. Our results suggest that in their CSIR-ML6 ensemble, XGB helps to strengthen overall performance while the NN enhances the ability to generalize to features and timescales that are poorly represented in the training data.

Conclusions
380 Figure A1. Training vs. test set RMSE for all members in the LET for each ML approach. Colors represent which ensemble model the member is from.

A2 ML model training
For each ensemble member, model points corresponding to SOCATv5  open ocean sampling are randomly split into three parts: 60% training, 20% validation, 20% test. This allows us to simulate how we would train a ML model with real world data. All other points, or "unseen" data, are used to understand how the ML method extrapolates. We expect deterioration in performance as compared to the SOCATv5-sampled data set due to correlations in the data along ship tracks 435 (Jones et al., 2012). Previous work has provided ample evidence that the full state space for pCO 2 in the open ocean across the last several decades is not sufficiently sampled (Gregor et al., 2019;Gloege et al., 2020).
An alternative to random sampling of the dataset would be to randomly sample ship tracts or years. This could better adjust for the correlation in the data and close the error estimation gap from test to unseen data. Future work with the LET should explore this approach.
440 Figure A1 compares the training and test RMSE data for each approach. The training RMSE for the NN is very similar to the test data, indicating that the NN is learning the patterns in the data efficiently without overfitting the training data. The tree based models achieve a lower training RMSE than test RMSE. This is unsurprising because they are constructed to learn the average values of the training data in the regions of the state space that they create. In fact, XGB is designed to eventually achieve a training RMSE of 0 if run for long enough. Though training data RMSE is lower than test RMSE, this does not mean 445 we are overfitting and sacrificing test set performance. Instead, test set performance is actually lower for both XGB and RF, as also shown in Table 1 and Figure 1. Our use of cross validation against a validation set when the number of estimators are selected helps to guard against overfitting.
When training the ML models, we use L2 loss (mean squared error) to fit the models. In the tree-based models, the validation set is used for one member from each of the four Earth System Models in the LET to tune hyper-parameters (e.g. number of 450 estimators). Then, all members for that LET model are trained using these hyper-parameters. The idea behind this is that to reduce this variance somewhat. To further reduce variance, for each LET member, we run 5 neural networks with the same settings and use the validation set to select the best one, applying criteria as described in the main text. Applying dropout between the layers did not reduce the variation.

A3 Regionally and globally averaged metrics
In Table A1, we present regional and global averages of long-term bias ( Figure A3) and for correlations ( Figure 4) and percent amplitude error for each timescale ( Figure 5).

A4 Ensemble variation
While this paper focuses on the differences between ML approaches, there is also variation across the ensemble models. Figure   A2 provides an overview of the RMSE for each approach, broken down by model, for the test and "unseen" data sets. For each model, the overall conclusions discussed in the paper are the same, the most significant finding being that across all three ML approaches, XGB has the highest overall performance. NN is also very competitive. Notably, however, the ML algorithms 475 consistently yield larger errors in reconstructing the CanESM2 and MPI model fields, compared to the other two models.
Indeed, there are underlying features of these two models that materialize high error in the reconstructions. The MPI model has larger variance of climate variables as compared to the others, and the CanESM2 model generally has high pCO 2 values in some locations, even after filtering out extreme values (pCO 2 values > 816µatm). Interestingly, reconstruction performance, as reflected by RMSE, degraded least in the MPI model, followed by CanESM2 for all ML approaches.
480 Figure A3 shows the average bias for each approach, broken down by each ensemble model. As we would expect from Figure A2, the ML approaches perform better on the GFDL and MPI models. Looking across the approaches for each model reinforces that the XGB performs the best at capturing the pCO 2 field with low bias, particularly in the open ocean. While the magnitude of the bias is different across the ensembles, the locations where they struggle are similar. The differences across these models highlight some of the challenges for reconstruction methods. By nature, all ML algo-485 rithms will struggle to produce a successful reconstruction when there is high uncertainty in the data, which is clearly exhibited by the poor performance of each approach in areas such as coastal regions, segments of the Southern Ocean, and other areas of particular data sparsity and high variability. To accommodate for these outliers while attempting to retain data integrity, we utilized minimal data cleaning (i.e. only removing extreme outliers). One potential future exploration with the LET would be to assess more fully the impact of data validation procedures on algorithm performance.