Sensitivity analysis of a data-driven model of ocean temperature

There has been much recent interest in developing data-driven models for weather and climate predictions. These have shown reasonable success in modelling atmospheric dynamics over short time scales, however there are open questions regarding the sensitivity and robustness of these models. Using model interpretation techniques to better understand how datadriven models are making predictions is critical to developing trust in these alternative prediction systems. We develop and interpret a simple regression model of ocean temperature evolution, to improve understanding of whether data-driven models 5 are capable of learning the complex underlying dynamics of the systems being modelled. We investigate model sensitivity in a variety of ways and find that the simple regression model analysed here behaves in ways which are, for the most part, in line with our knowledge of the ocean system being modelled. Specifically we see that the regressor heavily bases its forecasts on, and is dependent on, variables which we know are key to the physical dynamics inherent in the system, such as the currents and density. By contrast, inputs to the regressor which have limited direct dynamic 10 impact, such as location, are not heavily used by the regressor. We also find that the regression model requires non-linear interactions between inputs in order to show any meaningful predictive skill — in line with out knowledge of the highly nonlinear dynamics of the ocean. Further sensitivity analysis is carried out to interpret that the ways in which certain variables are used by the regression model. Results here are again mostly in line with our physical knowledge of the system, for example, we see that information about the vertical profile of the water column reduces errors in areas associated with convective activity, 15 and information about the currents is used by the regressor to reduce errors in regions dominated by advective processes. Our results show that even a simple regression model is capable of ‘learning’ much of the physical dynamics inherent in the ocean system being modelled, which gives promise for the sensitivity and generalisability of data-driven models more generally.

Further details of the model dynamics, in particular assessment of the contribution of different processes to temperature change in the model can be found in Appendix A.

Training and validation datasets
Input and output data for training and validating the regressor comes from a 70 year run of this simulator. The first 50 years of the run are discarded, as this includes a period where the model is dynamically adjusting to its initial conditions which may 130 be physically inconsistent. During this period the evolution of the simulator is driven by this adjustment, rather than the more realistic ocean dynamics which we are interested in, hence we exclude this data. This leaves 20 years of data which is used for training and validating the model. As the GCM sees constant wind forcing and a consistent restoring of surface temperature and salinity, if left to run for long enough (thousands of years) the system would reach a quasi-steady state, however the 20 year period used here is prior to the model reaching this quasi-steady state. As the data is highly auto-correlated we sub-sample in time to remove some of the co-dependent nature of the training data. There are also computational constraints limiting the total size of our dataset. This leads us to choose a subsampling rate of every 200th day. This dataset is then split into training, validation and test data with a split of 70-20-10. The data is systematically split temporally, so the first 70% of samples are used as training data etc, meaning each dataset contains data from different temporal sections of the run, maximising independence across the datasets. For every 200th day, we take all grid 140 points from throughout the model interior (i.e. we exclude points next the land, and points at the surface and seabed). We do not subsample in space, as the domain is reasonably small and the dynamics varies considerably across it, meaning subsampling in space can lead to some dynamic regimes being entirely missing from the dataset. This gives us approximately 650 000 training samples, 200 000 validation samples, and 100 000 test samples. 145 We develop a linear regression model to predict the daily mean change in ocean temperature for any single grid cell, based on variables at surrounding locations at the current time step. The regressor is defined such that it outputs temperature change at a single grid cell rather than predicting for the entire domain, but the cell being predicted can be any of the cells in the domain interior -the regressor is not limited to predicting for a specific location.

Regression model
Equation 1 shows the formulation of the regressor.
terms, as we are interested in representing the interaction between different features through this term. This gives 26 016 input terms in total.
The model design means that all physical ocean variables at surrounding points are included in the prediction, as these are 170 likely to impact the change in temperature at the central point. Geographic inhomogeneity is accounted for through inclusion of the location information. Further, the combination of this geographic inhomogeneity with physical ocean variables is included to a limited extent through some of the multiplicative terms in Eq. 1 (those terms which are a combination of latitude, longitude or depth with a physical variable input). Lastly, the non-linear interaction between physical ocean variables is also included to a limited extent through the remainder of the multiplicative terms. All input variables are normalised prior to fitting the model 175 by subtracting their mean and dividing by their standard deviation.

Limitations of the model
It should be noted that the model is a simple regressor, to allow for easy analysis of sensitivity. This however limits how accurately the model can fit the data, and how well it can represent the underlying system. In particular, we know the ocean to be highly non-linear, but allow only second order polynomial terms in the regressor, restricting the level to which it can capture 180 the true dynamics.
The regressor here takes input data from only immediately surrounding grid cells, meaning it has no information about what is happening in the wider domain. This potentially prevents the regressor from making predictions far ahead, when the wider ocean state has more influence, but for the short time steps being forecast here (one day), this local information is expected to be sufficient. Indeed, here we are making predictions at time steps only double that used in the GCM -where the change at 185 each cell is based predominantly on the state of only immediately surrounding cells.
We note that many existing papers looking at data-driven forecast systems focus on developing methods which can be applied iteratively to provide an alternative forecast system able to predict an arbitrary number of time steps ahead. We emphasise that the model described here would not, in its current form, be usable to produce an iterative forecast in this same way. There are two main reasons for this. Firstly, this model is unable to forecast near the boundary as it requires a full set of neighbouring 190 input points. Secondly, this model requires a wider set of inputs than the outputs it produces. This means that if using this iteratively as an alternative forecast system to a GCM, if a full set of initial conditions were given, we would still require some means of generating variables other than temperature, in order to provide the full set of inputs to the regressor for future time steps. Our interest here is not in deriving a data-driven analogue of the MITgcm simulation which might one day be used in place of the original simulator, but simply in assessing the sensitivity of this data-driven model to different variables. The set 195 up described here best allows us to assess the dependencies and sensitivities of a simple data-driven model to various inputs, and thus provides insight into the general ability of data-driven methods to learn the underlying dynamics of weather systems.

Training the regressor
The model is trained by minimising least squares errors with ridge regularisation (Hoerl and Kennard, 1970 between the regression model predictions and the actual outputs taken from the GCM over the training dataset. In any application of a regression model it is expected that the model will be used on data other than that used in the training of the model. To ensure the model performs well on unseen data, we want to ensure that that model learns the general pattern of the data, rather than specifically fitting to every point in the training data. This is particularly important where datasets are known to contain noise, as here fitting the data exactly would mean 'learning' the noisy representation of the system that the data portrays, rather 205 than learning the underlying system itself. Regularisation techniques are applied to avoid the problem of overfitting (of matching the training data exactly) and work to limit the level at which the model can fit the data, ensuring the model can generalise well -i.e. it still performs well on new unseen data which shares the same underlying dynamics. Ridge regression is one such regularisation method, which works by minimising the size of the coefficients as well as the model errors. When using ridge regression an additional term is added to the loss function, so the training now focuses on minimising a combination of the 210 square errors and the sum of the magnitude of the β i & γ i,j values, with α acting as a tuning parameter determining the balance between these two terms.
We use a very small value of α = 0.001. This was found through cross validation with values of α ranging from 0.001 to 30. With larger values the regressor performed poorly, particularly when predicting larger temperature changes. Given the dataset comes from simulator output we know that in this case noise or measurement error is not an issue, so the need for 215 regularisation is limited. Similarly, whilst we have a large number of weights in our equation, the size of our training set is very large compared to this, which already acts to limit overfitting. Because of this we find only very small values are necessary.

Sensitivity studies
We wish to investigate the sensitivity of the regressor to its inputs, in order to understand the ways in which the regressor is making its predictions. We do this in three ways. Firstly we directly assess the coefficients (weights) used in the resulting 220 regressor. This indicates which features are being most heavily used in the predictions. Secondly, we run a series of withholding experiments, this indicates which inputs are most necessary for accurate forecasts. Lastly, for the inputs which the withholding experiments identified as being most critical to forecasts, we assess the impact these have on errors, giving insight into how these inputs effect the forecasts.
We assess the coefficients simply through plotting a heat map of coefficients ( Fig. 4  This corresponds to running the first pass of a Backward Sequential Search interpretability analysis. We also run two further withholding experiments. In the first we assess the importance of providing information in the vertical neighbourhood of points.
Instead of the 3-d stencil originally used, we take a 2-d neighbourhood of points, 3x3, in only the horizontal direction, thus giving 9 inputs for each of temperature, salinity, etc. Lastly, we also run without multiplicative terms, i.e. the model consists of only the first term in Eq. 1, giving a purely linear equation, enabling us to assess the impact of non-linearity on predictions.

235
The new regressors are trained in exactly the same way, using the same training and validation samples -the only difference being the number of input features used. Comparing results from these withholding experiments to the control run show the importance of the withheld variable -if error increases significantly then the variable is necessary for accurate predictions.
However if the impact on error is small, the regressor is able to make predictions of similar accuracy with or without that variable, indicating it is not needed for good predictions.

240
Whilst these two methods (coefficient analysis and withholding experiments) both help to indicate the feature importance in the model, it should be noted they highlight different aspects of the importance of the input features. Looking at the coefficients of the trained regressor helps to identify which inputs are being most heavily used for the predictions from that particular regressor. By contrast, the withholding experiments indicate which variables are necessary to get predictions with the level of accuracy shown in the control. There may, for example, be scenarios where certain variables are heavily weighted and flagged as 245 important through the coefficient analysis, but when these same variables are withheld, the regressor re-weights other variables during the training step and maintains a similar level of accuracy due to correlations and the strong co-dependency of ocean dynamics on multiple variables. Coefficient analysis helps us to understand how a particular instance of a regressor is working, whereas the withholding experiments help us to understand the impact and importance of each variable in creating skilful regression models more generally.

250
Lastly we analyse the resultant models from the three worst performing withholding experiments. We look at scatter plots of truth against prediction, and spatial plots of averaged absolute error to see how these models perform. We compare the average error plots to average errors in the control run (a run with all inputs) to see where errors are increased. We then compare this with the dominant processes driving temperature change in those regions ( Fig. A1 and A2) and our expectations based on prior knowledge of ocean dynamics to assess if the regressors respond in the ways we expect.  Despite the relatively rare nature of these larger temperature changes, we feel that capturing these alongside the smaller changes is critical to building a robust model. The underlying dynamics of the system, which we hope 265 the regression model is able to learn, drives the full range of temperature changes seen. As such if we build a regressor which is unable to capture the extreme levels of change, this would indicate the model is not fully learning the physical dynamics as was intended. Capturing these extremes is also critical to obtaining a model which could (with further development) lead to a feasible alternative forecast system. While these extremes are limited in their occurrence, their impact on the ocean system is notable, particularly if we were interested in longer prediction timescales. A model unable to capture these fails to provide 270 a useful starting point for development of an ocean forecast system. Although we note that the development of a data-driven forecast system is not the focus of this work, the ability of the model developed here to capture extremes is to some extent relevant from that perspective. Table 1 shows RMS errors and correlation coefficients for this run (top row) in comparison with a persistence forecast (bottom row). A persistence forecasting is a forecast of no change -in this case to forecast zero temperature difference.

275
Persistence forecasts are commonly used as a benchmark in forecasting, and already provide a statistically good predictor here due to the limited temperature change across most of the simulator domain. However, we can see the regressor performs significantly better than persistence. As expected we can see from Table 1 and Fig. 2 the regressor performs less well over the validation dataset, however it still outperforms the persistence forecast.

Spatial patterns of errors 280
We calculate temporally averaged absolute errors to give us an indication of how the regression model performs spatially.
These averages were created by taking the MITgcm state at 500 different times from the 20 year dataset and using these fields as inputs to the regressor to forecast a single time step ahead. The set of forecasts created from these 500 input states are compared to the truth from the GCM run, and the absolute errors between the truth and predictions are then temporally averaged. To emphasise, this is an average of 500 single time-step predictions, and not an average from an iterative run.

285
The set of input states spans the full 20 year MITgcm dataset, but with sub-sampling to take every 14th day (as opposed to every 200th day as was used in creating the training and validation sets). This results in a far larger set of input states than present in the training and validation data. The results here are therefore not specific to either the training or validation set, but instead show performance over a larger dataset which shares occasional samples with both.
These averaged errors are shown in Fig. 3. Note the regressor is only applied away from boundary and land points (in its 290 current form it cannot deal with spatial locations which are not surrounded on all sides by ocean points) hence points close to land are not included in these plots. We see that density is very heavily weighted, and therefore providing a large part of the predictive skill of this model, this is in line with our physical understanding that density changes are driving convective temperature change. The interactions between temperature at the point we are predicting, and temperature at surrounding points are also very highly weighted. This is line with our physical knowledge of advection and diffusion driving temperature change.
First we assess the sensitivity of the trained regressor by direct coefficient analysis. Figure 4 plots the magnitude of the coefficients in Eq. 1. Figure 4a shows coefficients averaged over all input locations for each variable type (i.e. for most variables there are 27 inputs, corresponding to the 27 neighbouring cells. We average over these to give a single value for each variable (temperature, salinity, etc), and for each polynomial combination of variables). Figure 4b shows the coefficients related to 315 polynomial interactions of temperature with temperature -these are the raw coefficients, without any averaging applied.
High-weighted inputs (those with a large magnitude coefficient) are variables which are heavily used in the predictions and therefore considered important for the predictive skill, whereas inputs with low magnitude coefficients can be considered less important. Again, we emphasise that the coefficients highlight which features are being predominantly used in this model, but this is not necessarily what is needed to create a skilful model -for that we need to look at the withholding experiments.

320
From Fig. 4a, we see that density (as a linear term, not in combination with other variables) is by far the most highly weighted variable in this model. The regressor is using density information as a very large component of its predictions. This is promising, as from our physical understanding of the system we know density is key to ocean dynamics. Unstable density profiles contribute to the large temperature changes seen in the south and very north of the domain, and for geostrophic currents the flow follows the density stratification.

325
More generally we see that the location information is low weighted, particularly when interacting with other variables.
This indicates the regressor is not basing its predictions predominantly on the location of points, but on the physical variables themselves.
From Fig. 4b we see that the multiplicative interaction between temperatures at different input locations are very highly weighted for certain combinations of locations. Specifically, it is the interaction between temperature at the grid point we 330 are predicting for and temperature at all surrounding points which gives the bright banding. This fits well with our physical expectation of the system -as diffusive and advective fluxes of temperature are dominated by local gradients in temperature. Using a 2-d stencil means the regressor has no information about the ocean vertically above and below the location being predicted, and cannot use the vertical structure of the ocean in its prediction. We know this information to be important in the dynamics of the simulator, particularly in the south of the domain and the very north where vertical processes driven by the 375 MOC affect temperature, and so it is reassuring that withholding it has such a large impact on error.

Withholding experiments
By restricting the regressor to purely linear terms (withholding polynomial interactions) we see the largest increase in error over the training set. That this purely linear version of the regressor performs poorly is also expected given our physical understanding of the problem being modelled. The ocean is known to be a complex, highly non-linear system, and we would not expect a purely linear regressor to be able to accurately replicate the detail and variability of these complex interactions.

Summary of withholding experiments
These withholding experiments emphasise the need for a regression model to have information on currents and vertical structure, as well as enough complexity to capture some of the non-linearity of the system in order to provide even a basic level of skill in forecasting temperature change in the ocean. The feature importance displayed here by the regressor is consistent with the importance of these inputs in the dynamics system we are modelling, implying the model is dependent upon the variables 385 we would expect. Therefore, we are confident that the regressor is, to some extent, learning physical constraints rather than purely statistical patterns that might lack causality.

Further analysis of withholding experiments
We further investigate the results of the three worst performing models from the withholding experiments; withholding information on the currents, providing only 2-d inputs, and a purely linear model. We look closely at the model predictions and 390 compare these with the control run to infer how the variables are impacting predictions.   Figure 5 shows the performance of the purely linear model, that is the model trained without any multiplicative terms. We see that without multiplicative terms the model can capture the mean behaviour of the system (zero change in temperature) but is unable to capture any of the variability. This mean behaviour alone does not provide useful forecasts, as can be seen from the 395 statistics, particularly the correlation coefficient for this experiment. Comparing Fig. 5 with Fig. 2 we see the importance of the non-linear terms in predicting temperature change, especially for samples where temperature change is non-zero. Non-linearity is shown to be critical to modelling the variability of temperature change.