Using Neural Network Ensembles to Separate Biogeochemical and Physical Components in Earth System Models

Earth system models (ESMs) are useful tools for predicting and understanding past and future aspects of the climate system. However, the biological and physical parameters used in ESMs can have wide variations in their estimates. Even small changes in these parameters can yield unexpected results without a clear explanation of how a particular outcome was reached. The standard method for estimating ESM sensitivity is to compare spatiotemporal 10 distributions of variables from different runs of a single ESM. However, a potential pitfall of this method is that ESM output could match observational patterns because of compensating errors. For example, if a model predicts overly weak upwelling and low nutrient concentrations, it may compensate for this by allowing phytoplankton to have a high sensitivity to nutrients. Recently, it has been demonstrated that neural network ensembles (NNEs) are capable of extracting relationships between predictor and target variables within ocean biogeochemical models. Being able to 15 view the relationships between variables, along with spatiotemporal distributions, allows for a more mechanistically based examination of ESM outputs. Here, we investigated whether we could apply NNEs to help us determine why different ESMs produce different results. We tested this using three cases. The first and second case use different runs of the same ESM, except the physical circulations differ between them in the first case while the biological equations differ between them in the second. Our results indicate that the NNEs were capable of extracting the relationships 20 between variables, allowing us to distinguish between differences due to changes in circulation (which do not change relationships) from changes in biogeochemical formulation (which do change relationships).


70
The objective of this paper is to investigate whether the application of NNEs and sensitivity analyses can provide useful information for determining differences in ESM outputs. In general, there are three primary drivers that lead to differences in the output of ESMs: physical forcings, phytoplankton physiology, or combinations of these two. Before applying this method to outputs of multiple ESMs, we chose to investigate whether the method worked well on different runs of a single ESM in which physical parameters were changed to produce different circulations. It was 75 uncertain whether the NNEs would be able to pick out the same apparent relationships of the same ESM when there were differences between runs in the physical forcings and intrinsic biological equations (phytoplankton physiology).
If different versions of an ESM, which have different circulations, still yield the same apparent relationships between light/nutrients and biomass, it would suggest that circulation changes do not produce new patterns of co-limitation.
Furthermore, it would suggest that differences in the apparent relationships of different ESMs could be partitioned 80 between those due to different physical circulations and those with different representations of biology. For example, if one uses the apparent relationships from model A to predict the biomass from model B given the environmental parameters from model B, any differences should be due to differences in the biological formulation.
To investigate the extent to which NNEs could characterize differences across ESMs, we explored three cases: 85 to which physical circulation contributes to these apparent relationships, and to identify challenges in comparing the apparent relationships between ESMs.

Earth System Models -Biogeochemical Codes
In general, biogeochemical components of ESMs predict the evolution of phytoplankton biomass using equations that 110 have the general form + ⃗ ⃗ * = ( , , ) * − ( , … ) + * ⃗⃗⃗ * (1) where ⃗ is the three-dimensional velocity field, is the phytoplankton growth rate which is a function of nutrients N, light and temperature T, ( , … ) represents the grazing loss rate, which may be a function of phytoplankton biomass and/or other variables such as temperature or zooplankton concentration, and ⃗ ⃗ is the three-dimensional mixing tensor.
Changes in physical parameters (for example changing the values in ⃗ ⃗ ) would produce changes in transport of 115 biomass. But the associated changes in circulation would also produce changes in other fields, such as N, Light, and where is a grazing rate, * is a biomass scaling for grazing, and is a grazing exponent. The grazing rate includes all losses due to grazing, viral lysis, temperature-dependent loss, and others. For the small phytoplankton size class α = 1 and for the large phytoplankton size class α= 1/3. This means the large phytoplankton biomass is more sensitive to environmental conditions that then small phytoplankton biomass. The growth rate (μ) in Eq. (2) goes as where is the growth rate, is the temperature with constant = 0.063°C -1 following Eppley (1972), governing phytoplankton physiology that determine the growth rate.

Tracers of Phytoplankton with Allometric Zooplankton (TOPAZ)
TOPAZ is a prognostic biogeochemical model included in the Geophysical Fluid Dynamics Laboratory (GFDL) ESM2M ; Fig. 2). It includes a total of 30 tracers to model cycles such as nitrogen, phosphorus, iron, oxygen, carbon, and others (Nutrients (N) in Fig. 2). TOPAZ models three phytoplankton groups (small, large, 150 and diazotrophic; Biomass (B) in Fig. 2) with light limitation based on the equations of Geider et al. (1997).
Additionally, phytoplankton loss/grazing and particle export are modelled using the same size-dependent formulation as those used in Eq. (2), though without imposing the quasi-equilibrium assumption that leads to Eq. (4). TOPAZ differs from BLING in its number of tracers (and associated limitations) and the allowance for advection/diffusion of   Fig. 2). This means that the loss rate of phytoplankton in TOPAZ is effectively a 155 function of circulation as well the temperature and biomass-dependent grazing rate, ( * ) . This will produce different biomasses in locations that have the same growth rates. Additionally, a key difference between BLING and TOPAZ is that the latter includes denitrification and nitrogen fixation. This then means (as suggested by Tyrrell (1999)) that the nitrogen is the proximate limiting nutrient, while phosphorus is the ultimate limiting nutrient; a distinction that is not made in BLING. 160

Case 1 -Same ESM: Identical Biological Equations, Different Physical Circulations
The aim of Case 1 was to quantify the extent to which differences in physical circulations between model runs of the same ESM with identical intrinsic biological relationships would affect the apparent relationships found by NNEs. As stated in Section 2.1, we chose to compare versions of GFDL ESM2Mc in which BLING is configured identically so 165 we can be certain the differences are solely due to circulation changing the environmental conditions, and not the phytoplankton loss rates.
We chose to use three configurations of GFDL ESM2Mc. The three model runs consisted of: a standard historical preindustrial model spin-up (BLING -PI Control), a similar case to the first but where the carbon dioxide concentration 170 was four times higher (BLING -4x CO2), and a case similar to the historical spin-up except that the horizontal mixing parameter was three times higher (BLING -3x Mixing). These model runs are described in greater detail in Gnanadesikan et al. (2013), Pradal and Gnanadesikan (2014), and Bahl et al. (2020). With the standard historical model essentially serving as a form of a "control," the two remaining cases allowed us to examine if changes in the physical circulation would result in changes to the apparent relationships. 175 The predictor variables for each model run were macronutrient (ex. phosphate), micronutrient (ex. dissolved iron), irradiance, and temperature. The target variables were small phytoplankton biomass and large phytoplankton biomass.
One NNE was trained for each target variable of each model run for a total of six NNEs in Case 1 (three model runs and two target variables in each run). Details of the NNE training can be found in Section 2.3. 180

Case 2 -Same ESM: Different Diagnostic Biological Equations, Near-Identical Physical Circulations
The purpose of Case 2 was to quantify the differences found by NNEs between the apparent relationships of model runs from the same ESM when the biological equations differ between runs, but the physical circulations are nearly identical.

185
As in Case 1, we again chose to use different configurations of ESM2Mc, but this time we keep the physical parameterizations constant but change constants within the BLING BC. We used two model runs: the standard https://doi.org/10.5194/gmd-2021-167 Preprint. Discussion started: 19 August 2021 c Author(s) 2021. CC BY 4.0 License.
historical pre-industrial model spin-up used in Case 1 (BLING -PI Control) and one with similar distributions to PI Control but different half-saturation coefficients (KFe and KPO4 in Eq. (3)) for small and large phytoplankton (BLING -LgSm). Changing the half-saturation coefficients, which directly affects phytoplankton growth, is analogous to 190 changing the biological equations. Relative to the PI Control, the half-saturation coefficients in LgSm were decreased by √3 for small phytoplankton and increased by √3 for large phytoplankton. While these changes produce small differences in circulation and SST (R 2 = 0.9949 for SST between the two model runs) via changing the absorption of shortwave radiation, these differences are small. The primary impact of these changes affects the distribution of nutrients. 195 The predictor variables for the model runs of Case 2 were the same as those in Case 1 (macronutrient, micronutrient, irradiance, and temperature). Likewise, the target variables were also the same as those in Case 1 (small and large phytoplankton biomass). A total of four NNEs were trained for Case 2 (two model runs and two target variables).

Case 3 -Different ESMs: Prognostic vs. Diagnostic Biological Equations, Identical Physical Circulations 200
For Case 3, the goal was to examine whether the results from a diagnostic BC from Cases 1 and 2 still held when a prognostic BC was used. Our goal was to examine any differences in apparent relationships, along with identifying challenges when comparing apparent relationships across more realistic ESMs. In this experiment, the BCs were governed by different biological equations, but were run within the same physical model so that the temperatures and light seen by the two BC codes were identical. 205 One of our model simulations uses a version of BLING as the BC, while the other uses TOPAZ. For the BLING model run, the iron concentrations were fixed at their climatological values since this formulation was previously used to develop a model for very high-resolution studies (miniBLING). We chose this pair of simulations as the miniBLING code was run in an identical physical circulation to the TOPAZ model run and so the light and temperature experienced 210 by the two model ecosystems are identical. As described in Galbraith et al. (2015), the output is from the ocean component of ESM2M forced with historical atmospheric forcing which we denote as ESM2Mo.
The predictor variables for Case 3 were limited to variables that were present in both ESMs: macronutrient, micronutrient, and irradiance. The target variable was total biomass. The biomass was not split into small and large 215 phytoplankton biomass because the miniBLING output only contained total biomass. For consistency, the small and large phytoplankton biomass values in TOPAZ were combined to give total biomass. Two NNEs were trained for Case 3 (two ESM runs and one target variable).

Neural Network Ensembles (NNEs)
Neural network ensembles (NNEs) are an ensemble machine learning (ML) method. NNEs are comprised of a 220 collection of individual NNs where the predictions of each NN are averaged into a single prediction. This ensemble approach has been shown to outperform individual NNs and reduce the generalization error within a dataset (Hansen  and Salamon, 1990) by turning individual "weak learners" into a single "strong learner." Individual neural networks (NNs) can fit a non-linear function to a dataset without assuming any prior knowledge of the system. For a more thorough discussion of NNs, please refer to Schmidhuber (2015). The basic structure of the NN approach that we use 225 here is described in Appendix 1 of Scardi (1996).
We chose to use NNEs for several reasons: The ensemble approach of NNEs allows us to view the uncertainty in any given prediction based on the individual predictions of each NN. 230

2.
NNEs possess some capability of extrapolating outside the range of the data on which they are trained.

3.
As recently shown in Holder and Gnanadesikan (2021a), NNEs were able to more closely reproduce the underlying intrinsic relationships compared to RFs, mainly because of their ability to extrapolate.
The structure of the individual NNs was consistent between the three cases with each NN containing 25 hidden nodes 235 in the hidden layer with a hyperbolic tangent sigmoid activation function and 1 node in the output layer with a linear activation function. The only difference between each case was in the number of input nodes: Cases 1 and 2 each contained four input nodes (one for each predictor) and Case 3 contained three input nodes. For example, one NN of the NNE for the small phytoplankton biomass target variable in Case 1 would have the following structure: 1.
The four predictor variables for Case 1 (first column of Table 1) correspond to the four nodes in the input layer 240 of the NN.

2.
Each of the four input nodes is connected by weights to each of the 25 nodes in the hidden layer. Additionally, a bias term is connected to each of the hidden nodes.

3.
Each of the nodes in the hidden layer is connected by weights to the single node in the output layer, which for this instance would correspond to the target variable of small phytoplankton biomass. As with the hidden layer, 245 a bias term is connected to the single output node.
The training of each NN was carried out using the "feedforwardnet" function in MATLAB 2019b (MATLAB, 2019).
The training was stopped when the error between the predictions and observations increased for six consecutive epochs. in the hidden layer. The settings specified here allowed for reasonable training times while maintaining high performance metrics. For more detailed information, see Appendix B2 in Holder and Gnanadesikan (2021a). 260 Each variable was also scaled between -1 and 1 relative to each variable's maximum and minimum Where V is the value of a variable being scaled, S (subscript) is the scaled value, and U (subscript) is the unscaled value. This scaling puts the predictor values in the same range, so more weight is not given to variables with larger ranges. Additionally, this step decreases the training time of the NNs so that no values are too close to the limits of 265 the hyperbolic tangent sigmoid activation function. The variables and predictions were then scaled back to their original values for analysis and presentation of the results (Eq. (6)). The letter representations in Eq. (6) are the same as those in Eq. (5).
When using ML, it is possible to produce overly complex relationships that "overfit" the data. This provides a good 270 match for the data on which an ML model is trained but leads to poor predictions when new data is presented to the model. This can be avoided by splitting a dataset into training and testing subsets. For this manuscript, this means each NNE was trained using only the observations in the training subset and tested on the observations from the testing subset. The data from each model run was randomly split into training and testing subsets with 60% of the observations from a dataset going to the training subset and the other 40% going to the testing subset. The observations set aside in 275 the testing subset were ones that the NNEs never saw during their training phase. This provides a way to evaluate each trained NNE and its generalizability. If performance metrics of a trained NNE are similar between the training and testing subsets, it suggests that the variance of the dataset is well captured in the training phase and the NNE is generalizable to the entire dataset.

280
To assess the performance of each NNE, we calculated the R 2 values and root mean squared error (RMSE) by comparing the predictions from each NNE to the actual values within the respective training and testing subsets. The NNEs in each case and matching size class were also asked to make predictions on the testing subsets of the other model runs. For example, in Case 1 the NNE trained on the small phytoplankton of PI Control was asked to make 285 predictions for small phytoplankton of 4xCO2 using the values of the predictors from the testing subset of the 4xCO2 model run. These results were then compared to the actual values of the target variable to calculate the RMSE. This RMSE was then used to calculate the percent increase/decrease in error when compared against the RMSE calculated from a point-by-point comparison of each model run against the others. The purpose of this was to provide another metric for testing if the NNEs had found common apparent relationships across model runs. If an NNE trained on one 290 model run was able to accurately predict the outcomes of the other model runs leading to a reduction in the RMSE, it would suggest that the NNE had found similar apparent relationships between the model runs. On the other hand, if it showed an increase in RMSE, it would suggest that the apparent relationships between the model runs were different in biologically important ways.

295
To view the apparent relationships found by the NNEs, we conducted sensitivity analyses in which we presented each NNE with a unique set of values for the predictors. Compared to spatiotemporal distributions and time series, sensitivity analyses allow for the visualization of relationships between predictor and target variables. In each sensitivity analysis, one predictor was varied across its minimum and maximum range, while the other variables were held at a specified value, such as each variable's 25 th percentile. This was repeated for the 50 th and 75 th percentile 300 values of each variable as well. This allowed us to visualize how the biomass predictions changed across one variable's range when the other variables were held at a specified value. An example of this would be varying the macronutrient concentration while holding the micronutrient, irradiance, and temperature variables at their 25 th or 75 th percentile  values. This allowed us to see how the macronutrient concentration affected biomass when other nutrients were low or high, respectively. 305

Case 1 -Same ESM: Identical Biological Equations, Different Physical Circulations
In Case 1, our objective was to quantify the extent to which differences in physical circulation might affect the apparent relationships found by NNEs when the intrinsic biological relationships remained the same between the model runs and the physical circulation parameters differed. It was uncertain whether changing the circulation would push the 310 biology into fundamentally new states (i.e., different apparent relationships) or whether the physical circulation would simply act to change the location of where combinations of light and nutrients were found (ie. same apparent relationships).
Our results support the latter case, in that the locations of particular environments were simply being shuffled around. 315 The sensitivity analysis showed that each NNE found similar apparent relationships between biomass and each of the predictors for the respective size classes, insofar as each line fell within the standard deviation of the others ( Fig. 3 and 4). For example, the standard deviation (gray region) around the predicted apparent relationships for the large phytoplankton (dashed lines) all overlap one another (Fig. 3). The same can be seen for the predicted apparent relationships for the small phytoplankton (Fig. 4). Additionally, we were confident in the apparent relationships since 320 each NNE acquired high performance metrics in both the training and testing subsets (highest RMSE = 3.11x10 -9 mol kg -1 ; Table 2) relative to the mean value of the total biomass (1.24x10 -8 mol kg -1 ).
This result can be better understood by considering the conceptual diagram of how the diagnostic BC BLING works within an ESM (Fig. 1). For each time step, nutrients are calculated as a function of three terms: the initial nutrients, 325 the change in nutrients from biology, and the change in nutrients from physical circulation. In contrast, the biomass is only a function of two terms: the initial biomass values and the change in biomass due to biological cycling. Thus, biomass is not directly affected by changes in the physical circulation. Additionally, this means that when given information on the biological predictors, but not the physical parameters, the NNEs were able to back out the apparent relationships quite well. That similar apparent relationships were found between the model runs was further supported when we tasked each trained NNE with making predictions on the testing subsets of the other model runs for the same size class. For example, the NNE trained on the PI Control for small phytoplankton was tasked with making predictions for the small phytoplankton biomass of 4xCO2 and 3xMixing using the predictor values from their testing subsets. This test allowed 335 for the evaluation of the robustness of the relationships that each NNE found. If the NNEs were finding different relationships between the model runs, the NNE from one model run would likely perform poorly when predicting on the other model runs. Our results show that the NNEs performed well when applied to the other model runs (highest RMSE = 3.74x10 -9 mol kg -1 ; Table 3) relative to the average value of total biomass (1.24x10 -8 mol kg -1    Additionally, using the NNEs to predict the other runs led to decreases in error relative to the error from comparing each run against the others. For example, the initial point-by-point comparison of 4xCO2 and PI Control for small phytoplankton (Fig. 5 d) showed an RMSE of 3.06x10 -9 mol kg -1 , while using the NNEs from each model run to 345 predict the other saw the RMSE go down with a reduction in error of about 76% (Table 3). This reduction of error was consistent across the other model runs and size classes with error reductions varying from 54-79% (     That the NNEs from one model run were able to reproduce the results from the other model runs was not simply due 350 to the models producing similar spatiotemporal patterns. To ensure that distinct differences between the model runs were present, we compared each model run against the others (Fig. 5 and 6). Differences in the biomass values between the three model runs were evident ( Fig. 5 and 6). First, we compared each model run against the others in a point-bypoint analysis and observed that different biomasses were occurring at the same spatiotemporal locations (Fig. 5   of having higher biomass values than 4xCO2 across most locations (Fig. 5 d). Additionally, we looked at the contour plots and log relative ratios using the yearly averaged biomass for each case (Fig. 5 and 6 a-c, e, f, i). Specific large differences that we noted were higher biomass in the Pacific and Northern Atlantic in PI Control and 3xMixing relative to 4xCO2 (Fig. 5 and 6 b, f) and the highest biomass in occurring in 3xMixing in the subtropical regions of the Pacific ( Fig. 5 and 6 c). Similar patterns were observed in the large phytoplankton, as well (Fig. 6). These differences between 360  the model runs are relatively large (exceeding a factor of three in some locations) and allow us to dismiss the possibility that the similar apparent relationships were only due to strong similarities between the model runs.
Although the sensitivity analysis allowed us to see that the apparent relationships were the same for each size class, it also allows us to see how the two size classes react differently to the same conditions. Most notably, the large 365 phytoplankton seem to be very sensitive to the micronutrient compared to the small phytoplankton ( Fig. 3; closer view of small phytoplankton in Fig. 4). When the other predictors are held at their 75 th percentile values (high macronutrient, high irradiance, and warm temperature), the large phytoplankton are able to reach biomass values almost an order of magnitude higher than the small phytoplankton ( Fig. 3 and 4 j). This is what would be expected given the cubic relationship of large phytoplankton with growth rate. Another interesting relationship is the stark asymptotes in both 370 size classes of the macronutrient plots, suggesting limitations by other nutrients, likely the micronutrient (Fig. 3 a, e, i). One unexpected relationship was the decreasing biomass with increasing temperature in both size classes (Fig. 3 d, h, l). This could be a result of warmer regions having less available nutrients or because of the temperature dependent Chl:C ratios which would lead to phytoplankton needing more light in warmer waters.

375
Relative to our main objective in Case 1 to quantify the extent to which differences in physical circulation affect the apparent relationships, our results indicated that the different physical circulations did not produce differences in the apparent relationships found by NNEs. When the biological equations remained the same, changing the physical parameters simply changed where combinations of nutrients and light occurred. In contrast to changes in nutrients, changes in biomass in the BLING ESM were not a function of the physical circulation. 380

Case 2 -Same ESM: Different Diagnostic Biological Equations, Near-Identical Physical Circulations
In Case 1, it was clear from our results that when the biological cycling was identical between model runs, the NNEs found the same apparent relationships because the biomass was not a function of the physical circulation. Since the biomass is clearly a function of the biological equations, it would be reasonable to assume that the apparent relationships would be different between model runs that are governed by different biological equations. So, for Case 385 2, the objective was to quantify the extent to which NNEs could detect differences in the apparent relationships when the intrinsic biological relationships between model runs were different, while maintaining similar physical circulations and still using a diagnostic model which guarantees that identical nutrient, light, and temperature at two different points will produce identical biomass.

390
Our results show that NNEs can differentiate the apparent relationships between model runs when the biological equations differ. The sensitivity analysis for Case 2 shows that different apparent relationships were found between model runs and within the same size classes, relative to the non-overlapping gray standard deviation regions around each line ( Fig. 7 and 8). Additionally, we can be fairly confident in these predictions given the high-performance metrics in both the training and testing subsets (highest RMSE = 3.11x10 -9 mol kg -1 [  Fig. 1). While it might seem obvious that changing the biological 400 equations will change the biomass values, it remained unclear whether NNEs would be able to pick out these differences in the apparent relationships.    We wanted to ensure there were noticeable differences between the model runs ( Fig. 9 and 10). We did this in Case 1 to ensure that the similar apparent relationships found by the NNEs were not simply because of similarities in the 405 model output. In Case 2, the difference in model outputs serves to reinforce the different apparent relationships found by the NNEs. In the point-by-point comparison, the large phytoplankton showed more agreement between model runs ( Fig. 10 c) than the small phytoplankton (Fig. 9 c). However, when we examined the contour and log relative ratios Although the gray regions overlap toward the higher concentrations of each predictor, this is likely due to the lack of observations in the training data meeting that criteria, without which the NNEs could not be constrained. For example, 415 in Fig. 7 (j), the apparent relationships of the large phytoplankton overlap past about 5 x 10 -10 mol kg -1 of the micronutrient, because there were no observations in the training data that were greater than 5 x 10 -10 mol kg -1 of the micronutrient while simultaneously being at the 75 th percentile level of the macronutrient, irradiance, and temperature (data not shown). Without observations to constrain them, the NNEs were unable to be constrained and, therefore, less certain about the relationships in those regions which lead to higher uncertainty and overlapping standard deviations. 420 As in Case 1, our result was supported by the additional test in which the NNEs trained on one model run were tasked with making predictions on the other. Had the NNEs found similar apparent relationships, the reductions in error would have been of similar magnitude as those in Case 1 ( Table 3 vs Table 4). For this second case, we saw that there were only modest decreases in RMSE for the small phytoplankton and increases in RMSE for large phytoplankton (Table 4). For example, relative to the RMSE of the point-by-point comparison, the RMSE decreased about 21% when 425 LgSm made predictions on PIControl for the small phytoplankton (Table 4). Additionally, it was observed that even though the RMSE increased in the large phytoplankton, the R 2 values improved in the cross-model comparison compared to the point-by-point comparison (0.92-0.93 vs 0.85; Table 4). This suggests that the NNEs improved the simulation in terms of the overall pattern, but not the magnitude. These results make sense since the apparent relationships of the small phytoplankton showed greater similarities than the apparent relationships of the large 430 phytoplankton (Fig. 7).
With respect to the apparent relationships that the NNEs uncovered, the large phytoplankton once again appeared to be more sensitive to the micronutrient concentrations compared to the small phytoplankton ( Fig. 7 b, f, j). Both size classes showed asymptotes around the same concentrations for the macronutrient, albeit at different biomass values 435 ( Fig. 7 a, e, i). As with Case 1, the decreasing biomass with increasing temperature was an unexpected relationship ( Fig. 7 d, h, l), which might be explained by the temperature dependent Chl:C ratios causing phytoplankton in warmer regions to need more light.
Our main objective with Case 2 was to quantify the extent to which NNEs could detect differences in the apparent 440 relationships when the physical conditions between model runs were identical and the biological relationships differed.
With the biomass being a function of changes in biomass from biology (ie. the equations governing how nutrients affect biomass), different biological equations produced differences in biomass. What was unclear was whether NNEs would be able to highlight these differences in the apparent relationships. Our results indicate that NNEs could find noticeable differences in the apparent relationships, insofar as the standard deviation regions did not often overlap in 445 the sensitivity analysis curves.

Case 3 -Different ESMs: Prognostic vs. Diagnostic Biological Equations, Identical Physical Circulations
From Cases 1 and 2, we learned from our results that NNEs were capable of discerning differences in apparent relationships between model runs of the same ESM. For Case 3, we wanted to apply these principles to different ESMs to quantify the differences in the apparent relationships and highlight challenges that arise in comparing relationships 450 between ESMs. The model runs of Cases 1 and 2 using BLING as a BC afforded us the opportunity to test a "best-   Biomass (mol Our results indicate that the phytoplankton in the two ESMs react differently to the same conditions. It should be noted total phytoplankton biomass was used for Case 3, rather than splitting the biomass into large and small because phytoplankton output by the miniBLING BC is not differentiated into size classes. The sensitivity analysis shows that the miniBLING simulation produces higher biomass concentrations than the TOPAZ simulation under the same  This is not entirely unexpected since the biomass values in the miniBLING simulation were generally much higher than those in the TOPAZ simulation, as can be seen in the point-by-point comparison (Fig. 12 c). However, not all of the biomass values in the miniBLING simulation were larger than those in the TOPAZ simulation. The subtropical Atlantic regions and northern subtropical Pacific had higher yearly averaged biomass values in the TOPAZ simulation compared to the miniBLING simulation (Fig. 12 a, b, d). As with Case 2, the additional test of asking the NNEs trained 470 on the output of one ESM to predict the the output from the other ESM reinforced the result that different apparent relationships were found from an increase in error for both ESMs ( Table 5).
The challenge of comparing the results of different ESMs was evident in Case 3. For example, the performance metrics for the model runs in Cases 1 and 2 were relatively high in both the training and testing subsets, but the performance 475 metrics for the TOPAZ simulation in Case 3 were much lower (R 2 > 0.97 vs ~0.58, respectively; Table 2). It was unclear whether this drop in performance was because we were unable to characterize the TOPAZ simulation with NNEs using predictors common to both runs or whether we simply did not include enough relevant variables. To understand this, we performed a brief analysis in which we trained NNEs on specific variables and measured their performance with ESM output from CMIP5 ESM2M, which has TOPAZ as its BC (  using only variables that directly affected the phytoplankton growth rate (biology), one was trained using only variables that did not directly affect the growth rate (physics), and another was trained with both sets of variables (all).
Our results indicated that we were able to characterize ESM2M (and, by extension, results produced by using TOPAZ as a BC) with NNEs with the inclusion of more relevant variables, such as nitrate, ammonium, and silicate (RMSE ~ 5.90x10 -5 mol N m -3 [ Table 6] vs. the average biomass value of 3.14 x10 -4 mol N m -3 ). Without the inclusion of all the 485 relevant variables as predictors, the performance of the NNE trained on output from the TOPAZ simulation suffered compared to the NNE trained on the miniBLING simulation.
An additional challenge with comparing different ESMs is that certain variables may not be present in all ESMs. For example, one ESM may have phosphate included as a variable and another ESM may not. This presents a problem 490 when using the sensitivity analyses, because each NNE needs to be presented with the same conditions for direct comparability. One potential method for mitigating this could be to use proxy-variables, such that variables not common to both ESMs could be modified to represent the missing variables. For example, if one ESM had phosphate as a variable, another ESM did not, it might be possible to modify a variable that would be equivalent to phosphate, such as nitrate. Using the Redfield ratio of 16:1 for the N:P ratio, the nitrate variable could be divided by 16 and thus 495 be considered a proxy variable for phosphate. This proxy phosphate variable could then be used in training the NNE particular to applicable ESM, so all NNEs would be trained using the same predictors.   Our results indicate that when all the relevant variables are included as predictors, the NNEs are a parsimonious representation of the ESMs and we can be relatively confident in their predictions. This confidence then allows us to query these NNEs using sensitivity analyses to find the apparent relationships, which provide information on the relationships between the predictor and target variables. 520 With the first case, the similar performance metrics in the within-and cross-model comparison, along with the overlapping apparent relationships demonstrated that the NNEs were able to attribute differences between the model runs to physics. Likewise, in the second case, where the biological relationships differed, the NNEs were capable of attributing differences between the model runs to biological factors and were able to identify the elements of that 525 formulation that were different.
With the third case, we were able to show that it is possible to compare the apparent relationships between two different ESMs and that key differences can be found. However, this case also highlighted some of the challenges when comparing output from multiple ESMs. In order to adequately capture the variability and achieve high performance 530 metrics, all relevant variables for predicting an outcome must be included as predictors for each NNE. However, this presents a problem when one ESM may have a variable and another ESM does not. One possible solution is to use proxy variables, such that one variable can be modified to be representative of another.
The results of our study suggest that oceanographers and climate scientists could use the methods we have 535 demonstrated to compare apparent relationships between ESMs, in addition to using spatiotemporal distributions and time series. This is not to say that spatiotemporal information is not important; rather, the relationships and spatiotemporal information can be used to inform one another. For example, in a side-by-side comparison of contour plots between biomass and nitrate concentrations, one might expect to see high biomass in high nitrate regions. However, if low biomass is observed in a high nitrate region, this would suggest that another factor (such as phosphate) 540 is limiting phytoplankton growth. By visualizing the apparent relationships, one would be able to observe that phosphate has a strong limitation factor on the phytoplankton. This could then be verified with the spatial contour plot of phosphate against the original biomass and nitrate contour plots.
In addition to comparing relationships between ESMs, the methods presented here will allow for the comparison of 545 relationships found in observational datasets to the relationships in ESMs. This will allow for better tuning of the