A derivative-free optimisation method for global ocean biogeochemical models

. The skill of global ocean biogeochemical models, and the Earth System Models in which they are embedded, can be improved by systematic calibration of the parameter values against observations. However, such tuning is seldom undertaken as these models are computationally very expensive. Here we investigate the performance of DFO-LS, a local, derivative-free optimisation algorithm which has been designed for computationally expensive models with irregular model-data misﬁt landscapes typical of biogeochemical models. We use DFO-LS to calibrate six parameters of a relatively complex global ocean 5 biogeochemical model (MOPS) against synthetic dissolved oxygen, phosphate and nitrate “observations” from a reference run of the same model with a known parameter conﬁguration. The performance of DFO-LS is compared with that of CMA-ES, another derivative-free algorithm that was applied in a previous study to the same model in one of the ﬁrst successful attempts at calibrating a global model of this complexity. We ﬁnd that DFO-LS successfully recovers 5 of the 6 parameters in approximately 40 evaluations of the misﬁt function (each one requiring a 3000 year run of MOPS to equilibrium), while 10 CMA-ES needs over 1200 evaluations. Moreover, DFO-LS reached a “baseline” misﬁt, deﬁned by observational noise, in just 11–14 evaluations, whereas CMA-ES required approximately 340 evaluations. We also ﬁnd that the performance of DFO-LS is not signiﬁcantly affected by observational sparsity, however fewer parameters were successfully optimised in the presence of observational uncertainty. The results presented here suggest that DFO-LS is sufﬁciently inexpensive and robust to apply to the calibration of complex, global ocean biogeochemical models

Abstract. The skill of global ocean biogeochemical models, and the earth system models in which they are embedded, can be improved by systematic calibration of the parameter values against observations. However, such tuning is seldom undertaken as these models are computationally very expensive. Here we investigate the performance of DFO-LS, a local, derivative-free optimisation algorithm which has been designed for computationally expensive models with irregular model-data misfit landscapes typical of biogeochemical models. We use DFO-LS to calibrate six parameters of a relatively complex global ocean biogeochemical model (MOPS) against synthetic dissolved oxygen, phosphate and nitrate "observations" from a reference run of the same model with a known parameter configuration. The performance of DFO-LS is compared with that of CMA-ES, another derivativefree algorithm that was applied in a previous study to the same model in one of the first successful attempts at calibrating a global model of this complexity. We find that DFO-LS successfully recovers five of the six parameters in approximately 40 evaluations of the misfit function (each one requiring a 3000-year run of MOPS to equilibrium), while CMA-ES needs over 1200 evaluations. Moreover, DFO-LS reached a "baseline" misfit, defined by observational noise, in just 11-14 evaluations, whereas CMA-ES required approximately 340 evaluations. We also find that the performance of DFO-LS is not significantly affected by observational sparsity, however fewer parameters were successfully optimised in the presence of observational uncertainty. The results presented here suggest that DFO-LS is sufficiently in-expensive and robust to apply to the calibration of complex, global ocean biogeochemical models.

Introduction
Ocean biogeochemical models are a key tool in understanding the cycling of nutrients and carbon in the ocean. They are used to quantify the uptake of greenhouse gases such as CO 2 emitted by human activity, of which the ocean has absorbed roughly a third since the start of the industrial revolution (Khatiwala et al., 2009;DeVries, 2014), as well as assess the impact of increasing concentrations of greenhouse gases on ocean ecosystems. Such models are also an important component of the earth system models (ESMs) used to project future climate change. In global ocean biogeochemical models the complex interactions between biota, nutrients, oxygen and carbon are typically heavily parameterised. The skill of such models can be improved by either subjective manual or systematic tuning of the parameter values against observations. The latter uses numerical optimisation algorithms which seek to find the minima of a "misfit function" -often defined as the root mean squared difference between the model and observations -within the parameter space. However, biogeochemical models are seldom subjected to such tuning because of their large computational expense and the long spin-up time required for chemical and biological tracers to reach equilibrium (Wunsch and Heimbach, 2008;Khatiwala et al., 2012). Moreover, optimisation algorithms must be able to navigate a generally irregular mis-fit landscape. Efficient and robust optimisation methods are thus of considerable interest to the ocean biogeochemical and broader climate modelling community.
Optimisation methods can generally be split into two broad categories. Derivative-based algorithms such as Gauss-Newton (Hartley, 1961) use derivatives within the parameter space to locate minima. The calculation of derivatives, which can be undertaken using finite differences or automatic differentiation and adjoints (Griewank and Walther, 2008), can be prohibitively expensive in some cases, such as when evaluating the misfit function is computationally costly or noisy (chapters 8 and 9 of Nocedal and Wright, 2006). By contrast, derivative-free algorithms (Conn et al., 2009) may require fewer evaluations per iteration and are typically better adapted to handle noisy misfit functions. An example of the latter is "covariance matrix adaptation evolution strategy" (CMA-ES; Hansen, 2016). CMA-ES was applied by Kriest et al. (2017) to optimise six parameters within the Model of Oceanic Pelagic Stoichiometry (MOPS; Kriest and Oschlies, 2015), by minimising a globally averaged misfit incorporating annual mean dissolved phosphate, nitrogen and oxygen. This constituted one of the first successful attempts at systematic tuning of a relatively complex global biogeochemical model. CMA-ES was subsequently used by Sauerland et al. (2019) for multiobjective calibration of MOPS by including oxygen minimum zones as a misfit metric, and by Kriest et al. (2020) who compared the influence of different general circulation models on parameter optimisation.
While the development and application of CMA-ES is an important step forward, its evaluation cost per iteration, as well as overall computational cost, is prohibitively expensive for routine use. In Kriest et al. (2017), for example, the misfit function had to be evaluated at least 950 times to achieve a sufficiently low misfit. As each evaluation requires running the biogeochemical model to equilibrium (3000 years in that study), this would be prohibitively expensive for the more complex models run at resolutions typical of the current generation of ESMs. Here, we explore the application of another, computationally less expensive algorithm, "derivativefree optimisation by least squares" (DFO-LS; Cartis et al., 2019), to the same problem set-up as in Kriest et al. (2017). We first compare the performance of CMA-ES and DFO-LS to optimise six biogeochemical parameters against the output of a reference run of MOPS where the parameters are known. We examine in this "twin" experiment the ability of the algorithms to recover the true parameters, and the computational cost incurred. True oceanic observations contain observational uncertainty; therefore we also investigate the impact of optimising in the presence of observational uncertainty by adding noise to the synthetic observations. Lastly, we evaluate the performance of DFO-LS when given sparse data. Sparse scattered oceanic observations are commonly mapped onto a regular grid using methods such as objective interpolation, introducing significant error, especially in regions such as the Southern Ocean with poor data coverage. The structure of the paper is as follows: Section 2 describes the methodology, Sect. 3 the results, Sect. 4 the discussion and Sect. 5 the conclusions.

Ocean biogeochemical model
The Model of Oceanic Pelagic Stoichiometry (MOPS-2.0) is a global ocean biogeochemical model, which simulates the cycling of nine biogeochemical tracers, namely dissolved inorganic and organic phosphate, dissolved inorganic nitrate, dissolved oxygen, phytoplankton, zooplankton, and detritus (Kriest andOschlies, 2013, 2015), with the possibility to include the carbon cycle. MOPS is coupled to the transport matrix method (TMM; Khatiwala et al., 2005;Khatiwala, 2007Khatiwala, , 2018, an efficient numerical method for "offline" simulation of biogeochemical tracers. In this study we use monthly mean transport matrices and other physical forcing fields (including temperature, salinity, sea ice and winds) derived from a relatively coarse resolution (2.8 • ×2.8 • × 15 levels) configuration of MITgcm (Marshall et al., 1997) driven by climatological momentum, heat and freshwater fluxes (Dutkiewicz et al., 2005).

Biogeochemical model parameters
The behaviour of MOPS is controlled by several parameters, of which we have chosen the same six parameters to consider for calibration as chosen in the previous optimisation study by Kriest et al. (2017). The detailed definitions and possible ranges of these parameters are described in that paper. Briefly, four of these parameters are mainly restricted to the epipelagic and mesopelagic zones of the ocean, as they involve phytoplankton and zooplankton. I C and K PHY are the phytoplankton half-saturation constants for light absorption and phosphate uptake, respectively. µ ZOO is the zooplankton maximum grazing rate and k ZOO the zooplankton quadratic mortality rate. The remaining two parameters influence the remineralisation and sinking of particulate organic matter (POM). R O 2 :P is the ratio of oxygen consumption to phosphate release during remineralisation when oxygen is available, and b * is the exponent of the "Martin curve", a power law function that describes the attenuation of POM flux with depth (Martin et al., 1987).

Parameter sensitivities
Local sensitivities for each of the six parameters were calculated at one location within parameter space, whereby only the parameter whose sensitivity was to be calculated was perturbed from its target value by 10 % of its range while the other parameter values remained constant.

Optimisation algorithms
Optimisation algorithms iteratively evaluate the misfit between model and observations, then vary the model parameter inputs with the aim of finding a lower misfit. Here, every evaluation of the misfit requires running the biogeochemical model to equilibrium (3000 years), then calculating the misfit between the model outputs and real (or synthetic) observations of dissolved oxygen, phosphate and nitrate. In general the misfit "landscapes" of biogeochemical models tend to be nonlinear, as found by Kriest et al. (2017) for example, who converged to multiple local minima. "Twin" experiments are used to determine if an optimiser can find the global minimum within the misfit function landscape, whereby the misfit is calculated between the model outputs and synthetic observations. The synthetic observations are created by the model with a known parameter configuration; therefore the global minimum (is zero) and optimal parameter values are known. We compare the performance of two different optimisation algorithms, by using twin experiments.

CMA-ES
The Covariance Matrix Adaptation Evolution Strategy (CMA-ES; Hansen, 2016) is a widely used stochastic evolutionary algorithm, for use on a "black box" misfit function. By design, CMA-ES is an unconstrained solver, that is, parameters are not restricted to be within specified bounds. To ensure that parameters lie within reasonable bounds, a penalty score is added to the misfit when any parameter value goes outside of their specified range, as also done by Kriest et al. (2017). During each iteration, a population size of λ biogeochemical parameter vectors are sampled from a multivariate normal distribution, which is fully described by a mean and a positive definite matrix of covariances. CMA-ES then requires the misfit function to be evaluated at these λ locations in the parameter space. The results of these are used to update the mean and covariance of the multi-variate normal distribution, before another λ biogeochemical parameter vectors are sampled for the next iteration. With each iteration the population should be guided towards areas of the parameter landscape which provide lower expected misfit values, aiming to converge on the parameter configurations which provide the best misfits. This process has been well illustrated by Kriest et al. (2017, see their Fig. 2), who previously used CMA-ES to optimise MOPS. CMA-ES carries out a global search of the parameter space; therefore it seeks to find the minimum over the parameter space. In order to achieve this, CMA-ES can require thousands of function evaluations (e.g. 950-3460 required by Kriest et al., 2017). The CMA-ES code used in this study, which is based on the (µ/µ W ,λ)-CMA-ES algorithm of Hansen (2016), is summarised in Appendix A. The optimisation code was sourced from the Supplement by Kriest et al. (2017), with some editing to make it compatible with our chosen optimisation framework (see Code and data availability section). As in the previous Kriest et al. (2017) study, we use a population size λ of 10; i.e. in each iteration of CMA-ES, the misfit function is evaluated 10 times.

DFO-LS
Derivative-free optimisation using least squares (DFO-LS) is an iterative algorithm for minimising a function f (x) (Cartis et al., 2019), where x is the n-dimensional vector of parameters, each of which is constrained within specified bounds. DFO-LS can take into account individual terms of the misfit function and use their structure to improve convergence. Mathematically, DFO-LS solves the nonlinear least-squares problem: where D is a bounded domain of R n , and r i (x) denotes individual terms in the misfit function. DFO-LS starts at an initial location within the parameter space and then moves through the space to provably find a local minimum (see Appendix A.2 in Cartis et al., 2019, for convergence and complexity rates). The algorithm is illustrated in Fig. 1 and summarised in Appendix B. DFO-LS must be given a starting location within the parameter space from which to initialise. In the initial iteration of DFO-LS, the misfit function is evaluated at the starting location and at n additional locations nearby (where n is the number of parameters to be optimised), with their proximity determined by DFO-LS settings (see Table B1). In subsequent iterations, typically only one function evaluation is needed, and often only a handful are needed to achieve significant misfit reduction.
Using this set of evaluated points, DFO-LS creates a quadratic approximation to the underlying true (unknown) misfit function (Cartis et al., 2019) and calculates the minimum of this function within a "trust region" centred around the starting point. The true misfit function is then evaluated at this location. If it is found to be worse than the misfit at the existing n + 1 points, it is rejected. The trust region is shrunk (2) Two mini-local regressions are built (green lines) using these two r i points (black diamonds). (3) These linearised misfits are then squared and summed over to give a quadratic approximation (blue line) to the true misfit function. (4) Within the trust region (shaded in yellow) the minimum of the approximation is found, at which the true misfit function is evaluated. (5) If the new point is accepted, this new information is used to update the mini-local regressions, otherwise it is rejected and the trust region is shrunk. Steps 2-5 are then repeated until the specified termination criterion or maximum evaluations is reached. and the procedure is repeated. On the other hand, if it is found to be lower than the best of the n+1 points then it is accepted, and the point corresponding to the highest misfit amongst the previous n + 1 points is discarded. A new quadratic approximation is calculated for this n + 1 set of points, and the procedure is repeated. Thus, at any iteration DFO-LS keeps track of n + 1 points in parameter space and the point with the lowest misfit is considered as that iteration's best set of parameters. The algorithm is terminated based on three specified criteria: (1) the maximum allowed number of function evaluations is exceeded, (2) the trust region radius is shrunk below a specified size, and (3) misfit reduction progress is identified as being too slow.
Unlike CMA-ES, DFO-LS is more of a "local" optimisation method. However, there is strong numerical evidence from the derivative-free optimiser Py-BOBYQA, upon which DFO-LS is based , that it is able to find global minima. To increase the likelihood of finding the global minimum, DFO-LS can either be manually reinitiated from different starting locations in the parameter space, or automatically "restarted" once it determines that the reduction in the misfit is progressing too slowly. During a restart the trust region expands, allowing DFO-LS to search for points potentially outside the local minimum it may be trapped in, and move towards a lower minimum elsewhere. This can be done by either a "hard" restart, whereby the (expensive) misfit function is re-evaluated at n + 1 new locations within the expanded trust region, or by a "soft" restart, whereby DFO-LS only "shifts" some of the current n+1 points in parameter space to geometry-improving points (Cartis et al., 2019). The former is more computationally expensive; therefore we do not use it here, although soft restarts are allowed. To increase confidence that DFO-LS has found the global minimum, we also initiate from multiple starting points.

Misfit functions
Every evaluation of our misfit requires running the biogeochemical model for 3000 years before calculating the misfit between the model outputs and synthetic observations. While both CMA-ES and DFO-LS minimise a single misfit function, DFO-LS can exploit the structure of the misfit function. Thus, if the misfit is defined as per Eq. (1), we only provide CMA-ES with f (x) whereas the individual r i (x) are supplied to DFO-LS. There is no maximum suggested value for d, the number of r i terms; therefore the misfit at every grid point within the model domain could be provided to DFO-LS. However, many of the individual r i misfits would be physically close to each other in the model and therefore will respond similarly to perturbations in the biogeochemical parameters being optimised, which will result in a heavier weighting to this location of the ocean model. To avoid this, we define r i to take into account the spatial structure of the misfit by partitioning the ocean into previously established biome regions of similar ocean biogeochemical properties (Henson et al., 2010;Weber et al., 2016, provided by Raffaele Bernardello, Barcelona Supercomputing Centre, personal communication, 2018, several of which were further split by depth at 1000 m (see Fig. 2) for a total of 19 regions. For every region j , we further calculate a misfit for each of the three tracers q (phosphate, nitrate, oxygen) used in the optimisation. The objective f (x) is thus composed of 19 × 3 = 57 terms of the following form: where m qi (x) is the model solution with parameters x at grid point i for tracer q and o qi the corresponding observation (the synthetic observations provided by a reference run of MOPS). The misfit is normalised by the volume-weighted mean tracer concentration for that region and weighted, first, by individual grid point volumes V i relative to the volume V j of region j and, second, by the region's total volume relative to the global ocean volume V global . Real oceanic observations have a degree of uncertainty associated with them due to spatio-temporal oceanic processes, e.g. from small-scale processes such as unresolved eddies. To account for this we add a noise term iq , which is the added noise due to uncertainty associated with tracer q for every grid box i in the model. The total global misfit f T (x) is then defined as The total misfit function is broadly similar to Kriest et al. (2017), with the main difference being the incorporation of the 19 biome regions. The non-noisy equivalent of these misfit terms are r qj and f T as in Eqs. (2) and (3) with = 0. We also define "baseline" misfits r base qj and f base T , which are the misfits due to the noise alone in the special case where the model outputs equal give an indication of the termination criteria when optimising the model to real noisy oceanic observations, as optimising below this threshold would serve no useful purpose.
To specify a realistic noise field we take the standard deviation variable provided in the World Ocean Atlas database (WOA18 Garcia et al., 2018a, b). Since our misfit is defined with respect to annual mean data we require an annual mean standard deviation without the variability of the seasonal cycle. To do so we take the numerical mean (weighted by number of observations) of the monthly standard deviations reported in the WOA18 dataset for the upper 800 m (phosphate and nitrate) or 1500 m (oxygen), and the annual standard deviation below those depths. These standard deviations fields were linearly interpolated onto the model grid and then multiplied by three different Gaussian noise fields to create three separate noise ( ) realisations. The baseline misfit terms r base qj and f base T were calculated as an average over these realisations.
As mentioned above, the "observations" in this study are from a reference simulation of MOPS, hereafter referred to as MOPS-ref, run with the following parameter values: R O 2 :P = 170 mmol O 2 : mmol P, Table C1).

Optimisation experimental design and solver settings
In this study we seek to (1) compare the performance of CMA-ES and DFO-LS on noise-free observations, (2) investigate DFO-LS's performance on noisy observations, and (3) investigate the impact of sparse observations on the ability of DFO-LS to recover the true parameters. In order to do so we carried out the following series of experiments (see Table 1 for the corresponding experiment labels): Noise-free experiments. In the noise-free experiments we attempted to recover all six parameters. For this a single run of CMA-ES was performed (labelled C_SMOOTH). For DFO-LS two experiments were carried out (D_SMOOTH 1 and D_SMOOTH 2 ), starting from two different locations in parameter space that were chosen to be relatively far from the target parameters. The parameter values for these and all other experiments are listed in Table C1.
Both CMA-ES and DFO-LS are controlled by various solver settings. For CMA-ES the main ones are the number of sequential generations and the population size. As per Kriest et al. (2017) we set these to 200 and 10, respectively. The solver settings used by DFO-LS are summarised in Table B1. Of the noise-free experiments D_SMOOTH 1 and D_SMOOTH 2 , the former had DFO-LS settings regarding trust region management (tr_radius) which are more suitable for a noisy misfit function, while the latter is more suitable for a smooth misfit function. Therefore, D_SMOOTH 1 and D_SMOOTH 2 vary both in starting values and trust region management. As D_SMOOTH 1 was slightly more successful, the trust region management settings were set to be more suitable for a noisy misfit function in all subsequent experiments.
DFO-LS experiments with observational uncertainty. To understand optimisation performance in the presence of observational uncertainty, noise was added to the reference observations (see Sect. 2.5). Three such optimisation runs, each with a different noise realisation, were carried out with DFO-LS (D_NOISY 1 , D_NOISY 2 , D_NOISY 3 ) starting from the same location in parameter space, to minimise the noisy misfit function f T . The goal was to see if DFO-LS could recover all six of the MOPS-ref target parameter values.
DFO-LS experiments with sparse observations. There are large areas of the ocean which have not been sampled adequately or at all (e.g. Fig. 2). While it is possible to fill in the gaps in the data using objective interpolation methods, this might not always work well in the presence of large gradients. In a last set of experiments we therefore compared how DFO-LS performs in the presence of data sparsity (D_SPARSE 1 and D_SPARSE 2 ), by only using observations at model grid points for which the corresponding locations in WOA18 contain data, with its corresponding performance in the absence of data sparsity (D_SMOOTH 1 and D_SMOOTH 2 ).

Parameter sensitivities
To provide insight into why some parameters were tuned better or worse in the following optimisation experiments, the sensitivity of the misfit function to an individual perturbation in each parameter has been calculated and shown in Fig. 3. The greatest change in the misfit was caused by perturbing b * by 10 % of its range, followed by I C . The parameter with the lowest influence on the misfit at this local point in parameter space was K PHY , in which a 10 % perturbation caused a misfit change of only 4.1 × 10 −6 .

Noise-free experiments
The results for all twin optimisation experiments are summarised in Figs. 4 and 5, and in Appendix C (Tables C1 and  C2), which show the starting and optimised parameter values, and parameter recovery information. In the subsequent sections we then plot both the global misfit and parameter values for every function evaluation throughout each individual optimisation experiment (Figs. 6-13). During one CMA-ES iteration we evaluate the misfit function 10 times (the population size). Therefore for CMA-ES we plot the minimum (best) and maximum misfits, and we plot the parameter values corresponding to the best misfit, and the minimum and maximum parameter values of each population. Table 1. Names of each experiment tuning to noise-free, noisy and sparse twin observations. C_SMOOTH is the non-noisy CMA-ES experiment. D_SMOOTH 1 and D_SMOOTH 2 are the non-noisy DFO-LS experiments starting from two different locations in parameter space, run specifically to be compared to C_SMOOTH. D_noise_randi are the noisy DFO-LS experiments, with three different noise realisations, run specifically to be compared to the non-noisy equivalent experiment D_SMOOTH 1 . D_SPARSE 1 and D_SPARSE 2 are the non-noisy DFO-LS experiments calibrating to sparse observations, starting from two different locations in parameter space, run specifically to be compared to the non-sparse equivalent experiments D_SMOOTH 1 and D_SMOOTH 2 , respectively.  To both reiterate how DFO-LS works and fully explain the DFO-LS figures, we briefly describe the optimisation process in terms of expected misfit reduction and parameter trajectories. First, DFO-LS evaluates the misfit function n + 1 times near to the chosen starting point; therefore in the first seven evaluations we do not expect a misfit reduction. After these initial evaluations, DFO-LS attempts to minimise the misfit function and there will be both successful evaluations (the resulting misfit is lower than previously found in the optimisation) and unsuccessful ones (the misfit is not lower). There may also be restarts, directly after which unsuccessful evaluations are common as DFO-LS perturbs the parameter values to get out of a possible local minimum. Therefore on every DFO-LS figure we have plotted the misfit or parameter values for every evaluation (both successful and unsuccessful) as scattered points, and successful ones in a solid line.

CMA-ES DFO-LS
On every figure of total global misfits the expected baseline misfit (f base T ; see Sect. 2.5) is also plotted, below which any misfit reduction is within observational noise levels. On every parameter trajectory plot the "recovery zone" is also shown. This indicates the range of parameter values within ±5 % of the target value, normalised by the total range (upper bound minus lower bound) for that parameter. We consider a parameter to have been "recovered" by a certain number of evaluations, when all subsequent parameter values corresponding to successful evaluations remain within this recovery zone.
CMA-ES experiment. Figure 6 shows that during the optimisation C_SMOOTH by CMA-ES the global misfit decreases significantly from 10 −1 to ≈ 10 −4 within the first 500 function evaluations. Subsequently progress slows down as the misfit is reduced by only 1 more order of magnitude over the next 1000 evaluations. Progress then significantly improves, with the misfit decreasing from ≈ 10 −5 to 10 −9 within the final 500 evaluations. Note that the spikes in the maximum global misfit near the 1650th evaluation was due to the added penalty factor when one of the parameter values in this population had a value just outside of its allowed range. While this experiment did not include noise, we note that CMA-ES required 309 evaluations to reach the baseline misfit, beyond which any misfit reduction would have been within observational noise levels. Figure 7 shows how the six parameters were optimised towards the MOPS-ref target parameter values by CMA-ES. The targets were found relatively quickly within the initial 500 evaluations for the parameters R O 2 :P , I C and b * , corresponding to the initial fast misfit reduction previously shown in Fig. 6. The µ ZOO and k ZOO targets were found next after approximately 1000 evaluations, after which the optimiser began tuning K PHY towards its target until it located after 1700 evaluations. If C_SMOOTH had been terminated once the observational noise level was reached after 309 evalua- For C_SMOOTH multiple starting locations (small orange markers) are shown, as unlike DFO-LS, CMAES selects 10 unconstrained randomised starting points. As CMA-ES is unconstrained, not all starting locations plot within the parameter bounds, to which the y-axis limits are fixed. Also plotted are the MOPS-ref target parameter values and ±5 % recovery zone (horizontal black dashed lines). For further information see Table C1. tions, R O 2 :P , I C and b * would have been optimised to their MOPS-ref values relatively well, while K PHY , µ ZOO and k ZOO would still be far from their target values.
DFO-LS experiments. To compare the performance of DFO-LS with CMA-ES we carried out two optimisation experiments with DFO-LS (D_SMOOTH 1 and D_SMOOTH 2 ) starting from two different locations in parameter space, with differing parameters controlling the DFO-LS trust region shrinking speed. Optimisation D_SMOOTH 1 had slower trust region shrinking settings to allow it to better handle an irregular misfit function. Figure 8 shows the comparison between both experiments' reduction of the global misfit. In both cases there was rapid initial misfit decrease from near 10 −1 to 10 −3 within 30 model evaluations. Optimisation D_SMOOTH 2 showed slightly slower misfit reduction, needing 35 evaluations to reach the baseline misfit, while D_SMOOTH 1 only required 20 to reach the baseline, beyond which any misfit reduction is within observational noise levels. In both D_SMOOTH 1 and D_SMOOTH 2 DFO-LS managed to reduce the misfit to below 10 −5 within 45-49 evaluations and then initiated restarts to reduce it further. As D_SMOOTH 1 performed slightly better than D_SMOOTH 2 , especially between evaluations 15-45, D_SMOOTH 1 trust region shrinking settings were used as defaults for all subsequent experiments. Figure 9 shows how the six parameters were optimised towards the MOPS-ref target parameter values by D_SMOOTH 1 and D_SMOOTH 2 . In both experiments within the first 30 evaluations R O 2 :P , I C , k ZOO and b * were optimised to relatively close to their targets, and µ ZOO within the first 45 evaluations. The parameter the misfit function was least sensitive to, K PHY , was successfully optimised by D_SMOOTH 1 ; however, was not successfully optimised at all by D_SMOOTH 2 . If D_SMOOTH 1 and D_SMOOTH 2 had been terminated once reaching the noise baseline, after 20 and 35 evaluations respectively, R O 2 :P , I C , b * and k ZOO would have been optimised to their MOPS-ref values relatively well.

DFO-LS experiments with observational uncertainty
To assess the impact of observational uncertainty we carried out three experiments in which DFO-LS was initialised from the same starting location in parameter space, but with three different realisations of random noise added to the observations (see Sect. 2.6). As seen in Fig. 10 DFO-LS managed to reduce the misfit to very close to the average baseline misfit Black arrows indicate (a) the baseline misfit was not reached, or (bg) the parameter was not recovered in that optimisation experiment. Note that the number of evaluations required by CMA-ES in experiment C_SMOOTH always plotted above the break in the y axis. For further information see Table C2. within 30 evaluations with a reduction in misfit from ∼ 10 −1 to ∼ 10 −3 . Closer to the end of the optimisation runs, restarts were initiated to encourage further misfit reduction, hence the large variations in misfit. Figure 11 shows that the initial misfit reduction corresponds to improved values for the pa-rameters R O 2 :P , I C , K PHY and b * . DFO-LS seems to compensate for the noise by increasing the values for the parameters µ ZOO and k ZOO .

DFO-LS experiments with sparse observations
In a final set of experiments we examine whether DFO-LS is able to successfully optimise MOPS given a sparse set of observations (see Sect. 2.6). The experiments D_SPARSE 1 and D_SPARSE 2 were initialised from the same location in parameter space as D_SMOOTH 1 and D_SMOOTH 2 , respectively, but the former were optimised using observations sub-sampled at grid points corresponding to locations in the un-interpolated WOA18 database. In these experiments no noise was added to the observations. Figure 12 shows that the two optimisations using full observations (D_SMOOTH 1 and D_SMOOTH 2 ) converged to slightly lower misfits than when using sparse observations. Figure 13 shows that D_SMOOTH 1 successfully recovered all six parameters within 42 evaluations, while D_SPARSE 1 only successfully recovered R O 2 :P , I C and b * throughout the optimisation. The above results would indicate a poorer optimisation when using sparse observations; however, when starting from a different location in parameter space, D_SMOOTH 2 successfully recovered five parameters within 44 evaluations, while D_SPARSE 2 recovered the same five after only 26  . Also plotted is the baseline misfit (horizontal black dashed line). Vertical arrows indicate a soft restart, coloured and marked according to each experiment. Note that every MOPS evaluation has been plotted with a small marker; however, only MOPS evaluations which resulted in a lower misfit than previously seen in each optimisation experiment have been plotted with a solid line.
evaluations. This suggests that even with sparse observations it is possible to successfully optimise a global ocean biogeochemical model such as MOPS.

CMA-ES vs. DFO-LS optimisation performance
Our comparison of the two optimisation algorithms shows that DFO-LS could recover all six target parameter values within ∼ 40 evaluations of MOPS, while CMA-ES achieved the same goal within ∼ 1700 evaluations. By "recover" we mean optimised to within ±5 % (normalised by the parameter range) of the target value. DFO-LS reduced the misfit to below the observational uncertainty threshold within 20-35 evaluations, while CMA-ES required 309 evaluations. DFO-LS is thus significantly more efficient for this particular problem and may, in general, be more practical for optimising more than a small handful of parameters. However, we note that the multiple evaluations CMA-ES requires can be run in parallel. In contrast, DFO-LS, except for the initial n + 1 evaluations, runs sequentially. CMA-ES is a single-objective optimiser, while DFO-LS can use information from multiple misfit values instead of just one. Therefore it can exploit more information to allow for a faster reduction in the misfit. Neither algorithm can  completely guarantee a global optimum solution, although CMA-ES carries out a more global search than DFO-LS. There is significant evidence DFO-LS can find the global optimum , but to increase confidence in the final solution it can be combined with a globalising method such as starting from different points in parameter space or using the DFO-LS restart functionality.
Both methods struggled with one of the parameters K PHY , due to the misfit function's low sensitivity to this parameter (as found by perturbing the parameter values in each direction and computing the gradient). DFO-LS had not begun to tune this parameter for one of the experiments (D_SMOOTH 2 ) before we terminated it at a maximum of 70 evaluations, although it did find K PHY when initiated from a different starting point (D_SMOOTH 1 ). CMA-ES also had difficulty in tuning K PHY and only started optimising this parameter after all the other parameters were recovered at ∼ 1200 evaluations. The maximum number of DFO-LS evaluations was set to 70 as it is a sequential algorithm; therefore it was impractical to allow too many more evaluations. Had it been allowed to run longer the expectation is it would begin tuning K PHY once the other five were sufficiently tuned, as was the case with CMA-ES. However, the computational expense of continuing DFO-LS to recover K PHY was deemed too costly, particularly due to the fact that this parameter has so little influence on the misfit function and therefore did not impact the successful misfit reduction achieved by DFO-LS. This low sensitivity to K PHY was also seen by Kriest et al. (2017), who determined the surface observations contribute too little to the total misfit, rendering the misfit function insensitive to perturbations in parameters that mainly influence Figure 11. As in Fig. 9, but for the experiments D_NOISY 1 (black line with crosses), D_NOISY 2 (blue line with circles) and D_NOISY 3 (red line with squares). Fig. 8, but for experiments D_SMOOTH 1 (black line with crosses), D_SPARSE 1 (blue line with circles), D_SMOOTH 2 (red line with squares) and D_SPARSE 2 (magenta line with diamonds). Vertical arrows indicate a soft restart, coloured and marked according to each experiment. Note that the baseline misfit (horizontal black dashed line) was calculated using the full grid noisy observations. the surface ocean (e.g. K PHY ). To help overcome this in future work, one could put more weighting on surface data when formulating the misfit.

Calibrating to uncertain observations
Real oceanic observations come with associated uncertainty due to measurement error, temporal variations such as sea-sonal and diurnal cycles, and meso-scale variability due to factors such as eddies and the movement of fronts. Here we have studied how this uncertainty raises the base of the misfit function, below which any optimisation of the biogeochemical model would be within the uncertainty level. We determined this baseline for the misfit function (or termination threshold) using the standard deviations of the observational data; however, others have defined it as the global optimum of a surrogate formulation of the biogeochemical model (Sauerland et al., 2017). In the present case and with the chosen set of oceanic observations, the model was significantly optimised before reaching levels of observational uncertainty, particularly due to optimisation of the parameters which the model is most sensitive to, as was determined by perturbing each parameter while holding the others fixed and calculating the misfit. In this case it was I C (the phytoplankton half-saturation for light) and b * (the increase in particle sinking speed with depth). Somewhat surprisingly, a parameter the model is less sensitive to, R O 2 :P (the ratio of oxygen consumption to phosphate release during remineralisation), was also well optimised before reaching the baseline. Despite the low sensitivity, possibly caused by narrow parameter bounds, the high optimisation potential by this parameter may be due to the fact that the misfit function includes both oxygen and phosphate. It could also be due to the fact that this parameter has a non-local effect, as it influences the flux of oxygen and phosphate to the deeper ocean, hence to ocean basins further along the "conveyor belt". This is also the case for b * (Kwon and Primeau, 2006;Kriest et al., 2012) but even Figure 13. As in Fig. 9, but for the experiments D_SMOOTH 1 (black line with crosses), D_SPARSE 1 (blue line with circles), D_SMOOTH 2 (red line with squares) and D_SPARSE 2 (magenta line with diamonds). more so as it also influences the vertical flux of all three of the tracers. In order to optimise less sensitive parameters before reaching noise levels one could introduce more metrics into the misfit calculation to help constrain these parameters, for example phytoplankton and zooplankton data, or additional oxygen constraints, such as the location of oxygen minimum zones as done by Niemeyer et al. (2019).

Calibrating to sparse observations
We also investigated the ability of DFO-LS to optimise MOPS in the presence of sparsity in the observational data. The results shown here suggest that there is no significant difference in the performance of DFO-LS when tuning to data at every grid point versus a subset of grid points, in line with earlier findings by Kriest et al. (2010). Interpolating can introduce large errors, on the order of 20 % (Garcia et al., 2018a, b), particularly in poorly sampled regions such as the Southern Ocean. However, our experiments suggest that it is possible to use un-interpolated observations, but it is important to start the optimiser from multiple locations in parameter space, or to generously allow restarts, in the presence of a complex misfit function with many local minima. These multiple runs clearly can be run in parallel.

Summary
This study compared the efficiency and performance of two derivative-free optimisation algorithms, CMA-ES and DFO-LS, applied to MOPS, a global ocean biogeochemical model with seven prognostic tracers. The two methods were used to tune six of the parameters that control the behaviour of MOPS. We found that DFO-LS has a significantly lower computational cost when compared to CMA-ES, between one and two orders of magnitude, which is important considering that global ocean biogeochemical models are computationally expensive, as they must be integrated for several thousand years to reach equilibrium. DFO-LS exploits more information when minimising the misfit function and therefore has more scope for reducing the misfit faster than CMA-ES. However, as DFO-LS is more of a local optimiser than CMA-ES, it should be paired with a globalising method such as starting from different initial points in parameter space, which can easily be run in parallel.
Future work will involve applying DFO-LS to tune the MEDUSA biogeochemical model (Yool et al., 2011(Yool et al., , 2013 to real observations. MEDUSA is more typical of the biogeochemical models which are embedded within earth system models (in the case of MEDUSA, UKESM) that are used to project climate change. Below is a simplified description of the (µ/µ W ,λ)-CMA-ES algorithm (Hansen, 2016).
Algorithm A1 (µ/µ W ,λ)-CMA-ES. 0: INPUT: Set initial parameters as in Table 1 of Kriest et al. (2017), population size λ = 10, µ = λ/2, evolution paths, covariance matrix C = I, distribution mean, step size and maximum generation number. 1: while maximum generation is not reached and fitness distribution is not flat do 2: Sample population of new probability distribution 3: for k = 0, 1, 2, . . .λ do 4: Sample search point for this k 5: end for 6: Update probability distribution 7: Update the mean of the search distribution according to a weighted average of the best half of the previously sampled population 8: Update the overall standard deviation ("step size") 9: Update evolution paths 10: Update covariance matrix 11: end while Appendix B: DFO-LS algorithm description Below is a simplified description of the DFO-LS algorithm in the context of how it has been used in this study. Not all technical details are included, such as safeguarding steps to improve the geometry of points and the quality of the model; therefore see the full description in Cartis et al. (2019). Ratio to decrease the lowest bound (ρ k ) for the trust region radius 0.9 0.1 tr_radius.alpha2 Ratio of ρ k to decrease k by when ρ k is reduced 0.95 0.5 Algorithm B1 DFO-LS.
Require: Number of parameters n, starting point x 0 ∈ R n , minimum trust region radius (p end ), if hard or soft restarts are allowed (see Sect. 2.4.2), and maximum number of true misfit function evaluations. 1: Evaluate the true misfit function at n+1 points within the initial trust region to build the initial interpolation set {Y 0 } (this can be done in parallel). 2: for k = 0, 1, 2, . . . do 3: if we have exceeded the maximum number of true misfit function evaluations then 4: terminate. 5: end if 6: Construct a quadratic approximation of the true misfit function. 7: Approximately solve the trust region subproblem to locate the minimum of the approximation within the trust region and get step s k to this point. 8: Evaluate the true misfit function at x k + s k . 9: if the misfit is significantly decreased then 10: Accept Experiments C_SMOOTH Start n/a n/a n/a n/a n/a n/a 4.  Table C2. Number of evaluations required to recover each parameter for all twin experiments. Columns 2-7: number of misfit function evaluations required to successfully recover that parameter ("-": never recovered). All evaluations required to recover a parameter which were fewer than 40 are typed in bold font. Column 8: the maximum number of evaluations completed. Column 9: the evaluation which provided the lowest or "best" global misfit. Column 10: the number of evaluations needed for the global misfit to be reduced below noise levels ("-": never reached the baseline). Code and data availability. The base TMM and MOPS code used for the ocean biogeochemical simulations are available to download from https://doi.org/10.5281/zenodo.1246300 (Khatiwala, 2018). Transport matrices and forcing fields required to perform the simulations can be downloaded from https://doi.org/10.5281/zenodo. 5517238 (Khatiwala, 2021). Modifications to the MOPS code for the specific experiments described in this paper, along with model output and scripts to recreate the figures shown here, are available from https://doi.org/10.5281/zenodo.5517626 . The OptClim optimisation framework used in this study to couple any climate model to any optimiser is available at https://doi.org/ 10.5281/zenodo.5517610 . This includes the CMA-ES optimisation code taken from the Supplement of Kriest et al. (2017) and adapted to work with OptClim.
Author contributions. SK and CC conceived the project and together with SO designed the experiments. SO carried out the experiments and analysis and wrote the paper with contributions from all co-authors.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.