Submitted as: methods for assessment of models 30 Sep 2021
Submitted as: methods for assessment of models  30 Sep 2021
A derivativefree optimisation method for global ocean biogeochemical models
 ^{1}Department of Earth Sciences, University of Oxford, South Parks Road, Oxford OX1 3AN, UK
 ^{2}Mathematical Institute, University of Oxford, Radcliffe Observatory Quarter, Woodstock Road, Oxford, OX2 6GG, UK
 ^{3}GEOMAR HelmholtzZentrum für Ozeanforschung Kiel, Düsternbrooker Weg 20, 24105 Kiel, Germany
 ^{4}School of GeoSciences, University of Edinburgh, Edinburgh, UK
 ^{1}Department of Earth Sciences, University of Oxford, South Parks Road, Oxford OX1 3AN, UK
 ^{2}Mathematical Institute, University of Oxford, Radcliffe Observatory Quarter, Woodstock Road, Oxford, OX2 6GG, UK
 ^{3}GEOMAR HelmholtzZentrum für Ozeanforschung Kiel, Düsternbrooker Weg 20, 24105 Kiel, Germany
 ^{4}School of GeoSciences, University of Edinburgh, Edinburgh, UK
Abstract. The performance of global ocean biogeochemical models, and the Earth System Models in which they are embedded, can be improved by systematic calibration of the parameter values against observations. However, such tuning is seldom undertaken as these models are computationally very expensive. Here we investigate the performance of DFOLS, a local, derivativefree optimisation algorithm which has been designed for computationally expensive models with irregular modeldata misfit landscapes typical of biogeochemical models. We use DFOLS to calibrate six parameters of a relatively complex global ocean biogeochemical model (MOPS) against synthetic dissolved oxygen, inorganic phosphate and inorganic nitrate observations
from a reference run of the same model with a known parameter configuration. The performance of DFOLS is compared with that of CMAES, another derivativefree algorithm that was applied in a previous study to the same model in one of the first successful attempts at calibrating a global model of this complexity. We find that DFOLS successfully recovers 5 of the 6 parameters in approximately 40 evaluations of the misfit function (each one requiring a 3000 year run of MOPS to equilibrium), while CMAES needs over 1200 evaluations. Moreover, DFOLS reached a baseline
misfit, defined by observational noise, in just 11–14 evaluations, whereas CMAES required approximately 340 evaluations. We also find that the performance of DFOLS is not significantly affected by observational sparsity, however fewer parameters were successfully optimised in the presence of observational uncertainty. The results presented here suggest that DFOLS is sufficiently inexpensive and robust to apply to the calibration of complex, global ocean biogeochemical models.
Sophy Elizabeth Oliver et al.
Status: final response (author comments only)

RC1: 'Comment on gmd2021175', Anonymous Referee #1, 15 Oct 2021
General Comments
The article examines the ability and performance of the mathematical optimisation software DFOLS (Cartis et al., 2019) to calibrate global biogeochemical ocean models  an important task in the field of earth system modelling.
The study is exemplified with the BGC model MOPS (Kriest et al, 2015), coupled to a 2.8° configuration of the MIT general circulation model (Marshall et al., 1997), using the transport matrix method (TMM, Khatiwala, 2007) as a "physical linearisation" in order to enable more efficient model evaluations.
The objective function to be optimised is a weighted sum of squared differences between observed tracer values and their modeled counterparts (a measure frequently applied to global ocean models). Since DFOLS is a derivativefree method for least squares problems it appears to be a promising candidate for the model calibration task. Consequently, the paper is tightly focused on the calibration performance of DFOLS and its ability to deal with noise and sparse observational data. It is compared to a "general purpose" derivativefree stochastic search algorithm, namely an implementation of a popular variant (Hansen, 2016) of the covariance matrix adaption evolution strategy (CMAES) by N.~Hansen. CMAES has been applied to the same TMM coupled BGC model by Kriest et al., 2017, as one of the first calibration attempts w.r.t. globally coupled BGC models.
The article is well laid out. It gives a quick overview about both optimisation tools and presents a tight experimental design with TWIN experiments that yield informative results and discussions.
The performance results of DFOLS are quite promising. At least for the given model setup with TWIN experiments, there is a clear decrease of the number of function evaluations, (i.e., of expensive model simulations) required to recover the selected set of BGC parameters (and the modeldata misfit value) within sufficient accuracy. Even the number of DFOLS iterations is less than with the applied CMAES version in the setting by Kriest et al.\ (there, a CMAES iteration means 10 model simulations in parallel while a DFOLS iteration evaluates the model for a single parameter vector, which is derived by constructing a quadratic "surrogate model" and minimising it within a trust region). Further, a good convergence property of DFOLS with respect to both noise and sparse data is empirically confirmed.
It remains to be investigated if the observed good performance is sustained with realworld observations (which I would expect) and for other globally coupled BGC models (as designated by the authors).
Specific CommentI am curious about the impact of the cost function definition, which is explained in quite some detail in section 2.5. Did you initially work without a partition into 27 biomes and 3 tracers? Does the required number of function evaluations (e.g. to reach "baseline optimality") significantly increase if the objective function is provided as a single sum of squared differences, only?
Technical Corrections The CMAES subsubsection and the DFOLS subsection have different depth levels (2.3.1 and 2.4). Using 2.3.1 and 2.3.2 (or 2.4 and 2.5) would be more consistent.
 The sentence beginning in line 157 seems incomplete. Is there a word missing?
 I would remove the first three words "Results table of" from the captions of Tables 3 and 4.
 In Figure 4 the parameter boundary lines are not "red dotted" as stated in the caption but black thin lines.
 Line 320: "verses" => "versus"

AC1: 'Reply on RC1', Sophy Oliver, 21 Oct 2021
Hello,
We thank the reviewer very much for their comments and corrections.
Regarding the 5 technical corrections, we thank the reviewer for these, we agree with them all and will edit accordingly.
 “The CMAES subsubsection and the DFOLS subsection have different depth levels (2.3.1 and 2.4). Using 2.3.1 and 2.3.2 (or 2.4 and 2.5) would be more consistent.”
We thank the reviewer, and will change them to 2.4 and 2.5.  “The sentence beginning in line 157 seems incomplete. Is there a word missing?”
We thank the reviewer, and we agree, there were some words missing. The sentence beginning line 157 will now read "Real oceanic observations have a degree of uncertainty associated with them due to spatiotemporal oceanic processes, e.g., from small scale processes such as eddies."  “I would remove the first three words "Results table of" from the captions of Tables 3 and 4.”
We agree and will remove “Results table of”.  “In Figure 4 the parameter boundary lines are not "red dotted" as stated in the caption but black thin lines.”
We thank the reviewer for highlighting this error, and will correct this to “thin black lines”.  “Line 320: "verses" => "versus"”
We shall do this.
Regarding the two questions, we have replied to each below.
 “Did you initially work without a partition into 27 biomes and 3 tracers?”
That is an interesting question. Essentially, because of how DFOLS works, no we didn't start off without partitioning the misfit into regions and tracers. DFOLS requires an input vector of misfit terms. When DFOLS uses these terms to create the quadratic approximation, it solves a system of linear equations, which would become underdetermined if the misfit vector length were less than the number of parameters being optimised. This is not a problem for DFOLS  which then uses a Tickhonovlike regularization to the ensuing inverse problem. Still, as we are trying to perform parameter tuning, it is better, if possible, to provide DFOLS with at least n+1 misfit terms (n=number of parameters) as this gives the code more problem information to exploit. For the upper limit, there is not really a maximum preferred length for the misfit vector. For example, one could provide a misfit for each grid point of the ocean biogeochemical model being optimised, providing a misfit vector length of many thousands. However, the problem with this is many of the misfits physically close to each other in the model will respond very similarly to perturbations in the biogeochemical parameters being optimised, which will essentially result in a heavier weighting to this location of the ocean model. Therefore, the misfit terms should be provided in such a way that they respond to parameter perturbations independently of each other, such as by providing misfits for different observational types (e.g. nitrate, oxygen, phosphate, silicate, etc), for different spatial regions (e.g. North Atlantic, Southern Ocean, etc) and for different depths in the water column (e.g. 01000m, 20004000m, etc). This is why we initially chose 19 regions, and 3 tracers, which we didn't deviate from.  “Does the required number of function evaluations (e.g. to reach "baseline optimality") significantly increase if the objective function is provided as a single sum of squared differences, only?”
Indeed we have tried this. We cannot use DFOLS for such an experiment, but a related code called BOBYQA (derivative free, builds a quadratic approximation, minimises within a trust region etc). BOBYQA is meant for general function minimisation, and so when one applies it to our problem, one indeed would supply only (calls to) the scalar sum of squared differences (not the individual misfit terms). We gave DFOLS the 19x3 misfits, while we gave BOBYQA the summed square of the 19x3 misfits, and let them both run for about 70 iterations. Their success was very similar, though the parameter recovery by BOBYQA was slower than for DFOLS by about 10 iterations (which is why we chose to continue with DFOLS). The main paper for DFOLS ^{1} and references therein compared DFOLS to BOBYQA on standard data fitting test problems and found DFOLS to be significantly more efficient. Thus in our experience, there are some computational gains in supplying the misfit terms individually, if a code such as DFOLS is able to exploit them.
We hope these explanations are clear but please do not hesitate to let us know if further questions arise.
^{1} Cartis, C., Fiala, J., Marteau, B., and Roberts, L.: Improving the flexibility and robustness of modelbased derivativefree optimization solvers, ACM Transactions on Mathematical Software, 45, 1–35, https://doi.org/10.1145/3338517, 2019.

RC2: 'Reply on AC1', Anonymous Referee #1, 25 Oct 2021
Many thanks for your reply.
Indeed, I was also wondering about the case that you are addressing in the second half of item 1:
considering each grid point of the ocean biogeochemical model as a component of the misfit vector.
Might be worth to briefly include that thoughts in Section 2.5 of the paper, too.
 “The CMAES subsubsection and the DFOLS subsection have different depth levels (2.3.1 and 2.4). Using 2.3.1 and 2.3.2 (or 2.4 and 2.5) would be more consistent.”

RC3: 'Comment on gmd2021175', Benoit Pasquier, 14 Dec 2021
This study presents a novel, efficient, and robust optimization algorithm, named DFOLS, for improving the skill of global biogeochemical models. Despite the potential of the proposed methodology, I believe major revisions are required before this manuscript can be published.
Given the complexity of biogeochemistry in the real ocean, parameterization of its unresolved processes is imperative for models of global marine biogeochemical cycles. In order to improve the ability of these models to fit the growing observational data, modellers have traditionally relied on common physical sense, marine biogeochemical expertise, or simply manual tuning and experimentation. Indeed, automated (programmatic) parameter fitting is generally problematic because of the prohibitive computational costs associated with global marine biogeochemical cycles, the most advanced of which can require months of computation time on large highperformance computing clusters. Oliver et al. thus present us with an inexpensive alternative that could potentially allow for the efficient optimization of parameters even in the case of sparse or noisy observations to fit. Hence on the face of it, the methodology presented offers a welcome tool on the way to improve the accuracy of wide swathes of oceanographic and climate models.
In order to illustrate the performance of DFOLS, the authors set up a benchmark of optimization runs. A control simulation of the Model of Oceanic Pelagic Stoichiometry (MOPS2.0) with known parameter values serves as a target reference for the benchmarks. Then, starting from slightly altered values for 6 parameters, different combinations of algorithm and objective functions are put to the test to try to recover the target parameter values. Each optimization requires many evaluations of an objective function, each of which requires a 3000yr MOPS2.0 simulation in order to compare the resulting simulated tracers to the "synthetic" observations (taken from the control MOPS2.0 simulation). A previously used optimization algorithm, CMAES, is used as a baseline for the benchmarks. The DFOLS shows promising performance with one run recovering the 6 parameters in 40 MOPS2.0 simulations compared to 1200 for CMAES.
However, as documented in Table 4, only in 1/6 cases does DFOLS "recover" all 6 parameters. In other words, a pessimistic take from the study is that DFOLS "failed" 5 out of 6 times. The authors even state: "DFOLS had not begun to tune this parameter for one of the experiments (exp_d2) before we terminated it at a maximum of 70 evaluations, although it did find KPHY when initiated from a different starting point (exp_d1). CMAES also had difficulty in tuning KPHY and only started optimising this parameter after all the other parameters were recovered at ~1200 evaluations. The maximum number of DFOLS evaluations was set to 70 as it is a sequential algorithm, therefore it was impractical to allow too many more evaluations. Had it been allowed to run longer the expectation is it would begin tuning KPHY once the other 5 were sufficiently tuned, as was the case with CMAES." That is, DFOLS originally failed 5 times out of 5, until a favourable starting state was used in one experiment that eventually succeeded. Unless I am confused, it appears that the robustness of the proposed DFOLS algorithm to tackle realworld problems hinges on the expectation that it could be just as efficient as CMAES and — rarely — be much better. In my opinion, this leaves a gaping hole in the arguments supporting the claim that DFOLS is "inexpensive and robust to apply to the calibration of complex, global ocean biogeochemical models". As it stands I would recommend accepting the manuscript only after thoroughly addressing the robustness of the approach by running many more simulations from different starting points (and also after consideration of the other points made below).
 Does the paper address relevant scientific modelling questions within the scope of GMD? Does the paper present a model, advances in modelling science, or a modelling protocol that is suitable for addressing relevant scientific questions within the scope of EGU? Yes.
 Does the paper present novel concepts, ideas, tools, or data? Yes.
 Does the paper represent a sufficiently substantial advance in modelling science? Yes.
 Are the methods and assumptions valid and clearly outlined? Yes.
 Are the results sufficient to support the interpretations and conclusions? No.
 Is the description sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results)? In the case of model description papers, it should in theory be possible for an independent scientist to construct a model that, while not necessarily numerically identical, will produce scientifically equivalent results. Model development papers should be similarly reproducible. For MIP and benchmarking papers, it should be possible for the protocol to be precisely reproduced for an independent model. Descriptions of numerical advances should be precisely reproducible. Did not try.
 Do the authors give proper credit to related work and clearly indicate their own new/original contribution? Yes (apart from some oversights).
 Does the title clearly reflect the contents of the paper? The model name and number should be included in papers that deal with only one model. Yes.
 Does the abstract provide a concise and complete summary? Yes.
 Is the overall presentation well structured and clear? Yes, apart from the table that should be presented as figures instead.
 Is the language fluent and precise? Yes.
 Are mathematical formulae, symbols, abbreviations, and units correctly defined and used? Yes.
 Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated? Yes, see the points below.
 Are the number and quality of references appropriate? Yes.
 Is the amount and quality of supplementary material appropriate? For model description papers, authors are strongly encouraged to submit supplementary material containing the model code and a user manual. For development, technical, and benchmarking papers, the submission of code to perform calculations described in the text is strongly encouraged. Yes.
Below is a list of linebyline items, suggestions, and comments.
 l. 1: "performance" of biogeochemical models could be confused with computational cost (the way "performance" is used when talking about DFOLS). What about "skill"?
 l. 34: I am unsure the Li and Primeau (2008) citation is an application of the Transport Matrix Method. (Maybe clarify or remove it?)
 l. 37: "use finite differences or adjoint" This is an inexact distinction of cases. Derivatives can sometimes be derived symbolically (manually or computationally) and evaluated directly (see, e.g., Dickinson and Gelinas, 1976; doi:10.1016/00219991(76)900073). Most often symbolic derivations are impractical, so one falls back on numerical techniques, such as finite differences. But there are other more efficient and accurate methods (see, e.g., Griewank and Walther, 2008; doi:10.1137/1.9780898717761).
 l. 37–38: "to calculate derivatives" This is technically incorrect. GaussNewton and other derivativebased algorithms use derivatives but do not calculate them.
 l. 38–39: "They can be both computationally expensive and generally less robust on or even unsuitable for noisy problems" Is there a reference for this? While I am convinced that the authors are correct with regards to the pitfalls of a derivativebased algorithm in the case of noisy problems, I am failing to see a strong argument for computational efficiency. While it is true that evaluating a derivative is not free, it provides information that can drastically improve the convergence rate of the optimization algorithm. This needs to be discussed in more detail in my opinion.
 l. 58: "interpolated" I think one could argue that the authors here mean extrapolated.
 l. 60 "Section 4" and "Section 5" should be spelled out for consistency.
 l. 73: Are the 6 parameters the same as those optimized by Kriest et al. (2017)? If so, this could be made clearer.
 l. 86: "In general" Is there a review to reference here?
 l. 86: "quite" is unnecessary
 l. 87–88: I find the "To determine (...) synthetic observations." sentence hard to parse. Maybe it can be reworded for clarity?
 l. 89–90: Suggestion: "the global minimum [is zero] and optimal parameter values are known"
 l. 90: Suggestion: "We compare the performance (...)"
 l. 95: Why not use a differentiable bijection mapping the range to the real line?
 l. 96: Can "various" be replaced with a more descriptive term? Maybe "randomized"?
 l. 96: It might be useful to briefly describe the covariance matrix there.
 l. 96–98: "It returns only the parameter configurations which provide the best misfits to a multivariate normal distribution of parameters, then in the next iteration it randomly draws several more parameter configurations, and repeats" is unclear. What is "it"? Which "multivariate normal distribution"? How many is "several"? This sentence sounds like the description of a bruteforce random search, which paints an unfavourable picture of what CMAES actually does.
 l. 98: What is a "population"?
 l. 99: "and therefore aim" reverses the logic. The "aim" is to converge towards the best estimation from the onset. "Therefore", the algorithm employs the strategy to "move" towards lower misfit values.
 l. 102–103: This "In order (...) in practice" sentence could be rearranged. Also, an indication of how many function evaluations would be useful. (And "quite" is unnecessary.) Suggestion: "In practice, CMAES requires thousands of function evaluations (...)"
 l. 104: What does "was sourced" mean? Is that the exact code? Is it archived and publicly accessible?
 l. 109: If x is bounded, then starting with x ∈ R^{n} is misleading. What about: "x ∈ D a bounded domain of R^{n}"?
 l. 109–111: Suggestion: "DFOLS can take into account individual terms of the misfit function and use their structure to improve convergence"
 l. 114: "provably": If there is a convergence proof, then it should be cited. Unless this was supposed to be "probably"?
 l. 117: What is a typical n?
 l. 117: "for a total of n + 1 function evaluations". I think this can safely be removed.
 l. 117: What does "nearby" mean?
 l. 118: Suggestion: "only one misfit function evaluation is needed"
 Fig. 1 Caption:
 How is the information "combined"? Are the squared misfits simply summed over? If that's the case, it should be stated as such. Otherwise, maybe some clarification of what goes on would be useful.
 How can the misfit function be "evaluated if accepted". It seems this is the other way around.
 l. 119–121: This is only important if the initial sampling constitutes the bulk of the computation. Is that the case in general?
 l. 133: "minima ." (space before dot)
 l. 139: "much" is not needed.
 l. 139: "This is much more computationally expensive than a soft restart" Why is it the case?
 l. 148–149: Are these personally communicated regions available in a public archive? Reproducibility hinges on such availability.
 Eq. (2): Does V_{i}^{i}^{ ∈ j} = 0 when i ∉ j? If so, I would suggest just having V_{i} instead, and summing over only i ∈ j (instead of summing, potentially redundantly, over all i)
 l. 158: "eddies" can be large. Maybe "unresolved eddies" is clearer?
 Fig 2. Caption: Which reference defines the 19 regions? Henson et al (2010) or Weber et al (2016)?
 Eq. (30) + many lines: "f_{global}". Usually nonvariable subscripts are typeset upright. Also, "global" is misleading, since there are many tracers. On the other hand, for volumes, the authors use "T", for "total", I guess. Maybe no subscript for this "total" f? Or maybe swap the "T" and "global" subscripts throughout?
 l. 171: How is the interpolation done?
 l. 172: Why three noise realizations?
 l. 189 and throughout: I kept going back to read what differentiated each experiment from the other. Maybe the authors can find more expressive names for their experiments? E.g., `exp_d1` could be `D_noise` and `exp_d2` could be `D_smooth`? `exp_drngi` could be `D_randi`? And `exp_d1_sparse` could be `D_smooth_sparse`, and so on.
 l. 183–204: What about an experiment combining sparse and randomized observations?
 Table 2. It is unclear what all the settings do. Also, all the caption experiment names do not match the main text.
 l. 208: the authors say they "plot" values but instead show a table. Indeed, Table 3 looks like it would deserve to be turned into a plot with 6 panels (6x1, one for each parameter) with each experiment on the xaxis, and the optimized value on the y axis. With a color code that conveys the groupings (smooth, sparse, noisy, and so on). The same goes for Table 4, which could be turned into a combination of bar plots (with a broken axis to cater for a large number of evaluations for CMAES rather than a misleading logscale).
 Table 4. Some of the experiment names do not match the main text. Is "subsel" the same as "sparse"?
 l. 277: Maybe I read this incorrectly, but it seems the authors report that only 1/6 of the DFOLS experiments recovered all 6 target parameters (Table 4). This is swept under the rug here. I think it would be useful to discuss the failures of convergence for some parameters here.
 l. 290–297: This seems like a significant caveat. The justification for capping the number of evaluations for DFOLS to 70 is unsubstantiated. Combined with the above comment it seems that the authors' (5?) original experiments with DFOLS all failed to recover all 6 parameters, and they then added an extra experiment with a different starting point for K_PHY that is at least twice as close as its target value (Figure 6). In my opinion, this places the robustness of the approach under question, and therefore I would recommend additional experiments with randomized starting values.
 l. 315: While technically feasible, I would consider using "oxygen concentrations" as a constraint rather than the "location of oxygen minimum zones", which is subjectively defined by an arbitrary threshold.
 Figures 7, 9 captions should repeat what the vertical arrows mean (soft restarts)
 One of the main citations of the manuscript (Cartis et al., arXiv, 2018) should be replaced by the more recent and, importantly, peerreviewed, version (Cartis et al., Optimization, 2021, doi: 10.1080/02331934.2021.1883015).
Sophy Elizabeth Oliver et al.
Sophy Elizabeth Oliver et al.
Viewed
HTML  XML  Total  BibTeX  EndNote  

406  118  17  541  5  5 
 HTML: 406
 PDF: 118
 XML: 17
 Total: 541
 BibTeX: 5
 EndNote: 5
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1