Automated model optimisation using the Cylc workflow engine (Cyclops v1.0)

Most geophysical models include many parameters th at are not fully determined by theory, and can be ‘ tuned’ to improve the model’s agreement with available data. We might attempt to automate this tuning process in an objective way by 10 employing an optimisation algorithm to find the set of parameters that minimises a cost function deriv ed from comparing model outputs with measurements. A number of algori thms are available for solving optimisation problem s, in various programming languages, but interfacing such softwar e to a complex geophysical model simulation, presen ts certain challenges. To tackle this problem, we have developed an optimi sation suite (“Cyclops”) based on the Cylc workflow engine that implements a wide selection of optimisation algorit hms from the NLopt Python toolbox (Johnson, 2014). The Cyclops 15 optimisation suite can be used to calibrate any mod elling system that has itself been implemented as a (separate) Cylc model suite, provided it includes computation and output of the desired scalar cost function. A growing numb er of institutions are using Cylc to orchestrate complex distributed suite of interdependent cycling tasks within their oper ational forecast systems, and in such cases application of the optimisation s uite is particularly straightforward. As a test case, we applied the Cyclops to calibrate a global implementation of the WAVEWATCH III (v4.1 8) third generation 20 spectral wave model, forced by ERA-Interim input fi elds. This was calibrated over a one-year period (1 997), before applying the calibrated model to a full (1979-2016) wave hin dcast. The chosen error metric was the spatial aver age of the root-meansquare error of hindcast significant wave height co mpared with collocated altimeter records. We descri be the results of a calibration in which up to 19 parameters were optim ised.

earlier on to help get across the concept of what the "Optimise" task actually does.
The revised text addresses any confusion about the tests run with randomised values for objective function (f) values that are still being evaluated, being more explicit about how these are run. We have tried to emphasise that these multiple randomised tests are a serial process that happens within each Optimise task, to determine 20 if there is a single definitive parameter set for that cycle to proceed with. We also try to make it clearer that those multiple tests are not being run in parallel.
The reviewer's suggested figure did not quite represent the random testing process that we actually use, but with the expanded explanation didn't consider that any revised version of that suggested Figure would add 25 further clarification. However we did add a new Figure (3) after the description of the Cylc implementation of concurrency (retry version) to talk through an example of how that might work in a test case. measurement data with an optimisation code, in this way. Nevertheless, examples of this approach can be found in hydrological and climate modelling applications. For example, Seong et al. (2015) developed a calibration tool (using R software) to apply the Shuffled Complex Evolution optimization algorithm to calibrate the Hydrologic Simulation Program-Fortran (HSPF) model. In climate modelling, Severijns and Hazeleger (2005) used the downhill simplex method to optimize the parameter values of the subgrid parameterizations of an atmospheric general circulation model. More recently, Tett et al. (2013) applied 5 a Gauss-Newton line search optimization algorithm to climate simulations with the Hadley Centre Atmosphere Model version 3 (HadAM3) forced with observed sea surface temperature and sea ice, optimising an objective function derived from reflected shortwave radiation and outgoing longwave radiation comparisons. The Tett et al. (2013) method was subsequently applied to optimize the sea ice component of the global coupled HadCM3 climate model (Roach et al., 2017;Tett et al., 2017).
Such custom applications of one particular optimisation algorithm to a specific model, however, can require significant effort 10 to switch to alternative optimisation algorithms, or to be applied to new models. Modern coupled climate models, or operational forecast systems for weather and related processes, encompass a diverse set of software tools, often running on multiple platforms. Ideally, we would like to be able to optimise performance of the modelling system (not just a single model code) without major reconfiguration of software between the calibration and operational/production versions of the system.
The Cylc workflow engine is now applied in several operational centres to manage the scheduling of tasks within such systems. 15 So it seems natural to consider the possibility of developing a framework within Cylc for the optimisation of the modelling systems under its control.

Methods
In very general terms, a derivative-free optimisation algorithm will explore parameter space, selecting values of the parameter vector in some sequence. As each is selected, it calls the (user-supplied) subroutine to evaluate the objective function f( ). 20 In our case, this would amount to running a complete model simulation with the corresponding parameter settings, comparing outputs to measurements, from which a defined error metric is computed to provide the return value of f. This can involve a lengthy simulation, needing a run time Tmodel perhaps of order hours or days to reproduce months or years of measurements.
A self-contained optimisation program, with an explicitly-coded function-evaluation subroutine, will run much faster, with a run time per iteration Titer typically being some small fraction of a second, and will run in many orders of magnitude less time 25 than a typical geophysical model even if a number of iterations N of order 1000 are required. This might be the case for "deliberately difficult" test problems: we might expect that a well-tested geophysical model will come with reasonable defaults that in many new implementations will produce a result within a relatively simple "basin of attraction" so that O(10) iterations may suffice to get very close.
If the optimisation procedure calls for a full model run to evaluate the objective function, and N iterations are required for 30 convergence, the total run time would be including an overhead To for initial and final tasks.
As Tmodel is orders of magnitude larger than To and Titer, the geophysical modelling system totally dominates run time, and we can comfortably afford not to be concerned with reducing the efficiency of the optimisation routine, even by a few orders of magnitude. 35 So let's consider a simple measure we might introduce to allow us to recover from an interruption part way through a long optimisation process. Normally, the optimisation code will retain in memory the values of each and its objective function f( ) that has already been evaluated, to use in selecting further points to be evaluated. If we write these values to file each time the function evaluation is called, we can build up a lookup table to use in case we need to restart the process. In that case, we could have the function evaluation subroutine first search the lookup table for a match to (within some acceptable tolerance), 4 in which case it could return the tabulated error value. Only in the case where a tabulated value was not found would the full model simulation be required to compute the return value of f. Now rather than actually perform that computation, the function evaluation subroutine could simply write the values (for the n th iteration, say) to file, and exit. We could then run the model in its usual way, outside the optimisation code, using those values as parameters, and add that result to our lookup table before restarting the whole process from scratch. This time, 5 assuming the optimisation algorithm is deterministic, with no random process influencing the sequence of values, the first n points would be exactly the same sequence that was selected previously, and could be quickly handled by table lookup, and the algorithm would either find that a convergence criterion had been satisfied, or select a new point n+1 to be passed to the model for simulation.
In effect, we are simply employing the optimisation algorithm in a generic tool that, given the results of all previous iterations, 10 either signals that convergence has been reached, or generates the next parameter set to be evaluated by the model., i.e.
In this scheme, assuming that we start with an empty lookup table, the first pass has one iteration of the optimisation code, the second has two, etc. So, allowing an additional overhead " for the full process, the total run time to reach the termination condition(s) after N iterations should be 15 As Tmodel is orders of magnitude larger than the other times, the ratio of the two run times is Given the expected relative magnitudes of the model and optimisation iteration times, and N of order 10s or 100s, the increase in run time through this approach is actually negligible.
On the other hand, this scheme has several benefits. Apart from being simple to code, the optimisation algorithm, including the user-defined function evaluation subroutine, can be completely generic, and applied unmodified to different modelling 20 systems. The only requirements on the modelling system are that, at the start of each simulation, it reads in the parameter values requested by the optimisation code and adapt them to its standard input formats, then at the end of the simulation, computes and writes to file a single error metric value. The optimisation code and the model system could then remain separate, both controlled by some form of scripting scheme, for example. This means that having invested considerable time and resources in developing a complex modelling scheme, no major reconfiguration needs to be made to prepare it for optimisation 25 in this manner, or subsequently to re-implement the optimised modelling system in operational or production mode.

Cylc
Cylc (http://cylc.github.io/cylc/) is an Open Source workflow engine that can manage ongoing distributed workflows of cycling (repeating) tasks. It was originally developed at NIWA to automate environmental forecasting systems, and has since been adopted by many other institutions -notably the UK Met Office and its international partners in climate, weather, and 30 related sciences. Cylc can manage large production systems of great complexity, but it is also easy to use for individuals and groups with less demanding automation requirements. Cylc workflows (or suites) are defined with an efficient graph syntax that expresses dependence between tasks, and an efficient inheritance hierarchy for optimal sharing of all task runtime properties (exactly what each task should execute, and where and how to submit task jobs to run).
Cylc tasks are related by trigger expressions that combine to form a dependency graph. This trigger, A:status => B says that task B depends on task A achieving the status status ("=>" represents an arrow). The default trigger status is succeeded (job execution completed successfully) and can be written simply as A => B; others include submitted, submit-failed, started, finished, failed, and custom task output messages. Tasks can depend on the wall clock and on external events, as well as on 5 other tasks, and a task job can be submitted to run once all its dependencies are met. Cylc automatically wraps user-defined task content (environment, scripting, etc.) in code to trap errors and report job status back to the suite server program via authenticated HTTPS messages. Tasks can even trigger off tasks in other suites, so for coupled systems you can choose between a larger suite that controls all tasks, and multiple smaller suites that interact with each other.
In cycling systems tasks repeat on sequences that may represent forecast cycles, or separate chunks of a model simulation that 10 is too long for a single run, or iterations in an optimization scheme, or different datasets to be processed as they are generated, and so on. Cycling is specified with ISO 8601 date-time recurrence expressions (e.g. for environmental forecasting), or with integer recurrence expressions (e.g. for iterative processes). Both date-time and integer cycling are used in the application described in this paper. Dependence across cycles (consider a forecast model that is initialised with outputs from a previous cycle) creates ongoing, potentially never-ending, workflows. Uniquely, Cylc can manage these without imposing a global 15 cycle loop: one cycle does not have to complete before the next can start. Instead, tasks from many cycles can run concurrently to the full extent allowed by individual task dependencies and external constraints such as compute resource and data availability. So, for example, on restarting after extended downtime, a suite that processes real-time data can clear its backlog and catch up again very quickly, by automatically interleaving cycles.

Implementation 20
We have developed a Cylc suite ("Cyclops", https://zenodo.org/badge/latestdoi/1836229) to perform optimisation of a modelling system that has itself been set up as a separate Cylc suite. In the example we describe below, the model suite controls a multi-year wave model hindcast, including the preprocessing of necessary model inputs (principally wind fields) and verification data (satellite altimeter data), running the wave model code, postprocessing of model outputs, and generation of error statistics from comparisons of predicted and observed significant wave height fields. 25 Typically, date-time cycling is used to run a model at successive forecast cycles, or to break a long simulation into a succession of shorter blocks. The optimisation suite, on the other hand, uses integer cycling, with each cycle corresponding to a single evaluation of the objective function.
There are several tasks controlled by the optimisation suite. One of these is responsible for running an optimisation algorithm to identify either an optimal parameter vector from previous model runs, or the next parameter vector to be evaluated. This 30 main optimisation task within the suite is implemented with Python code calling the NLopt Python toolbox (Johnson, 2014).
NLopt includes a selection of optimisation algorithms: both "local" solvers, which aim to find the nearest local minimum to the starting point as efficiently as possible, and "global" solvers, which are designed to explore the full parameter space, giving high confidence in finding the optimal solution out of a possible multitude of local minima. NLopt includes algorithms capable of using derivative information where available, which is not the case in our application, and Cyclops is restricted to the 35 derivative-free algorithms listed in Table 1.
We have assumed that the sequence of parameter vectors tested by an optimisation algorithm is deterministic. Several of the algorithms available in NLopt have some inherently stochastic component. It is, however, possible to make these algorithms "repeatably stochastic" by enforcing a fixed seed for the random number generator.
In NLopt, any combination of the following termination conditions can be set: 40 1. maximum number of iterations by each call of the optimisation algorithm 2. absolute change in the parameter values less than a prescribed minimum 3. relative change in the parameter values less than a prescribed minimum 4. absolute change in the function value less than a prescribed minimum 5. relative change in the function value less than a prescribed minimum 6. function value less than a prescribed minimum In the second and third of these convergence criteria, the "change in parameter values" means the magnitude of the vector 5 . We have implemented Python code that uses NLopt calls to seek a minimum of an objective function f that represents a nonnegative model error metric. As described above, the user-defined function evaluation has been implemented as a generic Python function f( ) that simply searches a lookup table (stored in a file). If is found in the table it returns the corresponding f value, otherwise it saves the vector to a file and returns an invalid 1 f value. Any of the termination conditions listed above 10 can be set by the user: the last of these can use a prescribed minimum f value as a convergence condition, while an invalid f value signals that the optimisation algorithm has stopped because a new parameter vector x needs to be evaluated externally by a model simulation. In this case a file is written containing parameter names and values in a format that can be parsed by the modelling system to generate the needed input files for a simulation. At present a generic namelist format is used as output from Cyclops for this purpose. 15 A "parameter definition" file is used to specify parameter names and their initial values, as used within the model. If a parameter is allowed to be adjusted by the optimisation suite, an allowable range is also set. This choice will generally require some experience with the particular model. Within the optimisation suite, these adjustable parameters will be scaled linearly to normalised parameters that lie between 0 and 1. Fixed parameters can be include for convenience, so that their names and values will be written to the namelist file but these are ignored by the optimisation suite. 20 The major tasks carried out by Cyclops on each cycle are: 0. (first cycle only): Init: write initial normalised parameters to file, … 1. Optimise: run the optimisation code, starting from and evaluating every in the sequence, until either a stopping criterion is met (in which case the task sends a "stop" message), or a parameter set is reached that is not in the lookup table so needs evaluating (signalled by a "next" message) 25 2. Namelist: Convert to non-normalised parameters in a namelist file 3. Model: Create a new copy of the model suite, copy the namelist file to it, and run it in non-daemon mode (i.e. so the task will not complete until the model suite shuts down). A new copy of the suite is made so that files created in one cycle do not overwrite those created on other cycles.

Table:
Read the resulting error value from the model suite, and update the lookup table 30 Within one cycle, the dependencies of the optimisation suite are simply: Table   to make these tasks run sequentially when no stop condition is met. We set a dependency on a previous cycle: Table[-P1] => Optimise (the notation -P1 denotes a negative displacement of one cycle period), to ensure that the lookup table is up to date with all previous results before starting the next optimisation cycle, and to prevent Cylc from running successive cycles concurrently.
The stopping condition is handled by where the Namelist_Final task produces the final version of the namelist file, and the Stop task does a final wrap-up of the 5 completed optimisation before the suite shuts down. For the purposes of good housekeeping, we can also add a Model_delete task to delete each copy of the model suite once all its outputs have been used. Also, tasks which will not be needed (e.g. "Namelist" if "Optimise" gives a "stop" message) can be removed, along with any dependencies on those tasks, by so-called "suicide triggers". Figure 1 illustrates the workflow of the optimisation suite described above in graphical form.
The optimisation suite's "Model" task for each cycle is a proxy for a copy of the full model suite being run for the 10 corresponding parameter set. The model suite is run in non-daemon (non-detaching) mode, so that the Model task does not finish until the suite that it represents runs to completion. Information passed between the suites consists of two simple files: a "namelist" file containing parameter names and values written by the optimisation suite for the model suite, and an "error" file containing the single value of the error metric returned by the model suite.
The model suite needs to include a task to process the namelist file into the particular modelling system's standard input 15 formats. Because the formats are highly model-specific, this task needs to be tailored for the particular model suite. For example, in our wave hindcast application described below, this task consists of a shell script which simply includes the namelist file verbatim as part of an ASCII control file, which also has various timing parameters provided from environment variables. Namelists can include named groups of parameters, which may be helpful in this process in cases where these groups need to be treated differently (e.g. affecting different model input files for multiple coupled models and pre-and post-20 processing tasks within the model suite). However, if the namelist format proved inadequate to supply the needed information, this format could be changed within the optimisation suite to something more suitable. It should be stressed, though, that no change should be needed to the main model codes: they can run as standard release versions under a separate task within the model suite.

Concurrent simulations 25
For some DFO algorithms, at least some parts of the sequence of vectors tested is predetermined, and independent of the function values found at those points. For example, BOBYQA (which we chose to use in the test application described below) sets up a quadratic approximation by sampling the initial point, plus a pair of points on either side of it in each dimension.
With N parameters, the first 2N+1 iterations are spent evaluating these 2N+1 fixed points, regardless of the function values obtained there. In such situations, the function values for each of these points could be evaluated simultaneously. 30 This can be done within Cylc by allowing tasks from multiple cycles to run simultaneously. In practice, this means that multiple copies of the model suite are running simultaneously, to the extent allowed by resource allocation on the host machine(s). This makes it imperative that a new copy of the model suite is made for each cycle.
If concurrent model simulations are allowed, this means that at any time there are a certain set of parameter vectors for which the function values are still being determined (we can call this the "active" set). We can add another parameter vector to that 35 set if it will be selected by the optimisation algorithm regardless of the function values at the active parameter vectors.
We would clearly like to determine that without needing specific knowledge of how the particular optimisation algorithm works. Instead we use a simple empirical method. To this end, we maintainsupplement the lookup table (of vectors already computed, with the resulting f values) with a filesecond table (the "active file") listing the active vectors, and make an addition to. We have the function evaluation subroutine, so that if it fails to find search for infirst among the lookup table,  40 it"completed" vectors, then searchesamong the "active file and if" vectors. If it finds there,among the active vectors (for which f is not yet known), it assigns f a random positive value (in this application we don't re-initialise the random number Formatted: Font: Not Bold generator with a fixed seed). Otherwise it writes to file and returns an "invalid" f value to force the optimisation algorithm to stop as usual.
The Python code controlling the optimisation algorithm has also been modified. Now when the active file is empty it will act as before, but if there are active vectors it will run a small set of repeated optimisations.applications of the optimisation algorithm (Eqn 2), each of which will use a different set of randomised f values for the active vectors. That is, in the Optimise 5 task for cycle % + 1 we evaluate for a set of iterations 2 = 1, … , 3, with If these all result in the same choice of value to be evaluated, a "next" message is sent to trigger further tasks for this cycle as before, since this choice is independent of the results for the active parameter vectors. If not, we do not have a definite to evaluate, and we must wait until at least one of the presently active simulations has finished before trying again, so 10 a "wait" message is sent. But clearly this does not mean that the optimisation is complete.
We can now allow for concurrent simulations in the Cylc suite in two ways.These repeated randomised applications of the optimisation algorithm are run sequentially within one cycle of the Optimise task, simply to determine if there is a unique parameter set with which further tasks for that cycle can be started, concurrently with Model tasks already running for other cycles. They do not themselves need to run in parallel. 15 We also need to consider how the Cylc suite dependency structure can accommodate concurrent simulations. This can be handled in two ways. In the first method we let the Optimise task fail when it determines a "wait" condition, and utilise Cylc's facility to retry failed tasks at specified intervals.
We also replace the dependency where M is a specified maximum number of concurrent simulations. This means that each cycle can first attempt to start a new model simulation as soon as the previous cycle's simulation has started and the Mth previous simulation has completed. The 25 "Optimise" task will keep retrying at intervals until it is able to give either a "stop" or "next" signal. This method has a simple workflow structure, illustrated in Figure 2, that does not change as M increases.
A schematic illustration of how this might work is shown in Figure 3. Here we consider an application in which the optimisation algorithm uses predetermined values for the first five parameter vectors, after which each new parameter vector selected depends on all previous results (BOBYQA has this behaviour for a two-parameter optimisation). We also assume we have set 30 M ≥ 5. Hence in cycle 2, the Optimise task's randomised test shows that the same parameter vector will be chosen regardless of the outcome of the cycle 1 Model task, so that further cycle 2 tasks can start immediately. Similarly, the cycle 3 Model task does not need to wait for the active cycle 1 and 2 Model tasks to complete, and so forth up to cycle 5. But the cycle 6 Optimise task will detect that its choice of a parameter vector will depend on the results of the active Model tasks, so it will fail and retry. Under our assumptions it will not succeed until no other Model tasks are active, and this will remain the case for all 35 subsequent cycles.
The second method, described in Appendix A and used in the tests described here, uses more complex dependencies and additional Optimise tasks, instead of a single retrying Optimisze task. It is somewhat more efficient in that there is no need to wait on a (short) retry interval before determining if a new cycle can start, but the workflow is more complicated and its It should be stressed that the optimisation code itself is simply run as a serial process in each case: it is simply required to produce the single set of parameters, if any, for the next model run given the known results of the completed simulations. As it checks that this parameter set is independent of the results of the presently active model runs without needing to know the 5 actual results, no parallel processing is required within the optimisation code.

Application: a global wave hindcast based on ERA-Interim inputs
Here we describe a global wave simulation, using the WAVEWATCH III model (WW3), forced by inputs from the ERA-Interim Reanalysis, covering the period from January 1979 to December 2016. Such multi-year wave model simulations are a valuable means of obtaining wave climate information at spatial and temporal scales that are not generally available from 10 direct measurements. It is rare for a particular location of interest to have a suitably long nearby in situ wave record, e.g. from a wave-recording buoy, to provide statistically reliable measures of climate variability on inter-annual time scales. And while satellite altimetry has provided near-global records of significant wave height that have been available for more than two decades, these have limited use for many climate applications for several reasons, including a return cycle that is too long to resolve typical weather cycles, limitations in providing nearshore measurements, and lack of directional information. Model 15 simulations can in many cases overcome these limitations, but available measurements still play an essential role in calibrating and verifying the simulations.
In our case, one of the principal motivations for carrying out this hindcast is to investigate the role of wave-ice interactions in the interannual variability of Antarctic sea ice extent, which plays an important role in the global climate system. The ERA-Interim Reanalysis is a suitable basis for this work, providing a consistent long-term record, with careful control on any 20 extraneous factors (e.g. changing data sources, or modelling methods) that might introduce artificial trends or biases into the records. While the ERA-Interim Reanalysis includes a coupled wave model, direct use of the wave outputs does not fully meet our requirements, which include the need for the wave hindcast to be independent of near-ice satellite wave, which were assimilated into the ERA-Interim Reanalysis. Hence we chose to carry out our own wave simulation, forced with ERA-Interim wind fields, but with no assimilation of satellite wave measurements. 25

Comparison of model outputs with altimeter data
Rather than being assimilated in the hindcast, satellite altimetry measurements of significant wave height were used as an independent source of model calibration. These were obtained from the IFREMER database of multi-mission qualitycontrolled and buoy-calibrated swath records (Queffeulou, 2004).
Swath records of significant wave height were first collocated to the hourly model outputs on the 1°×1° model grid. For each 30 calendar month simulated, collocations were then accumulated in 3°×3° blocks of 9 neighbouring cells to produce error statistics, including model mean, altimeter mean, bias and root-mean-square error (RMSE), and correlation coefficient R.
Spatial averages of these error statistics were taken over the full model domain between 65°S and 65°N (excluding polar regions with insufficient coverage).
The final error statistic used in the objective function was the spatially-averaged RMSE, normalised by the spatially-averaged 35 altimeter mean, temporally averaged over the simulation period, excluding spinup.

WW3 parameters
For this simulation we used version 4.18 of the WAVEWATCH III (WW3) third generation wave model (Tolman, 2014). The model represents the sea state by the two-dimensional ocean wave spectrum C(D , , ), which gives the energy density of the wave field as a function of wavenumber D , at each position in the model grid and time t of the simulation.
The spectrum evolves subject to a radiative transfer equation 5 for the wave action (D, K, , ) = C(D , , )/M(D) , where M is the frequency associated with waves of wavenumber magnitude k through the linear dispersion relation, and K is the propagation direction. The dots represent time derivatives. The terms on the left hand side represent spatial advection, and the shifts in wavenumber magnitude and direction due to refraction by currents and varying water depth. The source term S on the right hand side represents all other processes that transfer energy to and from wave spectral components, including contributions from wind forcing, energy dissipation and weakly-nonlinear 10 four wave interactions.
Adjustable parameters within WW3 that can influence a deep water global simulation such as the one described here are principally concentrated in the wind input and dissipation source terms. It is generally necessary to treat these two terms together as a self-consistent 'package' of input and dissipation treatments designed to work together. In this study we undertook two separate calibration exercises, based on two 'packages' of input/dissipation source terms, firstly that of Tolman and 15 Chalikov (1996)  In Appendix B we describe some of the details of these two packages. We also include some description of the WAM Cycle 4 (ST3) input source term formulation (Janssen, 1991), on which the ST4 input term is based, even though the ST3 package was not tested in this study . 20 In addition to the input and dissipation terms, the other main control on deep-water wave transformation is provided by weakly nonlinear four-wave interactions (Hasselmann, 1962). Unfortunately, acceptable run time requirements for multiyear simulations over extensive domains still preclude using a near-exact computation of these terms, such as the Webb, Resio, Tracy method (Webb, 1978;Tracy and Resio, 1982) that is available in spectral models including WW3 (van Vledder et al., 2000). Instead we use the much-simplified form of the Discrete Interaction Approximation (Hasselmann et al., 1985), treating 25 its proportionality constant C as a tunable parameter.
Common to both optimisations, sea ice obstruction was turned on (FLAGTR=4) with non-default values for the critical sea ice concentrations O P, and O P, between which wave obstruction by ice varies between zero and total blocking: these were set to 0.25 and 0.75, respectively. All other available parameters beyond the input and dissipation terms were left with default settings, noting that shallow water processes, while activated, are not expected to have more than a negligible and localised 30 influence on model outputs in a global simulation at 1° resolution.
For initial testing, in which two sets (ST2 and ST4) of optimisation parameters were compared, we used a one month (January 1997) spinup to a three month calibration period (February -April 1997). The selection of the calibration period from the full extent of the satellite record was arbitrary.
Relevant parameters used in the two calibrations are listed in Table 2Table 2 and Table 3Table 3, respectively, which refer to 35 the parameter names as defined (more completely than we do here) in the WW3 user manual (Tolman, 2014), and as specified in namelist inputs to the model. These tables include the initial values of the parameters, the range over which they were allowed to vary, and the final optimised values. Other parameters not listed were kept fixed. A particular example was the input wind vertical level Q (ST2) = Q R (ST4) = 10 m which is a property of the input data set, hence not appropriate to adjust.
Others were left fixed after an initial test confirmed that they had negligible influence on the objective function, leaving 13 adjustable parameters for ST2 and 17 for ST4.
The selection of which parameters to tune, and the range over which they are allowed to vary, is an area where some (partly subjective) judgement is still required, based on some familiarity with the relevant model parameterisations. In this case, parameter ranges were chosen to be physically realistic, and to cover the range of parameter choices used in previous studies 5 reported in the literature.

Optimisation settings
We elected to primarily use the BOBYQA optimisation algorithm (Powell, 2009) for this study. Given that we expected WW3 to be already reasonably well-tuned for a global simulation such as our test case, we wished to use a local optimisation algorithm that could reach a solution to a problem with 10-20 variables in as few iterations as possible. Of the algorithms 10 available in NLopt that were included in the intercomparison study of Rios and Sahinidis (2012), BOBYQA was found to be the most suitable in that respect. In particular it allows for concurrent model runs in the early stages of the optimisation process.
Both optimisations were stopped when either the absolute change in (normalised) parameter values was less than 0.0001, or the relative change in the objective function was less than 0.0001. Less stringent conditions were initially used, but the ability of the optimisation suite to be restarted with revised stopping criteria was invoked to extend the optimisation. 15 These first two tests used a local optimisation method on the assumption that the respective default parameter sets are nearoptimal, or at least within the "basin of attraction" of the optimal solution. In order to test this assumption, two further approaches can be considered. The first choice would be to use a truly global optimisation algorithm to explore the selected parameter space as thoroughly as possible. This approach may be expected to require a number if iterations in the thousands, which is rather challenging given typical model run times, especially as global methods do not generally allow for parallel 20 iterations.
A simpler approach is to still use a local algorithm, but initialise it at a range of different starting points. This was the approach we took for our next set of tests, restricted to the ST4 case, in which the initial value of each parameter was selected at random with uniform probability distribution over its allowed range. Five randomised tests were done, along with a control optimisation starting from the default parameter set used previously. For these tests we made some further simplifications in 25 the interests of computational speed, running the hindcast for only one month (February 1997), and initialising all simulations from a common initial condition, spun up over one month with the default parameter set. Both simplifications detract from how applicable the resulting parameter sets would be for hindcast applications, but can be justified in allowing a more extensive examination of parameter space with a given computational resource. A slightly reduced set of ST4 parameters was optimised, omitting S T UVW , S T XVW and Y U . The initial and final values of these parameters from each of the tests are listed in Table 4Table  30   4 and Table 5Table 5, respectively. The allowed range of each of the adjustable parameters was the same as in the previous simulations, as listed in Table 3, while both stopping criteria were relaxed to a value of 0.005.
Despite the expected high computational demands, we next attempted an optimisation using the global evolutionary algorithm ESCH of da Silva . This was initialised from the default parameter values, and used the same one month hindcast, parameter ranges and stopping criteria as described above. 35 Following these test simulations, the ST4 parameterisation was chosen for a final calibration, carried out over a 12 month period (January -December 1997) following a one-month spinup (December 1996). This calibration was finally terminated with both stopping criteria set to a value of 0.0001. This was a somewhat arbitrary choice made to observe the evolution of the solution. For practical applications the choice of stopping criteria should take into account the sensitivity of the objective function to measurement error in the data used for the calibration, to avoid unnecessary 'over-tuning' of the model. 40 The full hindcast, from January 1979 through December 2016 was then run using the optimised parameter set. Comparisons with altimeter data were made for each month from August 1991 onward.
Each WW3 simulation was run on 64 processors on a single core of either an IBM Power6 or a Cray XC50 machine. Other processing tasks within the suites were run on single processors. The resulting hindcast simulations required an average of approximately 25 minutes of wall clock time to complete each month of simulation.

Local optimisation of 3 month hindcasts with ST2 and ST4 source terms 5
The BOBYQA algorithm develops a quadratic model of the objective function. To do so, the first iteration evaluates the objective function at the initial point, then perturbs each component in turn by a positive increment, then by an equal negative increment (leaving all other components at the initial value). This can be seen for the ST2 optimisation in Figure 4Figure 3, in which the bottom panel shows the sequence of (normalised) parameter values tested. With 13 adjustable parameters, this amounts to 27 iterations in this preliminary phase. As this sequence of parameter values is fixed, independent of the resulting 10 objective function values, all of the first 27 iterations could have been run simultaneously as detailed above, if permitted by the queuing system. We, however, applied a limit of 7 parallel iterations in line with anticipated resource limitations.
The 3-month ST2 optimisation only required a further 7 iterations after this initial phase to reach a stopping criterion. The ST2 default parameter settings used as the starting point for optimisation resulted in an objective function value of 0.1901, which was reduced to 0.1424 in the optimisation process. 15 In the optimal configuration, none of the tunable parameters were at either of the limits of their imposed range, indicating that convergence to a true minimum (at least locally) had been reached. Most of the parameters were only slightly modified from their initial values: the largest changes were in parameters Z (reduced from 0.0003 to 0.0002059) and Z (0.47 to 0.2493), both influencing the low frequency dissipation term.
The ST4 3-month optimisation was initialised with the default settings from the TEST451 case reported by Ardhuin et al 20 (2010), for which the objective function returned a value of 0.1427. Optimisation only managed to reduce this to 0.1419 ( Figure   5Figure 4), indicating that the default ST4 parameter set was already quite closely tuned for our case, having been selected by Ardhuin et al (2010) largely from broadly similar studies, i.e. global simulations (at 0.5° resolution) compared with altimeter records.
Three of the parameters ended the optimisation at one end of their allowed range, in each case at the same value at which it 25 was initialised. The 16 th adjustable parameter (Y U ) controls the assumed directional spread of the dissipation spectrum, and the fact that it remained at its upper limit suggests that the optimisation may be improved by assuming the dissipation spectrum to have a narrower directional distribution than anticipated. On the other hand, parameters 14 (S T UVW ) and 15 (S T XVW ) are associated with an alternative breaking formulation proposed by Filipot and Ardhuin (2012), who chose values S T UVW = 0.185 and S T XVW = 1.5 (and correspondingly, turned off the default saturation-based dissipation term by setting S T T. = 0) whereas this 30 term is turned off in the ST4 default, hence both were initially set to zero. On the face of it, one might think that the optimisation algorithm would have been free to explore solutions with positive values of these parameters, resulting in an optimal 'hybrid' total dissipation term. In fact the way the dissipation algorithm is coded, this form of the dissipation term is not computed at all in the event that S T UVW = 0.0, which would have been the case when the BOBYQA algorithm explored sensitivity to S T XVW in the initial stages. This means that our choice of initial values may have spuriously caused the BOBYQA algorithm to 35 underestimate sensitivity to S T XVW , and may have missed a distinct second local minimum (approximately corresponding to the parameter settings of Filipot and Ardhuin (2012)).

Tests with local optimisation with randomised initial parameter sets, and global optimisation
The next set of five tests compared results of the local BOBYQA algorithm starting from different parameter sets chosen at random within the allowed ranges (Table 4Table 4). The resulting final parameter sets, listed in Table 5Table 5, show that each test located a different minimum. This indicates that there are multiple local minima for the error metric in our chosen parameter ranges, in addition to the local minimum derived from the default parameters. The corresponding values of the error metric were all slightly higher than the value (0.1454) obtained from the baseline optimisation starting from the default parameter set, although much reduced from their initial values (Table 4Table 4). Although none of those additional local minima found so far has replaced the baseline set as a candidate for a global optimum, this gives no guarantee that this would 5 not be the case after a more thorough search.
The attempted global optimisation (using the ESCH algorithm) of the same hindcast, had not converged to within the chosen tolerances after 800 cycles. However, in the course of its operation it did identify over 30 parameter sets with slightly lower error metric than the minimum value (0.1450) obtained in the corresponding baseline local optimisation. The lowest value within 800 iterations was 0.1441, and the corresponding parameter values are included in Table 5Table 5. This supports our 10 suspicion that a local optimisation algorithm cannot be relied upon to identify the global optimum for this hindcast problem.
On the other hand, the very small decrease in the error metric obtained from this wider search does not give strong justification for making a significant change in parameters from near their default values. We need to bear in mind that the optimisation problem we have addressed in this set of tests (i.e. minimising RMS errors in significant wave height from a one month partial hindcast) is not quite the same as optimising this measure over a more representative period. 15

Local optimisation of 12 month hindcast with ST4 source terms
In the final 12 month ST4 optimisation, two additional parameters were allowed to vary that were fixed in the 3-month optimisation, bringing the number of adjustable parameters to 19. These were the critical sea ice concentration parameters O P, and O P, between which wave obstruction by ice varies between zero and total blocking: these had been fixed at 0.25 and 0.75, respectively, in the 3 month optimisations. Otherwise, the initial parameters (Table 4Table 4) again corresponded to the ST4 20 defaults, which in this case produced an error metric of 0.1436. At the termination after 89 iterations (with the more stringent stopping criteria), this had decreased to 0.1431.
Most of the resulting optimised parameters were close to the values obtained from the 3-month optimisation (Table 3Table 3).
An exception was the 11 th adjustable parameter, S R _ , scaling the strength of the turbulent contribution to dissipation, which finished the 3-month optimisation at 0.41298, but at 0.0 (the lower bound) in the 12 month simulations. 25 For this longer optimisation, we have additionally computed a measure of the sensitivity of the objective function, using the initial phase of the BOBYQA iterations to estimate the change in the (un-normalised) parameter required to produce a 0.1% change in the objective function. This is listed as "Delta" in the seventh column of Table 3Table 3, and provides a measure, at least in relative terms, of the bounds within which each parameter value has been determined.
The full hindcast, run from 1979 to 2016, could be compared with satellite data from August 1991 onward. The resulting bias 30 in significant wave height, averaged over the August 1991 -December 2016 comparison period, is shown in Figure 6Figure 5. Positive biases are obtained in latitudes south of 45°S, particularly south of Australia and in the South Atlantic. This is also seen in the vicinity of some island groups (notably French Polynesia, Micronesia, the Maldives, Aleutians, Carribean, Azores), which may be indicative of insufficient sub-grid scale obstruction. On the other hand, negative biases are seen near the western sides of major ocean basins, and in the "swell shadow" to the northeast of New Zealand. A similar pattern is seen in the results 35 reported by Ardhuin et al (2010) for their TEST441 case (their Figure 9).
Normalised root-mean-square error (i.e. RMSE error divided by the observed mean) from the same comparison, again averaged over the period August 1991 -December 2016, is shown in Figure 7Figure 6. Note that the objective function for our optimisation used this measure, spatially averaged over ocean waters between 61°S and 61°N. For the majority of the ocean surface, this lies in the range 0.08 -0.14, but with higher values near some island chains and the western boundaries of ocean 40 basins. Again, similar results were reported by Ardhuin et al (2010).

Discussion
In their review of methods used to tune Numerical Weather Prediction and climate models, Hourdin et al. (2017) observe that with the number and complexity of parameterisations to consider, the task of tuning these parameters was for a long time largely left to "expert judgement", and that objective methods have made a more recent appearance than in the statistical, engineering, and computing fields. The method we have presented here, along with the approaches of Severijns and Hazeleger 5 (2005), Tett et al. (2013), Roach et al. (2017 described in the introduction, perform model tuning through the relatively direct approach of defining and minimising a cost function. Our method has the advantage of employing a tool (Cylc) that is already commonly used to control complex workflows for weather forecasting and climate modelling systems, to optimize the parameters of such a system under its control, in a way that is simple to implement, and flexible in choice of optimisation algorithm. 10 We have shown this to be a practical method for optimising 10-20 parameters in a model application of sufficient complexity to require several hours per simulation in a parallel processing computing environment. For applications that are yet more time-consuming, it is becoming increasingly common (Bellprat et al., 2012;Wang et al., 2014;Duan et al., 2017) to first build a surrogate model to provide a statistical emulator for the actual model, and then apply an optimisation algorithm to the surrogate model. Such multi-stage model optimisation frameworks are beyond the scope of this paper, but the flexibility of 15 our approach could potentially bring benefits to them as well. For example, it may be worth considering a hybrid approach of using a surrogate model to quantify the role of the full set of model parameters and perform an initial global optimisation, before switching to a method such as ours for a final refinement using the original model directly.
In our study we have largely restricted our attention to one local optimisation algorithm (BOBYQA), but our initial results suggest the need in some circumstances to apply a more global method. This is not difficult to do in principle, with multiple 20 algorithms, both global and local, implemented in Cyclops. However, the generally higher computational demands of a global algorithm put a limit on such applications. In this study we have only been able to undertake a preliminary exploration of the wider parameter space of our single chosen test case. This did however illustrate that the possibility of multiple alternative local minima must be considered.
As we have seen, there remains a need for care with the choices of which parameters to attempt to optimise, and what bounds 25 to set on their values. Most optimisation algorithms are intended for continuously variable parameters, and may rely on the objective function having a continuous dependence on these parameters. In many cases it is clear which parameters fall into this category, as opposed to discrete valued options. But in some cases, model code may make binary choices based on real parameters lying within discrete ranges, which may break this assumption. Hence the Cyclops optimisation suite is best employed in conjunction with a good understanding of the role each parameter plays in the model, and the interplay between 30 them.
It is also important to be aware of the role played by the design of the error metric, which may make it sensitive to some parameters and insensitive to others. One should be wary of accepting a large change in these insensitive parameters to achieve a tiny improvement in the chosen error metric, when the resulting model could then perform poorly against other relevant criteria. In the particular wave modelling case we have investigated, our approach would not be sufficient on its own to identify 35 suitable values of the large set of WW3 parameters without guidance from previous studies. Tett et al. (2017) point out that the inherently chaotic nature of the climate system means that a certain level of noise is introduced into evaluations of an atmospheric model simulation, which can cause problems in evaluating the termination criteria. They describe a procedure to rerun a simulation that had nominally satisfied the prescribed convergence criteria, with randomised perturbations before determining whether or not to terminate. Unlike the atmosphere, ocean surface waves are an 40 essentially dissipative system, and perturbations introduced in the initial conditions and forcing will tend to diminish, rather than grow, with time. As a result, noise in the objective function was not so relevant for our wave hindcast application as for atmospheric models, but may need to be addressed in to systems with an underlying chaotic nature, possibly through implementing similar measures to those of Tett et al. (2017) into Cyclops.
Similarly, the dissipative nature of ocean waves means that a cost function based on a spatial average of the (temporal) RMSE of model-data comparisons will not be subject to the level of chaotic variability seen in similar measures for atmospheric models. Small scale variability in wave model output is therefore more likely to be genuinely sensitive to parameter variation. 5 In that case it is worth capturing such variability in the cost function, whereas for a chaotic system it may be wiser to average out such variability before evaluating the cost function.

Conclusions
The Cyclops Cylc-based optimisation suite offers a flexible tool for tuning the parameters of any modelling system that has been implemented to run under the Cylc workflow engine. Minimal customisation of the modelling system is required beyond 10 providing tasks to input and apply model parameter values in a simple namelist format, and output the value of the scalar error metric that is to be minimised. This then allows any of 16 optimisation algorithms (from the NLopt toolbox) to be applied to the optimisation. This optimisation suite is expected to be especially applicable to operational forecasting systems, where minimal re-configuration is required between "tuning" and "operational/production" versions of the forecast suite.
Results of the initial test case we have investigated, a global hindcast using a spectral wave model forced by ERA-Interim 15 input fields, illustrate that the method is applicable to a modelling system of moderate complexity, both in terms of the number of parameters to tune, and the computational resources required, at least for the purposes of local optimisation to fine tune a model that already has a more-or-less well developed initial parameter set from previous studies. Investigations of systems that require a more global tuning approach, or are more computationally demanding remain for future work. Cylc is available from GitHub (https://cylc.github.io/cylc/) and Zenodo (https://zenodo.org/badge/latestdoi/1836229) under the GPLv3 licence.

Appendix A: Handling concurrent simulations through dependencies
An alternative way to allow for concurrent simulations involves modifying the simple Cylc suite described above to have several versions of the "Optimise" task. Now "Opt_m" runs the optimisation algorithm when there are m active model simulations still running, with m ranging from 0 to a set maximum M-1, where M is the maximum number of concurrent cycles we chose to allow. There are a more complex set of dependencies to ensure that this is the case. In particular, there is a condition 5

Table[-P(m+1)] => Opt_m
to ensure that the lookup table has been updated with the results of all completed (i.e. inactive) cycles. If that is the case, the optimisation code will be run to determine if a new model simulation can be launched while those m tasks are active. If not, the suite will wait until one of the active model runs completes, and try again with "Opt_m-1", and so forth.
The dependency diagram for the case in which up to three concurrent simulations are allowed (i.e. M = 3) is illustrated in 10 Figure 8Figure 7. Assume, for example, that we are still well short of convergence, and that the optimisation algorithm is such that the next parameter set tested depends on all previous results. Then "Opt_2" and "Opt_1" will always give a "wait" message, and "Opt_0" will be needed on each cycle. This effectively produces the same behaviour as in Figure 1, with each cycle waiting for the immediately preceding cycle to complete before "Opt_0" can start, leading to a new model run. If, on the other hand, the algorithm never depends on the results of the previous two (active) calculations, "Opt_2" will always give a 15 "next" message. This removes the "Opt_1 and "Opt_0" tasks (and any dependencies upon them), leading to the "Model" task being called for cycle N as soon as the cycle N-3 model run has completed and updated the lookup table, even if the cycle N-2 and N-1 "Model" tasks are still running.

B.1 Tolman and Chalikov input + dissipation source term package 20
The input source term is defined as where ` is a non-dimensional wind-wave interaction parameter, which has a parameterised dependence on wind speed and direction, through boundary layer properties influenced by the wave spectrum. These dependencies are, however, fully determined with no user-adjustable terms, so we omit the details here.
This input term was, however, adjusted by Tolman (2002) following a global test case to ameliorate an excessive dissipation 25 of swell in weak or opposing winds, in which cases ` can be negative. This is done by applying, when ` is negative, a swell filtering scaling factor with a constant value a T for frequencies below 0.6 -(whereis the peak frequency), scaling linearly up to 1 at 0.8 -, with higher frequencies unmodified.
The same study also led to the introduction of a correction for the effects of atmospheric stability on wave growth identified by Kahma and Calkoen (1992)  where hi is a bulk stability parameter in terms of air, sea and reference temperatures . , T and , respectively, and b o the wind speed at reference height ℎ = 10 m, with n the gravitational acceleration. As air and sea surface temperature fields are available from the ERA-Interim dataset, it was possible to apply this parametrisation, treating e , e , e , , and hi as adjustable dimensionless parameters.
The dissipation term consists of a dominant low-frequency constituent, with an empirical frequency dependence parameterised by constants Z , Z , p and a high-frequency term, parameterised by constants q , q , q , , the details of which we leave for the WW3 manual (Tolman, 2014) and original references therein.

B.2 WAM Cycle 4 source term package
The input source term implemented in WAM Cycle4 by Janssen (1982) was based on the wave growth theory of Miles (1957). 5 The starting point is the assumption the wind speed r has a logarithmic profile, so that if the wind fields input to the model are specified at elevation Q R , then where b * is the friction velocity, defined by the total wind stress v = b * , , κ is von Karman's constant, and Q is a roughness length modified by wave conditions: in which v x is the magnitude of the wave-supported stress, while 10 with z a tunable dimensionless parameter.
The wave-supported stress can be equated to the rate of momentum transfer between wind and waves: where c is the wave phase velocity The WAM Cycle 4 input source term is then given by In these terms } . and } x are the densities of air and water, ` .F is a dimensionless constant, Q ‚ is a wave age tuning parameter and is a parameter controlling the directional dependence relative to the wind direction K R .
The inter-dependence of v x and L expressed in (B7B7) and (B8B8) creates an implicit functional dependence of b * on r and v x v ⁄ . In practice, this dependence can be tabulated, using the resolved model spectrum for the low-frequency (D < D .F ) part of (B7B7), above which a ‰Š diagnostic tail is assumed. 20 The L R term represents a linear damping of swells, in the form (Bidlot, 2012): with Y set to 1(0) to turn on(off) the damping.
Dissipation is represented in the form with = 0.5 and = 1 being the respective WAM defaults (Bidlot, 2012) while mean steepness is

B.3 Ardhuin (2010) source term package
This package introduces a saturation-based dissipation term. In order to accommodate this, the WAM Cycle 4 input source function is modified by replacing b * in (B8B8) with a frequency-dependent form 5 in which Y R ≈ 1 is a sheltering coefficient, to allow for balance with a saturation-based dissipation term. Also, a limit can be placed on the roughness length Q , replacing (B6B6) with The swell dissipation parameterisation of Ardhuin et al. (2009) is used, consisting of terms which is taken to have a threshold value depending on significant wave height: 15 The turbulent dissipation term is parameterised to depend on wind speed and direction: based on the friction factor ,ª« from the Grant and Madsen (1979) theory of oscillatory boundary layer flow over a rough surface.
The dissipation term is based on the saturation of the wave spectrum, and takes the form Where £ PR = 0.5 and S PR is a tuning coefficient.
The turbulent dissipation term is 5

Competing interests
The authors declare that they have no conflict of interest. 10

Author contribution
Richard Gorman developed the Cyclops optimisation suite, carried out all simulations, and prepared the manuscript. Hilary Oliver leads the development of Cylc, assisted with the design of the optimisation suite, and contributed to the content of the manuscript.     (not converged) Table 6. As for Table 3Table 3, but for parameters used to calibrate the simulation using the source term package of Ardhuin et al (2010), for Jan-Dec 1997. The "Delta" value in the seventh column is the estimated change in the (un-normalised) parameter required to produce a 0.1% change in the objective function.       Table tasks), and the Optimise task in the next cycle. Green arrows represent these dependencies 5 on task success. Optimise tasks which fail to select a parameter vector independent of the result of active tasks retry (yellow arrows) at prescribed intervals until they succeed. The time axis is not to scale: Model tasks will typically have run times orders of magnitude longer than the run times of Optimise tasks. In this example, we suppose that the particular optimisation algorithm employed allows for up to five concurrent cycles during the initial stages.    This example shows four successive cycles, for the case in which up to three parallel simulations are allowed. Arrows represent dependency, which in some cases are combined by a logical OR (enclosed "+" symbol). All tasks and explicit dependencies (other