DATeS: A Highly-Extensible Data Assimilation Testing Suite

A flexible and highly-extensible data assimilation testing suite, named DATeS, is described in this paper. DATeS aims to offer a unified testing environment that allows researchers to compare different data assimilation methodologies and understand their performance in various settings. The core of DATeS is implemented in Python and takes advantage of its object-oriented capabilities. The main components of the package (the numerical models, the data assimilation algorithms, the linear algebra solvers, and the time discretization routines) are independent of each other, which offers great flexibility to configure data assimilation applications. DATeS can interface easily with large third-party numerical models written in Fortran or in C, and with a plethora of external solvers.


Introduction
Data Assimilation (DA) refers to the fusion of information from different sources, including priors, predictions of a numerical model, and snapshots of reality, in order to produce accurate description of the state of a physical system of interest [15,28]. DA research is of increasing interest for a wide range of fields including geoscience, numerical weather forecasts, atmospheric composition predictions, oil reservoir simulations, and hydrology.
Two approaches have gained wide popularity for solving the DA problems, namely ensemble and variational approaches. The ensemble approach is rooted in statistical estimation theory and uses an ensemble of states to represent the underlying probability distributions. The variational approach, rooted in control theory, involves solving an optimization problem to obtain a single "analysis" as an estimate of the true state of the system of concern. The variational approach does not provide an inherent description of the uncertainty associated with the obtained analysis, however it is less sensitive to physical imbalances prevalent in the ensemble approach. Hybrid methodologies designed to harnesses the best of the two worlds are an on-going research topic.
Numerical experiments are an essential ingredient in the development of new DA algorithms. Implementation of numerical experiments for DA involves linear algebra routines, a numerical model along with time integration routines, and an assimilation algorithm. Currently available testing environments for DA applications are either very simplistic or very general, many are tied to specific models, and are usually completely written in a specific language. A researcher who wants to test a new algorithm with different numerical models written in different languages might have to re-implement his/her algorithm using the specific settings of each model. A unified testing environment for DA is important to enable researchers to explore different aspects of various filtering and smoothing algorithms with minimal coding effort.
The DA Research Section (DAReS) at the National Center for Atmospheric Research (NCAR) provides DART [3] as a community facility for ensemble filtering. The DART platform is currently the gold standard for ensemble-based Kalman filtering algorithm implementations. It is widely used in both research and operational settings, and interfaces to most important geophysical numerical models are available. DART employs a modular programming approach and adheres strictly to solid software engineering principles. DART has a long history, and is continuously well maintained; new ensemble-based Kalman filtering algorithms that appear in the literature are routinely added to its library. Moreover it gives access to practical, and well-established parallel algorithms. DART is, by design, very general in order to support operational settings with many types of geophysical models. Using DART requires a non-trivial learning overhead. The fact that DART is mainly written in Fortran makes it a very efficient testing platform, however this limits to some extent the ability to easily employ third party implementations of various components.
Matlab programs are often used to test new algorithmic ideas due to its ease of implementation. A popular set of Matlab tools for ensemble-based DA algorithms is provided by the Nansen Environmental and Remote Sensing Center (NERSC), with the code available from [19]. A Matlab toolbox for uncertainty quantification (UQ) is UQLab [32]. Matlab is generally a very useful environment for small-to-medium scale numerical experiments.
Python is a modern and popular scripting language that gives the power of reusing existing pieces of code via inheritance. Python is widely known to be a powerful scripting tool for scientific applications that can be used to glue legacy codes. This can be achieved by writing wrappers that can act as interfaces. Building wrappers around existing C, and Fortran code is a common practice in scientific research. Several automatic wrapper generation tools, such as SWIG [11] and F2PY [34], are available to create proper interfaces between Python and low-level languages. While translating Matlab code to Python is a relatively easy task, one can call Matlab functions from Python using the Matlab Engine API. Moreover, Python is available on virtually all Linux, MacOS, and Windows platforms, and therefore Python software has excellent portability. When using Python instead of Fortran or C one generally trades some computational performance for programming productivity. The performance penalty in the scientific calculations is minimized by delegating computationally intensive tasks to compiled languages such as Fortran. This approach is followed by the scientific computing Python modules Numpy and Scipy. This allows to write scientific Python code that is computationally efficient.
This paper presents a highly-extensible Python-based DA testing suite. The package is named DATeS, and is intended to be an open-source, extendable package positioned between the simple typical research-grade implementations and the professional implementation of DART, but with the capability to utilize large physical models. Students can use it as an interactive learning tool, and researchers can use it as experimental testing pad where they can focus on coding only their new ideas without worrying much about the other pieces of the DA process.The code developed by a researcher in the DATeS framework should fit with all other pieces in the package with minimal-to-no effort, as long as the programmer follows the "flexible" rules of DATeS. As an initial illustration of its capabilities DATeS has been used to carry out the experiments presented in [6].
The paper is structured as follows. Section 2 reviews the DA problem and the most widely used approaches to solve it. Section 3 describes the architecture of the DATeS package. Section 4 takes a user-centric and example-based approach for explaining how to work with DATeS, and Section 5 demonstrates the main guidelines of contributing to DATeS. Conclusions and future development directions are discussed in Section 6.

Data Assimilation
Assume the true state of a physical system at a given time t k is described by x true (t k ). The time evolution of the physical system is approximated by the discretized forward model: where x k ∈ R Nstate is a discretized approximation of the true state at time instance t k . The prior distribution P b (x k ) encapsulates the knowledge about the model state at time instance t k before additional information is incorporated. The likelihood function P(Y|x k ) quantifies the deviation of the prediction of model observations from the collected measurements. The posterior distribution is formulated by applying Bayes' theorem as follows: where Y refers to the data (observations) to be assimilated. In the sequential filtering context Y is a single observation, while in the smoothing context, it generally stands for several observations {y 0 , y 1 , . . . , y m } to be assimilated simultaneously.
Consider assimilating information available about the system state at time instance t k . The computer model is used to provide a prior prediction (forecast) about the system state denoted by x b k . In typical applications of DA, the error distributions are assumed to be Gaussian, resulting in the so-called "Gaussian framework". Consider a Gaussian framework, where the prior errors are assumed to be normally distributed with zero mean, and a covariance matrix Consider an observation y k ∈ R N obs given at time instance t k . The observation errors are assumed to follow a Gaussian distribution with zero mean, and covariance matrix R k ∈ R N obs ×N obs , that is (y k − y true k ) ∼ N (0, R k ). Following a perfect model assumption, the posterior distribution follows from (2) as: (3) where the scaling factor P(y k ) is dropped. H k is the observation operator that maps the state space vector onto the observation space. In practical applications, the dimension of the observation space is much less than the state space dimension, that is N obs N state . Ensemble filtering methods such as ensemble Kalman filter (EnKF) [13, 17, 18, 23-25, 36-38, 40, 43], and maximum likelihood ensemble filter (MLEF) [42] use ensembles of states to represent the prior, and the posterior distribution. A prior ensemble X k = {x(e)} e=1,2,...,Nens , approximating the prior distributions, is obtained by propagating analysis states from a previous assimilation cycle at time t k−1 by applying 1. Most of the ensemble based DA methodologies work by transforming the prior ensemble into an ensemble of states collected from the posterior distribution, namely an analysis ensemble. The transformation in the EnKF framework is applied following the update equations of the well-known Kalman filter. A minimum variance estimate (MVE) of the true state of the system is obtained by averaging the analysis ensemble, while the posterior covariance is approximated by the covariance matrix of the analysis ensemble.
The maximum a posteriori (MAP) estimate of the true state is the state that maximizes the posterior probability density function (PDF). Alternatively the MAP estimate is the minimizer of the negative logarithm (negative-log) of the posterior PDF. The MAP estimate can be obtained by solving the following optimization problem: This formulates the 3-dimensional variational (3D-Var) DA problem.Derivative-based optimization algorithms used to solve (4) require the derivative of the negative-log of the posterior PDF (4): where H k = ∂H k /∂x k is the sensitivity (e.g. the Jacobian) of the observation operator H k evaluated at x k . Unlike ensemble filtering algorithms, the optimal solution of (4) provides a single estimate of the true state, and does not provide a direct estimate of associated uncertainty.
Assimilating several observations Y := {y 0 , y 1 , . . . , y m } simultaneously requires adding time as a fourth dimension to the DA problem. Let P b (x 0 ) be the prior distribution of the system state at the beginning of a time window [t 0 , t F ]over which the observations are distributed. Assuming the observations' errors are temporally uncorrelated, the posterior distribution of the system state at the initial time of the assimilation window t 0 follows by applying Equation (2) as: In the statistical approach, ensemble-based smoothers such as the ensemble Kalman smoother (EnKS) are used to approximate the posterior (6) based on an ensemble of states. Similar to the ensemble filters, the analysis ensemble generated by a smoothing algorithm can be used to provide an MVE of the posterior first order moment. It also can be used to provide a flow-dependent ensemble covariance matrix to approximate the posterior true second order moment. The MAP estimate of the true state at the initial time of the assimilation window can be obtained by solving the following optimization problem: This is the standard formulation of the strong-constraint four-dimensional variational (4D-Var) DA problem. The solution of the 4D-Var problem is equivalent to the MAP of the smoothing posterior in the Gaussian framework. The Jacobian of the (7) with respect to the model state at the initial time of the assimilation window reads where M T 0,k is the adjoint of the tangent linear model operator, and H T k is the adjoint of the observation operator sensitivity.Similar to the 3D-Var case, the solution of Equation (7) provides a single best estimate (the analysis) of the system state without providing consistent description of the uncertainty associated with this estimate.
In idealized settings, where the model is linear, the observation operator is linear, and the underlying probability distributions are Gaussian,the posterior is also Gaussian, however this is rarely the case in real applications. Algorithms capable of accommodating non-Gaussianity are too limited and have not been successfully tested in large-scale settings.
Particle filters (PF) [16,21,29,39] are an attractive family of nonlinear and non-Gaussian methods. This family of filters is known to suffer from filtering degeneracy, especially in large-scale systems. While PFs make no assumptions on the shape of the underlying probability distribution functions, they are not generally efficient without expensive tuning.
Recently, a family of fully non-Gaussian DA algorithms that works by sampling the posterior were developed in [4,[6][7][8][9][10]. This family follows a Hamiltonian Monte-Carlo (HMC) approach for sampling the posterior, however, the HMC sampling scheme can be easily replaced with other algorithms suitable for sampling complicated, and potentially multimodal, probability distributions in high dimensional state spaces.
DATeS provides standard implementations of several flavors of the algorithms mentioned here. One can easily explore, test, or modify the provided implementations in DATeS, and add more methodologies. As discussed later, one can use existing components of DATeS, such as the implemented numerical models, or add new implementations to be used by other components of DATeS.

DATeS Implementation
DATeS seeks to capture in an abstract form the common elements shared by most DA applications and solution methodologies. For example, the majority of the ensemble filtering methodologies share nearly all the steps of the forecast phase, and a considerable portion of the analysis steps. Moreover, all the DA applications involve common essential components such as linear algebra routines, model discretization schemes, and analysis algorithms.
Existing DA solvers have been implemented in different languages. For example, low-level languages such as Fortran and C have been (and are still being) extensively used to develop numerically efficient model implementations, and linear algebra routines. Both Fortran and C allow for efficient parallelization because these two languages are supported by common libraries designed for distributed memory systems such as MPI, and shared memory libraries such as Pthreads and OpenMP.
To make use of these available resources and implementations, one has to either rewrite all the different pieces in the same programming language, or have proper interfaces between the different new and existing implementations.
The philosophy behind the design of DATeS is that "a unified DA testing suite has to be open-source, easy to learn, and able to reuse and extend available code with minimal effort". Such a suite should allow for easy interfacing with external third-party code written in various languages, e.g., linear algebra routines written in Fortran, analysis routines written in Matlab, or "forecast" models written in C. This should help the researchers to focus their energy on implementing and testing their own analysis algorithms.
The next section details several key aspects of the DATeS implementation.

DATeS architecture
The DATeS architecture abstracts the following generic components of any DA system: 1. linear algebra routines, 2. a "forecast" computer model that includes the discretization of the physical processes, 3. error models, e.g. observation and background error, 4. an analysis methodology, e.g., a filter or a smoother.
In DATeS, an independent set of modules is built for each of these components.
Linear algebra routines. The linear algebra routines are responsible for handling the data structures representing essential entities such as model state vectors, observation vectors, and covariance matrices. This includes manipulating an instance of the corresponding data.
For example, a model state vector should provide methods for accessing/slicing and updating entries of the state vector, a method for adding two state vector instances, and methods for applying specific scalar operations on all entries of the state vector such as evaluating the square root or the logarithm.
Forecast model. The forecast computer model simulates a physical phenomena of interest such as the atmosphere, ocean dynamics, and volcanoes. This typically involves approximating the physical phenomena using a gridded computer model. The implementation should provide methods for creating and manipulating state vectors, and state-size matrices. The computer model should also provide methods for creating and manipulating observation vectors and observation-size matrices. The observation operator responsible for mapping state-size vectors into observation-size vectors should be part of the model implementation as well.
Simulating the evolution of the computer model in time is carried out using numerical time integration schemes. The time integration scheme can be model-specific, and is usually written in a low-level language for efficiency.
Error models. It is common in DA applications to assume a perfect forecast model, a case where the model is deterministic rather than stochastic. However, the background and observation errors need to be treated explicitly as they are essential in the formulation of nearly all DA methodologies. We refer to the DATeS entity responsible for managing and creating random vectors, sampled from a specific probability distribution function, as the "error model ". For example a Gaussian error model would be completely set up by providing the first and second order moments of the probability distribution it represents.
Analysis algorithms. Analysis algorithms manipulate model states and observations by applying widely used mathematical operations to perform inference operations. The popular DA algorithms can be classified into filtering and smoothing categories.
An assimilation algorithm, a filter or a smoother, is implemented to carry out a single DA cycle. For example, in the filtering framework, an assimilation cycle refers to assimilating data at a single observation time by applying a forecast and an analysis step. On the other hand, in the smoothing context, several observations available at discrete time instances within an assimilation window are processed simultaneously in order to update the model state at a given time over that window; a smoother is designed to carry out the assimilation procedure over a single assimilation window.
Assimilation experiments. In typical numerical experiments a DA solver is applied for several consecutive cycles to assess its long-term performance. We refer to the procedure of applying the solver to several assimilation cycles as the "assimilation process". The assimilation process involves carrying out the forecast and analysis cycles repeatedly, creating synthetic observations or retrieving real observations, updating the reference solution when available, and saving experimental results between consecutive assimilation cycle.
DATeS layout. The design of DATeS takes into account the distinction between these components, and separate them in design following an Object-Oriented Programming (OOP) approach. A general description of DATeS architecture is given in Figure 1.
The enumeration in Figure 1 (numbers from 1 to 4 in green circles) indicates the order in which essential DATeS objects should be created. Specifically, one starts with an instance of a model. Once a model object is created, an assimilation object is instantiated, and the model object is passed to it. An assimilation process object is then instantiated, with a reference to the assimilation object passed to it. The assimilation process object iterates the consecutive assimilation cycles and save and/or output the results which can be optionally analyzed later using visualization modules.
All DATeS components are independent such as to maximize the flexibility in experimental design. However, each newly added component must comply to DATeS rules in order to guarantee interoperability with the other pieces in the package. DATeS provides base classes with definitions of the necessary methods. A new class added to DATeS, for example to implement a specific new model, has to inherit the appropriate model base class, and provide implementations of the inherited methods from that base class.
In order to maximize both flexibility and generalizability, we opted to handle configurations, inputs, and output of DATeS object, using "configuration dictionaries". Parameters

Linear algebra classes
The main linear algebra data structures essential for almost all DA aspects are a) model state-size and observation-size vectors (also named state and observation vectors, respectively), and b) state-size and observation-size matrices (also named state and observation matrices, respectively). A state matrix is a square matrix of order equal to the model state space dimension. Similarly, an observation matrix is a square matrix of order equal to the model observation space dimension. DATeS makes a distinction between a state and observation linear algebra data structures.
Third-party linear algebra routines can have widely different interfaces and underlying data structures. For reusability, DATeS provides unified interfaces for accessing and manipulating these data structures using Python classes. The linear algebra classes are implemented in Python. The functionalities of the associated methods can be written either in Python, or in lower-level languages using proper wrappers.
A class for a linear algebra data structure enables updating, slicing, and manipulating an instance of the corresponding data structures. For example, a model state vector class provides methods that enable accessing/slicing and updating entries of the state vector, a method for adding two state vector instances, and methods for applying specific scalar operations on all entries of the state vector such as evaluating the square root or the logarithm. Once an instance of a linear algebra data structure is created, all its associated methods are accessible via the standard Python dot operator.
The following linear algebra base classes are provided in DATeS: (i) state vector base.StateVectorBase: a base class for state vector objects including all necessary methods, (ii) state matrix base.StateMatrixBase: a base class for state matrix objects with methods implementing necessary matrix operations, (iii) observation vector base.ObservationVectorBase: a base class for observation vector objects with related vector operations, (iv) observation matrix base.ObservationMatrixBase: a base class for observation matrix objects providing methods for related matrix operations.
Python descriptors are provided in a linear algebra class to enable iterating a linear algebra data structure entries. Examples of these descriptors include getitem ( ), setitem ( ), getslice ( ), setslice ( ), etc. These operators make it feasible to standardize working with linear algebra data structures implemented in different languages or saved in memory in different forms.
These classes provide templates for designing more sophisticated extensions of the linear algebra classes.

Forecast model classes
Each numerical model needs an associated class providing methods to access its functionality. The unified forecast model class design in DATeS provides the essential tasks that can be carried out by the model implementation. Each model class in DATeS has to inherit the model base class: models base.ModelBase or a class derived from it.
A numerical model class is required to provide access to the underlying linear algebra data structures and time integration routines. For example, each model class has to provide the method state vector( ) that creates an instance of a state vector class, and the method integrate state( ) that takes a state vector instance and time integration settings, and returns a trajectory (list of states) evaluated at specified future times. The base class provided in DATeS contains definitions of all the methods that need to be supported by a numerical model class.
While some linear algebra and the time integration routines are model-specific, DATeS also implements general-purpose linear algebra classes and time integration routines that can be reused by newly created models. For example, the general integration class FatODE ERK FWD is based on FATODE [41] explicit Runge-Kutta (ERK) forward propagation schemes.
DATeS includes implementations of several popular test models for data assimilation, including: (i) lorenz models.Lorenz3: A class implementing the 3-variables Lorenz model [30].

Error models classes
In many DA applications the errors are additive, and are modeled by random variables normally distributed with zero mean and a given or an unknown covariance matrices. DATeS implements Numpy-based functionality for background, observation, and model errors in the following classes, respectively: (i) error models numpy.BackgroundErrorModelNumpy, (ii) error models numpy.ObservationErrorModelNumpy, (iii) error models numpy.ModelErrorModelNumpy.
These classes are derived from the base class ErrorsModelBase and provide methodologies to sample the underlying probability distribution, evaluate the value of the density function, and generate statistics of the error variables based on model trajectories and the settings of the error model. The Numpy-based implementations can be used as templates for userdefined extended error model classes.

Assimilation classes
Assimilation classes are responsible for carrying out a single assimilation cycle (i.e., over one assimilation window) and optionally printing or writing the results to files. For example, an EnKF object should be designed to carry out one cycle consisting of the "forecast" and the "analysis" steps. The basic assimilation objects in DATeS are a filtering object, a smoothing object, and a hybrid object. DATeS provides the common functionalities for filtering objects in the base class filters base.FiltersBase; all derived filtering classes should have it as a super class. Similarly, smoothing objects are to be derived from the base class smoothers base.SmoothersBase. A hybrid object can inherit methods from both filtering and smoothing base classes.
A model object is passed to the assimilation object constructor via configuration dictionaries to give the assimilation object access to the model-based data structures and functionalities. The settings of the assimilation object, such as the observation time, the assimilation time, the observation vector, and the forecast state or ensemble, are also passed to the constructor upon instantiation, and can be updated during runtime.
Each of these filtering classes can be instantiated and run with any of the DATeS model objects.

Assimilation process classes
A common practice in sequential DA experimental settings, is to repeat an assimilation cycle over a given timespan, with similar or different settings at each assimilation window. For example, one may repeat a DA cycle on several time intervals with different output settings; e.g. to save and print results only every fixed number of iterations. Alternatively, the DA process can be repeated over the same time interval with different assimilation settings to test and compare results. We refer to this procedure as an "assimilation process".
The base class from which assimilation process objects are derived is named assimilation process base.AssimilationProcess. When instantiating an assimilation process object, the assimilation object, the observations and the assimilation time instances, are passed to the constructor through configuration dictionaries. As a result, the assimilation process object has access to the model and its associated data structures and functionalities through the assimilation object.
The assimilation process object either retrieves real observations, or creates synthetic observations at the specified time instances of the experiment. Figure 2 summarizes DATeS assimilation process functionality.

Utility modules
Utility modules provide additional functionality. For example, the configs module adds functions for reading, writing, validating, and aggregating configuration dictionaries. In DA an ensemble is a collection of state or observation vectors. As another example,in DATeS an ensemble is represented by a list of either state, or observation vector objects; the utility modules include functions responsible for iterating over ensembles to evaluate ensemblerelated quantities of interest, such as ensemble mean, ensemble variance/covariance, and for applying ensemble inflation.
The module dates utility adds functionality to different aspects of DATeS. It gives direct access to functions imported from several utility modules, including: (i) utility configs: provides functions to handle configuration dictionaries, including aggregating, reading, and writing configuration dictionaries; (ii) utility stat: provides functions to evaluate statistical quantities such as moments of an ensemble (e.g. list of model state or observation objects); (iii) utility machine learning: provides functions to carry out machine learning algorithms such as fitting a Gaussian Mixture Model to an ensemble; (iv) utility data assimilation: provides functions to carry out general DA tasks such as ensemble inflation, evaluating root-mean-squared errors (RMSE), and evaluating spatial decorrelation coefficients given the distance between two grid points and a localization function.
The utility module provides other general functions such as to handle file downloading, and functions for file I/O. For a list of all functions in the utility module, see the User's Manual [5].

Using DATeS
The sequence of steps needed to run a DA experiment in DATeS is summarized as follows: 1. initialize DATeS for a "run," 2. create a model object and the associated error models, 3. create an assimilation object, 4. create an assimilation process object, 5. run the experiment, and visualize the results.
This section is devoted to explaining these steps in the context of a working example that uses the QG-1.5 model [35] and carries out DA using the standard EnKF formulation.

Step1: initialize DATeS
Initializing a DATeS run involves defining the root directory of DATeS as an environment variable, and adding the paths of DATeS source modules to the system path. This can be done by executing code Snippet 1 in DATeS root directory:
Quasi-geostrophic model. This model is a numerical approximation of the equations: where ∆ := ∂ 2 /∂x 2 + ∂ 2 /∂y 2 and ψ is the surface elevation. The values of the model coefficients in (9)  To create a QG model object with these specifications, one executes code Snippet 2: Step3: create an assimilation object One now proceeds to create an assimilation object. We consider a deterministic implementation of EnKF (DEnKF) with ensemble size equal to 20, and parameters tuned optimally as suggested in [35]. Covariance localization is applied via a Hadamard product [25]. The localization function is Gaspari-Cohn [20] with a localization radius of 12 grid cells.Ensemble inflation is applied to the analysis ensemble of anomalies at the end of each assimilation cycle of DEnKF with an inflation factor of 1.06. Code Snippet 3 creates a DEnKF filtering object with these settings:

Snippet 3: Create a DEnKF object
Most of the methods associated to the DEnKF object will raise exceptions if immediately invoked at this point. This is because several keys in the filter configuration dictionary such as the observation, the forecast time, the analysis time, and the assimilation time, are not yet appropriately assigned. DATeS allows creating assimilation objects without these options to maximize flexibility. A convenient approach is to create an assimilation process object that, among other tasks, can properly update the filter configurations between consecutive assimilation cycles.

Step4: create an assimilation process
We now test DEnKF with QG model by repeating the assimilation cycle over a timespan from 0 to 1250 with offsets of 12.5 time units between each two consecutive observation/assimilation time. An initial ensemble is created by the numerical model object. An experimental timespan is set for observations and assimilation. Here, the assimilation time instances da checkpoints are the same as the observation time instances obs checkpoints, but they can in general be different, leading to either synchronous or asynchronous assimilation settings. This is implemented in the code Snippet 4: Snippet 4: Create a filter process object to carry out DEnKF filtering using the QG model.
The object (experiment) is responsible for creating synthetic observations at all time instances defined by the obs checkpoints (except the initial time). To create synthetic observations the truth at the initial time (0 in this case) is obtained from the model and is passed to the filtering process object experiment, which in turn propagates it forward in time to assimilation time points.

Experiment results
Finally, the assimilation experiment is executed by running code Snippet 5: The filtering results are printed to screen and are saved to files at the end of each assimilation cycle as instructed by the output configs dictionary of the object experiment. The output directory structure is controlled via the options in the output configurations dictionary output configs of the FilteringProcess object, i.e. experiment. All results are saved in appropriate sub-directories under a main folder named Results in the root directory of DATeS. We will refer to this directory henceforth as DATeS results directory.
The default behavior of a FilteringProcess object is to create a folder named Filtering Results in DATeS results directory, and to instruct the filter object to save/output the results every file output iter whenever the flag file output is turned on. Specifically, the DEnKF object creates three directories named Filter Statistics, Model States Repository, and Observations Repository respectively. The root meansquared (RMS) forecast and analysis errors are evaluated at each assimilation cycle, and are written to a file under Filter Statistics directory. The output configurations of the filter object of the DEnKF class, i.e. denkf filter, instructs the filter to save all ensemble members (both forecast and analysis) to files by setting the value of the option file output moment only to False. The true state, the analysis ensemble, and the forecast ensembles are all saved under the directory Model States Repository, while the observations are saved under the directory Observations Repository. We note that while here we illustrate the default behavior, the output directories are fully configurable. Figure 3 shows the reference initial state of the QG model, an example of the observational grid used, and an initial forecast state. The initial forecast state in Figure 3(b) is the average of an initial ensemble collected from a long run of the QG model.
The true field, the forecast errors, and the DEnKF analyses errors at different time instances are shown in Figure 4.
Typical solution quality metrics in the ensemble-based DA literature include RMSE plots and Rank (Talagrand) histograms [1,14]. The RMSE plot of the results of this experiment is shown in Figure 5(a). The histogram of the rank statistics of the truth compared to the truth is shown in Figure 5(b).
Upon termination of a run in DATeS, executable files can be cleaned up by calling the function clean executable files( ) from the utility module:

Extending DATeS
DATeS aims at being a collaborative environment, and is designed such that adding DA components to the package is as easy and flexible as possible. This section describes how new implementations of components such as numerical models and assimilation methodologies can be added to DATeS.
The most direct approach is to write the new implementation completely in Python. This, however, may sacrifice efficiency, or may not be feasible when existing code in other languages needs to be reused. One of the main characteristics of DATeS is the possibility of incorporating code written in low level languages. There are several strategies that can be followed to interface existing C or Fortran code with DATeS. Amongst the most popular tools are SWIG, and F2Py for interfacing Python code with existing implementations written in C and Fortran, respectively.
Whether the new contribution is written in Python, in C, or in Fortran,an appropriate Python class that inherits the corresponding base class, or a class derived from it, has to be created. The goal is to design new classes those are conformable with the existing structure of DATeS and can interact appropriately with new as well as existing components.

Adding a numerical model class
A new model class has to be created as a subclass of ModelsBase, or a class derived from it. The base class ModelsBase, similar to all base classes in DATeS, contains headers of all the functions that need to be provided by a model class to guarantee that it interacts properly with other components in DATeS.
The first step is to grant the model object access to linear algebra data structures, and to error models. Appropriate classes should be imported in a numerical model class: • Linear algebra: state vector, state matrix, observation vector, and observation matrix, and • Error models: background, model, and observation error models. This gives the model object access to model-based data structures, and error entities necessary for DA applications.  Figure 6: Illustration of a numerical model class named MyModel, and relations to the linear algebra and error models classes. A dashed arrow refers to an "import" relation, and a solid arrow represents an "inherit" relation.
The next step is to create Python-based implementations for the model functionalities. As shown in Figure 5.1, the corresponding methods have descriptive names in order to ease the use of DATeS functionality. For example, the method state vector( ) creates (or initializes) a state vector data structure. Details of each of the methods in Figure 5.1 are given in the DATeS User Manual [5].
As an example, suppose we want to create a model class name MyModel using Numpy and Scipy (for sparse matrices) linear algebra data structures. Code Snippet 7 shows the implementation of such a class: Note that in order to guarantee extensibility of the package we have to fix the naming of the methods associated with linear algebra classes, and even if only binary files are provided, the Python-based linear algebra methods must be implemented. If the model functionality is fully written in Python, the implementation of the methods associated with a model class is straightforward, as illustrated in Figure [5]. On the other hand, if a low level implementation of a numerical model is given, these methods wrap the corresponding low level implementation.

Adding an assimilation class
The process of adding a new class for an assimilation methodology is similar to creating a class for a numerical model, however it is expected to require less effort. For example, a class implementation of a filtering algorithm uses components and tools provided by the passed model, and by the encapsulated linear algebra data structures and methods. Moreover, filtering algorithms belonging to the same family, such as different flavors of the well-known EnKF, are expected to share a considerable amount of infrastructure. Python inheritance enables the reuse of methods and variables from parent classes.
To create a new class for DA filtering one derives it from the base class FiltersBase, imports appropriate routines, and defines the necessary functionalities. Note that each assimilation object has access to a model object, and consequently to the proper linear algebra data structures and associated functionalities through that model.
Unlike the base class for numerical models (ModelsBase), the filtering base class FiltersBase includes actual implementations of several widely used solvers. For example, an implementation of the method FiltersBase.filtering cycle( ) is provided to carry out a single filtering cycle by applying a forecast phase followed by an analysis phase (or vice-versa, depending on stated configurations).

Conclusions
This work describes DATeS, a flexible and highly-extensible package for solving data assimilation problems. DATeS seeks to provide a unified testing suite for data assimilation applications that allows researchers to easily compare different methodologies in different settings with minimal coding effort. The core of DATeS is written in Python. The main functionalities, such as model propagation, filtering, and smoothing code, can however be written in low-level languages such as C or Fortran to attain high levels of computational efficiency. The authors plan to continue developing DATeS with the long-term goal of making it a complete data assimilation testing suite that includes support for variational methods, as well as interfaces with complex models such as quasi-geostrophic global circulation models.

Software Availability
The code is available from the web page http://people.cs.vt.edu/~asandu/ Software/DATeS/index.html.