DATeS: a highly extensible data assimilation testing suite v1.0

Attia, Ahmed; Sandu, Adrian

doi:https://doi.org/10.5194/gmd-12-629-2019

Articles | Volume 12, issue 2

https://doi.org/10.5194/gmd-12-629-2019

Articles | Volume 12, issue 2

Development and technical paper

12 Feb 2019

Development and technical paper |

| 12 Feb 2019

DATeS: a highly extensible data assimilation testing suite v1.0

Ahmed Attia and Adrian Sandu

Abstract

A flexible and highly extensible data assimilation testing suite, named DATeS, is described in this paper. DATeS aims to offer a unified testing environment that allows researchers to compare different data assimilation methodologies and understand their performance in various settings. The core of DATeS is implemented in Python and takes advantage of its object-oriented capabilities. The main components of the package (the numerical models, the data assimilation algorithms, the linear algebra solvers, and the time discretization routines) are independent of each other, which offers great flexibility to configure data assimilation applications. DATeS can interface easily with large third-party numerical models written in Fortran or in C, and with a plethora of external solvers.

Download & links

Article (PDF, 3575 KB)

Download & links

Received: 06 Feb 2018 – Discussion started: 22 Mar 2018 – Revised: 29 Oct 2018 – Accepted: 07 Dec 2018 – Published: 12 Feb 2019

1 Introduction

Data assimilation (DA) refers to the fusion of information from different sources, including priors, predictions of a numerical model, and snapshots of reality, in order to produce accurate description of the state of a physical system of interest (Daley, 1993; Kalnay, 2003). DA research is of increasing interest for a wide range of fields including geoscience, numerical weather forecasts, atmospheric composition predictions, oil reservoir simulations, and hydrology. Two approaches have gained wide popularity for solving the DA problems, namely ensemble and variational approaches. The ensemble approach is rooted in statistical estimation theory and uses an ensemble of states to represent the underlying probability distributions. The variational approach, rooted in control theory, involves solving an optimization problem to obtain a single “analysis” as an estimate of the true state of the system of concern. The variational approach does not provide an inherent description of the uncertainty associated with the obtained analysis; however, it is less sensitive to physical imbalances prevalent in the ensemble approach. Hybrid methodologies designed to harness the best of the two worlds are an ongoing research topic.

Numerical experiments are an essential ingredient in the development of new DA algorithms. Implementation of numerical experiments for DA involves linear algebra routines, a numerical model along with time integration routines, and an assimilation algorithm. Currently available testing environments for DA applications are either very simplistic or very general; many are tied to specific models and are usually completely written in a specific language. A researcher who wants to test a new algorithm with different numerical models written in different languages might have to re-implement his/her algorithm using the specific settings of each model. A unified testing environment for DA is important to enable researchers to explore different aspects of various filtering and smoothing algorithms with minimal coding effort.

The DA Research Section (DAReS) at the National Center for Atmospheric Research (NCAR) provides Data Assimilation Research Testbed (DART) (Anderson et al., 2009) as a community facility for ensemble filtering. The DART platform is currently the gold standard for ensemble-based Kalman filtering algorithm implementations. It is widely used in both research and operational settings, and interfaces to most important geophysical numerical models are available. DART employs a modular programming approach and adheres strictly to solid software engineering principles. DART has a long history and is continuously well maintained; new ensemble-based Kalman filtering algorithms that appear in the literature are routinely added to its library. Moreover, it gives access to practical and well-established parallel algorithms. DART is, by design, very general in order to support operational settings with many types of geophysical models. Using DART requires a non-trivial learning overhead. The fact that DART is mainly written in Fortran makes it a very efficient testing platform; however, this limits to some extent the ability to easily employ third-party implementations of various components.

Matlab programs are often used to test new algorithmic ideas due to its ease of implementation. A popular set of Matlab tools for ensemble-based DA algorithms is provided by the Nansen Environmental and Remote Sensing Center (NERSC), with the code available from Evensen and Sakov (2009). A Matlab toolbox for uncertainty quantification (UQ) is UQLab (Marelli and Sudret, 2014). Also, for the newcomers to the DA field, a concise set of Matlab codes is provided through the pedagogical applied mathematics reference (Law et al., 2015). Matlab is generally a very useful environment for small- to medium-scale numerical experiments.

Python is a modern high-level programming language that gives the power of reusing existing pieces of code via inheritance, and thus its code is highly extensible. Moreover, it is a powerful scripting tool for scientific applications that can be used to glue legacy codes. This can be achieved by writing wrappers that can act as interfaces. Building wrappers around existing C and Fortran code is a common practice in scientific research. Several automatic wrapper generation tools, such as SWIG (Beazley, 1996) and F2PY (Peterson, 2009), are available to create proper interfaces between Python and lower-level languages. While translating Matlab code to Python is a relatively easy task, one can call Matlab functions from Python using the Matlab Engine API. Moreover, unlike Matlab, Python is freely available on virtually all Linux, macOS, and Windows platforms, and therefore Python software is easily accessible and has excellent portability. When using Python, instead of Fortran or C, one generally trades some computational performance for programming productivity. The performance penalty in the scientific calculations is minimized by delegating computationally intensive tasks to compiled languages such as Fortran. This approach is followed by the scientific computing Python modules NumPy and SciPy, which enable writing computationally efficient scientific Python code. Moreover, Python is one of the easiest programming languages to learn, even without background knowledge about programming.

This paper presents a highly extensible Python-based DA testing suite. The package is named DATeS and is intended to be an open-source, extendable package positioned between the simple typical research-grade implementations and the professional implementation of DART but with the capability to utilize large physical models. Researchers can use it as an experimental testing pad where they can focus on coding only their new ideas without worrying much about the other pieces of the DA process. Moreover, DATeS can be effectively used for educational purposes where students can use it as an interactive learning tool for DA applications. The code developed by a researcher in the DATeS framework should fit with all other pieces in the package with minimal to no effort, as long as the programmer follows the “flexible” rules of DATeS. As an initial illustration of its capabilities, DATeS has been used to implement and carry out the numerical experiments in Attia et al. (2018), Moosavi et al. (2018), and Attia and Constantinescu (2018).

The paper is structured as follows. Section 2 reviews the DA problem and the most widely used approaches to solve it. Section 3 describes the architecture of the DATeS package. Section 4 takes a user-centric and example-based approach for explaining how to work with DATeS, and Sect. 5 demonstrates the main guidelines of contributing to DATeS. Conclusions and future development directions are discussed in Sect. 6.

2 Data assimilation

This section gives a brief overview of the basic discrete-time formulations of both statistical and variational DA approaches. The formulation here is far from conclusive and is intended only as a quick review. For detailed discussions on the various DA mathematical formulations and algorithms, see, e.g., Asch et al. (2016), Evensen (2009), and Law et al. (2015).

The main goal of a DA algorithm is to give an accurate representation of the “unknown” true state, x^true(t_k), of a physical system, at a specific time instant t_k. Assuming $x_{k} \in R^{N_{state}}$ is a discretized approximation of x^true(t_k), the time evolution of the physical system over the time interval $[t_{k}, t_{k + 1}]$ is approximated by the discretized forward model:

\begin{matrix} (1) & x_{k + 1} = M_{k, k + 1} (x_{k}), k = 0, 1, \dots, N - 1 . \end{matrix}

The model-based simulations, represented by the model states, are inaccurate and must be corrected given noisy measurements Y of the physical system. Since the model state and observations are both contaminated with errors, a probabilistic formulation is generally followed. The prior distribution 𝒫^b(x_k) encapsulates the knowledge about the model state at time instant t_k before additional information is incorporated. The likelihood function 𝒫(Y|x_k) quantifies the deviation of the prediction of model observations from the collected measurements. The corrected knowledge about the system is described by the posterior distribution formulated by applying Bayes' theorem:

\begin{matrix} (2) & P^{a} (x_{k} | Y) = \frac{P^{b} (x_{k}) P (Y | x_{k})}{P (Y)} \propto P^{b} (x_{k}) P (Y | x_{k}), \end{matrix}

where Y refers to the data (observations) to be assimilated. In the sequential filtering context, Y is a single observation, while in the smoothing context, it generally stands for several observations ${y_{1}, \dots, y_{m}}$ to be assimilated simultaneously.

In the so-called “Gaussian framework”, the prior is assumed to be Gaussian $N (x_{k}^{b}, B_{k})$ , where $x_{k}^{b}$ is a prior state, e.g., a model-based forecast, and $B_{k} \in R^{N_{state} \times N_{state}}$ is the prior covariance matrix. Moreover, the observation errors are assumed to be Gaussian 𝒩(0, R_k), with $R_{k} \in R^{N_{obs} \times N_{obs}}$ being the observation error covariance matrix at time instant t_k, and observation errors are assumed to be uncorrelated from background errors. In practical applications, the dimension of the observation space is much less than the state-space dimension, that is N_obs≪N_state.

Consider assimilating information available about the system state at time instant t_k, the posterior distribution follows from Eq. (2) as

\begin{array}{l} P^{a} (x_{k} | y_{k}) \propto P^{b} (x_{k}) P (y_{k} | x_{k}) \propto \exp (- J (x_{k})), \\ (3) & J (x_{k}) = \frac{1}{2} ‖ x_{k} - x_{k}^{b} ‖_{B_{k}^{- 1}}^{2} + \frac{1}{2} ‖ y_{k} - H_{k} (x_{k}) ‖_{R_{k}^{- 1}}^{2}, \end{array}

where the scaling factor 𝒫(y_k) is dropped. Here, ℋ_k is an observation operator that maps a model state x_k into the observation space.

Applying Eqs. (2) or (3), in large-scale settings, even under the simplified Gaussian assumption, is not computationally feasible. In practice, a Monte Carlo approach is usually followed. Specifically, ensemble-based sequential filtering methods such as ensemble Kalman filter (EnKF) (Tippett et al., 2003; Whitaker and Hamill, 2002; Burgers et al., 1998; Houtekamer and Mitchell, 1998; Zupanski et al., 2008; Sakov et al., 2012; Evensen, 2003; Hamill and Whitaker, 2001; Evensen, 1994; Houtekamer and Mitchell, 2001; Smith, 2007) and maximum likelihood ensemble filter (MLEF) (Zupanski, 2005) use ensembles of states to represent the prior, and the posterior distribution. A prior ensemble $X_{k} = {x (e)}_{e = 1, 2, \dots, N_{ens}}$ , approximating the prior distributions, is obtained by propagating analysis states from a previous assimilation cycle at time t_k−1 by applying Eq. (1). Most of the ensemble-based DA methodologies work by transforming the prior ensemble into an ensemble of states collected from the posterior distribution, namely the analysis ensemble. The transformation in the EnKF framework is applied following the update equations of the well-known Kalman filter (Kalman and Bucy, 1961; Kalman, 1960). An estimate of the true state of the system, i.e., the analysis, is obtained by averaging the analysis ensemble, while the posterior covariance is approximated by the covariance matrix of the analysis ensemble.

The maximum a posteriori (MAP) estimate of the true state is the state that maximizes the posterior probability density function (PDF). Alternatively, the MAP estimate is the minimizer of the negative logarithm (negative log) of the posterior PDF. The MAP estimate can be obtained by solving the following optimization problem:

\begin{matrix} (4) & min_{x_{k}} J (x_{k}) = \frac{1}{2} ‖ x_{k} - x_{k}^{b} ‖_{B_{k}^{- 1}}^{2} + ‖ y_{k} - H_{k} (x_{k}) ‖_{R_{k}^{- 1}}^{2} . \end{matrix}

This formulates the three-dimensional variational (3D-Var) DA problem. Derivative-based optimization algorithms used to solve Eq. (4) require the derivative of the negative log of the posterior PDF Eq. (4):

\begin{matrix} (5) & \nabla_{x_{k}} J (x_{k}) = B_{k}^{- 1} (x_{k} - x_{k}^{b}) + H_{k}^{T} R_{k}^{- 1} (y_{k} - H_{k} (x_{k})), \end{matrix}

where $H_{k} = \partial H_{k} / \partial x_{k}$ is the sensitivity (e.g., the Jacobian) of the observation operator ℋ_k evaluated at x_k. Unlike ensemble filtering algorithms, the optimal solution of Eq. (4) provides a single estimate of the true state and does not provide a direct estimate of associated uncertainty.

Assimilating several observations $Y = {y_{0}, y_{1}, \dots, y_{m}}$ simultaneously requires adding time as a fourth dimension to the DA problem. Let 𝒫^b(x₀) be the prior distribution of the system state at the beginning of a time window [t₀, t_F] over which the observations are distributed. Assuming the observations' errors are temporally uncorrelated, the posterior distribution of the system state at the initial time of the assimilation window t₀ follows by applying Eq. (2) as

\begin{array}{l} P^{a} (x_{0}) \propto P^{b} (x_{0}) P (y_{0}, y_{1}, \dots, y_{m} | x_{0}) \propto \exp (- J (x_{0})), \\ (6) & J (x_{0}) = \frac{1}{2} ‖ x_{0} - x_{0}^{b} ‖_{B_{0}^{- 1}}^{2} + \frac{1}{2} \sum_{k = 0}^{m} ‖ y_{k} - H_{k} (x_{k}) ‖_{R_{k}^{- 1}}^{2} . \end{array}

In the statistical approach, ensemble-based smoothers such as the ensemble Kalman smoother (EnKS) are used to approximate the posterior Eq. (6) based on an ensemble of states. Similar to the ensemble filters, the analysis ensemble generated by a smoothing algorithm can be used to provide an estimate of the posterior first-order moment. It also can be used to provide a flow-dependent ensemble covariance matrix to approximate the posterior true second-order moment.

The MAP estimate of the true state at the initial time of the assimilation window can be obtained by solving the following optimization problem:

\begin{matrix} (7) & min_{x_{0}} J (x_{0}) = \frac{1}{2} ‖ x_{0} - x_{0}^{b} ‖_{B_{0}^{- 1}}^{2} + \frac{1}{2} \sum_{k = 0}^{m} ‖ y_{k} - H_{k} (x_{k}) ‖_{R_{k}^{- 1}}^{2} . \end{matrix}

This is the standard formulation of the four-dimensional variational (4D-Var) DA problem. The solution of the 4D-Var problem is equivalent to the MAP of the smoothing posterior in the Gaussian framework. The Jacobian of Eq. (7) with respect to the model state at the initial time of the assimilation window reads

\begin{array}{l} \nabla_{x_{0}} J (x_{0}) = & B_{0}^{- 1} (x_{0} - x_{0}^{b}) \\ (8) & + \sum_{k = 0}^{m} M_{0, k}^{T} H_{k}^{T} R_{k}^{- 1} (H_{k} (x_{k}) - y_{k}), \end{array}

where $M_{0, k}^{T}$ is the adjoint of the tangent linear model operator, and $H_{k}^{T}$ is the adjoint of the observation operator sensitivity. Similar to the 3D-Var case (Eq. 4), the solution of Eq. (7) provides a single best estimate (the analysis) of the system state without providing consistent description of the uncertainty associated with this estimate. The variational problem (Eq. 7) is referred to as strong-constraint formulation, where a perfect-model approach is considered. In the presence of model errors, an additional term is added, resulting in a weak-constraint formulation. A general practice is to assume that the model errors follow a Gaussian distribution 𝒩(0, Q_k), with $Q_{k} \in R^{N_{state} \times N_{state}}$ being the model error covariance matrix at time instant t_k. In non-perfect-model settings, an additional term characterizing state deviations is added to the variational objectives (Eqs. 4, 7). The model error term depends on the approach taken to solve the weak-constraint problem, and usually involves the model error probability distribution.

In idealized settings, where the model is linear, the observation operator is linear, and the underlying probability distributions are Gaussian, the posterior is also Gaussian; however, this is rarely the case in real applications. In nonlinear or non-Gaussian settings, the ultimate objective of a DA algorithm is to sample all probability modes of the posterior distribution, rather than just producing a single estimate of the true state. Algorithms capable of accommodating non-Gaussianity are too limited and have not been successfully tested in large-scale settings.

Particle filters (PFs) (Doucet et al., 2001; Gordon et al., 1993; Kitagawa, 1996; Van Leeuwen, 2009) are an attractive family of nonlinear and non-Gaussian methods. This family of filters is known to suffer from filtering degeneracy, especially in large-scale systems. Despite the fact that PFs do not force restrictive assumptions on the shape of the underlying probability distribution functions, they are not generally considered to be efficient without expensive tuning. While particle filtering algorithms have not yet been used operationally, their potential applicability for high-dimensional problems is illustrated, for example, by Rebeschini and Van Handel (2015), Poterjoy (2016), Llopis et al. (2018), Beskos et al. (2017), Potthast et al. (2018), Ades and van Leeuwen (2015), and Vetra-Carvalho et al. (2018). Another approach for non-Gaussian DA is to employ a Markov chain Monte Carlo (MCMC) algorithm to directly sample the probability modes of the posterior distribution. This, however, requires an accurate representation of the prior distribution, which is generally intractable in this context. Moreover, following a relaxed, e.g., Gaussian, prior assumption in nonlinear settings might be restrictive when a DA procedure is applied sequentially over more than one assimilation window. This is mainly due to fact that the prior distribution is a nonlinear transformation of the posterior of a previous assimilation cycle. Recently, an MCMC family of fully non-Gaussian DA algorithms that works by sampling the posterior were developed in Attia and Sandu (2015), Attia et al. (2015, 2017 a, b, 2018), and Attia (2016). This family follows a Hamiltonian Monte Carlo (HMC) approach for sampling the posterior; however, the HMC sampling scheme can be easily replaced with other algorithms suitable for sampling complicated, and potentially multimodal, probability distributions in high-dimensional state spaces. Relaxing the Gaussian prior assumption is addressed in Attia et al. (2018), where an accurate representation of the prior is constructed by fitting a Gaussian mixture model (GMM) to the forecast ensemble.

DATeS provides standard implementations of several flavors of the algorithms mentioned here. One can easily explore, test, or modify the provided implementations in DATeS, and add more methodologies. As discussed later, one can use existing components of DATeS, such as the implemented numerical models, or add new implementations to be used by other components of DATeS. However, it is worth mentioning that the initial version of DATeS (v1.0) is not meant to provide implementations of all state-of-the-art DA algorithms; see, e.g., Vetra-Carvalho et al. (2018). DATeS, however, provides an initial seed with example implementations, those could be discussed and enhanced by the ever-growing community of DA researchers and experts. In the next section, we provide a brief technical summary of the main components of DATeS v1.0.

3 DATeS implementation

DATeS seeks to capture, in an abstract form, the common elements shared by most DA applications and solution methodologies. For example, the majority of the ensemble filtering methodologies share nearly all the steps of the forecast phase, and a considerable portion of the analysis step. Moreover, all the DA applications involve common essential components such as linear algebra routines, model discretization schemes, and analysis algorithms.

Existing DA solvers have been implemented in different languages. For example, high-performance languages such as Fortran and C have been (and are still being) extensively used to develop numerically efficient model implementations and linear algebra routines. Both Fortran and C allow for efficient parallelization because these two languages are supported by common libraries designed for distributed memory systems such as MPI and shared memory libraries such as Pthreads and OpenMP. To make use of these available resources and implementations, one has to either rewrite all the different pieces in the same programming language or have proper interfaces between the different new and existing implementations.

The philosophy behind the design of DATeS is that “a unified DA testing suite has to be open-source, easy to learn, and able to reuse and extend available code with minimal effort”. Such a suite should allow for easy interfacing with external third-party code written in various languages, e.g., linear algebra routines written in Fortran, analysis routines written in Matlab, or “forecast” models written in C. This should help the researchers to focus their energy on implementing and testing their own analysis algorithms. The next section details several key aspects of the DATeS implementation.

3.1 DATeS architecture

The DATeS architecture abstracts, and provides a set of modules of, the four generic components of any DA system. These components are the linear algebra routines, a forecast computer model that includes the discretization of the physical processes, error models, and analysis methodologies. In what follows, we discuss each of these building blocks in more detail, in the context of DATeS. We start with an abstract discussion of each of these components, followed by technical descriptions.

3.1.1 Linear algebra routines

The linear algebra routines are responsible for handling the data structures representing essential entities such as model state vectors, observation vectors, and covariance matrices. This includes manipulating an instance of the corresponding data. For example, a model state vector should provide methods for accessing/slicing and updating entries of the state vector, a method for adding two state vector instances, and methods for applying specific scalar operations on all entries of the state vector such as evaluating the square root or the logarithm.

3.1.2 Forecast model

The forecast computer model simulates a physical phenomena of interest such as the atmosphere, ocean dynamics, and volcanoes. This typically involves approximating the physical phenomena using a gridded computer model. The implementation should provide methods for creating and manipulating state vectors and state-size matrices. The computer model should also provide methods for creating and manipulating observation vectors and observation-size matrices. The observation operator responsible for mapping state-size vectors into observation-size vectors should be part of the model implementation as well. Moreover, simulating the evolution of the computer model in time is carried out using numerical time integration schemes. The time integration scheme can be model-specific and is usually written in a high-performance language for efficiency.

3.1.3 Error models

It is common in DA applications to assume a perfect forecast model, a case where the model is deterministic rather than stochastic. However, the background and observation errors need to be treated explicitly, as they are essential in the formulation of nearly all DA methodologies. We refer to the DATeS entity responsible for managing and creating random vectors, sampled from a specific probability distribution function, as the “error model”. For example, a Gaussian error model would be completely set up by providing the first- and second-order moments of the probability distribution it represents.

3.1.4 Analysis algorithms

Analysis algorithms manipulate model states and observations by applying widely used mathematical operations to perform inference operations. The popular DA algorithms can be classified into filtering and smoothing categories. An assimilation algorithm, a filter or a smoother, is implemented to carry out a single DA cycle. For example, in the filtering framework, an assimilation cycle refers to assimilating data at a single observation time by applying a forecast and an analysis step. On the other hand, in the smoothing context, several observations available at discrete time instances within an assimilation window are processed simultaneously in order to update the model state at a given time over that window; a smoother is designed to carry out the assimilation procedure over a single assimilation window. For example, EnKF and 3D-Var fall in the former category, while EnKS and 4D-Var fall in the latter.

3.1.5 Assimilation experiments

In typical numerical experiments, a DA solver is applied for several consecutive cycles to assess its long-term performance. We refer to the procedure of applying the solver to several assimilation cycles as the “assimilation process”. The assimilation process involves carrying out the forecast and analysis cycles repeatedly, creating synthetic observations or retrieving real observations, updating the reference solution when available, and saving experimental results between consecutive assimilation cycles.

3.1.6 DATeS layout

The design of DATeS takes into account the distinction between these components and separates them in design following an object-oriented programming (OOP) approach. A general description of DATeS architecture is given in Fig. 1.

The enumeration in Fig. 1 (numbers from 1 to 4 in circles) indicates the order in which essential DATeS objects should be created. Specifically, one starts with an instance of a model. Once a model object is created, an assimilation object is instantiated, and the model object is passed to it. An assimilation process object is then instantiated, with a reference to the assimilation object passed to it. The assimilation process object iterates the consecutive assimilation cycles and saves and/or outputs the results which can be optionally analyzed later using visualization modules.

All DATeS components are independent so as to maximize the flexibility in experimental design. However, each newly added component must comply with DATeS rules in order to guarantee interoperability with the other pieces in the package. DATeS provides base classes with definitions of the necessary methods. A new class added to DATeS, for example, to implement a specific new model, has to inherit the appropriate model base class and provide implementations of the inherited methods from that base class.

https://www.geosci-model-dev.net/12/629/2019/gmd-12-629-2019-f01

Figure 1Diagram of the DATeS architecture.

DATeS: a highly extensible data assimilation testing suite v1.0

3.1 DATeS architecture

3.1.1 Linear algebra routines

3.1.2 Forecast model

3.1.3 Error models

3.1.4 Analysis algorithms

3.1.5 Assimilation experiments

3.1.6 DATeS layout

3.2 Linear algebra classes

3.3 Forecast model classes

3.4 Error model classes

3.5 Assimilation classes

3.6 Assimilation process classes

3.7 Utility modules

4.1 Step1: initialize DATeS

4.2 Step2: create a model object

4.2.1 Quasi-geostrophic model

4.2.2 Observations and observation operators

4.3 Step3: create an assimilation object

4.4 Step4: create an assimilation process

4.5 Experiment results

4.6 DATeS for benchmarking

4.6.1 Performance metrics

4.6.2 Benchmarking

5.1 Adding a numerical model class

5.2 Adding an assimilation class