The specification of state background error statistics is a key component of
data assimilation since it affects the impact observations will have on the
analysis. In the variational data assimilation approach, applied in
geophysical sciences, the dimensions of the background error covariance
matrix (

We present the advantages of this new design for the data assimilation
community by performing benchmarks of different modeling of

Since the best estimate of the background error covariance matrix
(

The opportunity has been taken to redesign the GEN_BE code by extending its capabilities to investigate and estimate new error covariances. Originally, the GEN_BE code was developed by Barker et al. (2004) as a component of a three-dimensional (3-D) variational data assimilation (3DVAR) method to estimate the background error of the fifth-generation Penn State/NCAR mesoscale model (MM5, Grell et al., 1994) for a limited-area system. Since this initial version, various branches of code have been developed at the National Center for Atmospheric Research (NCAR) and at the UK Met Office to address specific needs using different models such as the Weather Research Forecast (WRF, Skamarock et al., 2008) and the Unified Model (UM, Davies et al., 2005) on different data assimilation platforms such as the Weather Research Forecast Data Assimilation (WRFDA, Barker et al., 2012) system and the Grid point Statistical Interpolation (GSI, Kleist et al., 2009) system. Different choices of control variables and their correlated errors used to mimic general physical balance (geostrophic, hydrostatic, etc.) in the atmosphere have been largely investigated by different operational centers and are referenced in Banister (2008b). Since then, such multivariate relationship approaches have been studied to characterize heterogeneous background errors in precipitating and nonprecipitating areas for regional applications (Caron and Fillion, 2010; Montmerle and Berre, 2010). Special emphasis is made in Michel et al. (2011) to include hydrometeors in the background error statistics, as their direct analysis increment can come from data assimilation of radar reflectivity and satellite radiances. The framework of the GEN_BE code version 2.0 has been developed to merge these different efforts using linear regression to model the balance between variables, empirical orthogonal function (EOF) decomposition techniques and the diagnostic of length scales to apply recursive filters (RFs). It allows reading of input from different models and provision of output for different data assimilation platforms. This new flexibility associated with the possibility of defining a set of control variables and their covariance errors as an input should potentially reduce further development efforts of the code and benefit the larger community of geophysical science in general.

This document describes the methods included in the GEN_BE code version 2.0
to investigate modeling of

The solution of 3-D variational data assimilation (3DVAR) is sought as the
minimum of the following cost function (Courtier et al., 1994):

By definition, exact values of

The cost function as defined in Eq. (1) is usually minimized after applying
the change of a variable:

The square root of the

the

the

The

The

The general structure of the GEN_BE code version 2.0 has been designed to
split the input, output, and algorithms into independent stages. The five
steps, from stages 0 to 4, that model a background error covariance matrix
become independent of the choice of control variables and model input, which
allows for more flexibility (Fig. 1). Stage 0 estimates the perturbations of
the control variables based on variables coming from a numerical weather
prediction (NWP) model forecast. Stage 1 removes the mean of these
perturbations and defines the applied binning. Stage 2 defines the balance
operator (

General structure of the code to generate a background error
covariance matrix. The input and output are represented by the orange boxes
and the five main stages that lead to model

Here, we present results obtained from a numerical experiment with the Advanced Research WRF (WRF-ARW, called WRF hereafter) model involving an ensemble of 50 members (D-ensemble) over the CONtiguous United States (CONUS) domain at 15 km resolution (res. 15 km, Fig. 2). Figure 3 shows the pressure (hPa) against vertical model levels. Each member is a 6 h forecast valid at 12:00 UTC on 3 June 2012. The ensemble adjustment Kalman filter (EAKF), coming from the community system Data Assimilation Research Testbed (DART, Anderson et al., 2009), was used by Romine et al. (2014) to generate the analysis ensemble. Table 2, shown in Sect. 4, contains detailed setup information of this data assimilation experiment.

Since the background error covariance matrix is a statistical entity, samples
of model forecasts are required to estimate the associated variances and
correlations. Traditionally, two distinct techniques are used and are
available in stage 0 to compute the perturbations.

Differences between two forecasts valid at the same time but initiated at different dates (time-lagged forecast, e.g., 24 h minus 12 h forecasts) can be used to represent a sample of model background errors. This is an ad hoc technique, called the NMC (named for the National Meteorological Center) method (Parrish and Derber, 1992), which has been widely used in operational centers where large databases of historical forecasts are available.

Background error statistics can be evaluated from an ensemble of perturbations valid at the same time (Fisher, 2003; Pereira and Berre, 2006). This method tends to be more accurate because it better represents the background error of the day, rather than a climatological error, as with the NMC method. However, more computational resources are required to run an ensemble simulation, and it may not provide automatically the optimum B for a particular system (Fisher, 2003).

WRF domain over the conus area at the resolution of 15 km. Based on this configuration, the 50 members coming from a 6 h forecast (DART experiment) are used to generate background error statistics.

Since the number of sample of perturbations can be limited, a strategy to
model a static error covariance over an entire domain and filter the sampling
noise is used. The statistics are spatially averaged by gathering grid points
with similar characteristics. The different options available for this
technique, referred as binning, are described in Table B2, and can be set up
in the namelist input file (Table B3). The simplest way to compute statistics
for a domain can be done by vertical levels (

For this reason, the GEN_BE code has been modified to facilitate the
introduction of new binning options for specific applications (see
Appendix B). Stage 1 removes the mean of the perturbations and defines the
binning, which is an important component in the model of

Analysis increment for one variable may impact another if they have
correlated errors. The simplest way to model these multivariate error
cross-covariances is to use linear regressions that mimic physical balance
between variables. First, the regression coefficient between variables can be
estimated by solving Eq. (

In practice, the regression coefficient can be directly calculated as the
ratio of the inverted variance with the covariance or by performing a
Cholesky decomposition (see Appendix B for more details). Then, linear
regressions are performed to derive uncorrelated (i.e., unbalanced)
perturbations by removing the balanced part from other perturbation
variables. Equation (

Note that, in variational data assimilation processes, the balance operator

Description of the control variables available for the meteorology.

Plot of pressure (hPa) against vertical model levels (WRF, resolution 15 km).

Furthermore, Bannister (2008b) described the

Finally, results of an experiment that include hydrometeors and its correlated errors with humidity (CV9) are presented Sect. 5.1 and defined by the namelist input file Table B5.

After calculating the vertical auto-covariance matrix (VACM), two techniques
are currently available in stage 3 to compute the parameters useful for
modeling the mean vertical auto-correlation transform
(

Approximating Eq. (8a) with finite difference to the second-order derivatives
of

If the correlation is approximated at the origin by a Gaussian function as
follows,

Pannekoucke et al. (2008) studied the sensitivity of sampling errors of these formulae and show that the Gaussian and the parabolic approximation give similar results. Furthermore, the vertical length scale can be computed uniformly by the vertical model level or binned. Table B6 in Appendix B contains a description of the namelist option to define the vertical length scale in stage 3 and the horizontal length scale in stage 4.

Horizontal autocorrelation performed at the center of each square
grid over vertical model level 5, around 950 hPa, for control variables

Horizontal auto-correlations can be computed for each control variable at
each grid point. Figure 5 shows a diagnostic of correlation for a few
selected points of the WRF computational domain around 500 m above the
ground (model level 5). The stream function (5a) and velocity potential
control variables have larger and more isotropic spatial correlations, while
the temperature (5b) and the humidity (5c) control variables show smaller and
anisotropic correlations at different locations. The radius of the area where
the correlation overpasses 0.9 is within a range of 100 to 400 km for stream
functions, while this radius reaches its maximum around 100 km for
temperature and humidity. Hydrometeor mixing ratios show even more local
structures due to their sparse location on the horizontal and the vertical
(5d).
In stage 4, we estimate horizontal length scales averaged by vertical level
or EOF mode for a field analysis in a 2-D plane. It represents the radius of
influence, calculated in grid point space, around the position of an
observation, and is an input parameter for recursive filters to spread out
horizontally the increment (

The first method (ls_method

If a second-order autoregressive (SOAR) correlation function is used, the
length scale

However, as this procedure is both computationally expensive and prone to
sampling errors, a second option (ls_method

The horizontal length scale can be uniformly calculated over a vertical model level or can be statistically binned. Homogeneous recursive filters are able to handle a unique length scale defined by model vertical level, or EOF mode. Inhomogeneous recursive filters (Purser et al., 2003b), as implemented in GSI, are able to handle heterogeneous length scales. In this case, the increment is spread out with a length scale according to the bin class of each grid point. Moreover, spatial filtering to smooth the length scale may be required because of recursive filter normalization issues (Michel and Auligné, 2010).

We present a benchmark of different modeling of

Description of the setup of the background error matrix modeling
diagnosed over the CONUS domain.

Representation of the first five eigenvectors resulting from the EOF
decomposition of the vertical autocovariance matrix, eigenvectors of

If the EOF decomposition is used, the eigenvectors model the vertical
transform (

Eigenvalues computed by EOF mode for

The horizontal length scales, estimated by Eq. (11), are presented in Fig. 8. The stream function and the velocity potential have the largest length scale value, reaching 600 km (39 grid points) for the first EOF mode, while the unbalanced temperature length scale has a strong variation for the first three EOFs passing approximately from 135 to 30 km (nine to two grid points) and, from there, slightly decreases from 30 km to reach 15 km (two to one point grid) for the last EOF mode. The relative humidity length scale remains small, decreasing from approximately 30 to 15 km as a function of the EOF mode. The unbalanced temperature and the relative humidity have a relatively small length scale, which means that they have more local features represented by a small radius of influence. Thus, the analysis increment from these variables will remain closer to the observation. As the horizontal length scale is associated with the EOF mode and is not directly related to a vertical model level, further discussions on the association of length scale with physical event may be difficult.

Length scales defined in grid point through EOF mode for CV5. The
analysis control variables representing the dynamical variables, psi and
chi

Horizontal length scales for CV5.

The horizontal correlation is modeled by the application of recursive filters
based on the estimation of the horizontal length scale solving Eq. (11),
applied at every vertical model level for each variable, as shown in Fig. 9.
The horizontal length scales diagnosed for each control variable by vertical
level (Fig. 9) or by EOF mode (Fig. 8) have the same range of values. The
length scales of the stream function and the velocity potential control
variables have the largest values above 150 km (10 grid points) for all the
vertical model levels, while the length scales of temperature and relative
humidity remain in a range of 30 km to 60 km (1 to 2 grid points) below the
200 hPa level. Temperature and humidity, which have more local structures,
are modeled with smaller length scales. Globally, the horizontal length
scales of different variables increase from the bottom to the top of the
model, as they represent larger-scale events. Direct comparison of these
statistics with the

The vertical correlation is modeled by the application of recursive filters based on the estimation of the vertical length scale coming from Eq. (8b). The stream function and the velocity potential in Fig. 10 that represent large-scale horizontal flow have a bigger vertical length scale than those of temperature and humidity. The vertical gradients of temperature and humidity can vary strongly locally, decreasing the vertical correlation.

Vertical length scale for CV5 (WRF, res. 15 km, D-ensemble, RFs).

The single pseudo observation is a powerful way to provide a benchmark, as it
allows visualization of the increment of an isolated observation and its
impact on other variables. Thus, the following are pseudo observation tests
of temperature with an innovation of 1 K and an observation error of 1 K
using different modeling of

Pseudo observation test of temperature (innovation of

Pseudo observation test of temperature (innovation of

Pseudo observation test of temperature (innovation of

As expected, the horizontal cross section at the 500 hPa level for
temperature shows an isotropic response to the innovation of 1 K. The maxima
of intensity simulated depend on the standard deviation (diagonal matrix

On the one hand, the operator (

For the vertical cross section (

Finally, the multivariate approach, defined by CV5, induces increments in the
wind components. The horizontal cross section

These ensemble-based background error

Code modifications have been done in the WRFDA code to add a multivariate
balance operator for the hydrometeor variables: cloud liquid water mixing
ratio (

The

The statistics coming from the GEN_BE v2.0 code, i.e., regression
coefficients and the unbalanced part of the variable, can be estimated only
by modifying the namelist file input. In this case, the covar5 line of Table
B5 that describes the covariances between the fifth control variable
(relative humidity) and the third control variables

A similar balance is applied to

The vertical and horizontal transforms retained are the recursive filters making the interpretation of the length scale parameter easier, as they are directly associated with a vertical model level. The four main hydrometeors have been added in this study, as they could be useful for data assimilation in remote sensing such as satellite cloud radiances and radar reflectivity.

Horizontal length scale for the hydrometeors using

The horizontal length scale values of the different hydrometeors shown in
Fig. 15a are smaller in comparison to other control variables (less than
30 km, two grid points). Significant values of length scale that overpass
15 km (one grid point) are related to the presence of hydrometeors: they
occur below the 150 hPa pressure level for

Vertical length scale for the hydrometeors using

The vertical correlation maxima of the precipitating hydrometeors are higher
compared to that of cloud water or cloud ice hydrometeors, as they can drop
freely through multiple levels (Fig. 16a). The vertical length scale of

To verify that our analysis is multivariate, we conducted a series of tests
in which pseudo observations of hydrometeors were assimilated into WRFDA, and
the corresponding analysis increment was plotted. Figure 17 shows the
analysis response for the

The intensity of the increment can be weighted by the 1-D variance or by the
3-D variance (

The covariance between the mixing ratio of cloud water condensate and
relative humidity, described in Sect. 5.1.1, can reinforce the ability to add
clouds in the dry area or remove clouds in the cloudy area. The univariate
version of the balance operator for hydrometeors may be beneficial at the
analysis time, as hydrometeors can be directly assimilated. The multivariate
balance is present to help to propagate the

The determination of the balance of humidity and hydrometeors is a difficult task, as it involves the microphysical processes of meteorological NWP models and different local phenomena. The use of local covariances coming from the D-ensemble may help to balance those highly sensitive variables. Furthermore, operational centers such as Météo-France with the Application of Research to Operations at Mesoscale system (AROME, Seity et al., 2011) and the Met Office with the Met Office Global and Regional Ensemble Prediction System (MOGREPS, Bowler et al., 2008; Migliorini et al., 2011) already use ensemble forecasts at high resolution to more accurately characterize specific meteorological events, such as precipitation and convection. Nowadays, their ensemble size remains small (often fewer than 10 members) because the cost of CPU (central processing unit) time is still elevated. Studies have been dedicated to evaluating the sampling errors in the ensemble method and in the parameters, such as correlation length scales, that usually model the background errors (Pannekoucke et al., 2008; Ménétrier et al., 2014). When the ensemble size is small, methods that combine general statistics of the background errors and local balance are found to perform better (Hamill and Snyder, 2000). Figures 15a, b and 16a, b, which display horizontal and vertical length scale parameters, respectively, for the hydrometeors in regards of the number of members, show stable results.

As a proof of concept, this last section shows the direct applicability of the GEN_BE v2.0 code as a diagnostic tool for topics other than meteorology. In recent decades, a large number of studies that investigate chemical data assimilation have been conducted. Some of the first studies on stratospheric and tropospheric chemistry data assimilation were performed roughly two decades ago (e.g., Austin, 1992; Fisher and Lary, 1995; and Elbern et al., 1997). During the last two decades, efforts have been made in order to improve atmospheric chemical modeling and data assimilation scheme performances.

Characterization of the background error covariance matrix

Statistics were analyzed in detail to ensure that

The statistics are estimated using 20 members over the CONUS domain. Each member comes from a 12 h forecast of WRF-Chem (WRF model coupled with Chemistry, Grell et al., 2005), valid at 12:00 UTC on 14 June 2008, at 36 km of horizontal resolution and 33 vertical levels. The lateral boundary conditions coming from MOZART (Model for OZone And Related chemical Tracers, Emmons et al., 2010) and emission factors coming from MEGAN (Model of Emissions of Gases and Aerosols from Nature, Guenther et al., 2006) are perturbed using a pseudo-normal random noise. In order to avoid unphysical or negative values of concentration and emissions and to keep ensemble mean boundary conditions values close to the original values, we then perturb the boundary conditions (emissions and boundary conditions) by using a standard deviation (sigma) of 25 % of the original boundary condition value, and we limit the perturbation to no more than 3 sigma (i.e., 75 %).

Vertical standard deviation in ppmv of

Figure 19 presents the standard deviations for the chemical species of
interest. The standard deviation of the background error is directly related
to the species concentrations. Most of the ozone variability takes place in
the middle atmosphere (stratosphere) in the ozone layer around 100 hPa
(Fig. 19a). Figure 19b and c highlight NO

Horizontal length scale of O

Figure 20 displays the calculated horizontal chemical length scales. Ozone
shows that horizontal length scales are around 100 km in the troposphere and
around 125 km in the stratosphere. Pagowski et al. (2010) used a NMC method
and found that ozone horizontal length scales are around 100 km (150 km) in
the troposphere (in the stratosphere). Concerning NO

Vertical length scale of O

Concerning the vertical correlations (Fig. 21), all the four species diagnosed present a maximum close to the surface where they are emitted (or secondarily produced for ozone). Correlation length scales sharply decrease between 1000 and 850 hPa. Two main reasons explain this: (1) reactions with other short-lived species emitted near the surface create strong correlations in the lowest model levels; and (2) an increase in first model level layer thickness from the surface to levels above creates stronger correlations in grid points. This strong decrease in correlation length scales is not fully understood and needs further investigations. Above the surface peak, vertical correlation also decreases around 800 hPa due to weaker vertical mixing above the planetary boundary layer. In the free troposphere, where the vertical mixing is less significant, the evolution of the vertical length scale decreases slowly from approximately 70 to 40 km. The vertical diffusion of possible data assimilation increments will be less significant than in the boundary layer. Compared to Pagowski et al. (2010), the ozone vertical length scale profile presents the same behavior. Strong vertical correlation close to the surface, followed by a strong decrease to the levels directly above, results in lower values in the upper levels of the boundary layer.

Here we have shown that the GEN_BE v2.0 code is able to model a

While variational methods have been successfully used in operational centers for a long time, the estimation of background errors needs to be continuously improved to assimilate new observations and to provide more accurate statistics. The GEN_BE v2.0 code has been developed to investigate and model univariate or multivariate covariance errors from control variables defined by a user as an input. It gathers some methods and options that can be easily applied to different model inputs and used on different data assimilation platforms by extending its former capabilities. The flexibility of the framework of the GEN_BE V2.0 code should help the diagnostics of correlated errors and the implementation of new background error modeling.

This document describes first the different stages and transforms that lead
to the modeling of the background error covariance matrix

Second, the GEN_BE v2.0 code has been validated through multivariate single
observation tests of temperature using three different modelings of

Third, the GEN_BE code has been used to perform the statistics over an
extended set of control variables that include the mixing ratio of
hydrometeors (CV9) for multivariate cloud data assimilation purposes. As
clouds have an intermittent presence, the 3-D variance coming from an
ensemble of the day gives a spatial envelope useful for weighting the
analysis relative to the observation and the background confidence. The
hydrometeors of cloud and ice condensate water are also balanced with
humidity to be potentially able to create or remove misplaced clouds. The
regression coefficients calculated can be conserved for a next cycle analysis
as they are averaged by bins or recalculated, as they are not so expansive
with regard to CPU time. In this paper, a pseudo observation test of cloud
mixing ratio was performed using WRFDA, and the next step is to test cloudy
radiance data assimilation. Finally, statistics of background are estimated
for chemical species such as carbon monoxide (CO), nitrogen oxides
(NO

In these previous examples, GEN_BE code version 2.0 can handle input data
sets coming from WRF, a model defined on an Arakawa C-grid, and the
background error statistic outputs are computed on an unstaggered Arakawa
A-grid. Within minor modifications, the code would be able to handle other
horizontal grids. Also, statistics could easily be done on models with
different vertical grid definitions. If we consider performing the background
error statistics on an unstructured grid, the structure of the code can
remain the same, but a few mathematical operators, such as differential and
Laplacian, and estimation of the distance between two grid points, would need
to be re-defined according to the grid. In fact, the

The current trend is to model a more complex background error, expanding the control variables and correlated errors and using techniques to achieve more heterogeneity and anisotropy. The geographical binning and the 3-D variance available in the GEN_BE v2.0 code can be utilized with new data assimilation algorithms. For example, hybrid data assimilation that combines variational and ensemble methods may be helpful especially by adding flow dependence in the estimation of the background error while keeping a reduced ensemble size due to CPU time constraints (Hamill and Snyder, 2000). Wang et al. (2008a, b) performed a study using a hybrid 3DVAR-ETKF (ensemble transform Kalman filter) technique that combines static (modeled) and ensemble background error covariances. Better results were obtained over North America at a coarse resolution (200 km), especially in data-sparse areas compared to those performed solely with 3DVAR. The extended control variable technique (Lorenc, 2003) allows blending of flow-dependent errors with static covariance errors. Bannister et al. (2011) investigated the benefit of a convection-permitting prediction system ensemble (24 members) on a finer scale (i.e., 1.5 km of resolution) for nowcasting purposes based on MOGREPS (Migliorini et al., 2011). Even though the authors show how general balances that drive synoptic flow, in particular geostrophic balance, can diminish in convective situations on small scales, they highlight the necessity for a data assimilation system to better represent both the large-scale and mesoscale components of the flow. In addition, Ménétrier et al. (2014) studied heterogeneous flow-dependent background error covariances on a convective scale and showed that a small ensemble (six members from AROME) contains relevant information together with sampling noise, which can be reduced through filtering. Finally, the GEN_BE v2.0 code may be a tool to diagnose inhomogeneous 3-D localization parameters in ensemble methods. The code has been tested in atmospheric science, but the flexibility of the code may be useful in other geophysical applications.

FORTRAN code description of the GEN_BE v2.0 framework.

Input and output of the different components of the GEN_BE v2.0 code.

Content of the final output file be.nc (NetCDF format) of the GEN_BE v2.0 code.

New FORTRAN modules have been developed to generalize the calculation of the error covariance matrix from different input models and for new control variables. Table A1 contains a complete list of these modules and their contents. All the algorithms from stage 1 to stage 4 are now independent of the choice of control variables and driven by a unique namelist file, called namelist.input, and read by FORTRAN module configure.f90. Flexibility has been added for future experiments. Few modifications are needed in stage 0 to add new control variables. FORTRAN module io_input_models.f90 converts the standard variables from a given model to the analysis variables. The interface is already made with the WRF model. Only FORTRAN module io_input_model.f90 needs to be updated to implement new model input and to run the different stages. The NetCDF format has been chosen to improve robustness and flexibility in the input and output of the different stages, as shown in Table A2. The final NetCDF output file (be.nc) contains all the information needed for a variational data assimilation system, as shown in Table A3. Several converters from NetCDF format to binary have been developed to ensure backward compatibility to another data assimilation system. A be.dat binary file can be generated for the WRFDA application using the gen_be_diags.f90 program, and a be_gsi.dat binary file can be created for GSI using the gen_be_nc2gsi.f90 converter.

General information defining the experiment in the namelist input file (&gen_be_info part).

Description of the binning options.

Parameters defining the binning options of the namelist input file (&gen_be_bin part).

Information related to the control variables and their covariance errors in the namelist input file (&gen_be_cv part, example CV5). At present, the parameter covar can take three values, 0, 1, and 2, meaning “no regression”, “full regression” and “diagonal only”.

Information related to the control variables and their covariance errors in the namelist input file (&gen_be_cv part, example CV9, definition of multivariate humidity and hydrometeor error covariance matrix).

Description of the options available in the namelist input file (&gen_be_lenscale part) to diagnose the length scale parameter.

The “namelist.input” file that drives the different stages 0 to 4 contains four different sections.

The “

Table B2 presents eight available binning options and Table B3 explains how
to set up namelist section “

The

Table B6 contains namelist section “

The GEN_BE code version 2.0 is a stand-alone package that can be installed
on different UNIX/LINUX systems. It has been tested with the Intel FORTRAN
compiler, the Portland Group FORTRAN compiler, and the GNU FORTRAN compiler.
It requires compilation of NetCDF libraries. First, a configuration file
needs to be created using the command

Korn-shell scripts available in the scripts directory allow one to set up the
experiment. The wrapper script, named gen_be_wrapper.ksh,
sets up some global variables and launches the main
script (gen_be.ksh). The user needs to set up most of the other options that
determine the way to model the

Funding for this work was provided by the US Air Force Weather Agency. The authors benefited from numerous discussions with Yann Michel. Glen Romine is thanked for providing the ensemble over the CONUS domain. Syed Rizvi is thanked for discussions concerning the previous version of the code.Edited by: A. Archibald