This paper describes the assimilation of trace gas observations into the
chemistry transport model SILAM (System for Integrated modeLling of
Atmospheric coMposition) using the 3D-Var method. Assimilation results for
the year 2012 are presented for the prominent photochemical pollutants ozone
(O

Attention was paid to the background and observation error covariance
matrices, which were obtained primarily by the iterative application of a
posteriori diagnostics. The diagnostics were computed separately for 2
months representing summer and winter conditions, and further disaggregated
by time of day. This enabled the derivation of background and observation error
covariance definitions, which included both seasonal and diurnal variation.
The consistency of the obtained covariance matrices was verified using

The analysis scores were computed for a control set of observation stations
withheld from assimilation. Compared to a free-running model simulation, the
correlation coefficient for daily maximum values was improved from 0.8 to 0.9
for O

During the past 10–15 years, assimilating observations into atmospheric chemistry transport models has been studied with a range of computational methods and observational data sets. The interest has been driven by the success of advanced data assimilation methods in numerical weather prediction (Rabier, 2005), as well as by the development of operational forecast systems for regional air quality (Kukkonen et al., 2012). Furthermore, the availability of remote sensing data on atmospheric composition has enabled construction of global analysis and forecasting systems, such as those described by Benedetti et al. (2009) and Zhang et al. (2008). Assimilation of satellite observations into stratospheric chemistry models has been demonstrated, e.g. by Errera et al. (2008).

Data assimilation is defined (e.g. Kalnay, 2003) as the numerical process of using model fields and observations to produce a physically and statistically consistent representation of the atmospheric state – often in order to initialise the subsequent forecast. The main techniques used in atmospheric models include the optimal interpolation (OI, Gandin 1963), variational methods (3D-Var and 4D-Var, Le Dimet and Talagrand, 1986; Lorenc, 1986), and the stochastic methods based on the ensemble Kalman filter (EnKF, Evensen, 2003, 1994). Each of the methods has been applied in air quality modelling. Statistical interpolation methods were used by Blond and Vautard (2004) for surface ozone analyses and by Tombette et al. (2009) for particulate matter. The EnKF method has been utilised by several authors (Constantinescu et al., 2007; Curier et al., 2012; Gaubert et al., 2014) especially for ozone modelling. The 3D-Var method has been applied in regional air quality models by Jaumouillé et al. (2012) and Schwartz et al. (2012), while the computationally more demanding 4D-Var method has been demonstrated by Elbern and Schmidt (2001) and Chai et al. (2007). Partly due to its significance in relation to health effects, the most commonly assimilated chemical component has been ozone.

The performance of most data assimilation methods depends on correctly prescribed background error covariance matrices (BECM). This is particularly important for 3D-Var, where the BECM is prescribed and fixed throughout the whole procedure, in contrast to the EnKF based assimilation methods, where the BECM is described by the ensemble of states, and to the 4D-Var method, where the BECM is prescribed but evolves implicitly within the assimilation window.

A range of methods of varying complexity have been employed to estimate the
BECM in previous studies on chemical data assimilation. The “National
Meteorological Centre” (NMC) method introduced by Parrish
and Derber (1992) is based on using differences between forecasts, with
differing lead times as a proxy for the background error.
Kahnert (2008), as well as Schwartz
et al. (2012), applied the NMC method for estimating the BECM for
assimilation of aerosol observations. Chai et al. (2007) based the BECM on a combination of the NMC method and the observational
method of Hollingsworth and Lönnberg (1986). This
observational method was also used by Kumar et al. (2012) the in assimilation of NO

The BECM can also be estimated using ensemble modelling; this approach was taken by Massart et al. (2012) for global and by Jaumouillé et al. (2012) for regional ozone analyses. Finally, Desroziers et al. (2005) presented a set of diagnostics, which can be used to adjust the background and observation error covariances. This method has been previously applied in chemical data assimilation for example by Schwinger and Elbern (2010) and Gaubert et al. (2014).

In contrast to short and medium range weather prediction, the influence of
initial condition on an air quality forecast has been found to diminish as
the forecast length increases. For ozone,
Blond and Vautard (2004) and Wu et al. (2008) found that the effect of the adjusted initial condition extended for
up to 24 h. Among other reactive gases, NO

An approach for improving the effectiveness of data assimilation for short-lived species is to extend the adjusted state vector with model parameters. Among the possible choices are emission and deposition rates (Bocquet, 2012; Curier et al., 2012; Elbern et al., 2007; Vira and Sofiev, 2012).

The aim of the current paper is to describe and evaluate a regional air
quality analysis system based on assimilating hourly near-surface
observations of NO

The following Sect. 2 presents the model setup
and briefly reviews the 3D-Var assimilation method. The procedure for
estimating the background and observation error covariance matrices is
discussed in Sect. 3. The assimilation results
for O

This section presents the SILAM dispersion model, the observation data sets used, and describes the assimilation procedure.

This study employed the SILAM chemistry transport model (CTM) version 5.3. The model utilises the semi-Lagrangian advection scheme of Galperin (2000) combined with the vertical discretisation described by Sofiev (2002) and the boundary layer scheme of Sofiev et al. (2010). Wet and dry deposition were parameterised as described in Sofiev et al. (2006).

The chemistry of ozone and related reactive pollutants was simulated using the
carbon bond 4 chemical mechanism (CB4, Gery et
al., 1989). However, the NO

In this study, the model was configured for a European domain covering the
area between 35.2 and 70.0

The emissions of anthropogenic pollutants were provided by the MACC-II European emission inventory (Kuenen et al., 2014) for the reference year 2009. The biogenic isoprene emissions, required by the CB4 run, were simulated by the emission model of Poupkou et al. (2010).

Three sets of SILAM simulations were carried out in this study. First, the background and observation error covariance matrices were calibrated using 1-month simulations for June and December 2011. The calibration results were used in reanalysis simulations covering the year 2012. Finally, a set of 72 h hindcasts was generated for the period between 16 July and 5 August 2012, to evaluate the forecast impact of assimilation. The hindcasts were initialised from the 00:00 UTC analysis fields. The timespan included an ozone episode affecting parts of southern and western Europe (EEA, 2013). The reanalysis and hindcasts use identical meteorological and boundary input data, and hence, the hindcasts only assess the effect of chemical data assimilation.

The analysis and forecast runs were performed at a horizontal resolution of
0.2

This study uses the hourly observations of NO

Two sets of stations were withheld for evaluation. The first set, referred to here as the MACC set, had been used in the regional air quality assessments within the MACC and MACC-II projects (Rouïl, 2013, also Curier et al., 2012). The second set consisted of the stations reported as EMEP stations in the database. The MACC validation stations included about a third of the available background stations for each species, and were chosen with the requirement to cover the same area as the assimilation stations. The EMEP network is sparser and has no particular relation to the assimilation stations. It can be noted that the EMEP stations included in AirBase do not comprise the full EMEP monitoring network.

The in situ data are used for assimilation and evaluation under the
assumption that they represent the pollutant levels in spatial scales
resolved by the model. We expect this assumption to be violated, especially
at many urban and suburban stations due to local variations in emission
fluxes. For this reason, only rural stations were used for evaluation of the
2012 reanalysis. The NO

The stations networks used for assimilation and validation for O

The statistical indicators used for model evaluation were correlation, mean bias and root mean squared error (RMSE). Since air quality models are frequently used to evaluate daily maximum concentrations, the indicators were evaluated separately for the daily maximum values.

In the 3D-Var method, the analysis

For the surface measurements, the operator

In the current study, only a single chemical component was assimilated in each
run. Since O

Correlation length scales L (km) diagnosed from the NMC data set.

The numerical formulation of the BECM in the current work follows the
assumptions made by Vira and Sofiev (2012). We assume
that the background error correlation is homogeneous in space, and its
horizontal component is described by a Gaussian function of distance between
the grid points. Furthermore, we assume that the background error standard
deviation

For estimation of the parameters for the covariance matrices

In the NMC method, the difference between two forecasts valid at a given time is taken as a proxy of the forecast error. In this work, the proxy data set was extracted from 24 and 48 h regional air quality forecasts for the year 2010. The forecasts are generated with the SILAM model in a configuration similar to the one used in this study. Since no chemical data assimilation was used in the forecasts, the differences were due to changes in forecast meteorology and boundary conditions only. The lead times were chosen to allow sufficient spread to develop between the forecasts. The forecast data were segregated by hour, resulting in separate sets for hours 00:00, 06:00, 12:00 and 18:00 UTC, and the correlations interpolated for all other times of day.

The horizontal and vertical components of the correlation matrix

The vertical correlation function was obtained directly as the sample
correlation across all vertical columns for each time of day. As an example,
the correlation matrix obtained for NO

Vertical correlation function for NO

The

Since the NMC data set includes only meteorological perturbations, it is expected to underestimate the total uncertainty of the CTM simulations. Hence, the standard deviations were not diagnosed from the NMC data set, but instead, an approach based on a posteriori diagnostics was taken. The approach, devised by Desroziers et al. (2005), is based on a set of identities, which relate the BECM and OECM if to expressions that can be estimated statistically from a set of analysis and corresponding background fields.

First, the standard deviation

The background error covariance matrix cannot be uniquely expressed in
observation space. However, assuming that each observation only depends
(linearly) on a single model grid cell (i.e. horizontal interpolation is
neglected), then:

The identities (3) and (4) hold for an ideally defined analysis system, provided that the background and observation errors are normally distributed and assuming the observation operator is not strongly nonlinear.

Furthermore, Eqs. (3) and (4) can be used to tune the parameters

In this work, the observation error covariance matrix

The choice of calibration periods representing both winter and summer
conditions was motivated by the strong seasonal variations in both O

Finally, the overall consistency could be evaluated by checking the identity
(Ménard et al., 2000)

The SILAM model was run for year 2012 with and without assimilation. Since the 3D-Var analyses require no additional model integrations in form of iterations or ensemble simulations, the hourly analyses increased the simulation runtime by only 10–15 %.

The effect of assimilation to the yearly-mean concentrations on the lowest
model level is shown in Fig. 3. On average, the
ozone concentrations are increased by the assimilation especially around the
Mediterranean Sea, which indicates corresponding low bias in the free model
run. The main changes in NO

Yearly mean concentration (

Refining the background and observation standard deviations iteratively both
improves the consistency of the assimilation setup as measured by the

The diagnosed observation and background error standard deviations for O

Especially for summertime night conditions, the values are higher than the values adopted in most of the earlier studies (Chai et al., 2007; Curier et al., 2012; Jaumouillé et al., 2012). However, the errors are comparable to the observation errors diagnosed using the CHIMERE model by Gaubert et al. (2014). The main error component is likely to be due to lack of representativeness: using the AIRNOW observation network, Chai et al. (2007) found standard deviations between 5 and 13 ppb for observations inside a grid cell with 60 km resolution. The maximum values occurred during nighttime.

The diagnosed observation and background error parameters are subject to uncertainty, since they are not uniquely determined (Schwinger and Elbern, 2010). Also, the parameters depend on the assumptions made regarding the correlation function. Nevertheless, the relative magnitude of observation errors during nighttime is interesting for interpreting the model-to-measurement comparisons.

The diagnosed background errors for ozone are between 5 and 9

The observation error standard deviation for NO

The BECM and OECM were adjusted to optimise self-consistency for 2 months
in 2011. To assess the robustness of the obtained formulations, the

Diagnosed background (dashed) and observation error (solid lines) standard
deviations (

As seen in Fig. 5,
the analyses using the adjusted BECM and OECM generally satisfy the
consistency relation better throughout the year, when compared to the
first guess values. The yearly-mean values for

Overall, the assimilation system is based on rather simplistic assumptions
regarding the background and observation error statistics. In addition to
computational efficiency, this approach benefits from having few tuning
parameters, and the remaining parameters (

Comparison of performance indicators for ozone in the 2012 reanalysis. The
scores are given for station sets “MACC” and “EMEP” as defined in
Sect. 2.2. For the analysis runs, scores are
shown for the different background error covariance matrices discussed in
Sect. 3. The unit of bias and RMSE is

Comparison of performance indicators for NO

Tables 3 and 4 present the analysis skill scores for runs with both first guess and final BECM and OECM, and for the free-running model with no assimilation.

In terms of correlation and RMSE, both analysis and free model runs show
better performance for predicting the daily maximum than hourly values. This
applies to both O

The comparison reveals a number of contrasts between the “MACC” and
“EMEP” validation stations. First, the free-running model shows better
performance for NO

The differences largely originate from the different representativeness and coverage of the MACC and EMEP station sets. As seen in Fig. 1, the EMEP network covers the computational domain more evenly than the MACC validation stations, which are concentrated in central Europe. Since the coverage of assimilation and MACC validation stations is similar, the average impact of assimilation is stronger on the MACC than EMEP stations.

For the free-running simulations, the better performance for O

For ozone, the assimilation had a variable effect on the model bias. While
the correlation and RMSE were always improved by assimilation, the analyses
have slightly larger negative mean bias (

The

Diurnal variation of model bias (

For NO

The analysis scheme assumes an unbiased model, and hence, the negative bias present in the free-running simulations is reduced but not removed in the analysis fields. The assimilation setup including tuned OECM and BECM produces more biased analyses compared to the first guess setup, as seen in Fig. 6. This is a consequence of the differences between the diagnosed and first guess background and observation error standard deviations. Contrary to the tuned setup, the first guess attributes most of the model-observation discrepancy to the background error, which results in stronger increments towards the observed values. Consequently, the analysis bias is smaller. However, the tuned assimilation setup has consistently better RMSE and correlation than the first guess assimilation setup.

Since the analysis bias is mainly a consequence of a bias in the forecast model, the bias should be addressed primarily by improving the model. As shown by Dee (2005), model biases can, in principle, also be addressed by the assimilation system. However, a possible bias correction scheme should be implemented with care, since due to representativeness errors, also observational biases could arise.

In addition to computing the regular statistical indicators for daily
maxima, we evaluated the hit rates (the number of correctly predicted
exceedances divided by the number of observed exceedances) for the 180

For NO

The interference can lead to overestimation of NO

The O

In order to quantify the usefulness of data assimilation in forecast applications, a set of simulations without data assimilation were generated using the analysis fields at 00:00 UTC as initial conditions. The forecast experiment covered the time period between 16 July and 5 August 012.

The effect of chemical data assimilation on forecast performance was
assessed as a function of the forecast lead time. Figures 7 and 8 present
the correlation and bias for the O

The model bias (

As Fig. 7, but for NO

For ozone, the forecast improvements due to data assimilation were largely limited to the first 24 h of forecast. Also, the forecast initialised at 00:00 UTC from the analysis shows a larger negative bias for the daytime than the free model run. This is a result of the corresponding nighttime positive bias of the free model run. The bias is effectively removed in the 00 analysis; however, the subsequent forecast is unable to recover the level observed during daytime. The correlation coefficient during daytime is nevertheless improved slightly (from 0.75 to 0.78) by initialising from the analysis. While the forecast shows somewhat reduced positive bias for the hours between 18 and 30 (i.e. the following night), the subsequent daytime scores are already almost unchanged by assimilation. The results in Fig. 7 are computed for the MACC station network; a similar impact is observed at the EMEP stations.

Due to the shorter chemical lifetime, the effect of initial condition on
forecasts of NO

In the forecast experiments performed within this study, the effect of
assimilation on NO

The forecast for short-lived pollutants like NO

An assimilation system coupled to the SILAM chemistry transport model has
been described along with its application in reanalysis of ozone and NO

The assimilation consistently improves the model-measurement comparison for
stations not included in the assimilation. For daily maximum values, the
correlation coefficient is improved over the free running model from 0.8 to
0.9 for O

During a 3-week forecast experiment, initialising the forecasts from the
analysis fields provided an improvement in ozone forecast skill for a
maximum of 24 h. For NO

The diagnosed observation error standard deviations for ozone have a strong
diurnal variation, and reach up to about 21

The 3D-Var based assimilation has a low computational overhead. This makes
it especially suitable for reanalyses in yearly or longer time scales, as
well as for high-resolution forecasting under operational time constraints.
Future studies will include more accurate characterisation of station
representativeness, as well as further investigation of model biases for O

The source code for SILAM v5.3, including the data assimilation component, is available on request from the authors (julius.vira@fmi.fi, mikhail.sofiev@fmi.fi).

This work has been supported by the FP7 projects MACC and MACC-II and the NordForsk project EmblA. The authors thank Marje Prank for constructive comments on the manuscript.Edited by: A. Colette