In this study, we developed a data assimilation (DA)
system for chemical transport model (CTM) simulations using an ensemble
Kalman filter (EnKF) technique. This DA technique is easy to implement in an
existing system without seriously modifying the original CTM and can
provide flow-dependent corrections based on error covariance by short-term
ensemble propagations. First, the PM
Among many air pollutants, particular attention has been paid to the issue
of atmospheric aerosols in East Asia and South Korea, where large
anthropogenic emissions from growing economic activities cause frequent
episodes of high air pollution. Several environmental and epidemiological studies
have suggested that continual exposure to particulate matter with
aerodynamic diameter smaller than 2.5
To improve the accuracy of short-term predictions via CTM simulations, chemical data assimilation (DA) has been proposed as an effective method to reduce the uncertainties in the CTM parameters (e.g., Sandu and Chai, 2011; Zhang et al., 2012a, b; Bocquet et al., 2015; Menut and Bessagnet, 2019). Chemical DA is a technique for integrating information provided by noisy observations and imperfect background estimations from CTM simulations. This integration of the two groups of information can theoretically better represent the true state of the chemical atmosphere. DA techniques have been predominantly applied in the numerical weather prediction (NWP) (Kalnay, 2002), such as optimal interpolation (OI: Lorenc, 1981), the three-dimensional variational method (3D-Var: Lorenc, 1986; Parrish and Derber, 1992; Rabier et al., 1998), the four-dimensional variational method (4D-Var: Talagrand and Courtier, 1987; Courtier et al., 1994; Rabier et al., 2000), and the ensemble Kalman filter (EnKF: Evensen, 2003). While the utilization of DA techniques in air quality predictions has been limited, these techniques have more recently started to be used for air quality prediction as well. To date, several DA methods have been applied to optimize the uncertainties in model input parameters, including ICs (e.g., Elbern and Schmidt, 2001; Park et al., 2016), BCs (e.g., Roustan and Bocquet, 2006), and emission fluxes (e.g., Elbern et al., 2007).
For the past 2 decades, various DA algorithms have been applied, especially to aerosol prediction studies. Several studies have focused on assimilating aerosol observations via OI (Lee et al., 2013; Park et al., 2011, 2014; Tang et al., 2015, 2017; Chai et al., 2017; Lee et al., 2020), 3D-Var (Pagowski et al., 2010; Liu et al., 2011; Schwartz et al., 2012; Saide et al., 2013; Jiang et al., 2013; Li et al., 2013; Pang et al., 2018; Ha et al., 2020), and 4D-Var (Benedetti et al., 2019; Morcrette et al., 2009). All the previous studies mentioned above have reported that the OI, 3D-Var, and 4D-Var assimilations using satellite-retrieved or ground-based observations led to improved aerosol predictability.
Even so, each of these DA methods has its own limitations. OI and 3D-Var
usually employ isotropic corrections due to a static (i.e., time-invariant)
background error covariance (BEC) based on model climatological profiles.
Although 4D-Var has been reported to show better performance than OI
and 3D-Var, it requires constant development and maintenance of a tangent
linear and adjoint model, which may be a time-consuming and labor-intensive
task (Skachko et al., 2014). On the other hand, EnKF is
relatively easy to implement without requiring a tangent linear or adjoint
model and can easily compute flow-dependent BEC from short-term ensemble
predictions. This flow dependence of the BEC is one of the main reasons
behind the possible success of the EnKF method compared to other DA
methods. Several studies (Tang et al., 2011; Pagowski and Grell, 2012;
Yumimoto and Takemura, 2015; Rubin et al., 2016; Yumimoto et al., 2016; Peng
et al., 2017, 2018; Lopez-Restrepo et al., 2020) applied the
EnKF DA approach to improve the accuracy of air quality prediction via
assimilating surface and/or satellite observations. For example,
Yumimoto et al. (2016) applied the EnKF method with
satellite-retrieved aerosol observations to evaluate the effectiveness of
the DA on dust forecasts and found improved agreement between the
predictions and observations. More recently, Peng et al. (2017) reported significant improvements in PM
To optimize the ICs, two studies (Lin et al., 2008; Candiani et al., 2013) carried out assimilation using ground-based aerosol observations with different variants of EnKF DA algorithms. However, few studies have applied the EnKF method to examine the importance of BCs. When long-range transport is an important issue, BCs can provide important information. For example, Constantinescu et al. (2007a) extended the EnKF method to consider lateral BCs and correct emission flux factors in the assimilation process by solving the state parameter estimation problem. Other than this study, no prior study has applied the EnKF method to this type of research, particularly with the Community Multiscale Air Quality (CMAQ) model.
This work is a new endeavor to develop an EnKF DA system for the CMAQ model.
The period of the KORUS–AQ campaign 2016 (1 May to 11 June 2016) was chosen
to be the target period to test the developed EnKF DA system, since this
period includes well-defined and various types of air pollution episodes,
e.g., yellow dust events, stagnant high-PM episodes, long-range transport
events, and rainy days (Peterson et al., 2019; Jordan et al., 2020). To
improve the predictability of PM
We believe that this study can be distinguishable from other EnKF studies in
three aspects: (i) the EnKF chemical DA system was first developed to
assimilate PM
Simulation domains with nested modeling. D1 and D2 represent the mother and daughter domain, respectively. The locations of ground stations in China (D1) and South Korea (D2) are marked on the maps with green dots. In D1 (left), the northeastern China (NEC), northern China (NC), and eastern China (EC) regions that frequently influence air quality in South Korea are grouped with olive, violet, and coral colors, respectively. The red star symbol indicates Baekryeong-do observatory, where the evaluation of boundary inflow was made. Jeju Island in D2 (right) is an ideal location to see the flow-dependent correction by the EnKF DA. The total number of available stations used in EnKF data assimilation is also shown in both domains.
The remainder of this paper is organized as follows. Section 2 describes the methodology of this study, including the DA algorithm, CTM, observations, and experimental settings. Section 3.1 discusses the effects of assimilation of ground-based observations and then compares the results with those from 3D-Var based on the reanalysis results. Section 3.2 provides the results of improved BCs in 1 d prediction simulations. Section 3.3 quantifies the contributions of updating ICs and BCs with statistical analysis. Finally, Sect. 4 concludes the paper.
The EnKF is a DA technique first introduced by Evensen (1994), which was an approximate version of the Kalman filter (KF) (Kalman, 1960). The basic principle of the KF is to estimate a true state, while minimizing the variances of the state with a linear combination of the best estimates of the model and the observations. The optimal state estimated from the KF shows less uncertainty than the model predictions and observations. This optimal state is called the “analysis”. To apply the KF to a nonlinear model, a tangent linear model needs to be constructed, as does its adjoint. However, the EnKF requires neither a tangent linear model nor its adjoint, since it employs Monte Carlo approximation that can estimate the model error covariances using finite ensemble simulations (Evensen, 1994). In particular, the model error covariances used in the EnKF technique are flow-dependent, which is one of the major differences from other DA methods.
The theoretical foundation of the EnKF method proposed by Evensen
(2003) is briefly presented below.
The practical approaches to implement Eqs. (1)–(6) are described as
follows. First, through multiple pre-sensitivity tests with the
considerations of both model performances and computational costs, the total
number of ensembles (
Second, the diagonal components in the observation error covariance matrix,
Because almost no observation locations exactly match the uniform model grid
points, an observation operator,
Third, the method to generate the ensemble spread for the model
(
The initial ensembles were created by perturbing the background values of
state vector,
For perturbing BCs and emission rates, we took time-correlated noise into
account to maintain the temporal evolution of those parameters. In addition,
avoiding the rapid fluctuations of perturbations is another reason for the
use of time-correlated noise (colored noise). The method of adding colored
noise is the same as that described in Tang et al. (2011).
In theory, an ensemble of infinite model states can provide the most realistic estimation of the model error. However, because of the limitations of the computational cost, ensembles with finite sizes are used to provide an approximation to the error covariance matrix. A limited ensemble size causes sampling errors. A small ensemble size may lead to underestimation of the prediction error covariances, which is called “filter divergence” (Houtekamer and Mitchell, 1998), and makes spurious corrections in regions remote from the observation locations, which is called “spurious correlation” (Constantinescu et al., 2007b). To avoid such filter divergence and spurious correlation, we applied covariance inflation and localization. The Gaspari–Cohn piecewise polynomial (Gaspari and Cohn, 1999) with a horizontal width of 100 km and a vertical width of 2 km was used to prevent spurious correlation by localizing the model error covariances. By conducting a sensitivity test, we determined these horizontal and vertical limits, which were small enough to remove the spurious correlation but large enough to encompass the spatial error correlation estimated by ensemble predictions (Eqs. 5 and 6). In addition, the relaxation-to-prior-spread (RTPS) inflation (Whitaker and Hamill, 2012) method was applied against the filter divergence by inflating the ensemble spread before and after the DA. The inflation factor 1.0 was chosen through experimentation, while Pagowski and Grell (2012) applied the inflation factor 1.2 for both meteorological and aerosol variables, and Schwartz et al. (2014) used the inflation factors 1.12 and 1.2 for meteorological variables and aerosol species, respectively. Because we did not perturb any meteorological variables to retain the dynamic balances (i.e., assuming no uncertainty in the meteorological model), the spreads in the predicted (or propagated) ensemble were occasionally less than the observation spread. Therefore, we inflated the ensemble spreads before and after the DA rather than using an inflation factor larger than 1.0. To inflate the predicted ensemble (before DA), we used the spread at the previous analysis time (e.g., 6 h before propagation).
An analysis state generated by 3D-Var is obtained by minimization of the cost
function as follows:
In this study, the EnKF DA algorithm was developed for the Weather Research and Forecasting (WRF)–CMAQ modeling system. The WRF–CMAQ system was run in offline mode, which means that the CMAQ model runs were performed sequentially after the meteorological fields were generated by the WRF model. This section briefly describes the two numerical models, input fields (e.g., emission and meteorology), simulation domains, and observation data used for the DA.
The WRF version 3.8.1 (Skamarock et al., 2008) with the
Advanced Research WRF (ARW) dynamical core was used to produce
meteorological fields for the CMAQ model simulations. The ARW dynamical core
employs fully compressible and nonhydrostatic Euler equations, together
with Arakawa C-grid staggering. In the WRF simulations, the final (FNL)
operational global analyses data produced by the NCEP (Saha et al., 2010)
were used for the ICs and BCs. Temporal and spatial resolutions of the FNL
data are 6 h and 0.25
WRF model configurations selected in this study.
The CMAQ model v5.1 (Byun and Ching, 1999; Byun and Schere, 2006) was used in this study to simulate the atmospheric photochemistries, aerosol dynamics, aerosol thermodynamics, and transport of atmospheric species. The CMAQ runs have two domains in accordance with our experimental purposes. The horizontal resolutions of the mother domain (D1) and daughter domain (D2) are 27 and 9 km, respectively, with 15 vertical layers, while the model top is at 20 km. Table 2 lists the CMAQ model configuration.
CMAQ model configurations selected in this study.
The mother domain (D1) for the CMAQ model simulations covers northeastern Asia
including China, the Korean Peninsula, and Japan, and the daughter domain
(D2) nested in the D1 targets South Korea (refer to Fig. 1). With this nesting
configuration, we intended to examine how the BCs provided by D1 affect
the PM
Domain descriptions for WRF and CMAQ models.
For another important input field into the CMAQ model simulations, emission data were prepared. KORUS v2.0 emission fields (Jang et al., 2020) were employed for anthropogenic emissions in the two domains. This emission inventory also supported official CTM simulations for the KORUS–AQ field campaign in 2016. To prepare biogenic emissions, the Model of Emissions of Gases and Aerosols from Nature (MEGAN v2.1; Guenther et al., 2006, 2012) was run with MODIS land cover data (Friedl et al., 2010), together with MODIS-derived leaf area index (LAI) (Myneni et al., 2002; Yuan et al., 2011). For the MEGAN runs, the same meteorological fields generated from the WRF model simulations were used. For the considerations of fire emissions, the Fire Inventory from NCAR (FINN) was used (Wiedinmyer et al., 2006, 2011).
The observation data used in the EnKF DA experiments were PM
For the control run (CTR) without DA, hourly predictions were conducted in D1 by the CMAQ model simulations to generate the BCs for D2. After that, using the BCs we implemented 24 h CMAQ predictions over D2 each day from 25 April to 12 June 2016, with the first 5 d for spin-up and the sixth day for adapting times for the EnKF DA. To provide the meteorological inputs into the CMAQ model runs over D2, the WRF model simulations were initialized each day 12 h before the CMAQ initialization. In this case, the first 12 h simulations were regarded as the spin-up times of the meteorological model. To initialize the next 24 h predictions, the CMAQ model utilized the last hour outputs from the previous 24 h predictions.
The initial ensemble of 40 runs was made based on the CTR output obtained
at 00:00 UTC on 30 April by perturbing ICs, as described in Sect. 2.1. The
ensemble propagations of the CMAQ model simulations started at 00:00 UTC on 30 April. The DA interval for reanalysis purposes was determined to be 6 h. At
the end of the first 6 h prediction (or propagation) of this initial
ensemble, the first EnKF DA of PM
Schematic flowchart for the experiments performed in this study.
To evaluate PM
In addition to the CTR run, the two experiments labeled DA_ic (Fig. 2a) and DA_icbc (Fig. 2b) were also made over South
Korea (D2). In both the DA_ic and DA_icbc
runs, ground-level PM
In the DA_ic experiment, we updated only ICs, while in the
DA_icbc experiment, we updated both the ICs and the BCs. The
goals of this experimental setup are to make it possible to evaluate how
much and to what degree the EnKF DA technique could enhance the PM
Figure 3 shows the daily variations of surface PM
Daily variations of surface PM
Figure 4a and b present the horizontal distributions of surface
PM
Snapshots of the horizontal distributions of PM
Figure 5 presents the average diurnal variations generated by aggregating
the PM
Average diurnal variations of PM
Statistical metrics for the experiments of DA_ic and DA_icbc. Experiments were evaluated for the 6-hourly assimilated analysis run (ANL) and for the 1 d prediction run (PRD). The ANL run using 3D-Var in the DA_ic experiment is included for comparison.
In the previous section, we examined the effects of the initial fields (the DA_ic experiment) in South Korea. The influences of the updated ICs tend to quickly disappear with time over the relatively small domain (D2), particularly when atmospheric flows are fast. In this section, we conducted additional assimilation with the ground observations from China in D1, in addition to the data assimilation with ground observations from South Korea (the DA_icbc experiment). The DA_ic and DA_icbc experimental results were again compared in South Korea, which is our main domain of interest. Although the prediction strategy (refer to Fig. S1 of the Supplement) was the same in the DA_icbc experiment, only PRD runs are shown in this section for simplicity.
Figure 6 shows the averaged PM
Averaged PM
The middle and bottom panels of Fig. 6 show that at all the boundaries, the
DA_icbc experiment exhibited higher PM
A ground station where the influence of the BCs can be checked is
Baekryeong-do, South Korea (shown with a star symbol in Fig. 1). This is
because Baekryeong-do is located at the west end of domain 2 (nearby the
western boundary of D2) and is also minimally affected by local inland
emissions (i.e., there are no major industries and only a small population
living on the island). Figure 7a shows the averaged diurnal variations of
PM
Averaged diurnal variations of PM
Figure 8 presents the daily variations of PM
Daily averaged variations of PM
To evaluate the PM
Averaged diurnal variations of PM
Table 4 summarizes the statistical performance metrics that were calculated
to evaluate the model performance. Table S1 of the Supplement provides the
mathematical definitions for the performance metrics. Because the
evaluations were conducted using hourly data, including the prediction
hours, we did not consider a spatially independent observation. Moreover, it
is difficult to randomly select sparse observation sites (D2 in Fig. 2).
Therefore, the statistical metrics were calculated including the same
observation data as those used in DA and were compared under the same
conditions for all the experiments. The average PM
To investigate the quantitative contributions of the ICs and BCs to the
model performance, we calculated the “rate of improvement (ROI)” with
respect to the PRD results (see Table 5). The ROIs are defined by the ratios
of enhanced (
Rate of improvement (ROI) by EnKF data assimilation in 1 d
predictions. The ROI is the ratio of the enhanced (
To improve PM
This study also highlighted the importance of updating BCs to further
enhance the PM
Recently, the EnKF has also been used to assimilate satellite-retrieved
aerosol observations (e.g., Sekiyama et al., 2010; and Yin et al., 2016). Other groups also used the EnKF method for
the joint optimization of ICs and emission scaling factors (e.g., Tang et
al., 2011; Peng et al., 2017 and 2018). As we have shown that the
consideration of transboundary air pollution is of significance in the
PM
Throughout this study, the DA method of “perturbed observation EnKF” (first proposed by Evensen, 2003) was employed. However, there are some popular variants of the EnKF method that obviate the need to perturb observations, such as the ensemble square root filter (EnSRF; Whitaker and Hamill, 2002), ensemble adjustment Kalman filter (EAKF; Anderson, 2001), and local ensemble transform Kalman filter (LETKF; Hunt et al., 2007). Two of these EnKF variants are also being tested to alleviate the sampling errors in the observation ensemble, and the results will also be reported in the near future in the context of further development of the ensemble data assimilations and the Korean air quality prediction system.
The WRF model v3.8.1 (DOI:
The supplement related to this article is available online at:
SYP and CHS designed this study and experiments. SYP, UKD, KY, and IU developed the EnKF code and discussed the results. SYP and UKD carried out the simulations, produced the figures, and prepared the initial paper draft. JY performed the 3D-Var experiments and provided all the input data for the CMAQ model. CHS contributed to the final writing with comments from all co-authors.
The contact author has declared that neither they nor their co-authors have any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was supported by the FRIEND (Fine Particle Research Initiative in East Asia Considering National Differences) project (2020M3G1A1114617) and Basic Science Research Program (2021R1A2C1006660) of the National Research Foundation of Korea (NRF) with a grant funded by the Ministry of Science and ICT (MSIT). We also appreciate the comments of the reviewers that helped us to improve this article.
This research has been supported by the National Research Foundation of Korea (grant nos. 2020M3G1A1114617 and 2021R1A2C1006660).
This paper was edited by Havala Pye and reviewed by two anonymous referees.