<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="research-article">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">GMD</journal-id><journal-title-group>
    <journal-title>Geoscientific Model Development</journal-title>
    <abbrev-journal-title abbrev-type="publisher">GMD</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Geosci. Model Dev.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1991-9603</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/gmd-19-4835-2026</article-id><title-group><article-title>OIRF-LEnKF v1.0: a novel data assimilation system by integrating incremental machine learning with a localized EnKF for enhanced PM<sub>2.5</sub> chemical component simulation and reanalysis</article-title><alt-title>OIRF-LEnKF v1.0: a novel data assimilation system by integrating incremental machine learning</alt-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Li</surname><given-names>Hongyi</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="yes" rid="aff1">
          <name><surname>Yang</surname><given-names>Ting</given-names></name>
          <email>tingyang@mail.iap.ac.cn</email>
        <ext-link>https://orcid.org/0000-0001-5605-0654</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Kong</surname><given-names>Lei</given-names></name>
          
        <ext-link>https://orcid.org/0000-0003-1162-2158</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Zhang</surname><given-names>Di</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Tang</surname><given-names>Guigang</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1 aff3">
          <name><surname>Tang</surname><given-names>Xiao</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1 aff3">
          <name><surname>Wang</surname><given-names>Zifa</given-names></name>
          
        </contrib>
        <aff id="aff1"><label>1</label><institution>State Key Laboratory of Atmospheric Environment and Extreme Meteorology, Institute of Atmospheric Physics,  Chinese Academy of Sciences, Beijing 100029, China</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>China National Environmental Monitoring Centre, Beijing, China</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>College of Earth and Planetary Sciences, University of Chinese Academy of Sciences, Beijing 100049, China</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Ting Yang (tingyang@mail.iap.ac.cn)</corresp></author-notes><pub-date><day>10</day><month>June</month><year>2026</year></pub-date>
      
      <volume>19</volume>
      <issue>11</issue>
      <fpage>4835</fpage><lpage>4856</lpage>
      <history>
        <date date-type="received"><day>13</day><month>August</month><year>2025</year></date>
           <date date-type="rev-request"><day>11</day><month>September</month><year>2025</year></date>
           <date date-type="rev-recd"><day>24</day><month>December</month><year>2025</year></date>
           <date date-type="accepted"><day>21</day><month>February</month><year>2026</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2026 Hongyi Li et al.</copyright-statement>
        <copyright-year>2026</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026.html">This article is available from https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026.html</self-uri><self-uri xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026.pdf">The full text article is available as a PDF file from https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d2e160">Assimilating observational data into numerical simulation is crucial for accurately estimating the spatiotemporal distribution of PM<sub>2.5</sub> chemical components (NH<inline-formula><mml:math id="M3" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M4" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M5" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC), which is beneficial to quantifying the impact of aerosols on the environment, climate change and human health. However, chemical transport model (CTM)-based data assimilation (DA) is computationally inefficient for large ensemble sizes and offers limited improvements in simulation skill, as it solely provides optimal initial conditions. This paper introduces an incrementally updatable machine learning-based data assimilation system (Optimized Incremental Random Forest coupled with Localized Ensemble Kalman Filter, OIRF-LEnKF v1.0) that achieves high efficiency and high quality in generating background and analysis fields for chemical components. Computational efficiency tests indicate that the total time consumed by OIRF-LEnKF v1.0 constitutes only 11.41 %–16.60 % of that of CTM-based DA, primarily because the simulation process requires only 0.13 %–0.20 % of the CTM computation time. Sensitivity tests demonstrate that the incremental learning during the simulation process enhances the percentage change of the Pearson correlation coefficient relative to its minimum value (<inline-formula><mml:math id="M6" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>CORR) by 2.43 %–11.75 % and reduces the percentage change of the RMSE relative to its maximum value (<inline-formula><mml:math id="M7" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>RMSE) by 32.55 %–40.36 %, compared to the stationary training mechanism. A 2-month DA experiment reveals that the RMSE values of chemical components after DA are less than 7.80 and 2.36 <inline-formula><mml:math id="M8" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> during the simulation and analysis processes, respectively, indicating reductions of at least 26.38 % and 68.99 % compared to values without DA. Notably, the RMSE values of our system during the simulation process exhibit a significant reduction of 33.16 %–90.10 % compared to those of the CTM-based DA, highlighting the superior simulation capability of our system. Furthermore, the spatial overestimation and underestimation of chemical components have been significantly mitigated following DA. Compared to multiple reanalysis datasets of inorganic salt aerosols (CORR: 0.56–0.89, RMSE: 2.55–8.52 <inline-formula><mml:math id="M10" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), the dataset generated by OIRF-LEnKF v1.0 (CORR: 0.97, RMSE: 1.12 <inline-formula><mml:math id="M12" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) demonstrates higher data quality.</p>
  </abstract>
    
<funding-group>
<award-group id="gs1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>42422506</award-id>
<award-id>42275122</award-id>
</award-group>
</funding-group>
</article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d2e296">Sulfate (SO<inline-formula><mml:math id="M14" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>), nitrate (NO<inline-formula><mml:math id="M15" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>), ammonium (NH<inline-formula><mml:math id="M16" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>), organic carbon (OC), and black carbon (BC) are critical chemical components of fine particulate matter (PM<sub>2.5</sub>) (Huang et al., 2014). The physicochemical processes of these chemical components within the atmospheric boundary layer, including chemical conversion, transboundary transport and deposition, directly influence air quality associated with PM<sub>2.5</sub> (Yang et al., 2024). Observational studies reveal that the contribution of transboundary transport increased from 4 %–8 % to 66 %–80 % during severe PM<sub>2.5</sub> pollution episodes (Sun et al., 2016). Furthermore, these components with varying physicochemical properties exert varying impacts on human health (Li et al., 2022) and climate change (Stier et al., 2024; Zhao et al., 2024). Therefore, characterizing the spatiotemporal distribution and evolution of PM<sub>2.5</sub> chemical components provides a scientific basis for identifying the causes of air pollution, assessing health and climate impacts, and developing effective climate change mitigation strategies and emission pathways.</p>
      <p id="d2e375">Observation techniques, machine learning (ML) methods, and chemical transport models (CTMs) are the primary approaches for acquiring mass concentrations of PM<sub>2.5</sub> chemical components. Observation techniques achieve high-precision measurements through field sampling and instrument analysis (Wang et al., 2016; Lei et al., 2021). However, the sparse distribution of observation points, limited observation pathways, inconsistencies in observation platforms, and measurement errors hinder the acquisition of continuous measurements with high spatiotemporal coverage. ML methods utilize historical observations to establish mapping relationships between features of non-chemical and chemical components, thereby reconstructing the mass concentrations of chemical components continuously without the need for traditional instrument measurements (Li et al., 2025; Wei et al., 2023; Liu et al., 2022). However, ML methods are limited by the lack of physicochemical constraints and insufficient spatiotemporal representativeness of historical observations, which results in inadequate generalization capabilities and interpretability. CTMs can characterize the spatiotemporal distribution and evolution of chemical components by solving equations that describe physicochemical mechanisms rather than relying on observations (Weagle et al., 2018). However, the uncertainties in physicochemical mechanisms, emission inventories, meteorological fields, as well as initial and boundary conditions result in significant simulation bias (Miao et al., 2020; Xie et al., 2022; Luo et al., 2023).</p>
      <p id="d2e387">Data assimilation (DA) can integrate observations from sparse sites and CTMs to estimate an optimal initial state with spatial continuity and high accuracy based on the model background field (Geer, 2021). DA has been widely used to generate reanalysis datasets of PM<sub>2.5</sub> chemical components at global and national scales, such as the Copernicus Atmosphere Monitoring Service ReAnalysis (CAMSRA) (Inness et al., 2019), the Modern-Era Retrospective Analysis for Research and Applications Version 2 (MERRA) (Randles et al., 2017), and the Air Quality ReAnalysis in China dataset (CAQRA-aerosol) (Kong et al., 2025). However, these datasets only assimilate the aerosol optical depth and conventional atmospheric pollutants at the surface level, indirectly enhancing simulations of chemical components. Consequently, the correlation between observations and these datasets is limited (<inline-formula><mml:math id="M23" display="inline"><mml:mi>R</mml:mi></mml:math></inline-formula>: 0.21 to 0.7) (Kong et al., 2025).</p>
      <p id="d2e406">Our previous work developed a novel hybrid nonlinear ensemble data assimilation system (NAQPMS-PDAF v2.0, NP2) for directly assimilating observations of chemical components (Li et al., 2024a). However, CTM-based NP2 requires a reduction in ensemble size to maintain computational efficiency during simulation and assimilation processes within high-dimensional state spaces, resulting in insufficient ensemble spread (Chattopadhyay et al., 2023). Consequently, the correlation (<inline-formula><mml:math id="M24" display="inline"><mml:mi>R</mml:mi></mml:math></inline-formula>: 0.12–0.72) between observations and analysis fields at independent validation sites showed only minor improvement compared to the datasets mentioned above. Furthermore, the low sensitivity of background fields in NP2 to assimilation frequency suggests that improvements in initial conditions have limited effects on enhancing the simulation ability on PM<sub>2.5</sub> chemical components due to the uncertainties in physicochemical mechanisms and input conditions within CTMs (Cha et al., 2025).</p>
      <p id="d2e426">In recent years, the combination of ML and DA has emerged as a pivotal strategy for addressing challenges associated with computational inefficiency and insufficient improvements in generating background and analysis fields. The first pathway employs the ML outputs as external constraints for DA, such as forecasting addition (Lin et al., 2019; Jin et al., 2019), bias correction (Arcucci et al., 2021; Farchi et al., 2021; He et al., 2023), parameter estimation (Legler and Janjić, 2022), and observation operator improvement (Lee et al., 2022). This pathway enhances forecasting and DA processes without perturbing the physical properties of the numerical models but fails to improve computational efficiency. The second pathway utilizes ML as an alternative to DA for generating analysis fields directly from high-density observations (Howard et al., 2024). This pathway mitigates the limitations of traditional DA algorithms in handling high-resolution observations while diminishing the physical dependence of observation propagation within model state space. The third pathway substitutes traditional numerical models with ML models to provide the background fields for DA (Dong et al., 2022, 2023; Yang and Grooms, 2021) and utilize the analysis fields to update ML model parameters, thereby enhancing forecasting performance (Brajard et al., 2020; Gottwald and Reich, 2021). This pathway improves computational efficiency by 78.3 % while maintaining high DA accuracy (Dong et al., 2022) and mitigates the adverse impact of low-quality data on ML forecasting (Buizza et al., 2022). However, to the best of our knowledge, this pathway has not yet been utilized in atmospheric chemical DA.</p>
      <p id="d2e429">The Random Forest (RF) model (Gohari et al., 2025; Lin et al., 2022; Lv et al., 2021; Meng et al., 2018) and Deep Neural Networks (DNNs) (Li et al., 2025; Liu et al., 2023) have been widely used for simulating and predicting PM<sub>2.5</sub> chemical component concentrations, with DNNs achieving a marginally superior predictive accuracy. However, a single DNN is outperformed by a RF model in terms of the computational efficiency during both training and inference (Debjyoti and Utpal, 2025; Jalali et al., 2025; Xi, 2022). Within an ensemble DA framework, periodically creating and running an ensemble of DNNs imposes a significant computational burden in contrast to the RF model, which inherently provides an ensemble. Consequently, the RF model offers an optimal trade-off between predictive performance and computational demand, making it a practical and efficient choice for coupling with ensemble DA.</p>
      <p id="d2e441">This study proposes an optimized incremental Random Forest (OIRF) model as a solution to the challenges of computational inefficiency and inadequate advancements in generating background and analysis fields within traditional CTM-based DA. The OIRF model is capable of providing a large number of background ensemble members at a reduced computational cost, which helps mitigate the underestimation of background error covariance. Additionally, it can dynamically update by integrating new training data, allowing it to adapt to the evolving dynamics of PM<sub>2.5</sub> chemical components, thereby enhancing its generalization capability for simulation. Then, the OIRF model is online coupled with the localized ensemble Kalman filter (LEnKF) algorithm to develop a novel data assimilation system (OIRF-LEnKF v1.0), which achieves a rapid iteration for high-quality simulation, assimilation, and incremental learning. Section 2 details the development of OIRF-LEnKF v1.0, the data used in this study and experimental settings. Section 3 presents the DA results, including an evaluation of computational efficiency, a discussion of sensitivity tests, and a validation of DA performance. Section 4 summarizes the conclusions.</p>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Method and data</title>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>OIRF-LEnKF v1.0</title>
<sec id="Ch1.S2.SS1.SSS1">
  <label>2.1.1</label><title>Structure of OIRF-LEnKF v1.0</title>
      <p id="d2e475">The OIRF-LEnKF v1.0 performs a continuous loop of simulation and assimilation for five PM<sub>2.5</sub> chemical components (SO<inline-formula><mml:math id="M29" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M30" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M31" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC) through online coupling an optimized incremental Random Forest (OIRF) ensemble model with the localized ensemble Kalman filter (LEnKF) algorithm (Fig. 1). The ML-based OIRF ensemble model offers an effective alternative to conventional CTMs by promptly supplying background ensemble members of PM<sub>2.5</sub> chemical components to the LEnKF algorithm and iteratively updating model parameters based on analysis fields derived from the LEnKF algorithm. The LEnKF algorithm effectively assimilates chemical observations into background fields, minimizing interference from spurious correlations by implementing localization schemes, thereby generating high-accuracy analysis fields for incremental learning of the OIRF model. The online coupling of the OIRF model with the LEnKF algorithm facilitates the iterative execution of ensemble simulation, assimilation, and incremental learning at each time step. Consequently, the OIRF-LEnKF v1.0 is capable of generating high-quality background and analysis fields while simultaneously undergoing incremental learning.</p>

      <fig id="F1" specific-use="star"><label>Figure 1</label><caption><p id="d2e537">The framework of OIRF-LEnKF v1.0.</p></caption>
            <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f01.png"/>

          </fig>

      <p id="d2e546">As shown in Fig. 1, the fundamental workflow of OIRF-LEnKF v1.0 is as follows: <list list-type="bullet"><list-item>
      <p id="d2e551"><italic>Step 1</italic>: initial training of the OIRF model. The training data at the first timestep serve as the initial conditions for constructing the OIRF model. The input features include meteorological parameters, including temperature, relative humidity, U-component wind, <inline-formula><mml:math id="M33" display="inline"><mml:mi>V</mml:mi></mml:math></inline-formula>-component wind, and geopotential, as well as anthropogenic atmospheric pollutants, including PM<sub>2.5</sub>, PM<sub>10</sub>, SO<sub>2</sub>, NO<sub>2</sub>, CO, and O<sub>3</sub>. The output features are SO<inline-formula><mml:math id="M39" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M40" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M41" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC.</p></list-item><list-item>
      <p id="d2e649"><italic>Step 2</italic>: incremental learning of the OIRF model at time steps <inline-formula><mml:math id="M42" display="inline"><mml:mo>&gt;</mml:mo></mml:math></inline-formula> 1. High-quality analysis fields at the last time step, along with the corresponding meteorological and anthropogenic input data, are employed to train a new ensemble of decision trees. The old decision trees, which exhibit poor simulation performance, are subsequently replaced with new decision trees to enhance the simulation accuracy and generalization ability of the OIRF model.</p></list-item><list-item>
      <p id="d2e662"><italic>Step 3</italic>: generating a background ensemble of PM<sub>2.5</sub> chemical component concentrations at the current timestep using the OIRF model, along with the current meteorological and anthropogenic input data.</p></list-item><list-item>
      <p id="d2e677"><italic>Step 4</italic>: generating the analysis fields of PM<sub>2.5</sub> chemical component concentrations at the current timestep by assimilating chemical observations into background fields using the LEnKF algorithm.</p></list-item><list-item>
      <p id="d2e692"><italic>Step 5</italic>: scoring the simulation performance of ensemble decision trees in the OIRF model using mean absolute error (MAE) and screening out the decision trees with poor simulation performance based on a predefined threshold. Repeat steps 2–5 until the end of the loop.</p></list-item></list></p>
</sec>
<sec id="Ch1.S2.SS1.SSS2">
  <label>2.1.2</label><title>Optimized Incremental Random Forest (OIRF)</title>
      <p id="d2e705">The OIRF model utilizes the Random Forest (RF) algorithm to establish a mapping relationship between anthropogenic atmospheric pollutants (PM<sub>2.5</sub>, PM<sub>10</sub>, SO<sub>2</sub>, NO<sub>2</sub>, CO, and O<sub>3</sub>), meteorological conditions (temperature, relative humidity, <inline-formula><mml:math id="M50" display="inline"><mml:mi>U</mml:mi></mml:math></inline-formula>-component wind, <inline-formula><mml:math id="M51" display="inline"><mml:mi>V</mml:mi></mml:math></inline-formula>-component wind, and geopotential), and the five PM<sub>2.5</sub> chemical components (SO<inline-formula><mml:math id="M53" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M54" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH4<sup>+</sup>, OC, and BC). The RF model consists of <inline-formula><mml:math id="M56" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula> decision trees (DTs), each using an independently and identically distributed random vector (<inline-formula><mml:math id="M57" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) to facilitate feature random selection and sample bootstrapping. This approach enhances the diversity among DTs while maintaining the predictive capability of each DT (Breiman, 2001). Unlike conventional ensemble simulations that rely on multiple CTMs, RF can swiftly generate an ensemble of background fields required for DA from multiple DTs without requiring external ensemble perturbation. The final simulation of the RF model is represented by the average of all DT outputs (Eq. 1).

              <disp-formula id="Ch1.E1" content-type="numbered"><label>1</label><mml:math id="M58" display="block"><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">RF</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mi>N</mml:mi></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

            where <inline-formula><mml:math id="M59" display="inline"><mml:mi mathvariant="bold-italic">x</mml:mi></mml:math></inline-formula> represents the input features, including anthropogenic atmospheric pollutants and meteorological conditions. <inline-formula><mml:math id="M60" display="inline"><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">RF</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the simulation of PM<sub>2.5</sub> chemical component concentrations. <inline-formula><mml:math id="M62" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula> is the total number of DTs. <inline-formula><mml:math id="M63" display="inline"><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the output of the <inline-formula><mml:math id="M64" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th DT and <inline-formula><mml:math id="M65" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is an independently and identically distributed random vector that facilitates feature random selection and sample bootstrapping. The criterion for selecting the optimal split at each node during the training of an individual DT involves maximizing the reduction in mean squared error (MSE) over all splitting candidates.</p>
      <p id="d2e970">Inspired by the idea of dynamically updating DTs with weak performance (Xie et al., 2016), the OIRF model incorporates a novel incremental learning mechanism into the RF model, enabling it to conduct effective updating from newly available training data within a simulation-assimilation cycle. In the incremental learning mechanism, the OIRF model scores the simulation performance of each DT based on the mean absolute error (MAE), as shown in Eq. (2). The MAE is quantified by the DT outputs and high-accuracy analysis fields at the same time step. A leakage-aware evaluation indicates that using the analysis field as scoring target did not cause substantial information leakage, while employing the independent high-quality observation as scoring target is also recommended (Sect. S1 in the Supplement). 

              <disp-formula id="Ch1.E2" content-type="numbered"><label>2</label><mml:math id="M66" display="block"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>n</mml:mi><mml:mi mathvariant="normal">score</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mi>K</mml:mi></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mfenced open="|" close="|"><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup><mml:mo>-</mml:mo><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi>N</mml:mi></mml:mrow></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M67" display="inline"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>n</mml:mi><mml:mi mathvariant="normal">score</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the MAE value of the <inline-formula><mml:math id="M68" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th DT. <inline-formula><mml:math id="M69" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula> is the total number of grid points of PM<sub>2.5</sub> chemical component concentrations. <inline-formula><mml:math id="M71" display="inline"><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the analysis value of concentrations at the <inline-formula><mml:math id="M72" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>th grid point after DA. <inline-formula><mml:math id="M73" display="inline"><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the simulation value of the <inline-formula><mml:math id="M74" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th DT at the <inline-formula><mml:math id="M75" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>th grid point. Notably, <inline-formula><mml:math id="M76" display="inline"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> used in machine learning denotes the input features, while <inline-formula><mml:math id="M77" display="inline"><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> used in data assimilation denotes the analysis states.</p>
      <p id="d2e1187">The incremental learning mechanism introduces a threshold (<inline-formula><mml:math id="M78" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">τ</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) to screen out the DTs with poor simulation performance. The threshold is defined as the <inline-formula><mml:math id="M79" display="inline"><mml:mi>p</mml:mi></mml:math></inline-formula>th percentile value of <inline-formula><mml:math id="M80" display="inline"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>n</mml:mi><mml:mi mathvariant="normal">score</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>. The percentile-based threshold ensures a stable and controllable number of DTs are updated, a critical feature for maintaining the smoothness and stability of the estimation of background error covariance within the ensemble data assimilation framework and preventing model overfitting to the new information. As shown in Eq. (3), the old DTs with scores not higher than <inline-formula><mml:math id="M81" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">τ</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> are retained, while the old DTs with scores higher than <inline-formula><mml:math id="M82" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">τ</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> will be replaced by new DTs obtained from the incremental learning process.

              <disp-formula id="Ch1.E3" content-type="numbered"><label>3</label><mml:math id="M83" display="block"><mml:mrow><mml:mtable class="split" rowspacing="0.2ex" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mo>=</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mfenced open="{" close=""><mml:mtable class="array" columnalign="left left"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mi mathvariant="normal">|</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="normal">Δ</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>n</mml:mi><mml:mi mathvariant="normal">score</mml:mi></mml:msubsup><mml:mo>≤</mml:mo><mml:msub><mml:mi mathvariant="italic">τ</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:msub><mml:mi>N</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mi mathvariant="normal">|</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:msubsup><mml:mi>f</mml:mi><mml:mi>n</mml:mi><mml:mi mathvariant="normal">score</mml:mi></mml:msubsup><mml:mo>&gt;</mml:mo><mml:msub><mml:mi mathvariant="italic">τ</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:msub><mml:mi>N</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi>N</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mfenced></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M84" display="inline"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> represents the final output of the updated DTs following incremental learning at time <inline-formula><mml:math id="M85" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>. <inline-formula><mml:math id="M86" display="inline"><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mi mathvariant="normal">|</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="normal">Δ</mml:mi><mml:mi>t</mml:mi></mml:mrow><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the output of the retained old DTs while <inline-formula><mml:math id="M87" display="inline"><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mi mathvariant="normal">|</mml:mi><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> refers to the output of the new DTs. <inline-formula><mml:math id="M88" display="inline"><mml:mrow><mml:mi mathvariant="normal">Δ</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:math></inline-formula> represents the time interval of incremental learning. <inline-formula><mml:math id="M89" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">τ</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> indicates the <inline-formula><mml:math id="M90" display="inline"><mml:mi>p</mml:mi></mml:math></inline-formula>th percentile value of <inline-formula><mml:math id="M91" display="inline"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>n</mml:mi><mml:mi mathvariant="normal">score</mml:mi></mml:msubsup><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi>N</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M92" display="inline"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> signifies the number of retained old DTs that achieve a score not exceeding <inline-formula><mml:math id="M93" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">τ</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. The <inline-formula><mml:math id="M94" display="inline"><mml:mi>p</mml:mi></mml:math></inline-formula> is set at 80 to prevent excessive updating of DTs, which may introduce instability and artificially optimistic performance into ensemble simulation of the OIRF model.</p>
      <p id="d2e1626">The final simulation (<inline-formula><mml:math id="M95" display="inline"><mml:mrow><mml:msup><mml:mi>f</mml:mi><mml:mi mathvariant="normal">OIRF</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>) of the OIRF model at time <inline-formula><mml:math id="M96" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> is derived from Eq. (4) by averaging the outputs of the updated DTs.

              <disp-formula id="Ch1.E4" content-type="numbered"><label>4</label><mml:math id="M97" display="block"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">OIRF</mml:mi></mml:msubsup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mi>N</mml:mi></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:math></disp-formula>

            Notably, the incremental learning mechanism generates new DTs within a Bayesian optimization framework, which ensures that the updated RF model simultaneously acquires new knowledge and preserves optimal hyperparameters over time. Consequently, the incremental learning mechanism enhances the capacity of the OIRF model to incorporate newly available training data and replace the underperforming DTs with deterministically superior ones, thereby dynamically improving its generalization ability in simulating PM<sub>2.5</sub> chemical component concentrations.</p>
      <p id="d2e1720">The hyperparameters in the OIRF model, such as the minimum number of leaf node observations, the maximal number of decision splits, and the number of predictors to select at random for each split, control the model structure and randomness level (Probst et al., 2019). The OIRF model integrates the RF model with the Bayesian optimization algorithm to ensure the statistical optimization of the hyperparameters. The Bayesian optimization algorithm incorporates hyperparameters as decision variables within the objective function, thereby abstracting the optimization problem as a solution problem of the objective function (Wu et al., 2019). The objective function was defined by Eq. (5). This algorithm is capable of identifying the global optimal solution using fewer iterations, thereby reducing the computational costs associated with evaluating the loss function and enhancing the performance of the ML model (Shahriari et al., 2016). A probabilistic surrogate model and an acquisition function are two essential components of the Bayesian optimization algorithm. The former is employed to approximate the complex objective function, thereby minimizing computational costs. The latter is used to identify potential optimal decision variables and update the surrogate model during iterative optimization. In this study, the surrogate model and acquisition function are specifically implemented using a non-parametric Gaussian process regression model (Rasmussen, 2004, February) and the Expected Improvement per Second Plus (Elps<inline-formula><mml:math id="M99" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula>) function (Gelbart et al., 2014). The detailed implementation of the Bayesian optimization algorithm in machine learning models is described in our previous work (Li et al., 2025).

              <disp-formula id="Ch1.E5" content-type="numbered"><label>5</label><mml:math id="M100" display="block"><mml:mrow><mml:mi>J</mml:mi><mml:mo>(</mml:mo><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mi>ln⁡</mml:mi><mml:mfenced open="(" close=")"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>+</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mi>N</mml:mi></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mi mathvariant="normal">pred</mml:mi></mml:msubsup><mml:mo>(</mml:mo><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup></mml:mrow></mml:mfenced><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:mfenced></mml:mrow></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M101" display="inline"><mml:mrow><mml:mi>J</mml:mi><mml:mo>(</mml:mo><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> represents the objective value, <inline-formula><mml:math id="M102" display="inline"><mml:mi mathvariant="italic">θ</mml:mi></mml:math></inline-formula> represents the set of hyperparameters under optimization, <inline-formula><mml:math id="M103" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula> is the total number of samples in the training dataset. <inline-formula><mml:math id="M104" display="inline"><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mi mathvariant="normal">pred</mml:mi></mml:msubsup><mml:mo>(</mml:mo><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the predicted value for the <inline-formula><mml:math id="M105" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>th sample, <inline-formula><mml:math id="M106" display="inline"><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>i</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the observation value for the <inline-formula><mml:math id="M107" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>th sample.</p>
</sec>
<sec id="Ch1.S2.SS1.SSS3">
  <label>2.1.3</label><title>Localized Ensemble Kalman Filter (LEnKF)</title>
      <p id="d2e1882">LEnKF is an Ensemble Kalman Filter (EnKF) algorithm with localization schemes that mitigate filter divergence induced by sampling errors of the estimated error covariance matrix (Nerger et al., 2012), thereby generating high-precision analysis fields of PM<sub>2.5</sub> chemical component concentrations. The EnKF is an extension of the Kalman filter, specifically designed for atmospheric and oceanic DA with nonlinear and high-dimensional model state spaces (Houtekamer and Zhang, 2016). The EnKF utilizes the Monte Carlo method to estimate a flow-dependent background error covariance matrix from an ensemble of model states at each time step. This algorithm mitigates the high computational costs associated with the explicit operations of high-dimensional matrices (Evensen, 1994, 2003). In this study, the OIRF model replaced the conventional CTMs to provide an ensemble of DT-simulated background fields for estimating the background error covariance (Eq. 6). The ensemble size in DA is equal to the total number of DTs in the OIRF model.

              <disp-formula id="Ch1.E6" content-type="numbered"><label>6</label><mml:math id="M109" display="block"><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup><mml:mo>=</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mfenced close=")" open="("><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi mathvariant="bold-italic">n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow></mml:mfenced></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:msup><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow></mml:mfenced><mml:mi>T</mml:mi></mml:msup></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M110" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the flow-dependent background error covariance matrix of PM<sub>2.5</sub> chemical component concentrations at time <inline-formula><mml:math id="M112" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M113" display="inline"><mml:mover accent="true"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> refers to the ensemble mean across decision trees in the random forest at time <inline-formula><mml:math id="M114" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>.</p>

      <fig id="F2" specific-use="star"><label>Figure 2</label><caption><p id="d2e2100">The scheme for domain localization and parallelization.</p></caption>
            <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f02.png"/>

          </fig>

      <p id="d2e2109">The Kalman gain matrix (<inline-formula><mml:math id="M115" display="inline"><mml:mrow><mml:mi mathvariant="bold">K</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> can be calculated by Eqs. (7)–(9).

                  <disp-formula specific-use="gather" content-type="numbered"><mml:math id="M116" display="block"><mml:mtable displaystyle="true"><mml:mlabeledtr id="Ch1.E7"><mml:mtd><mml:mtext>7</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true" class="stylechange"/><mml:mrow><mml:mi mathvariant="bold">K</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup><mml:msubsup><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:msub><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup><mml:msubsup><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold">R</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mlabeledtr><mml:mlabeledtr id="Ch1.E8"><mml:mtd><mml:mtext>8</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true" class="stylechange"/><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup><mml:msubsup><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>=</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mfenced close=")" open="("><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow></mml:mfenced></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:mi>H</mml:mi><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>H</mml:mi><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow></mml:mfenced><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mtd></mml:mlabeledtr><mml:mlabeledtr id="Ch1.E9"><mml:mtd><mml:mtext>9</mml:mtext></mml:mtd><mml:mtd><mml:mrow><mml:mstyle class="stylechange" displaystyle="true"/><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup><mml:msubsup><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>=</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mfenced close=")" open="("><mml:mrow><mml:mi>H</mml:mi><mml:mfenced close=")" open="("><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>H</mml:mi><mml:mfenced close=")" open="("><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow></mml:mfenced></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:msup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi>H</mml:mi><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>H</mml:mi><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:mrow></mml:mfenced><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mtd></mml:mlabeledtr></mml:mtable></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M117" display="inline"><mml:mi mathvariant="bold">K</mml:mi></mml:math></inline-formula> is the Kalman gain matrix. <inline-formula><mml:math id="M118" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the observation operator at <inline-formula><mml:math id="M119" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>. <inline-formula><mml:math id="M120" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">R</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the observation error covariance matrix at <inline-formula><mml:math id="M121" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>, which is a diagonal matrix. <inline-formula><mml:math id="M122" display="inline"><mml:mi>H</mml:mi></mml:math></inline-formula> is the linear observation operator. In this study, the observation operator solely conducts spatial mapping between the observations and the background fields due to consistency in the variable and temporal dimensions. The method employed for spatial mapping between observations from sparse sites and gridded background fields is the <inline-formula><mml:math id="M123" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>-nearest neighbor search (Friedman et al., 1977).</p>
      <p id="d2e2578">The final analysis fields (<inline-formula><mml:math id="M124" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>) can be obtained from the integration of background fields (<inline-formula><mml:math id="M125" display="inline"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>) and observations (<inline-formula><mml:math id="M126" display="inline"><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>):

              <disp-formula id="Ch1.E10" content-type="numbered"><label>10</label><mml:math id="M127" display="block"><mml:mrow><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup><mml:mo>=</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>+</mml:mo><mml:mi mathvariant="bold">K</mml:mi><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup><mml:mo>+</mml:mo><mml:mi>y</mml:mi><mml:msubsup><mml:msup><mml:mi/><mml:mo>′</mml:mo></mml:msup><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup><mml:mo>-</mml:mo><mml:mi>H</mml:mi><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi>N</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M128" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the analysis field of the <inline-formula><mml:math id="M129" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th ensemble member at <inline-formula><mml:math id="M130" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>. <inline-formula><mml:math id="M131" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the observation of PM<sub>2.5</sub> chemical components at <inline-formula><mml:math id="M133" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math id="M134" display="inline"><mml:mrow><mml:mi>y</mml:mi><mml:msubsup><mml:msup><mml:mi/><mml:mo>′</mml:mo></mml:msup><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the observation perturbation of the <inline-formula><mml:math id="M135" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th ensemble member at <inline-formula><mml:math id="M136" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>, characterized by a normal distribution with a mean of 0 and a standard deviation equal to the observation error.</p>
      <p id="d2e2866">The LEnKF integrates domain localization and observation localization into the EnKF algorithm to diminish the interference of non-physical teleconnections within a high-dimensional model state space, especially for small ensemble sizes (Nerger et al., 2012). The domain localization segments the global state space into several disjoint local state spaces, each of which assimilates observations independently within a defined localization radius, thereby effectively increasing the rank of the background covariance matrix and eliminating the interference of long-distance spurious correlations (Houtekamer and Mitchell, 1998). The independence of the analysis process within the local state space facilitates parallel computation (Janjić et al., 2011). However, this may result in discontinuities at the boundaries of adjacent local state spaces. To address this challenge, domain localization in our system conducts assimilation for each analysis grid point using only background fields and observations within a specific localization radius (Fig. 2), with the same update form as global EnKF (Eq. 10). The fundamental update form is presented in Eq. (11).

              <disp-formula id="Ch1.E11" content-type="numbered"><label>11</label><mml:math id="M137" display="block"><mml:mrow><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="italic">δ</mml:mi></mml:mrow><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup><mml:mo>=</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi mathvariant="italic">δ</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>+</mml:mo><mml:msub><mml:mi mathvariant="bold">K</mml:mi><mml:mi mathvariant="italic">δ</mml:mi></mml:msub><mml:mfenced open="(" close=")"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="italic">δ</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup><mml:mo>+</mml:mo><mml:mi>y</mml:mi><mml:msubsup><mml:msup><mml:mi/><mml:mo>′</mml:mo></mml:msup><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="italic">δ</mml:mi></mml:mrow><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="italic">δ</mml:mi></mml:msub><mml:mfenced close=")" open="("><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi mathvariant="italic">δ</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mfenced close=")" open="("><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mi>N</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M138" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="italic">δ</mml:mi></mml:mrow><mml:mi mathvariant="normal">ana</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the analysis value within the localization domain <inline-formula><mml:math id="M139" display="inline"><mml:mi mathvariant="italic">δ</mml:mi></mml:math></inline-formula> of the <inline-formula><mml:math id="M140" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th ensemble member. <inline-formula><mml:math id="M141" display="inline"><mml:mrow><mml:msubsup><mml:mi>f</mml:mi><mml:mi mathvariant="italic">δ</mml:mi><mml:mi mathvariant="normal">DT</mml:mi></mml:msubsup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">θ</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the background value within the localization domain <inline-formula><mml:math id="M142" display="inline"><mml:mi mathvariant="italic">δ</mml:mi></mml:math></inline-formula> of the <inline-formula><mml:math id="M143" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th ensemble member. <inline-formula><mml:math id="M144" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">K</mml:mi><mml:mi mathvariant="italic">δ</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the local Kalman gain matrix computed from the ensemble covariance within the localization domain <inline-formula><mml:math id="M145" display="inline"><mml:mi mathvariant="italic">δ</mml:mi></mml:math></inline-formula>. <inline-formula><mml:math id="M146" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mi mathvariant="italic">δ</mml:mi><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the observation of PM<sub>2.5</sub> chemical components within the localization domain <inline-formula><mml:math id="M148" display="inline"><mml:mi mathvariant="italic">δ</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math id="M149" display="inline"><mml:mrow><mml:mi>y</mml:mi><mml:msubsup><mml:msup><mml:mi/><mml:mo>′</mml:mo></mml:msup><mml:mrow><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="italic">δ</mml:mi></mml:mrow><mml:mi mathvariant="normal">o</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> is the observation perturbation of the <inline-formula><mml:math id="M150" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>th ensemble member within the localization domain <inline-formula><mml:math id="M151" display="inline"><mml:mi mathvariant="italic">δ</mml:mi></mml:math></inline-formula>. <inline-formula><mml:math id="M152" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi mathvariant="italic">δ</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the linear observation operator within the localization domain <inline-formula><mml:math id="M153" display="inline"><mml:mi mathvariant="italic">δ</mml:mi></mml:math></inline-formula>.</p>
      <p id="d2e3179">The overlap of observations across analysis grid points smooths the boundaries of adjacent local state spaces. However, grid-by-grid assimilation at a fine spatial resolution incurs high computational costs. To mitigate this issue, OIRF-LEnKF v1.0 incorporates a second-level parallel computational framework that facilitates the simultaneous assimilation of various chemical species and multiple analysis grid points (Fig. 2). Computational tasks for different chemical species are allocated to independent computational nodes to prevent interference of spurious correlations among chemical species and eliminate the need for inter-node communication. Subsequently, the grid points of each chemical component are assigned to multiple CPUs within these independent computational nodes.</p>
      <p id="d2e3182">Observation localization is combined with domain localization to enhance the physical authenticity of observation propagation within state spaces (Nerger et al., 2012). This scheme conducts observation localization by applying the Schur product between the observation error covariance matrix (<inline-formula><mml:math id="M154" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">R</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) and a distance-based weight matrix (<inline-formula><mml:math id="M155" display="inline"><mml:mi mathvariant="bold">W</mml:mi></mml:math></inline-formula>) as shown in Eq. (12). 

              <disp-formula id="Ch1.E12" content-type="numbered"><label>12</label><mml:math id="M156" display="block"><mml:mrow><mml:msup><mml:mi mathvariant="bold">K</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup><mml:msubsup><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msup><mml:mfenced close=")" open="("><mml:mrow><mml:msub><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:msubsup><mml:mi mathvariant="bold">P</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">f</mml:mi></mml:msubsup><mml:msubsup><mml:mi mathvariant="bold">H</mml:mi><mml:mi>t</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo>+</mml:mo><mml:mi mathvariant="bold">W</mml:mi><mml:mo>⋅</mml:mo><mml:msub><mml:mi mathvariant="bold">R</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M157" display="inline"><mml:mrow><mml:msup><mml:mi mathvariant="bold">K</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is the Kalman gain matrix applied observation localization, and <inline-formula><mml:math id="M158" display="inline"><mml:mi mathvariant="bold">W</mml:mi></mml:math></inline-formula> is a distance-based weight matrix, which is diagonal.</p>
      <p id="d2e3290">The distance-based weight matrix (<inline-formula><mml:math id="M159" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">W</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>) for the <inline-formula><mml:math id="M160" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>th localization domain is obtained using a Gaussian function:

              <disp-formula id="Ch1.E13" content-type="numbered"><label>13</label><mml:math id="M161" display="block"><mml:mrow><mml:msub><mml:mi mathvariant="bold">W</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="normal">diag</mml:mi><mml:mfenced close=")" open="("><mml:mrow><mml:mi>exp⁡</mml:mi><mml:mfenced close=")" open="("><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mo>-</mml:mo><mml:mi>d</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:msup><mml:mi>L</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:mfrac></mml:mstyle></mml:mfenced></mml:mrow></mml:mfenced><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mspace width="0.125em" linebreak="nobreak"/><mml:mn mathvariant="normal">2</mml:mn><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mspace linebreak="nobreak" width="0.125em"/><mml:msub><mml:mi>N</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>

            Here, <inline-formula><mml:math id="M162" display="inline"><mml:mrow><mml:mi>d</mml:mi><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the Euclidean distance between center grid point of the <inline-formula><mml:math id="M163" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>th localization domain and observation point <inline-formula><mml:math id="M164" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula>. <inline-formula><mml:math id="M165" display="inline"><mml:mi>L</mml:mi></mml:math></inline-formula> is the decorrelation length. <inline-formula><mml:math id="M166" display="inline"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the total number of effective observations within the <inline-formula><mml:math id="M167" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>th localization domain. <inline-formula><mml:math id="M168" display="inline"><mml:mi mathvariant="bold">W</mml:mi></mml:math></inline-formula> is constructed as a diagonal matrix (<inline-formula><mml:math id="M169" display="inline"><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub><mml:mo>×</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi mathvariant="normal">obs</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>), applying a distance-dependent weighting directly to the diagonal elements of observation error covariance matrix <inline-formula><mml:math id="M170" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">R</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>.</p>
</sec>
<sec id="Ch1.S2.SS1.SSS4">
  <label>2.1.4</label><title>Configurations</title>
      <p id="d2e3489">Table 1 presents the fundamental configuration parameters in OIRF-LEnKF v1.0. The state variables consist of five PM<sub>2.5</sub> key chemical components (SO<inline-formula><mml:math id="M172" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M173" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M174" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC and BC). The modeling domain encompasses North China, with a spatial range of 32.38–44.90° N and 108.07–127.01° E. The spatial and temporal resolutions are established at 5 km <inline-formula><mml:math id="M175" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 km and 1 h, respectively. The data of the input feature utilized for training the OIRF model are outlined in Sect. 2.2.1, including <inline-formula><mml:math id="M176" display="inline"><mml:mi>U</mml:mi></mml:math></inline-formula>-component wind, <inline-formula><mml:math id="M177" display="inline"><mml:mi>V</mml:mi></mml:math></inline-formula>-component wind, temperature, relative humidity, geopotential, and the mass concentrations of PM<sub>2.5</sub>, PM<sub>10</sub>, SO<sub>2</sub>, NO<sub>2</sub>, CO, and O<sub>3</sub>. The ensemble sizes employed in the assimilation experiments are 2, 5, 10, 15, 20, 30, 40, 50, 100, and 200. The update frequencies for incremental learning in the experiments include 0 (no update), 18 h intervals, 12 h intervals, 6 h intervals, and 1 h intervals. The experimental design is detailed in Sect. 2.3. Hyperparameters in the OIRF model, such as the minimum number of leaf node observations, the maximum number of decision splits, and the number of predictors to select at random for each split, are tuned using Bayesian optimization over 30 iterations. The training data are randomly re-partitioned at each optimization iteration to enhance the robustness of the OIRF model. Regarding the DA-related parameters, the localization radius and decorrelation length are set to 200 and 80 km, respectively, based on the spatial range and resolution requirements. The assimilation frequency matches the temporal resolution of 1 h.</p>

<table-wrap id="T1" specific-use="star"><label>Table 1</label><caption><p id="d2e3611">Fundamental configuration parameters in OIRF-LEnKF v1.0.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="justify" colwidth="2cm"/>
     <oasis:colspec colnum="2" colname="col2" align="justify" colwidth="4cm"/>
     <oasis:colspec colnum="3" colname="col3" align="justify" colwidth="10cm"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">Category</oasis:entry>
         <oasis:entry colname="col2" align="left">Parameter</oasis:entry>
         <oasis:entry colname="col3" align="left">Setting</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1" align="left">Ensemble simulation</oasis:entry>
         <oasis:entry rowsep="1" colname="col2" align="left">State variable</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">SO<inline-formula><mml:math id="M183" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M184" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M185" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC and BC</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Model domain</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">North China (32.38–44.90° N, 108.07–127.01° E)</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Spatial resolution</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">5 km <inline-formula><mml:math id="M186" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 km</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Temporal resolution</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">1 h</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Meteorological input feature</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left"><inline-formula><mml:math id="M187" display="inline"><mml:mi>U</mml:mi></mml:math></inline-formula>-component wind, <inline-formula><mml:math id="M188" display="inline"><mml:mi>V</mml:mi></mml:math></inline-formula>-component wind, temperature, relative humidity and geopotential</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Anthropogenic input feature</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">PM<sub>2.5</sub>, PM<sub>10</sub>, SO<sub>2</sub>, NO<sub>2</sub>, CO and O<sub>3</sub></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Ensemble size</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">2, 5, 10, 15, 20, 30, 40, 50, 100, 200</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Update frequency</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">0, 18 h interval, 12 h interval, 6 h interval, 1 h interval</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Hyperparameter for tuning</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">Minimum number of leaf node observations, maximal number of decision splits, and number of predictors to select at random for each split</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Optimization iteration</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">30</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry colname="col2" align="left">Data partition</oasis:entry>
         <oasis:entry colname="col3" align="left">Re-partition at every iteration</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left">Data assimilation</oasis:entry>
         <oasis:entry rowsep="1" colname="col2" align="left">State dimension</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">5, including SO<inline-formula><mml:math id="M194" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M195" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M196" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC and BC</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry colname="col2" align="left">Latitudinal dimension</oasis:entry>
         <oasis:entry colname="col3" align="left">249 grid points</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Longitudinal dimension</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">300 grid points</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Algorithm</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">LEnKF</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Localization radius</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">200 km</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry rowsep="1" colname="col2" align="left">Decorrelation length</oasis:entry>
         <oasis:entry rowsep="1" colname="col3" align="left">80 km</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry colname="col2" align="left">Assimilation frequency</oasis:entry>
         <oasis:entry colname="col3" align="left">1 h</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S2.SS1.SSS5">
  <label>2.1.5</label><title>Data</title>
</sec>
<sec id="Ch1.S2.SS1.SSS6">
  <label>2.1.6</label><title>Features</title>
      <p id="d2e4005">The input features used in the OIRF model training include six anthropogenic air pollutants and five meteorological parameters (Table 1). The hourly gridded data of anthropogenic air pollutants were obtained from Chinese Air Quality ReAnalysis (CAQRA, <ext-link xlink:href="https://doi.org/10.11922/sciencedb.00053" ext-link-type="DOI">10.11922/sciencedb.00053</ext-link>, Tang et al., 2020). CAQRA is generated by assimilating surface observations of hourly concentrations of conventional air pollutants into the Nested Air Quality Prediction Modeling System (NAQPMS), with a spatial resolution of 15 km <inline-formula><mml:math id="M197" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 15 km and a 5-fold cross-validation <inline-formula><mml:math id="M198" display="inline"><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> of 0.52–0.81 (Kong et al., 2021). The hourly gridded data of meteorological parameters were obtained from the 5th Generation ECMWF ReAnalysis (ERA5, <ext-link xlink:href="https://doi.org/10.24381/cds.bd0915c6" ext-link-type="DOI">10.24381/cds.bd0915c6</ext-link>, Hersbach et al., 2023) with a horizontal resolution of <inline-formula><mml:math id="M199" display="inline"><mml:mrow><mml:mn mathvariant="normal">0.25</mml:mn><mml:mi mathvariant="italic">°</mml:mi><mml:mo>×</mml:mo><mml:mn mathvariant="normal">0.25</mml:mn><mml:mi mathvariant="italic">°</mml:mi></mml:mrow></mml:math></inline-formula> (Hersbach et al., 2023). The output features include five PM<sub>2.5</sub> chemical components (NH<inline-formula><mml:math id="M201" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M202" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M203" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC and BC). The hourly gridded data of these components were obtained from the PM<sub>2.5</sub> chemical composition dataset (CAQRA-aerosol, <ext-link xlink:href="https://doi.org/10.1007/s00376-024-4046-5" ext-link-type="DOI">10.1007/s00376-024-4046-5</ext-link>, Kong et al., 2025). CAQRA-aerosol is developed based on a CTM-based simulation method with an improved inorganic aerosol module and a constrained emission inventory, with a spatial resolution of 15 km <inline-formula><mml:math id="M205" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 15 km and a mean bias of less than 1.1 <inline-formula><mml:math id="M206" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (Kong et al., 2025). Due to consideration of the distribution of available ground-based observational sites for PM<sub>2.5</sub> chemical components, the gridded data containing various features in China have been transformed into a new grid with a spatial resolution of 5 km <inline-formula><mml:math id="M209" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 km in North China, utilizing a triangulation-based linear interpolation method (Amidror, 2002).</p>
</sec>
<sec id="Ch1.S2.SS1.SSS7">
  <label>2.1.7</label><title>Observations</title>
      <p id="d2e4161">Observations of hourly mass concentrations of five PM<sub>2.5</sub> chemical components (NH<inline-formula><mml:math id="M211" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M212" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M213" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC) were collected over a two-month period (February to March 2022) from 33 ground-based sites in North China and its surrounding areas. Of these 33 sites, 24 sites (designated as DA sites) were employed for DA and internal validation, while the remaining 9 sites (defined as VE sites) were used for independent verification to evaluate the influence of DA sites on neighboring areas. The description of site distribution and the division method of DA sites and VE sites were detailed in our previous work (Li et al., 2024a).</p>
</sec>
<sec id="Ch1.S2.SS1.SSS8">
  <label>2.1.8</label><title>Reanalysis dataset for comparison</title>
      <p id="d2e4220">The multi-source reanalysis datasets of PM<sub>2.5</sub> chemical components were collected to assess the relative quality of the reanalysis dataset generated by OIRF-LEnKF v1.0, including the CAQRA-aerosol, the Tracking Air Pollution in China (TAP, <uri>http://tapdata.org.cn/</uri>, last access: 2 June 2025), the Copernicus Atmosphere Monitoring Service ReAnalysis (CAMSRA, <ext-link xlink:href="https://doi.org/10.24381/d58bbf47" ext-link-type="DOI">10.24381/d58bbf47</ext-link>, Copernicus Atmosphere Monitoring Service, 2020), the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2, <uri>https://disc.gsfc.nasa.gov/datasets?project=MERRA-2</uri>, last access: 2 June 2025) and the reanalysis dataset generated by NAQPMS-PDAF v2.0 (NP2, <ext-link xlink:href="https://doi.org/10.5281/zenodo.10886914" ext-link-type="DOI">10.5281/zenodo.10886914</ext-link>, Li et al., 2024b). The High-resolution and High-quality Air Pollutants dataset for China (CHAP, <ext-link xlink:href="https://doi.org/10.5281/zenodo.10011898" ext-link-type="DOI">10.5281/zenodo.10011898</ext-link>, Wei et al., 2022) was not considered in this study because it did not cover the observation period. The properties of the multi-source reanalysis datasets are presented in Table 2.</p>

<table-wrap id="T2" specific-use="star"><label>Table 2</label><caption><p id="d2e4251">Properties of the multi-source reanalysis datasets for PM<sub>2.5</sub> chemical components.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="7">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:colspec colnum="5" colname="col5" align="left"/>
     <oasis:colspec colnum="6" colname="col6" align="left"/>
     <oasis:colspec colnum="7" colname="col7" align="left"/>
     <oasis:thead>
       <oasis:row>
         <oasis:entry colname="col1">Dataset</oasis:entry>
         <oasis:entry colname="col2">Chemical species</oasis:entry>
         <oasis:entry colname="col3">Period</oasis:entry>
         <oasis:entry colname="col4">Temporal</oasis:entry>
         <oasis:entry colname="col5">Vertical</oasis:entry>
         <oasis:entry colname="col6">Spatial</oasis:entry>
         <oasis:entry colname="col7">Spatial</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4">resolution</oasis:entry>
         <oasis:entry colname="col5">resolution</oasis:entry>
         <oasis:entry colname="col6">coverage</oasis:entry>
         <oasis:entry colname="col7">resolution</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">CAQRA-aerosol</oasis:entry>
         <oasis:entry colname="col2">SO<inline-formula><mml:math id="M216" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M217" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M218" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, BC</oasis:entry>
         <oasis:entry colname="col3">2013–2022</oasis:entry>
         <oasis:entry colname="col4">1-hourly</oasis:entry>
         <oasis:entry colname="col5">Surface level</oasis:entry>
         <oasis:entry colname="col6">China</oasis:entry>
         <oasis:entry colname="col7">15 km <inline-formula><mml:math id="M219" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 15 km</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">TAP</oasis:entry>
         <oasis:entry colname="col2">SO<inline-formula><mml:math id="M220" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M221" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M222" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OM, BC</oasis:entry>
         <oasis:entry colname="col3">2000–present</oasis:entry>
         <oasis:entry colname="col4">Daily</oasis:entry>
         <oasis:entry colname="col5">Surface level</oasis:entry>
         <oasis:entry colname="col6">China</oasis:entry>
         <oasis:entry colname="col7">10 km <inline-formula><mml:math id="M223" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 km</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">NP2</oasis:entry>
         <oasis:entry colname="col2">SO<inline-formula><mml:math id="M224" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M225" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M226" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, BC</oasis:entry>
         <oasis:entry colname="col3">February 2022</oasis:entry>
         <oasis:entry colname="col4">1-hourly</oasis:entry>
         <oasis:entry colname="col5">Surface level</oasis:entry>
         <oasis:entry colname="col6">North China</oasis:entry>
         <oasis:entry colname="col7">5 km <inline-formula><mml:math id="M227" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 km</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">CAMSRA</oasis:entry>
         <oasis:entry colname="col2">NO<inline-formula><mml:math id="M228" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M229" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col3">2003–2024</oasis:entry>
         <oasis:entry colname="col4">3-hourly</oasis:entry>
         <oasis:entry colname="col5">Pressure level</oasis:entry>
         <oasis:entry colname="col6">Global</oasis:entry>
         <oasis:entry colname="col7"><inline-formula><mml:math id="M230" display="inline"><mml:mrow><mml:mn mathvariant="normal">0.75</mml:mn><mml:mi mathvariant="italic">°</mml:mi><mml:mo>×</mml:mo><mml:mn mathvariant="normal">0.75</mml:mn><mml:mi mathvariant="italic">°</mml:mi></mml:mrow></mml:math></inline-formula></oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">MERRA-2</oasis:entry>
         <oasis:entry colname="col2">SO<inline-formula><mml:math id="M231" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OM, BC</oasis:entry>
         <oasis:entry colname="col3">1980–present</oasis:entry>
         <oasis:entry colname="col4">1-hourly</oasis:entry>
         <oasis:entry colname="col5">Surface level</oasis:entry>
         <oasis:entry colname="col6">Global</oasis:entry>
         <oasis:entry colname="col7"><inline-formula><mml:math id="M232" display="inline"><mml:mrow><mml:mn mathvariant="normal">0.5</mml:mn><mml:mi mathvariant="italic">°</mml:mi><mml:mo>×</mml:mo><mml:mn mathvariant="normal">0.625</mml:mn><mml:mi mathvariant="italic">°</mml:mi></mml:mrow></mml:math></inline-formula></oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Experimental setting</title>
      <p id="d2e4674">We designed four experiments to evaluate the performance of OIRF-LEnKF v1.0 on background and analysis fields of the concentrations of SO<inline-formula><mml:math id="M233" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M234" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M235" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC. In the first experiment, we conducted model training, simulation, and assimilation at the first time step using 10 distinct ensemble sizes (2, 5, 10, 15, 20, 30, 40, 50, 100, and 200) to assess the dependence of computational efficiency on ensemble size. In the second experiment, we performed 24-timestep simulation and assimilation across 30 different scenarios, which comprised all possible combinations of 6 ensemble sizes (20, 30, 40, 50, 100, and 200) and 5 varied update frequencies for incremental learning (no update, 18 h interval, 12 h interval, 6 h interval, and 1 h interval). This design aimed to evaluate the sensitivity of simulation and assimilation performance to ensemble size and update frequency. In the third experiment, we conducted a 2-month simulation-assimilation loop using ground-level observations at 24 DA sites to comprehensively assess the capabilities of OIRF-LEnKF v1.0 in interpreting the spatiotemporal distribution of PM<sub>2.5</sub> chemical component concentrations. In the fourth experiment, we simultaneously assimilated all ground-level observations at 33 sites to generate a 1-month reanalysis dataset of PM<sub>2.5</sub> chemical component concentrations in North China and compared it with multiple reanalysis datasets. The observation errors in the four experiments were set at 0.5 <inline-formula><mml:math id="M238" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (NH<inline-formula><mml:math id="M240" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>), 0.5 <inline-formula><mml:math id="M241" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (NO<inline-formula><mml:math id="M243" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>), 1.0 <inline-formula><mml:math id="M244" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (SO<inline-formula><mml:math id="M246" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>), 3.0 <inline-formula><mml:math id="M247" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (OC), and 0.5 <inline-formula><mml:math id="M249" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (BC), with the assumption that the observation errors were spatially isotropic in state space to reduce computational complexity.</p>
</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Results and discussion</title>
<sec id="Ch1.S3.SS1">
  <label>3.1</label><title>Computational efficiency</title>
      <p id="d2e4892">As shown in Fig. 3, we evaluate the computational efficiencies of hyperparameter tuning, simulation and assimilation. Previous studies have indicated that the Bayesian optimization algorithm is both efficient and stable for hyperparameter tuning in various ML models (Lai, 2024). In this section, we validate its stability within the OIRF model and computational costs. Figure 3a demonstrates that both the estimated and observed minimum objective values initially decrease rapidly and subsequently converge within 10 iterations across all ensemble sizes, indicating the convergence stability and high efficiency of the OIRF model. In addition, the consistency in both the magnitude and variation between the estimated and observed minimum objective values suggests that the surrogate model employed in Bayesian optimization exhibits a high fitting accuracy for the objective function. Although the time consumed during each iteration increases positively with ensemble size, the number of optimal hyperparameter searches remains relatively insensitive to ensemble size. As illustrated in Fig. 3b, the minimum value of the total observed objectives decreases significantly as the ensemble size increases, ranging from 2 to 20, indicating that a larger ensemble size enhances the optimization accuracy of the OIRF model. Notably, when the ensemble size exceeds 20, the rate of improvement in optimization accuracy diminishes. The total time consumed by the optimization process increases gradually with ensemble sizes ranging from 2 to 50 but rises sharply beyond an ensemble size of 50. Therefore, an ensemble size of 50 is determined to be optimal for the OIRF model, effectively balancing the optimization accuracy and efficiency.</p>

      <fig id="F3" specific-use="star"><label>Figure 3</label><caption><p id="d2e4897">Computational efficiency of OIRF-LEnKF v1.0. <bold>(a)</bold> Variation in the minimum objective value throughout the Bayesian optimization process and time consumed by each iteration, determined by Eq. (5). <bold>(b)</bold> Minimum value of total observed minimum objectives and total time consumed during Bayesian optimization process for different ensemble sizes, <bold>(c)</bold> time consumed by model simulation and data assimilation at each timestep for OIRF-LEnKF and NAQPMS-PDAF v2.0 (NP2), and the ratio of total time consumed between OIRF-LEnKF and NP2, <bold>(d)</bold> the ratio of time consumed by model simulation and data assimilation between OIRF-LEnKF and NP2. SIM represents the simulation phase, and DA represents the data assimilation phase. The elapsed time of the OIRF-LEnKF simulation process in <bold>(c)</bold> has been magnified by a factor of 10 for better clarity.</p></caption>
          <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f03.png"/>

        </fig>

      <p id="d2e4921">The computational costs of OIRF-LEnKF v1.0 in simulation and assimilation processes were compared with those of a CTM-based DA system (NP2). To ensure comparability of computational expenses between OIRF-LEnKF v1.0 and NP2, the number of CPUs allocated for each grid calculation was intentionally set closer, at 35 and 50, respectively. As illustrated in Fig. 3c, the total time consumed by simulation and assimilation for OIRF-LEnKF v1.0 amounts to only 11.41 % to 16.60 % of that for NP2, especially during the simulation process, which accounts for merely 0.13 % to 0.20 % (Fig. 3d). The marked improvement in simulation efficiency by OIRF-LEnKF v1.0 is comparable to the deep neural network model (Adie et al., 2024). This enhancement is primarily attributed to the fact that ML-based simulation does not necessitate a profound understanding of the complex physicochemical mechanisms of the atmosphere (Fang et al., 2022), whereas CTM-based simulation involves intricate computations of a large number of chemical species and reaction processes (Zaveri and Peters, 1999; Stockwell et al., 1990). The computational efficiency of OIRF-LEnKF v1.0 during the DA stage is slightly lower than that of NP2, as its time consumed is 1.76 to 3.02 times greater than that of NP2 (Fig. 3d), primarily due to minor differences in the DA algorithm and the number of CPUs allocated.</p>
      <p id="d2e4925">As the ensemble size increases from 2 to 50, the total time consumed for OIRF-LEnKF v1.0 and NP2 increases by 17.91 and 39.53 s, respectively. Specifically, the time consumed by simulation increases by 0.22 and 39.53 s, respectively, while the time consumed by assimilation increases by 17.69 and 0 s, respectively. Although the time consumed by assimilation for OIRF-LEnKF v1.0 is sensitive to ensemble size, the total time consumed remains relatively low (less than 50 s) at an ensemble size of 50. Given that the ensemble spread typically correlates positively with ensemble size (Lei and Whitaker, 2017), configuring an ensemble size of 50 in OIRF-LEnKF v1.0 offers an optimal balance among optimization accuracy, optimization efficiency, time consumed by simulation and assimilation, and ensemble spread.</p>
</sec>
<sec id="Ch1.S3.SS2">
  <label>3.2</label><title>Sensitivity to parameterization scheme</title>
      <p id="d2e4936">The ensemble size and update frequency for incremental learning are critical parameters that influence the simulation and reanalysis capabilities of OIRF-LEnKF v1.0. Specifically, the ensemble size affects the estimation of the background error covariance matrix (Valler et al., 2019), which determines the observation propagation at the analysis step and the uncertainty range of the ensemble simulation at the simulation step. The update frequency for incremental learning drives the adaptability of the ML model to non-stationary data distributions (Shaheen et al., 2022), thereby influencing the generalization ability at the simulation step and indirectly affecting the background error information at the analysis step.</p>

      <fig id="F4" specific-use="star"><label>Figure 4</label><caption><p id="d2e4941"><bold>(a)</bold> Percentage change of Pearson correlation coefficient (CORR) relative to the minimum CORR (0.5) (<inline-formula><mml:math id="M251" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>CORR, %) for sensitivity test with six ensemble sizes (20, 30, 40, 50, 100, 200) and five update frequencies (no update, 18 h interval, 12 h interval, 6 h interval and 1 h interval) at the simulation step. <bold>(b)</bold> Same as <bold>(a)</bold> but for percentage change of root mean square error (RMSE) relative to the maximum RMSE (3.46 <inline-formula><mml:math id="M252" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) (<inline-formula><mml:math id="M254" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>RMSE, %) at the simulation step. <bold>(c)</bold> Same as <bold>(a)</bold> but for percentage change of CORR relative to the minimum CORR (0.7) at the analysis step. <bold>(d)</bold> Same as <bold>(a)</bold> but for percentage change of RMSE relative to the maximum RMSE (1.65 <inline-formula><mml:math id="M255" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) at the analysis step.</p></caption>
          <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f04.png"/>

        </fig>

      <p id="d2e5026">During the ML simulation process, the statistical indicators that compare the background fields and observations for OIRF-LEnKF v1.0 exhibit a pronounced sensitivity to update frequency but are less sensitive to ensemble size. With a fixed ensemble size, the correlation coefficient (CORR) increases as the update frequency rises (Fig. 4a). At the same time, the root mean square error (RMSE) decreases significantly with a higher update frequency (Fig. 4b). Specifically, the percentage change of CORR relative to minimum CORR (<inline-formula><mml:math id="M257" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>CORR) rises by 2.43 % (ensembles size is 200) to 11.75 % (ensembles size is 30), and the percentage change of RMSE relative to maximum RMSE (<inline-formula><mml:math id="M258" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>RMSE) decreases by 32.55 % (ensembles size is 20) to 40.36 % (ensembles size is 100) when comparing a 1 h update frequency to the scenario without incremental learning, which indicates that high-frequency incremental learning effectively enhances the adaptability of the statically trained ML model to the non-stationary data distributions, enabling it to demonstrate improved generalization capabilities and higher simulation accuracy in rapidly changing chemical component simulations. Notably, an increase in ensemble size can amplify the effect of incremental learning on simulation errors. Specifically, the reduction in <inline-formula><mml:math id="M259" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>RMSE at an ensemble size of 100 is approximately 8 % greater than at an ensemble size of 20 when comparing a 1 h update frequency to a scenario without incremental learning (Fig. 4b), which is attributed to the fact that as the ensemble size increases, the probability density distribution becomes more accurate, leading to improved ensemble simulation skill (Chen, 2024).</p>
      <p id="d2e5051">During the DA analysis phase, the statistical indicators that compare the analysis fields and observations for OIRF-LEnKF v1.0 are found to be significantly dependent on the ensemble size rather than the update frequency. With a fixed update frequency, excluding the 1 h update frequency, the CORR increases considerably with a larger ensemble size (Fig. 4c). At the same time, the RMSE decreased markedly as the ensemble size increases (Fig. 4d). Specifically, the <inline-formula><mml:math id="M260" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>CORR increased by 9.75 % (update frequency is 6 h) to 19.04 % (update frequency is 18 h), and the <inline-formula><mml:math id="M261" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>RMSE decreased by 16.70 % (update frequency is 6 h) to 30.48 % (update frequency is 18 h) when comparing an ensemble size of 200 to that of 20. This improvement is attributed to the enhanced accuracy of estimating the background error covariance matrix, resulting from a larger ensemble size, which enables the effective propagation of observations within the model state space. (Valler et al., 2019). However, the 1 h update frequency diminishes the dependence of the analysis fields on the ensemble size. This interference may result from high-frequency incremental learning, which causes the new DTs in the OIRF model to diverge from the existing DTs, leading to a deviation in the background error covariance structure from the true state. Consequently, although the 1 h update frequency can significantly enhance the simulation performance, we configured an ensemble size of 50 with a 6 h update frequency in OIRF-LEnKF v1.0 to balance computational efficiency, ML simulation accuracy, and DA analysis performance.</p>
</sec>
<sec id="Ch1.S3.SS3">
  <label>3.3</label><title>Evaluation of DA results</title>
      <p id="d2e5076">This section assesses the performance of the free-run field without DA and incremental learning (FR), the ML-simulated background field with incremental learning (SIM) and the analysis field with DA (ANA) in interpreting the spatiotemporal distribution of PM<sub>2.5</sub> chemical components.</p>
<sec id="Ch1.S3.SS3.SSS1">
  <label>3.3.1</label><title>Assessment of temporal variation in chemical components</title>
      <p id="d2e5096">Figure 5 presents the time series of errors (observations minus OIRF-LEnKF v1.0 outputs) and statistical indicators comparing observations with FR, SIM, and ANA across 33 ground-level sites. As illustrated in Fig. 5a1–a3, the errors of FR for NH<inline-formula><mml:math id="M263" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M264" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, and SO<inline-formula><mml:math id="M265" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> ranged from <inline-formula><mml:math id="M266" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">2.30</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">1.97</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M267" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M269" display="inline"><mml:mrow><mml:mn mathvariant="normal">8.84</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">5.04</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M270" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, <inline-formula><mml:math id="M272" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">7.60</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">5.29</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M273" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M275" display="inline"><mml:mrow><mml:mn mathvariant="normal">14.64</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">17.20</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M276" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, and <inline-formula><mml:math id="M278" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4.31</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">3.81</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M279" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M281" display="inline"><mml:mrow><mml:mn mathvariant="normal">9.61</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">6.00</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M282" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively. The overall errors of FR for NH<inline-formula><mml:math id="M284" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M285" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, and SO<inline-formula><mml:math id="M286" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> are positive and relatively dispersed, suggesting a general underestimation of inorganic salt concentrations. Conversely, the errors of SIM concentrated to a range of <inline-formula><mml:math id="M287" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">2.66</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">4.18</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M288" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M290" display="inline"><mml:mrow><mml:mn mathvariant="normal">5.18</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">4.87</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M291" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (NH<inline-formula><mml:math id="M293" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>), <inline-formula><mml:math id="M294" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">7.17</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">10.75</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M295" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M297" display="inline"><mml:mrow><mml:mn mathvariant="normal">10.07</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">7.48</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M298" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (NO<inline-formula><mml:math id="M300" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>), and <inline-formula><mml:math id="M301" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1.37</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">1.98</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M302" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M304" display="inline"><mml:mrow><mml:mn mathvariant="normal">6.50</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">4.81</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M305" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (SO<inline-formula><mml:math id="M307" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>), indicating that incremental learning enhances the ability to capture the temporal features of inorganic salt concentrations. Compared to FR and SIM, the errors of ANA predominantly concentrated around zero over time, signifying that DA significantly enhances the capacity to interpret the temporal variation of inorganic salt concentrations. Unlike inorganic salt aerosols, the errors of FR for OC and BC ranged from <inline-formula><mml:math id="M308" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">12.18</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">4.09</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M309" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M311" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1.11</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">2.78</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M312" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> and <inline-formula><mml:math id="M314" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">5.41</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">1.39</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M315" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> to <inline-formula><mml:math id="M317" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">0.87</mml:mn><mml:mo>±</mml:mo><mml:mn mathvariant="normal">0.57</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M318" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively, with a general overestimation of carbonaceous aerosol concentrations (Fig. 5a4 and a5). The errors of SIM and ANA are relatively similar, both concentrating around zero over time due to the effects of incremental learning and DA.</p>

      <fig id="F5" specific-use="star"><label>Figure 5</label><caption><p id="d2e5759">Smoothed variation in the error between observation and model output – including the free-run field (FR), the ML-simulated background field (SIM) and the analysis field (ANA) – for <bold>(a1)</bold> NH<inline-formula><mml:math id="M320" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, <bold>(a2)</bold> NO<inline-formula><mml:math id="M321" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, <bold>(a3)</bold> SO<inline-formula><mml:math id="M322" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, <bold>(a4)</bold> OC and <bold>(a5)</bold> BC at total sites during February and March of 2022. The lines and shading areas represent the mean and standard deviation of the errors, respectively. <bold>(b)</bold> Correlation coefficient (CORR) between observation and model output for five PM<sub>2.5</sub> chemical components at DA sites. <bold>(c)</bold> Same as <bold>(b)</bold> but for root mean square errors (RMSE). <bold>(d)</bold> Same as <bold>(b)</bold> but for VE sites. <bold>(e)</bold> Same as <bold>(b)</bold> but for RMSE at VE sites.</p></caption>
            <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f05.png"/>

          </fig>

      <p id="d2e5854">Figure 5b–e presents the CORR and RMSE for the time series of five PM<sub>2.5</sub> chemical components across 24 DA sites and 9 VE sites. For the DA sites, the CORR values of FR for NH<inline-formula><mml:math id="M325" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M326" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M327" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC ranged from 0.24 to 0.76, 0.25 to 0.76, 0.11 to 0.64, 0.33 to 0.77, and 0.12 to 0.62, respectively (Fig. 5b). The RMSE values varied from 2.64 to 9.15 <inline-formula><mml:math id="M328" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, 4.73 to 16.24 <inline-formula><mml:math id="M330" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, 2.31 to 10.24 <inline-formula><mml:math id="M332" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, 4.57 to 10.41 <inline-formula><mml:math id="M334" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, and 1.36 to 3.42 <inline-formula><mml:math id="M336" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively (Fig. 5c). Following incremental learning, the CORR and RMSE values of SIM demonstrated a more concentrated data distribution than those of FR, with average CORR (0.42 to 0.83) and RMSE (0.99 to 7.80 <inline-formula><mml:math id="M338" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) values increasing by 5.61 % to 114.28 % and decreasing by 26.38 % to 61.75 %, respectively. Additionally, compared to the SIM of a CTM-based DA system, the SIM of OIRF-LEnKF v1.0 exhibited advancements of 19.14 % to 73.19 % and 33.16 % to 90.10 % in CORR and RMSE, respectively (Table 3). This finding indicates that the incremental learning mechanism is more effective than the optimal estimation of initial conditions in enhancing PM<sub>2.5</sub> chemical component simulations, which is attributed to the fact that the enhancement in ML-based simulation by incremental learning is global, while the CTM-based simulation is still constrained by the uncertainties in emission inventories and physiochemical mechanisms in addition to initial conditions (Mallet and Sportisse, 2006; Luo et al., 2023). After DA, the CORR and RMSE values of ANA for NH<inline-formula><mml:math id="M341" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M342" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M343" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC exhibited a more concentrated data distribution than those of FR and SIM. The average CORR (0.58 to 1.00) and RMSE (0.80 to 2.36 <inline-formula><mml:math id="M344" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) values demonstrated advancements of 35.27 % to 187.15 % and 68.99 % to 91.31 %, respectively, compared to FR, and advancements of 18.85 % to 38.73 % and 19.71 % to 88.20 %, respectively, compared to SIM.</p>

<table-wrap id="T3" specific-use="star"><label>Table 3</label><caption><p id="d2e6100">The correlation coefficient (CORR) and root mean square error (RMSE, <inline-formula><mml:math id="M346" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) of OIRF-LEnKF v1.0 (this study) and NAQPMS-PDAF v2.0 (NP2) at DA sites and VE sites for the simulations of NH<inline-formula><mml:math id="M348" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M349" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M350" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC and BC, as well as the improvement (%) of this study relative to NP2.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="15">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="left"/>
     <oasis:colspec colnum="8" colname="col8" align="right"/>
     <oasis:colspec colnum="9" colname="col9" align="right"/>
     <oasis:colspec colnum="10" colname="col10" align="left"/>
     <oasis:colspec colnum="11" colname="col11" align="right"/>
     <oasis:colspec colnum="12" colname="col12" align="right"/>
     <oasis:colspec colnum="13" colname="col13" align="left"/>
     <oasis:colspec colnum="14" colname="col14" align="right"/>
     <oasis:colspec colnum="15" colname="col15" align="right"/>
     <oasis:thead>
       <oasis:row>
         <oasis:entry colname="col1"/>
         <oasis:entry rowsep="1" namest="col2" nameend="col3" align="center">NH<inline-formula><mml:math id="M351" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry rowsep="1" namest="col5" nameend="col6" align="center">NO<inline-formula><mml:math id="M352" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry rowsep="1" namest="col8" nameend="col9" align="center">SO<inline-formula><mml:math id="M353" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry rowsep="1" namest="col11" nameend="col12" align="center">OC </oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry rowsep="1" namest="col14" nameend="col15" align="center">BC </oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">DA</oasis:entry>
         <oasis:entry colname="col3">VE</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">DA</oasis:entry>
         <oasis:entry colname="col6">VE</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">DA</oasis:entry>
         <oasis:entry colname="col9">VE</oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">DA</oasis:entry>
         <oasis:entry colname="col12">VE</oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14">DA</oasis:entry>
         <oasis:entry colname="col15">VE</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col15">CORR </oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">This study</oasis:entry>
         <oasis:entry colname="col2">0.85</oasis:entry>
         <oasis:entry colname="col3">0.82</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">0.86</oasis:entry>
         <oasis:entry colname="col6">0.85</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">0.66</oasis:entry>
         <oasis:entry colname="col9">0.63</oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">0.54</oasis:entry>
         <oasis:entry colname="col12">0.53</oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14">0.31</oasis:entry>
         <oasis:entry colname="col15">0.37</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">NP2</oasis:entry>
         <oasis:entry colname="col2">0.60</oasis:entry>
         <oasis:entry colname="col3">0.53</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">0.50</oasis:entry>
         <oasis:entry colname="col6">0.40</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">0.53</oasis:entry>
         <oasis:entry colname="col9">0.52</oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">0.44</oasis:entry>
         <oasis:entry colname="col12">0.38</oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14">0.26</oasis:entry>
         <oasis:entry colname="col15">0.23</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Improve (%)</oasis:entry>
         <oasis:entry colname="col2">41.59</oasis:entry>
         <oasis:entry colname="col3">53.69</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">73.19</oasis:entry>
         <oasis:entry colname="col6">110.49</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">23.59</oasis:entry>
         <oasis:entry colname="col9">21.92</oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">23.91</oasis:entry>
         <oasis:entry colname="col12">41.60</oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14">19.14</oasis:entry>
         <oasis:entry colname="col15">64.16</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col15">RMSE (<inline-formula><mml:math id="M354" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) </oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">This study</oasis:entry>
         <oasis:entry colname="col2">3.35</oasis:entry>
         <oasis:entry colname="col3">3.07</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">6.70</oasis:entry>
         <oasis:entry colname="col6">5.94</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">3.80</oasis:entry>
         <oasis:entry colname="col9">3.71</oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">3.47</oasis:entry>
         <oasis:entry colname="col12">3.19</oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14">1.17</oasis:entry>
         <oasis:entry colname="col15">1.12</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">NP2</oasis:entry>
         <oasis:entry colname="col2">5.01</oasis:entry>
         <oasis:entry colname="col3">4.88</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">11.13</oasis:entry>
         <oasis:entry colname="col6">10.73</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">6.86</oasis:entry>
         <oasis:entry colname="col9">7.23</oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">18.71</oasis:entry>
         <oasis:entry colname="col12">20.69</oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14">11.78</oasis:entry>
         <oasis:entry colname="col15">13.30</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Improve (%)</oasis:entry>
         <oasis:entry colname="col2">33.16</oasis:entry>
         <oasis:entry colname="col3">37.10</oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5">39.77</oasis:entry>
         <oasis:entry colname="col6">44.62</oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8">44.59</oasis:entry>
         <oasis:entry colname="col9">48.73</oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11">81.48</oasis:entry>
         <oasis:entry colname="col12">84.58</oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14">90.10</oasis:entry>
         <oasis:entry colname="col15">91.55</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e6625">For the VE sites without DA, the CORR values of FR for NH<inline-formula><mml:math id="M356" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M357" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M358" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC ranged from 0.20 to 0.66, 0.25 to 0.71, <inline-formula><mml:math id="M359" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">0.20</mml:mn></mml:mrow></mml:math></inline-formula> to 0.50, 0.13 to 0.66, and 0.15 to 0.47, respectively (Fig. 5d). The RMSE values varied from 3.39 to 8.25 <inline-formula><mml:math id="M360" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, 8.04 to 14.18 <inline-formula><mml:math id="M362" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, 3.94 to 7.04 <inline-formula><mml:math id="M364" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, 6.23 to 10.05 <inline-formula><mml:math id="M366" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, and 2.33 to 3.30 <inline-formula><mml:math id="M368" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively (Fig. 5e). After incremental learning, the CORR and RMSE values of SIM exhibited a more concentrated data distribution than those of FR, with average CORR (0.39 to 0.81) and RMSE (0.93 to 7.76 <inline-formula><mml:math id="M370" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) values increasing by 12.00 % to 124.69 % and decreasing by 28.37 % to 68.00 %, respectively. Furthermore, compared to the SIM of a CTM-based DA system, the SIM of OIRF-LEnKF v1.0 demonstrated advancements of 21.92 % to 110.49 % and 37.10 % to 91.55 % in CORR and RMSE, respectively (Table 3), with greater advancements at VE sites than those at DA sites, further demonstrating the advantages of the incremental learning mechanism for improving ML-based simulations in a global scale. After DA, the CORR and RMSE values of ANA for NH<inline-formula><mml:math id="M372" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M373" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M374" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC ranged from 0.38 to 0.80 and 0.90 to 7.76 <inline-formula><mml:math id="M375" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively, showing a more concentrated data distribution than those of FR and SIM. The average CORR and RMSE values increased by 14.14 % to 116.65 % and decreased by 23.46 % to 68.75 %, respectively, compared to FR, indicating that the EnKF algorithm with localization schemes effectively propagates observations within the model state space.</p>

      <fig id="F6" specific-use="star"><label>Figure 6</label><caption><p id="d2e6861">Spatial distribution of observation (OBS), free-run field (FRFR), ML-simulated background field (SIM) and analysis field (ANA) for NH<inline-formula><mml:math id="M377" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(a1–a4)</bold>, NO<inline-formula><mml:math id="M378" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(b1–b4)</bold>, SO<inline-formula><mml:math id="M379" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(c1–c4)</bold>, OC <bold>(d1–d4)</bold> and BC <bold>(e1–e4)</bold>.</p></caption>
            <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f06.png"/>

          </fig>

</sec>
<sec id="Ch1.S3.SS3.SSS2">
  <label>3.3.2</label><title>Assessment of spatial distribution in chemical components</title>
      <p id="d2e6933">Figure 6 presents the spatial distributions of observations from sparse sites (OBS), FR, SIM and ANA for the average concentrations of NH<inline-formula><mml:math id="M380" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M381" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M382" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC, and BC over a two-month period from February to March 2022. The OBS of NH<inline-formula><mml:math id="M383" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> reveals that the concentrations at southern sites in North China are significantly higher than those at northern sites, particularly in northern Henan Province, with a maximum concentration of 12.20 <inline-formula><mml:math id="M384" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (Fig. 6a1). However, FR fails to accurately capture the spatial patterns of NH<inline-formula><mml:math id="M386" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> concentration (Fig. 6a2), exhibiting underestimations at 100 % of DA sites and 89 % of VE sites, with average underestimations of 2.71 and 3.07 <inline-formula><mml:math id="M387" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively (Fig. 7a1). This finding is attributed to the underestimation of the original training samples (Kong et al., 2025). Compared to FR, the SIM mitigates the underestimation (Fig. 6a3), with 96 % of DA sites underestimating by 1.56 <inline-formula><mml:math id="M389" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> and 78 % of VE sites underestimating by 1.88 <inline-formula><mml:math id="M391" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> (Fig. 7a2). After DA, ANA accurately depicts the spatial distribution of NH<inline-formula><mml:math id="M393" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> concentrations (Fig. 6a4), with 92 % of DA sites underestimating by 0.74 <inline-formula><mml:math id="M394" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> and 44 % of VE sites underestimating by 2.34 <inline-formula><mml:math id="M396" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively (Fig. 7a3). The increment field (INC) between ANA and SIM exhibits substantial positive increments in southern North China (Fig. 7a4), indicating that the observations from 24 DA sites were effectively propagated within the model state space, thereby addressing the underestimation of NH<inline-formula><mml:math id="M398" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> concentrations in the whole domain.</p>

      <fig id="F7" specific-use="star"><label>Figure 7</label><caption><p id="d2e7148">Spatial distribution of observation minus free-run field (OmF), observation minus ML-simulated background field (OmS), observation minus analysis field (OmA) and analysis field minus background field (INC) for NH<inline-formula><mml:math id="M399" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(a1–a4)</bold>, NO<inline-formula><mml:math id="M400" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(b1–b4)</bold>, SO<inline-formula><mml:math id="M401" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(c1–c4)</bold>, OC <bold>(d1–d4)</bold> and BC <bold>(e1–e4)</bold>. The circle indicates the DA sites with data assimilation, and the upward-pointing triangle indicates the VE sites without data assimilation.</p></caption>
            <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f07.png"/>

          </fig>

      <p id="d2e7212">The observed spatial distributions of NO<inline-formula><mml:math id="M402" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and SO<inline-formula><mml:math id="M403" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> are consistent with those of NH<inline-formula><mml:math id="M404" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, revealing significantly higher concentrations at southern sites in the North China region than at northern sites, particularly in the Hebei-Henan-Shandong junction areas (Fig. 6b1 and c1). Although FR can capture the spatial patterns of NO<inline-formula><mml:math id="M405" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and SO<inline-formula><mml:math id="M406" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, it significantly underestimates their concentrations (Fig. 6b2 and c2). Specifically, 63 %–79 % of DA sites and 89 % of VE sites underestimate by 1.87–3.76 and 1.57–3.44 <inline-formula><mml:math id="M407" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively (Fig. 7b1 and c1). Compared to FR, SIM mitigates the underestimations in the Hebei-Henan-Shandong junction areas and overestimations in the Beijing-Tianjin-Hebei eastern areas (Fig. 6b3 and c3), with improvements at most DA and VE sites (Fig. 7b2 and c2). After DA, ANA accurately characterizes the spatial distribution of NO<inline-formula><mml:math id="M409" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and SO<inline-formula><mml:math id="M410" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> concentrations (Fig. 6b4 and c4), with 88 %–100 % of DA sites and 56 %–67 % of VE sites merely underestimating by 0.77–1.31 and 1.85–2.73 <inline-formula><mml:math id="M411" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, respectively (Fig. 7b3 and c3). Furthermore, similar to the INC of NH<inline-formula><mml:math id="M413" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, INCs of NO<inline-formula><mml:math id="M414" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and SO<inline-formula><mml:math id="M415" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> exhibit widespread positive increments across the North China region (Fig. 7b4 and c4).</p>
      <p id="d2e7390">In contrast to the spatial distributions of NH<inline-formula><mml:math id="M416" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M417" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and SO<inline-formula><mml:math id="M418" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, the observed spatial distributions of OC and BC reveal that concentrations in the North China region demonstrate spatial homogeneity (Fig. 6d1 and e1). However, FR significantly overestimated the concentrations of OC and BC in the North China region (Figs. 6d2, e2, and 7d1, e1), with an average overestimation of 6.12 <inline-formula><mml:math id="M419" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 1.99 <inline-formula><mml:math id="M421" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC at all DA sites, and 6.88 <inline-formula><mml:math id="M423" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 2.29 <inline-formula><mml:math id="M425" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC at all VE sites. Following incremental learning, SIM significantly reduced the overestimations (Figs. 6d3, e3, and 7d2, e2), resulting in an average overestimation of 1.46 <inline-formula><mml:math id="M427" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 0.53 <inline-formula><mml:math id="M429" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC at 71 %–79 % of DA sites, and 1.56 <inline-formula><mml:math id="M431" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 0.65 <inline-formula><mml:math id="M433" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC at 89 % of VE sites. The number of sites exhibiting overestimation and the degree of overestimation are markedly lower than those of FR. After DA, ANA further mitigates the overestimation in SIM, accurately interpreting the spatial distributions of OC and BC concentrations (Fig. 6d4 and e4), with the gaps between the observations and analysis fields for both DA and VE sites approaching 0 (Fig. 7d3 and e3). Assimilating the observations from 24 DA sites effectively mitigates the overestimation in the southern North China region (Fig. 7d4 and e4).</p>
</sec>
<sec id="Ch1.S3.SS3.SSS3">
  <label>3.3.3</label><title>Comparison with multiple reanalysis datasets</title>
      <p id="d2e7604">In this section, we utilized OIRF-LEnKF v1.0 to generate an hourly reanalysis dataset of PM<sub>2.5</sub> key chemical components (SO<inline-formula><mml:math id="M436" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M437" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NH<inline-formula><mml:math id="M438" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC and BC) for the North China region in February 2022. We compared it with multiple related reanalysis datasets, including CAQRA-aerosol, TAP, Global-RA (CAMS and MERRA-2), and the dataset generated by NP2. The temporal and spatial resolutions of CAQRA-aerosol, TAP, and Global-RA on both global and national scales are lower than those of OIRF-LEnKF v1.0 and NP2 on the regional scale (Table 2). It is important to note that the spatial range and resolution of OIRF-LEnKF v1.0 are contingent upon those of the available training data. Consequently, OIRF-LEnKF v1.0 has significant potential for elucidating the spatiotemporal distribution of PM<sub>2.5</sub> chemical components on a global and national scale.</p>
      <p id="d2e7664">Figure 8 illustrates the average values of observation minus analysis (OmA) over 1 month. For NH<inline-formula><mml:math id="M440" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> (Fig. 8a1–a5), the mean absolute OmA of OIRF-LEnKF v1.0 at a total of 33 sites (0.25 <inline-formula><mml:math id="M441" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) is significantly lower than that of NP2 (0.81 <inline-formula><mml:math id="M443" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), CAQRA (1.18 <inline-formula><mml:math id="M445" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), TAP (0.92 <inline-formula><mml:math id="M447" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), and Global-RA (2.92 <inline-formula><mml:math id="M449" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>). Furthermore, the OmA of OIRF-LEnKF v1.0 is within <inline-formula><mml:math id="M451" display="inline"><mml:mrow><mml:mo>±</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M452" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> at 97 % of the sites, whereas NP2, CAQRA, TAP, and Global-RA had only 9 %–70 % of the sites within this range. Most of the sites exhibit slight underestimations in NP2 and TAP, overestimations in CAQRA, and significant underestimations in Global-RA, while the disparity between OIRF-LEnKF v1.0 and the observations is minimal. The findings for NO<inline-formula><mml:math id="M454" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> are comparable to those for NH<inline-formula><mml:math id="M455" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> (Fig. 8b1–b5), the mean absolute OmA of OIRF-LEnKF v1.0 at a total of 33 sites (0.19 <inline-formula><mml:math id="M456" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) is significantly lower than that of NP2 (0.93 <inline-formula><mml:math id="M458" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), CAQRA (8.42 <inline-formula><mml:math id="M460" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), TAP (2.24 <inline-formula><mml:math id="M462" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<inline-formula><mml:math id="M463" display="inline"><mml:mrow><mml:msup><mml:mi/><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">3</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and Global-RA (2.27 <inline-formula><mml:math id="M464" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>). Furthermore, the OmA of OIRF-LEnKF v1.0 is within <inline-formula><mml:math id="M466" display="inline"><mml:mrow><mml:mo>±</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:mrow></mml:math></inline-formula> <inline-formula><mml:math id="M467" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> at all sites, whereas NP2, CAQRA, TAP, and Global-RA had only 3 %–94 % of the sites within this range. The similar spatial patterns of OmA for NH<inline-formula><mml:math id="M469" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and NO<inline-formula><mml:math id="M470" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> are related to thermodynamic equilibrium (Nenes et al., 1998) and consistency between NH<inline-formula><mml:math id="M471" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and NO<inline-formula><mml:math id="M472" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> has also been observed in previous works (Sun, 2018; Shi et al., 2021; Wu et al., 2022).</p>

      <fig id="F8" specific-use="star"><label>Figure 8</label><caption><p id="d2e8021">Difference between observations at a total of 33 sites and five reanalysis datasets for NH<inline-formula><mml:math id="M473" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(a1–a5)</bold>, NO<inline-formula><mml:math id="M474" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(b1–b5)</bold>, SO<inline-formula><mml:math id="M475" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(c1–c5)</bold>, OC <bold>(d1–d5)</bold> and BC <bold>(e1–e5)</bold>. Global-RA is the combination of CAMSRA and MERRA-2.</p></caption>
            <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f08.png"/>

          </fig>

      <p id="d2e8086">For SO<inline-formula><mml:math id="M476" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> (Fig. 8c1–c5), the average absolute OmA of OIRF-LEnKF v1.0 (0.54 <inline-formula><mml:math id="M477" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) is slightly lower than that of NP2 (0.86 <inline-formula><mml:math id="M479" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) but significantly lower than that of CAQRA (1.26 <inline-formula><mml:math id="M481" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), TAP (1.72 <inline-formula><mml:math id="M483" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>), and Global-RA (7.19 <inline-formula><mml:math id="M485" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>). In contrast to NO<inline-formula><mml:math id="M487" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, most of the sites exhibit underestimation in CAQRA, overestimation in TAP, and significant overestimation in Global-RA for SO<inline-formula><mml:math id="M488" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>. This discrepancy between NO<inline-formula><mml:math id="M489" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and SO<inline-formula><mml:math id="M490" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> arises from the competition for the capture of NH<sub>3</sub>. Thus, the underestimation of SO<inline-formula><mml:math id="M492" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> is considered a factor in the overestimation of NO<inline-formula><mml:math id="M493" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> (Xie et al., 2022). Unlike the four CTM-based reanalysis datasets, OIRF-LEnKF v1.0 implements independent simulation and DA processes for various chemical components, thereby reducing the constraints imposed by correlations among variables.</p>
      <p id="d2e8297">The OmA of OC (Fig. 8d1–d5) and BC (Fig. 8e1–e5) exhibit similar spatial patterns. Specifically, the average absolute OmA of OIRF-LEnKF v1.0 (0.66 <inline-formula><mml:math id="M494" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 0.40 <inline-formula><mml:math id="M496" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC) is slightly higher than that of NP2 (0.23 <inline-formula><mml:math id="M498" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 0.03 <inline-formula><mml:math id="M500" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC) but significantly lower than those of CAQRA (2.90 <inline-formula><mml:math id="M502" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 1.32 <inline-formula><mml:math id="M504" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC), TAP (1.04 <inline-formula><mml:math id="M506" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 0.65 <inline-formula><mml:math id="M508" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC), and Global-RA (1.62 <inline-formula><mml:math id="M510" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for OC and 5.85 <inline-formula><mml:math id="M512" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> for BC). The significant overestimation of carbonaceous aerosols observed in CTM-based CAQRA and Global-RA is likely attributed to the hygroscopic growth schemes of carbonaceous aerosols, the poorly constrained semi-volatile species that escape from primary organic aerosols, and aging mechanisms (Soni et al., 2021; Huang et al., 2013). Overall, the reanalysis dataset generated by OIRF-LEnKF v1.0 demonstrates lower errors in the concentrations of the five PM<sub>2.5</sub> chemical components in the North China region compared to four CTM-based datasets.</p>

      <fig id="F9" specific-use="star"><label>Figure 9</label><caption><p id="d2e8514">Pearson correlation coefficient (CORR) and root mean square error (RMSE, <inline-formula><mml:math id="M515" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) quantified by the five reanalysis datasets and observations at a total of 33 sites for NH<inline-formula><mml:math id="M517" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(a)</bold>, NO<inline-formula><mml:math id="M518" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(b)</bold>, SO<inline-formula><mml:math id="M519" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> <bold>(c)</bold>, OC <bold>(d)</bold> and BC <bold>(e)</bold>. The averages of CORR <bold>(f)</bold> and RMSE <bold>(g)</bold> across all observational sites for the five reanalysis datasets for the five PM<sub>2.5</sub> chemical components. Global-RA is the combination of CAMSRA and MERRA-2.</p></caption>
            <graphic xlink:href="https://gmd.copernicus.org/articles/19/4835/2026/gmd-19-4835-2026-f09.png"/>

          </fig>

      <p id="d2e8614">We further compared the differences in RMSE and CORR among five reanalysis datasets. As illustrated in Fig. 9a–c, the CORR values of OIRF-LEnKF v1.0 for NH<inline-formula><mml:math id="M521" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M522" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, and SO<inline-formula><mml:math id="M523" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> (mean CORR: 0.97, Fig. 9f) are significantly higher than those of other datasets (mean CORR: 0.56 to 0.89, Fig. 9f), while the RMSE values (mean RMSE: 1.12 <inline-formula><mml:math id="M524" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, Fig. 9g) are significantly lower than those of other datasets (mean RMSE: 2.55–8.52 <inline-formula><mml:math id="M526" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, Fig. 9g). Furthermore, the RMSE values of OIRF-LEnKF v1.0 are relatively concentrated across all sites, indicating a marked improvement in simulation of NH<inline-formula><mml:math id="M528" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M529" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, and SO<inline-formula><mml:math id="M530" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> across a broad spatial range. From Fig. 9d and e, the CORR and RMSE values of OIRF-LEnKF v1.0 for carbonaceous aerosols (OC and BC) (mean CORR: 0.68, Fig. 9f; mean RMSE: 1.49 <inline-formula><mml:math id="M531" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, Fig. 9g) are slightly worse than those of NP2 (mean CORR: 0.97, Fig. 9f; mean RMSE: 1.66 <inline-formula><mml:math id="M533" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, Fig. 9g) and are comparable to those of TAP (mean CORR: 0.66, Fig. 9f; mean RMSE: 1.49 <inline-formula><mml:math id="M535" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, Fig. 9g), while demonstrating superiority over the other datasets (mean CORR: 0.28–0.44, Fig. 9f; mean RMSE: 4.49–11.70 <inline-formula><mml:math id="M537" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>, Fig. 9g). Overall, OIRF-LEnKF v1.0 exhibits a notable advantage in accurately interpreting the concentrations of PM<sub>2.5</sub> chemical components on a regional scale. Further improvements in the performance of OIRF-LEnKF v1.0 in interpreting carbonaceous aerosols are expected by modifying the structure of the OIRF model and the frequency of incremental learning, as well as by adopting hybrid nonlinear DA algorithms.</p>
</sec>
</sec>
<sec id="Ch1.S3.SS4">
  <label>3.4</label><title>Limitations</title>
      <p id="d2e8836">Although the OIRF model serves as an efficient surrogate for the CTM in generating simulation or forecast ensembles for data assimilation, it inherits a constrained extrapolation capability of tree-based models. Specifically, the OIRF model may exhibit a tendency to saturate at learned extremes when extrapolating beyond its training data distribution, which directly limits its generalizability in diverse and complex atmospheric scenarios, such as the pollution extremes in seasons outside the training period. The poor performance of tree-based models on testing sets has been reported in our previous study (Li et al., 2025). Our incremental learning mechanism is designed to mitigate the extrapolation limitation by dynamically updating the RF model with new knowledge. However, the effectiveness of incremental learning is contingent upon the availability of high-quality analysis fields. A lack of observations, which prevents the generation of analysis fields, exposes the OIRF model to its inherent extrapolation limitations, leading to compromised simulation accuracy.</p>
      <p id="d2e8839">Replacing the RF model with an ensemble of deep neural networks (DNNs) holds promise for superior nonlinear mapping and extrapolation. However, the considerably higher computational cost required for both training and inference of DNNs (Debjyoti and Utpal, 2025; Xi, 2022) results in an operational bottleneck that the process of updating and running an ensemble of DNNs can be slower than traditional CTM-based ensemble simulations, which could offset its accuracy advantages. Therefore, balancing the inherent predictive performance of a machine learning model against its computational cost remains a central challenge for the practical online coupling of machine learning with data assimilation.</p>
</sec>
</sec>
<sec id="Ch1.S4" sec-type="conclusions">
  <label>4</label><title>Conclusions</title>
      <p id="d2e8851">In this paper, we online coupled the OIRF model with the LEnKF algorithm to develop a novel DA system (OIRF-LEnKF v1.0) that mitigates the limitations of high computational costs and inadequate advancements in generating background and analysis fields of PM<sub>2.5</sub> chemical components (NH<inline-formula><mml:math id="M541" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, SO<inline-formula><mml:math id="M542" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M543" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, OC and BC) in conventional CTM-based DA. The OIRF model introduces an incremental learning mechanism that enhances the generalization ability of ML by iteratively absorbing newly available training data to dynamically update the model structure. The domain localization and observation localization schemes are incorporated into the EnKF algorithm within a second-level parallel computation framework, which effectively reduces the interference of spatial and variable spurious correlations and improves computational efficiency. The findings are outlined as follows.</p>
      <p id="d2e8902">OIRF-LEnKF v1.0 exhibits stable convergence capability and high convergence efficiency, achieving convergence within 10 iterations across ensemble sizes ranging from 2 to 200. Computational tests reveal that the total time consumed by OIRF-LEnKF v1.0 constitutes only 11.41 %–16.60 % of that of CTM-based DA, primarily because the simulation process requires only 0.13 % to 0.20 % of the CTM computation time, demonstrating its superior computational efficiency.</p>
      <p id="d2e8905">Sensitivity tests reveal that the background fields in OIRF-LEnKF v1.0 are more sensitive to updating frequency within the incremental learning mechanism. In contrast, the analysis fields exhibit a marked sensitivity to ensemble size. Specifically, the <inline-formula><mml:math id="M544" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>CORR rises by 2.43 %–11.75 %, and the <inline-formula><mml:math id="M545" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>RMSE decreases by 32.55 %–40.36 % when comparing a 1 h update frequency to the scenario without incremental learning during the simulation phase. Additionally, the <inline-formula><mml:math id="M546" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>CORR increases by 9.75 %–19.04 %, and the <inline-formula><mml:math id="M547" display="inline"><mml:mi mathvariant="normal">Δ</mml:mi></mml:math></inline-formula>RMSE decreases by 16.70 %–30.48 % when comparing an ensemble size of 200 to that of 20 during the DA analysis phase. However, the 1 h update frequency diminishes the dependence of the analysis fields on ensemble size. Thus, an ensemble size of 50 with a 6 h update frequency is configured to balance computational efficiency, ML simulation accuracy, and DA analysis performance.</p>
      <p id="d2e8936">A 2-month DA experiment demonstrates that the RMSE values for PM<sub>2.5</sub> chemical components at DA sites range from 0.99 to 7.80 <inline-formula><mml:math id="M549" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> after incremental learning and 0.80 to 2.36 <inline-formula><mml:math id="M551" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> after DA analysis, exhibiting reductions of 26.38 %–61.75 % and 68.99 %–91.31 %, respectively, compared to values obtained without incremental learning and DA analysis. For VE sites, the RMSE values range from 0.93 to 7.76 <inline-formula><mml:math id="M553" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> after incremental learning and 0.90 to 7.76 <inline-formula><mml:math id="M555" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup> after DA analysis, exhibiting reductions of 28.37 %–68.00 % and 23.46 %–68.75 %, respectively, relative to values obtained without incremental learning and DA analysis. Notably, the RMSE values of our system during the simulation process show a significant reduction of 33.16 %–90.10 % at DA sites and 37.10 %–91.55 % at VE sites compared to those of CTM-based DA, highlighting the superior simulation capability of ML-based DA. Additionally, the spatial patterns of the background and analysis fields for chemical components more accurately reflect those of the observations when employing incremental learning and DA.</p>
      <p id="d2e9030">In comparison to the datasets provided by NP2, CAQRA, TAP, CAMSRA, and MERRA-2, the dataset generated by OIRF-LEnKF v1.0 exhibits superior data quality. Notably, for NH<inline-formula><mml:math id="M557" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mo>+</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>, NO<inline-formula><mml:math id="M558" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">3</mml:mn><mml:mo>-</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> and SO<inline-formula><mml:math id="M559" display="inline"><mml:mrow><mml:msubsup><mml:mi/><mml:mn mathvariant="normal">4</mml:mn><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>, the CORR values of OIRF-LEnKF v1.0 (0.97) are significantly higher than those of the aforementioned datasets (0.56–0.89). Additionally, the RMSE values of OIRF-LEnKF v1.0 (1.12 <inline-formula><mml:math id="M560" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>) are markedly lower than those of the four reanalysis datasets (2.55–8.52 <inline-formula><mml:math id="M562" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">µ</mml:mi></mml:mrow></mml:math></inline-formula>g m<sup>−3</sup>). Future work should focus on generating reanalysis datasets that utilize configurations with larger domains and higher spatial resolutions, as well as improving data quality through the application of deep learning techniques and hybrid nonlinear DA algorithms.</p>
</sec>

      
      </body>
    <back><notes notes-type="codedataavailability"><title>Code and data availability</title>

      <p id="d2e9118">The source codes and related data in our work, including observation data, modelling domain data, sensitivity test data and OIRF-LEnKF v1.0 output data, are openly accessible at <ext-link xlink:href="https://doi.org/10.5281/zenodo.17346786" ext-link-type="DOI">10.5281/zenodo.17346786</ext-link> (Li and Yang, 2025). The open-access reanalysis datasets of CAQRA (<ext-link xlink:href="https://doi.org/10.11922/sciencedb.00053" ext-link-type="DOI">10.11922/sciencedb.00053</ext-link>, Tang et al., 2020; Kong et al., 2021), CAQRA-aerosol (<ext-link xlink:href="https://doi.org/10.1007/s00376-024-4046-5" ext-link-type="DOI">10.1007/s00376-024-4046-5</ext-link>, Kong et al., 2025) and ERA5 (<ext-link xlink:href="https://doi.org/10.24381/cds.bd0915c6" ext-link-type="DOI">10.24381/cds.bd0915c6</ext-link>, Hersbach et al., 2023) from February to March 2022 were downloaded for the development and realization of OIRF-LEnKF v1.0 system, which have been packaged and uploaded at a repository (<ext-link xlink:href="https://doi.org/10.5281/zenodo.17359290" ext-link-type="DOI">10.5281/zenodo.17359290</ext-link>, Li, 2025) for easier access. The open-access reanalysis datasets of TAP (<uri>http://tapdata.org.cn</uri>, last access: 2 June 2025, Liu et al., 2022), NP2 (<ext-link xlink:href="https://doi.org/10.5281/zenodo.10886914" ext-link-type="DOI">10.5281/zenodo.10886914</ext-link>, Li et al., 2024b), CAMS (<ext-link xlink:href="https://doi.org/10.24381/d58bbf47" ext-link-type="DOI">10.24381/d58bbf47</ext-link>, Copernicus Atmosphere Monitoring Service, 2020; Inness et al., 2019) and MERRA-2 (<uri>https://disc.gsfc.nasa.gov/datasets?project=MERRA-2</uri>, last access: 2 June 2025, Randles et al., 2017) during February 2022 were downloaded for the evaluation of OIRF-LEnKF v1.0 system, which have been packaged and uploaded at a repository (<ext-link xlink:href="https://doi.org/10.5281/zenodo.17359290" ext-link-type="DOI">10.5281/zenodo.17359290</ext-link>, Li, 2025) for easier access.</p>
  </notes><app-group>
        <supplementary-material position="anchor"><p id="d2e9152">The supplement related to this article is available online at <inline-supplementary-material xlink:href="https://doi.org/10.5194/gmd-19-4835-2026-supplement" xlink:title="pdf">https://doi.org/10.5194/gmd-19-4835-2026-supplement</inline-supplementary-material>.</p></supplementary-material>
        </app-group><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d2e9161">HL implemented the data assimilation system, performed the numerical experiments, conducted the analysis, and wrote the paper. TY conceived and designed the overall research framework, provided scientific guidance, wrote the paper, and devised the strategy for the responses and manuscript revision. LK and XT provided help for the system code and the CAQRA reanalysis dataset. DZ and GT provided PM<sub>2.5</sub> chemical component data. ZW did overall supervision. All authors reviewed and revised this paper.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d2e9176">The contact author has declared that none of the authors has any competing interests.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d2e9182">Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. The authors bear the ultimate responsibility for providing appropriate place names. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.</p>
  </notes><ack><title>Acknowledgements</title><p id="d2e9188">We thank for the technical support of the National Large Scientific and Technological Infrastructure “Earth System Numerical Simulation Facility” (<uri>https://cstr.cn/31134.02.EL</uri>, last access: 20 December 2025), and the data support of the China National Environmental Monitoring Center. Ting Yang would like to express gratitude towards the Program of the Youth Innovation Promotion Association (CAS).</p></ack><notes notes-type="financialsupport"><title>Financial support</title>

      <p id="d2e9196">This research has been supported by the National Natural Science Foundation of China (grant nos. 42422506 and 42275122).</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d2e9202">This paper was edited by Klaus Klingmüller and reviewed by two anonymous referees.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bib1"><label>1</label><mixed-citation>Adie, J., Chin, C. S., Li, J., and See, S.: GAIA-Chem: A Framework for Global AI-Accelerated Atmospheric Chemistry Modelling, in: Proceedings of the Platform for Advanced Scientific Computing Conference, Zurich, Switzerland, 13, 1–5, <ext-link xlink:href="https://doi.org/10.1145/3659914.3659927" ext-link-type="DOI">10.1145/3659914.3659927</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bib2"><label>2</label><mixed-citation>Amidror, I.: Scattered data interpolation methods for electronic imaging systems: a survey, J. Electron. Imag., 11, <ext-link xlink:href="https://doi.org/10.1117/1.1455013" ext-link-type="DOI">10.1117/1.1455013</ext-link>, 2002.</mixed-citation></ref>
      <ref id="bib1.bib3"><label>3</label><mixed-citation>Arcucci, R., Zhu, J., Hu, S., and Guo, Y.-K.: Deep Data Assimilation: Integrating Deep Learning with Data Assimilation, Appl. Sci., 11, 1114, <ext-link xlink:href="https://doi.org/10.3390/app11031114" ext-link-type="DOI">10.3390/app11031114</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib4"><label>4</label><mixed-citation>Brajard, J., Carrassi, A., Bocquet, M., and Bertino, L.: Combining data assimilation and machine learning to emulate a dynamical model from sparse and noisy observations: A case study with the Lorenz 96 model, J. Comput. Sci., 44, 101171, <ext-link xlink:href="https://doi.org/10.1016/j.jocs.2020.101171" ext-link-type="DOI">10.1016/j.jocs.2020.101171</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bib5"><label>5</label><mixed-citation>Breiman, L.: Random Forests, Mach. Learn., 45, 5–32, <ext-link xlink:href="https://doi.org/10.1023/A:1010933404324" ext-link-type="DOI">10.1023/A:1010933404324</ext-link>, 2001.</mixed-citation></ref>
      <ref id="bib1.bib6"><label>6</label><mixed-citation>Buizza, C., Quilodrán Casas, C., Nadler, P., Mack, J., Marrone, S., Titus, Z., Le Cornec, C., Heylen, E., Dur, T., Baca Ruiz, L., Heaney, C., Díaz Lopez, J. A., Kumar, K. S. S., and Arcucci, R.: Data Learning: Integrating Data Assimilation and Machine Learning, J. Comput. Sci., 58, 101525, <ext-link xlink:href="https://doi.org/10.1016/j.jocs.2021.101525" ext-link-type="DOI">10.1016/j.jocs.2021.101525</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib7"><label>7</label><mixed-citation>Cha, Y., Lee, J.-J., Song, C. H., Kim, S., Park, R. J., Lee, M.-I., Woo, J.-H., Choi, J.-H., Bae, K., Yu, J., Kim, E., Kim, H., Lee, S.-H., Kim, J., Chang, L.-S., Jeon, K.-h., and Song, C.-K.: Investigating uncertainties in air quality models used in GMAP/SIJAQ 2021 field campaign: General performance of different models and ensemble results, Atmos. Environ., 340, 120896, <ext-link xlink:href="https://doi.org/10.1016/j.atmosenv.2024.120896" ext-link-type="DOI">10.1016/j.atmosenv.2024.120896</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib8"><label>8</label><mixed-citation>Chattopadhyay, A., Nabizadeh, E., Bach, E., and Hassanzadeh, P.: Deep learning-enhanced ensemble-based data assimilation for high-dimensional nonlinear dynamical systems, J. Comput. Phys., 477, 111918, <ext-link xlink:href="https://doi.org/10.1016/j.jcp.2023.111918" ext-link-type="DOI">10.1016/j.jcp.2023.111918</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bib9"><label>9</label><mixed-citation>Chen, L.: A review of the applications of ensemble forecasting in fields other than meteorology, Weather, 79, 285–290, <ext-link xlink:href="https://doi.org/10.1002/wea.4584" ext-link-type="DOI">10.1002/wea.4584</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bib10"><label>10</label><mixed-citation>Copernicus Atmosphere Monitoring Service: CAMS global reanalysis (EAC4), Copernicus Atmosphere Monitoring Service (CAMS) Atmosphere Data Store [data set], <ext-link xlink:href="https://doi.org/10.24381/d58bbf47" ext-link-type="DOI">10.24381/d58bbf47</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bib11"><label>11</label><mixed-citation>Debjyoti, G. and Utpal, R.: Comprehensive Benchmark Study of Machine Learning and Deep Learning Approaches for Human Activity Recognition using the UCI HAR Dataset, Int. J. Comput. Appl., 187, 66–69, <ext-link xlink:href="https://doi.org/10.5120/ijca2025925797" ext-link-type="DOI">10.5120/ijca2025925797</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib12"><label>12</label><mixed-citation>Dong, R., Leng, H., Zhao, J., Song, J., and Liang, S.: A Framework for Four-Dimensional Variational Data Assimilation Based on Machine Learning, Entropy, 24, 264, <ext-link xlink:href="https://doi.org/10.3390/e24020264" ext-link-type="DOI">10.3390/e24020264</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib13"><label>13</label><mixed-citation>Dong, R., Leng, H., Zhao, C., Song, J., Zhao, J., and Cao, X.: A hybrid data assimilation system based on machine learning, Front. Earth Sci., 10, <ext-link xlink:href="https://doi.org/10.3389/feart.2022.1012165" ext-link-type="DOI">10.3389/feart.2022.1012165</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bib14"><label>14</label><mixed-citation>Evensen, G.: Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics, J. Geophys. Res.-Oceans, 99, <ext-link xlink:href="https://doi.org/10.1029/94jc00572" ext-link-type="DOI">10.1029/94jc00572</ext-link>, 1994.</mixed-citation></ref>
      <ref id="bib1.bib15"><label>15</label><mixed-citation>Evensen, G.: The Ensemble Kalman Filter: Theoretical formulation and practical implementation, Ocean Dynam., 53, 343–367, <ext-link xlink:href="https://doi.org/10.1007/s10236-003-0036-9" ext-link-type="DOI">10.1007/s10236-003-0036-9</ext-link>, 2003.</mixed-citation></ref>
      <ref id="bib1.bib16"><label>16</label><mixed-citation>Fang, L., Jin, J., Segers, A., Lin, H. X., Pang, M., Xiao, C., Deng, T., and Liao, H.: Development of a regional feature selection-based machine learning system (RFSML v1.0) for air pollution forecasting over China, Geosci. Model Dev., 15, 7791–7807, <ext-link xlink:href="https://doi.org/10.5194/gmd-15-7791-2022" ext-link-type="DOI">10.5194/gmd-15-7791-2022</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib17"><label>17</label><mixed-citation>Farchi, A., Bocquet, M., Laloyaux, P., Bonavita, M., and Malartic, Q.: A comparison of combined data assimilation and machine learning methods for offline and online model error correction, J. Comput. Sci., 55, 101468, <ext-link xlink:href="https://doi.org/10.1016/j.jocs.2021.101468" ext-link-type="DOI">10.1016/j.jocs.2021.101468</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib18"><label>18</label><mixed-citation>Friedman, J. H., Bentley, J. L., and Finkel, R. A.: An algorithm for finding best matches in logarithmic expected time, ACM T. Math. Softw., 3, 209–226, <ext-link xlink:href="https://doi.org/10.1145/355744.355745" ext-link-type="DOI">10.1145/355744.355745</ext-link>, 1977.</mixed-citation></ref>
      <ref id="bib1.bib19"><label>19</label><mixed-citation>Geer, A. J.: Learning earth system models from observations: machine learning or data assimilation?, Philos. T. Roy. Soc. A, 379, 20200089, <ext-link xlink:href="https://doi.org/10.1098/rsta.2020.0089" ext-link-type="DOI">10.1098/rsta.2020.0089</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib20"><label>20</label><mixed-citation>Gelbart, M. A., Snoek, J., and Adams, R. P.: Bayesian optimization with unknown constraints, in: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI'14), AUAI Press, Arlington, Virginia, USA, 250–259, <ext-link xlink:href="https://doi.org/10.5555/3020751.3020778" ext-link-type="DOI">10.5555/3020751.3020778</ext-link>, 2014.</mixed-citation></ref>
      <ref id="bib1.bib21"><label>21</label><mixed-citation>Gohari, K., Sheidaei, A., Yitshak-Sade, M., Colicino, E., and Kloog, I.: Exploring multivariate machine learning frameworks to parallelize PM<sub>2.5</sub> simultaneous estimations across the continental United States. Environ. Pollut., 374, 126161, <ext-link xlink:href="https://doi.org/10.1016/j.envpol.2025.126161" ext-link-type="DOI">10.1016/j.envpol.2025.126161</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib22"><label>22</label><mixed-citation>Gottwald, G. A. and Reich, S.: Supervised learning from noisy observations: Combining machine-learning techniques with data assimilation, Physica D, 423, 132911, <ext-link xlink:href="https://doi.org/10.1016/j.physd.2021.132911" ext-link-type="DOI">10.1016/j.physd.2021.132911</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib23"><label>23</label><mixed-citation>He, X., Li, Y., Liu, S., Xu, T., Chen, F., Li, Z., Zhang, Z., Liu, R., Song, L., Xu, Z., Peng, Z., and Zheng, C.: Improving regional climate simulations based on a hybrid data assimilation and machine learning method, Hydrol. Earth Syst. Sci., 27, 1583–1606, <ext-link xlink:href="https://doi.org/10.5194/hess-27-1583-2023" ext-link-type="DOI">10.5194/hess-27-1583-2023</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bib24"><label>24</label><mixed-citation>Hersbach, H., Bell, B., Berrisford, P., Biavati, G., Horányi, A., Muñoz Sabater, J., Nicolas, J., Peubey, C., Radu, R., Rozum, I., Schepers, D., Simmons, A., Soci, C., Dee, D., and Thépaut, J.-N.: ERA5 hourly data on pressure levels from  940 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS) [data set], <ext-link xlink:href="https://doi.org/10.24381/cds.bd0915c6" ext-link-type="DOI">10.24381/cds.bd0915c6</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bib25"><label>25</label><mixed-citation>Houtekamer, P. L. and Mitchell, H. L.: Data Assimilation Using an Ensemble Kalman Filter Technique, Mon. Weather Rev., 126, 796–811, <ext-link xlink:href="https://doi.org/10.1175/1520-0493(1998)126&lt;0796:DAUAEK&gt;2.0.CO;2" ext-link-type="DOI">10.1175/1520-0493(1998)126&lt;0796:DAUAEK&gt;2.0.CO;2</ext-link>, 1998.</mixed-citation></ref>
      <ref id="bib1.bib26"><label>26</label><mixed-citation>Houtekamer, P. L. and Zhang, F.: Review of the Ensemble Kalman Filter for Atmospheric Data Assimilation, Mon. Weather Rev., 144, 4489–4532, <ext-link xlink:href="https://doi.org/10.1175/MWR-D-15-0440.1" ext-link-type="DOI">10.1175/MWR-D-15-0440.1</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bib27"><label>27</label><mixed-citation>Howard, L. J., Subramanian, A., and Hoteit, I.: A Machine Learning Augmented Data Assimilation Method for High-Resolution Observations, J. Adv. Model Earth Syst., 16, e2023MS003774, <ext-link xlink:href="https://doi.org/10.1029/2023MS003774" ext-link-type="DOI">10.1029/2023MS003774</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bib28"><label>28</label><mixed-citation>Huang, R. J., Zhang, Y. L., Bozzetti, C., Ho, K. F., Cao, J. J., Han, Y. M., Daellenbach, K. R., Slowik, J. G., Platt, S. M., Canonaco, F., Zotter, P., Wolf, R., Pieber, S. M., Bruns, E. A., Crippa, M., Ciarelli, G., Piazzalunga, A., Schwikowski, M., Abbaszade, G., Schnelle-Kreis, J., Zimmermann, R., An, Z. S., Szidat, S., Baltensperger, U., El Haddad, I., and Prévôt, A. S. H.: High secondary aerosol contribution to particulate pollution during haze events in China, Nature, 514, 218–222, <ext-link xlink:href="https://doi.org/10.1038/nature13774" ext-link-type="DOI">10.1038/nature13774</ext-link>, 2014.</mixed-citation></ref>
      <ref id="bib1.bib29"><label>29</label><mixed-citation>Huang, Y., Wu, S., Dubey, M. K., and French, N. H. F.: Impact of aging mechanism on model simulated carbonaceous aerosols, Atmos. Chem. Phys., 13, 6329–6343, h<ext-link xlink:href="https://doi.org/10.5194/acp-13-6329-2013" ext-link-type="DOI">10.5194/acp-13-6329-2013</ext-link>, 2013.</mixed-citation></ref>
      <ref id="bib1.bib30"><label>30</label><mixed-citation>Inness, A., Ades, M., Agustí-Panareda, A., Barré, J., Benedictow, A., Blechschmidt, A. M., Dominguez, J. J., Engelen, R., Eskes, H., Flemming, J., Huijnen, V., Jones, L., Kipling, Z., Massart, S., Parrington, M., Peuch, V. H., Razinger, M., Remy, S., Schulz, M., and Suttie, M.: The CAMS reanalysis of atmospheric composition, Atmos. Chem. Phys., 19, 3515–3556, <ext-link xlink:href="https://doi.org/10.5194/acp-19-3515-2019" ext-link-type="DOI">10.5194/acp-19-3515-2019</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bib31"><label>31</label><mixed-citation>Jalali, M. W., Saidi, B., Farahmand, H., Panah, M. A. R., and Saruhan, E. N.: Scalable AI-driven air quality forecasting and classification for public health applications, Discov. Atmos., 3, 25, <ext-link xlink:href="https://doi.org/10.1007/s44292-025-00052-8" ext-link-type="DOI">10.1007/s44292-025-00052-8</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib32"><label>32</label><mixed-citation>Janjić, T., Nerger, L., Albertella, A., Schröter, J., and Skachko, S.: On Domain Localization in Ensemble-Based Kalman Filter Algorithms, Mon. Weather Rev., 139, 2046–2060, <ext-link xlink:href="https://doi.org/10.1175/2011MWR3552.1" ext-link-type="DOI">10.1175/2011MWR3552.1</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bib33"><label>33</label><mixed-citation>Jin, J., Lin, H. X., Segers, A., Xie, Y., and Heemink, A.: Machine learning for observation bias correction with application to dust storm data assimilation, Atmos. Chem. Phys., 19, 10009–10026, <ext-link xlink:href="https://doi.org/10.5194/acp-19-10009-2019" ext-link-type="DOI">10.5194/acp-19-10009-2019</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bib34"><label>34</label><mixed-citation>Kong, L., Tang, X., Zhu, J., Wang, Z., Li, J., Wu, H., Wu, Q., Chen, H., Zhu, L., Wang, W., Liu, B., Wang, Q., Chen, D., Pan, Y., Song, T., Li, F., Zheng, H., Jia, G., Lu, M., Wu, L., and Carmichael, G. R.: A 6-year-long (2013–2018) high-resolution air quality reanalysis dataset in China based on the assimilation of surface observations from CNEMC, Earth Syst. Sci. Data, 13, 529–570, <ext-link xlink:href="https://doi.org/10.5194/essd-13-529-2021" ext-link-type="DOI">10.5194/essd-13-529-2021</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib35"><label>35</label><mixed-citation>Kong, L., Tang, X., Zhu, J., Wang, Z., Liu, B., Zhu, Y., Zhu, L., Chen, D., Hu, K., Wu, H., Wu, Q., Shen, J., Sun, Y., Liu, Z., Xin, J., Ji, D., and Zheng, M.: High-resolution Simulation Dataset of Hourly PM<sub>2.5</sub> Chemical Composition in China (CAQRA-aerosol) from 2013 to 2020, Adv. Atmos. Sci., 42, 697–712, <ext-link xlink:href="https://doi.org/10.1007/s00376-024-4046-5" ext-link-type="DOI">10.1007/s00376-024-4046-5</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib36"><label>36</label><mixed-citation>Lai, Y.: Application and Effectiveness Evaluation of Bayesian Optimization Algorithm in Hyperparameter Tuning of Machine Learning Models, in: 2024 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC), 14–16 August 2024, Athens, Greece, 351–355, <ext-link xlink:href="https://doi.org/10.1109/PEEEC63877.2024.00070" ext-link-type="DOI">10.1109/PEEEC63877.2024.00070</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bib37"><label>37</label><mixed-citation>Lee, S., Park, S., Lee, M.-I., Kim, G., Im, J., and Song, C.-K.: Air Quality Forecasts Improved by Combining Data Assimilation and Machine Learning With Satellite AOD, Geophys. Res. Lett., 49, e2021GL096066, <ext-link xlink:href="https://doi.org/10.1029/2021GL096066" ext-link-type="DOI">10.1029/2021GL096066</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib38"><label>38</label><mixed-citation>Legler, S. and Janjić, T.: Combining data assimilation and machine learning to estimate parameters of a convective-scale model, Q. J. Roy. Meteorol. Soc., 148, 860–874, <ext-link xlink:href="https://doi.org/10.1002/qj.4235" ext-link-type="DOI">10.1002/qj.4235</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib39"><label>39</label><mixed-citation>Lei, L. and Whitaker, J. S.: Evaluating the trade-offs between ensemble size and ensemble resolution in an ensemble-variational data assimilation system, J. Adv. Model Earth Syst., 9, 781–789, <ext-link xlink:href="https://doi.org/10.1002/2016MS000864" ext-link-type="DOI">10.1002/2016MS000864</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bib40"><label>40</label><mixed-citation>Lei, L., Sun, Y., Ouyang, B., Qiu, Y., Xie, C., Tang, G., Zhou, W., He, Y., Wang, Q., Cheng, X., Fu, P., and Wang, Z.: Vertical Distributions of Primary and Secondary Aerosols in Urban Boundary Layer: Insights into Sources, Chemistry, and Interaction with Meteorology, Environ. Sci. Technol., 55, 4542–4552, <ext-link xlink:href="https://doi.org/10.1021/acs.est.1c00479" ext-link-type="DOI">10.1021/acs.est.1c00479</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib41"><label>41</label><mixed-citation>Li, H.: OIRF-LEnKF v1.0 related open-access datasets, Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.17359290" ext-link-type="DOI">10.5281/zenodo.17359290</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib42"><label>42</label><mixed-citation>Li, H. and Yang, T.: OIRF-LEnKF v1.0, Zenodo [code and data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.17346786" ext-link-type="DOI">10.5281/zenodo.17346786</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib43"><label>43</label><mixed-citation>Li, H., Yang, T., Nerger, L., Zhang, D., Zhang, D., Tang, G., Wang, H., Sun, Y., Fu, P., Su, H., and Wang, Z.: NAQPMS-PDAF v2.0: a novel hybrid nonlinear data assimilation system for improved simulation of PM<sub>2.5</sub> chemical components, Geosci. Model Dev., 17, 8495–8519, <ext-link xlink:href="https://doi.org/10.5194/gmd-17-8495-2024" ext-link-type="DOI">10.5194/gmd-17-8495-2024</ext-link>, 2024a.</mixed-citation></ref>
      <ref id="bib1.bib44"><label>44</label><mixed-citation>Li, H., Yang, T., and Wang, H.: NAQPMS-PDAF v2.0 (Version 2.0), Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.10886914" ext-link-type="DOI">10.5281/zenodo.10886914</ext-link>, 2024b.</mixed-citation></ref>
      <ref id="bib1.bib45"><label>45</label><mixed-citation>Li, H., Yang, T., Du, Y., Tan, Y., and Wang, Z.: Interpreting hourly mass concentrations of PM<sub>2.5</sub> chemical components with an optimal deep-learning model, J. Environ. Sci., 151, 125–139, <ext-link xlink:href="https://doi.org/10.1016/j.jes.2024.03.037" ext-link-type="DOI">10.1016/j.jes.2024.03.037</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bib46"><label>46</label><mixed-citation>Li, J., Wang, Y., Steenland, K., Liu, P., van Donkelaar, A., Martin, R. V., Chang, H. H., Caudle, W. M., Schwartz, J., Koutrakis, P., and Shi, L.: Long-term effects of PM<sub>2.5</sub> components on incident dementia in the northeastern United States, Innovation, 3, 100208, <ext-link xlink:href="https://doi.org/10.1016/j.xinn.2022.100208" ext-link-type="DOI">10.1016/j.xinn.2022.100208</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib47"><label>47</label><mixed-citation>Lin, G. Y., Chen, H. W., Chen, B. J., and Chen, S. C. et al.: A machine learning model for predicting PM<sub>2.5</sub> and nitrate concentrations based on long-term water-soluble inorganic salts datasets at a road site station, Chemosphere, 289, <ext-link xlink:href="https://doi.org/10.1016/j.chemosphere.2021.133123" ext-link-type="DOI">10.1016/j.chemosphere.2021.133123</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib48"><label>48</label><mixed-citation>Lin, H., Jin, J., and van den Herik, J.: Air Quality Forecast through Integrated Data Assimilation and Machine Learning, in: Proceedings of the 11th International Conference on Agents and Artificial Intelligence, Prague, Czech Republic, 787–793, <ext-link xlink:href="https://doi.org/10.5220/0007555207870793" ext-link-type="DOI">10.5220/0007555207870793</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bib49"><label>49</label><mixed-citation>Liu, K., Zhang, Y., He, H., Xiao, H., Wang, S., Zhang, Y., Li, H., and Qian, X.: Time series prediction of the chemical components of PM<sub>2.5</sub> based on a deep learning model, Chemosphere, 342, 140153, <ext-link xlink:href="https://doi.org/10.1016/j.chemosphere.2023.140153" ext-link-type="DOI">10.1016/j.chemosphere.2023.140153</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bib50"><label>50</label><mixed-citation>Liu, S., Geng, G., Xiao, Q., Zheng, Y., Liu, X., Cheng, J., and Zhang, Q.: Tracking Daily Concentrations of PM<sub>2.5</sub> Chemical Composition in China since 2000, Environ. Sci. Technol., 56, 16517–16527, <ext-link xlink:href="https://doi.org/10.1021/acs.est.2c06510" ext-link-type="DOI">10.1021/acs.est.2c06510</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib51"><label>51</label><mixed-citation>Luo, Z., Han, Y., Hua, K., Zhang, Y., Wu, J., Bi, X., Dai, Q., Liu, B., Chen, Y., Long, X., and Feng, Y.: The effect of emission source chemical profiles on simulated PM<sub>2.5</sub> components: sensitivity analysis with the Community Multiscale Air Quality (CMAQ) modeling system version 5.0.2, Geosci. Model Dev., 16, 6757–6771, <ext-link xlink:href="https://doi.org/10.5194/gmd-16-6757-2023" ext-link-type="DOI">10.5194/gmd-16-6757-2023</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bib52"><label>52</label><mixed-citation>Lv, L., Wei, P., Li, J., and Hu, J.: Application of machine learning algorithms to improve numerical simulation prediction of PM<sub>2.5</sub> and chemical components, Atmos. Pollut. Res., 12, 101211, <ext-link xlink:href="https://doi.org/10.1016/j.apr.2021.101211" ext-link-type="DOI">10.1016/j.apr.2021.101211</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib53"><label>53</label><mixed-citation>Mallet, V. and Sportisse, B.: Uncertainty in a chemistry-transport model due to physical parameterizations and numerical approximations: An ensemble approach applied to ozone modeling, J. Geophys. Res.-Atmos., 111, <ext-link xlink:href="https://doi.org/10.1029/2005jd006149" ext-link-type="DOI">10.1029/2005jd006149</ext-link>, 2006.</mixed-citation></ref>
      <ref id="bib1.bib54"><label>54</label><mixed-citation>Meng, X., Hand, J. L., Schichtel, B. A., and Liu, Y.: Space-time trends of PM<sub>2.5</sub> constituents in the conterminous United States estimated by a machine learning approach, 2005–2015, Environ. Int., 121, 1137–1147, <ext-link xlink:href="https://doi.org/10.1016/j.envint.2018.10.029" ext-link-type="DOI">10.1016/j.envint.2018.10.029</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bib55"><label>55</label><mixed-citation>Miao, R., Chen, Q., Zheng, Y., Cheng, X., Sun, Y., Palmer, P. I., Shrivastava, M., Guo, J., Zhang, Q., Liu, Y., Tan, Z., Ma, X., Chen, S., Zeng, L., Lu, K., and Zhang, Y.: Model bias in simulating major chemical components of PM<sub>2.5</sub> in China, Atmos. Chem. Phys., 20, 12265–12284, <ext-link xlink:href="https://doi.org/10.5194/acp-20-12265-2020" ext-link-type="DOI">10.5194/acp-20-12265-2020</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bib56"><label>56</label><mixed-citation>Nenes, A., Pandis, S. N., and Pilinis, C.: ISORROPIA: A new thermodynamic equilibrium model for multiphase multicomponent inorganic aerosols, Aquat. Geochem., 4, 123–152, <ext-link xlink:href="https://doi.org/10.1023/A:1009604003981" ext-link-type="DOI">10.1023/A:1009604003981</ext-link>, 1998.</mixed-citation></ref>
      <ref id="bib1.bib57"><label>57</label><mixed-citation>Nerger, L., Janjić, T., Schröter, J., and Hiller, W.: A regulated localization scheme for ensemble-based Kalman filters, Q. J. Roy. Meteorol. Soc., 138, 802–812, <ext-link xlink:href="https://doi.org/10.1002/qj.945" ext-link-type="DOI">10.1002/qj.945</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bib58"><label>58</label><mixed-citation>Probst, P., Wright, M. N., and Boulesteix, A.-L.: Hyperparameters and tuning strategies for random forest, WIREs Data Min. Knowl. Discov., 9, e1301, <ext-link xlink:href="https://doi.org/10.1002/widm.1301" ext-link-type="DOI">10.1002/widm.1301</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bib59"><label>59</label><mixed-citation>Randles, C. A., da Silva, A. M., Buchard, V., Colarco, P. R., Darmenov, A., Govindaraju, R., Smirnov, A., Holben, B., Ferrare, R., Hair, J., Shinozuka, Y., and Flynn, C. J.: The MERRA-2 aerosol reanalysis, 1980 onward. Part I: System description and data assimilation evaluation, J. Climate, 30, 6823–6850, <ext-link xlink:href="https://doi.org/10.1175/JCLI-D-16-0609.1" ext-link-type="DOI">10.1175/JCLI-D-16-0609.1</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bib60"><label>60</label><mixed-citation>Rasmussen, C. E.: Gaussian processes in machine learning, in: Advanced Lectures on Machine Learning, edited by: Bousquet, O., von Luxburg, U., and Rätsch, G., Springer, Berlin, Heidelberg, 63–71, <ext-link xlink:href="https://doi.org/10.1007/978-3-540-28650-9_4" ext-link-type="DOI">10.1007/978-3-540-28650-9_4</ext-link>, 2004.</mixed-citation></ref>
      <ref id="bib1.bib61"><label>61</label><mixed-citation>Shaheen, K., Hanif, M. A., Hasan, O., and Shafique, M.: Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks, J. Intel. Robot. Syst., 105, 9, <ext-link xlink:href="https://doi.org/10.1007/s10846-022-01603-6" ext-link-type="DOI">10.1007/s10846-022-01603-6</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib62"><label>62</label><mixed-citation>Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and Freitas, N. D.: Taking the Human Out of the Loop: A Review of Bayesian Optimization, Proc. IEEE, 104, 148–175, <ext-link xlink:href="https://doi.org/10.1109/JPROC.2015.2494218" ext-link-type="DOI">10.1109/JPROC.2015.2494218</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bib63"><label>63</label><mixed-citation>Shi, Y., Liu, L., Hu, F., Fan, G., and Huo, J.: Nocturnal Boundary Layer Evolution and Its Impacts on the Vertical Distributions of Pollutant Particulate Matter, Atmosphere, 12, 610, <ext-link xlink:href="https://doi.org/10.3390/atmos12050610" ext-link-type="DOI">10.3390/atmos12050610</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib64"><label>64</label><mixed-citation>Soni, A., Mandariya, A. K., Rajeev, P., Izhar, S., Singh, G. K., Choudhary, V., Qadri, A. M., Gupta, A. D., Singh, A. K., and Gupta, T.: Multiple site ground-based evaluation of carbonaceous aerosol mass concentrations retrieved from CAMS and MERRA-2 over the Indo-Gangetic Plain, Environ. Sci.: Atmos., 1, 577–590, <ext-link xlink:href="https://doi.org/10.1039/d1ea00067e" ext-link-type="DOI">10.1039/d1ea00067e</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib65"><label>65</label><mixed-citation>Stier, P., van den Heever, S. C., Christensen, M. W., Gryspeerdt, E., Dagan, G., Saleeby, S. M., Bollasina, M., Donner, L., Emanuel, K., Ekman, A. M. L., Feingold, G., Field, P., Forster, P., Haywood, J., Kahn, R., Koren, I., Kummerow, C., L'Ecuyer, T., Lohmann, U., Ming, Y., Myhre, G., Quaas, J., Rosenfeld, D., Samset, B., Seifert, A., Stephens, G., and Tao, W.-K.: Multifaceted aerosol effects on precipitation, Nat. Geosci., 17, 719–732, <ext-link xlink:href="https://doi.org/10.1038/s41561-024-01482-6" ext-link-type="DOI">10.1038/s41561-024-01482-6</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bib66"><label>66</label><mixed-citation>Stockwell, W. R., Middleton, P., Chang, J. S., and Tang, X.: The second generation regional acid deposition model chemical mechanism for regional air quality modeling, J. Geophys. Res.-Atmos., 95, 16343–16367, <ext-link xlink:href="https://doi.org/10.1029/JD095iD10p16343" ext-link-type="DOI">10.1029/JD095iD10p16343</ext-link>, 1990.</mixed-citation></ref>
      <ref id="bib1.bib67"><label>67</label><mixed-citation>Sun, Y.: Vertical structures of physical and chemical properties of urban boundary layer and formation mechanisms of atmospheric pollution, Chinese Sci. Bull., 63, 1374–1389, <ext-link xlink:href="https://doi.org/10.1360/n972018-00258" ext-link-type="DOI">10.1360/n972018-00258</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bib68"><label>68</label><mixed-citation>Sun, Y. L., Wang, Z. F., Wild, O., Xu, W. Q., Chen, C., Fu, P. Q., Du, W., Zhou, L. B., Zhang, Q., and Han, T. T.: “APEC Blue”: Secondary Aerosol Reductions from Emission Controls in Beijing, Sci. Rep., 6, 20668, <ext-link xlink:href="https://doi.org/10.1038/srep20668" ext-link-type="DOI">10.1038/srep20668</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bib69"><label>69</label><mixed-citation>Tang, X., Kong, L., Zhu, J., Wang, Z., Li, J., Wu, H., Wu, Q., Chen, H., Zhu, L., Wang, W., Liu, B., Wang, Q., Chen, D., Pan, Y., Song, T., Li, F., Zheng, H., Jia, G., Lu, M., Wu, L., and Carmichael, G. R.: High-resolution Air Quality Reanalysis Dataset over China (CAQRA), Science Data Bank [data set], <ext-link xlink:href="https://doi.org/10.11922/sciencedb.00053" ext-link-type="DOI">10.11922/sciencedb.00053</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bib70"><label>70</label><mixed-citation>Valler, V., Franke, J., and Brönnimann, S.: Impact of different estimations of the background-error covariance matrix on climate reconstructions based on data assimilation, Clim. Past, 15, 1427–1441, <ext-link xlink:href="https://doi.org/10.5194/cp-15-1427-2019" ext-link-type="DOI">10.5194/cp-15-1427-2019</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bib71"><label>71</label><mixed-citation>Wang, H. L., Qiao, L. P., Lou, S. R., Zhou, M., Ding, A. J., Huang, H. Y., Chen, J. M., Wang, Q., Tao, S. K., Chen, C. H., Li, L., and Huang, C.: Chemical composition of PM<sub>2.5</sub> and meteorological impact among three years in urban Shanghai, China, J. Clean. Product., 112, 1302–1311, <ext-link xlink:href="https://doi.org/10.1016/j.jclepro.2015.04.099" ext-link-type="DOI">10.1016/j.jclepro.2015.04.099</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bib72"><label>72</label><mixed-citation>Weagle, C. L., Snider, G., Li, C., van Donkelaar, A., Philip, S., Bissonnette, P., Burke, J., Jackson, J., Latimer, R., Stone, E., Abboud, I., Akoshile, C., Anh, N. X., Brook, J. R., Cohen, A., Dong, J., Gibson, M. D., Griffith, D., He, K. B., Holben, B. N., Kahn, R., Keller, C. A., Kim, J. S., Lagrosas, N., Lestari, P., Khian, Y. L., Liu, Y., Marais, E. A., Martins, J. V., Misra, A., Muliane, U., Pratiwi, R., Quel, E. J., Salam, A., Segev, L., Tripathi, S. N., Wang, C., Zhang, Q., Brauer, M., Rudich, Y., and Martin, R. V.: Global Sources of Fine Particulate Matter: Interpretation of PM<sub>2.5</sub> Chemical Composition Observed by SPARTAN using a Global Chemical Transport Model, Environ. Sci. Technol., 52, 11670–11681, <ext-link xlink:href="https://doi.org/10.1021/acs.est.8b01658" ext-link-type="DOI">10.1021/acs.est.8b01658</ext-link>, 2018. </mixed-citation></ref>
      <ref id="bib1.bib73"><label>73</label><mixed-citation>Wei, J., Li, Z., and Chen, X.: ChinaHighPMC: Daily Seamless 1 km Ground-Level PM<sub>2.5</sub> Composition Dataset for China (2000–Present) [Data set]. In Environmental Science &amp; Technology (Version 1, Vol. 57, Issue 46, pp. 18282–18295), Zenodo [data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.10011898" ext-link-type="DOI">10.5281/zenodo.10011898</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib74"><label>74</label><mixed-citation>Wei, J., Li, Z., Chen, X., Li, C., Sun, Y., Wang, J., Lyapustin, A., Brasseur, G. P., Jiang, M., Sun, L., Wang, T., Jung, C. H., Qiu, B., Fang, C., Liu, X., Hao, J., Wang, Y., Zhan, M., Song, X., and Liu, Y.: Separating Daily 1 km PM<sub>2.5</sub> Inorganic Chemical Composition in China since 2000 via Deep Learning Integrating Ground, Satellite, and Model Data, Environ. Sci. Technol., 57, 18282–18295, <ext-link xlink:href="https://doi.org/10.1021/acs.est.3c00272" ext-link-type="DOI">10.1021/acs.est.3c00272</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bib75"><label>75</label><mixed-citation>Wu, C., Cao, C., Li, J., Lv, S., Li, J., Liu, X., Zhang, S., Liu, S., Zhang, F., Meng, J., and Wang, G.: Different physicochemical behaviors of nitrate and ammonium during transport: a case study on Mt. Hua, China, Atmos. Chem. Phys., 22, 15621–15635, <ext-link xlink:href="https://doi.org/10.5194/acp-22-15621-2022" ext-link-type="DOI">10.5194/acp-22-15621-2022</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib76"><label>76</label><mixed-citation> Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., and Deng, S.-H.: Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimizationb, J. Electron. Sci. Technol., 17, 26–40, 2019.</mixed-citation></ref>
      <ref id="bib1.bib77"><label>77</label><mixed-citation>Xi, E.: Image Classification and Recognition Based on Deep Learning and Random Forest Algorithm, Wirel. Commun. Mob. Com., 2013181, <ext-link xlink:href="https://doi.org/10.1155/2022/2013181" ext-link-type="DOI">10.1155/2022/2013181</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib78"><label>78</label><mixed-citation>Xie, T., Wang, C., and Peng, Y.: hi-RF: Incremental Learning Random Forest for Large-Scale Multi-class Data Classification, 2016/11, Atlantis Press, 312–321, <ext-link xlink:href="https://doi.org/10.2991/aiie-16.2016.72" ext-link-type="DOI">10.2991/aiie-16.2016.72</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bib79"><label>79</label><mixed-citation>Xie, X., Hu, J., Qin, M., Guo, S., Hu, M., Wang, H., Lou, S., Li, J., Sun, J., Li, X., Sheng, L., Zhu, J., Chen, G., Yin, J., Fu, W., Huang, C., and Zhang, Y.: Modeling particulate nitrate in China: Current findings and future directions, Environ. Int., 166, 107369, <ext-link xlink:href="https://doi.org/10.1016/j.envint.2022.107369" ext-link-type="DOI">10.1016/j.envint.2022.107369</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bib80"><label>80</label><mixed-citation>Yang, L. M. and Grooms, I.: Machine learning techniques to construct patched analog ensembles for data assimilation, J. Comput. Phys., 443, 110532, <ext-link xlink:href="https://doi.org/10.1016/j.jcp.2021.110532" ext-link-type="DOI">10.1016/j.jcp.2021.110532</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bib81"><label>81</label><mixed-citation>Yang, T., Li, H., Xu, W., Song, Y., Xu, L., Wang, H., Wang, F., Sun, Y., Wang, Z., and Fu, P.: Strong Impacts of Regional Atmospheric Transport on the Vertical Distribution of Aerosol Ammonium over Beijing, Environ. Sci. Technol. Lett., 11, 29–34, <ext-link xlink:href="https://doi.org/10.1021/acs.estlett.3c00791" ext-link-type="DOI">10.1021/acs.estlett.3c00791</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bib82"><label>82</label><mixed-citation>Zaveri, R. A. and Peters, L. K.: A new lumped structure photochemical mechanism for large-scale applications, J. Geophys. Res.-Atmos., 104, 30387–30415, <ext-link xlink:href="https://doi.org/10.1029/1999JD900876" ext-link-type="DOI">10.1029/1999JD900876</ext-link>, 1999.</mixed-citation></ref>
      <ref id="bib1.bib83"><label>83</label><mixed-citation>Zhao, C., Sun, Y., Yang, J., Li, J., Zhou, Y., Yang, Y., Fan, H., and Zhao, X.: Observational evidence and mechanisms of aerosol effects on precipitation, Sci. Bull., 69, 1569–1580, <ext-link xlink:href="https://doi.org/10.1016/j.scib.2024.03.014" ext-link-type="DOI">10.1016/j.scib.2024.03.014</ext-link>, 2024.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>OIRF-LEnKF v1.0: a novel data assimilation system by integrating incremental machine learning with a localized EnKF for enhanced PM<sub>2.5</sub> chemical component simulation and reanalysis</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>1</label><mixed-citation>
      
Adie, J., Chin, C. S., Li, J., and See, S.: GAIA-Chem: A Framework for Global AI-Accelerated Atmospheric Chemistry Modelling, in: Proceedings of the Platform for Advanced Scientific Computing Conference, Zurich, Switzerland, 13, 1–5, <a href="https://doi.org/10.1145/3659914.3659927" target="_blank">https://doi.org/10.1145/3659914.3659927</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>2</label><mixed-citation>
      
Amidror, I.: Scattered data interpolation methods for electronic imaging
systems: a survey, J. Electron. Imag., 11, <a href="https://doi.org/10.1117/1.1455013" target="_blank">https://doi.org/10.1117/1.1455013</a>, 2002.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>3</label><mixed-citation>
      
Arcucci, R., Zhu, J., Hu, S., and Guo, Y.-K.: Deep Data Assimilation: Integrating Deep Learning with Data Assimilation, Appl. Sci., 11, 1114,
<a href="https://doi.org/10.3390/app11031114" target="_blank">https://doi.org/10.3390/app11031114</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>4</label><mixed-citation>
      
Brajard, J., Carrassi, A., Bocquet, M., and Bertino, L.: Combining data
assimilation and machine learning to emulate a dynamical model from sparse
and noisy observations: A case study with the Lorenz 96 model, J. Comput.
Sci., 44, 101171, <a href="https://doi.org/10.1016/j.jocs.2020.101171" target="_blank">https://doi.org/10.1016/j.jocs.2020.101171</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>5</label><mixed-citation>
      
Breiman, L.: Random Forests, Mach. Learn., 45, 5–32,
<a href="https://doi.org/10.1023/A:1010933404324" target="_blank">https://doi.org/10.1023/A:1010933404324</a>, 2001.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>6</label><mixed-citation>
      
Buizza, C., Quilodrán Casas, C., Nadler, P., Mack, J., Marrone, S., Titus, Z., Le Cornec, C., Heylen, E., Dur, T., Baca Ruiz, L., Heaney, C.,
Díaz Lopez, J. A., Kumar, K. S. S., and Arcucci, R.: Data Learning:
Integrating Data Assimilation and Machine Learning, J. Comput. Sci., 58,
101525, <a href="https://doi.org/10.1016/j.jocs.2021.101525" target="_blank">https://doi.org/10.1016/j.jocs.2021.101525</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>7</label><mixed-citation>
      
Cha, Y., Lee, J.-J., Song, C. H., Kim, S., Park, R. J., Lee, M.-I., Woo, J.-H., Choi, J.-H., Bae, K., Yu, J., Kim, E., Kim, H., Lee, S.-H., Kim, J.,
Chang, L.-S., Jeon, K.-h., and Song, C.-K.: Investigating uncertainties in
air quality models used in GMAP/SIJAQ 2021 field campaign: General performance of different models and ensemble results, Atmos. Environ., 340,
120896, <a href="https://doi.org/10.1016/j.atmosenv.2024.120896" target="_blank">https://doi.org/10.1016/j.atmosenv.2024.120896</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>8</label><mixed-citation>
      
Chattopadhyay, A., Nabizadeh, E., Bach, E., and Hassanzadeh, P.: Deep
learning-enhanced ensemble-based data assimilation for high-dimensional
nonlinear dynamical systems, J. Comput. Phys., 477, 111918,
<a href="https://doi.org/10.1016/j.jcp.2023.111918" target="_blank">https://doi.org/10.1016/j.jcp.2023.111918</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>9</label><mixed-citation>
      
Chen, L.: A review of the applications of ensemble forecasting in fields
other than meteorology, Weather, 79, 285–290, <a href="https://doi.org/10.1002/wea.4584" target="_blank">https://doi.org/10.1002/wea.4584</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>10</label><mixed-citation>
      
Copernicus Atmosphere Monitoring Service: CAMS global reanalysis (EAC4), Copernicus Atmosphere Monitoring Service (CAMS) Atmosphere Data Store [data set], <a href="https://doi.org/10.24381/d58bbf47" target="_blank">https://doi.org/10.24381/d58bbf47</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>11</label><mixed-citation>
      
Debjyoti, G. and Utpal, R.: Comprehensive Benchmark Study of Machine Learning and Deep Learning Approaches for Human Activity Recognition using the UCI HAR Dataset, Int. J. Comput. Appl., 187, 66–69, <a href="https://doi.org/10.5120/ijca2025925797" target="_blank">https://doi.org/10.5120/ijca2025925797</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>12</label><mixed-citation>
      
Dong, R., Leng, H., Zhao, J., Song, J., and Liang, S.: A Framework for Four-Dimensional Variational Data Assimilation Based on Machine Learning,
Entropy, 24, 264, <a href="https://doi.org/10.3390/e24020264" target="_blank">https://doi.org/10.3390/e24020264</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>13</label><mixed-citation>
      
Dong, R., Leng, H., Zhao, C., Song, J., Zhao, J., and Cao, X.: A hybrid data
assimilation system based on machine learning, Front. Earth Sci., 10,
<a href="https://doi.org/10.3389/feart.2022.1012165" target="_blank">https://doi.org/10.3389/feart.2022.1012165</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>14</label><mixed-citation>
      
Evensen, G.: Sequential data assimilation with a nonlinear quasi-geostrophic
model using Monte Carlo methods to forecast error statistics, J. Geophys.
Res.-Oceans, 99, <a href="https://doi.org/10.1029/94jc00572" target="_blank">https://doi.org/10.1029/94jc00572</a>, 1994.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>15</label><mixed-citation>
      
Evensen, G.: The Ensemble Kalman Filter: Theoretical formulation and practical implementation, Ocean Dynam., 53, 343–367,
<a href="https://doi.org/10.1007/s10236-003-0036-9" target="_blank">https://doi.org/10.1007/s10236-003-0036-9</a>, 2003.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>16</label><mixed-citation>
      
Fang, L., Jin, J., Segers, A., Lin, H. X., Pang, M., Xiao, C., Deng, T., and
Liao, H.: Development of a regional feature selection-based machine learning
system (RFSML v1.0) for air pollution forecasting over China, Geosci. Model
Dev., 15, 7791–7807, <a href="https://doi.org/10.5194/gmd-15-7791-2022" target="_blank">https://doi.org/10.5194/gmd-15-7791-2022</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>17</label><mixed-citation>
      
Farchi, A., Bocquet, M., Laloyaux, P., Bonavita, M., and Malartic, Q.: A
comparison of combined data assimilation and machine learning methods for offline and online model error correction, J. Comput. Sci., 55, 101468,
<a href="https://doi.org/10.1016/j.jocs.2021.101468" target="_blank">https://doi.org/10.1016/j.jocs.2021.101468</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>18</label><mixed-citation>
      
Friedman, J. H., Bentley, J. L., and Finkel, R. A.: An algorithm for finding
best matches in logarithmic expected time, ACM T. Math. Softw., 3, 209–226, <a href="https://doi.org/10.1145/355744.355745" target="_blank">https://doi.org/10.1145/355744.355745</a>, 1977.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>19</label><mixed-citation>
      
Geer, A. J.: Learning earth system models from observations: machine learning or data assimilation?, Philos. T. Roy. Soc. A, 379, 20200089, <a href="https://doi.org/10.1098/rsta.2020.0089" target="_blank">https://doi.org/10.1098/rsta.2020.0089</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>20</label><mixed-citation>
      
Gelbart, M. A., Snoek, J., and Adams, R. P.: Bayesian optimization with unknown constraints, in: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI'14), AUAI Press, Arlington,
Virginia, USA, 250–259, <a href="https://doi.org/10.5555/3020751.3020778" target="_blank">https://doi.org/10.5555/3020751.3020778</a>, 2014.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>21</label><mixed-citation>
      
Gohari, K., Sheidaei, A., Yitshak-Sade, M., Colicino, E., and Kloog, I.:
Exploring multivariate machine learning frameworks to parallelize PM<sub>2.5</sub>
simultaneous estimations across the continental United States. Environ. Pollut., 374, 126161, <a href="https://doi.org/10.1016/j.envpol.2025.126161" target="_blank">https://doi.org/10.1016/j.envpol.2025.126161</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>22</label><mixed-citation>
      
Gottwald, G. A. and Reich, S.: Supervised learning from noisy observations:
Combining machine-learning techniques with data assimilation, Physica D, 423, 132911, <a href="https://doi.org/10.1016/j.physd.2021.132911" target="_blank">https://doi.org/10.1016/j.physd.2021.132911</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>23</label><mixed-citation>
      
He, X., Li, Y., Liu, S., Xu, T., Chen, F., Li, Z., Zhang, Z., Liu, R., Song,
L., Xu, Z., Peng, Z., and Zheng, C.: Improving regional climate simulations
based on a hybrid data assimilation and machine learning method, Hydrol. Earth Syst. Sci., 27, 1583–1606, <a href="https://doi.org/10.5194/hess-27-1583-2023" target="_blank">https://doi.org/10.5194/hess-27-1583-2023</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>24</label><mixed-citation>
      
Hersbach, H., Bell, B., Berrisford, P., Biavati, G., Horányi, A.,
Muñoz Sabater, J., Nicolas, J., Peubey, C., Radu, R., Rozum, I.,
Schepers, D., Simmons, A., Soci, C., Dee, D., and Thépaut, J.-N.: ERA5 hourly data on pressure levels from  940 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS) [data set],
<a href="https://doi.org/10.24381/cds.bd0915c6" target="_blank">https://doi.org/10.24381/cds.bd0915c6</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>25</label><mixed-citation>
      
Houtekamer, P. L. and Mitchell, H. L.: Data Assimilation Using an Ensemble
Kalman Filter Technique, Mon. Weather Rev., 126, 796–811,
<a href="https://doi.org/10.1175/1520-0493(1998)126&lt;0796:DAUAEK&gt;2.0.CO;2" target="_blank">https://doi.org/10.1175/1520-0493(1998)126&lt;0796:DAUAEK&gt;2.0.CO;2</a>, 1998.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>26</label><mixed-citation>
      
Houtekamer, P. L. and Zhang, F.: Review of the Ensemble Kalman Filter for
Atmospheric Data Assimilation, Mon. Weather Rev., 144, 4489–4532,
<a href="https://doi.org/10.1175/MWR-D-15-0440.1" target="_blank">https://doi.org/10.1175/MWR-D-15-0440.1</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>27</label><mixed-citation>
      
Howard, L. J., Subramanian, A., and Hoteit, I.: A Machine Learning Augmented
Data Assimilation Method for High-Resolution Observations, J. Adv. Model Earth Syst., 16, e2023MS003774, <a href="https://doi.org/10.1029/2023MS003774" target="_blank">https://doi.org/10.1029/2023MS003774</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>28</label><mixed-citation>
      
Huang, R. J., Zhang, Y. L., Bozzetti, C., Ho, K. F., Cao, J. J., Han, Y. M.,
Daellenbach, K. R., Slowik, J. G., Platt, S. M., Canonaco, F., Zotter, P.,
Wolf, R., Pieber, S. M., Bruns, E. A., Crippa, M., Ciarelli, G., Piazzalunga, A., Schwikowski, M., Abbaszade, G., Schnelle-Kreis, J., Zimmermann, R., An, Z. S., Szidat, S., Baltensperger, U., El Haddad, I., and Prévôt, A. S. H.: High secondary aerosol contribution to particulate pollution during haze events in China, Nature, 514, 218–222, <a href="https://doi.org/10.1038/nature13774" target="_blank">https://doi.org/10.1038/nature13774</a>, 2014.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>29</label><mixed-citation>
      
Huang, Y., Wu, S., Dubey, M. K., and French, N. H. F.: Impact of aging mechanism on model simulated carbonaceous aerosols, Atmos. Chem. Phys., 13,
6329–6343, h<a href="https://doi.org/10.5194/acp-13-6329-2013" target="_blank">https://doi.org/10.5194/acp-13-6329-2013</a>, 2013.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>30</label><mixed-citation>
      
Inness, A., Ades, M., Agustí-Panareda, A., Barré, J., Benedictow, A., Blechschmidt, A. M., Dominguez, J. J., Engelen, R., Eskes, H., Flemming, J., Huijnen, V., Jones, L., Kipling, Z., Massart, S., Parrington, M., Peuch, V. H., Razinger, M., Remy, S., Schulz, M., and Suttie, M.: The CAMS reanalysis of atmospheric composition, Atmos. Chem. Phys., 19, 3515–3556,
<a href="https://doi.org/10.5194/acp-19-3515-2019" target="_blank">https://doi.org/10.5194/acp-19-3515-2019</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>31</label><mixed-citation>
      
Jalali, M. W., Saidi, B., Farahmand, H., Panah, M. A. R., and Saruhan, E. N.: Scalable AI-driven air quality forecasting and classification for public health applications, Discov. Atmos., 3, 25, <a href="https://doi.org/10.1007/s44292-025-00052-8" target="_blank">https://doi.org/10.1007/s44292-025-00052-8</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>32</label><mixed-citation>
      
Janjić, T., Nerger, L., Albertella, A., Schröter, J., and Skachko, S.: On Domain Localization in Ensemble-Based Kalman Filter Algorithms, Mon.
Weather Rev., 139, 2046–2060, <a href="https://doi.org/10.1175/2011MWR3552.1" target="_blank">https://doi.org/10.1175/2011MWR3552.1</a>, 2011.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>33</label><mixed-citation>
      
Jin, J., Lin, H. X., Segers, A., Xie, Y., and Heemink, A.: Machine learning
for observation bias correction with application to dust storm data assimilation, Atmos. Chem. Phys., 19, 10009–10026, <a href="https://doi.org/10.5194/acp-19-10009-2019" target="_blank">https://doi.org/10.5194/acp-19-10009-2019</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>34</label><mixed-citation>
      
Kong, L., Tang, X., Zhu, J., Wang, Z., Li, J., Wu, H., Wu, Q., Chen, H., Zhu, L., Wang, W., Liu, B., Wang, Q., Chen, D., Pan, Y., Song, T., Li, F., Zheng, H., Jia, G., Lu, M., Wu, L., and Carmichael, G. R.: A 6-year-long (2013–2018) high-resolution air quality reanalysis dataset in China based on the assimilation of surface observations from CNEMC, Earth Syst. Sci. Data, 13, 529–570, <a href="https://doi.org/10.5194/essd-13-529-2021" target="_blank">https://doi.org/10.5194/essd-13-529-2021</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>35</label><mixed-citation>
      
Kong, L., Tang, X., Zhu, J., Wang, Z., Liu, B., Zhu, Y., Zhu, L., Chen, D.,
Hu, K., Wu, H., Wu, Q., Shen, J., Sun, Y., Liu, Z., Xin, J., Ji, D., and Zheng, M.: High-resolution Simulation Dataset of Hourly PM<sub>2.5</sub> Chemical
Composition in China (CAQRA-aerosol) from 2013 to 2020, Adv. Atmos. Sci., 42, 697–712, <a href="https://doi.org/10.1007/s00376-024-4046-5" target="_blank">https://doi.org/10.1007/s00376-024-4046-5</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>36</label><mixed-citation>
      
Lai, Y.: Application and Effectiveness Evaluation of Bayesian Optimization
Algorithm in Hyperparameter Tuning of Machine Learning Models, in: 2024 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC), 14–16 August 2024, Athens, Greece, 351–355,
<a href="https://doi.org/10.1109/PEEEC63877.2024.00070" target="_blank">https://doi.org/10.1109/PEEEC63877.2024.00070</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>37</label><mixed-citation>
      
Lee, S., Park, S., Lee, M.-I., Kim, G., Im, J., and Song, C.-K.: Air Quality
Forecasts Improved by Combining Data Assimilation and Machine Learning With
Satellite AOD, Geophys. Res. Lett., 49, e2021GL096066, <a href="https://doi.org/10.1029/2021GL096066" target="_blank">https://doi.org/10.1029/2021GL096066</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>38</label><mixed-citation>
      
Legler, S. and Janjić, T.: Combining data assimilation and machine learning to estimate parameters of a convective-scale model, Q. J. Roy. Meteorol. Soc., 148, 860–874, <a href="https://doi.org/10.1002/qj.4235" target="_blank">https://doi.org/10.1002/qj.4235</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>39</label><mixed-citation>
      
Lei, L. and Whitaker, J. S.: Evaluating the trade-offs between ensemble size
and ensemble resolution in an ensemble-variational data assimilation system,
J. Adv. Model Earth Syst., 9, 781–789, <a href="https://doi.org/10.1002/2016MS000864" target="_blank">https://doi.org/10.1002/2016MS000864</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>40</label><mixed-citation>
      
Lei, L., Sun, Y., Ouyang, B., Qiu, Y., Xie, C., Tang, G., Zhou, W., He, Y.,
Wang, Q., Cheng, X., Fu, P., and Wang, Z.: Vertical Distributions of Primary
and Secondary Aerosols in Urban Boundary Layer: Insights into Sources, Chemistry, and Interaction with Meteorology, Environ. Sci. Technol., 55, 4542–4552, <a href="https://doi.org/10.1021/acs.est.1c00479" target="_blank">https://doi.org/10.1021/acs.est.1c00479</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>41</label><mixed-citation>
      
Li, H.: OIRF-LEnKF v1.0 related open-access datasets, Zenodo [data set],
<a href="https://doi.org/10.5281/zenodo.17359290" target="_blank">https://doi.org/10.5281/zenodo.17359290</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib42"><label>42</label><mixed-citation>
      
Li, H. and Yang, T.: OIRF-LEnKF v1.0, Zenodo [code and data set],
<a href="https://doi.org/10.5281/zenodo.17346786" target="_blank">https://doi.org/10.5281/zenodo.17346786</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib43"><label>43</label><mixed-citation>
      
Li, H., Yang, T., Nerger, L., Zhang, D., Zhang, D., Tang, G., Wang, H., Sun,
Y., Fu, P., Su, H., and Wang, Z.: NAQPMS-PDAF v2.0: a novel hybrid nonlinear
data assimilation system for improved simulation of PM<sub>2.5</sub> chemical
components, Geosci. Model Dev., 17, 8495–8519, <a href="https://doi.org/10.5194/gmd-17-8495-2024" target="_blank">https://doi.org/10.5194/gmd-17-8495-2024</a>, 2024a.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib44"><label>44</label><mixed-citation>
      
Li, H., Yang, T., and Wang, H.: NAQPMS-PDAF v2.0 (Version 2.0), Zenodo [data set], <a href="https://doi.org/10.5281/zenodo.10886914" target="_blank">https://doi.org/10.5281/zenodo.10886914</a>, 2024b.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib45"><label>45</label><mixed-citation>
      
Li, H., Yang, T., Du, Y., Tan, Y., and Wang, Z.: Interpreting hourly mass
concentrations of PM<sub>2.5</sub> chemical components with an optimal deep-learning model, J. Environ. Sci., 151, 125–139, <a href="https://doi.org/10.1016/j.jes.2024.03.037" target="_blank">https://doi.org/10.1016/j.jes.2024.03.037</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib46"><label>46</label><mixed-citation>
      
Li, J., Wang, Y., Steenland, K., Liu, P., van Donkelaar, A., Martin, R. V.,
Chang, H. H., Caudle, W. M., Schwartz, J., Koutrakis, P., and Shi, L.:
Long-term effects of PM<sub>2.5</sub> components on incident dementia in the northeastern United States, Innovation, 3, 100208, <a href="https://doi.org/10.1016/j.xinn.2022.100208" target="_blank">https://doi.org/10.1016/j.xinn.2022.100208</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib47"><label>47</label><mixed-citation>
      
Lin, G. Y., Chen, H. W., Chen, B. J., and Chen, S. C. et al.: A machine learning model for predicting PM<sub>2.5</sub> and nitrate concentrations based on
long-term water-soluble inorganic salts datasets at a road site station,
Chemosphere, 289, <a href="https://doi.org/10.1016/j.chemosphere.2021.133123" target="_blank">https://doi.org/10.1016/j.chemosphere.2021.133123</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib48"><label>48</label><mixed-citation>
      
Lin, H., Jin, J., and van den Herik, J.: Air Quality Forecast through
Integrated Data Assimilation and Machine Learning, in: Proceedings of the
11th International Conference on Agents and Artificial Intelligence, Prague, Czech Republic, 787–793, <a href="https://doi.org/10.5220/0007555207870793" target="_blank">https://doi.org/10.5220/0007555207870793</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib49"><label>49</label><mixed-citation>
      
Liu, K., Zhang, Y., He, H., Xiao, H., Wang, S., Zhang, Y., Li, H., and Qian, X.: Time series prediction of the chemical components of PM<sub>2.5</sub> based on a deep learning model, Chemosphere, 342, 140153, <a href="https://doi.org/10.1016/j.chemosphere.2023.140153" target="_blank">https://doi.org/10.1016/j.chemosphere.2023.140153</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib50"><label>50</label><mixed-citation>
      
Liu, S., Geng, G., Xiao, Q., Zheng, Y., Liu, X., Cheng, J., and Zhang, Q.:
Tracking Daily Concentrations of PM<sub>2.5</sub> Chemical Composition in China
since 2000, Environ. Sci. Technol., 56, 16517–16527,
<a href="https://doi.org/10.1021/acs.est.2c06510" target="_blank">https://doi.org/10.1021/acs.est.2c06510</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib51"><label>51</label><mixed-citation>
      
Luo, Z., Han, Y., Hua, K., Zhang, Y., Wu, J., Bi, X., Dai, Q., Liu, B., Chen, Y., Long, X., and Feng, Y.: The effect of emission source chemical profiles on simulated PM<sub>2.5</sub> components: sensitivity analysis with the Community Multiscale Air Quality (CMAQ) modeling system version 5.0.2, Geosci. Model Dev., 16, 6757–6771, <a href="https://doi.org/10.5194/gmd-16-6757-2023" target="_blank">https://doi.org/10.5194/gmd-16-6757-2023</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib52"><label>52</label><mixed-citation>
      
Lv, L., Wei, P., Li, J., and Hu, J.: Application of machine learning algorithms to improve numerical simulation prediction of PM<sub>2.5</sub> and
chemical components, Atmos. Pollut. Res., 12, 101211, <a href="https://doi.org/10.1016/j.apr.2021.101211" target="_blank">https://doi.org/10.1016/j.apr.2021.101211</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib53"><label>53</label><mixed-citation>
      
Mallet, V. and Sportisse, B.: Uncertainty in a chemistry-transport model due
to physical parameterizations and numerical approximations: An ensemble approach applied to ozone modeling, J. Geophys. Res.-Atmos., 111,
<a href="https://doi.org/10.1029/2005jd006149" target="_blank">https://doi.org/10.1029/2005jd006149</a>, 2006.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib54"><label>54</label><mixed-citation>
      
Meng, X., Hand, J. L., Schichtel, B. A., and Liu, Y.: Space-time trends of
PM<sub>2.5</sub> constituents in the conterminous United States estimated by a
machine learning approach, 2005–2015, Environ. Int., 121, 1137–1147,
<a href="https://doi.org/10.1016/j.envint.2018.10.029" target="_blank">https://doi.org/10.1016/j.envint.2018.10.029</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib55"><label>55</label><mixed-citation>
      
Miao, R., Chen, Q., Zheng, Y., Cheng, X., Sun, Y., Palmer, P. I., Shrivastava, M., Guo, J., Zhang, Q., Liu, Y., Tan, Z., Ma, X., Chen, S., Zeng, L., Lu, K., and Zhang, Y.: Model bias in simulating major chemical
components of PM<sub>2.5</sub> in China, Atmos. Chem. Phys., 20, 12265–12284,
<a href="https://doi.org/10.5194/acp-20-12265-2020" target="_blank">https://doi.org/10.5194/acp-20-12265-2020</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib56"><label>56</label><mixed-citation>
      
Nenes, A., Pandis, S. N., and Pilinis, C.: ISORROPIA: A new thermodynamic
equilibrium model for multiphase multicomponent inorganic aerosols, Aquat.
Geochem., 4, 123–152, <a href="https://doi.org/10.1023/A:1009604003981" target="_blank">https://doi.org/10.1023/A:1009604003981</a>, 1998.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib57"><label>57</label><mixed-citation>
      
Nerger, L., Janjić, T., Schröter, J., and Hiller, W.: A regulated
localization scheme for ensemble-based Kalman filters, Q. J. Roy. Meteorol.
Soc., 138, 802–812, <a href="https://doi.org/10.1002/qj.945" target="_blank">https://doi.org/10.1002/qj.945</a>, 2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib58"><label>58</label><mixed-citation>
      
Probst, P., Wright, M. N., and Boulesteix, A.-L.: Hyperparameters and tuning
strategies for random forest, WIREs Data Min. Knowl. Discov., 9, e1301,
<a href="https://doi.org/10.1002/widm.1301" target="_blank">https://doi.org/10.1002/widm.1301</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib59"><label>59</label><mixed-citation>
      
Randles, C. A., da Silva, A. M., Buchard, V., Colarco, P. R., Darmenov, A.,
Govindaraju, R., Smirnov, A., Holben, B., Ferrare, R., Hair, J., Shinozuka, Y., and Flynn, C. J.: The MERRA-2 aerosol reanalysis, 1980 onward. Part I:
System description and data assimilation evaluation, J. Climate, 30, 6823–6850, <a href="https://doi.org/10.1175/JCLI-D-16-0609.1" target="_blank">https://doi.org/10.1175/JCLI-D-16-0609.1</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib60"><label>60</label><mixed-citation>
      
Rasmussen, C. E.: Gaussian processes in machine learning, in: Advanced Lectures on Machine Learning, edited by: Bousquet, O., von Luxburg, U., and
Rätsch, G., Springer, Berlin, Heidelberg, 63–71,
<a href="https://doi.org/10.1007/978-3-540-28650-9_4" target="_blank">https://doi.org/10.1007/978-3-540-28650-9_4</a>, 2004.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib61"><label>61</label><mixed-citation>
      
Shaheen, K., Hanif, M. A., Hasan, O., and Shafique, M.: Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks, J.
Intel. Robot. Syst., 105, 9, <a href="https://doi.org/10.1007/s10846-022-01603-6" target="_blank">https://doi.org/10.1007/s10846-022-01603-6</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib62"><label>62</label><mixed-citation>
      
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and Freitas, N. D.: Taking the Human Out of the Loop: A Review of Bayesian Optimization, Proc. IEEE, 104, 148–175, <a href="https://doi.org/10.1109/JPROC.2015.2494218" target="_blank">https://doi.org/10.1109/JPROC.2015.2494218</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib63"><label>63</label><mixed-citation>
      
Shi, Y., Liu, L., Hu, F., Fan, G., and Huo, J.: Nocturnal Boundary Layer Evolution and Its Impacts on the Vertical Distributions of Pollutant Particulate Matter, Atmosphere, 12, 610, <a href="https://doi.org/10.3390/atmos12050610" target="_blank">https://doi.org/10.3390/atmos12050610</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib64"><label>64</label><mixed-citation>
      
Soni, A., Mandariya, A. K., Rajeev, P., Izhar, S., Singh, G. K., Choudhary,
V., Qadri, A. M., Gupta, A. D., Singh, A. K., and Gupta, T.: Multiple site
ground-based evaluation of carbonaceous aerosol mass concentrations
retrieved from CAMS and MERRA-2 over the Indo-Gangetic Plain, Environ. Sci.:
Atmos., 1, 577–590, <a href="https://doi.org/10.1039/d1ea00067e" target="_blank">https://doi.org/10.1039/d1ea00067e</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib65"><label>65</label><mixed-citation>
      
Stier, P., van den Heever, S. C., Christensen, M. W., Gryspeerdt, E., Dagan,
G., Saleeby, S. M., Bollasina, M., Donner, L., Emanuel, K., Ekman, A. M. L.,
Feingold, G., Field, P., Forster, P., Haywood, J., Kahn, R., Koren, I.,
Kummerow, C., L'Ecuyer, T., Lohmann, U., Ming, Y., Myhre, G., Quaas, J., Rosenfeld, D., Samset, B., Seifert, A., Stephens, G., and Tao, W.-K.:
Multifaceted aerosol effects on precipitation, Nat. Geosci., 17, 719–732,
<a href="https://doi.org/10.1038/s41561-024-01482-6" target="_blank">https://doi.org/10.1038/s41561-024-01482-6</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib66"><label>66</label><mixed-citation>
      
Stockwell, W. R., Middleton, P., Chang, J. S., and Tang, X.: The second
generation regional acid deposition model chemical mechanism for regional air quality modeling, J. Geophys. Res.-Atmos., 95, 16343–16367,
<a href="https://doi.org/10.1029/JD095iD10p16343" target="_blank">https://doi.org/10.1029/JD095iD10p16343</a>, 1990.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib67"><label>67</label><mixed-citation>
      
Sun, Y.: Vertical structures of physical and chemical properties of urban
boundary layer and formation mechanisms of atmospheric pollution, Chinese Sci. Bull., 63, 1374–1389, <a href="https://doi.org/10.1360/n972018-00258" target="_blank">https://doi.org/10.1360/n972018-00258</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib68"><label>68</label><mixed-citation>
      
Sun, Y. L., Wang, Z. F., Wild, O., Xu, W. Q., Chen, C., Fu, P. Q., Du, W., Zhou, L. B., Zhang, Q., and Han, T. T.: “APEC Blue”: Secondary Aerosol
Reductions from Emission Controls in Beijing, Sci. Rep., 6, 20668,
<a href="https://doi.org/10.1038/srep20668" target="_blank">https://doi.org/10.1038/srep20668</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib69"><label>69</label><mixed-citation>
      
Tang, X., Kong, L., Zhu, J., Wang, Z., Li, J., Wu, H., Wu, Q., Chen, H., Zhu, L., Wang, W., Liu, B., Wang, Q., Chen, D., Pan, Y., Song, T., Li, F., Zheng, H., Jia, G., Lu, M., Wu, L., and Carmichael, G. R.: High-resolution Air Quality Reanalysis Dataset over China (CAQRA), Science Data Bank [data set], <a href="https://doi.org/10.11922/sciencedb.00053" target="_blank">https://doi.org/10.11922/sciencedb.00053</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib70"><label>70</label><mixed-citation>
      
Valler, V., Franke, J., and Brönnimann, S.: Impact of different estimations of the background-error covariance matrix on climate reconstructions based on data assimilation, Clim. Past, 15, 1427–1441,
<a href="https://doi.org/10.5194/cp-15-1427-2019" target="_blank">https://doi.org/10.5194/cp-15-1427-2019</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib71"><label>71</label><mixed-citation>
      
Wang, H. L., Qiao, L. P., Lou, S. R., Zhou, M., Ding, A. J., Huang, H. Y.,
Chen, J. M., Wang, Q., Tao, S. K., Chen, C. H., Li, L., and Huang, C.: Chemical composition of PM<sub>2.5</sub> and meteorological impact among three
years in urban Shanghai, China, J. Clean. Product., 112, 1302–1311,
<a href="https://doi.org/10.1016/j.jclepro.2015.04.099" target="_blank">https://doi.org/10.1016/j.jclepro.2015.04.099</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib72"><label>72</label><mixed-citation>
      
Weagle, C. L., Snider, G., Li, C., van Donkelaar, A., Philip, S., Bissonnette, P., Burke, J., Jackson, J., Latimer, R., Stone, E., Abboud, I.,
Akoshile, C., Anh, N. X., Brook, J. R., Cohen, A., Dong, J., Gibson, M. D.,
Griffith, D., He, K. B., Holben, B. N., Kahn, R., Keller, C. A., Kim, J. S.,
Lagrosas, N., Lestari, P., Khian, Y. L., Liu, Y., Marais, E. A., Martins, J.
V., Misra, A., Muliane, U., Pratiwi, R., Quel, E. J., Salam, A., Segev, L.,
Tripathi, S. N., Wang, C., Zhang, Q., Brauer, M., Rudich, Y., and Martin, R.
V.: Global Sources of Fine Particulate Matter: Interpretation of PM<sub>2.5</sub>
Chemical Composition Observed by SPARTAN using a Global Chemical Transport
Model, Environ. Sci. Technol., 52, 11670–11681, <a href="https://doi.org/10.1021/acs.est.8b01658" target="_blank">https://doi.org/10.1021/acs.est.8b01658</a>, 2018.


    </mixed-citation></ref-html>
<ref-html id="bib1.bib73"><label>73</label><mixed-citation>
      
Wei, J., Li, Z., and Chen, X.: ChinaHighPMC: Daily Seamless 1&thinsp;km Ground-Level PM<sub>2.5</sub> Composition Dataset for China (2000–Present) [Data set]. In Environmental Science &amp; Technology (Version 1, Vol. 57, Issue 46, pp. 18282–18295), Zenodo [data set], <a href="https://doi.org/10.5281/zenodo.10011898" target="_blank">https://doi.org/10.5281/zenodo.10011898</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib74"><label>74</label><mixed-citation>
      
Wei, J., Li, Z., Chen, X., Li, C., Sun, Y., Wang, J., Lyapustin, A., Brasseur, G. P., Jiang, M., Sun, L., Wang, T., Jung, C. H., Qiu, B., Fang, C., Liu, X., Hao, J., Wang, Y., Zhan, M., Song, X., and Liu, Y.: Separating
Daily 1&thinsp;km PM<sub>2.5</sub> Inorganic Chemical Composition in China since 2000 via
Deep Learning Integrating Ground, Satellite, and Model Data, Environ. Sci.
Technol., 57, 18282–18295, <a href="https://doi.org/10.1021/acs.est.3c00272" target="_blank">https://doi.org/10.1021/acs.est.3c00272</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib75"><label>75</label><mixed-citation>
      
Wu, C., Cao, C., Li, J., Lv, S., Li, J., Liu, X., Zhang, S., Liu, S., Zhang,
F., Meng, J., and Wang, G.: Different physicochemical behaviors of nitrate
and ammonium during transport: a case study on Mt. Hua, China, Atmos. Chem.
Phys., 22, 15621–15635, <a href="https://doi.org/10.5194/acp-22-15621-2022" target="_blank">https://doi.org/10.5194/acp-22-15621-2022</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib76"><label>76</label><mixed-citation>
      
Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., and Deng, S.-H.:
Hyperparameter Optimization for Machine Learning Models Based on Bayesian
Optimizationb, J. Electron. Sci. Technol., 17, 26–40, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib77"><label>77</label><mixed-citation>
      
Xi, E.: Image Classification and Recognition Based on Deep Learning and Random Forest Algorithm, Wirel. Commun. Mob. Com., 2013181,
<a href="https://doi.org/10.1155/2022/2013181" target="_blank">https://doi.org/10.1155/2022/2013181</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib78"><label>78</label><mixed-citation>
      
Xie, T., Wang, C., and Peng, Y.: hi-RF: Incremental Learning Random Forest
for Large-Scale Multi-class Data Classification, 2016/11, Atlantis Press, 312–321, <a href="https://doi.org/10.2991/aiie-16.2016.72" target="_blank">https://doi.org/10.2991/aiie-16.2016.72</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib79"><label>79</label><mixed-citation>
      
Xie, X., Hu, J., Qin, M., Guo, S., Hu, M., Wang, H., Lou, S., Li, J., Sun, J., Li, X., Sheng, L., Zhu, J., Chen, G., Yin, J., Fu, W., Huang, C., and
Zhang, Y.: Modeling particulate nitrate in China: Current findings and
future directions, Environ. Int., 166, 107369, <a href="https://doi.org/10.1016/j.envint.2022.107369" target="_blank">https://doi.org/10.1016/j.envint.2022.107369</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib80"><label>80</label><mixed-citation>
      
Yang, L. M. and Grooms, I.: Machine learning techniques to construct patched
analog ensembles for data assimilation, J. Comput. Phys., 443, 110532,
<a href="https://doi.org/10.1016/j.jcp.2021.110532" target="_blank">https://doi.org/10.1016/j.jcp.2021.110532</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib81"><label>81</label><mixed-citation>
      
Yang, T., Li, H., Xu, W., Song, Y., Xu, L., Wang, H., Wang, F., Sun, Y., Wang, Z., and Fu, P.: Strong Impacts of Regional Atmospheric Transport on
the Vertical Distribution of Aerosol Ammonium over Beijing, Environ. Sci.
Technol. Lett., 11, 29–34, <a href="https://doi.org/10.1021/acs.estlett.3c00791" target="_blank">https://doi.org/10.1021/acs.estlett.3c00791</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib82"><label>82</label><mixed-citation>
      
Zaveri, R. A. and Peters, L. K.: A new lumped structure photochemical mechanism for large-scale applications, J. Geophys. Res.-Atmos., 104,
30387–30415, <a href="https://doi.org/10.1029/1999JD900876" target="_blank">https://doi.org/10.1029/1999JD900876</a>, 1999.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib83"><label>83</label><mixed-citation>
      
Zhao, C., Sun, Y., Yang, J., Li, J., Zhou, Y., Yang, Y., Fan, H., and Zhao, X.: Observational evidence and mechanisms of aerosol effects on precipitation, Sci. Bull., 69, 1569–1580, <a href="https://doi.org/10.1016/j.scib.2024.03.014" target="_blank">https://doi.org/10.1016/j.scib.2024.03.014</a>, 2024.

    </mixed-citation></ref-html>--></article>
