Evaluation of ECMWF IFS-AER (CAMS) operational forecasts during cycle 41r1-46r1 with calibrated ceilometer profiles over Germany

Aerosol forecasts by the European Center for Medium Range Weather Forecasts (ECMWF) Integrated Forecasting System IFS-AER for years 2016-2019 (cycle 41r1 46r1) are compared to vertical profiles of particle backscatter from the Deutscher Wetterdienst (DWD) ceilometer network. The system has been developed in the Copernicus Atmosphere Monitoring Service (CAMS) and its precursors. The focus of this article is to evaluate the realism of the vertical aerosol distribution 5 from 0.3 to 8 km above ground, coded in the shape, bias and temporal variation of the profiles. The common physical quantity, the attenuated backscatter β∗(z) , is directly measured and calculated from the model mass mixing ratios of the different aerosol types using the model’s inherent aerosol microphysical properties. Pearson correlation coefficients of daily average simulated and observed vertical profiles between 10 r=0.6-0.8 in summer and 0.7-0.95 in winter indicate that most of the vertical structure is captured. It is governed by larger β∗(z) in the mixing-layer and comparably well captured with the successive model versions. The aerosol load tends to be high-biased near the surface, be underestimated in the mixing layer and realistic at small background values in the undisturbed free troposphere. A seasonal cycle of the bias below 1 km height indicates that aerosol sources and/or lifetimes are overestimated 15 in summer and pollution episodes not fully resolved in winter. Long-range transport of Saharan dust or fire smoke is captured and timely, only the dispersion to smaller scales is not resolved in detail. Over Germany β∗(z) from Saharan dust and sea salt are considerably overestimated. Differences between model and ceilometer profiles are investigated using observed in-situ mass concentrations of organic, black carbon, SO4, NO3, NH4 and proxys for mineral dust and sea-salt near the surface. 20 Accordingly, SO4 and OM sources as well as gas-to-particle partitioning of the NO3-NH4-system 1 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c © Author(s) 2020. CC BY 4.0 License.


Introduction
Aerosol particles play a key role in atmospheric processes depending on scales, heights and geographical regions. They affect climate and weather, directly by light scattering and absorption (Hansen et al., 1997;Ramanathan et al., 2007;WMO, 2013), indirectly by altering formation and droplet size of clouds (Lohmann et al., 2007), and by their impact on saturation and vertical ex-30 change (Ackerman et al., 2000). In the lower troposphere particle emission, transport and heterogeneous chemical processes degrade air-quality and health (Galanter et al., 2000;Karanasiou et al., 2012;Pèrez et al., 2012), but then particles mediate gas-to-particle conversion, scavenging and final removal of trace gases from the atmosphere (e.g. Birmili et al. (2003); Kolb and et al (2010)).
Their largely variable primary (direct) and secondary (precursor-initiated) sources, both natural and 35 anthropogenic, resulted in a bunch of measurement-and numerical modeling techniques to analyze and understand particle atmospheric abundance and effects. Actually, increasing emissions of anthropogenic gases and aerosols during the last century has made aerosols a key trigger of severe pollution episodes and one of the largest uncertainties in assessments of climate change (e.g. Gilge et al. (2010); WMO (2013)). 40 Largest primary natural sources of atmospheric particles are oceans (sea salt), arid regions (mineral dust), boreal forests (organic and biomass-burning) and volcanic eruptions. Except volcanoes, these depend on season and weather, and all are linked to specific, partly extended source regions.
Anthropogenic primary sources also exhibit regular spatial and temporal patterns, while secondary production is mainly driven by dispersion of precursors/condensables, transport and radiation.  ally, wet sea-salt is the most abundant aerosol, but referred to the dry state mineral dust comprises an even larger portion of the global aerosol load (Kinne and et al, 2006;Huneeus and et al, 2011) and about one third of the primarily emitted aerosol mass (Houghton et al., 2001;WMO, 2013). Europe is reached by Saharan dust oftentimes per year (Moulin et al., 1998;Gobbi et al., 2000;Ansmann and et al, 2003;Collaud-Coen et al., 2004;Papayannis et al., 2008;Pey et al., 2013;Flentje et al., 2015) 50 where, decreasing towards the north, it contributes between 5%-30% to the total dry particle mass (Putaud et al., 2010). Dust may carry bacteria and has been associated to dispersion of meningitis (Griffin, 2007). Open fire emissions are in ∼85% (globally) linked to land-use and agriculture in the tropics (Andreae and Merlet, 2001), but drought-induced wildfires and boreal burns (Damoah et al., 2004;Hyer et al., 2007;Stohl et al., 2002) are more effective in spring and summer for mid-latitudes 55 (Trickl et al., 2015). Owing to their highly variable source characteristics and dispersion (Labonne 2 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c Author(s) 2020. CC BY 4.0 License. volved in atmospheric composition modeling (Ordonez et al., 2010) inclusive their meteorological driver (e.g. Flentje et al. (2005Flentje et al. ( , 2007) with aid of independent observations is essential . 95 Global aerosol model evaluations so far concentrate on aerosol optical depth (AOD) measurements e.g. from AERONET (Holben and et al, 2001;Cesnulyte et al., 2014), however limited to daytime (except few moon-radiometers) and without resolving the vertical distribution. Regional models mostly think and verify in terms of particulate matter mass concentration PM 10 or PM 2.5 , often without resolving composition and sizes of particles (Stidworthy et al., 2018;Akritidis et al., 2018). 100 Assessments of detailed particle properties typically suffer from low availability and representativeness of observations and are applied retrospectively (Flemming et al., 2017;Inness et al., 2019).
Discrepancies between model and observations are typically reported in evaluation of operational or specific aerosol transport of aerosols (Pèrez et al., 2006a;Morcrette et al., 2011;Basart et al., 2012). They result from displacements, time-shifts, misses or excess plumes/layers, since neces-105 sarily rough parametrisations of complex source, transformation and sink mechanisms sometimes cannot fully reproduce the highly variable dispersion of particles in the atmosphere. Simplified representations of soil conditions, mobilisation, emission, convective/turbulent lifting and diffusion affect the source strengths, shortcomings in dry and wet deposition as well as coagulation and chemical transformations propagate to sinks and range (Tegen and Schepanski, 2009). Only recently, 110 aerosol profiles have been considered, measured by lidars (GALION -WMO-GAW Report No. 178) and ceilometers (Benedetti et al., 2009;Wiegner and Geiß, 2012;Wiegner et al., 2014;Chan et al., 2018), whereby the former are operated spatially sparse and temporally intermittent, the latter have no independent capability to identify and quantify particles and both do at best capture part of the surface layer. Nonetheless, combined lidar and ceilometer networks are optimally suited for 115 localizing and tracking aerosol plumes throughout the troposphere and for assimilation of aerosol information into atmospheric models (Benedetti et al., 2009;Bocquet et al., 2015).
The vertical profile particularly holds information about long-range quasi horizontal-as well as vertical exchange, altitude dependent exposure to pollutants, the impact of optically active particles on stratification and radiation, cloud formation potential, dispersion of hazardous particles like volcanic 120 ash, and by all these often allows source attribution more precise than from surface network observation alone. Inherently, it reflects the pollution and mixing state as well as the height of the planetary boundary layer or mixing layer ML. The air-quality-and health-impact of aerosols concentrates in to scale the dispersion of reactive gases and aerosols (Monks et al., 2009) as well as greenhouse gas concentration budgets (Gerbig et al., 2008). As far at it reflects in temporally consistent aerosol gradients, it can be inferred from lidar/ceilometer profiles (Münkel et al., 2007;Haeffelin et al., 2012).
In this study, we primarily use ceilometer profile data from the German ceilometer network to 135 evaluate CAMS aerosol forecasts. Section 2 describes the data and the methods to compare particle information in the CAMS global model with respect to observations. The results are presented in section 3 together with some auxilliary data sets for interpretation and are discussed in the context of possible model improvement in chapter 4. Key findings are summarized with an outlook to upcoming activities in chapter 5.  Rémy et al. (2019). Further information as well as analyses, forecasts, evaluation re-145 sults and other products can be found on the web page https://atmosphere.copernicus.eu/. We use the operational runs with assimilation (ASM) from 01/2016 (cycle 41r1) to 12/2019 (cycle 46r1) and corresponding unconstrained control runs (CTR) as listed in Table 1 and in Table 3 in Rémy et al. (2019). The data were re-sampled from the reduced Gaussian grid at T255 to 1.0 • x 1.0 • resolution before 06/2016 and T511 to 0.5 • x 0.5 • thereafter. Conceptually, regional 150 models build on the global forecasts and refine these scales to few km but yet provide only aggregated aerosol quantities PM 2.5 or PM 10 rather than direct backscatter output nor the information necessary for conversion. The global aerosol model uses 14 prognostic variables: (3 size bins each of dust and sea-salt, hydrophilic/hydrophobic black and organic carbon, sulphate SO 4 , and as of 9 July 2019 (cycle 46r1) also nitrate NO 3 and ammonium NH 4 ). MODIS AOD and 155 since cycle 45r1 also the Polar-Multi-Angle Product (Popp, 2016) are assimilated, optionally by 4D-Var (Benedetti et al., 2009) or the 3-D fields from the previous forecast. Owing to an adverse effect on headline scores during tests with CALIOP-lidar backscatter-profiles (1D-Var), yet no aerosol profiles are assimilated (Benedetti et al., 2009). As described in detail by Granier and et al (2011); EDGAR (2013); Rémy et al. (2019) and documented at the ECMWF website zontal and vertical transport is based on the dynamics of the ECMWF model, complemented by vertical diffusion/convection, sedimentation and dry/wet deposition by large-scale and convective precipitation. Significant upgrades impacting our results are the increase of horizontal resolution from T255 to T511 after 06/2016, the coupling of organic matter emission to CO emissions (Spracklen et al., 2011) as of 02/2017, the increase of vertical resolution from 60 to 137 levels and addition of 170 NO 3 and NH 4 as of cycle46r1 in 07/2019 (cf. Table 3 in Rémy et al. (2019)). Based on the 00 UT analysis, 3-hourly profiles at time steps +3, +6, +9,... are extracted from 5-day forecast runs, such that noticeable adaptations by the analysis/assimilation are possible at 03 UT each. Ceilometer and model profiles/MLH are based on altitude above ground and model geopotential height, respectively.
The vertical displacement between the low-resolved model orography and real terrain height is only 175 relevant for steep stations sticking out far above the model surface level, while over flat terrain this is below 100 m.  Table 3 in Rémy et al. (2019). ASM is like CTR, but additionally uses 4D-Var assimilation. The model provides mass mixing ratios m p,i of 14 particle types as output, which for comparison 180 must be converted to a common physical quantity that is measured by the ceilometers. As such the attenuated backscatter β * (z) according to Eq. (1) is chosen rather than the backscatter coefficient β(z) because it is the primary measured variable without assumptions and the model contains all information to calculate it:

185
Here β(z) and σ(z) are the backscatter and extinction coefficients, respectively. The further procedure is described in detail by Chan et al. (2018) and outlined here for easier tracking of the most important steps. At first the mass mixing ratios of the particle types are converted to mass concen-6 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c Author(s) 2020. CC BY 4.0 License.
trations c p,i by multiplication with the air density air as shown in Eq. (2).
190 Then the particle extinction coefficient σ p,i and the particle backscatter coefficient β p,i of each particle type i have been pre-calculated using appropriate particle size distributions dN (r)/dr and humidity dependent particle refractive indices n as applied in IFS-AER (Chan et al., 2018). For consistency with the current implementation of the aerosols in the IFS model Mie scattering theory has been applied for all particles. Model mass concentrations are then converted to extinction 195 coefficients by means of the specific (mass) extinction coefficient σ * e,i (unit: m 2 /g) Eq.
(3) is applied separately to each size and humidity bin of the humidity dependent and size segregated particle types. For convenience the lidar ratio S p,i is commonly used to calculate particle backscatter coefficients from extinction coefficients.
With this definition, the extinction and backscatter coefficients of each particle type are determined from Eqs. (5, 6).

Ceilometer network
The German Meteorological Agency (DWD) operates a network of about 160 Lufft-CHM15k ceilometers (∼60 in Jan 2016, Figure 1) which provide operational profiles of aerosol attenuated backscatter 220 Pz 2 (Flentje et al., 2010a,b) System stability and output power monitoring allows to track the lidar constant C L and transfer the calibrations to daytime profiles (Böckman et al., 2004;Heese et al., 2010;Wiegner and Geiß, 2012;Wiegner et al., 2014). Only stations with a sufficient density of successful calibrations are consid-235 ered. Attenuated backscatter β * (z) as a function of altitude z is then calculated from the background corrected ceilometer signals P(z) with the calibration constant C L β * (z) = P z 2 C L The C L values are first cleaned for outliers (<>1.5 x 25th-75th percentile of 30-day average), smoothed with a 30-day running mean and finally interpolated to hourly values to be used in Eq.

240
(1). The typical uncertainty of an individual calibration is 15-20 %, while the actual error is smaller due to the temporal smoothing. The accuracy of the retrieved backscatter linearly depends on the accuracy of C L . The monthly variation of C L is usually less than 5% and the annual variation is 10-15 %. Finally, cloud-free attenuated backscatter profiles are averaged within ± 1 h around the corresponding model times. Profiles with precipitation, low clouds or instrument operation flags number, which characterizes the degree of turbulence (Richardson et al., 2013). The vertical stability is estimated using the difference between each level and the lowest level. Several issues with this approach are described by e.g. Engeln and Teixeira (2013), related to the Richardson number being based on ratios of both dynamic and thermodynamic vertical gradients rather than of temperature and/or humidity as such, the use of dry variables in cloudy situations, and the fact that the Richard-260 son number as a measure of local turbulence is often unable to properly characterize the turbulent properties of convective boundary layers. Turbulent kinetic energy, which could better be used, how-9 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c Author(s) 2020. CC BY 4.0 License. ever, is rarely used in global models and as such is not available (Engeln and Teixeira, 2013).
The MLH observations are based on two approaches: visual inspection of individual daily timeheight sections of β * (z) and data provided by the firmware implemented in the CHM15k ceilome-265 ters. The former is elaborate, subjective and requires an experienced analyst of 2-D backscatter sections. The latter is automated and precise but suffers from severe inaccuracies and ambiguities.
In principle, each MLH detection is a pattern recognition problem which is based on the assumption that the vertical distribution of aerosol can be used as a tracer for boundaries. This, however, is not always the case. The absolute value of the backscatter is typically not needed since the rel-270 evant information seems to be completely coded in the gradient (but possibly of different orders) of the backscatter profile (Teschke and Pönitz, 2010) and its temporal development. The CHM15k firmware calculates MLH from the backscatter intensities of the ceilometer range corrected signal (Pz 2 ) by means of a wavelet transform algorithm (Teschke and Pönitz, 2010). Up to 3 layers with quality flags are reported, which however lack specificity, temporal continuity and distinctness. In 275 this respect, Haeffelin et al. (2012) find in their analysis of limitations and capabilities of existing mixing height retrieval techniques "'...no evidence that the first derivative, wavelet transform, and two-dimensional derivative techniques result in different skills to detect one or multiple significant aerosol gradients."'. The inferred MLH are mostly unrealistic in cases with multiple-layers, low clouds/fog, small aerosol gradients, precipitation and long-range transport of e.g. dust, smoke etc.

280
Particularly Saharan dust days often must be excluded as the β * (z) gradient at the ML top disappears. MLH below 400 m a.g. typically cannot be detected by the CHM15k because limited accuracy of the overlap correction causes artifacts around that height. It turns out that the automatic data can hardly be made a reliable reference, even when only subsets of clear cases and comparatively robust metrics like maximum daily mixing layer heights MMLH are used. To this end visual inspection 285 of many individual cases illustrates why algorithms fail with ubiquitous complex scenes and at the same time provides reasonable estimates of MMLH. Also owing to the sometimes thick entrainment zone the uncertainty of visual MMLH detection ranges around ± 100 m for clearly developing ML.
Finally, it must be clearly stated, that the discussion of MMLH in this article is included as it is the most prominent feature in the vertical profile, but it is not intended as a rigorous evaluation. 2.4 In-situ measurements of particle composition and -sizes For comparison of modeled particle composition, in-situ particle measurements are used from the German Global Atmosphere Watch (GAW) global station Hohenpeißenberg HPB (47.8 • N, 11.0 • E, 980 m a.s.l.), . Hohenpeißenberg is a pre-Alpine hill, sticking out 300 m above the surrounding forest/grassland and represents rural central European conditions. Measurements spectively, see e.g. Wagner et al. (2015). The range of concentrations within these altitudes indicates the uncertainty.

Concept of evaluation
Given the complexity of spatio-temporal variations of 14 interacting aerosol types in the IFS-AER 310 model, it is important to breakdown the evaluation to a meaningful subset of metrics and scores and adapt to the information content of the observation data. In this study, the evaluation focuses on the vertical aerosol distribution and the altitude-dependence of the model-observation differences (bias) from about 0.3 to 6 km above ground. Below 0.3 km, the incomplete overlap cannot be corrected with sufficient accuracy. Above 6 km ceilometer data suffer increasingly from low signal/noise ratio 315 and cloud artifacts. To avoid sensitivity of our results to truncated profiles extending vertically over less than 0.6 km or containing clouds and possibly falling precipitation streaks, such profiles are excluded (cf. Section 4.2). In the vertical we distinguish the surface layer SL where the sources of most particles are, the planetary boundary-or mixing-layer ML, and the free troposphere FT, where long-range transport takes place. Model biases at certain altitudes can indicate specific deficiencies 320 in the model, but may also stem from uncertainties in the observation data, from inherent conversions necessary to refer to the same physical quantity (mass mixing ratios versus scattering) or arise from metrics and the methodology itself -this is discussed in Section 4.2.
While there are several options to discuss the agreement of forecast and observed backscatter profiles, we use the following metrics and scores to reveal different aspects of aerosol representation Either moving averages over certain altitude ranges (bias time series) or (e.g monthly) averages 11 https://doi.org/10.5194/gmd-2020-308 Preprint.   (Taylor, 2001). This means that correlation is calculated over altitude ranges rather than 350 periods of time. The ideal agreement or the reference point (observation) is thus located at polar coordinate [1,1]. Noteworthy, the distance from the reference in Taylor

Results
The consistency of IFS-AER forecasts with observations is evaluated using different related aspects: 365 the temporal development of β * (z) and its bias at selected levels, the vertical shapes of β * (z)profiles and their correlation, periodicity, specific particle events and the height of the ML. Then we sort, link and aggregate these criteria for selected height regimes to extract clear measures of 12 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c Author(s) 2020. CC BY 4.0 License.
performance. Longer temporal averages like seasonal cycle are disturbed by five model upgrades within the 4 years period. By considering mean and median values, the skills with and without 370 (peaks of) events are distinguished, the latter representing more background conditions and less the inter-annual variability of (mostly dust) events. Negative and positive biases are denoted as 'lowbias' or 'high-bias', respectively, their absolute amount classified as large or small. The relative data coverage of 3-hourly profiles from all stations remaining for evaluation is 93%, 92%, 89%, 83%, 71%, 46%, 16% at 0.4 km, 1 km, 2 km, 3 km, 4 km, 5 km, and 6 km above ground, respectively.   Table 1.
Bias of β * (z) shows a clearly different behavior near the surface, in the ML and the FT, with 385 upward tendencies toward the surface, low bulges in the ML reaching up to ≈0.5-1 km in winter and 14 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c Author(s) 2020. CC BY 4.0 License.
to ≈1-2 km in summer, and enhanced variation related to irregular long-range transport, mostly of dust, in the FT as shown in Figure 3. Estimated error bars overlayed to the CTR profiles indicate the significance of the biases. The low bias scatter above 6 km are artifacts caused by cloud boundaries not captured by the quality control. Owing to events, the mean bias is on average larger and scatters Over the four years monthly bias profiles have become more variable, the means more than medians and CTR more than ASM ( Figure 3). This may reflect changes to model source strengths (OM and SD 02/17, SO 4 10/17, sea-salt 06/18 -c.f. Table 3  . In-between they were smaller or negative as shown in Figure 2 and Table 3. A clear bias increase with cycle 46r1, observed at 0.4/1 km, corresponds to overestimated NO 3 , NH 4 and OM in the model as discussed with respect to GAW surface data in Section 3.3. summer minimum reaches up to 3 km (MNMB even to 4 km a.g.), while it is variable dominated by Saharan dust events at 5 and 6 km a.g. Table 2   The reported tremendous variability with season, altitude, events and model cycle is condensed to Tables 3 and 2 and summarized as follows: -    A more condensed way than Figure 5 to descriptively visualize performance changes between 530 model versions are Taylor polar plots as displayed in Figure 6 and explained in Section 2.5. Here, the average performance during the six IFS-AER configurations in Table 1 are summarized in terms of correlation, normalized standard deviation and the plotting-distance towards the reference, i.e. the root mean square error (Taylor, 2001). Accordingly, the model system has not systematically evolved towards improved representation of the profile shape, though mean values around r 1dly = 0.7 are 535 already quite good. However, after some variation, finally the overall variance of the profile became nearly realistic on average after the implementation of NO 3 and NH 4 and adaptions to SO 4 , organics and dust in cycle 46r1 in July 2019. The differences between ASM and CTR are small. It should be noted, that individual covariances of modeled and observed profiles vary quite strongly with time and location/station, meaning that many situations cannot be closely captured and even the 540 observations may partly not be representative due to undetected artifacts (clouds, overlap correction, misalignment etc, not removed by the quality control).

Particle composition and -size at surface level
To better understand the differences between modeled and observed backscatter β * (z) profiles, 545 near-surface mass concentrations MC of the prognostic aerosols in IFS-AER, namely PM 10 , sulfate SO 4 , nitrate NO 3 , ammonium NH 4 , black carbon BC, organics OM as well as qualitative proxys for 'sea-salt' SS and 'mineral dust' MD are compared to surface in-situ observations. All particle concentrations are modeled in IFS-AER and measured in situ in dry state without hygroscopic water uptake. PM 10 is calculated from the model mass mixing ratios mmr according to the formula used in    Table 1. Note the different y-ranges.
As shown in Figure 7, the dry surface mass concentration PM 10 for ASM (10.3 µg/m 3 ) and CTR (7.9 µg/m 3 ) roughly corresponds to HPB data (7.9 µg/m 3 ). The assimilation seems to bias surface 565 concentrations a bit high. Species are detailed in Table 4. PM 10 approaches HPB data after the increase of OM with cycle 43r1 (02/2017), though this was partly compensated by a parallel decrease of SO 4 , it is however overestimated as of cycle 46r1 after July 2019 due to introduction of NO 3 and NH 4 , which are simulated roughly 3 µg/m 3 (∼ 300%) and 0.3 µg/m 3 (∼ 60%) too high at HPB, respectively. Further changes with cycle 43r3 (10/2017) synchronize the phase but exaggerate SS has already been discussed in Chan et al. (2018). To this, the above mentioned approximation of SS via Cl has negligible impact. The observed-dust proxy contributes only 4-6% to the annual average mass at HPB . The seasonality is reproduced, but mean summer con-585 tributions around 10 µg/m 3 would require much more events than observed and simulated, which confirms that dust concentrations are overestimated not only near the surface but also in the higher ML and the FT as noted in Section 3.1. The assimilation correction to dust MC of few µg/m 3 is too small. These results are not affected by mass-to-backscatter conversion nor humidity and, due to averaging over the lowest 300 m a.g., are not sensitive to the model level selected to represent surface 590 concentrations at HPB. The regional representativeness is limited to rural central Europe (Putaud et al., 2010) where comparatively small concentrations prevail as discussed in Section 4.

Long-range transport
The DWD ceilometer network follows the 3-D dispersion of optically efficient particles like dust or smoke and is therefore particularly suitable to verify the timelyness of long-range aerosol transport

Mineral dust 605
The previous sections showed that Saharan dust loads over Germany are over-estimated at the surface and throughout the profile. The realistic seasonality (Figure 7) and the reasonable correlation while standard deviation (coding the amplitude of β * (z) ) is higher. The first indicates spatiotemporal or vertical shifts of layers/plumes, the latter reflects overestimation of dust concentrations but is not directly scaled to the SD bias due to the large influence of the ML on the profile. These

Cloud formation due to SD
Though Saharan dust transport is realistic in IFS-AER on spatio-temporal scales >100 km and > 1/2 635 day, the dust load is mostly overestimated. Occasionally however, β * (z) of dust plumes is apparently underestimated because dust particles rapidly grow by water uptake and observed β * (z) changes though the dust mass concentration itself remains constant. Though the ability of (coated) mineral dust to foster cloud formation is well known, its simulation still is a challenge (Sassen et al., 2003;Ansmann et al., 2005;Bangert et al., 2012). At the time of incipient cloud formation this layer still was clearly separated from the Saharan dust layer below and thus could not influence this process.

Fractions skill score
The penalizing of slightly vertically displaced aerosol layers yielding a low or even anti-correlation   Table 3. The larger overestimation of β * (z) associated with higher sea salt relative contributions is in CH18 partly (∼ 10% of total β * (z) ) attributed to the utilized hygroscopic growth scheme OPAC (Hess et al., 1998), and is not elaborated further in this study. Sea salt over continental Europe remains considerably overestimated (c.f Section 3.3) in all seasons as changes to the sea salt emission scheme, e.g. coming in with cycle45r1 710 (06/2018) still primarily aim at decreasing the global low-bias of sea salt abundance dominated by oceans. Concurrent substantial increases of sea-salt particle sizes and sinks (wet deposition) likely reduce sea salt mass concentrations further inland, apparent as step at HPB in Figure 7, but are either not efficient enough or still not the governing processes. As before, underestimated near-surface β * (z) are presumably linked to unresolved local (Chan et al., 2018) or regional scale (e.g. Jan/Feb Performance changes of IFS-AER with height arise from different governing processes. As most sources enter into the lowest model level they have the largest effect to the lowest part of the profile.

735
In the near-surface layer, the observed high-bias of NO 3 mass concentrations results from too efficient gas-to-particle partitioning, i.e. fine-mode NO 3 production from HNO 3 -neutralisation by NH 3 , followed by temperature dependent dissociation to NO 3 and NH 4 . Secondly, remaining HNO 3 may heterogeneously produce coarse NO 3 on SS or dust particles (Rémy et al., 2019), but this process is of minor relevance in central Europe where fine-mode nitrate has roughly 5 times larger mass con-740 centration than coarse-mode nitrate. NH 4 is simulated at comparable concentrations as observed at HPB while NO 3 is about four times as high. For fine-mode NO 3 the most efficient sink near the surface, probably underestimated, is dry deposition (Zhang et al., 2012), while sedimentation of small particles should be slow and is disabled in the model. Below-cloud wet deposition (washout) should affect the whole profile rather than only the surface where the high-bias tendency toward the ground concentrations in ASM. Emission inventories thus seem to capture the decrease of anthropogenic emissions during the last decade but as for SO 4 the assimilation seems to add too much mass and may disturb the realistic partitioning between anthropogenic and biogenic OM.

760
Mixing layer: Against overestimation of mass concentrations and β * (z) near the surface, the aerosol load in the ML tends to be low-biased. This may stem from delayed vertical transport from the surface. Several more reasons may contribute to the dominance of low-biases of β * (z) in the ML. The forward operator, including mass-to-volume conversion, presently uses particle densities of the pure materials, not taking into account possible porosity of dry atmospheric particles enclosing 765 air due to coagulation and variable internal mixing (Winkler et al., 1981). If the model assumed larger bulk densities than particles actually have, the equivalent volumes were calculated too small and optical properties would be underestimated, because they depend strongly on the particle size.
The density of accumulation mode particles, composed of hydrophilic and hydrophobic materials could be overestimated by up to a factor 1.5, which transfers to a factor 1.3 in the optically relevant 770 surface area. Secondly, the ML top is too smooth which means the capping transport barrier at the ML top seems less effective in the model, diluting higher ML concentrations with cleaner FT air.
Geometrically, however, the ML height on average seems reasonable (cf. Section 4.1).
The monthly mean bsc profiles suggest that aerosol mass, added to the whole column by the assimilation (runs 0001), results in overall higher aerosol load than in the control runs till 07/2019, but the 775 assimilation does not improve the step at the top of the PBL to lower values in the free troposphere (FT) as shown by the ceilometer profiles. Note that daily/monthly averaging considerably smooths the ML top due to daily/monthly variations of PBL height. Likewise, the FT background might be slightly high-biased due to adding an assimilated portion there. This aerosol mass would be missing in the ML, yielding a too low amplitude (coded in the standard deviation) of the model compared to 780 observations (reference) in the Taylor plots, too.
Free troposphere: Saharan dust, as an irregular constituent is (as in CH18) found to be overestimated over Germany by typically a factor 2 or more all the time (except when water condenses to the dust). CH18 calculated, that accounting for non-spherical particles, using conversion coeffi-785 cients based on T-Matrix calculations rather than Mie-Theory, would reduce β * (z) by 15-45%. This reduction arises from the modification of the phase function by non-sphericity, coded in the lidar ratio LR, not the specific extinction, and thus does not transfer to AOD. In order to reduce the dust high bias in the model, the dust source size distribution after cycle 43r1 was modified to distribute less mass into the fine (8 to 5%) and more mass into the super-coarse bin (61 to 83%) which has a 790 shorter lifetime due to faster sedimentation (Rémy et al., 2019). An according dust reduction can however not be seen over Germany in the mass concentrations of the dust-proxy in Figure 7, which is independent from uncertainties in the the mass-to-β * (z) conversion. That the observed inflation of the β * (z) signal on 17 Oct 2017 (Section 3.4.2) marks the onset of cloud formation, seems plausible. 37 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c Author(s) 2020. CC BY 4.0 License.
On the one hand the temporary factor 10 increase of β * (z) (∼ 1 x 10 −6 → ∼ 1.2 x 10 −5 ) roughly 795 corresponds to a significant visual range reduction by an order of magnitude (to ∼1 km assuming a lidar ratio near 30 sr). On the other hand a mature water cloud would block the lidar beam and any signal from above, which may have been observed at some stations further north (not shown).
Alternatively the passage of a shallow plume with ten times higher concentration would have been diagnosed which would be rather untypical. Unfortunately satellite imagery provides only blurred 800 pictures due to optically dense smoke layers above. In all, the unambiguous identification of particle induced cloud formation is a challenge in the observations as well as it is for a model to simulate hygroscopic growth near saturation.
The  Figure 11). The dynamic reason for the tilted smoke curtain may be ignored for our purpose, but the comparison confirms the expected behavior found for previous fire cases that IFS-AER forecasts are capable to reproduce many details of smoke plumes qualitatively, but that the simulated shape and position become more uncertain when smaller scales develop (Kaiser et al.,815 2012). It shows that source strengths, injection heights and long-range transport in the model are quite realistic. But it must be noted, that β * (z) of the smoke plume is considerably underestimated.
This may be due to the model resolution, but also to emission strengths and heights that are inferred from fire radiative power measurements onboard satellites and converted into convective updraft.
The smoke-mass is calculated from the thermal energy via land-cover-dependent conversion factors, 820 which can represent local characteristics only to a certain degree (Kaiser et al., 2012).

Mixing layer height
The mixing layer height MLH characterizes the ML in many respects, as it can be closely related to important variables like water vapor, cloud cover, heat fluxes and vertical transport as well as PM 2.5 825 and contaminant dispersion (Engeln and Teixeira, 2013;Li et al., 2020). It is however challenging to infer operationally as described in Section 2.3 and by Haeffelin et al. (2012). The physical correspondence between observed aerosol gradients and turbulence (Ri > 0.25) requires meteorological conditions without vertical shear (generating vertical aerosol gradients), fog, clouds, precipitation or aerosol plumes (often dust), which after rigorous filtering leaves only about 5-10% of the days. SD 10.5 • E) the ML top may not follow the steep terrain and local MLG above ground will be too low.
Thus this short study shows that rigid checks of station characteristics, data quality and outliers are necessary before operational MLH data from ceilometers (or lidars) may be used to constrain and evaluate models.

Uncertainties and limitations
The overall uncertainty of our results is mainly limited by the conversion of model mass mixing ratios to β * (z) , the uncertainty of the ceilometer observations, and the re-sampling over different horizontal and vertical resolutions. The former includes estimates of particle shape, densities, mixing-and hygroscopic state as well as meteorological conversions as described in Chan et al. be kept in mind that the relatively coarser global fields of the CAMS system are intended to serve as boundary conditions for nested regional models which refine the aerosol distributions down to scales of few km.

875
As there was no β * (z) output available from IFS-AER before cycle45r1 (10/2017), we use for consistency over the whole period the same lidar forward operator to calculate β * (z) from the model mass mixing ratios as described in Chan et al. (2018). Minor modifications were necessary to integrate the higher resolution and additional species (NO 3 , NH 4 ) as of 07/2019. It uses their precalculated look-up table, slightly adapted to IFS-AER values and modified to additionally handle 880 NO 3 and NH 4 (cf. Table A1, A2, A3). Since 10/2017 lidar output is available from the IFS archive.
Results from both emulators compare well for dust, but for other components like sea salt somewhat different β * (z) profiles are calculated. Possible reasons may be the handling of hygroscopic growth near saturation, the disregard of (however small) absorption by trace gases at 1064 nm and a different effective resolution of the model fields resulting from both lidar emulators. A direct comparison 885 of the β * (z) (from ground) product retrieved from the IFS and that calculated from the model mass mixing ratios according to (Chan et al., 2018) reveals that the IFS β * (z) product is provided with a different effective resolution then the mmr fields used here. The gradients thus appearing at the boundaries of aerosol structures cause deviating results depending on the specific time and location of the comparison. For longer averages as mostly discussed in this article, these differences largely 890 cancel out.

Summary
The assessment of IFS-AER vertical aerosol distributions with calibrated ceilometer profiles over Germany (central Europe) generally confirms the realistic reproduction of the vertical aerosol vari-895 42 https://doi.org/10.5194/gmd-2020-308 Preprint. Discussion started: 7 October 2020 c Author(s) 2020. CC BY 4.0 License. ability in terms of attenuated backscatter β * (z) . The shape of the profile, dominated by the mixing layer ML and occasionally by long-range transport particles is largely captured, as indicated by high co-variance of daily average profiles with Pearson's r ≈ 0.6 − 0.95, however no clear impact of the assimilation is found. In summer the agreement of profile shapes is less due to small vertical shifts or untimely long-range transport to which r is quite sensible. A systematic higher/lower bias 900 regularity is found in the lower part of the profile, i.e. by trend higher bias near the ground to lower bias (underestimation) in the middle ML aerosol load. It is attributed to over-estimated sources at the surface, likely in combination with too slow vertical transport and a probably too weak transport barrier at the top of the ML, where the large aerosol gradient is not fully captured. The low aerosol background in the free troposphere FT is usually reproduced. Also captured are plumes and layers 905 from long-range transport of Saharan dust and fire smoke, although β * (z) of dust is overestimated over Germany by a factor 2 or more, and small scale structures evolving during the dispersion of these layers can not be resolved at the present model resolution.
Comparison to dry-state aerosol in-situ observations suggest that SO 4 and OM sources as well as gas-to-particle partitioning of the NO 3 -NH 4 -system are too strong, while black carbon load and 910 trend is realistic near the surface. With respect to the discussed metrics, no consistent development is evident due to the five model upgrades during the evaluated period. The vertically integrated β * (z) , which codes similar information like AOD, consistently with these previous findings shows a bias near zero for 43r1 (till 05/2016) and 46r1 (after 07/19) and slightly negative in-between. The modified normalized mean bias MNMB which is less dependent on absolute values reveals lower 915 values in the more relevant (for air-quality) surface-and mixing layer and a general increase toward higher levels. Over the whole period, the bias of β * (z) exhibits seasonal cycles at the lower levels due to overestimation of SO 4 and OM sources/lifetimes in summer and under-representation of severe pollution episodes in winter.
Finally, we demonstrated that ceilometer networks offer several options to check the realism of mix-920 ing layer heights in atmospheric numerical models. Though we confined to manual analysis of a representative region, we could provide confidence that the annual cycle and the maximum daily height of the ML can be reproduced within few 100 m vertically by the IFS-AER model.
In future, the regional extension of this assessment to larger parts of Europe and the combination of ceilometer networks' spatio-temporal coverage with the higher accuracy and particle identification 925 capability of sun-photometers (AERONET) and multi-wavelength-depolarisation (Raman-) lidars will significantly reduce the uncertainties remaining in this study. Corresponding CAMS activities, also for evaluation of the particle composition using European network data have already started.
A robust discussion of boundary layer heights will benefit more from further improvements to the algorithms than from improved profile data quality. Author contribution IM analysed the ceilometer data. SR and ZK are involved in IFS-AER development, provided model information and interpretations and aided in the retrieval of IFS-AER data.

940
WT organized the DWD ceilometer network and contributed to data transmission. HF performed the evaluation and prepared the manuscript with contributions from all co-authors.

Appendix C Monthly mean profiles
In order to illustrate the shapes of the actual vertical β * (z) profiles from the model (runs with, ASM, and without assimilation, CTR) and the ceilometers, the 48 individual monthly average profiles are given in Figures A2, A3, A4 and A5.