Interactive comment on “ Meteorological and trace gas factors affecting the number concentration of atmospheric Aitken ( D p = 50 nm ) particles in the continental boundary layer : parameterization using a multivariate mixed effects model ”

Abstract. Measurements of aerosol size distribution and different gas and meteorological parameters, made in three polluted sites in Central and Southern Europe: Po Valley, Italy, Melpitz and Hohenpeissenberg in Germany, were analysed for this study to examine which of the meteorological and trace gas variables affect the number concentration of Aitken (Dp= 50 nm) particles. The aim of our study was to predict the number concentration of 50 nm particles by a combination of in-situ meteorological and gas phase parameters. The statistical model needs to describe, amongst others, the factors affecting the growth of newly formed aerosol particles (below 10 nm) to 50 nm size, but also sources of direct particle emissions in that size range. As the analysis method we used multivariate nonlinear mixed effects model. Hourly averages of gas and meteorological parameters measured at the stations were used as predictor variables; the best predictive model was attained with a combination of relative humidity, new particle formation event probability, temperature, condensation sink and concentrations of SO2, NO2 and ozone. The seasonal variation was also taken into account in the mixed model structure. Model simulations with the Global Model of Aerosol Processes (GLOMAP) indicate that the parameterization can be used as a part of a larger atmospheric model to predict the concentration of climatically active particles. As an additional benefit, the introduced model framework is, in theory, applicable for any kind of measured aerosol parameter.


Introduction
It is commonly known that atmospheric aerosols have a great effect on the radiation budget, formation of clouds, climate change and human health (IPCC, 2007;Kerminen et al., 2005;Pope and Dockery, 2006).Acquiring a quantitative understanding of the Cloud Condensation Nucleus (CCN) production from different natural and anthropogenic sources is one of the key topics in aerosol research (Wiedensohler et al., 2009).Theoretical frameworks have been derived to investigate the efficacy by which nucleated particles produce CCN in the atmosphere (e.g.Pierce and Adams, 2007;Kuang et al., 2008) and recent atmospheric measurements and modelling studies have shown that newly formed particles Published by Copernicus Publications on behalf of the European Geosciences Union.
can affect concentrations of CCN (Ghan et al., 2001;Lihavainen et al., 2003;Kerminen et al., 2005).Laaksonen et al. (2005) suggested that new particle formation (NPF) might be an important source of CCN even in polluted environments with strong primary particle emissions; the growth from nucleation to CCN-size may take only few hours if enough condensable vapours are available.
In order to obtain better predictions with global or regional climate models an adequate description of aerosol dynamics is needed.In global scale atmospheric models, modelling aerosol processes is necessarily a compromise between the accurate description of microphysical processes on the one hand, and computational efficiency -requiring simplified parameterisations of these processes, on the other hand.
Simplified parameterisations of the size distribution evolution as well as the chemical composition of aerosol populations could help to reduce computational time efficiently (Dusek et al., 2006;Kokkola et al., 2009).Typically the aerosol particle size distribution is modelled using either a modal approach (e.g.Whitby and McMurry, 1997) or a sectional approach employing relatively few size bins.These methods are not very well suited for predicting new particle formation and growth to CCN sizes (Korhonen et al., 2003), and methods that can directly predict particle concentrations at climate relevant sizes would be of great advantage.Aerosols have two principal impacts on climate: particles larger than 100 nm reflect and absorb radiation (direct effect) and particles of sizes starting from 50 nm can act as CCN (indirect effect).Particles smaller than 50 nm do not have significant effect on climate; a detailed description of their dynamics in large scale climate models would be a waste of computational resources.In addition, there has been no uniform theoretical description of the atmospheric particle nucleation and growth process, so a parameterisation of the concentrations of 50 nm particles could circumvent some of the uncertainties associated with that lack of knowledge.A prime motivation for the current study was to derive a statistical relationship between meteorological, trace gas, and aerosol properties and 50 nm particle concentrations.The resulting parameterisation could then be used in large scale models so that no computing resources go into aerosol dynamics below 50 nm.We do not suggest that this population of 50 nm particles are themselves climatically important: this depends on factors such as temperature and water vapour supersaturation as well as aerosol physico-chemical properties.These factors, combined with aerosol dynamical processes such as coagulation and evaporation/condensation, ultimately determine the climatic impact of this population and are expected to be calculated in the model.Thus, by incorporating our parameterization modelers have an opportunity to lower the computational cost of their calculation without sacrificing accuracy; environmental conditions and aerosol chemical properties from the model will ultimately determine the direct and indirect impacts of these particles.
The use of advanced statistical tools has been rare in the analysis of the formation and growth of atmospheric aerosols.Kulmala et al. (2004) studied several physical and chemical properties affecting particle growth, and statistical properties of NPF events have been studied in Hyvönen et al. (2005) and in Mikkonen et al. (2006), but factors affecting the growth of newly formed particles have not undergone careful statistical studies due in large part to the fact that the exact processes by which this growth occurs are poorly understood.Since simple yet comprehensive mechanisms are currently not available for these processes, statistical analyses such as ours can be used to develop effective representations of the causalities and interdependencies of the gases and particles in the atmosphere.
The problem in using general data analysis methods is that the measurement data is not normally distributed and typically contains different autocorrelation structures.In this manuscript, we describe an advanced statistical data analysis method that takes into account the structure of the data and uses it to find predictors and indicators for the number concentration of particles at a selected size.The study was carried out in an explorative manner, i.e., we did not make limiting preconceived assumptions about which variables should be included in the analysis but we used all the variables measured in our three measurement sites.This ensures that no significant variables were left out from the analysis due to any limitations.The method presented here is applicable for any other dataset from different sites.
The main objectives of the study are to find the factors affecting the growth to, and primary production of, particles that can be considered the minimum potential CCN size and to find a parameterization which can be used as a part of a larger atmospheric model to predict the concentration of climatically active particles.In this article we present a multiple-site, and multi-annual statistical analysis that led to the identification of important parameters the size distribution function at 50 nm, dN/d logD p | 50 nm .Following this we present the optimized parameterization that incorporates the observations from all three focus sites.The results of the multivariate mixed model are compared to data from these sites to quantify the predictive ability of the model.In a final step, we investigate the usefulness of the statistical model by incorporating the developed statistical parameterization into the Global Model of Aerosol Processes (GLOMAP).

Atmospheric particle data
The present paper concentrates on the description of Aitken particles in the troposphere of Central Europe and Northern Italy.As a measure for the concentration of Aitken mode particles we use the particle number density of these distributions at D p = 50 nm dN/d logD p | 50 nm , denoted N 50 hereafter.Particle number size distributions were collected at the measurement sites of San Pietro Capofiume (SPC; Po Valley, Italy), Melpitz (East Germany), and Hohenpeissenberg (South Germany) using Differential Mobility Particle Sizer (DMPS) instruments.DMPS systems were set to measure the dry mobility diameter of the particles.The mentioned regions are characterised by a population density of 50-200 inhabitants km −2 , and feature substantial anthropogenic gaseous and particulate emissions from diffuse sources such as industry, domestic heating and traffic.
In the continental troposphere, Aitken particles often occur as a distinct particle mode, with mean diameters between 45 and 90 nm (Birmili et al., 2001).Aitken mode particles are, in the most general picture, a mixture between aged secondary particles originating from gas-to-particle conversion, or new particle formation (NPF), and "primary" particles emitted directly from anthropogenic sources.NPF, which is ultimately the result of photochemical processes, has been observed at all three sites under study (Jaatinen et al., 2009;Paasonen et al., 2009), and is expected to influence the concentrations of Aitken particles via particle growth induced by condensation and coagulation.
A detailed principal component analysis of particle number size distributions in the Leipzig/Melpitz area yielded at least four statistically independent sources of Aitken particles at the rural observation site Melpitz, including aged nucleation mode particles from regional-scale secondary formation, two types of particles originating from anthropogenic, mainly urban particle emissions, and a fourth (though less significant) type deriving from long-range transport (Costabile et al., 2009).
A one-year characterisation of the hygroscopic properties of atmospheric particles at Melpitz confirmed the presence of different particle types at D p = 50 nm, exhibited by their different hygroscopic properties (Kinder, 2010).Three hygroscopicity classes were identified: hydrophobic particles, associated with fresh direct anthropogenic emissions, accounted for 7-35% of the particle number at 50 nm, depending on season and the large-scale weather situation.Less hygroscopic and more hygroscopic particles, which are broadly associated with secondary particles at different stages of the atmospheric ageing process, accounted for 12-54% and 26-68% of the particle number at 50 nm, respectively.As a summary, it is necessary to consider tropospheric Aitken particle (D p = 50 nm) concentrations as being influenced by a variety of source processes.
We classified NPF events into three classes using the visual methods described in Hamed et al. (2007).A day is considered an event day if the formation of new aerosol particles starts in the nucleation mode size range and the mode is observed over a period of several hours showing signs of growth.If no NPF is observed, the day is classified as a non-event day (NE).A large number of days did not fulfil the criteria to be classified either clear event or NE day and they are considered as unclassified days (UC).In each data set, there are periods of missing data, as well as periods of low quality data, i.e., one or more variables did not have measured value.To avoid the eventual biasing of our results, the number of observations used in the analysis was decreased so that slightly more than half of the observations had measured values for all of the variables used in the final statistical model.
SPC is a low-land observation site in the Po Valley (North Italy).The particle size distribution measurements were made between 24 March 2002 and 30 April 2005.The DMPS system was operational on 814 days during the time period, which included 293 NPF event days and 270 nonevent days.251 days could not be classified.A description of the data set and the analyses performed is given in Hamed et al. (2007).
Melpitz is a low-land (86 m a.s.l.) research station in Eastern Germany, surrounded by flat terrain consisting mainly of agricultural soil, pastures, and forests.Atmospheric particle size distributions and NPF have been analysed at Melpitz since 1996 (Birmili and Wiedensohler, 2000;Engler et al., 2007).The particle size distribution measurements used in this paper lasted between 1 July 2003 and 30 June 2006.This particular time period included 270 event days, 414 nonevent days and 130 unclassified days (Hamed et al., 2010).
Hohenpeissenberg is a mid-level mountain site (980 m a.s.l.) in Southern Germany, located about 30 km north of the Alpine mountain ridge.In Hohenpeissenberg the DMPS measurements lasted between 1 April 1998 and 3 August 2000, which included 85 event days, 220 nonevent days and 40 unclassified days.For details of these measurements and a previous analysis of the connection between NPF and gaseous sulphuric acid, see Birmili et al. (2003).

Data selection and pre-processing
The aim of our study was to find the factors affecting the growth to, and primary production of, particles that can be considered the minimum potential CCN size (here chosen to be 50 nm in diameter).For this purpose, several different statistical models were tested in their ability to predict N 50 as a function of various factors.Most of our statistical models included combinations of hourly averages of gas and meteorological parameters measured at all of the three stations, including temperature, relative humidity, radiation, O 3 , SO 2 , NO x , condensation sink, wind speed and direction and many other parameters which were either not present in all datasets or did not have any relevance to the fine particle concentrations.We also used the probability that the day is a non-event day (PrNE), i.e., a day when NPF is not observed (Hamed et al., 2007).PrNE was calculated with discriminant analysis according to Mikkonen et al. (2006).The calculated non-event probability was used instead of www.geosci-model-dev.net/4/1/2011/Geosci.Model Dev., 4, 1-13, 2011 observed event classification because otherwise we would have had to exclude the unclassified days, which would have subsidised the data drastically.In addition, the probability of a non-event day can be estimated also for those days where the visual event classification has not been made at all, which enables the use of the statistical model in predictive purposes.
Since the condensation sink (CS) is computed from the size-distribution of the particles by the method described by Pirjola et al. (1998) and Kulmala et al. (2001), there is a risk of circular argumentation when using it in our statistical model.Even if the contribution of the smallest particles to the total value of CS is small it may still cause bias to the estimation if we first use the number of small particles to calculate the CS and then use the CS to predict the growth of the same particles.That is why we used only the number of particles larger than 50 nm in the calculation of the condensation sink, acknowledging that the contribution of sub 50 nm particles to CS amounts to a few percent.

Computing event probabilities with discriminant analysis
Probabilities for event and non-event days were computed with discriminant analysis method described in Mikkonen et al. (2006).Discriminant analysis is a multivariate statistical analysis method which is commonly used to classify observations into different groups.When the distribution in each group cannot be assumed multivariate normal, which is the case with atmospheric aerosol measurements, non-parametric discriminant methods must be used.Nonparametric methods are based on group-specific probability densities and they are used to produce a classification criterion based on those probabilities.In addition, the nonparametric method is more robust for multicollinearity (i.e. when some variables measure partly the same effect), which might occur in the analysis of atmospheric data.
The best classification result in SPC was reached with the combination of daily averages of relative humidity, ozone concentration and global radiation (Mikkonen et al., 2006) and these variables were used in the computation of the event probability.A similar analysis was made also for the two other datasets and the predictors were slightly site dependent.
The best predictor models for each site are shown in Table 1 with prediction errors.Finally, we combined all the data from different sites to find the best discriminates for the NPF events for the full dataset.We found out that the best predictor sets for the individual sites are subgroups of the best predictors of the combined data, i.e., in specific sites it is possible to get equally good classification with fewer parameters.

Predicting the size-distribution
Because of the complexity of processes affecting the concentration of small particles in the atmosphere, we chose to use generalized linear models in the analysis.The basic form of a generalized linear model is given by g(y) = Xβ + ε. (1) On the left side of Eq. ( 1), y is the vector of measurements of the studied variable (in our case N 50 ) and g(•) refers to socalled link function, which relates the linear predictors (e.g., measured temperature or SO 2 concentration) to the expected value of y (McCullagh and Nelder, 1989).In our analysis, the natural logarithm turned out to be a suitable link function, since the density function of measured particle concentration (dN/d logD p ) followed the Gamma distribution and the natural logarithm of gamma-distributed data follows normal distribution.Using a model based on gamma-distribution was also an option but we chose a log-linear model since estimating and interpreting it is somewhat easier.On the right side of Eq. ( 1), Xβ is the fixed part of the model (as in the case of standard linear models) so that X denotes the (n × p) observation matrix (e.g., measured temperature or SO 2 concentration) and β denotes the unknown (p × 1) vector of fixed intercept and slope effects of the model.The remaining term, ε, is the vector of the residuals of the model.However, the basic form of the generalized linear model, Eq. ( 1), is ill-suited for aerosol measurement data in which standard independency and homogeneity assumptions are not met.Therefore, we chose to use a mixed model structure in which a random component (denoted Zu) is added to Eq. ( 1).The main idea of a mixed model is to estimate not only the mean of the measured response variable y, but also the variance-covariance structure of the data.Modelling the (co)variances of the variables reduces the bias of the estimates and prevents autocorrelation of the residuals.
Using matrix notation, a linear mixed model can be written as follows (McCulloch and Searle, 2001): (2) Here Zu + ε is the random part of the model.u is a (q ×1) vector of random effects with a q-dimensional normal distribution with zero expectation and (q × q) covariance matrix denoted by G.Note that the structure of the covariance matrix G is not defined in advance.On the other hand, Z is the (n × q) design matrix of the random effects vector, u.With adequate choices of the matrix Z, different covariance structures Cov(u) and Cov(ε) can be defined and fitted.Successful modelling of variances and covariances of the observations provides valid statistical inference for the fixed effects β of the mixed model.In contrast to general linear models, the error terms ε can be correlated and the covariance matrix of the residuals is denoted by R. From this it follows that the distribution of observations can be postulated as a normal distribution with the expectation of Xβ and covariance matrix V, which is given by V = ZGZ + R.
One of the greatest advantages of a multivariate model is that when all parameters are in the same model the interpretation of estimates of single parameters is easy and the results are more valid than in single variable analysis.For example, bias caused by yearly variation is cleaned from the other variables in the model.We made the analysis in a stepwise manner, i.e., the parameters were added and removed to and from the model according to their statistical significance and the total increase of the explanation capability of the model with the all other parameters included.
The final model, used in all datasets is given by ln dN/d logD p 50 where β 0 is the fixed intercept term, β i are the fixed slopes, u m are month-specific random intercepts and v 1 − v 4 are the random month-specific (m) or hour-specific (h) slopes.v 5 is location specific random effect for taking account the condensation sink in the estimation of other parameters but it is set to zero (or constant) when the model is used in prediction.The other variables used in the model are: relative humidity (RH, %), concentrations of SO 2 (µg m −3 ), NO 2 (µg m −3 ) and ozone (µg m −3 ), probability that the day is not a NPF event day (PrNE) and air temperature (K).Experimental particle concentrations at 50 nm (dN/d logD p ) for different event classes.The box plots indicate the median (bar), 25th and 75th percentiles (box), and the minimum and maximum values (whiskers) that were not considered outliers."1", "2", and "3" indicate NPF event classes of decreasing intensity (see text).NE: non-events, and UC: unclassified (i.e.ambiguous) days.

Effect of new particle formation event
As described in Sect.2, all measurement days were classified as a NPF event, non-event or unclassified day.NPF events were further classified to three classes, according to the intensity of the particle formation event.The experimental observations show that N 50 was significantly higher on class 1 event days compared to the rest of the data (Fig. 1).
On the other hand, on non-event days the concentration of N 50 was the lowest.This observation confirms that the number of Aitken particles (50 nm) is influenced by NPF events, which are characterised by the evolution of nucleation mode particles visible at an initial diameter around 3 nm.The impact can be twofold.The first option is that freshly nucleated particles grow to 50 nm on the very same day.This requires considerable particle growth and thus amounts of condensable vapours.At Melpitz, the nucleation mode has been shown to occasionally reach mode diameters between 50 and 80 nm on the very same day when NPF happened (Wehner et al., 2005).A second option is that particles from the previous day, on which NPF might have happened, grew into the Aitken particle size range.This second option bears some relevance, since NPF events tend to cluster in series of subsequent days that are characterised by synoptic-scale weather conditions (high solar radiation, intense vertical mixing) that are favourable to particle nucleation.Since all three stations studied are located in anthropogenically influenced areas and encounter pollution episodes, several of the non-event days also showed high particle concentrations.In SPC the concentrations in weak event classes 2 and 3 do not differ significantly from each other or from the unclassified days.At the German stations enhanced concentrations also occurred on class 2 event days, more pronounced in Melpitz.

Estimated parameters of the statistical model
In order to find the best predictive statistical model for the number concentration of 50 nm particles we first performed tests for the data from each of the three measurement sites separately.Owing to a scarcity of observations, we needed to exclude the data for September in SPC and for February in Hohenpeissenberg.
Table 2 shows the signs and strengths of parameters (β i + v i ) in Eq. (3); the magnitudes of the parameters are listed in the Appendix A. Since the parameters for the three datasets were very close to each other, we could merge the data and find the best predictors for the combined dataset.All months were also given individual intercept terms, which would give information about the monthly variation if the monthly slopes were constant, but as the slopes differ too, the intercepts cannot be fully interpreted.We found that RH and the concentrations of SO 2 and NO 2 had significant additional variance components for different times of the year in all sites, whereas other parameters showed no significant seasonal variation.
When the additional variance is taken into account, the statistical model suggests that the regression effect of RH is negative in all sites, i.e., when the RH is high, N 50 is low.High atmospheric relative humidity has proved to be a factor disfavouring NPF (e.g.Birmili and Wiedensohler, 2000;Boy and Kulmala, 2002;Mikkonen et al., 2006;Hamed et al., 2007), so the observed inverse relationship between RH and N 50 is in line with the inverse relationship between RH and NPF.The overall regression effect of RH was found to be negative but it varies between months (Fig. 2a).The adverse effect of RH was most intense in winter (January-December), when the measured relative humidity reaches its highest overall levels.This is not surprising because the highest RH values are also associated with more clouds and precipitation and, thus, wet particle deposition.
SO 2 concentration had a significant positive regression effect in the summer at all sites but the effect was negative or zero in wintertime (Fig. 2b), especially in SPC and Melpitz.Sulphuric acid has been shown to be involved in nucleation and growth of newly formed particles (e.g.Kulmala et al., 2006;Laaksonen et al., 2008a).A product of SO 2 and radiation divided by CS has been used as a proxy for H 2 SO 4 formation (Hamed et al., 2010).The positive effect of SO 2 concentration in summer months suggests that increased amount of radiation together with lower CS increases the concentration of H 2 SO 4 and thus gives a significant contribution to the particle formation and growth.
NO x has been suggested to have a positive influence on particle nucleation (Laaksonen et al., 2008a) and thus affect the number of Aitken particles.NO 2 was found to be a significant predictor in our model for Melpitz and Hohenpeissenberg and had a positive effect on N 50 .In SPC the effect of NO 2 seems insignificant, but this may be due to poor quality of the NO 2 data.The effect of the NO 2 concentration in the combined dataset is at its highest in summer and in early autumn (Fig. 2c).The effect of NO 2 being a significant predictor for N 50 is interesting and we do not completely understand it but more thorough settling of it is out of the scope of the current study.
The aerosol condensation sink determines how rapidly molecules will condense onto pre-existing aerosols (Kulmala et al., 2005).The regression effect of condensation sink was found to be positive in all datasets, i.e., high CS was predicted to favour high particle concentrations at 50 nm.This was somewhat unexpected since previous studies have found low CS to favour new particle formation (Vehkamäki et al., 2004;Hamed et al., 2007;Jaatinen et al., 2009).It is possible that CS acts as an indicator of the growth of the smaller particles to the size of 50 nm.An alternative explanation is the contribution of direct anthropogenic particle emissions to both, N 50 and CS.The relevance of anthropogenic particle sources (traffic, industry), which are concentrated in urban areas but are also spread in a diffuse distribution in rural areas, on the observations at Melpitz was pointed out by Costabile et al. (2009).The effect of CS was used in the model as a random, site specific effect, to take account the random variation in the 50 nm particle concentration caused by the larger particles existing in the air, but to avoid circularity problems it is not used in prediction or in the GLOMAP modelling study.
Temperature had a small negative effect in SPC and Melpitz but the effect was not significant in Hohenpeissenberg.The negative regression effect of temperature in SPC and in Melpitz is in line with the suggestions of Birmili et al. (2003) and Hamed et al. (2007) that the high temperatures suppress new particle formation in polluted areas, i.e., there is less NPF on high temperature days, and through this affect also to the number concentration of small particles.In addition, high temperatures are generally associated with taller atmospheric mixed layer heights.In a taller mixed layer, primary particle emissions contributing to the Aitken mode will dilute in a bigger volume of air and thus lead to lower concentrations of such particles.Not significant effect in Hohenpeissenberg may be caused by the fact that the measurement station is on top of a small mountain (980 m a.s.l.), where it is generally above the nocturnal boundary layer and in winter time also occasionally above the daytime boundary layer.The effect of PrNE was significant and negative which was expected as the number concentration of 50 nm particles typically increases on NPF days (i.e., when PrNE is zero or low).PrNE was calculated with method described in Mikkonen et al. (2006)  possible methods to compute the probability of NPF (or nonevent), e.g., Hyvönen et al. (2005) introduced a method for boreal forest areas and some cases even local proportion of nonevent days could be valid approximation, so method for determining the PrNE does not limit the use of the parameterization for N 50 .Oxidation of organics is known to be a significant factor in the growth of the particles (Laaksonen et al., 2008b).Ozone concentration has been found to affect the oxidation of organic species and thus affect the particle formation and growth (e.g., Joutsensaari et al, 2005;Vaattovaara et al., 2006).Ozone concentration was the only variable that had significant diurnal variation in the estimate.Figure 3 shows that while the overall effect is negative, it has its lowest values just before sunrise and then rises until sunset, which indicates that it is probably acting as an (inverse) tracer of some pollutants which are sinks for small particles.For other variables, diurnally varying estimates did not increase the predicting ability of the model so much that it would either have statistical significance or compensate for the increased computational cost.

Statistical model predictions
Figure 4 illustrates how well the statistical model for the combined data predicts the observations at the three stations in randomly selected periods.The figure shows that the predicted values follow the observations fairly well in all stations.Overall the statistical model finds the peaks of the number concentration but slightly underestimates the highest peaks and the fastest fluctuations.
Figure 5 presents the scatter plots of the natural logarithms of observed and predicted N 50 for event and non-event days.It shows that when the great number of data points is taken into account, the predicted values are quite well in line with the observations.Only an insignificant number of the highest observations is underestimated and some of the lowest values are overestimated.Together with Fig. 4 it confirms that the prediction ability of the statistical model is adequate in all of the stations and for both event and non-event days.We computed coefficients of determination (R 2 ) for each dataset separately and for the combined dataset.R 2 indicates how well the statistical model predicts the total variation of the dependent variable (here N 50 ) and is commonly used to give information about the goodness of a fit.The best prediction ability was found in data from SPC where the R 2 was more than 0.6, which indicates that the statistical model explains more than 60% of the total variation of the particle concentration.At the German stations R 2 values were approximately 0.5.For the combined dataset the model could explain more than 50% of the total variation of the number concentration of 50 nm particles, which can be considered as fairly good result for this kind of data.The biggest single factor decreasing the R 2 values is the highest peaks on the number concentration which the model does not capture very well.However, these peaks are often caused by local pollution events and thus are almost impossible to predict.The current statistical model does not have specific description for local transport trajectories, or air mass origin but these effects are partly described by the gas phase data.

Model set-up
The derived parameterisation for aerosol size distribution at 50 nm (Eq.4) was tested in a global scale aerosol model GLOMAP to predict the concentration of climate active particles over Europe.The model is an extension to the TOMCAT 3-D chemical transport model (Chipperfield, 2006;Stockwell and Chipperfield, 1999), and its detailed description can be found in Spracklen et al. (2005).Here, we used a sectional moving centre scheme with 20 size sections to cover the aerosol particle size range of 3 nm to 25 µm.We performed two simulations for April 2000: (1) a baseline simulation using the GLOMAP standard set-up, and (2) a test simulation using the statistical model developed in this study.
The standard set-up of GLOMAP explicitly simulates particle formation via binary H 2 SO 4 -H 2 O nucleation in the free troposphere (Kulmala et al., 1998), activation nucleation in the boundary layer (Kulmala et al., 2006), and primary particle emissions according to the AEROCOM emission data base (http://nansen.ipsl.jussieu.fr/AEROCOM).Particles formed or emitted at sizes below 50 nm grew to this size by condensation of sulphuric acid and oxidation products of monoterpenes and, to a lesser extent, by coagulation.In the test run, boundary layer nucleation and primary emissions at particle sizes D p < 50 nm were omitted and replaced with the parameterization developed in this study.The parameterization was used to calculate N 50 at each aerosol model time step (15 min).Input for the statistical model was taken from the model predicted SO 2 field, offline NO 2 and O 3 fields predicted with a coupled chemistry aerosol model, and European Centre for Mediumrange Weather Forecasts (ECMWF) temperature and relative humidity fields.

Results
Figure 7 compares the model-predicted potential CCN concentration (D p > 50 nm) against measured April mean data at Melpitz, SPC, Hohenpeissenberg, and 15 other sites.For the three sites analysed in this study, we present averages of several years of measurement data (i.e., multi-annual averages), since the analysed measurement periods at the three sites did not overlap (see measurement periods in Sect.2.1).The data for the other 15 sites is from the European Integrated project on Aerosol Cloud Climate and Air Quality Interactions (EUCAARI) during April 2008 and/or 2009, and were chosen because of their comprehensiveness.Note that since the model was run only for April 2000 (due to computational expense of a global model) one can expect only a rough agreement between the model and measurements.However, we wanted to include the EUCAARI data in order to demonstrate that the statistical parameterization gives reasonable results also outside the geographical domain for which it was derived.
The parameterization (test run) brings the model results significantly closer to observations at Melpitz and Hohenpeissenberg: the baseline run predicts 110% and 52% higher than observed potential CCN concentration, respectively.Using the statistical model, the overprediction is reduced to 34% and 22%, respectively.However, at SPC the agreement with the observations weakens in the test run compared to the baseline run (underpredictions of 42% and 18%, respectively).This can be because the modelled gas concentrations, which are used as input for the statistical model, are in poor agreement with the measurements.For example, the modelled O 3 concentration at SPC is a factor of ∼ 2 higher and the NO x concentration a factor of ∼ 5 lower than the measured mean values for April.As a whole, SPC is a challenging site for large scale aerosol models such as GLOMAP that have a coarse spatial resolution.The site is located in a valley where pollution frequently builds up or clears out.Since GLOMAP has a spatial resolution of 2.8 • × 2.8 • , it is completely unable to resolve the local topological features around the site.
Of the 15 EUCAARI sites, using the parameterisation instead of the baseline model set-up brings the predicted CCN much closer to observations at 7 locations and deteriorates the agreement clearly at 4 locations.At 3 sites (K-Puzsta, Kosetice and Finokalia) there is relatively small difference between the baseline and the parameterized runs.At Waldhof the observed CCN is approximately halfway between the predicted values from the baseline and parameterized runs.
www.geosci-model-dev.net/4/1/2011/Geosci.Model Dev., 4, 1-13, 2011 While this preliminary test of the parameterisation against observations is incomplete in that it does not simulate the exact years of the observations, it does indicate that the derived parameterisation has potential to describe CCN formation at very different environments from Arctic to polluted rural.These results give confidence to apply the statistical framework also to measurements from other sites in order to further improve the derived parameterisation.
As indicated in Fig. 7, the parameterization has a tendency to predict lower simulated potential CCN concentrations than the standard model version over Central Europe (Fig. 6).Reductions of the order of 30-60% are seen over northern parts of Central and Western Europe (Germany, Poland, Benelux, England) and Italy.Over the rest of Europe (apart from a small region in Northern Scotland) the baseline and test runs agree within 20%.Note, however, that since the parameterisation was developed based on aerosol data from the Central Europe, it is most likely not valid outside this region (e.g. over the oceans were totally different processes determine the aerosol and CCN concentrations).

Conclusions
An advanced statistical model structure was introduced and found to be an adequate tool to analyse tropospheric Aitken particle (D p = 50 nm) concentrations and for making predictions based on in-situ meteorological (temperature, RH) and gas phase parameters (SO 2 , NO 2 , O 3 ).The statistical model can also be used for forecasting the particle concentration with the estimated regression coefficients.A key result was that some of those variables which control the occurrence of new particle formation events also influence the number concentration of 50 nm particles.This is explained by a significant transfer rate of newly formed particles into the bigger size ranges by condensation and/or coagulation.A notable exception was the condensation sink, which was found to be a factor disfavouring NPF but had a significant positive correlation with the number concentration of 50 nm particles.
The same statistical model framework could be used for any other particle sizes and in other locations.The same parameterization can be used at least in areas with similar concentrations of particles and pollutants but extrapolation of the results to clean environments, like boreal forests, needs to be confirmed before use.Our preliminary tests with the global scale aerosol model GLOMAP indicate that the parameterization can be used as a part of a larger atmospheric model to predict the concentration of climatically active particles.The statistical model for the prediction of 50 nm particles could be a significant step towards shorter computation times in global climate models; it seems to work adequately in boundary layer but it still does not solve the computational efficiency problems in free troposphere.Equally, the use of the statistical model for N 50 could bypass some of the current uncertainties in the theoretical description of the nucleation and growth process, particularly when predicting potential CCN concentration.
Fig. 1.Experimental particle concentrations at 50 nm (dN/d logD p ) for different event classes.The box plots indicate the median (bar), 25th and 75th percentiles (box), and the minimum and maximum values (whiskers) that were not considered outliers."1", "2", and "3" indicate NPF event classes of decreasing intensity (see text).NE: non-events, and UC: unclassified (i.e.ambiguous) days.

Fig. 2 .
Fig. 2. Seasonal estimates of (a) RH, (b) SO 2 , and (c) NO 2 for the combined dataset.Note that the estimates of different parameters are not comparable, since the values of the variables are not standardised.

Fig. 4 .)Fig. 5 .
Fig.4.Observed (green line) and predicted (red line) time series from illustrative example periods in all stations, when the combined data is used in the parameter estimation.Gaps in red line are due to missing datapoints.

Fig. 6 .
Fig. 6.Relative change in model predicted potential CCN concentrations when the standard aerosol model set-up is replaced with the test model set-up.

Fig. 7 .
Fig. 7. Mean observed and predicted potential CCN (D p > 50 nm) concentrations at the 3 measurement sites (measurement data from the time periods indicated in the text) and at 15 EUCAARI sites (measurement data for April 2008 and 2009).Note that the model is run for April 2000.Units are cm −3 .

Table 1 .
Results from Discriminant analysis.

Table 2 .
Signs and magnitudes of the parameters of the model (β i + v i ).Minus sign denotes a negative regression effect on N 50 and plus sign denotes a positive effect and the number of signs describes the magnitude of the effect.

Table A4 .
Coefficients not used in prediction.