The 2.5 km convection-permitting (CP) ensemble AROME-EPS
(Applications of Research to Operations at Mesoscale – Ensemble Prediction
System) is evaluated by comparison with the regional 11 km ensemble
ALADIN-LAEF (Aire Limitée Adaption dynamique Développement
InterNational – Limited Area Ensemble Forecasting) to show whether a benefit
is provided by a CP EPS. The evaluation focuses on the abilities of the
ensembles to quantitatively predict precipitation during a 3-month
convective summer period over areas consisting of mountains and lowlands.
The statistical verification uses surface observations and 1 km
The prediction of deep convection in mountainous terrain is known to be one of the greatest challenges in atmospheric modeling. The initiation and development of deep convection is dependent on small-scale orographic structures and related processes, which cannot be easily described by atmospheric models (Wulfmeyer et al., 2011; Barthlott et al., 2011; Weckwerth et al., 2014). Nevertheless, the estimation of the location, duration, and intensity of precipitation events is important, as Alpine areas are more exposed to natural hazards connected with heavy precipitation (landslides and flooding) than flat land (e.g., Rotach et al., 2009; Haiden et al., 2014).
Models with deep convection parameterization perform poorly in simulating heavy and highly localized precipitation, especially those with a grid spacing larger than 10 km (Weusthoff et al., 2010). One source of errors is that the applied convection schemes act independently in individual model grid columns. As a consequence, convectively generated cold pools that drive convective system propagation cannot be properly simulated, resulting in simulated system movement that is too slow. In weak synoptic forcing, for example, organized mesoscale convection systems (MCSs) are particularly challenging for convection-parameterizing models (Clark et al., 2007; Liu et al., 2006). Another drawback is that the inadequate descriptions of buoyancy and updrafts in a convection-parameterizing model often cause convection to initiate too early. This premature initiation of convection often results in timing and location errors as well as difficulty to simulate the diurnal cycle of rainfall (Clark et al., 2007). Detailed discussion on the convection initiation in a convection-parameterizing model can be found in Davis et al. (2003) and Bukovsky et al. (2006).
A solution for this kind of forecasting problem is offered by a new generation of numerical weather prediction (NWP) models, which have been developed during the last decade. Convection-permitting models with horizontal grid spacings of approximately 2–3 km offer new possibilities for estimating local impacts. The term “convection permitting” as used in this article (CP hereafter) means that a deep convection parameterization is not used in the model. It is assumed that the horizontal resolution around 2–3 km is sufficient to depict the bulk properties of precipitating convective cells, but not to truly resolve the processes within precipitating convective cells such as turbulence and entrainment (Bryan et al., 2003). This is in accordance with Weisman et al. (1997), who suggested setting the upper limit for the range of convection-permitting resolutions at 4 km.
Despite the higher resolution and explicit simulation of deep convection, the exact prediction of location, intensity, and spatiotemporal extent of deep convection is still difficult. Recently, probabilistic approaches using convection-permitting ensembles have proven valuable, since they provide direct information on forecast uncertainty, which is often quite large for deep convection. An ensemble usually consists of a number of model runs, which differ in their initial and boundary conditions and/or model configurations. In order to produce a reliable probabilistic forecast, the individual ensemble member forecasts should be equally likely to occur and cover the range of future states. Following Clark et al. (2011), the ideal number of ensemble members is dependent on the point of diminishing returns, i.e., the ensemble size where no new information can be expected by additional members.
In recent years, several CP ensemble prediction systems (EPSs) have been developed and and considerable experience has already been gained. To name but a few, there are the COSMO-DE-EPS (Consortium for Small-scale Modeling – EPS, Gebhardt et al., 2011; Peralta et al., 2012; Ben Bouallègue et al., 2013; Kühnlein et al., 2014) at the Deutscher Wetterdienst (DWD), the CP version of UK Met Office's MOGREPS (Met Office Global and Regional Ensemble Prediction System, Bowler et al., 2008; Caron, 2013; Hanley et al., 2013; Tennant, 2015), a storm-scale ensemble forecast (SSEF) run by the Center of Analysis and Prediction of Storms (CAPS) at the University of Oklahoma (Xue et al., 2007, 2009; Clark et al., 2011; Schumacher et al., 2013; Schumacher and Clark, 2014), WRF-based CP ensemble at NCAR (e.g., Schwartz et al., 2015), and AROME-EPS (e.g., Vié et al., 2012; Bouttier et al., 2012) developed at Météo France. A common feature of all of these EPSs is that their horizontal mesh size is equal to or less than 4 km, but mostly between 2 and 3 km.
The EPSs mentioned above differ regarding their number of ensemble members and their perturbation strategies and post-processing. Some of them apply an ensemble data assimilation (EDA) approach for perturbing the initial conditions (ICs) (Vié et al., 2012; Caron, 2013; Schumacher and Clark, 2014; Schwartz et al., 2015). The applied model perturbation methods range from a multiparameter approach (Gebhardt et al., 2011) to a stochastic physics scheme (Bouttier et al., 2012; Romine et al., 2014) and to using different dynamical cores (Schumacher et al., 2013). In order to increase ensemble size and to improve the representation of the ensemble distribution, some systems also apply the neighborhood method and/or lagged ensemble concepts (Ben Bouallègue et al., 2013). While the neighborhood method is based on ensemble probabilities derived from grid points of a defined environment (Theis et al., 2005; Schwartz et al., 2010), the lagged ensemble approach uses forecasts of successive ensemble runs (Ben Bouallègue et al., 2013).
A number of evaluative studies concerned with these CP EPSs have been conducted. They mainly focus on the investigation of the impact of CP ensemble configurations, for example, the generation of IC perturbation, representation of the model error, uncertainties from the lateral boundary conditions (LBCs), ensemble size, and spatial scale (Kong et al., 2006; Clark et al., 2009, 2011; Vié et al., 2012; Bouttier et al., 2012; Ben Bouallègue et al., 2013; Kühnlein et al., 2014; Schwartz et al., 2015; Schumacher and Clark, 2014; Romine et al., 2014; Tennant, 2015). There are few comprehensive studies on the evaluation of CP EPS, in particular, in comparison with the mesoscale regional EPS. Clark et al. (2009) compared a 5-member 4 km grid spacing convection-permitting ensemble with a 15-member 20 km grid spacing regional ensemble. Their case studies revealed that the convection-permitting ensemble generally provided more accurate precipitation forecasts than the coarser-resolution regional EPS. Le Duc et al. (2013) examined the ability to predict precipitation of two 11-member ensembles with 10 and 2 km horizontal resolution, with the fine model using direct downscaling of the coarser one. They could show that the 10 km ensemble was more reliable in predicting light rain, whereas the 2 km ensemble outperformed the coarser one in cases of heavier rain. Schwartz et al. (2009) combined subjective and objective verification approaches and found that a higher-resolution ensemble with 4 km produced better forecasts than a 12 km regional model. However, additional comparisons of control runs with 2 and 4 km resolution did not reveal further prognostic value for the lower-resolution model.
In this paper, we will evaluate the performance of a 16-member 2.5 km
grid spacing convection-permitting EPS by comparing it with its driving
16-member and 11 km grid spacing mesoscale regional ensemble. Focus will be
on the capabilities of the CP ensemble to quantitatively predict
precipitation during a convective summer period over an area consisting of
mountains and lowlands. Of interest here is the Alpine region, since the
impacts of the mountainous terrain, such as windward/lee effects, the
differential heating of valley, and mountain slopes can cause large
inaccuracies in forecasting convective precipitation and pose a challenge
for numerical models and their physical parameterizations (Richard et al.,
2007; Wulfmeyer et al., 2008, 2011; Bauer et al., 2011).
Therefore, an evaluation study is designed and conducted for a typical
convective season (3 months, May–August 2011), i.e., a period, which is
long enough to make at least basic statements about the significance of
results. Naturally, this period length is not sufficient to enable
statistically reliable statements on real hazardous events, such as
landslides and flash floods. However, the investigations can be regarded as a
first step towards this aim. The CP ensemble, which is evaluated in this
paper, is a version of AROME-EPS, developed at the Central Institute for
Meteorology and Geodynamics in Austria (ZAMG). It is compared with its
coarser driving regional EPS ALADIN-LAEF (Wang et al., 2011). The following
questions are raised:
Can a convection-permitting EPS provide an advantage over its coarser,
driving regional EPS in complex terrain? Is there any difference in the performance for the compared EPSs between
lowlands and mountainous areas? How well can CP EPS and lower-resolution regional EPS simulate the diurnal
cycle of precipitation? Is the onset and development of convective
precipitation realistic? Does a significant difference in performance for different weather regimes
(i.e., days with weak and strong synoptic forcing) exist?
A verification study is designed and conducted to answer these questions and to establish whether AROME-EPS can outperform ALADIN-LAEF, a regional mesoscale ensemble with deep convection parameterization on a coarser grid. Wang et al. (2012) demonstrated the added value of ALADIN-LAEF as a regional mesoscale EPS to the global ECMWF-EPS (European Centre for Medium-Range Weather Forecasts). Hence, the present study extends this research by addressing the step between regional mesoscale and CP ensembles.
For the present paper, AROME-EPS is coupled to the 16 perturbed ALADIN-LAEF members. This is done to take advantage of the simulation of uncertainties used in ALADIN-LAEF. This uncertainty information is subsequently transferred to finer scales via the dynamical downscaling of the ALADIN-LAEF forecasts by AROME. This means that both IC perturbations and LBC perturbations are provided from the driving model and are thus consistent. No further IC perturbations and model perturbations are applied. Generally, the setup is kept as simple as possible to point out the pure effects of the downscaling: AROME-EPS is directly coupled to a daily ALADIN-LAEF run initiated at 00:00 UTC. There is no time lag between the ALADIN-LAEF and the AROME-EPS simulations, and the forecasts are evaluated for the first 30 h of the model runs, hence for a whole day and the subsequent night each.
The benefits of AROME-EPS compared to ALADIN-LAEF are revealed in the framework of a comparative verification study. Although the focus of the verification study is on the onset and development of precipitation, the performance of other surface weather parameters is considered. The verification methods are selected in such a way that the overall performance, in a deterministic and probabilistic manner, and the abilities of the ensembles to reproduce spatial structures, can be investigated. Hence, ensemble-related scores are combined with spatial verification methods. Unintentionally, the strategy of this paper shows parallels to the verification study conducted by Le Duc et al. (2013), especially concerning the two ensembles (10 and 2 km resolution) coupled by direct downscaling. Further similarities are the complex terrain in which the study is conducted (Japan) and the use of traditional and advanced verification metrics. As a consequence, parallels in the results are mentioned in the results section.
Detailed characteristics of the compared models are described in Sect. 2 along with the verification data. The methods chosen for the evaluation of the two ensembles are described in Sect. 3. Section 4 comprises the verification results and Sect. 5 the summary and concluding remarks.
ALADIN-LAEF is the operational regional ensemble system of ZAMG and runs at ECMWF (Wang et al., 2010, 2011). It is based on the hydrostatic spectral limited area model ALADIN (Wang et al., 2009). ALADIN-LAEF has 16 members and is coupled to ECMWF-EPS (Weidle et al., 2013) with a horizontal grid spacing of 11 km. In operational mode, it runs two times per day at 00:00 and 12:00 UTC and provides probabilistic forecasts on a forecast range up to 3 days ahead, i.e., 72 h. In this study, however, evaluation is confined to the run at 00:00 UTC and a forecast range of 30 h ahead only. This is done in order to investigate the onset and development of convection in its diurnal cycle.
The 16 members of ALADIN-LAEF are not sufficient to represent the atmospheric state probability density function (PDF). However, Schwartz et al. (2014) have shown that similar verification scores can be obtained from a 50-member ensemble and subsets of 20–30 members. Hence, we can expect, at least, reasonable results from verification based on a 16-member ensemble.
Geographic domains and topographies of
The ALADIN-LAEF domain (Fig. 1) covers the whole European continent, Iceland, the whole Mediterranean Sea, Black Sea, Caspian Sea, and adjacent countries. The eastern margins reach the Ural Mountains and parts of Siberia. To deal with the atmospheric initial condition perturbation, ALADIN-LAEF applies a breeding–blending method for generating the IC perturbations for the upper levels. It uses large-scale perturbations from the driving global-ECMWF-EPS combined with small-scale perturbations from the ALADIN-breeding vectors (Toth and Kalnay, 1993). The blending method (Wang et al., 2014) ensures that inconsistencies between small- and large-scale perturbations are avoided. Therefore, a digital filter is applied on the low spectral truncations of both the breeding vectors and the fields from the global model. Afterwards, the filtered breeding vectors on the full spectral resolution are subtracted from the original ones and added by the filtered global fields resulting in initial perturbations that are consistent with the regional EPS itself as well as with the driving global EPS.
To consider uncertainties arising from the initial surface conditions in ALADIN-LAEF, a surface data assimilation scheme based on optimum interpolation (CANARI – Code for the Analysis Necessary for Arpège for its Rejects and its Initialization, Taillefer, 2002) is implemented using randomly perturbed observations. To account for uncertainties in the model itself, a multi-physics approach is implemented in ALADIN-LAEF. The perturbed members use different model configurations with several combinations and tunings of schemes and parameterizations available in the ALADIN physics package. The main emphasis is put on the variation and tunings of the following schemes and parameterizations: the diagnostic convection scheme as described in Bougeault (1985); the prognostic deep convection scheme 3MT (modular multiscale microphysics and transport scheme; Gerard et al., 2009), and the connected microphysics scheme described in Geleyn et al. (2008) and Gerard et al. (2009); the radiation scheme based on Ritter and Geleyn (1992) or alternatively the scheme described in Mlawer (1997) and Morcrette (1991); the pseudo-prognostic TKE (turbulent kinetic energy) scheme described in Vana et al. (2008). Further details can be found in Wang et al. (2010). Authors are aware that the forecasts of the individual members produced by the multi-physics approach cannot be regarded as equally likely. However, a previous evaluation (apart from this study) of the multi-physics in ALADIN-LAEF revealed that some of the members showed larger biases and errors than the other members. The configurations of these worse members were changed accordingly. Hence, we can assume that the members now produce forecasts of comparable quality.
The model core of AROME-EPS is the non-hydrostatic, spectral limited area model AROME (Seity et al., 2011), which is especially designed to run at very high resolutions with a grid spacing of 2.5 km or lower. Deep convection is treated explicitly, while shallow convection is parameterized with a mass flux approach (Pergaud et al., 2009). The single-moment bulk microphysics scheme ICE3 for mixed-phase cloud parameterization (Pinty and Jabouille, 1998) can handle mixing ratios of five prognostic hydrometeor classes: cloud water, cloud ice, rain, snow, and graupel and also simulates complex interactions between them. AROME, by default, uses a three-layer soil model SURFEX (Surface Externalisé) with the effects of sea and urban areas parameterized using a tile approach (Masson, 2000).
At ZAMG, a deterministic version of AROME with 2.5 km grid spacing has been operational since January 2014 running every 3 h up to a lead time of 48 h. The domain for the model integration encompasses the Alpine region (Fig. 1). Table 1 summarizes the most important model characteristics of ALADIN-LAEF and AROME-EPS.
Main characteristics of the ALADIN-LAEF and AROME-EPS.
To run AROME-EPS, the same version of AROME with the same resolution is initialized by a dynamical downscaling of ALADIN-LAEF and coupled to the 16 members of ALADIN-LAEF. The ensemble runs with a forecast range of 30 h are initiated at 00:00 UTC each day, i.e., at the same time as ALADIN-LAEF. There is no time lag considered, as the pure impact of enhanced resolution and the convection-permitting configuration shall be investigated. Apart from the perturbations of initial conditions and lateral boundary conditions, no further perturbations (e.g., multi-physics parameterizations as in ALADIN-LAEF) are induced in the model integration. This comparatively simple configuration is used for several reasons: first, AROME-EPS has been set up quite recently at ZAMG and is still at an early stage of development. Secondly, the development of physics perturbations in AROME-EPS will rather go towards a stochastic physics scheme or a combined stochastic–multi-physics scheme than towards pure multi-physics as currently used in ALADIN-LAEF. Thirdly, the aim of this study is to test the possible advantage of a CP EPS compared to the operational system of ALADIN-LAEF.
Locations of meteorological surface observation stations within the evaluation domain.
Station observations are used for the evaluation of ALADIN-LAEF and AROME-EPS surface weather variables. Figure 2 shows the 517 surface stations in the AROME domain, providing observations at 6-hourly intervals for 2 m temperature, 2 m humidity, 10 m wind speed, and mean sea level pressure. The upper-level verification is achieved using ECMWF analyses reference data at four pressure levels: 925, 850, 700, and 500 hPa, which are adapted to the model resolutions of both AROME-EPS and ALADIN-LAEF.
The evaluation of precipitation forecasts is performed using the very high-resolution precipitation analyses of the ZAMG nowcasting system INCA (Integrated Nowcasting through Comprehensive Analyses; Haiden et al., 2011). This is necessary as the average station distance of precipitation observations is too large to resolve the fine spatial structures of precipitation events. The advantage of the INCA analyses is that they use additional observations and are provided on a regular grid. Based on these gridded data, it is possible to apply enhanced verification methods on precipitation fields, which cannot be computed on a point-to-point basis.
The INCA system, developed at ZAMG, operates on a horizontal resolution of 1 km
Amending the rain gauge–radar combination, the scheme includes elevation effects on precipitation using an intensity-dependent parameterization (Haiden and Pistotnik, 2009). A NWP model first guess is not required in the precipitation analysis; thus, such analyses are ideally suited as an independent reference to validate NWP models.
Forecast verifications are performed at the observation locations for surface variables as 2 m temperature and humidity, 10 m wind speed, and mean sea level pressure, and on the INCA grid for precipitation. The model forecasts are interpolated bi-linearly to the station locations and INCA analysis grid points, respectively. Further, a height correction scheme is applied on 2 m temperature values based on atmospheric standard conditions. In doing so, the same number of forecast–observations pairs is available for the verification of each of the EPS models. This supports the comparability of the verification results.
AROME-EPS and ALADIN-LAEF are evaluated over a 3-month summer period from 15 May–15 August 2011, which represents a typical convective summer season in central Europe.
Precipitation is one of the parameters for which the biggest improvement is expected from the convection-permitting models. Therefore, the evaluation of the ensembles focuses on the representation of the spatiotemporal structure of precipitation events in the forecasts. Nevertheless, the preconditions for the development and onset of precipitation are also considered. For this reason, other forecast parameters such as temperature, humidity, wind speed, air pressure, and geopotential height are also verified.
Precipitation forecasts are evaluated in both deterministic and
probabilistic ways. The deterministic approach is directed towards
predicting the correct precipitation amounts and the spatial distribution of
the data. Probabilistic evaluation tests the capability of the ensembles to
predict a predefined event with the probability which corresponds to its
relative frequency, i.e., to produce a reliable probability density function (PDF) for the occurrence of the
event. The events can be defined as, e.g., precipitation amounts exceeding a
certain threshold. In this study, thresholds of 0.1 mm (threshold for the
prediction of rain or no rain), 0.5, 1, 2, and 5 mm are chosen for 3-hourly accumulated precipitation amounts. These
thresholds appear low, especially when taking into account convective
precipitation events. However, the thresholds are selected according to the
frequency of occurrence of the precipitation values in the individual grid
cells of the 1 km
A number of traditional point-to-point verification scores (see, e.g., Wilks, 2006) are computed for all evaluated parameters. In addition, significance tests for these scores are performed. Confidence intervals of the verification scores are estimated by a bootstrapping algorithm (Davison and Hinkley, 1997; Joliffe, 2007; Ferro, 2007) and confidence intervals of 90 %. The bootstrapping method uses 5000 random samples with a block length of 4 days (Hall et al., 1995).
In order to present the results concisely, three scores have been selected to describe the differences in forecast performance between AROME-EPS and ALADIN-LAEF: the ensemble mean bias (Eq. 1), the Brier score (BS), components derived from its decomposition, reliability, resolution, and uncertainty (Brier, 1950; Murphy, 1973; respectively, Eqs. 2–5), and the continuous ranked probability score (CRPS, Hersbach 2000; Gneiting and Raftery, 2007; Eq. 6).
The bias simply measures the mean deviation between the analyzed values
(
CRPS is related to BS insofar as it can be expressed as the integral of BS
for all possible thresholds of the meteorological parameter
The selected spatial verification methods are the so-called SAL method (structure–amplitude–location method; Wernli et al., 2008) and the fractions skill score (FSS; Roberts and Lean, 2008).
SAL determines the forecast performance of precipitation in terms of
structure (
The location score measures the agreement of the centers of mass in the
analyzed and predicted precipitation fields together with the averaged
distance between the center of mass and the individual objects. It is
actually the sum of two components (
As an identical mass center position does not necessarily mean that the
forecast is perfect, the second component
The structure score
The fractions skill score (FSS),
FSS is computed by assigning the grid points binary values 0 and 1 in each
of the neighborhoods with subscripts (
At each such defined scale
INCA domain and topography with the subdomains which are used for the evaluation.
Verification is done for the whole domain of Austria. To account for the different topographic characteristics in the verification domain, two subdomains are chosen (Fig. 3). They comprise a mountainous area (hereafter region West) as well as a region with flat terrain (hereafter region Northeast). Due to the location of the Alps in Austria and the prevailing flow directions around the Alps, each of the subdomains has its own climatological properties which are also visible in the precipitation characteristics.
In order to investigate the influence of different weather regimes, the 92 days of the test period are classified into three bins according to the
synoptic situation: strong synoptic forcing, weak synoptic
forcing, and dry. Days are classified as dry (5 days) if the areal mean of the daily precipitation sum is below
0.05 mm. All other days, i.e., 87 days on which rain was reported, are
assigned to the bins of weak (23 days) or strong synoptic forcing (64 days). For the classification, a method described by
Done et al. (2006) and successfully applied by Kühnlein et al. (2014) is
used, which is based on the temporal variability of CAPE (convective
available potential energy) as a measure of
atmospheric instability. According to Done et al. (2006), the approach helps
to distinguish between days on which convection is predominantly at equilibrium or at non-equilibrium. This means that the
destabilization of the atmosphere by large-scale synoptic forcing is
balanced or unbalanced, respectively, by the stabilization through
convection. The idea is that this balance or imbalance is related
to the timescale in which CAPE is built up by large-scale processes and
consumed by convection. On days with weak synoptic forcing, the
consumption of CAPE is related to the diurnal cycle or to local triggering
rather than to prevalent large-scale processes. In these cases, the
convective timescale is long and CAPE is often not fully consumed by
convection. In situations where CAPE is realized much faster by large-scale
processes, i.e., in situations of strong synoptic forcing, convection is in equilibrium. In our study, the convective
adjustment timescale
In the following, we present the evaluation of AROME-EPS and ALADIN-LAEF over a 3-month summer period. The focus is on the performance of near-surface parameters, in particular the precipitation forecast, which is of most interest to the users of convection-permitting and regional EPSs.
The forecast performance of surface parameters (2 m temperature and humidity, 10 m wind speed, and mean sea level pressure, MSLP) and upper-level parameters (temperature, humidity, wind speed, and geopotential height) of AROME-EPS and ALADIN-LAEF are verified in this study, which form the background of the evaluation of precipitation.
A large number of verification metrics have been calculated for those near-surface and upper-air parameters. In general, there is no clear advantage either for ALADIN-LAEF or for AROME-EPS. Exceptions from this statement are solely constituted by biases in the forecasts, which are particularly found on the surface level. They form the most eminent differences in the performances of the EPSs: if the bias is low, the models provide good performance also for other scores.
For the surface level, we also found more results on a high level of significance (i.e., 90 %). The verification results of the upper levels are less significant than for the surface and performance is more ambivalent. We used a large number of observations for both surface (station observations) and upper levels (ECMWF grid values). Hence, the lower significance of the results for the upper levels can be explained by the model setup rather than by the verification data. Near surface and on lower levels, AROME-EPS can add more information to the model simulation than on upper levels, compared to ALADIN-LAEF. This is due to the SURFEX soil scheme and the interaction between a refined representation of orography and the model physics schemes and dynamics. On the upper levels, however, there is less influence of the orography and the simulation resembles more the driving model. For this reason, surface results have been selected to highlight the main findings in the following.
Bias of the ensemble means (left panel) and CRPS (right panel) for 2 m relative humidity (top), 2 m temperature (middle), and 10 m wind speed (bottom) for the period of 15 May–15 August 2011 of AROME-EPS (dotted line) and ALADIN-LAEF (solid line), both verified over the AROME domain. Lead times, which are marked with asterisks (*) indicate results with significant differences between the ensembles.
Figure 4 compares the ensemble mean bias and the continuous ranked probability score (CRPS; see Wilks, 2006 for details) for 2 m relative humidity, 2 m temperature, and 10 m wind speed. CRPS compares the forecast PDF based on all ensemble members to the observed values of occurrence and non-occurrence, respectively. CRPS is sensitive to the difference between the forecast probabilities and observed values. The lower the difference, the better the forecast is rated. Hence, the value of CRPS of a perfect forecast is zero. Due to the formulation of CRPS, variations of CRPS values are also reflected by many other scores, in particular those which are sensitive to deviations between the distributions of forecasts and observations. Thus, CRPS is useful for representing the results of this study exemplarily. It also shows the impact of biased forecasts.
Biases of 2 m relative humidity in Fig. 4a show noticeable diurnal variations. During the night and early morning, AROME-EPS is too dry, whereas ALADIN-LAEF is too moist during the day (12:00 and 18:00 UTC). The diurnal variations of the differences between AROME-EPS and ALADIN-LAEF are also reflected in CRPS in Fig. 4b. During the night, AROME-EPS and ALADIN-LAEF are at the same level, but for the daytime hours AROME-EPS shows better results. For 2 m relative humidity, most verification results are significant at a level of 90 %. This is also true for the differences in forecast performance during the daytime hours. Results for 2 m temperature in Fig. 4c and d show an improvement for bias and CRPS at a significance level of 90 % for AROME-EPS. This result is partially due to a large bias of ALADIN-LAEF temperatures. In contrast, there exist fewer deviations between the ensembles for wind speed (Fig. 4e and f) and MSLP (not shown). However, these results have only a low level of significance.
Precipitation is evaluated by 3-hourly INCA analyses on a regular 1 km
Time evolution of 3-hourly accumulated precipitation forecast for INCA (solid line), ALADIN-LAEF ensemble mean (dashed line), and AROME-EPS ensemble mean (dotted line) for regions Austria (top), West (middle), and Northeast (bottom). Left panels show results for the days with strong synoptic forcing, right panels for weak synoptic forcing. The shaded areas denote the range of individual ensemble member forecasts for ALADIN-LAEF (dark grey) and AROME-EPS (light grey), respectively.
Errors occur in terms of over- and underestimation of the maximum intensity and in terms of time shifts. The daily maximum of 3 h precipitation is overestimated by AROME-EPS for regions West and Austria and both types of synoptic forcing by 20–50 %. In ALADIN-LAEF, the maximum of the ensemble mean in these regions is approximately at the same level as analyzed by INCA. Hence, the conditions of ALADIN-LAEF that are too moist near the surface in Fig. 4a are not directly reflected in the precipitation sums. For region Northeast, the ensemble mean of AROME-EPS simulates the maximum amount of precipitation quite well for strong synoptic forcing and only slightly overestimates it for weak synoptic forcing, whereas ALADIN-LAEF is too low for both types of forcing.
Considering the days with strong synoptic forcing in Fig. 5 (left panels),
the highest precipitation sums are detected around 18:00 UTC. AROME-EPS
describes the temporal maximum quite well, whereas the maximum in
ALADIN-LAEF occurs too early (
A further characteristic evident in Fig. 5 is that the precipitation amounts in AROME-EPS develop independently of those in the driving ALADIN-LAEF members, which is indicated by the ensemble spread. In ALADIN-LAEF, the ensemble spread is quite large for certain lead times, ranging from a larger overestimation of the observed precipitation amounts to a large underestimation. This contrasts with AROME-EPS, which shows a much smaller range of precipitation amounts. This difference in the spread is very likely due to the large influence of the multi-physics configuration in ALADIN-LAEF, compared with the single physics configuration of AROME-EPS.
In order to summarize the findings of Fig. 5, we can state that the ability of the models to forecast the daily precipitation cycle is influenced by both the topography and the type of synoptic forcing. Additionally, there is a general tendency of the finer model, AROME-EPS, to forecast higher precipitation amounts with a temporal maximum later in the day than ALADIN-LAEF. The latter, on the other hand, exhibits a larger variety of simulations, visible through the larger spread, especially over mountainous terrain. In the following, we will discuss several scores (Brier score, SAL scores, and FSS) to demonstrate in which ways the differences in the diurnal precipitation cycle have an influence on forecast quality.
Time evolution of the Brier score components, reliability (top), resolution (center), and uncertainty (bottom), with confidence intervals (shades) for region Austria, AROME-EPS (dotted line), and ALADIN-LAEF (dashed line). The results are shown for a precipitation threshold of 0.1 mm/3 h. Left panels depict results for days with strong synoptic forcing, right panels results for days with weak synoptic forcing.
Figure 6 shows the differences of the components of BS, reliability, resolution, and uncertainty for strong and weak synoptic forcing with different precipitation thresholds for region Austria. BS measures the accuracy of probability forecasts, which is equivalent to the MSE for deterministic forecasts. The value for perfect forecasts is zero. BS has largest values for the lowest precipitation threshold of 0.1 mm/3 h, and decreases for larger thresholds. This is also true for the differences of BS between AROME-EPS and ALADIN-LAEF. However, BS is dominated by the uncertainty component, which is independent of the forecast system but only dependent on the observations. Therefore, the components are shown in Fig. 6, as they provide a more detailed insight into forecast performance than the overall quantity BS.
The unequal diurnal variations of uncertainty for days with strong synoptic
forcing and days with weak synoptic forcing are clearly visible in panels e
and f, respectively, in Fig. 6. The relatively constant values of
uncertainty for strong synoptic forcing and the differences between
afternoon (
Time evolution of SAL scores for AROME-EPS (left) and ALADIN-LAEF
(right) for different forecast ranges in region West. Upper panels
The results of the resolution component depicted in panels c and d show very similar daily variations compared to uncertainty. Generally, larger-resolution values are preferable for any forecast system. However, this does not necessarily mean that the forecasts are generally wrong as during the morning hours of days with weak synoptic forcing (panel d) in Fig. 6. It reveals, moreover, that the models keep forecasting low values of precipitation probability regardless of if there is no rain or a little rain reported. However, if the observation sample itself contains values of no rain, results of resolution are less meaningful than for situations with a more balanced distribution of observations. This is the case between noon and early night hours for days with weak synoptic forcing and for the whole day for days with strong synoptic forcing. For these periods, we can observe mostly higher resolution for the forecasts of AROME-EPS than for ALADIN-LAEF, at which the differences are not significant, though. The lower-resolution values for ALADIN-LAEF are presumably due to the smoother precipitation fields compared to AROME-EPS. The smoothness leads to rather medium precipitation probabilities in large areas, which is a disadvantage with regard to resolution compared to sharper forecasts near 0 and 1 (i.e., very low and very high probabilities for rainfall).
The most obvious differences between ALADIN-LAEF and AROME-EPS can be
observed for the reliability component (Fig. 6a and b). They
can, for the most part, be explained by the time shift between forecast and
observation, i.e., by the fact that the precipitation generally starts too
early in ALADIN-LAEF forecasts (see Fig. 5a and b). Both models
show good (i.e., low values of) reliability during the nighttime and the
morning hours (
The variability of SAL scores with lead time gives insight into the performance of AROME-EPS and ALADIN-LAEF in terms of the structure, amplitude, and location of the predicted precipitation events. Figures 7 and 8 show the SAL scores for the mountainous region West and the lowland region Northeast, respectively. The distributions of SAL values are sampled for the individual ensemble members and classified into days with strong (panels a and b) and weak synoptic forcing (panels c and d). These values differ from those based on the ensemble mean and median forecasts as the averaging produces more smoothed precipitation events, and hence has an influence on the properties described by the SAL method.
Same as in Fig. 7, but for region Northeast.
In both geographic regions and for both types of synoptic forcing, the structure score is lower for AROME-EPS than for ALADIN-LAEF, which is, inter alia, a consequence of the model resolution (Wittmann et al., 2010). AROME-EPS produces precipitation events, which are mostly too small and/or too peaked, whereas precipitation objects in ALADIN-LAEF are too large and flat. This is particularly true for days with strong synoptic forcing and for flat terrain. The structure score for ALADIN-LAEF further shows a pronounced diurnal variation for region West, where precipitation events are too large during the day (09:00–15:00 UTC), but more realistic during evening and nighttime. In region Northeast and in weak synoptic forcing, on the contrary, there is a rather damped diurnal variation. This is a sign that precipitation events emerge too early and grow too large over the mountains, whereas over flat land, they are too flat and too widespread during the whole day. AROME-EPS generally shows better agreement with the observed precipitation structures than ALADIN-LAEF during noon (12:00–15:00 UTC) while objects are much too small during the rest of the day. Only on days with strong synoptic forcing and over mountainous terrain does AROME-EPS mostly underestimate the dimension of precipitation events. Also over flat land, structure scores are variable for AROME-EPS, but do not show a perfect daily cycle as for the mountainous areas.
In most instances, the amplitude component reflects the findings shown in Fig. 5, being more apparent for days with weak than for days with strong synoptic forcing. For both EPS models, an overestimation occurs during noon over mountainous terrain (region West; Fig. 7), which is associated with the early onset of convection for ALADIN-LAEF and with the overestimation of precipitation amounts in AROME-EPS. In region Northeast (Fig. 8), the agreement seems to be much better for days with strong synoptic forcing than with weak synoptic forcing. However, the amplitude score measures the agreement in terms of the percentage share of precipitation amounts. Hence, if the amounts are on a much lower level, as in the case of weak synoptic forcing, amplitude scores appear worse. The large amplitude errors in Fig. 8c and d are, therefore, more dependent on the time shift between simulated and observed peaks of precipitation intensities than on the absolute amount of maximum precipitation intensities, which are fairly well captured.
The location score in both regions provided by the SAL shows not as much variability as the other two components. Nevertheless, an investigation of the distances of observed and forecast centers of mass for the precipitation events can provide useful information. Figure 9a and b show the mean distances for objects pertaining to precipitation thresholds of 0.1 mm/3 h and of 2 mm/3 h for days with strong synoptic forcing, respectively. In general, it can be stated that the distances get shorter with increasing thresholds. This indicates that both ALADIN-LAEF and AROME-EPS are more successful for more intense precipitation events. On the other hand, precipitation objects with very low intensities can be either very small and randomly distributed, which is difficult to predict, or very large, which is easier to predict or detect.
Distances (km) between the centers of mass of observed and
forecast precipitation objects for AROME-EPS (dotted) and ALADIN-LAEF
(dashed) for thresholds of
For higher thresholds, Fig. 9b shows that the distances have more
variability with time. Although distances are short for earlier hours of the
forecast (and the first half of the day), they increase for later forecast
hours and reach a maximum at
The fractions skill score (FSS) indicates how well the ensemble systems
predict precipitation at different spatial scales. The grid box widths (1–21 km, corresponding to areas of 1–441 km
Figure 10a and b compare the FSSs for days with strong synoptic forcing
and days with weak forcing. FSS values are greater (
FSSs for
For all weather situations, ALADIN-LAEF shows better values for the lowest thresholds of 0.1 and 0.5 mm. The converse result is observed for higher thresholds above 2 mm. For 5 mm/3 h, ALADIN-LAEF has hardly any skill on the very fine scales for days with weak synoptic forcing. This means that small, scattered showers and thunderstorms, which typically occur on these days, cannot be simulated well by the model with coarser model resolution. In AROME-EPS, there is at least a certain skill for small intense precipitation events, although it is not on a level considered to be useful.
These results are comparable to the main outcomes of Le Duc et al. (2013)
and Schwartz et al. (2009). Le Duc et al. (2013) also found that the coarser
10 km ensemble showed slightly better results for light rains than the finer
2 km one. Both models had lower skill in predicting heavy rain; however, for
the higher precipitation thresholds, the 2 km ensemble performed better than
the 10 km one. Schwartz et al. (2009) partially found the same behavior of
FSS for coarse 12 km and fine models (2 and 4 km resolution). The coarser
model clearly outperformed the finer ones for light rain, whereas the 4 km
model showed better skill at a high threshold of 5 mm h
In the previous sections, the discussion provided an overview on the whole 3-month period. In the following section, evaluations focus on a single selected day. This is done in order to show the forecast behavior of the ensembles in a single, concrete weather situation.
A typical convective day with weak synoptic forcing is selected to show the evolution of precipitation in AROME-EPS and ALADIN-LAEF in more detail. Here, more emphasis is put on the observation of the numbers, volumes, and distribution of the precipitation objects.
Figure 11 illustrates the precipitation at different times of INCA analyses on 29 April 2014 and the ensemble means of AROME-EPS and ALADIN-LAEF. On this day, continuous light rain was reported in Austria's mountainous terrain, near the main Alpine ridge during the morning hours, as shown in the first row of Fig. 11. At the same time, the lowlands in the east and north were dry. In the lowlands, precipitation activities in terms of small showers started from approximately 11:00 UTC in second row of Fig. 11. Over the course of the day, the focus of precipitation was increasingly shifted to the flat lands in the north, east, and southeast of Austria as well as to Slovenia and northern Italy. The peak rain intensity was around 15:00 UTC, shown at 14:00 UTC in third row of Fig. 11. Rain in the inner Alpine areas had diminished. In contrast, the showers in the flat regions continued until the time of sunset. Then, their activity also weakened, which is visible in the bottom row of Fig. 11.
Observed (INCA, first column) and forecast (AROME-EPS and ALADIN-LAEF, second and third columns, respectively) development of precipitation on 29 April 2014 shown for selected times (rows). The panels show 1-hourly accumulated precipitation sums (mm).
Characteristics of the precipitation forecasts of ALADIN-LAEF and
AROME-EPS on 29 April 2014.
Figure 12 gives the characteristics of the precipitation forecasts of ALADIN-LAEF and AROME-EPS, such as the temporal evolution of the mean areal precipitation in Fig. 12a, the number of precipitation objects in Fig. 12b, and the temporal evolution of the SAL scores in Fig. 12c. For the selected day, precipitation amounts for the region Austria are slightly underestimated by the both ensemble systems. Further, only a minor fraction of ensemble members reach the observed precipitation intensities at noon. By investigating the structures of the precipitation forecasts, further insight into the behavior of the ensemble systems is provided. The number and volume of precipitation objects describe how models perform in a spatial context. In this respect, AROME-EPS clearly shows more ability to replicate the real spatial structure of precipitation. Although the number of objects in the region Austria is too low during the first forecast hours, the further development as observed by the INCA analysis in Fig. 12b is described well. In the ALADIN-LAEF forecast, the number of precipitation objects is very low and mostly a product of the lower resolution. The volumes of the precipitation events are in direct connection with their number (not shown). ALADIN-LAEF overestimates the volumes to the same degree as it underestimates their numbers. However, it shows a clear diurnal variation of the volumes with a maximum around noon, which is not indicated by AROME.
The fact that ALADIN-LAEF tends to produce fewer but larger precipitation objects does not lead to worse verification statistics for ALADIN-LAEF. On the contrary, in most regions, the hit rate is higher for ALADIN-LAEF than for AROME-EPS and the number of missed events is lower. AROME-EPS, on the other hand, outperforms ALADIN-LAEF in terms of correct negatives and false alarms (not shown).
These results are also reflected in the temporal evolution of SAL scores in
Fig. 12c. As expected, the structure score
Interestingly, there is a late peak in the
In this paper, we investigate the forecast performance of the 2.5 km convection-permitting ensemble AROME-EPS by comparison with the regional 11 km ensemble ALADIN-LAEF to reveal the benefit provided by a CP EPS. The regional EPS, ALADIN-LAEF, involves several sources of forecast perturbations, such as initial condition perturbations by blending ECMWF-EPS with ALADIN-LAEF breeding vectors and assimilation of perturbed surface observations, and a multi-physics scheme. The high-resolution, convection-permitting AROME-EPS solely performs downscaling of the ALADIN-LAEF forecasts. The performance of the ensembles is evaluated for a 3-month period during the convective season of 2011 and for a typical convective day in April 2014 with a special focus on precipitation events in mountainous terrain and lowland regions. The aim is to show whether the convection-permitting ensemble provides benefits to the regional ensemble with deep convection parameterization. The evaluation is conducted using a combination of standard deterministic and probabilistic verification scores and selected spatial verification measures. The former are applied on several main forecast parameters for surface and upper levels, and the latter – according to their definition – only for precipitation.
The forecast quality for the main meteorological parameters (except precipitation) for the surface and selected upper levels is strongly dependent on the model bias and is rather balanced, except for diurnal variations near the surface. However, characteristic differences are revealed by the investigation of the precipitation forecasts. A known drawback of models using deep convection schemes proves true, which is the premature onset of precipitation in the daily cycle by ALADIN-LAEF (see, e.g., Wittmann et al., 2010; Weusthoff et al., 2010). On the other hand, an overestimation of precipitation intensities at the peak of convection activities by AROME-EPS is also confirmed, which has been assumed in previous validations. Both of these properties are found to be more pronounced in mountainous than in flat regions.
ALADIN-LAEF shows skill in the prediction of probabilities for low precipitation thresholds, i.e., to distinguish between rain and no rain. This is also true for small scales, but it is again dependent on the time of day, as the early onset of precipitation has a negative influence on the verification scores. AROME-EPS, on the other hand, has a better ability to capture the diurnal cycle of convective precipitation, especially over mountainous terrain. At small spatial scales, it further demonstrates better performance for higher precipitation thresholds. The results of the evaluations in this study lead to the conclusion that the convection-permitting ensemble is more skillful in the precipitation forecast than its mesoscale counterpart, the regional ensemble. The positive impact is larger for the mountainous areas than for the lowlands. Nevertheless, the knowledge of which precipitation situations can be better modeled by the convection-permitting ensemble is important to have. For many applications, e.g., for large-scale extreme events, such as the central Europe flooding event of 2013, the best solution will be a combination of both systems: the coarser ensembles with longer forecast range for (pre)warnings and the convection-permitting ensemble for the detailed specification of the expected event. Regarding different time and length scales in that way could lead to the generation of seamless forecast products (e.g., Drobinski et al., 2014; Vitart et al., 2008).
This study is considered as the initial point for further investigations and improvement of the convection-permitting ensemble AROME-EPS. The low spread of the prevailing AROME-EPS version is a clear drawback compared to ALADIN-LAEF. Therefore, future enhancements of AROME-EPS will involve components which will presumably increase ensemble spread. Among those upgrades will be ensemble data assimilation and physics perturbations (multimodel and stochastic). The expectation with these components is that forecast errors will be reduced, and that a more realistic simulation of forecast uncertainties will be achieved.
The ALADIN-LAEF and AROME codes including all related intellectual property rights, are owned by the members of the LACE consortium and ALADIN consortium. Access to the ALADIN-LAEF and AROME systems, or elements thereof, can be granted upon request and for research purposes only. INCA code and INCA data are only available subject to a licence agreement with ZAMG.
We gratefully acknowledge all the LACE/ALADIN/HIRLAM colleagues who have contributed to the development of AROME. ECMWF has provided the computer facilities and technical help implementing ALADIN-LAEF and AROME-EPS on the ECMWF HPCF.Edited by: A. Kerkweg Reviewed by: two anonymous referees