Model evaluation by a cloud classification based on multi-sensor observations

The detailed understanding of clouds and their macrophysical properties is crucial to reduce uncertainties of cloud feedbacks and related processes in current climate and weather prediction models. Comprehensive evaluation of cloud characteristics using observations is the first step towards any improvement. An advanced observational product was developed by the Cloudnet project. A multi-sensor synergy of active and passive remote-sensing instruments is used to generate a Target Classification providing detailed information about cloud phase and 5 structure. Nevertheless, this valuable product is only available for observations and there is yet no comparable surrogate for models. Therefore, a new cloud classification algorithm is presented to calculate a comparable classification for models by using the temperature, dew point and all hydrometeor profiles. The study explains the algorithm and shows possible evaluation methods making use of the new synthetic cloud classification. For example, the statistics of the vertical cloud distribution as well as e.g. the accuracy of cloud forecasts can be 10 investigated regarding different cloud types. The algorithm and methods are exemplarily tested on two months of operational weather forecast data of the COSMO-DE model and compared to a Cloudnet supersite in Germany. Additionally, the cloud classification is applied to Large Eddy Simulations with a similar resolution as of the observations showing detailed cloud structures. 15 Copyright statement.


Introduction
Clouds and their related processes are still responsible for the highest uncertainties of current climate and weather prediction models (IPCC, 2007;Forster et al., 2007).They are of great importance for accurate weather predictions to various end users and applications like solar power forecasts (Huang and Thatcher, 2017;Antonanzas et al., 2016;Sperati et al., 2016), the aviation sector (Bolgiani et al., 2018;Gultepe et al., 2015) and many more.Nevertheless, the evaluation of the macrophysical cloud properties of current atmospheric models is even nowadays very challenging due to the complexity of involved processes and the large variability of clouds.

Data, Methods and Cloud Classification Algorithm
The observations and derived Cloudnet products are obtained by the Leipzig Aerosol and Cloud Remote Observations System (LACROS) supersite of the HOPE campaign (Macke et al., 2017) for April and May 2013 (SAMD, 2018).This supersite was located at a sewage plant near Krauthausen, Germany (40 km west of Cologne).Various cloud types like low-level cumulus clouds, high cirrus clouds, large precipitating ice clouds and several frontal passages were captured during these two months depicting a large variety of typical synoptic situations.The Cloudnet products have a time resolution of 30 seconds and a height resolution of 30 m.The data is available from roughly 200 m above ground up to 15 km height due to the remote sensing measurement characteristics.
The operational COSMO-DE model of the German Meteorological Service (DWD) is evaluated (Baldauf et al., 2011).This cloud-resolving model runs at a horizontal resolution of 2.8 km and contains 51 height levels with terrain following hybrid coordinates.The layer thickness is increasing with height till the upper edge of the model at 22 km.The 1-hourly output 3 Microphysical processes are parameterised by a 1-moment bulk formulation (Baldauf et al., 2011) The official Cloudnet algorithm differentiates between eleven categories, which are "Aerosol & insects", "Insects", "Aerosol", "Melting ice & cloud droplets", "Melting ice", "Ice & supercooled droplets", "Ice", "Drizzle/rain & cloud droplets", "Drizzle or rain", "Cloud droplets only" and "Clear sky".The aerosol, as well as insect categories, are not diagnosed by the presented cloud classification algorithm, because most atmospheric models like COSMO-DE don't provide any information about it.
Therefore, these categories are set to "Clear sky" in the observations and simulations.The remaining eight cloud classes are defined consistently with respect to the Cloudnet algorithms where possible.For example, the temperature has to be above the freezing level for liquid phases as for example "Drizzle or rain" and the dew point below 273.15K for ice classes like "Ice & supercooled droplets".The snow and graupel hydrometeors are also classified as ice because of their ice phase and characteristics.Other distinctions between e.g."Drizzle or rain" and "Drizzle/rain & cloud droplets" are set by the specific cloud water hydrometeor concentration QC for the cloud classification algorithm.In contrast, the Cloudnet algorithm is based on remote-sensing observations and thus using e.g.thresholds of the cloud radar reflectivity and LiDAR attenuated backscatter coefficient to differentiate between both classes.Nevertheless, a comparable diagnosis is developed to generate a consistent synthetic cloud classification for the model with respect to Cloudnet.
The algorithm itself works on every grid box as follows.The category of "Ice" is e.g.determined by a dew point below the freezing point and a dominant concentration of cloud ice defined by an ice concentration QI larger than a certain threshold and a cloud water concentration QC below another fixed threshold.The "Melting ice" category is e.g.defined by a dew point greater than the freezing point and QI greater than the critical hydrometeor concentration.All rules are compiled together by the flowchart in figure 2. The order of the case selection statements is crucial to get physical consistent results.If for example a grid box is first checked for "Drizzle/rain & cloud droplets" with a temperature above the freezing level, QR and QC larger than a certain threshold and afterwards examined for "Drizzle or rain", for which only the first two conditions have to be fulfilled, all "Drizzle/rain & cloud droplets" cases would be classified only as "Drizzle or rain".Similar situations exist for other case selection statements like "Melting ice" and "Melting ice & cloud droplets".The algorithm can be easily advanced by additional case differentiations including for example information about the surrounding grid boxes or aerosols & dust, if they are provided by the model.For this, the categories of "Drizzle/rain & cloud droplets" and "Cloud droplets" are merged to a new category named "Liquid clouds".The categories of "Ice", "Ice & supercooled droplets", "Melting ice" and "Melting ice & cloud droplets" are combined to "Ice clouds".The previously mentioned classes are also merged, because "Ice Clouds" mainly consists of "Ice" and "Liquid clouds" out of "Cloud droplets only".The categories of "Clear sky" and "Drizzle or rain" aren't modified.The timely averaged frequencies of occurrence profiles for all four cloud categories are depicted in figure 4.
The analysis points out model biases, seen for example by a continuous overestimation of "Ice clouds" by up to 30 % above 3 km and "Liquid clouds" by up to 3 % between 1 and 3 km, which would not be feasible to determine looking only at the cloud fraction statistics.Therefore, the "Ice cloud" overestimation explains the differences found at the "Clear sky" category with an underestimation of a similar magnitude with roughly 30 %.This also confirms the previously stated results of the qualitative comparison.Nevertheless, especially at high altitudes, the lower sensitivity of the remote-sensing instruments has to be considered which could reduce the observed frequency of occurrence of "Ice clouds".The choice of merging the categories for "Ice clouds" doesn't affect the overall conclusions because of the small number of other categories like for "Drizzle/rain & cloud droplets" with only 251 occurrences out of 47,576 samples.
The profile of "Drizzle or rain" fits well to the observations which was also seen at the qualitative comparison.Precipitation or at least Drizzle was present for roughly 20 % of the time of the two months.The missing "Drizzle or rain" within the lowest layers at the observations are due to the remote-sensing instruments, which start measuring roughly 200 m above ground.
COSMO-DE captures this category with a frequency of occurrence of 20 % down to the ground.Overall, the generally good statistics of the COSMO-DE except for the mismatches found at "Ice clouds" is proven by the similar shape of the distributions of the four distinct categories.

Point-to-Point Verification
Precise cloud predictions for the right time and place are of high importance for various applications like general weather forecasts or even for the radiative budget and all related quantities of the model itself.The cloud classification contains for every height interval of each time step information about the cloud phase, which can be directly compared between the observed and the modelled classification.This enables to analyse if the model predicts the correct cloud type at the same time and location 5 as of the measurements even though an exact matching is not expected due to the chaotic characteristics of the atmosphere.
The evaluation of the cloud composition is e.g.crucial to investigate their radiative properties and derived quantities.Certain mismatches of the multiple cloud classes can furthermore indicate model shortcomings.For example, "Cloud droplets" of the  As an example, the COSMO-DE Cloud Classification is compared pointwise with the LACROS observations for which the results are discussed.In total 47,576 points with 8 different cloud categories are analysed, which results in an 8 x 8 contingency table (Tab.1).The table depicts how often the model and the observations contain the same category at the same time, place and height, respectively how often another category is predicted than measured.Therefore, hits on the diagonal are the optimum where the model matches the observations.
The high number of "Clear sky" cases and the simplicity to diagnose this category by the algorithm results in the best agreement of 84.6 % found for "Clear sky".The overestimation of "Ice" above 3 km is depicted by a hit rate of only 75.3 %.
In addition, 12 % of the modelled "Ice" points are thus observed as "Clear sky".
General issues of atmospheric models like the right representation of the melting layer are identifiable by this analysis.For example, a lower melting layer through a warmer temperature profile lead to an earlier melting of ice hydrometeors, which is seen by the observed 52.5 % of "Melting ice" points already modelled as "Drizzle or rain".Difficulties at the right distinction

Fuzzy Verification
Small displacements or time lags of cloud and precipitation forecasts often induce large errors, if the model output is compared pointwise with the observations.Depending on the specific interest as for example to evaluate the statistics of the cloud forecasts, fuzzy verification methods are more appropriate allowing the model to be uncertain in time and/or space or further dimensions like the cloud phase.In addition, the large variability of clouds, the multi-dimensional and categorical problem of the classification as well as different resolutions of the observations and the model, cause problems at using standard point to point verification metrics like BIAS and RMSE.Therefore, fuzzy verification techniques are more appropriate for the analysis of the cloud classification to assess a small shift in time or place of a cloud still as a correct prediction.
For each of the fuzzy analyses like e.g.being fuzzy in space by one grid box, an 8 x 8 contingency table can be calculated, which would be difficult to interpret due to the large amount of numbers.For that reason, the focus at this point is on the hit rates of the same cloud phase between the model and the measurement to evaluate the overall accuracy of the cloud forecasts including the cloud type.Further comparisons with a random forecast or a bootstrapping can be applied to the cloud classification to investigate the statistical significance of the results.The fuzzy verification of the cloud classification is exemplarily tested by the two months of COSMO-DE and LACROS observation dataset.
For the fuzzy verification in time, the hour before and after the observation is included, shown by the "Time" column in Tab. 2. For being fuzzy in space, first one grid box around the centre ("Space 1" column) and then three ("Space 3" column) grid boxes surrounding the middle as well as the whole extracted area ("Space Full" column) of 18 x 17 grid points (50 x 48 km) are considered for the evaluation.Only the hit rates, normalised by the total number of observed events for each category separately, are regarded.As a benchmark, we calculated 10,000 times a hit rate resulting from randomly chosen observational and model grid points and computed the average.The statistical significance of the point to point comparison is tested by a bootstrapping with return applied to the cloud classification with 10,000 iterations.The bootstrapping assumes that the climatology is captured correctly, which is indicated by a good agreement of the frequency of occurrences seen in section 3.1.
The standard deviation of the bootstrapping provides an uncertainty estimation of the analysis.All results are compiled in table 2.
The easier prediction of classes like "Clear sky" than of more rare categories is obvious by a hit rate of roughly 72.7 % for "Clear sky" already gained by the comparison with a random forecast.Therefore, the added value for "Clear sky" isn't that high as indicated by the large hit rate of the real forecast of 84.6 %.The improved forecast skill is especially visible for rare categories like "Drizzle or rain", where an increase of 44 % and a factor of eight is found compared to the random forecast.The reliability of the results, based on the bootstrapping, is higher for more frequent occurring categories like "Clear sky" with a small standard deviation of less than 0.2 % compared to 2.3 % for "Ice & supercooled droplets" with only 200 observed points.
Overall, standard deviations of less than 2.3 % prove the statistical significance of the presented results.

LACROS Observations
Overall  As expected, by being fuzzy in time or space, the hit rates increase by considering more and more grid boxes respectively time steps.Thus, including the whole region with 18 x 17 grid points ("Space Full" column) shows the highest hit rates.Larger variability within three hours of the model ("Time" column) is observable compared to 9 considered grid boxes of being fuzzy by one additional grid box ("Space 1" column), seen by higher hit rates for being fuzzy in time than in space, except for "Ice & supercooled droplets".The opposite is found regarding three or more surrounding grid boxes, except for "Clear Sky".The  In general, a similar cloud structure and phase is found comparing the modelled cloud classification to the observed one.Nevertheless, differences for example between the categories of "Drizzle or rain" and "Drizzle/rain & cloud droplets" can be identified, which provide detailed insights into the models' microphysics.The example shows the usability of the developed cloud classification algorithm as well for other models even at LES resolution.Also, the other introduced cloud evaluation methods can thus be applied, which is shown exemplarily by the frequency of occurrence for the four cloud classes (Fig. 6).
13 Mind different axis.
Rain events of the ICON model seem to be too rare and underestimated by roughly 15 %, as also found by Heinze et al. (2017).This results in an overestimation of the "Clear Sky" frequency of occurrence.The profile of "Ice clouds" is overall in good agreement with the observations.Nevertheless, an overestimation of "Ice clouds" by roughly 10-20 % above 7 km is found which might arise from the lower sensitivity of the remote sensing instruments and thus less observed ice at high altitudes.The low "Liquid cloud" layer during the afternoon, seen as "Drizzle or rain" by the observations, leads to an overestimation of this Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-259Manuscript under review for journal Geosci.Model Dev. Discussion started: 2 November 2018 c Author(s) 2018.CC BY 4.0 License.frequency of occurrence.The accuracy of cloud forecasts for the right time and place is investigated by direct comparisons of the modelled and measured cloud classification for which also fuzzy verification methods are applied.The potential of this new evaluation approach is shown by comparisons with the operational COnsortium for Small-scale MOdelling (COSMO) model for the German domain (COSMO-DE) as well as with first ICOsahedral Non-hydrostatic (ICON) Large Eddy Simulations (LES) (section 4).Results are finally concluded and discussed (section 5).

Figure 1 .
Figure 1.Schematic illustration of the cloud classification approach to generate a comparable Cloudnet target classification on the basis of an atmospheric model output.The original target classification is obtained by observations, using cloud radar, LiDAR and microwave radiometer measurements.
Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-259Manuscript under review for journal Geosci.Model Dev. Discussion started: 2 November 2018 c Author(s) 2018.CC BY 4.0 License. of 12 h forecasts, starting at 00 and 12 UTC, are analysed within this study.The COSMO-DE model gets its initial and hourly boundary conditions from the COSMO model setup with 7 km grid spacing covering Central Europe (COSMO-EU).
, which contains five different hydrometeor classes (specific cloud water content QC, specific cloud ice content QI, specific rain water content QR, specific snow content QS and specific graupel content QG).The closest grid point to the LACROS supersite is selected for the pointto-point comparisons and a region of roughly 50 x 50 km² between Aachen and Düsseldorf is extracted from the model output for the fuzzy verification.The new cloud classification algorithm combines the temperature, dew point and all hydrometeor information of an atmospheric model to generate a consistent cloud classification corresponding to the Cloudnet Target Classification.The cloud classes are determined based on physical principles from the model output by consecutive case selections, as depicted in Fig. 2. Every grid box of the model is assessed independently by the algorithm.

Figure 2 .
Figure 2. Flowchart of the cloud classification algorithm using the output of an atmospheric model to generate a comparable Cloudnet target classification, differentiating eight different cloud categories.

Figure 3 .
Figure 3. Cloudnet Target Classification of the LACROS observations (a,c) and of the cloud classification algorithm applied on the COSMO-DE (b,d) for April (a,b) and May (c,d) 2013.The output is provided on the 51 height levels of the COSMO-DE with an hourly resolution.The observations are adapted to the common grid by using the most frequent category.

3. 1
Frequency of Occurrence of Cloud ClassificationThe frequency of occurrences of the cloud classes are calculated for the observations and model to quantify differences between both datasets and investigate qualitative findings in more detail.Therefore, the cloud statistics of the model can be evaluated considering the multiple cloud phases of the classification.The mean vertical cloud profile for the cloud classes highlights e.g.certain biases or model shortcomings for specific cloud types like ice clouds and thus allow for in-depth analysis of the cloud microphysics.The cloud type climatology of the model for specific locations can be assessed using a long time series.To illustrate the potential of this method, the frequency of occurrence is calculated for the COSMO-DE dataset.The eight cloud categories are merged to four because of the small number of occurrence of most categories as well as for reasons of clarity and comprehensibility.In respect to the model hydrometeors of cloud water QC, cloud ice QI, snow QS, graupel QG and rain QR, the cloud classification categories are merged to "Clear Sky", "Ice Clouds", "Liquid Clouds" and "Drizzle or rain".

Figure 4 .
Figure 4. Frequency of occurrence (0-12 km) of April and May 2013 of the Clear Sky (a), Drizzle or Rain (b), Liquid Clouds (c) and Ice Clouds (d) category for the LACROS Cloudnet target classification (black solid line) and for the COSMO-DE Cloud classification (black dashed line).Mind different axis.
which are classified as "Drizzle or rain" by the observations suggest e.g.problems at the rain formation process.
of the eight different cloud categories are visible e.g. by 31 % of observed "Drizzle/rain & cloud droplet" points, which are categorised as "Drizzle or rain" at the modelled classification.The underestimation of "Liquid clouds" (sect.3.1) is confirmed by 55.3 % of the modelled "Clear sky" points measured as "Cloud droplets".Only a few observations were made of rare categories like "Drizzle/rain & cloud droplets" with 251 points or even "Ice & supercooled droplets" with 200 points out of Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-259Manuscript under review for journal Geosci.Model Dev. Discussion started: 2 November 2018 c Author(s) 2018.CC BY 4.0 License.47,576.Therefore, the right capturing of those categories is very difficult for the model because of the small sample size and thus only low hit rates are obtained like 17 % for "Ice & supercooled droplets".
eight different cloud categories of the observed LACROS Cloudnet target classification and of the COSMO Cloud Classification of April and May 2013.The hit rates are normalized by the total number of observed events (first column) of each category.The point to point comparison results are at the first numbers of the first rows of the second column.The numbers at the second rows of the second column are the mean hit rates of randomly chosen points from 10,000 iterations.The second numbers of the first rows at the second column are the standard deviation calculated by a bootstrapping with 10,000 iterations.The hit rates of the third column are determined by being fuzzy in time by one hour.The hit rates of being fuzzy in space are at the fourth to sixth column.For more details, see text.

5
biggest rise at the hit rates for the fuzzy verification is seen for rare categories like "Cloud droplets".For this category, an increase by a factor of four from 8.7 % to 40.5 % is visible comparing the point-to-point verification with the fuzzy verification including all extracted points.The presented results indicate for possible model improvements and are a good starting point for further in-depth analysis.Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-259Manuscript under review for journal Geosci.Model Dev. Discussion started: 2 November 2018 c Author(s) 2018.CC BY 4.0 License.The applicability of the presented cloud classification algorithm to other atmospheric models is tested by a case study with the ICON LES model(Dipankar et al., 2015).Realistic Germany-wide LES simulations were performed within the High Definition Clouds and Precipitation for Advancing Climate Prediction (HD(CP)²) project with a horizontal resolution down to 156 m and an output frequency of 9 seconds for specific locations like the Cloudnet supersites(Heinze et al., 2017).Thus, the output is averaged to the 30 sec.resolution of Cloudnet using the most frequent cloud class for each time interval.The ICON LES model consists of 151 terrain following levels up to 22 km with increasing layer thickness with altitude.The initial-/boundary conditions are provided by the COSMO-DE analysis and boundary conditions are updated every hour.The simulations were done with the two-moment microphysics of(Seifert and Beheng, 2001), which has 6 hydrometeor categories (cloud water, cloud ice, rain water, snow, hail and graupel).According toJerger (2014), hail could be assigned to ice as well, which allows applying the same cloud classification algorithm (Fig.2) to the LES output as for the COSMO-DE simulations.The algorithm is applied for a case study to the nearest grid box of the LACROS supersite for 26 April 2013, showing a frontal passage during noon (Fig.5).

Figure 6 .
Figure 6.Frequency of occurrence (0-12 km) of 26 April 2013 of the Clear Sky (a), Drizzle or Rain (b), Liquid Clouds (c) and Ice Clouds (d) category for the LACROS Cloudnet target classification (black solid line) and for the ICON-LES cloud classification (black dashed line).

5
category by the model, indicating issues at the rain formation process.Nevertheless, the high spatial and temporal resolution of the ICON output shows an unprecedented detail of the modelled cloud structure and development, which shows among others the added value of these realistic large eddy simulations.Addi-14 Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-259Manuscript under review for journal Geosci.Model Dev. Discussion started: 2 November 2018 c Author(s) 2018.CC BY 4.0 License.tionally, the cloud classification offers a very intuitive and comprehensive view on the modelled clouds to make big datasets of large LES accessible, which is crucial e.g. to find model issues or screen the simulations for interesting physical processes.5Conclusions and DiscussionThe proposed cloud classification algorithm uses the temperature, dew point and hydrometeor profiles of a numerical atmospheric model to generate a cloud classification similar to the observational-based Cloudnet Target Classification.This observational product is a comprehensive tool for the detailed evaluation of cloud phase, composition and structure, but has so far only been used to derive model quantities like cloud fraction or ice water content from the observations.The modelled surrogate makes therefore a direct comparison of the extensive cloud classification product possible, allowing for an in-depth cloud evaluation.The modelled cloud classification provides, in addition, an easily accessible first impression of the vertical cloud structure.The qualitative comparison with the observations indicates how well the clouds are represented by the model, giving a hint about the overall model performance and the underlying cloud microphysics.The comparably designed cloud classification can consequently serve ideal as a basis for various statistical analyses and further derived cloud properties.For example, the frequency of occurrence can be calculated to evaluate the mean vertical cloud distributions for the multiple cloud types.Model biases of certain cloud types or other shortcomings of the model can thus be identified.Furthermore, the prediction of the right cloud type at the same time and location can be investigated by comparing the cloud phase of every height level for each time step with the observations.Fuzzy verification techniques are more appropriate to assess cloud forecast statistics due to the different time and spatial resolutions of the observations and the model as well as due to the large variability of clouds.This is especially worthwhile because of the multi-dimensional and categorical dataset of the cloud classifications.The fuzzy verification allows the model to be uncertain in time and space, which prevents a misinterpretation of the evaluation due to e.g. a simple time lag or displacement of the model.The cloud classification algorithm and evaluation methods are exemplarily tested by comparing the COSMO-DE model data with two months, respectively the ICON LES model with one day, of observations for a mid-latitude Cloudnet supersite.The results show the value of the classification to point out, for example, certain model shortcomings.The calculated frequency of occurrence of "Ice clouds" shows a significant overestimation above 3 km with up to 30 % for the COSMO-DE.Thus, a too low frequency of occurrence for "Clear sky" conditions by up to 30 % is seen in the middle and upper atmosphere.The pointwise comparison of the cloud classification demonstrates e.g.modelled "Drizzle or rain" cases, which are observed as "Melting ice" points.The earlier phase transition to rain indicates possible issues of a too warm temperature profile or of the cloud microphysics.Allowing the model to be uncertain in time or space leads, as expected, to higher hit rates.The hit rates of the correct cloud category are higher for being uncertain by one hour than of being uncertain by one or more grid boxes surrounding the centre for the COSMO-DE.Considering three or more cells, the opposite is the case.The application of the new cloud classification algorithm to the ICON LES provides very detailed information about the cloud structure and phase at a similar resolution as of the observations.Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-259Manuscript under review for journal Geosci.Model Dev. Discussion started: 2 November 2018 c Author(s) 2018.CC BY 4.0 License.Geosci.Model Dev.Discuss., https://doi.org/10.5194/gmd-2018-259Manuscript under review for journal Geosci.Model Dev. Discussion started: 2 November 2018 c Author(s) 2018.CC BY 4.0 License.

Table 1 .
Point wise comparison (Contingency table) of the eight different cloud categories of the LACROS Cloudnet target classification and of the COSMO-DE Cloud classifications of April and May 2013.The COSMO Cloud classification results are based on the 00/12 UTC analyses with the hourly forecasts for the hours in between.The absolute numbers of the contingency table are normalized by the total number of observed events of each category.

Table 2 .
Hit rate table of the