Object-based analysis of simulated thunderstorms in Switzerland: application and validation of automated thunderstorm tracking on simulation data

We present a feasibility study for an object-based method to characterise thunderstorm properties in simulation data from convection-permitting weather models. An existing thunderstorm tracker, the Thunderstorm Identification, Tracking, Analysis and Nowcasting (TITAN) algorithm, was applied to thunderstorms simulated by the Advanced Research Weather Research and Forecasting (AR-WRF) weather model at convection-permitting resolution for a domain centred on Switzerland. Three WRF microphysics parameterisations were tested. The results are compared to independent radar-based observations of 5 thunderstorms derived using the MeteoSwiss Thunderstorms Radar Tracking (TRT) algorithm. TRT was specifically designed to track thunderstorms over the complex Alpine topography of Switzerland. The object-based approach produces statistics on the simulated thunderstorms that can be compared to object-based observation data. The results indicate that the simulations underestimated the occurrence of severe and very large hail compared to the observations. Other properties, including the number of storm cells per day, geographical storm hotspots, thunderstorm diurnal cycles, and storm movement directions and 10 velocities, provide a reasonable match to the observations, which shows the feasibility of the technique for characterisation of simulated thunderstorms over complex terrain.


Introduction
Convection-permitting simulations will play a critical role in reducing the existing high uncertainty around the responses of thunderstorms (e.g. Diffenbaugh et al., 2013;Collins et al., 2013;Hartmann et al., 2013;Allen, 2018), and hailstorms (e.g. Historical cases include storms that inflicted significant damage (e.g. Schmid et al., 1997Schmid et al., , 2000Peyraud, 2013;Trefalt et al., 2018). In Switzerland, severe storms are monitored primarily by a dual-polarisation radar network operated by MeteoSwiss (Germann et al., 2015). Switzerland's climate is expected to be significantly affected by global warming (CH2018, 2018), but there remains high uncertainty on the likely future evolution of severe thunderstorms in Switzerland (CH2018, 2018Willemse, 1995). 70 Figures 1 and 2 show the geographical area of the study, with the radar coverage area overlaid. The Alps run across the centre of the simulation domain and split it into northern and southern regions. Figure 3 shows the sub-domains used in this study; these correspond to geographical features and are modified versions of the domains used by Nisi et al. (2016). Table 1 lists the coordinates of the boundaries of the study domain, which was chosen to be well covered by both the radar data and simulations. 75 The study period was May 2018. In Switzerland, the 2018 convective season was characterised by lower than average overall rainfall (MétéoSuisse, 2018c), but high levels of convective activity in late May and early June (MétéoSuisse, 2018b, a). In May, thunderstorms occurred in Switzerland on days 6-9 and 11-13 of the month, and then almost daily from the 15th until the end of the month (MétéoSuisse, 2018b). 22 May saw thunderstorms across the Central Plateau with a 30-year daily rain amount (73.2 mm) at Belp, and on 30 and 31 May there were extensive hailstorms over the Swiss Plateau that caused local 80 flooding (MétéoSuisse, 2018b). Hail was reported in Switzerland on,7,8,15,21,30,and 31 May (Sturmarchiv Schweiz).

Reference thunderstorm data set
The reference data for thunderstorms in Switzerland is a database of thunderstorm tracking results compiled by MeteoSwiss.
MeteoSwiss operates five C-band, dual-polarisation, Doppler weather radars in a network designed for high performance despite the challenges posed by the mountainous terrain of Switzerland (Germann et al., 2015). The resulting radar products 85 are at high spatial and temporal resolution, with 20 elevation sweeps conducted every five minutes (Germann et al., 2015). The  we use in this study are results from the TRT algorithm that were compiled into a database of storm cells and their associated properties (as in Nisi et al., 2018, but including data for 2018 and using all Swiss radars).
TRT was developed specifically to deal with the challenging topography of the Alpine region: it takes advantage of the high 90 spatial and temporal resolution of the Swiss radar network . TRT identifies thundestorms in a two-dimensional Cartesian multiple-radar "Max Echo" composite product, which is composed of the maximum radar reflectivity recorded in each vertical column (Nisi et al., 2014). TRT uses an adaptive thresholding scheme proposed by Crane (1979) that requires a fixed minimum detection threshold Z min [dBZ], a fixed minimum reflectivity "depth" Z depth [dBZ] and an adaptive threshold . On a two-dimensional map of "Max Echo" radar reflectivity, a cell is defined as a closed contour at Z thresh dBZ, 95 around a maximum reflectivity of Z peak [dBZ]. Z thresh is adapted for each cell to be the minimum value for which Z thresh ≥ Z min and for which the cell contains a single closed contour at Z peak − Z depth dBZ (Crane, 1979;Hering et al., 2004). In the case of TRT, Z min is 36 dBZ and Z depth is 6 dBZ, and a further constraint on cell area is applied: for a thunderstorm to be detected by TRT it must contain a connected area of sufficient size with radar reflectivity values at 36 dBZ or higher, and at least one pixel with a reflectivity of at least 42 dBZ (Hering et al., 2004). The area threshold used in these observations was 13 km 2 (Hering, 100 2020). TRT uses geographical overlapping of cells for matching between time steps (Hering et al., 2004(Hering et al., , 2008. Several cell properties are then computed by TRT from the 3D radar data, as well as satellite and lightning data, inside the detected footprint of each cell. A cell severity ranking product is included. TRT is well tested and established as a reference data set. It has been in operational use at MeteoSwiss since 2003 (Hering et al., 2008), and formed part of a successful forecast demonstration project in the Alpine region (Rotach et al., 2009). TRT Figure 3. Sub-domains used in this study (solid blue lines). Terrain elevation and national borders shown as in Figure 1. "N. Prealps" stands for Northern Prealps, "S. Prealps" stands for Southern Prealps, "Baden-Wurt." stands for Baden-Württemberg. The study domain is shown by the dashed blue line. Plot produced using NCL version 6.6.2.
was used to produce a 15-year, Lagranian-perspective hail climatology for Switzerland , as well as to study hailstorm initiation with cold fronts (Schemm et al., 2016). In this study we use TRT results for the study period as the reference data set.

The TITAN storm tracker
TITAN is a radar-based storm cell tracker that uses thresholds on 3D Cartesian fields of radar reflectivity to define contiguous 110 storm areas, for which statistical properties are calculated (Dixon and Wiener, 1993). Matching of storms between time steps is performed using an optimisation algorithm that expects matched storms to have similar volumes and prioritises small separation distance (Dixon and Wiener, 1993). TITAN has been used operationally (e.g. Bally, 2004) as well as in an object-based study of hailstorm properties (Foris et al., 2006).
TITAN (Dixon and Wiener, 1993; TITAN system within LROSE) was downloaded and compiled from the Lidar Radar Open 115 Source Software Environment (LROSE). TITAN uses specialised binary formats for both input and output. As input, TITAN requires data in MDV format with radar reflectivity fields in 3D Cartesian gridded coordinates (Dixon and Wiener, 1993). We used an adapted version of the TITAN tool NcGeneric2Mdv to convert input files to MDV format. The output of the tracking process are "storm" files, in which the tracking results are stored in binary format. To extract storm properties from the storm files we used an adapted version of the TITAN Storms2Xml2 tool. The TITAN processing flowchart for simulation data is 120 shown in Figure 4.
For this study we ran TITAN in dual thresholding mode with auto-restart disabled. In dual-thresholding mode, storms are identified in two steps. First, regions of reflectivity above a lower threshold are identified. Then, within these regions, areas with reflectivities greater than a sub-region reflectivity threshold are identified, tested for size, and "grown" out into the original lower-threshold region (Dixon and Seed, 2014). Threshold choice is discussed in Section 2.6.

WRF weather model
WRF is a weather model used for both research and operational NWP (Skamarock et al., 2019;Powers et al., 2017). When run at sufficiently high spatial resolution, it can explicitly resolve convection. What constitutes a sufficient resolution depends on the application: model grid spacings finer than 1 km are optimal for resolving all convective processes, while proper resolution of turbulent processes requires a grid spacing in the order of 100 m (e.g. Bryan et al., 2003;Bryan and Morrison, 2012).

130
However, grid spacings up to 4 km provide enough detail to explicitly resolve basic cumulous cloud structures (e.g. Weisman et al., 1997;Done et al., 2004;Kain et al., 2006;Chevuturi et al., 2015). In this work we ran WRF with 50 vertical levels, on a regional rotated grid, with average horizontal resolution of about 1.5×1.5 km 2 . A nested domain structure was used with a larger external domain at an average of about 4.6×4.6 km 2 resolution. The two domains are shown in Figures 1 and 2 respectively.

135
We used WRF version 4.0.1 . HAILCAST (Brimelow et al., 2002;Adams-Selin and Ziegler, 2016) was used to calculate maximum hail sizes. We tested three different WRF microphysics schemes: the Predicted Particle Property (P3) scheme (Morrison and Milbrandt, 2015), the Morrison scheme (Morrison et al., 2009) (Hong et al., 2006) Cumulus parameterisation None (explicit convection) Shortwave radiation scheme Dudhia (Dudhia, 1989) Longwave radiation scheme RRTM (Mlawer et al., 1997) Land surface scheme Noah (Chen and Dudhia, 2001) Surface layer model Revised wrf_user_vert_interp function to grid points from 1 km to 15 km above sea level at 0.5 km resolution. These heights were geopotential heights above sea level; the small differences between geopotential and geometric heights are ignored in this study. Interpolation of radar reflectivities was performed using dBZ values. The regridded WRF files were converted to MDV format for use with TITAN.

155
Before comparisons of tracking results were made, TRT and TITAN cell detections with centre points outside the study domain (see Figure 2) were discarded. Cells that were truncated by this operation had their durations shortened to the duration for which they stayed within the region of interest. Likewise, cells that were split into multiple parts by the spatial subsetting operation were updated so that their parts were counted as separate storm cells.
Thunderstorms often split into multiple parts or merge from multiple parts into single cells. TITAN and TRT handle the 160 labelling of these storms differently. TITAN data contain a "storm ID" that is maintained through splits and merges, and a "track ID" which refers to a unique length of storm track with no splits or merges. TRT data contain flags indicating when splits and merges have occurred, and the most intense storm part keeps the same identifier afterwards. Due to these labelling differences, in this paper we take a simplified approach and refer to a "cell" as a region of high radar reflectivity that exists for at least 30 minutes with no splitting or merging events. When a split occurs, the parent cell ends and multiple new (child) cells 165 are created, and when a merge occurs multiple cells end and a new (merged) cell is created. In this way we lose information on the overall length of one storm system, but we can compare cell properties easily and fairly. A "track" is the path over which a cell moves. A "cell detection" refers to a region of high reflectivity at one moment in time. Some storm properties (area, movement direction) are defined for each cell detection, while some (duration) are defined for each cell.
The TRT results are taken as the reference data set, and TITAN results were compared to the TRT database to analyse the 170 performance of the TITAN approach. The comparison measures used were defined as follows: for a given storm property P , let P i,TITAN be the ith value of the property given by the TITAN approach and let P i,TRT be the corresponding ith reference value of the property in the TRT database (i refers to an index shared by both datasets, such as simulation day). The difference between the two results is given by (1)

175
The bias of the TITAN approach is D , where the angular brackets signify the mean of all differences. The root mean squared error (RMSE) is D 2 . The relative error is given as a percentage by The mean relative bias (RB, R ), the median relative bias (MRE, median of R), and the interquartile range of relative bias (RE IQR, 75th percentile minus 25th percentile of R) to measure relative differences. The Pearson correlation coefficient (r 2 ) 180 is used to show the cofluctuation of P TITAN and P TRT . The relative error is only defined when P i,TRT is non-zero; accordingly RB, MRE and RE IQR include only data points for which P i,TRT = 0, whereas bias, RMSE, and r 2 include such points. Days on which no technique identified cells are not counted in the statistics.

Optimisation of TITAN thresholds
Radar reflectivities simulated in WRF at S-band are not expected to match the measured radar reflectivities at C-band that 185 were used by TRT, so we did not attempt to make TITAN use exactly the same thresholds as TRT. Furthermore, the TRT detection works on two-dimensional fields and thresholds on cell area, whereas TITAN uses three dimensional fields and thresholds on cell volume. Our simulation setups differed only in the microphysics scheme used, but since the calculation of radar reflectivities can be affected by the microphysics scheme, optimum thresholds were expected to differ between simulation sets.

190
We chose to optimise three TITAN thresholds by finding the values that provided the best match between TITAN+WRF TITAN was run on WRF output for the test days with all 243 tested combinations of the three thresholds. The results for each run were compared to TRT results for those days. The "best" parameter set was non-trivial to select and depended on 200 the performance statistics used. We chose an approach that emphasised low bias and cofluctuation in simulated and observed number of cells per hour, and a good match on cell area. To choose the "winning" parameter set we used absolute value of median relative bias as a score. This score was applied to comparisons of daily median cell area, and per time-step number of cells. We first subset based on number of cells per hour by taking all test runs with scores less than the 10th percentile of all scores. We then subset based on daily median cell area by again taking scores less than the 10th percentile of all such 205 scores. Of the few remaining tested combinations we chose the configuration with the best squared correlation coefficient value for simulated and observed per time-step number of cells. The resulting thresholds used for TITAN tracking in this study are shown in Table 3. Reports showing details of the threshold testing are archived (Raupach et al., 2021d).
Other parameters in the dual-thresholding scheme were held fixed for all model runs. These parameters were the minimum area required for each sub-part in the dual thresholding approach (min_area_each_part), which was set to 16 210 km 2 , the fraction of the lower-reflectivity storm region that must be covered by the sum of all higher-reflectivity sub-regions (min_fraction_all_parts), set to 0.10, and the minimum proportion of the large area that each sub-area must exceed (min_fraction_each_part), set to 0.005. These last two area thresholds are those listed in the default TITAN parameters as appropriate for strong convection and squall lines in South Africa 1 .

215
In this section, storm properties found using TITAN on WRF simulation output are compared to those found using TRT on radar data, to test whether TITAN applied to WRF simulations can produce representative statistics on thunderstorms in Switzerland. TITAN was run over the WRF simulation outputs, and TRT results were subset to the same period of time.
Both sets of results were subset to the study domain shown by the dashed line in Figures 1 and 2. During subsetting of the TITAN (TRT) results, including all tested microphysics scheme setups, subsetting caused splits in 0.78% (0.64%) of cells.

220
After subsetting, 37.8% (52.4%) of the recorded cells were discarded because their track duration was less than 30 minutes.
The resulting cell descriptions from TITAN sometimes contained spatial overlaps; 23% of cells were affected by overlaps, but the areas affected were small with only 3% of all cell points overlapping. Of the TRT cells remaining after subsetting, 30 (0.06%) were removed from this analysis because no cell velocity information was recorded.     Alpine range; in this regard, the P3 and Thompson schemes produce more realistic maps than the Morrison scheme. Overall, the approach of using TITAN on WRF output is able to broadly reproduce the observed locations of cell detection maxima. To investigate any systematic timing differences and to look at the diurnal cycle of the thunderstorms, we calculated the 255 percentage of cells that appeared in each hour of the day, for each simulation and for the observations. These results are shown by region in Figure 7. In all regions, the afternoon peak in thunderstorm activity is well reproduced by the simulations, although the exact timings differ from the obsevations in some regions. There is a tendancy for the Morrison and P3 simulations to produce more cells during the night time than are observed, and this continues into the morning for the Morrison scheme.

Spatial and temporal cell occurrences
For all data, the peak time for cell occurrence in the Thompson simulations matches the peak time in the observations, while radar-based climatologies , but here this effect is more extreme in the simulations than in the observations.
The north to south differences are possibly due to different handling of convective initiation mechanisms in the weather model.   There are known differences in storm initiation between northern regions of Switzerland and regions on the south of the main Alpine chain , and references within).

Cell movement properties 270
The use of object-based analysis means we can compare aggregate storm properties such as movement speed, direction, intensity, or cell lifetime. Figure 8 shows a comparison of the directions in which detected cells were moving at each observation point. Although there are some differences in the proportions between TRT and TITAN, it is notable that the simulations are able to reproduce the differences in advection direction observed between different regions. For example, the TRT observations show that storms moved mostly in a north and northwest direction in the Po Valley, and in a southwest direction on the Swiss 275 plateau. The simulations reproduce these differences. Again, the region of Allgau shows notable differences between observa- tions and simulations. Table 6 shows the mean direction of all cells by region and dataset. The simulation set that produced the best match with observations differed by region, but the P3 scheme produced the best match in more regions than the other simulation sets. and cell durations. We consider very high velocities (> 80 km h −1 ) to be unrealistic artefacts of the tracking algorithms; for both TRT and TITAN+WRF results less than 0.5% of cell detections had such velocities. We note again that these durations are the durations of cells as defined here, meaning that they are interrupted by storm splits and merges. The QQ plots map observed quantiles of these properties to simulated quantiles, over all detected cells. If the simulated distributions match the observed distributions, the lines follow the diagonal (solid black) line on the QQ plot. The plot shows that the simulated 285 distributions broadly agree with observed distributions for velocity in all simulations and for area and duration for the P3 and Thompson microphysics scheme setups. For the simulations run with Morrison microphysics, the plot shows that the detected cell areas were larger than the observed cells, and the simulated cells lasted for longer durations than the observed cells. Cell area and duration is most affected by the choice of thresholds used in the TITAN tracker, which means that these differences are unlikely to be caused by the microphysics scheme as such, but rather by the thresholds that result from the optimisation 290 process described in Section 2.6.

Hail properties
In this section we compare radar-based observations of hail properties to those estimated by the WRF model and HAILCAST.
The object-based technique we test here may be particularly useful for studying the effects of climate change on hail, on which there remains high uncertainty (e.g. Raupach et al., 2021b). In each dataset we compare the proportion of storm cell pixels that were estimated to contain severe (greater than 2.5 cm) and very large (greater than 4 cm) hail. In the observations from TRT, the maximum hail size was estimated using the radar-based maximum expected severe hail size (MESHS, implementation described in Nisi et al., 2016). In the WRF output, we used the HAILCAST variable HAILCAST_DIAM_MAX to calculate the proportions of TITAN-identified cell pixels with hail over 2.5 cm and 4 cm respectively. We note that the two techniques used to estimate maximum hail size are very different from each other and are therefore not strictly directly comparable; they are 300 used here as the available approximations of observed and simulated hail size. Table 7 shows the proportions of all cell detections that contained severe hail. In general, the observations contained more severe hail than the simulations. All WRF setups underestimated the proportion of cell detections containing severe hail. The WRF setup using the Thompson microphysics scheme produced the closest match to the TRT proportion of cell detections with hail over 2.5 cm. The relative errors on these proportions were smaller for 2.5 cm hail than for 4 cm hail, implying that 305 the WRF and HAILCAST simulations more severely underestimated the number of cells containing very large hail than severe hail. Figures 10 and 11 show quantile-to-quantile plots to compare the proportions of cell pixels, for cell detections for which the proportion was non-zero, that contained hail with maximum estimated size over 2.5 cm and over 4 cm respectively. The WRF results show an understimation of the cell area covered by severe and large hail, compared to the TRT observations.

Cell lifecycles 310
In this section we consider cell lifecycles -the evolution of the strength of storm cells over their durations. Since in this work splits and merges of storms interrupt storm durations, in this section we consider only the 43% of cells that contained no splits or merges, so that their durations are well defined. Figure 12 shows the number of such cells by cell duration. There are very few cells with duration over 100 minutes, meaning little emphasis should be placed on aggregate results for these long-duration cells. Figure 13 shows

Conclusions
In this study we tested and verified an approach for the object-based analysis of simulated thunderstorms in the topographically 325 complex Alpine region of Europe. Output from a high-resolution weather model (AR-WRF) was analysed using a radar storm tracking system (TITAN) to derive characteristics on each storm cell. The results were compared to a reliable and independently derived dataset of storm observations for Switzerland (TRT) for the month of May 2018. We tested WRF and TITAN using three different microphysics schemes.
The choice of radar reflectivity and cell volume thresholds to use in TITAN made a significant difference to the quality of the 330 results. We optimised the thresholds to find the best settings to use for each microphysics scheme, but this search was locationdependent and not exhaustive, the resulting thresholds depended on which performance criteria were emphasised, and the search-space over which thresholds are optimised could be further refined. The results of this study should thus not be seen as a comparison of the physical appropriateness of the microphysics schemes, but a comparison of three possible setups (comprising both scheme and chosen thresholds) for summarising thunderstorm properties in simulations over the Alpine region. TITAN 335 thresholds, including those not optimised here such as the dual-thresholding scheme settings, should be carefully considered in any work that uses this technique. We used a simplified approach in which splits and merges in storm cells were ignored. Future work could take splits and merges into account in order to properly characterise full storm lifecycles. Updates to TITAN have been suggested (e.g. Han et al., 2009;Muñoz et al., 2018), and could also be tested in future studies. We showed comparisons for simulated and radar-derived hail properties; in future, liquid precipitation could also be considered through the use of 340 disagregated precipitation fields (e.g. Barton et al., 2020).    in the TRT distributions. Since the distribtions are skewed, these plots are on logarithmic axes (zeros are plotted on the axes); the same plot with linear axes is shown for comparison in Figure A1.    Figure A1. As for Figure 9, but with quantiles plotted on linear scales.