Using radar observations to evaluate 3-D radar echo structure simulated by the Energy Exascale Earth System Model (E3SM) version 1

. The Energy Exascale Earth System Model (E3SM) developed by the Department of Energy has a goal of addressing challenges in understanding the global water cycle. Success depends on correct simulation of cloud and precipitation elements. However, lack of appropriate evaluation metrics has hindered the accurate representation of these elements in general circulation models. We derive metrics from the three-dimensional data of the ground-based Next-Generation Radar (NEXRAD) network over the US to evaluate both horizontal and vertical structures of precipitation elements. We coarsened the resolution of the radar observations to be consistent with the model resolution and improved the coupling of the Cloud Feedback Model Intercomparison Project Observation Simulator Package (COSP) and E3SM Atmospheric Model Version 1 (EAMv1) to obtain the best possible model output for comparison with the observations. Three warm seasons (2014–2016) of EAMv1 simulations of 3-D radar reﬂectivity features at an hourly scale are evaluated. A general agreement in domain-mean radar reﬂectiv-ity intensity is found between EAMv1 and NEXRAD below 4 km altitude; however, the model underestimates reﬂectivity over the central US, which suggests that the model does not capture the mesoscale convective systems that produce much of the precipitation in that region. The shape of the model-estimated histogram of subgrid-scale reﬂectivity is improved by correcting the microphysical assumptions in COSP. Different from previous studies that evaluated modeled cloud top height, we ﬁnd the model severely underestimates radar reﬂectivity at upper levels – the simulated echo top height is about 5 km lower than in observations – and this result is not changed by tuning any single physics parameter. For more accurate model evaluation, a higher-order consistency between the COSP and the host model is warranted in future studies

Abstract. The Energy Exascale Earth System Model (E3SM) developed by the Department of Energy has a goal of addressing challenges in understanding the global water cycle. Success depends on correct simulation of cloud and precipitation elements. However, lack of appropriate evaluation metrics has hindered the accurate representation of these elements in general circulation models. We derive metrics from the three-dimensional data of the ground-based Next-Generation Radar (NEXRAD) network over the US to evaluate both horizontal and vertical structures of precipitation elements. We coarsened the resolution of the radar observations to be consistent with the model resolution and improved the coupling of the Cloud Feedback Model Intercomparison Project Observation Simulator Package (COSP) and E3SM Atmospheric Model Version 1 (EAMv1) to obtain the best possible model output for comparison with the observations. Three warm seasons (2014)(2015)(2016) of EAMv1 simulations of 3-D radar reflectivity features at an hourly scale are evaluated. A general agreement in domain-mean radar reflectivity intensity is found between EAMv1 and NEXRAD below 4 km altitude; however, the model underestimates reflectivity over the central US, which suggests that the model does not capture the mesoscale convective systems that produce much of the precipitation in that region. The shape of the modelestimated histogram of subgrid-scale reflectivity is improved by correcting the microphysical assumptions in COSP. Different from previous studies that evaluated modeled cloud top height, we find the model severely underestimates radar reflectivity at upper levels -the simulated echo top height is about 5 km lower than in observations -and this result is not changed by tuning any single physics parameter. For more accurate model evaluation, a higher-order consistency between the COSP and the host model is warranted in future studies.

Introduction
Clouds and precipitation play a major role in Earth's budgets of energy, water, and momentum. However, the correct simulation of 3-D structures of clouds and precipitation has been challenging in general circulation models (GCMs) (Trenberth et al., 2007;Randall et al., 2007), partially because model grid spacings generally do not adequately resolve the cloud-structure details important to these budgets. In addition, the lack of appropriate evaluation metrics also hinders the evaluation of GCMs. Over the contiguous US (CONUS), the detailed 3-D radar reflectivity field (indicating the 3-D distribution of precipitation particles) is observed by the ground-based Next-Generation Radar (NEXRAD) network of S-band weather radars (3 GHz; Zhang et al., 2011Zhang et al., , 2016. In this study, we use the mosaic of NEXRAD observations called Gridded Radar Data (GridRad) developed by Homeyer and Bowman (2017), which have a horizontal resolution of 0.02 • (regridded to 4 km in this study), vertical resolution of 1 km (24 levels), and an update cycle of 1 h. In order to compare these data appropriately with output of the global model used here, we further coarsen the horizontal resolution, as described in Sect. 2.
Published by Copernicus Publications on behalf of the European Geosciences Union. See Corrigendum following references 720 J. Wang et al.: Using radar observations to evaluate 3-D radar echo simulation The Energy Exascale Earth System Model (E3SM) is an ongoing effort of the Department of Energy (DOE) to advance the next generation of climate modeling Version 1 of E3SM Atmosphere Model (EAMv1) is a descendent of the National Center for Atmospheric Research (NCAR) Community Atmosphere Model version 5.3 (CAM5.3; Neale et al., 2012). However, it has evolved substantially in coding, performance, resolution, physical processes, testing, and development procedures . Previous model evaluation has focused on the long-term climatological properties of certain cloud fields, surface precipitation, and water conservation on the global scale (e.g., Qian et al., 2018;Xie et al., 2018;Lin et al., 2019). Evaluations of the vertical structures of cloud and precipitation elements have used vertically pointing radar observations obtained during field campaigns (Y. Zhang et al., 2019). However, these tests lacked evaluation of fully 3-D cloud and precipitation structure over large regions of the globe and over long time periods.
For this study, we have built data processing techniques to evaluate EAMv1 simulation of the 3-D radar reflectivity field at its default setting of 1 • grid spacing and 72 vertical layers at an hourly timescale. Our goal is to provide a comprehensive evaluation of both horizontal pattern and vertical structure of cloud and precipitation. We use radar observations obtained from the NEXRAD over the CONUS for the 3 years 2014-2016. In order to directly compare the model results with NEXRAD, we have improved the Cloud Feedback Model Intercomparison Project (CFMIP) Observation Simulator Package (COSP) (Bodas-Salcedo, et al., 2011) and implemented it into EAMv1. We restrict the evaluation to the warm season (i.e., April to September). Over the CONUS, warm-season precipitation is dominated by convective processes, which are very different from the more widespread frontal cloud systems of cold-season precipitation. As discussed by Iguchi et al. (2018), precipitating ice particles have large variation in habits and scattering properties, and the effect of non-Rayleigh scattering and multiple scattering by large precipitating ice particles could introduce large uncertainty into simulating the radar reflectivity field. To reduce uncertainty due to these factors, we examine only the warm season of the 3 years from 2014 to 2016. This paper is organized as follows: Sect. 2 describes the model, the GridRad dataset, the COSP simulator, and the step-by-step methodology of data processing to account for differences between the modeled and observed datasets, specifically (1) horizontal and vertical resolutions of EAMv1 (1 • , 72 vertical levels) and NEXRAD (4 km horizontally, 1 km vertically) and (2) minimum detectable limits between the model and NEXRAD. Section 3 presents the model evaluation results and tests of the sensitivity to physics parameters. Section 4 provides synthesis and conclusions.

EAMv1 description and configuration
EAMv1's dynamics core and physics parameterizations are described in detail by . The continuous Galerkin spectral finite-element method solves the primitive equations on a cubed-sphere grid (Dennis et al., 2012;Taylor and Fournier, 2010). Tracer transport on the cubed sphere is handled using a variant of the semi-Lagrangian vertical coordinate system of Lin (2004). The method locally conserves air mass, trace constituent mass, and moist total energy (Taylor, 2011). Turbulence, shallow cumulus clouds, and cloud macrophysics are parameterized with the Cloud Layers Unified By Binormals (CLUBB) parameterization (Golaz et al., 2002;Larson, 2017). Deep convection is based upon the formulation originally described in Zhang and Mc-Farlane (1995, hereafter ZM), with modifications by Neale et al. (2008) and Richter and Rasch (2008). Stratiform clouds are represented with the "Morrison and Gettelman version 2" (MG2) two-moment bulk microphysics parameterization (Gettelman and Morrison, 2015). Aerosol microphysics and interactions with stratiform clouds are treated with an updated and improved version of the four-mode version of the Modal Aerosol Module (MAM4; Liu et al., 2016). Regarding the stratiform-convection partition, the MG2 stratiform cloud microphysics and CLUBB higher-order turbulence parameterization explicitly provide values for condensate mass and number, as well as an estimate of stratiform cloud fraction, whereas the convective cloud fraction is not parameterized in the mass-flux-based ZM scheme (assumed to be 1 for typical GCM resolutions such as at 1 • grid spacing or coarser) and is diagnosed from cloud mass flux for cloud radiation calculation, which is treated as a tunable parameter.
The EAMv1 used in this study has 30 spectral elements (ne30), which corresponds to approximately 1 • horizontal grid spacing, and the total number of grid columns is 48 602. Vertically, there are 72 layers using a traditional hybridized sigma pressure coordinate. The simulation is run for the time period from 1 January 2014 to 1 October 2016. We use a dynamic time step of 5 min and a cloud microphysics time step of 30 min. The large-scale circulation in the simulation is constrained using the nudging technique (Zhang et al., 2014;Ma et al., 2014;Lin et al., 2016), so that the model simulations can be constrained by realistic large-scale forcing. Specifically, horizontal winds (U , V components) are nudged towards the Modern-Era Retrospective analysis for Research and Applications, version 2 (MERRA2), reanalysis data (Gelaro, et al., 2017) with a relaxation timescale of 6 h. Nudging is applied to all grid boxes at each time step, with the nudging tendency calculated using the model state and the linearly interpolated MERRA2 data (Sun et al., 2019).
To facilitate the comparison with observations, model outputs are regridded to the geographic coordinate system with a horizontal grid spacing of 100 km, and the vertical coordinate is converted to the above mean surface level height in meters. By default, all the regridding processes in this study are based on the Earth System Modeling Framework Python Regridding Interface (https://earthsystemmodeling.org/esmpy/, last access: 10 April 2019) using bilinear interpolation.

COSP radar simulator
The retrieved spaceborne satellite and ground-based radar products such as cloud water content and effective particle size (e.g., Randel et al., 1996;Wang et al., 2015;Tian et al., 2016;Um et al., 2018) are often treated as the ground truth for model evaluation (e.g., Fan et al., 2017;Han et al., 2019). However, the retrieved products often have large uncertainty (Stephens and Kummerow, 2007). To allow the comparison of model results with direct measurements from 3-D scanning radars (ground-based or satelliteborne), the COSP was developed for use in GCMs (Bodas-Salcedo et al., 2011). Instead of using retrieved products to evaluate the model simulation, COSP converts model output into pseudo-observations using forward calculations (Bodas-Salcedo et al., 2011;Swales et al., 2018;Zhang et al., 2010).
The COSP consists of three steps, as detailed in Zhang et al. (2010). The first step is to generate a subgrid-scale distribution of cloud and precipitation, which is done by using the Subgrid Cloud Overlap Profile Sampler (SCOPS; Klein and Jakob, 1999;Webb et al., 2001) and SCOPS for precipitation (SCOPS_PREC), respectively. Each GCM grid box is divided into 50 subcolumns in this study. Detailed description of SCOPS and SCOPS_PREC can be found in Zhang et al. (2010). Then, the radar signals are calculated by the QuickBeam code (Haynes and Stephens, 2007) using the column distribution of cloud and precipitation. Thus, COSP calculates the reflectivity for the combined cloud properties using its own subgrid assumption, and it does not distinguish convective and stratiform cloud contributions to reflectivity. Finally, the grid box mean radar reflectivity is calculated through the method of linear averaging (i.e., the reflectivity values [in dBZ] are converted to the Z values [mm 6 m −3 ] to calculate the mean Z, and then mean Z is converted back into dBZ). In addition to averaging, all the processing of radar reflectivity data from model and NEXRAD in this study utilizes the linearized Z values, including horizontal averaging, vertical interpolation, calculation and comparison of mean values, etc.
The COSP version 1.4 used in this study has no scientific difference from version 2.0 (Song et al., 2018. Following the general usage of COSP, we modified the microphysics assumptions used for the radar reflectivity calculation regarding hydrometeor density, size distribution, etc., making those assumptions consistent with those used in the MG2 cloud microphysics scheme that is used in E3SM. The detailed documentation of those changes is in Table 1. Note that, although we tried to make the COSP use the same hydrometeor size distribution functions as MG2, the three parameters (slope, intercept, and shape parameters) are still separately defined in COSP. We use horizontally homogeneous cloud condensate distribution within the model grid element and the maximum-random overlapping scheme for cloud occurrence (Marchand et al., 2009;Hillman et al., 2018).

NEXRAD observations
The NEXRAD network consists of 159 S-band (3 GHz) Doppler radars, which form a dense observational network nearly covering the CONUS. We use the GridRad mosaic product of Homeyer and Bowman (2017), which combines all NEXRAD radar data covering the region 25-49 • N, 155-69 • W. To compare the GridRad data to the E3SM model fields, the radar frequency in the COSP was set to 13.6 GHz, consistent with the Global Precipitation Measurement (GPM) Ku-band radar, since we originally aimed at evaluating the E3SM simulation with GPM data. However, due to the high detectable threshold of 13 dBZ, low sampling frequency (four to seven overpasses over CONUS per day), and the narrow swath width (245 km) for each overpass, GPM data within the 3-year period (2014-2016) have a significant under-sampling issue. That is, the GPM sample sizes over 1 • model grid boxes are generally too small to robustly represent the grid element mean value. Therefore, we decided not to use GPM data in this study. As GPM operates over the whole earth and is anticipated to run for a long time period, it will likely be a very useful dataset for evaluating the coarse-resolution global model in the future.
The GPM radar frequency is higher than that of NEXRAD (13.6 vs. 3 GHz). Previous studies have shown conversions from Ku (13.6 GHz) to S band (3 GHz) are necessary when using GPM Ku-band radar to calibrate the ground-based radars (Warren et al., 2018). Based on our previous study that quantitatively evaluated the coincident observations from NEXRAD and GPM over the CONUS, we found the 3-D radar reflectivity fields obtained from the two independent platforms are highly consistent with each other after proper smoothing of GPM data in the vertical (Wang et al., 2019b). We performed a series of offline tests of COSP simulation using the frequency of 3 GHz (NEXRAD), 13.6 GHz (GPM Ku band), and 94 GHz (the cloud profiling radar on board the CloudSat satellite). Their corresponding reflectivities are compared in Fig. 1. As shown, the reflectivity values with 3 GHz are very similar to those with 13.6 GHz, indicating the Rayleigh scattering is satisfied for both frequencies in this application. To examine if the COSP can correctly handle the Mie scattering calculation, the frequency of 94 GHz used by the CloudSat is also tested, whose products have been widely used for the evaluation of coarse-resolution models (Zhang et al., 2010). As shown in Fig. 1, the reflectivities simulated with 94 GHz significantly deviate from those simulated with 3 and 13.6 GHz when reflectivities > 10 dBZ, which reveals that the COSP simulator is capable of handling both Rayleigh which is a shape parameter in gamma distribution describing the dispersion of the distribution. n/a -not applicable and Mie scattering calculations. However, there is no difference using Ku band or S band in the COSP simulator in this study, because the simulated condensates are not large enough to lead to non-Rayleigh scattering, which is typically observed at Z > 40 dBZ for the Ku band (Matrosov, 1992). An attenuation correction has been applied in case of existence of any large particles, although they are extremely unlikely to occur in this application. Since the COSP mimics the satellite view from space to the ground, the layer below 1 km altitude is most vulnerable to the possible attenuation caused by large precipitation particles, which has been excluded from the comparison. In this study, biases caused by the temporal mismatch are minimal at the horizontal resolution of 1 • (∼ 100 km); we nevertheless perform Gaussian smoothing of GridRad data to match the model time step (30 min) in the comparison.

Mapping the radar observations to the model grid
As shown in previous studies (e.g., Wang et al., 2015Wang et al., , 2016Wang et al., , 2018Feng et al., 2012Feng et al., , 2019, the minimum reflectivity of the 3-D mosaic NEXRAD dataset is 0 dBZ (Fig. 2a). However, the model grid-mean reflectivity can be as low as −100 dBZ. Because our focus is on significantly precipitating clouds, the minimum threshold of reflectivity at 1 • grid scale is set to be 8 dBZ (corresponding to rain rate ≥ 0.1 mm h −1 ). We also tested with a threshold of 0 dBZ and report later on how it only has minor effects on our conclusions. For our main results, after coarsening the 4 km GridRad data to a model grid element, only the grid elements with a mean value larger than 8 dBZ are taken into account in both observations (Fig. 2b) and in the simulation (Fig. 2c). In the vertical direction, the EAMv1-simulated radar reflectivity field (72 vertical levels, hybrid coordinate) is interpolated to the levels of GridRad (vertical resolution of 1 km). The simulation data are saved hourly, consistent with the hourly GridRad data.

Results
After the horizontal averaging, vertical interpolation, and truncation at the identified minimum threshold of 8 dBZ, the 3-D radar reflectivity fields obtained from GridRad and the model simulation become comparable. The EAMv1simulated reflectivity is evaluated from the perspectives of subgrid distribution, horizontal pattern, and vertical distribution.

Comparison of subgrid distribution of reflectivity
The horizontal resolution difference between GCMs (∼ 100 km) and NEXRAD observations (4 km) presents a challenge for testing the model-simulated radar reflectivity. To mimic the observations, COSP divides the grid-mean cloud and precipitation properties into subcolumns (Pincus et al., 2006) that statistically downscale the data in a way that should be consistent with observations. The way this is done in COSP is discussed by Zhang et al. (2010) and Hillman et al. (2018). In this section we examine whether the sub- grid reflectivity distribution generated by COSP is consistent with the observed subgrid reflectivity distribution shown by the NEXRAD observations.
In EAMv1, 50 subcolumns are used for calculating the mean radar reflectivity for a model grid box. There are 625 pixels inside each 1 • grid for NEXRAD data to provide a probability density function (PDF) of observed reflectivity within the box. After averaging the NEXRAD pixels at subgrid scale to 50 samples to match the COSP's subcolumns, Fig. 3 compares the simulated subgrid reflectivity PDF to the NEXRAD PDF based on all the GridRad samples combined for the 3-year period at each individual level, where the interval of reflectivity bins is 1 dBZ. The results for the default microphysics assumptions in COSP, which are for a singlemoment scheme, produce a bimodal distribution at and below 8 km altitudes (blue histograms in the left-hand column of Fig. 3). The bimodality is significantly different from the observed PDF, which forms a smooth gamma distribution. Song et al. (2018) also found bimodal distributions when the COSP was implemented in the CAM with the original microphysics assumptions, which are clearly unlike real observed radar reflectivity distributions.
Our modification of the microphysical assumptions in COSP (right-hand column of Fig. 3) greatly reduces the bimodality. In addition, the modified microphysical assumptions produce higher values of reflectivity, in better agreement with observations, and the grid-mean radar reflectivities increase by ∼ 4 dBZ (Fig. 4) mainly for values less than 25 dBZ. The improvement in the subgrid distribution and grid-mean reflectivity brought by the change of microphysics assumptions indicates the necessity of microphysical consistency between the COSP and the host model. It should be noted that the simulated radar reflectivity and its subgrid distribution are sensitive to the overlap assumption and the dis-tribution function of condensates that are set in COSP (Hillman et al., 2018). Our results are from the default setup of these aspects of COSP. It is not the purpose of this study to test those assumptions.
Although the simulated subgrid reflectivity distribution is improved by setting the microphysics assumptions used in COSP consistent with the MG2, the model is still significantly biased. In addition to the intrinsic model-observation differences in the number concentrations and mixing ratios of hydrometeors, there are other possible error sources related to the reflectivity calculation as mentioned in Sect. 2.2. For example, (1) the mixing ratios of hydrometeor types from different types of clouds are not directly passed from the host model to COSP but rather are lumped together and equally divided among all the precipitating subcolumns, (2) the spectral parameters for defining a gamma distribution are not consistent with those from MG2, and (3) the assumptions of subgrid distribution and hydrometeor vertical overlap are simple and not consistent with other parts of the host model. In addition, the subgrid distribution results from COSP are calculated based on the assumption about the distribution of cloud and precipitation among the 50 subcolumns, which is independent of what E3SM uses. Therefore, a higher-order consistency between the COSP and the host model is warranted in future studies.
In this following analysis, we focus on the evaluation of the simulated 3-D radar reflectivity field at the model's native grid, which is 1 • , since the subgrid information from COSP does not directly reflect how E3SM does it. Also, the convective cloud fraction is not parameterized in the mass-fluxbased ZM scheme and is diagnosed from cloud mass flux for cloud radiation calculation, which is treated as a tunable parameter, whose evaluation is not very meaningful unless it becomes an independent variable, for instance, for grey-zone resolutions.

Comparison of horizontal patterns
Now we compare the temporal mean reflectivity through the entire study period between the NEXRAD observation ( Fig. 5a, d, g, and j) and EAMv1 simulation (Fig. 5b, e, h, and k) with the consistent microphysical assumptions between COSP and the host model at the vertical levels of 2, 4, 8, and 11 km. The mean, standard deviation, 95th-percentile values, and valid sample numbers between the model and NEXRAD are compared in Table 2. At 2 km altitude, the EAMv1 estimates higher reflectivity than the NEXRAD observations ( Fig. 5a-b) except over the central US. The overall mean value is 28.7 dBZ for EAMv1 and 25.1 dBZ for NEXRAD. The negative bias for the model is in the region between the Rocky Mountains and Mississippi Basin (Fig. 5c), where precipitation is heavily contributed by mesoscale convective systems (MCSs). Those MCSs propagate eastward from their initiation over or just east of the Rocky Mountains, go through upscale growth, and finally dissipate in the eastern part of the Mississippi Basin (Yang et al., 2017;Feng et al., 2018Feng et al., , 2019. The standard deviations of the two individual datasets are quite similar, and EAMv1 generates a higher 95th-percentile value than the observation, indicating the model overestimates the extremely high values in the lower troposphere. In addition, those simulated extreme values are evenly distributed across the entire domain, which fail to mimic the spatial footprint of MCSs as depicted by the NEXRAD data. At 4 km altitude (Fig. 5d-e), the model's underestimation over the central US becomes larger compared to 2 km altitude, and the overestimation at the foothills of Rocky Mountains also becomes larger. The model also overestimates reflectivity in the east region of the domain. These  results indicate that the E3SM simulation fails to capture the observed spatial variability. The domain mean value between the model and observations is the same (24.0 dBZ) as a consequence of the offset between the negative and positive biases in different areas. The standard deviation and 95th-percentile values are comparable with the observations as well. At 8 km, underestimation of the reflectivity by the model occurs over almost the entire domain (Fig. 5i), with a domain mean of 15.0 dBZ, much lower than 19.2 dBZ in the NEXRAD data. Meanwhile, the modeled standard deviation and the extreme values are smaller, indicating the model has difficulty capturing the observed variability. At 11 km altitude, the EAMv1 severely underestimates the reflectivity values compared to NEXRAD (Fig. 5j-k), with a mean value of 9.8 dBZ for EAMv1 and 16.6 dBZ for NEXRAD. The negative bias is generally more than 7.5 dBZ in the central US (Fig. 5l), and the model severely underestimates the standard deviation and extreme reflectivity. Moreover, EAMv1's sample size is 50 times lower than that of NEXRAD, indicating the lower occurrence of reflectivity values ≥ 8 dBZ.
Clearly, above 4 km, the model's negative biases increase with height as shown in Fig. 5f, i, and l, manifested in the central US. There is no valid reflectivity value simulated by EAMv1 above 12 km altitude, where NEXRAD still shows reflectivity values up to 15.7 dBZ, indicating that the simulated deep convection in the warm season is not deep enough, a problem that is further examined in the following section.
In addition to the mean values, the histograms of observed and simulated radar reflectivities are compared for different altitudes, where the interval of reflectivity bins is 2 dBZ (Fig. 6). By comparing the occurrence of Z ≥ 8 dBZ between model and observations, the model apparently has a narrower distribution than the observations, and the modelobservation deviation in maximum values increases with height. At 8 km and below, the model generally overestimates the sample sizes of smaller reflectivity values but lacks extremely high reflectivity values. However, at 11 km altitude, the model greatly underestimates the sample sizes of the entire reflectivity spectrum compared to the observation, causing the severe underestimation in the mean value.

Comparison of vertical distribution of radar reflectivity
To quantitatively examine the simulated vertical distribution of radar reflectivity, contoured-frequency-by-altitude diagrams (CFADs; Yuter and Houze 1995) are generated from NEXRAD and EAMv1 and compared in Fig. 7. The CFADs represent the frequency of occurrence of reflectivity in a coordinate system having reflectivity bins (interval of 1 dBZ) on the x axis and altitude bins (interval of 1 km) on the y axis. The frequency within each bin box is calculated as the number of valid samples it contains divided by the total sample number of all reflectivity bins at all levels, meaning that the integrated value of all frequencies in each plot is 100 %. Figure 7 shows the CFADs for both NEXRAD observations (Fig. 7a, d, g, j, m, and p) and the EAMv1 simulation (Fig. 7b, e, h, k, n, and q) for each month from April to September combined over 2014-2016. The most distinct difference between the model and observations is the simulated echo top height. The echo top height in the simulation generally is at 11 km, at least 5 km lower than the 16 km top seen in the observations. At levels below 4 km, the NEXRAD data show a high-frequency zone (> 3.2 %) concentrated between 8-25 dBZ, whereas the simulated high-frequency zone is at 13-28 dBZ. For reflectivity > 35 dBZ, the simulation has a higher probability of occurrence than the NEXRAD observations.
Regarding the overall shape of CFADs, the model follows the well-known pattern where the reflectivity value range of the high-frequency zone (> 3.2 %) increases from cloud top to the freezing level and then slowly decreases or remains constant below the freezing level. The cores of maximum frequency (> 5 %) are located in the centres of the high- frequency zones. However, these characteristics are not presented in the observations, whose high-frequency zones are greatly skewed to the lower reflectivity values. The characteristics of NEXRAD's CFADs could be due to averaging from fine resolution (4 km) to coarse resolution (1 • ) , as well as averaging of convective and stratiform components because the two components produce significantly different reflectivity profiles and magnitudes.
The box-and-whisker plots (Fig. 7c, f, i, l, o, and r) represent the same results in a different way, where the normalization is conducted at each level rather than against the entire dataset at all levels. Below 4 km, the percentile values are consistent between the model and observations except for the 1 km altitude, where the model overestimates the reflectivity. The simulated 25-75th percentiles are located at the reflectivity values of 15-27 dBZ at 1 km altitude, which is higher than the NEXRAD observation (12-28 dBZ). As noted in the discussion of Fig. 5, the consistency at low levels (e.g., 2 km) between the model and observations is mainly due to the offset of negative and positive biases in different regions of the domain. Moreover, EAMv1 underestimates the frequency of echoes ≤ 15 dBZ and overestimates it for echoes between 15 and 30 dBZ, which causes the higher median values in the model. From 4 km upward, the model-observation differences become much larger than at low levels, consistent with the result shown in Fig. 5. The underestimation of 95thpercentile value increases from 10 dBZ at 7 km to more than 20 dBZ at 11 km. Above 11 km, the model fails to generate average reflectivity above 8 dBZ, and the typical reflectivity value is between 0 and 2 dBZ at 12 km.
From Fig. 7 it is clear that the model severely underestimates the echo top height by at least 5 km. To look at how this result is sensitive to the threshold reflectivity, we reprocessed the results with the 0 dBZ threshold. By lowering the threshold to 0 dBZ, an increment of ∼ 1 km in the vertical extension of the CFADs is found in the model, but the echo top height of the observations is not changed much. As a result, the choice of threshold does not change the conclusion of severe model underestimation in echo top height.
The CFADs of NEXRAD observations vary from month to month. For example, the echo top height is at 15 km in April, which increases to 16 km in May, then reaches 17 km in June and July, and finally decreases to 15 km in September. Similarly, the 0.6-0.8 % contour level in the observations stops at 9 km altitude in April but extends to 10 km in May and reaches 11 km in June. It increases to the highest level at 11.5 km in July and August, and then decreases to 11 km in September. This seasonality follows the seasonal variation of intensity of convection (Wang et al., 2019a), which is not captured in the EAMv1 simulation (Fig. 7b, e, h, k, n, and q).
The severe underestimation of the echo top height by EAMv1 has been reported for simulation of tropical convection with CAM5 in a recent study . Although EAMv1 is different from CAM5 in many aspects, such as vertical resolution and dynamical core, they share the same ZM cumulus parameterization (Zhang and McFarlane, 1995) for representing deep convection.  found the cloud top height of tropical convection is underestimated by more than 2 km, which can be alleviated by the adjustment of the ZM scheme. We have performed a series of sensitivity tests by changing physical parameters in ZM and cloud microphysics schemes to explore the possibility of model improvement in echo top height. These tests are detailed in Sect. 3.4.
As evaluated in Zheng et al. (2019), E3SM v1 failed to simulate the diurnal variation of precipitation over the central US, where the observed nocturnal peak is greatly underestimated. Xie et al. (2019) improved the diurnal cycle of convection in E3SM v1 recently by modifying the convective trigger function in the ZM scheme. It will be interesting to see if the 3-D radar reflectivity fields can be better simulated using the updated ZM scheme. Figure 7. Contoured-frequency-by-altitude diagrams (CFADs) normalized by the total number of samples at all altitude levels for NEXRAD (a, d, g, j, m, p) and EAMv1 simulation with the modified microphysics assumptions in COSP (b, e, h, k, n, q) for the months from April to September averaged over the 2014-2016 period. The box-and-whisker plots (c, f, i, l, o, r) for NEXRAD (red) and EAMv1(blue) are calculated using normalization at each individual level, where the center of the box represents the 50th-percentile value, and the 25th and 75th percentiles are represented by the left and right boundary of the box, respectively. Whiskers correspond to the 5 and 95 % values.

Sensitivity of simulated echo top height to tunable parameters of the global model
Differently from the model evaluation of cloud top height and high cloud fraction (e.g., Xie et al., 2018), where EAMv1 has shown good agreement with satellite observations over the CONUS, evaluation of radar echo top height indicates whether the processes internal to the cloud are producing precipitation correctly. To examine if any model parameters in the ZM cumulus parameterization scheme and/or MG2 microphysics parameterization scheme can significantly influence the echo top height, we conducted a series of sensitivity tests for the tunable parameters as listed in Table 3. In each test a single parameter is changed, and all other parameters retain their default values.  suggested that the restriction of neutral buoyancy level (NBL) from the dilute convective available potential energy (CAPE) calculation (Neale et al., 2008) can limit the depth of deep convection in ZM. When the convective plume reaches the NBL, all mass flux is detrained even if the updraft is still positively buoyant from the cloud model calculation (Zhang, 2009). To allow deep convection to grow deeper, we performed a sensitivity test following , where the NBL deter-  mined in the dilute CAPE calculation is removed, and the upper limit of the integrals of mass flux, moist static energy, and other cloud properties is set to be very high (70 hPa in this study). After the modification, the convective cloud top height increases as shown in ; how-ever there is no change in the radar echo top height, i.e., the maximum altitude at which precipitation-sized particles occur. A possible reason for the limited effect on echo top height is that the cloud ice content is too low in midlatitude continental convection without convective microphysics parameterization (Song et al., 2012), which cannot be improved by merely increasing the NBL.
Other parameters that we tested in the ZM cumulus parameterization with the dilute CAPE calculation include convective entrainment rate (zmconv_dmpdz), the convection adjustment timescale (zmconv_tau), the coefficient of autoconversion rate (zmconv_c0_lnd), ice particle size (clubb_ice_deep), convective fraction (cldfrc_dp), and number of layers allowed for negative CAPE (zm-conv_cape_cin). The overall conclusion is that separately tuning any of these parameters does not improve the simulation of echo top height. For the convective entrainment rate (zmconv_dmpdz), we decreased its value from −0.7 × 10 −3 to −1.0 × 10 −5 , which means that the entrainment in convection is almost turned off, similar to the undiluted CAPE assumption. Results show the simulated echo top height is increased by 500-800 m in the EAMv1-test simulation, and the reflectivity span in the lower troposphere is narrowed by 1-2 dBZ, which is closer to the observations (Fig. 8). This result is consistent with the previous studies that tested the undiluted CAPE assumption as well (Neale et al., 2008;Hannah and Maloney, 2014). However, that assumption is unrealistic given the fact that the undiluted CAPE-based closure strongly deviated from observations (Zhang, 2009). In summary, changing any of our selected parameters individually in the ZM scheme does not improve the simulation of echo top height.
The MG2 cloud microphysics parameterization in E3SM determines only large-scale cloud and precipitation (i.e., those resolved by the model). Changes in the MG2 cloud microphysics parameterization could affect the parameterized cumulus cloud and precipitation by changing the large-scale forcing which feeds into the cumulus cloud calculations. By decreasing the MG2 autoconversion rate (prc_coef1), ideally the depletion of moisture within the atmospheric column is slowed down and more water vapor can be supplied to cumulus convection. Results show, however, that the echo top height is not affected by changing the MG2 assumptions. Attempts at accelerating the Wegener-Bergeron-Findeisen process in MG2 to increase the conversion of liquid to snow or ice, as well as using a lower size threshold for the ice-tosnow conversion, have also proven to be unimportant to the simulation of echo top height.
Thus, echo top height proves to be insensitive to the available tunable parameters. Setting the value of the convective entrainment rate to be unrealistically low only gains a 500-800 m increment in echo top height. Given that the model underestimation is more than 5 km, the increment is insufficient to solve the discrepancy. Note that each individual tunable parameter was changed without retuning the model to keep the top-of-atmosphere radiative energy budget balanced and the model performance optimized. Thus, some expected improvement in echo top height can be subsequently offset by other untuned processes. Instead of providing quantification of how the model responds to the changes of parameters, we emphasize the trend of change in echo top height, in which the simulation of the echo top height cannot be significantly improved by tuning only one of those physical parameters. Further investigation of combinations of two and more parameters is a topic for a future study.

Conclusions and discussion
We have evaluated the model performance of E3SM EAMv1 in simulating the warm-season 3-D radar reflectivity at an hourly scale over the North American sector of the globe by comparing the model results to the 3-D distribution of radar reflectivity observed by NEXRAD radars over the CONUS during April-September of 2014-2016. The evaluation is achieved by improving the COSP radar simulator and employing special data processing techniques to ensure fair comparison between model and observations that are different in sampling frequency, horizontal-vertical resolutions, and minimum detection limit. Our findings are as follows: 1. With the default microphysics assumptions in COSP, the simulated subgrid reflectivity PDF is bimodal, in disagreement with radar observations which show that the subgrid reflectivity follows a gamma distribution. When the microphysics assumptions in COSP are changed to be consistent with the MG2 microphysics parameterization used in E3SM, the bimodality of the subgrid distribution is nearly eliminated. It is therefore important to maintain consistency of microphysics assumptions between the host model and radar echo simulator attached to the model as advocated by the COSP community . For more accurate model evaluation, a higher-order consistency between the COSP and the host model is warranted in future studies.
2. Below 4 km altitude, the simulated domain-mean reflectivities by EAMv1 agree with NEXRAD observations in magnitude, but the simulation fails to capture the spatial variability. The model underestimates the reflectivity in the central US between the Rocky Mountains and Mississippi River. This pattern suggests that the model is not adequately representing the mesoscale convective systems that dominate warm-season rainfall in that region. The model overestimates the reflectivity outside this region.
3. Above 4 km altitude, EAMv1 shows a severe underestimation of the domain-mean reflectivity, and the negative bias increases with altitude up to 11 km, above which the model fails to simulate any valid reflectivity at all, whereas NEXRAD observations show strong radar echoes up to 16 km.
4. EAMv1 is able to simulate the variability and extreme value of reflectivity at the lower troposphere but significantly underestimates them at high levels.
The NEXRAD observations used in this study reveal that EAMv1 fails to simulate the occurrence of large ice-phase particles at high levels in deep convective clouds. In addition, the conclusion that "simulated deep convection is not deep enough" also echoes the dry bias seen in GCMs as manifested in underestimations of total precipitation and individually large rain rates over the CONUS (e.g., Zheng et al., 2019). We have now shown that this model deficiency cannot be significantly improved by tuning a single value of the physical parameters in the ZM cumulus and MG2 cloud microphysics schemes. Note the large-scale circulation is nudged towards observations for the simulations in this study, so our results represent the best-case model performance. Compared to the nudged simulations, free running of EAMv1 has shown nonnegligible biases in the regional circulation (Sun et al., 2019). With the nudged simulations, the large biases in circulation can be excluded so that the performances of physics parameterizations in simulating convective systems can be more insightfully understood.
The data processing techniques and metrics we have developed in this study can be used globally for model evaluation when satellite-based radars provide global 3-D radar observations. The GPM radar observations will eventually be able to provide global radar echo coverage , whose data have been proven consistent with NEXRAD (Wang et al., 2019b). However, as discussed in Sect. 2, the sampling by GPM at 1 • model grid elements for only 3 years of GPM data is insufficient for obtaining robust grid-mean values to compare with the EAMv1 simulation. In addition to the restriction in the availability of observational data, the high computation cost with the incorporation of the COSP simulator in simulation and the demand of large data space (14 000 core hours and 1.2 TB of data per simulation month at hourly output frequency) have hindered the modeling for an extended period. When GPM has run for a much longer time period and more powerful computational resources become available, it will be a very useful study for evaluating the long-term model simulations at the global scale. In addition, the results of this study can provide metrics for evaluating the cumulus parameterizations or provide insights on how to further improve the cumulus parameterizations like Labbouz et al. (2018), which could be a follow-on work. Future studies can also focus on separately evaluating properties in convective and stratiform regions, since the thermodynamic and reflectivity profiles are fundamentally different between the two regions.
Author contributions. JW performed the simulations and conducted the analyses. JF and RAH developed the idea of this research. KZ helped in the model configuration, and P-LM implemented the radar simulator. GJZ provided feedback and helped shape the research. All authors discussed the results and contributed to the final manuscript.