Assessment of the data assimilation framework for the Rapid Refresh Forecast System v0.1 and impacts on forecasts of convective storms

The Rapid Refresh Forecast System (RRFS) is currently under development and aims to replace the National Centers for Environmental Prediction (NCEP) operational suite of regional and convective scale modeling systems in the next upgrade. In order to achieve skillful forecasts comparable to the current operational suite, each component of the RRFS needs to be configured through exhaustive testing and evaluation. The current data assimilation component uses the Gridpoint Statistical Interpolation (GSI) system. In this study, various data assimilation algorithms and configurations in GSI are assessed for 5 their impacts on RRFS analyses and forecasts of a squall line over Oklahoma on 4 May 2020. Results show that a baseline RRFS run without data assimilation is able to represent the observed convection, but with stronger cells and large location errors. With data assimilation, these errors are reduced, especially in the 4 and 6 h forecasts using 75 % of the ensemble background error covariance (BEC) and with the supersaturation removal function activated in GSI. Decreasing the vertical ensemble localization radius in the first 10 layers of the hybrid analysis results in overall less skillful forecasts. Convection and 10 precipitation are overforecast in most forecast hours when using planetary boundary layer pseudo-observations, but the root mean square error and bias of the 2 h forecast of 2 m dew point temperature are reduced by 1.6 K during the afternoon hours. Lighter hourly accumulated precipitation is predicted better when using 100 % ensemble BEC in the first 4 h forecast, but heavier hourly accumulated precipitation is better predicted with 75 % ensemble BEC. Our results provide insight into current capabilities of the RRFS data assimilation system and identify configurations that should be considered as candidates for the 15 first version of RRFS.


Introduction
The increase in computational resources over the last several decades has allowed a considerable increase in horizontal resolution in numerical weather prediction (NWP) (e.g., Bauer et al., 2015;Yano et al., 2018). Currently, many NWP centers have (Mlawer et al., 1997;Iacono et al., 2008), the surface layer scheme of Long (1986), the near-surface sea temperature scheme of Li et al. (2015), the NoahMP (Noah Multi-Physics) land surface model (Niu et al., 2011), and the Navy Research Laboratory ozone photochemistry (2015) and stratospheric water vapor schemes (McCormack et al., 2008).

Data Assimilation
GSI is a variational data assimilation system featuring 3DVar (e.g., Wu et al., 2002;Kleist et al., 2009), hybrid 3DEnVar (e.g., Wang, 2010;Wang et al., 2013;Wu et al., 2017), and hybrid 4DEnVar methods (e.g., Wang and Lei, 2014;Kleist and Ide, 160 2015a). It also includes an optional non-variational, complex cloud analysis capability that executes after the variational analysis as a method to specify cloud and hydrometeor variables (e.g., Hu et al., 2006a, b;Benjamin et al., 2021). GSI features the following standard control (analysis) variables: streamfunction, velocity potential, temperature, surface pressure, and normalized relative humidity following Holm et al. (2002). However the choice of control variable is flexible and one may extend or modify the standard set to include other fields, such as hydrometeors or radar reflectivity (e.g., Wang and Wang, 2017). In 165 3Dvar and the associated hybrid variants, the static background error covariance is approximated through the application of a recursive filter which models the autocorrelations (Purser et al., 2003) while cross-covariances are handled in the standard context through statistical balance relationships obtained via regression (e.g., Wu et al., 2002;Parrish and Derber, 1992). The analysis is obtained by minimizing the incremental form cost function through the preconditioned conjugate gradient method (Derber and Rosati, 1989;Bathmann, 2021). 170 The extension of GSI from traditional 3DVar to hybrid 3DEnvar and to hybrid 4DEnVar is accomplished through the extended control variable approach (e.g., Lorenc, 2003;Wang, 2010;Kleist and Ide, 2015b, c). In this configuration, one is able to incorporate flow-dependent covariance information obtained from a complementary suite of ensemble forecasts. Typically this ensemble is obtained from a companion ensemble-based data assimilation system, such as the EnKF. However one may use any suitably available ensemble. In fact, the regional operational data assimilation systems at NCEP have used the ensemble 175 members from the GFS Data Assimilation System directly in the hybrid 3DEnvar framework with considerable benefit to the resulting forecast (Wu et al., 2017). This study focuses on the 3DVar and hybrid 3DEnVar frameworks and uses the global ensembles as described in (Wu et al., 2017).
GSI is capable of assimilating a large suite of observations. This includes, but is not limited to, satellite radiances (e.g., Zhu et al., 2014), derived Global Navigation Satellite System Radio Occultation (GNSS-RO) observations, radar radial velocity and 180 reflectivity (e.g. Lippi et al., 2019;Chen et al., 2021), Geostationary Lighting Mapper (GLM) lightning flash rates, web-camera derived estimates of horizontal visibility (Carley et al., 2021), and conventional observations (Hu et al., 2018). After 2014 it became a community system, maintained and supported by the EMC and the DTC (Shao et al., 2016). Recently, it has been added as the analysis component to improve initial conditions for the RRFS (Hu et al., 2021).
Presently, GSI is the data assimilation system used at NCEP for all operational atmospheric data assimilation applications 185 (e.g., Wu et al., 2002;Kleist and Ide, 2015a;Hu et al., 2017). It was initially developed by the EMC (Wu et al., 2002) and implemented as the analysis component in the operational GFS in May 2007 (Kleist et al., 2009) and in the operational Rapid Refresh (RAP) in May 2012 (Benjamin et al., 2016).

Post-processing
UPP is used at NCEP in all operational models. A community version is currently supported and maintained by the DTC. 190 UPP takes native output from the model grid points/cells and creates post-processed outputs including numerous diagnostic quantities in the same model output grid and model-native or isobaric vertical coordinate (UPP, 2021). Post-processed outputs include diagnostics fields that are not part of the model computation that have been developed for different applications. These include, for example, precipitation type, composite reflectivity, simulated satellite brightness temperatures, updraft helicity, storm motion, ceiling or cloud-base height, vertically integrated liquid, and lightning, among several others. More details 195 on the diagnostic fields developed for hourly updated NOAA weather models such as RAP and HRRR, and how they are calculated, can be found in Benjamin et al. (2020). These products are critical for users in their forecast processes. UPP was selected as the unified post-processing system for UFS and modifications have been made to work with FV3-based models.
Currently, it can be used in the UFS medium range weather and SRW applications.

200
The workflow ties all RRFS components together and handles all system interdepencies. In essence, it manages the cycling configuration taking into account each task dependency and specification. It oversees that tasks to generate ICs and LBCs only start if all needed information was obtained from the previous step. It manages how the data assimilation cycle advances, i.e. by running the forecast to generate the first guess and running the analysis once the first guess is completed. It handles the model execution by supervising availability of ICs and LBCs for the specific hour, and controls that model outputs only 205 be postprocessed if they exist in the model run directory. It also manages crucial information on computational resource requirements to run each task.
A schematic diagram of major tasks and the general pipeline of the RRFS system is provided in Fig. 1. The task "Make Fixed Files" generates the model grid, orography, and climatological information needed for the model execution. The tasks "Make ICs" and "Make BCs" read data from external models (such as GFS and HRRR), perform the necessary calculation, 210 interpolation, conversion, and then generate appropriate ICs and LBCs for a FV3 LAM model run. The task "Run analysis" (the gray shaded area in Fig. 1) is to execute the data assimilation system for a FV3 LAM run. It ingests various types of observations and combines them with a first guess (or background) to generate a best possible atmospheric analysis for the initialization of the FV3 LAM model integration. The first guess can be either an IC from an external model (after the task "Make ICs") or a short term forecast (1 -6 h forecast, configurable to users) from a previous FV3 LAM model run. The 215 first scenario is referred to as a "cold start" (the blue box in Fig. 1) while the latter is called a "warm start" (the red box in Fig. 1). In practice, for a FV3 LAM "warm start," the first guess comes from "restart" forecast files generated by the FV3 LAM model. The task "Run model" is to run the FV3 LAM model with ICs and LBCs prepared from the previous steps. It is worth mentioning that besides the "cold start" and "warm start," a FV3 LAM model run can also start from an IC made directly from an external model without the data assimilation step. This is also referred as a "cold start." The task "Run post" is to 220 post-process the FV3 LAM forecasts and generate all target model fields for downstream plotting and/or examination.

Cycling configuration
The cycling configuration of the RRFS v0.1 is similar to the one used in RAP, i.e. cold starts are performed every 12 hours and warm starts are performed at all other cycles using the 1 h forecast from the previous cycle as background for the analysis.
RAP performs hourly-updated continuous cycles with cold starts at 09:00 UTC and 21:00 UTC using the 1 h forecast from 225 cycles initialized at 08:00 UTC and 20:00 UTC in 6 h parallel hourly spin-up cycles. The parallel spin-up cycles are cold started from GFS atmosphere analyses and RAP surface fields at 03:00 UTC and 15:00 UTC. Cold starts in RAP introduce the atmospheric conditions while RAP land surface fields are fully cycled in the continuous cycle (Benjamin et al., 2016;Hu et al., 2017). At the time of execution of this research not many RAP functionalities were available for use in the RRFS data assimilation framework, therefore a more simplified configuration with partial cycling is used. Figure 2 illustrates the RRFS 230 cycling configuration from cycles initialized between 00:00 UTC through 06:00 UTC. In each cycle, an 18 h free forecast is launched following the analysis, with hourly outputs. A cold start is performed at 00:00 UTC and warm starts between 01:00 UTC to 06:00 UTC using the FV3 LAM 1 h forecast from the previous cycle as background for the analysis.

Methods
In order to achieve skillful forecasts comparable to the current operational convection-allowing suite, each component of the

235
RRFS needs to be exhaustively tested to determine the best configuration. This study focuses on the initial configuration of the data assimilation framework. In this section, the case study, general setup of the experiments, description of the experiments conducted, and verification methodology are presented. (3694 J kg −1 ) and effective bulk shear (48 kt for the surface to 3 km layers and 36 kt for the surface to 6 km shear) were observed over northeastern Oklahoma. This environment provided favorable conditions for severe convective storms with po-245 tential for strong updrafts and development of supercells (e.g., Weisman and Klemp, 1982;McCaul and Weisman, 2001). At 20:00 UTC, the convective cells were first seen in the radar reflectivity observations over that region (

250
Clusters of severe storms developed across south-central Oklahoma along the intersection of the cold front with the dry line.
The convection associated with the squall line evolution resulted in several instances of large hail and high wind, mostly over northeastern and south-central Oklahoma, southeastern Kansas, southwestern Missouri, and northwestern Arkansas. The time window used is 1 hour, allowing for observations within 30 minutes before to 30 minutes after the analysis time to be assimilated.

Setup of experiments
Experiments are conducted testing the GSI 3DVar and 3DEnVar systems. For the hybrid 3DEnVar analysis, the Global Data

265
Assimilation System (GDAS) 80 member ensemble forecasts (9 h forecasts) are used to provide the ensemble BEC (e.g. Wu et al., 2017). These forecasts are available four times per day, therefore the same 9 h GDAS ensemble forecasts are used for the 2 hours before and 3 hours after its valid hour. For example, the 9 h GDAS ensemble forecasts initialized at 00:00 UTC (valid at 09:00 UTC) are used for the cycles from 07:00 UTC to 12:00 UTC. Similarly, the 9 h forecast GDAS ensemble initialized at 06:00 UTC (valid at 15:00 UTC) is used for the cycles from 13:00 UTC to 18:00 UTC. This follows the same strategy in the

270
RAP system (Hu et al., 2017). In all experiments with data assimilation, two outer loops with 50 inner loops each are performed to minimize the cost function and find each analysis. The spatial resolution of the analysis is 3 km, as in the forecast model.

Sensitivity experiments
GSI provides many functionalities and parameters, enabling users to make the best data assimilation configurations for different practices. A series of experiments were designed to examine the impact of different configurations on the analyses and forecasts.
275 Some RAP configurations were tested in the experiments following Hu et al. (2017). An experiment with no data assimilation is provided, acting as the baseline for all other experiments. This baseline experiment is called NoDA and uses the same cycling configuration as experiments with data assimilation but without the execution of GSI. EnVar analyses (e.g Campbell et al., 2010), as an effective way to mitigate sampling errors due to the relatively small ensemble size available for hybrid EnVar and ensemble analyses (Houtekamer and Mitchell, 2001;Hamill et al., 2001), especially at convective scales (e.g Gustafsson et al., 2018). At this stage, it is important to determine how large the localization radius needs to be for RRFS analyses. Therefore, experiment VLOC was designed to examine the vertical localization radius that yields more realistic forecasts in RRFS. The localization function in GSI is implemented as a single application of an isotropic 290 recursive filter (Purser et al., 2003) and the radius is specified as a Gaussian half-width, either in scale height (ln p) or in terms of number of vertical layers. In this study, the radius is specified in terms of layers. In VLOC, the vertical ensemble localization radius is changed from 3 vertical layers for the whole atmosphere (used in all other experiments) to a heightdependent localization setting: 1 vertical layer in the lowest 10 model layers and 3 layers for other model layers. A comparison experiment (not shown) was conducted reducing the vertical localization to 2 layers in the first 10 model layers, but results
The operational RAP system has developed a PBL pseudo-observation function in order to obtain a better representation of the PBL in the analysis. This function was initially developed to further leverage the information provided by METAR observations, extending their representativeness through the PBL depth in the Rapid Update Cycle (RUC) analyses. Improvements in the temperature, dew point, and CAPE forecasts were found when spreading the innovations from temperature, moisture, 300 and wind in the layers above surface and below the top of the PBL (Benjamin et al., 2004). Smith et al. (2007) also found a positive impact in the 3 h forecast of CAPE by using the PBL pseudo observations, and the impact was greatly increased when additionally assimilating GPS-IPW. Benjamin et al. (2010) found higher positive impact during the summer, when the PBL is deeper. This function has been used operationally since RAP version 3 (Benjamin et al., 2016), and therefore, it needs to be tested and tuned for its potential use in RRFS analyses. To test whether and how this function works for the RRFS v0.1, 305 experiment PSEUDO is designed and results are presented in Sect. 4.3. The study of Tong et al. (2020) showed that regardless of the method used, the storm coverage was overestimated and reflectivity values were much higher than the observed, which is likely to be linked to the physics suite used. However, it is also well known that nonphysical solutions (nonrealistic updraft/downdraft, negative humidity, supersaturation, etc.) can arise as a result from the data assimilation procedure (e.g., Janjić et al., 2014;Tong et al., 2016). In this study, experiment CLIPSAT and that RAP uses operationally 75 % of the ensemble BEC (Hu et al., 2017), this percentage was then used for these other experiments.

Forecast verification
MET version 9.0 (Jensen et al., 2020) is used for forecast verification. MET was developed at the DTC and has been widely used by the NWP community. Upper air and surface observations are used to verify the vertical profiles of temperature, 320 specific humidity, and wind, as well as 2 m temperature, 2 m dew point, and 10 m wind forecasts, respectively. For upper are further analyzed at 00:00 UTC and 12:00 UTC valid times.
Precipitation forecasts are verified against the hourly Stage IV precipitation product (Lin and Mitchell, 2005) in terms of the ETS and frequency bias (FBIAS) for different thresholds, but only >0.01 inches h −1 (0.254 mm h −1 ) for lighter precipitation and >0.25 inches h −1 (6.35 mm h −1 ) for heavier precipitation are presented here. The grid-to-grid approach in MET is used. is expected from a correctly executed data assimilation procedure. There is a noticeable jump in the RMS error values of the OmB from 00:00 UTC (12:00 UTC) to 01:00 UTC (13:00 UTC) on 4 May 2020. This is because 00:00 UTC and 12:00 UTC are cold started from HRRR analyses. On the contrary, at 01:00 UTC (13:00 UTC) on 4 May, the background used is from the FV3 LAM 1 h forecast. Therefore, forecasts used to initialize cycles at 01:00 UTC and 13:00 UTC undergo a spin-up process.   The hybrid EnVar data assimilation method is now widely used by NWP centers (e.g., Bannister, 2017;Gustafsson et al., 2018) and the research community. It combines the static and ensemble BEC, taking advantages from both the variational method and the EnKF filter method. It is robust, allows the use of flow-dependent BEC, avoids the development and maintenance of 365 a tangent linear and adjoint model, and thus has gained mainstream practice. In this hourly updated RRFS system, the hybrid 3DEnVar method is tested. One of the major concerns is to how to obtain the optimal weight for the ensemble BEC in the hybrid 3DEnVar analysis. A series of weighting sensitivity experiments were conducted in order to find the best option for this study. Figure 6 shows the specific humidity and temperature analysis increments for the 19:00 UTC cycle on 4 May 2020 for 370 experiments 100EnBEC, 75EnBEC and 3DVar. The analysis conducted at 19:00 UTC is during a cycling period using warm starts and is close in time to the initiation of convection in the afternoon hours. Forecasts initialized by this analysis cover the squall line evolution from its initiation to decay stages. Therefore, this cycle was selected to show the analysis increments and storm forecasts in the following sections. The analysis increments from experiment 3DVAR (Fig. 6c and f) are smoother as compared to those from 75EnBEC ( Fig. 6b and e), which exhibits some flow-dependent features. As it goes into pure ensemble 375 BEC ( Fig. 6c and d), more flow-dependent increments are obtained.  Fig. 7a, d, g, and j). The initial cells are represented and located more accurately in the experiments with data assimilation, especially 75EnBEC (Fig. 7d). In the 4 h forecast, the squall line enters its mature stage and a line of storms are ranged from southwest Missouri to central Oklahoma (Fig. 7b, e, h, and k). Every experiment predicts a squall line, but there is substantial 385 location and coverage error in the NoDA experiment. 3DVAR improves a little over NoDA. 75EnBEC does well to predict the squall line at the correct location, although the storm near the southwest tip of the observed squall line is still missing as in all other experiments (Fig. 7e). 100EnBEC overproduces the convection associated with the squall line, but still improves over 3DVar and NoDA at this forecast hour. In the 6 h forecast, the squall line moves eastward and covers from southern Missouri and northwestern Arkansas to southeastern Oklahoma. At this time, 3DVAR again performs better than NoDA, 390 and 75EnBEC still makes the best forecast among all experiments (Fig. 7c, f, i, and l). Overall, data assimilation introduces evident, positive impacts to the storm forecasts in terms of the squall line location, orientation and coverage, though different assimilation strategies yield different impacts. The improvement from 3DVAR is somewhat limited while hybrid 3DEnVar is seen to perform much better. Among the experiments, 75% ensemble BEC gives the best forecasts.

The impact of vertical ensemble localization radius 420
Ensemble-based systems need a large number of ensemble forecasts in order to estimate a full rank covariance matrix. However, this is computationally impractical for operational and research activities. The ensemble-based covariances can be very noisy when using a small ensemble size which results in inaccurate analyses (e.g., Hamill et al., 2001;Gustafsson et al., 2018).
The vertical and horizontal localization scales determine how the ensemble covariance varies with distance (Buehner, 2005). Gustafsson et al. (2018) pointed out that the localization needs to be large enough to not disrupt the large scale balance but 425 small enough to represent fluctuations at the convective scale. Thus, unlike at global scales, the operational RAP and HRRR systems use a horizontal localization radius of 110 km in combination with a vertical localization radius of 3 layers, which give optimal forecast skill in RAP applications (Hu et al., 2017). In addition, Hu et al. (2017) tested a vertical localization radius of 9 layers, but using this larger localization radius degraded the forecast when compared to 3 layers. Knowing the expected results for a relatively larger vertical localization value using an 80 member ensemble, this study looked at the impact of reducing the 430 vertical localization radius from 3 grid points to 1 in the lowest 10 vertical model levels (experiment VLOC). This reduction is adopted to capture finer vertical features of low atmosphere from the observations close to the surface and below the PBL.  Matched Pairs the late afternoon (valid hour 00:00 UTC) (Fig. 10a). At valid hour 12:00 UTC, VLOC gives a lower RMSE between 950 and 900 hPa and 350 and 300 hPa, and lower bias in the upper atmosphere between 450 and 250 hPa (Fig. 10d). For specific humidity, the RMSE and bias are improved at all levels above 650 hPa at 00:00 UTC with VLOC. However, a degradation is observed in the RMSE in the lower levels below 700 hPa. Degradation is also seen in the bias between 950 and 800 hPa ( Fig. 10b). At valid 12:00 UTC, not much improvement is shown in either the RMSE or bias from VLOC (Fig. 10e). Meanwhile, 440 a general positive impact is observed in the RMSE and bias for the winds above 650 hPa but negative impact in lower levels at 00:00 UTC valid hour. At 12:00 UTC valid hour, slight improvements are shown for VLOC in the RMSE between 650 and 500 hPa and at 400 hPa, and in the bias at 550 hPa and 350 hPa (Fig. 10f).
The change in vertical localization slightly improved the extent and intensity of convection over northeastern Oklahoma in the 2 h forecast, however an underforecast of the convection over central and eastern Oklahoma is observed in the 4 h 445 forecast and an overforecast over north-central Arkansas and south-central Missouri is observed in the 6 h forecast (Fig. A1).
While reducing the vertical localization scale did produce small improvement at some levels, degradation dominated the overall signature, indicating this variation of localization scale produces less skillful storm forecasts. It suggests a vertical ensemble localization radius of 3 layers is already a good choice if not the best.  vertical levels, from the surface to the level corresponding to 75 % of the PBL height and spaced every 20 hPa (Benjamin et al., 2016). This technique works as if additional PBL observations are available at those levels and thus more observation 455 innovations can be computed. Therefore, they are called "PBL pseudo-observations". This function is tested with the PBL pseudo-observation configuration used in the operational RAP system.

The impact of PBL pseudo-observations
The 2, 4, and 6 h composite reflectivity forecasts from experiments PSEUDO and 75EnBEC are presented in Fig. 11.
PSEUDO clearly predicted more convection than 75EnBEC (Fig. 11a -c). These forecasts overestimated convection in southern Missouri, northern Arkansas, and northeastern Texas for all forecast lengths, especially at 4 and 6 h ( Fig. 11b and c). Spu-460 rious convection also appeared over central Arkansas, southeastern Oklahoma, and far northeastern Texas in the 2 h PSEUDO forecast (Fig. 11a), and convective initiation was overproduced over northeastern Oklahoma and southeastern Kansas. The squall line is not well represented in the 6 h forecast after adding PBL pseudo-observations (Fig. 11c). Nevertheless, the 4 h forecast (Fig. 11b) in PSEUDO has a better coverage of the squall line.
The RMSE and bias vertical profiles for the 2 h forecast of temperature, specific humidity, and wind against sounding 465 observations at 00:00 UTC and 12:00 UTC valid hours are presented in Fig. 12. The use of PBL pseudo-observations gives clear negative impacts at both valid hours and most vertical levels for the RMSE and bias of temperature and specific humidity ( Fig. 12a, b, d, and e). For wind, the bias shows more promising results with a positive impact for almost the entire atmosphere   at late afternoon hours (valid hour 00:00 UTC) (Fig. 12c), but at 12:00 UTC valid hour (Fig. 12f) the wind RMSE clearly shows an advantage of 75EnBEC over VLOC at almost all vertical levels.

470
Unlike upper air verification, the surface RMSE and bias of 2 m dew point temperature for the 2 h forecast in PSEUDO show substantial improvements between cycles initialized at 12:00 UTC on 4 May 2020 and 00:00 UTC on 5 May 2020 ( Fig. 13b). A reduction of almost 1.6 K is shown for both statistics in the cycle initialized at 19:00 UTC. The RMSE and bias of 2 m temperature for the 2 h forecast also show improvements during the afternoon hours, but with less magnitude. A degradation in RMSE is observed between cycles initialized at 21:00 UTC and 00:00 UTC (Fig. 13a). Clearly, adding PBL 475 pseudo-observations mitigates near surface dry and warm bias during afternoon hours, but overforecasts the storm and makes upper air forecasts worse. More tuning and testing of this function are needed before applying this technique in the RRFS.

The impact of supersaturation removal
GSI has a function to remove supersaturation in the background by capping specific humidity to its saturation value in each outer loop during the minimization of the cost function, as calculated using the background fields (CIMSS, 2014). Figure 14   480 shows the difference in the specific humidity (g kg −1 ) analyses between the 75EnBEC analysis and the 75EnBEC analysis with the supersaturation clipping function activated (75EnBEC vs. CLIPSAT) for the 19:00 UTC cycle on 4 May 2020. Since more moisture is present in the lower atmosphere, model level 50 (located in the lower atmosphere) was selected to show this result. Positive (negative) differences in Fig. 14 indicate that more (less) specific humidity is found in the 75EnBEC analysis than in CLIPSAT. The figure suggests that supersaturation is removed in the CLIPSAT analysis mostly over southwestern and 485  Figure 13. As in Fig. 9, but for experiments 75EnBEC and PSEUDO.
northwestern Missouri, southeastern Kansas, northern Arkansas and Oklahoma. It is worth mentioning that the computational run time of the analyses in CLIPSAT was quite similar to 75EnBEC (not shown).
The 2, 4, and 6 h composite reflectivity forecasts are shown in Fig. 15 for experiments CLIPSAT and 75EnBEC. When the supersaturation removal function is activated in the analyses, a better evolution of the squall line is observed in the 4 and 6 h forecasts ( Fig. 15b and c). The displacement errors are reduced and less spurious convection is seen over southern Missouri 490 and northern Arkansas at these forecast hours (Fig. 15b, c, e, and f). As seen in Fig. 14, over these areas the CLIPSAT analysis showed less specific humidity content than in 75EnBEC. However, less spatial coverage of the convection is forecast over eastern Missouri, and the spurious convection is increased over southwestern Missouri at the 2 h forecast in CLIPSAT when compared to 75EnBEC ( Fig. 15a and d).

495
To further evaluate the experiments conducted, the FV3 LAM 1 h accumulated precipitation is also analyzed. Precipitation forecasts remain a challenge for NWP models at various spatial and temporal scales. Because of their complexity, precipitation forecasts are frequently used to evaluate model performance.
As mentioned in Sect. 3.4, precipitation forecasts are verified against Stage IV precipitation observations at various thresholds. Figure 16 shows the equitable threat score (ETS) and frequency bias (FBIAS) for 1 h accumulated precipitation greater 500 than 0.01 inches h −1 (Fig. 16a and c) and 0.25 inches h −1 (Fig. 16b and d) et al., 2020). ETS is based on the threat score or critical success index and is commonly used to examine the performance of Meanwhile, FBIAS indicates when an event is forecast more or less often than it is observed. FBIAS greater than 1 indicates 505 an event is overforecast, while less than 1 suggests an event is underforecast. An FBIAS equal to 1 indicates that the event is predicted as frequently as it is observed (e.g Wilks, 2006).
For this case study, ETS values decrease as the precipitation threshold increases in all of the experiments assessed ( Fig. 16a and b), indicating the difficulty in predicting heavier precipitation events. Most of the experiments with data assimilation have higher ETS scores for precipitation greater than 0.01 inches h −1 than NoDA during almost the entire 18 hour forecast 510 (Fig. 16a). This shows the positive impact of data assimilation in the analyses and subsequent lighter precipitation forecasts.
However, and unsurprisingly, experiment PSEUDO stands out with the lowest forecast performance between 5 h and 17 h forecast. Experiments 100EnBEC, CLIPSAT, and 75EnBEC show higher ETS values in the first 4 hours of the forecast. Between 4 h and 16 h forecast, experiment CLIPSAT shows the best performance among all experiments, followed by 75EnBEC and 100EnBEC. In terms of FBIAS, 100EnBEC shows better scores until the 11 h forecast lead (Fig.16c). Between 2 and 515 8 h forecast, Experiment VLOC shows the largest underforecast among all experiments. PSEUDO overforecasts the lighter precipitation in the first 5 h forecast and underforecasts it after the 13 h lead (Fig. 16c). The verification of 1 h accumulated precipitation greater than 0.25 inches consistently shows the poor performance of PSEUDO in both measures ( Fig. 16b and Figure 15. As in Fig. 7, but for experiments CLIPSAT (a, b, and c) and 75EnBEC (d, e, and f). d). PSEUDO has very low ETS scores for this threshold during almost all the forecast hours and overproduces the precipitation forecasts until the 11 h forecast lead. These results support the conclusion that further testing of this function in GSI is 520 needed to be used in RRFS. Using hybrid and pure ensemble BEC in data assimilation improves the precipitation forecasts by more than 0.25 inches h −1 in the first 13 hours forecast, with 75EnBEC outperforming 100EnBEC within the first four hours (Fig. 16b). After the 13 h forecast, experiment NoDA performs better, which shows data assimilation mainly improves the short term forecast and the major factor for a good long term forecast is the quality of the background from the outside model as well as the FV3 LAM model itself. In addition to PSEUDO, for the 0.25 inches threshold, precipitation is overforecast in 525 experiments 100EnBEC and NoDA in the first 3 hours, underforecast in experiments 3DVar, 75EnBEC, VLOC, and CLIPSAT.
All experiments underforecast accumulated precipitation greater than 0.25 inches after the 10 h forecast (Fig. 16d).

Summary and final remarks
The capability of a prototype RRFS, the RRFS v0.1, with data assimilation to simulate convection is investigated through a case study of a squall line that occurred over Oklahoma during the afternoon of 4 May 2020. Various data assimilation parameters 530 and algorithms were tested and evaluated in order to find the best configuration to produce more realistic convection forecasts.
This case study shows that the FV3 LAM with the RRFSv1a physics suite has good potential for storm forecasts. Overall, the configurations tested are able to capture the main characteristics of the major convective systems during the execution period.
However, the convection in the RRFS v0.1 tends to be overestimated in intensity and underestimated in its extent, as found in previous studies on FV3-based convection-allowing models (e.g, Tong et al., 2020;Gallo et al., 2021).

535
As expected, data assimilation makes the analyses fit the observations more closely in all cycles. However, the RMS errors of the OmB show distinguishable spikes in cycles where FV3 LAM 1 h forecasts are initialized from an external model as background for the analyses, which indicates the FV3 LAM is still under spin-up in this situation. Therefore, a cycling configuration including a spin-up period for cycles using external model forecasts may be considered. At present, work is underway at NOAA's Global Systems Laboratory (GSL) and EMC to determine the best cycling strategy for this system.

540
The data assimilation configurations tested show different impacts to the storm forecasts in terms of the squall line location, orientation and coverage, but experiments with data assimilation show an overall positive impact compared with the experiment without data assimilation. The data assimilation using pure ensemble BEC (100EnBEC) performs better at 2 h forecasts for the storms, but 75 % ensemble BEC (75EnBEC) produces better forecasts in all forecast lengths with a better positioning of the squall line evolution, especially at 4 h forecast. Lower RMSE and bias are also found in experiment 75EnBEC for the analyzed 545 surface variables and most vertical profiles.
Reducing the vertical localization from 3 layers to 1 layer in the lowest 10 layers of the analysis grid leads to, in general, a less skillful forecast. This suggests that the vertical localization configuration used in RAP is already a good choice and should be used in RRFS. Nevertheless, the RMSE and bias of the 2 h forecast of specific humidity are reduced above 600 hPa at