Reply on RC3

with the large scale conditions as part of the partial cycling procedure, the model bias cannot be fully examined. Further investigation involves adapting the approach employed by Wong et al. (2020) in which forecast tendencies are used to investigate systematic model biases in a continuously cycled experiment.”


Please change to: "Decreasing the vertical ensemble localization radius [from X m to Y m] in the first 10 layers of the hybrid analysis results in overall less skillful forecasts."
The suggested change was adopted: Lines 12-13: "Decreasing the vertical ensemble localization radius from 3 layers to 1 layer in the first 10 layers of the hybrid analysis results in overall less skillful forecasts." L 97-99: "Using hybrid 3DEnVar with 75 % of the ensemble background error covariance (BEC) showed storm structures in the 2 h forecast comparable to when using ensemble Kalman (EnKF), although EnKF outperformed 3DEnVar in the first hour forecast." (1) Does this mean that the BEC is weighted 75% toward the dynamic ensembleestimated BEC and 25% to the static climatological BEC?
(2) If the EnKF outperforms the 3DEnVar in the first hour, and then they are comparable in the second hour, then when not use the EnKF instead of the 3DEnVar?
(1) Yes, the text was improved including the weight given to the static BEC: Line 100: "... and 25 % of the static BEC…" (2) The reviewer would need to follow up with Tong et al. (2020), who conducted the work summarized in this portion of the literature review, to have this question addressed. For RRFS, development is underway to incorporate the EnKF into the hybrid data assimilation system for the first implementation. L 99-100: "Both methods showed higher equitable threat scores (ETS) when compared to 3DVar and pure ensemble during the 4 h forecast analyzed." Does "both methods" refer to the 3DEnVar and the EnKF? What is a "pure ensemble"? Yes, "both methods" refer to the 3DEnVar and the EnKF. On the other hand, pure ensemble refers to pure En3DVar in which the BEC is composed of 100 % of the ensemble BEC and 0 % of the static BEC. This was clarified in the text: Lines 101-103: "Both methods, hybrid En3DVar and EnKF, showed higher equitable threat scores (ETS) when compared to 3DVar and pure 3DEnVar during the 4 h forecast analyzed." It would be useful to mention briefly how these types of local refinement differ.
The sentence has been updated for additional clarity.
Lines 129-132: "The FV3, originally a global model, features three types of local refinement capabilities: stretching of the global grid using the Schmidt refinement technique (Harris et al., 2016), one-and two-way nesting within the global grid (Harris and Lin, 2013), and recently a LAM capability (Black et al., 2021). The LAM capability eliminates the need to run a concurrent global model and instead relies upon lateral boundary conditions provided at pre-specified intervals from an external source." L 143-144: "Hence, the CCPP contains a set of physical schemes and a common framework that facilitates the interaction between the physics and a numerical model (Bernardet et al., 2020). " Perhaps it would be more clear to say "between the physics parameterizations and the dynamical core".
As suggested, we modified line 148 in the new version of the manuscript.
L 177-178: Some discussion should be given about what deficiencies this will have. For example, the lower-resolution global ensemble members (which are even lower resolution than the global deterministic forecast) may have significant biases, and will not resolve the error characteristics at the scale of the LAM. Clearly, the global ensemble statistics provides some useful information, but it is not ideal.
As suggested, some discussion was provided in lines 181-184 and lines 185-191.
Lines 181-184: "Although the use of lower-resolution global ensemble members may not be ideal for the representation of the error characteristics at finer scales, Wu et al. (2017) showed that considerable forecast improvement can be obtained even if the ensemble provided is from a different system, which is consistent with findings in other studies such as Hu et al., (2017)." Lines 185-191: "Future work on RRFS involves the extension to a convective-scale ensemble in the EnKF, which will improve the representativeness associated with the forecast error covariance at finer scales. However, such a change is not a panacea. Aside from increased computational expense, the problem of rank deficiency of the ensemblederived error covariance becomes more apparent with the expanded degrees of freedom associated with the finer spatial resolution. While localization helps somewhat, a computationally affordable ensemble is one that is often insufficiently sized. Therefore future work also includes efforts to introduce multiscale data assimilation capabilities, such as scale dependent localization (e.g., Huang et al., 2021)." L 223: "cold starts are performed every 12 hours and warm starts are performed at all other cycles using the 1 h forecast from the previous cycle as background for the analysis." Why is the DA continually reset with cold and warm starts? What prevents the standard self-contained forecast-analysis cycle? Is there a model drift when the DA is run continually in the RRFS?
The reviewer is right. Partial cycles are performed in order to avoid model drift when the data assimilation is run continually in the RRFS. Nevertheless, other methods are under consideration for future versions of RRFS, such as a fully cycled method with blending of longwaves from the global based on Schwartz et al. (2022). Lines 241-243 were modified for clarity.
Lines 241-243: "Periodic updates of the large scale atmospheric conditions are needed in regional modeling systems in order to account for corrections made by global observations over land and ocean and to avoid model drift from those conditions (Benjamin et al., 2016)." The diagram of Figure 2 was improved showing more cycles in order to clarify that cold starts are performed at 00:00 UTC and 12:00 UTC. The acronym FV3 LAM was added to indicate that the 18 h forecasts are obtained as part of the model execution after the data assimilation.  I suppose this implies that the analysis time is at the middle of the 1-hour forecast window, but I didn't notice this mentioned earlier.
Yes, the time window is centered at the analysis hour. Line 281 includes the word "central" to clarify this information. The resolution of the 80 member ensemble forecast is approximately 25 km. As mentioned in a previous comment, the use of lower-resolution global ensemble members may not be ideal for the representation of the fine scale motion seen at 3 km. However, doing so has shown to be beneficial (Hu et al. 2017) for the forecasts. It should be taken into account that this study uses a prototype RRFS, RRFS v0.1 and that future work already intends to address this issue. This was added in lines 181-184 and 288-289. Line 284 was also modified to include the resolution of the ensemble members, as suggested.
Lines 288-289: "As shown in Hu et al. (2017), using off-time global and fixed ensemble based BEC still produces better results than just using the static BEC." Line 284: "These forecasts have a horizontal resolution of approximately 25 km and…" L 268-269: "For example, the 9 h GDAS ensemble forecasts initialized at 00:00 UTC (valid at 09:00 UTC) are used for the cycles from 07:00 UTC to 12:00 UTC." How is the 9-hour forecast initialized at 0 UTC used at 12 UTC? Is this using the BEC fixed in time at 9 UTC for the entire window?
Yes, since GDAS ensemble forecasts are available four times per day, the same ensemble based BEC is used along the 6 h window. This is possible because GSI has the flexibility to use off-time ensemble forecasts, i.e. ensemble forecasts that do not match the analysis hour. As shown in Hu et al. (2017), using off-time global and fixed ensemble forecasts still produces better results than just using the static BEC. The following sentence was added to the manuscript: Lines 288-289: "As shown in Hu et al. (2017), using off-time global and fixed ensemble based BEC still produces better results than just using the static BEC." L 271-272: "In all experiments with data assimilation, two outer loops with 50 inner loops each are performed to minimize the cost function and find each analysis."

I'm not sure I understand what is done in the outer and inner loops. Are the inner loops referring to the PCG solver? If so, then was is done in the outer loop?
In this study, the minimization was performed in two outer loops with 50 iterations each, which were sufficient to reach the convergence condition. In each outer loop a relinearization is performed after the 50 iterations. The increment is zero for the first outer loop while for the second it is updated with the solution found after the 50 iterations of the first outer loop.
Lines 289-292: "In all experiments with data assimilation, two outer loops with 50 iterations each loop are performed to minimize the cost function and find each analysis. In each outer loop a re-linearization is performed (e.g., Kleist et al., 2009). The increment is zero for the first outer loop while for the second it is updated with the solution found after the 50 iterations of the first outer loop."

L 274-275: "for different [applications]."
As suggested, "practices" was changed to "applications" in line 296 of the new manuscript.
L 277-279: "This baseline experiment is called NoDA and uses the same cycling configuration as experiments with data assimilation but without the execution of GSI." I understand that from your perspective this doesn't use the GSI, but for the entire procedure the GSI is used at the global scale. I think it would be helpful to explain this context in more detail.
The sentence was modified in line 299 to clarify that the cycling configuration of the NoDA experiment follows the same configuration of the experiment with data assimilation, in terms of the cold and warm initial conditions. Explanation of what is performed in each initialization type is already provided in Sect. 2.7.

L 281: "experiments with different ensemble weights [are] conducted"
The verbal tense was corrected in line 303 of the new manuscript.
L 335: "The analysis residuals (OmA) are also depicted in Fig. 4" The OmA's are less useful than the OmF's for assessing the DA performance. The analysis can be drawn arbitrarily close to the observations (e.g. using complete replacement). It is more valuable to see that the forecasts are being drawn closer to the future observations.
The reviewer is right. However, for a more complete assessment, results from both, OmA and OmF, are provided in Figure 5 for all cycles. Figure 4 intends to show the spatial distribution of the assimilated observation and inform readers on the performance of the data assimilation system.

L 343: "while some MESONET observations have large analysis residuals."
What is the expected cause of this larger discrepancy in these observations?
As pointed out in Morris et al., (2020), while some MESONET stations are well maintained, the majority do not meet siting standards and maintenance protocols. Therefore, larger residuals are expected from these observations when compared to other observation networks such as METAR, which is an officially maintained observation network. Owing to their uncertain quality, MESONET observations are therefore assigned a higher observation error via a station blacklist, which also explains the larger residuals. Lines 373-376 were modified to include more details.
Lines 373-376: "As pointed out in Morris et al., (2020), while some MESONET stations are well maintained, the majority do not meet siting standards and maintenance protocols and therefore are assigned a higher observation error via a station blacklist. As expected, larger residuals are found from these observations when compared to other observation networks." L 350: "This means the analyses fit closer to the observations, which is expected from a correctly executed data assimilation procedure." This is partially true, but it is also not correct to fit the observations perfectly (due to observational error), so care should be taken in such a statement.
We agree with the reviewer. The sentence was modified to: Lines 383-384: "This means the analyses fit the observations more closely, though owing to observation error not perfectly, which is expected from a correctly executed data assimilation procedure." L 351-353: "There is a noticeable jump in the RMS error values of the OmB from 00:00 UTC (12:00 UTC) to 01:00 UTC (13:00 UTC) on 4 May 2020. This is because 00:00 UTC and 12:00 UTC are cold started from HRRR analyses. " So why are you using the cold starts?
As explained before, the cold starts in this study are basically to update the large-scale conditions and to correct the drift from the model. A more adequate cycling technique is being investigated as stated in lines 601-602 in the manuscript.

Figure 5: It would be useful to add a panel showing the difference between 3dvar and the 75EnVar.
In the discussion provided for Figure 5 it is highlighted that the background root mean square error increases from 14:00 UTC to 23:00 UTC, and that in Figure 5b this increase is larger than in Figure 5a (3DVar vs. 75EnBEC). Therefore, arrows were added to Figure 5 to draw attention to that marked increase, which is the focus of this result. The differences for other periods of time are subtle between these two panels. Accordingly, the caption of Figure (Davis et al., 2009) derived from MODE results were then added to each subplot. This metric is an appropriate metric for this kind of verification, providing more objectivity to the discussion, as suggested by the reviewer. Lines 352-359 were added to the document in which the method and metric used are described. Corresponding discussion of these values was added to the new version of the manuscript in figures 7, 11, 15, and A1.
Lines 353-359: "Hourly MRMS composite reflectivity mosaics (optimal method) observations (Zhang et al., 2016) are used to verify the composite reflectivity forecasts using the Method for Object-Based Diagnostic Evaluation (MODE) in MET. In order to quantitatively identify the experiment configuration that yielded better forecasts, the median of maximum interest (MMI (F+O)) (Davis et al., 2009) is analyzed. This metric results from the median between the maximum interest from each observed object with all predicted objects (MIF), and the maximum interest from each predicted object with all observed objects (MIO). It takes into account all attributes used in the total interest calculation, summarizing them into a single value. The forecast in greatest agreement with the observations will give MMI (F+O) values closer to one. Otherwise, the values will be closer to zero." https://journals.ametsoc.org/view/journals/mwre/148/6/mwrD190128.xml We agree with the reviewer, the static BEC matrix used may not be optimal for RRFS and efforts are underway in order to obtain this matrix using its own RRFS forecasts. As mentioned above, the static B used in the experiments is the one currently used in RAP and HRRR.
Line 450-451: "These results may indicate that the static BEC matrix used may not be optimal for RRFS v0.1 and efforts are underway in order to obtain a better BEC matrix."

Also, the online estimation of the hybrid weighting parameter has been explored by De Azevedo et al. 2020, which may be worth mentioning:
https://www.tandfonline.com/doi/full/10.1080/16000870.2020.1835310 As suggested, the work of Azevedo et al. (2020) was mentioned in lines 451-453.
Lines 451-453: "Moreover, an online estimation approach may be explored for the specification of the hybrid weighting parameter, such as the method proposed by Azevedo et al. (2020) in which a geographically varying weighting factor alpha is defined and the ensemble spread is used for the assignment of the weights."  (Davis et al., 2009) derived from MODE results were then added to each subplot. This metric provides more objectivity to the discussion, as suggested by the reviewer. Lines 353-359 were added to the document in which the method and metric used are described. Corresponding discussion of these values was added to the new version of the manuscript in figures 7, 11, 15, and A1.

Figure 8:
The Green RMSE is difficult to read. A slightly darker color would help. Also, perhaps you could order the RMSE in each plot from lowest to highest so that is it easier to see how each method performs in comparison to the others.
As suggested, Figure 8 was improved using a darker green for the 3DVar experiment results and the RMSE values were ordered from lowest to highest. Figures 9 and 16 were also modified to account for the color change in the 3DVar plot line.

L 430: "this study looked at"
Change to: "this study looks at" The verbal tense was corrected in line 476 of the new version of the manuscript. Figure 8 and 10: I'm not sure that I see the value of the confidence interval shading. However are these confidence intervals computed? Can the authors justify that this statistic is meaningful for this application? (e.g., the confidence interval implies that this specific case can be extrapolated to all other relevant storm instances -perhaps more discussion of the computation and application of this method would be warranted.) We agree with the reviewer that the confidence intervals interpretation was not added to the discussion and therefore it was improved in the new version of the manuscript. The confidence intervals used in this study were derived using a bootstrap resampling technique of 1000 replications with replacement at each forecast lead hour in every cycle, and with bias-corrected percentiles, as reported by Wilks, 2006 andGilleland et al. (2018). The confidence intervals help to highlight where the differences between the experiments analyzed are significant or not. Adding the confidence intervals shading shows the variability in the sample and informs readers how confident we can be that the mean values plotted are actually different. However, since we are analyzing multiple forecasts for this one single case study, we are not confident that the results obtained in this study can be extrapolated to all other relevant storm instances. Nevertheless, some of the results show statistically significant differences for the sample of this case study, which were highlighted in the discussions of figures 8, 10, and 12 in the new version of the manuscript. As suggested, several lines in the document (435-444, 487-491, 526-529) were modified for clarity and to address the point made by the reviewer. The results (e.g. in Figure 12) seem to indicate the use of pseudo-observations may be a bad idea. Is there any reason why continued effort should be made to do this?.
The experiment PSEUDO needed to be rerun because of a bug in the calculation of the planetary boundary layer (PBL) height which affected the PBL pseudo-observations function. This bug only affected this experiment. Accordingly, figures 11, 12, and 13 were replaced with the new results, which indicate that this strategy improves the convection in particular for the 4 h forecast. The new analysis of these results corresponding discussions were provided in Sect. 4.3 (lines 512-537) in the new version of the manuscript.  Figure 12, this seems to be over-fitting the data, which is then causing problems throughout the model column. It seems like this result would be better achieved via post processing, so that it doesn't have negative ripple effects throughout the cycled DA.  The model level 50 is a hybrid level which corresponds with around 850 hPa at approximately 1450 m of altitude. As suggested, line 543 was modified for clarity as well as "Hybrid" was included in Figure 14.
Line 543: "...model hybrid level 50 (located in the lower atmosphere at around 850 hPa)" L 559: "Supersaturation clipping in GSI can improve specific humidity fields in the analyses, allowing for more realistic storm and precipitation forecasts at longer forecast lengths. At shorter forecast lead hours, it produces more spurious convection" This sounds like the wrong modification is being made to correct a longer-term bias. It would be beneficial to track down the root cause that accumulates to cause the longer-term bias without degrading the short-term skill.
We agree, however this also highlights a challenge with the partial cycling procedure where we cannot fully examine the model bias owing to the fact that the atmospheric state is periodically refreshed with the global models. Future work involves adapting the approach employed by Wong et al. (2020).
Lines 556-560: "Results from CLIPSAT indicate the presence of longer-term bias that is being corrected to some extent in this experiment. However, because the atmospheric state is periodically refreshed with the large scale conditions as part of the partial cycling procedure, the model bias cannot be fully examined. Further investigation involves adapting the approach employed by Wong et al. (2020) in which forecast tendencies are used to investigate systematic model biases in a continuously cycled experiment." L 572: "More extensive testing of RRFS, covering a wider variety of cases, larger domain, and longer period of time, is needed to demonstrate whether results found here are robust or may be case dependent." I agree, which is why I'm confused by the presentation of the confidence intervals. Could the authors either remove these in the plots above or describe in greater detail how they were computed and why they are relevant in this case.
As mentioned above, we agree with the reviewer that the confidence intervals interpretation was not included in the discussion and therefore it was improved in the new version of the manuscript. It is important to highlight that the results presented are representative for the case study analyzed. However, we are not confident that the results obtained in this study can be extrapolated to all other relevant storm instances. Therefore we conclude that more tests covering a wider variety of cases is needed in order to obtain more robust results.