Impact of model improvements on 80-m wind speeds during the second Wind Forecast Improvement Project (WFIP2)

During the second Wind Forecast Improvement Project (WFIP2; Oct 2015 – Mar 2017, held in the Columbia River Gorge and Basin area of eastern Washington and Oregon states) several improvements to the parameterizations used in the 20 High Resolution Rapid Refresh (HRRR – 3 km horizontal grid spacing) and the High Resolution Rapid Refresh Nest (HRRRNEST – 750 m horizontal grid spacing) Numerical Weather Prediction (NWP) models were tested during four 6-week reforecast periods (one for each season). For these tests the models were run in control (CNT) and experimental (EXP) configurations, with the EXP configuration including all the improved parameterizations. The impacts of the experimental parameterizations on the forecast of 80-m wind speeds (wind turbine hub height) from the HRRR and HRRRNEST models 25 are assessed, using observations collected by 19 sodars and 3 profiling lidars for comparison. Improvements due to the experimental physics (EXP vs CNT runs) and those due to finer horizontal grid spacing (HRRRNEST vs HRRR), and the combination of the two are compared, using standard bulk statistics such as Mean Absolute Error (MAE) and Mean Bias Error (bias). On average, the HRRR 80-m wind speed MAE is reduced by 3-4% due to the experimental physics. The impact of the finer horizontal grid spacing in the CNT runs also shows a positive improvement of 5% on MAE, which is particularly large 30 at nighttime and during the morning transition. Lastly, the combined impact of the experimental physics and finer horizontal grid spacing produces larger improvements in the 80-m wind speed MAE, up to 7-8%. The improvements are evaluated as a function of the model’s initialization time, forecast horizon, time of the day, season of the year, site elevation, and meteorological phenomena. Causes of model weaknesses are identified. Finally, bias correction methods are applied to the 80-

m wind speed model outputs to measure their impact on the improvements due to the removal of the systematic component of the errors.

Introduction
The second Wind Forecast Improvement Project (WFIP2) took place in Oregon and Washington states from October 2015 through March 2018. This Department of Energy (DOE) and National Oceanic and Atmospheric Administration (NOAA) 5 funded project was aimed at improving the parameterizations within the High Resolution Rapid Refresh (HRRR -3 km horizontal grid spacing) and its nested version (HRRRNEST -750 m horizontal grid spacing), with the goal of increasing the forecast skill of wind turbine hub-height (80-m) wind speeds. The study area is a region of complex terrain that included a large amount of wind power generation, with more than 4.6 GW of installed capacity associated with the Bonneville Power Administration (BPA) balancing authority. 10 WFIP2 (Shaw et al., 2019;Wilczak et al., 2019a;Olson et al., 2019a), as well as the first WFIP (held in the U.S. Great Plains, in 2011-2012Wilczak et al., 2015), represent efforts to improve forecasts for the renewable energy sector. While the first WFIP was in an area with relatively flat terrain, WFIP2 took place in an area characterized by pronounced topographic features.
These include the Cascade Mountains and the Columbia River Basin to the east, with the Columbia River Gorge forming a gap in the mountain range resulting in complex flow patterns in the region. Important background information regarding the 15 project can be found in several publications: Shaw et al. (2019) presents a general overview of the project; Wilczak et al. (2019a) describes the instruments deployed for the 18-month long campaign and the meteorological forecast challenges of the region; and Olson et al. (2019a) discusses the parameterization improvements applied to the HRRR and HRRRNEST models resulting from a better understanding of local atmospheric processes achieved by the use of the observations. Toward the end of the campaign, a model freeze was imposed and some case studies with interesting meteorological conditions 20 were selected to focus model improvements around. Changes to the model physical parameterizations based on model known deficiencies and findings from this campaign were then tested over these case studies and those that showed improvements were selected to become a new experimental physics suite. Finally, four 6-week periods (one for each season: "spring 2016" -3/ 25-5/7/2016, "summer 2016" -6/24-8/7/2016, "fall 2016" -9/24-11/7/2016, and "winter 2017" -12/25/2016-2/7/2017) were chosen to re-run the models in control (CNT) and experimental (EXP) configurations. The EXP configuration included 25 all the modifications/improvements added to the models, while the CNT runs used the HRRR parameterization present in the NCEP operational version of the HRRR at the start of WFIP2. The four 6-week periods will be called "reforecast periods" throughout the rest of the manuscript, while the model re-runs (HRRR CNT, HRRR EXP, HRRRNEST CNT, and HRRRNEST EXP) will be called "reforecast runs".
Since the primary goal of WFIP2 is to advance the state of the art of wind energy forecasting in areas with complex terrain in 30 general, and in the BPA region in particular, in this paper we use hub-height wind speed observations from sodars and profiling lidars to assess the impacts of the experimental parameterizations and finer horizontal grid spacing on the performance of the models. These instruments were chosen because they accurately measure wind speed and direction from 20 m up to few hundred meters above ground level, which is the layer of the atmosphere most relevant for wind energy production. While in this paper improvements in bulk statistics (Mean Absolute Error -MAE, and bias) are evaluated, a companion research article (Djalalova et al., 2019) determines the improvements using the same set of measurements and the same model runs at forecasting wind power ramp events. 5 The paper is organized as follows: in Section 2 the observational and NWP model datasets are described; in Section 3 details of the bulk statistical results are presented for 80-m wind speed MAE and bias for individual models, in terms of time of the day, model initialization time, forecast horizon, season of the year, and site elevation; in Section 4 improvements in the statistical results are quantified due to the experimental physics, model finer horizontal grid spacing, and a combination of the two, again as a function of the time of the day, the season of the year, and the different meteorological phenomena predominant 10 in the area, both with and without bias correcting the model output. Section 5 presents a summary and conclusions.

Observational dataset
Various in-situ, scanning, and profiling instruments were deployed and maintained by WFIP2 team partners who later provided quality controlled versions of the data. All data are available to the public from the DOE Data Archive and Portal (DAP; 15 https://a2e.energy.gov/projects/wfip2). The list of instruments, deployed in nested arrays (with the outer scale of the order of 500 km and the inner scale of the order of 2x2 km, see Fig. 1a of Wilczak et al., 2019a), includes 3 449-MHz, 8 915-MHz radar wind profilers with radio acoustic sounding system temperature profiles, 19 sodars, 5 scanning lidars, 5 profiling lidars, 4 microwave radiometers, 10 microbarographs, a network of sonic anemometers, and many surface meteorological stations.
An overview of the instrumentation capability and how the instruments were used for atmospheric process understanding and 20 model validation is presented in Wilczak et al. (2019a) and Olson et al. (2019a). Also, Pichugina et al. (2019) compared a full year of wind profiles from Doppler lidars at three WFIP2 sites to the operational (at the time of their study) HRRR-NCEP runs, showing how model errors varied from site to site, and highlighting several aspects on where HRRR-NCEP needed improvement.
In the current study, data collected at 22 remote sensing sites (19 sodars and 3 lidars) spanning the WFIP2 region are used, 25 since their measurements cover the part of the atmosphere of most interest for wind energy. As measurements through the entire turbine-rotor layer were not always available, we decided to focus on the 80-m level when available, to avoid averaging the data over a variable depth layer of the atmosphere that could result, in some cases, to biasing the average toward values more representative of the lower part of the layer.
Some sites had a co-located sodar and lidar. In this situation the instrument with the highest data availability during the 30 campaign was chosen. This choice led to the selection of the 19 sodars and 3 lidars listed in Table 1, where the latitude, longitude, elevation of the site, terrain complexity, percentage of data availability over the four reforecast periods, and the institution in charge of the instrument are also presented. The terrain complexity was computed as the standard deviation (in meters) relative to the average slope in a 6 by 6 km area (81 points) around the site using the HRRRNEST model topography.
Although the focus of this study is on the 80-m wind speed statistics, we also examine the statistics of wind power generation, using a generic IEC Class 2 power curve to convert wind speed into power. Details for the conversion from wind speed into power are given in Wilczak et al. (2019b), while Wilczak et al. (2019a) and Djalalova et al. (2019) demonstrated that the 5 equivalent wind power generation computed from these 22 remote sensors using the above-mentioned curve is representative of the actual wind power generation over the entire BPA area. The geographical location of the 19 sodars and 3 lidars is provided in a map later in the manuscript, and a more comprehensive base map of all the instruments deployed for WFIP2 is presented in Wilczak et al. (2019a).

NWP Models 10
WFIP2 model development/improvement focused on improving forecasts in complex terrain for wind energy applications. Improvements in operational NWP models usually target extreme weather events and near-surface weather in general, with little focus on the improvement of the forecast of wind speed at hub-height. Wind energy generation is especially abundant in regions of complex terrain where there are many forecasting challenges due to the complexity of the terrain-modulated flows 15 and the feedback processes associated with them Thus, forecast errors in hub-height wind speeds can originate from various model components. For this reason, WFIP2 model development/improvement included a number of model components: the boundary-layer and surface-layer schemes, the representation of drag associated with subgrid scale topography and wind farms, and the cloud-radiation interaction. Moreover, because of the complex terrain, special care had to be devoted to scale adaptive physical parameterizations. 20 While the reader is referred to Olson et al. (2019a;2019b) for complete details on the improved model configurations, we provide a list with brief summaries of the set of model physical parameterizations and relevant numerical methods targeted for development in WFIP2:

Planetary boundary layer (PBL) local mixing: mixing length revision
The mixing length is the distance parcels are allowed to be displaced by turbulence processes, therefore 25 depending on the size of the turbulent eddies. In the new formulation, the mixing length is independent of the height above ground and turbulent eddies are forced to be smaller than the depth of the model layer in strong stratification, thus improving maintenance of cold pools and stable boundary layers in general.
2. PBL non-local mixing: mass-flux scheme A mass-flux scheme was added to the original MYNN PBL scheme, making it an eddy-diffusivity mass-flux 30 (EDMF) scheme and allowing for direct coupling of the sub cloud convective cores and the cloud layer above.
This resulted in improved coverage of shallow-cumulus and improved profiles of temperature and humidity, while a smaller impact was found on low-level winds during the day.
3. Subgrid-scale (SGS) clouds and coupling to radiation SGS clouds and coupling to radiation improves the downward shortwave forcing in shallow-cumulus and stratocumulus conditions. The primary impact is to improve the surface energy balance, which can then more accurately drive the turbulent mixing, while a small direct impact was found on low-level winds.

Drag due to SGS topography 5
The representation of drag due to SGS orography was added to the HRRR physics suite including surface drag due to gravity waves and form drag. While the SGS gravity wave drag acts in stable PBLs and the form drag acts for all stabilities, form drag has a smaller impact than the gravity wave drag at the high resolutions of the HRRR, and neither are active in the HRRRNEST. This addition improves the maintenance of cold pools by reducing the near-surface wind speeds (and wind speed bias), while also reducing the near surface vertical wind shear in stable 10 conditions.

Surface layer scheme
In the Monin-Obukhov theory the flat-terrain approximation implies that all fluxes (momentum, heat and moisture) happen in the vertical, but this approximation becomes unrealistic in complex terrain. For this reason, the new surface layer scalar flux algorithm now includes horizontal fluxes. 15 6. 3D turbulence scheme While typically horizontal turbulent mixing is calculated with no direct communication with the parameterized vertical mixing, the impact of horizontal fluxes can now be of similar magnitude as the vertical fluxes, improving the representation of fine-scale turbulence. The expected benefits are mostly found at sub-kilometric scales.

Horizontal finite differencing 20
Horizontal diffusion is now performed in Cartesian space instead of terrain-following sigma coordinates. This option is a replacement to mixing along sigma coordinates, which can produce artificial vertical mixing in steep terrain. This change improves the maintenance of cold pools by no longer mixing vertically when model vertical coordinates follow steep terrain.

Wind-farm parameterization 25
A representation of wind-farm drag was introduced by adopting the Weather Research and Forecasting (WRF) wind farm parameterization (Fitch et al. 2012(Fitch et al. , 2013a(Fitch et al. , and 2013b. The inclusion of this parameterization reduces a high wind speed bias within wind farms but can contribute to a slight negative wind speed bias near wind farms. The biggest improvements in the reforecasts were found from 1, 3, and 4, which improved the representation of turbulent mixing in stable boundary layers (Olson et al. 2019a;2019b). 30 Details of the simulations used in this analysis are as follows. For the four reforecast periods (spring, summer, and fall 2016, and winter 2017), 24-hour forecasts were made with the HRRR and HRRRNEST, initialized twice per day at 00 and 12 UTC, using initial conditions from the operational RAPid refresh model (RAP; Benjamin et al., 2016), with no additional data assimilation, and with output available every 15 minutes. For simplicity, we refer to the runs initialized at 0000 UTC as the Z00 runs, and the runs initialized at 1200 UTC as the Z12 runs. The reforecasts were run in both CTL and EXP configurations, with the EXP configuration including all the improved parameterizations. The 3-km HRRR is directly initialized off of the 13-km RAP grid, so there is a spin-up period associated with the model atmosphere adjusting to the higher resolution terrain, which typically has much higher mountain peaks and lower valleys in the HRRR relative to the RAP. This spin-up problem would be even more exaggerated if the HRRRNEST was directly initialized from the RAP model atmosphere, so to minimize 5 this problem, we chose to allow the HRRR model atmosphere to spin-up for 3 hrs before we initialized the HRRRNEST from the HRRR 3-hr forecast. Therefore, the HRRRNEST output runs were delayed by 3 hours to ameliorate these spin-up problems., so that a gap in the HRRRNEST model output exists from forecast horizon 0000 to forecast horizon 0200 (from 0000 UTC -0200 UTC for the Z00 initialized runs, and from 1200 UTC -1400 UTC for the Z12 initialized runs). For this reason, in order to show meaningful comparisons between the models, we utilize only the forecast horizons 03-24 for the 10 HRRR runs also.
For our analysis, in order to compare to the observations, the 80-m wind field is obtained from model output horizontally bilinearly interpolating to the 22 site locations using the 4 closest grid points, and linearly vertically interpolating the two closest heights (approximately 36 and 83 m).The HRRR has relatively coarse vertical resolution, with only five full model layers below 200 m, but the middle of the third layer is very close to 80-m AGL, so a linear interpolation does not have a significant 15 negative impact on the accuracy of the estimated 80-m wind speeds.
The observations were also averaged and interpolated in time over the 15-minute model output times (most of the observations were already at a 15 min interval, but some were at a 10 min interval or less), and linearly interpolated to the 80-m level.

Bulk statistical results of 80-m wind speed forecasts
In this section we examine the diurnal variation of 80-m wind speed MAE and bias at all sites and the seasonal variation of 20 MAE and biases from the four reforecast periods to identify the dependence of the statistics on the time of the day, model initialization time, forecast horizon, and season. The dependence on the elevation of the site is also investigated.

Statistical results as a function of the time of the day, model initialization time, forecast horizon, and season of the year
The 80-m wind speed MAEs, averaged over the 19 sodars and 3 lidars, show a clear diurnal pattern (Fig. 1). Each of the four 25 reforecast runs (HRRR CNT is in red, HRRR EXP in blue, HRRRNEST CNT in yellow, and HRRRNEST EXP in black) is and 2.4 m s -1 , with significantly smaller values during daytime (unstable atmospheric conditions), ranging between 1.6 and 1.8 m s -1 (panel a). For reference, the insert of panel a of Fig. 1 presents the diurnal cycle of the averaged observed 80-m wind speeds for the four reforecast periods, showing that 80-m wind speeds are higher at nighttime, particularly in summer and to a lesser extent in spring (contributing to MAE to be larger at nightime compared to daytime), but less so in fall and winter. In addition to the larger values of MAE found at nighttime, the reforecast runs also show larger differences between the models.
In contrast, during daytime not only are the MAEs smaller, but the differences between the four models reforecast runs are 5 also smaller. Figure 1 can be used to examine the dependence of MAE on initialization time and forecast horizon. In particular, the Z00 MAEs are smaller than the Z12 MAE values for times soon after the Z00 initialization (for the first part of the day O lines are below X lines). In contrast the Z12 MAEs tend to be smaller than Z00 values for times soon after the Z12 initialization (for the second part of the day X lines are below O lines, except for HRRRNEST EXP), meaning that the MAE increases with the forecast horizon. Certainly, for each of the model reforecast runs, the time of the day is more important at determining the 10 MAE values than the initialization time, as expected.
While on average the experimental physics and finer grid spacing lowers the MAEs over the four reforecast periods (Fig. 1, panel a: blue, yellow and black lines all show smaller MAEs compared to the red lines), the improvements are less consistent when looking at the four reforecast periods separately (panels b-e). In winter, the improvements are more robust, as explained in Olson et al. (2019a), due to better maintenance of cold pools which frequently happen in this area over the winter (Whiteman 15 et al., 2001;McCaffrey et al., 2019), and which are investigated in detail in Section 4.4.
The biases of the 80-m wind speed also exhibit a diurnal cycle (Fig. 2). Again, the upper panel shows averages of the four reforecast periods and the lower panels display the four reforecast periods separately. The diurnal trend of the bias in the HRRR CNT is evident in the red curves, with positive biases at nighttime (stable atmospheric conditions), averaging 0.7 m s -1 , and negative values during daytime (unstable atmospheric conditions), down to -0.4 m s -1 (panel a). The diurnal trend for the HRRR 20 CNT is also clear for the four reforecast periods separately (panels b-e). The HRRR EXP reforecast runs (blue curves) tend to eliminate the diurnal trend in all reforecast periods, because of the differences in the treatment of boundary-layer turbulence in unstable and stable conditions, but lowers the bias significantly, leading to a negative average value of ~-0.6 m s -1 (panel a).
A possible reason for such behaviour in the HRRR EXP runs can be found in the representation of drag due to SGS orography (Steeneveld et al., 2008;Tsiringakis et al., 2017) added to the HRRR physics suite. This new representation is only active in 25 the HRRR, but not in the HRRRNEST due to its finer grid spacing (Olson et al., 2019a). While the expected benefit of such improved representation of the drag is to decrease the high wind speed bias in stable conditions often found in the HRRR, the detriment in this case seems to be a too large decrease in wind speed. The addition of wind turbine drag from the wind farm parameterization also contributed to the low wind speed bias, but to a lesser degree. Due to the results found in this study and in other WFIP2 related studies, ways to revisit the treatment of the drag due to sub-grid scale orography are under 30 consideration. Finally, the diurnal trends in the MAE and biases are smaller in the winter than in other seasons. This result could also be due to differences in the treatment of boundary-layer turbulence in unstable and stable conditions. Similar results were found by Berg et al. (2019) in their study of the sensitivity of winds simulated using the Mellor-Yamada-Nakanishi-Niino planetary boundary-layer parameterization in the Weather Research and Forecasting model. While the HRRRNEST reforecast runs (CNT in yellow and EXP in black) reduce the bias compared to their respective HRRR simulations it is not clear yet if the HRRRNEST EXP is better than the HRRRNEST CTL or vice-versa. Similar to the MAEs, differences between the four reforecast runs are larger at nighttime and smaller at daytime (when the biases are consistently mostly negative).
MAEs of the 80-m wind speed, presented in the left panel of Fig. 3, show that the HRRR EXP (in blue) does better than the 5 HRRR CNT (in red) in fall and in winter, but not in spring or summer. MAEs of the HRRRNEST CNT (in yellow) are better than those of the HRRR CNT (in red), and the HRRRNEST EXP (in black) is now almost always better than the other models.
Biases, presented on the right panel of Fig. 3, show values in the HRRR EXP (in blue) becoming way too negative (caused by the additional orographic drag employed in the HRRR EXP) compared to the HRRR CNT (in red) in the spring, summer and fall. Future revisions of the orographic drag in the HRRR will address this issue. The HRRRNEST EXP (black) is better than 10 the HRRRNEST CNT (in yellow) only in the fall and winter, and again it is not clear that one of these two models has a demonstrably smaller overall bias.
The results of this section indicate that the time of the day is of primary importance in terms of MAEs and biases, while the model initialization time and the forecast horizon are of secondary importance. Consequently, the remaining statistical analysis is carried out averaging the Z00 and Z12 runs. 15

Statistical results as a function of the site elevation
As evident from Table 1 presented for the four reforecast periods. Sites are sorted from low to high elevation (from Rufus on the left to Prineville on the right) and biases are normalized by the averaged (observed) 80-m wind speed at each site. On the right axes of panels a, b, c, and d of Fig. 4, we show (as dotted black lines) the averaged 80-m wind speed at each site for each reforecast period. These averages show some dependence on site elevation in fall and winter, most likely caused by cold pool events with lower wind speeds confined to the sites at lower elevation. We also note that sites at higher elevation do not have higher 80-m wind speeds 25 than sites at lower elevation, neither in summer nor spring. The topography of the area with the location of the sites is in Fig.   4, panel e. The biases presented in Fig. 4 show that the diurnally and seasonally averaged biases are smaller (and often negative) at lower elevations, with a positive trend with increasing elevation. In particular, the HRRR CNT (red) has the largest positive bias at high elevations in winter which is likely due to the premature mix-out of cold pools occurring preferentially at higher elevations first, which can lead to longer periods of time with a positive wind speed bias. As in Fig. 2, HRRR EXP runs (in 30 blue) always show the lowest bias, almost always negative, particularly at the lowest elevation sites. When not normalized by the averaged wind speed at the site (not shown) the trend was consistent with that shown in Fig.4, but even more accentuated.
In contrast, a similar analysis but for MAE normalized by the averaged 80-m wind speed at each site (not shown) did show a mostly neutral dependence on site elevation (with a slight decrease with site elevation).
Although it is not clear at this point what is the physical reason for the models having a normalized bias dependent on site elevation (it may be due to the characteristics of the atmospheric phenomena predominant in this area, and challenging to forecast), it is important to know that in an area of complex terrain like that of WFIP2 this dependence exists. The dependence 5 of the bias on the elevation indicates that a post-processing bias correction of the model should be done at each site independently.
Terrain complexity is not as powerful of a predictor of model bias as site elevation. A similar analysis to that presented in Fig.   4 was performed but sorting the sites by the complexity of the surrounding terrain (see Table 1). In this analysis (not shown) the trend of 80-m wind speed MAE and bias was not clearly defined. 10

Improvements to the statistics due to the experimental physics and finer horizontal grid spacing
In this section we examine the statistical significance and percentage improvement in the model forecast of 80-m wind speed and power. The improvements are analyzed in terms of the new physics (EXP vs CNT runs) as well as horizontal grid spacing of the models (HRRRNEST vs HRRR runs), first separately and then combining the impact of the two (HRRRNEST EXP vs HRRR CNT). Finally, we evaluate the dependence of the improvements on the dominant meteorological phenomena of the 15 area (Shaw et al., 2019), including cold pools (Whiteman et al., 2001;Zhong et al., 2001;McCaffrey et al., 2019), gap flows (Sharp and Mass 2002;2004), easterly flows (Neiman et al., 2018), mountain waves (Durran 1990;2003), topographic wakes, and convective outflows (Mueller and Carbone, 1987).

Impact of experimental physics (CNT vs EXP runs)
The impact of the experimental physics in the HRRR runs (HRRR EXP vs HRRR CNT) is almost always positive for wind 20 speed and power. Percent improvement and statistical significance is shown in Fig. 5 for 80-m wind speed (left panels) and 80-m wind power (right panels). These results are obtained averaging all sites together, over the two model initialization times

Impact of model finer horizontal grid spacing (HRRRNEST vs HRRR)
Improvements due to finer horizontal grid spacing are larger than those due to the experimental physics. The impact of the finer horizontal grid spacing in the control runs (HRRRNEST CNT vs HRRR CNT) is shown in Fig. 6 for 80-m wind speed (left panels) and 80-m wind power (right panels). MAE values in the upper panels are in red for the HRRR CNT runs and in yellow for the HRRRNEST CNT. In the bottom panels of Fig. 6 we see a large percentage improvement in MAE due to finer 5 horizontal grid spacing, particularly at nighttime and during the morning transition (approximately between 0100 UTC and 1500 UTC). Improvements due to finer horizontal grid spacing are larger than those due to the experimental physics in Fig. 5, with values now up to 10% in 80-m wind speed MAE and up to 15% in 80-m wind power MAE. The percentage improvements are smaller during daytime, when the HRRR model with larger horizontal grid spacing had lower MAE compared to nighttime.
In Fig. 7 we compare the improvements in 80-m wind speed MAE due to the experimental physics (left panels) from the 10 HRRR (shown previously in Fig. 5) with those found in the HRRRNEST, and the improvements due to finer horizontal grid spacing (right panels) from the CNT simulations (shown previously in Fig. 6) with those found in the EXP simulations. The dark blue curve shows the impact of the experimental physics on the models with larger horizontal grid spacing (HRRR EXP vs HRRR CNT), while light blue shows the impact of the experimental physics on the models with finer horizontal grid spacing (HRRRNEST EXP vs HRRRNEST CNT). The red curve shows the impact of finer horizontal grid spacing on the CNT runs 15 (HRRRNEST CNT vs HRRR CNT), while the impact of finer horizontal grid spacing on the EXP runs (HRRRNEST EXP vs HRRR EXP) is shown in orange. When averaged over the four reforecast periods, the impact of the experimental physics (left upper panel) is quite similar between the higher and finer horizontal grid spacing models, however when considering the four reforecast periods separately (lower left smaller panels) the impact varies considerably. For example, in summer the impact of the experimental physics on the HRRRNEST is mostly neutral (light blue curve), while in the HRRR it is actually producing 20 a negative impact (dark blue curve). In contrast, while the impact of the experimental physics is positive for both horizontal grid spacings in winter, it is very positive for the HRRR (dark blue curve). This variation could be due to changes in the physics that are grid-spacing dependent, making the impact different for HRRR and HRRRNEST. Similar considerations can be made for the improvement due to finer horizontal grid spacing (right panels). When averaged over the four reforecast periods (right upper panel) the impact of the finer horizontal grid spacing is similar between the models with different physics. However, for 25 the winter reforecast period (lower right panel) the impact of the finer horizontal grid spacing on the EXP runs is mostly neutral (orange curve), while for the CNT runs it is clearly positive (red curve).

Impact to the statistics due to the experimental physics and finer horizontal grid spacing (HRRRNEST EXP vs HRRR CNT)
As a final step of the analysis, the combined impact on 80-m wind speed MAE of the experimental physics and finer horizontal 30 grid spacing, comparing the HRRRNEST EXP to HRRR CNT is shown in Fig. 8. Consistent with the results presented in the previous sections, we find that the combination of the experimental physics and finer horizontal grid spacing produces even larger improvements, always positive and up to a maximum of 14% in the 80-m wind speed MAE (lowest left panel) and up to a maximum of 18% in 80-m wind power MAE (lowest right panel). Again, larger improvements are found during nighttime and during the morning transition, with smaller improvement found during daytime when the models had lower MAEs.
To condense the results presented in this section, a summary plot with the percentage improvements on MAE due to the experimental physics, finer horizontal grid spacing, and the combination of the two, for the four reforecast periods separately and averaged together is presented in Fig. 9

Statistical results as a function of the different meteorological phenomena 20
The improvements due to the experimental physics and finer horizontal grid spacing (and to the combination of the two) as a function of the different meteorological phenomena common to this area are presented in Fig. 10. For this analysis we take advantage of the WFIP2 Event Log, which was created and updated regularly during WFIP2 by several meteorologists documenting the meteorological conditions of relevance in the area and is available on the DAP (Shaw et al., 2019). The

WFIP2 meteorologists based their classification of events on WFIP2 observations and other surface observations, real-time 25
and global model forecasts, satellite images, and local radio-soundings. In the Event Log document, days and characteristics of the different meteorological phenomena were recorded, with the possibility that on some days multiple phenomena could occur at the same time. Although the categorization of the days into different meteorological phenomena involves a certain level of subjectivity, the final classification process involved weekly meetings during the field study with meteorologists on the project team, many with operational forecasting experience in this geographic area, during which a consensus was reached 30 by the team, making us confident that other meteorologists would agree with the classifications we used. The Event Log is accessible to the public (available on the DAP, https://a2e.energy.gov/projects/wfip2). For the plot in Fig. 10 the results are averaged over all sites, between the two initialization times, over all reforecast horizons between 03 and 24 and over the four reforecast periods. The number of days over which each specific phenomenon takes place is in the parentheses on the x-axis label. On the far right are the improvements averaged (weighted by the number of cases) over all the different phenomena.
Since on some days multiple phenomena might occur at the same time, same days can be counted multiple times in the average, which consequently is not exactly the same as that in Fig. 9. From this analysis there is no improvement in the 80-m wind speed MAE due to the modifications in the physics of the HRRR (in dark blue) for mountain waves and topographic wakes, 5 while for the other meteorological phenomena the impact due to the experimental physics is positive. However, this figure does not tell the entire story.
As shown in Fig. 10, the number of days with gap flow events is very high (145), and if we plot the same figure separately for each of the four reforecast periods (Fig. 11) Consequently, the blue bar in spring and summer extending toward negative values, visible in Fig. 9 is not only due to the negative impact of mountain wave and topographic wake days, but also to gap flow days in spring and summer (upper right and lower left panels of Fig. 11). From Fig. 11 we also note that easterly flow is a category with a more consistent impact, always being improved by the experimental HRRR physics. Cold pool events are also consistently improved by the 20 experimental HRRR physics; this type of event happens mostly in fall and winter (only one event is found in spring, therefore its impact cannot be considered statistically significant).
To better understand the reasons for the lack of MAE improvement in the HRRR EXP vs HRRR CNT runs during diurnal gap flow days in summer, in Fig. 12  Although from Fig. 11 we see the experimental physics generally improves the HRRR during cold pool events, we next 30 examine details of the when and how this improvement occurs. Fig. 13 is similar to Fig. 12, but for part of the winter reforecast period. In the lower panel, days identified in the Event Log as experiencing cold pools are highlighted with the blue shaded areas. In the time series shown in the upper panel of Fig. 13, a period when the 80-m wind speed MAE of the HRRR EXP (blue line) is larger than the HRRR CNT (red line) is highlighted with the red oval, while at a later time (inside the blue oval) the opposite is true. Differences between these cold pool events were examined using the WFIP2 real-time model observation evaluation website (http://wfip.esrl.noaa.gov/psd/programs/wfip2/). This website was used through the duration of the WFIP2 field campaign for daily monitoring of model forecasts and instrument health (Wilczak et al., 2019a). Time-height cross sections (not shown, but available from the WFIP2 real-time model observation evaluation website) of microwave radiometer temperature, and winds from the radar wind profiler superimposed on radio acoustic sounding system 5 virtual temperature at Wasco, OR, for January 4, 2017, and January 19, 2017, revealed that the cold pool at the beginning of January is brought in by sustained easterly winds and has weaker stable stratification compared to the cold pool event in the second half of January, which is characterized by very low wind speeds close to the surface and more strongly stable stratification. Thus, although these periods are both listed as cold pool events, they have different atmospheric characteristics.
In the first case the experimental physics in the HRRR EXP run does not help the model to outperform the HRRR CNT, while 10 in the second case it does. A large wind speed deficit in the HRRR EXP forecast on January, 4, 2017 (visible in the red oval in the lower panel of Fig. 13) might occur because the HRRR EXP model has too much drag due to the SGS and/or because of the wind farm parameterization, with wind farms just upwind, east of Wasco. In contrast, in January 18, 2017 a large wind speed excess in the HRRR CNT forecast (visible in the blue oval in the lower panel of Fig. 13) occurs because of 1) not enough drag in the HRRRR CNT to reduce the strong winds immediately above the cold pool, 2) too much mixing at the top of the 15 cold pool, which may be due to too large mixing lengths, and 3) to "horizontal" mixing along sloped sigma coordinates, which contribute to vertical mixing. Given the very different wind and stability profiles characteristics of the two cold pool events, having routinely available observations of these profiles and assimilating them into the models would likely improve their short-term forecast skill. The need of a network of ground-based profiling instruments to improve numerical weather prediction and operational forecasting is also strongly advocated by the National Research Council (2009). 20

Bias correction impact on the improvements
Next, we evaluate whether the improvements measured in the previous sections are mainly due to reducing the biases of the models (the systematic component of the error) or if the model improvements also address the random component of error. To this aim the model 80-m wind speed output needs to be bias corrected before the bulk statistics and the relative improvements can be computed. Several methods have been investigated in the literature to remove the systematic component of the error 25 from model outputs. For this study, due to the nature of the 80-m wind speed biases presented in Fig. 2, two possible bias correction methods have been considered. The first one removes the mean bias from each model, at each site, and for each reforecast period separately ("mean bias"). The second method removes the mean bias from each model, at each site, for each of the reforecast periods and for each of the hour of the day separately ("diurnal bias"). Since, as is clear from Fig. 2, the nature of the bias differs among the models, we examined the impacts of both of these simple bias correction methods. In Fig. 14 we  30 present similar results to those presented in the left panel of Fig. 9, but after applying the "mean bias" correction (Fig. 14, upper panel) and the "diurnal bias" correction (Fig. 14, lower panel). In both cases, the methodology used to apply the bias correction was to split the dataset into two parts, determine the bias correction on the first half and evaluate it independently on the second half of the dataset.
The "mean bias" correction enhances the improvement due to the experimental physics in the HRRR and HRRRNEST models (blue and light blue bars, comparing Fig. 14 to Fig. 9). This improvement indicates that the experimental physics improves the random component of the model error, even if the experimental physics might degrade the systematic component: the right 5 panel of Fig. 4 shows that the bias of the HRRR EXP model is larger than the bias of the HRRR CTL model. In comparison, applying the "diurnal bias" correction also increases the improvement due to the experimental physics (dark blue and light blue bars) over all reforecast periods and for their average, while the improvements due to finer horizontal grid spacing in the models (red and orange bars) actually decrease.

Impact of model improvements on other key meteorological variables 10
Although the scope of the study presented in this manuscript is to measure the impact of the improved model parameterizations on the forecast of 80-m wind speeds, it is important to assess what improvements, if any, were brought to other key variables in the boundary layer. Olson et al. (2019a) considered this matter when comparing HRRR (CNT and EXP) model outputs to eight 915-MHz radar wind profilers in the WFIP2 region. The 915-MHz radar wind profilers observe through the planetary boundary layer, where the MAE wind speeds were found to be reduced over all four reforecast periods, especially at night and 15 in winter (stable atmospheric conditions), with MAE reduced by up to 0.5 m s -1 in the lower 300 m above ground level (agl), through most of the diurnal cycle. Some degradation was found in summer, for daytime, in agreement with our finding. The improvements on MAE of wind speed in the HRRNEST runs were mostly localized in the rotor layer over which the primary goal of the campaign was focused, being much smaller over the deeper layer of the atmosphere observed by the 915 MHz radar wind profilers. 20 Another important variable considered by Olson et al. (2019a) was temperature, comparing the model runs to Radio Acoustic Sounding System virtual temperature measurements. For this variable the largest improvements were found in winter, with MAE of temperatures reduced by more than 0.5 C up to 400 m agl for the HRRR, but half of that for the HRRNEST.
Other key meteorological variables over which model improvements were measured by Olson et al (2019a) were 2-m temperature and 10-m wind speed comparing the upgraded models to the previous version over the entire CONUS domain. 25 For these variables RMSE and biases were improved over both the eastern and western CONUS domains, proving that model improvements in one variable were found in other variables as well.

Summary and conclusions
Measurements collected by 19 sodars and 3 lidars during the second Wind Forecast Improvement Project (WFIP2), an 18month field campaign in the Columbia River Gorge and Basin area, were used to validate model runs by the High Resolution 30 Rapid Refresh (HRRR) model (3 km horizontal grid spacing) and its nested version (HRRRNEST, 750 m horizontal grid spacing).
The models were run for four 6-week reforecast periods (one for each season) in control (CNT) and experimental (EXP) configurations, where the EXP runs included new parameterizations to the HRRR and HRRRNEST physics suites (i.e. representation of wind farms and of drag associated with subgrid-scale (SGS) topography in the HRRR), improvements to 5 existing parameterizations (i.e. boundary-layer and surface-layer schemes, cloud-radiation interaction), and improvements to numerical methods (i.e. finite differencing of the horizontal diffusion). Results showed that: • 80-m wind speed MAE and bias vary significantly through the diurnal cycle, with time of day being more important at determining the 80-m wind speed MAE and bias values than either the initialization time or the forecast horizon.
• The HRRR EXP reforecast run reduces the diurnal trend in the bias, but results in a near constant negative bias, 10 possibly by exaggerating the drag due to sub-grid scale orography added to the HRRR physics suite (but not added to the HRRRNEST).
• The 80-m wind speed biases have lower values (often negative) at lower elevations, but increase with the site elevation. Differences in the sub-grid scale terrain inhomogeneity did not help explain any of the bias or MAE in the results. 15 • The experimental physics in the HRRR reduces 80-m wind speed MAE by 3-4 % and 80-m wind power MAE by 4-5 %.
• Finer model horizontal grid spacing improves 80-m wind speed MAE in the control runs, particularly at nighttime and during the morning transition. Smaller improvements occur during daytime, when the larger horizontal grid spacing model had lower MAE than at nighttime. The finer horizontal grid spacing of the HRRRNEST improves 80-20 m wind speed MAE values up to 5%, and 80-m wind power MAE up to 7-8%.
• The combined impact on 80-m wind speed MAE of the experimental physics and finer horizontal grid spacing produces an even larger reduction in MAE, averaging 7-8% for 80-m wind speed and 11-12% for 80-m wind power.
• Improvements in MAE and bias due to the experimental physics and finer horizontal grid spacing depend on season but almost always are positive. However, in spring and summer, the experimental physics in the HRRR runs increases 25 the 80-m wind speed MAE.
• The negative impact of the experimental physics on the HRRR MAE found in spring and summer results from degradation of the HRRR EXP on days experiencing gap flows, mountain waves and topographic wakes, and is probably due to the representation of drag in the HRRR EXP. In particular, for almost all of the summer gap flow days, the HRRR EXP predicts the down-ramps occurring at the end of the events too early. 30 • Although cold pool forecast skill improves due to the experimental physics in the models, different types of cold pools are predicted with varying skill. If routinely available observations of wind and stability profiles were assimilated into the models, short term forecast skill would likely improve.
• "Mean bias" and "diurnal bias" corrections of the 80-m wind speed model outputs demonstrated that the experimental physics improves both the systematic and the random component of the model errors. The impacts of the different bias corrections on the improvements due to finer horizontal grid spacing in the models are mixed.
The strength of WFIP2 came from many observational scientists and model developers working closely together, steering the 5 observational-based process understanding to guide model improvements which were later transitioned into operations. The current analysis quantifies the skill added by improvements made to the models within four months towards the end of WFIP2.
A model freeze was then imposed so that the models could be run in EXP and CNT configurations over the four chosen reforecast periods. Since the model code freeze, three research tasks related to better simulating the low-level wind speeds have been prioritized: first the inclusion of momentum transport in the new mass-flux component of the MYNN-EDMF, second 10 modifying the small scale gravity wave drag to only parameterize small-amplitude gravity waves associated with subgrid-scale terrain undulations < 100 m, and third investigating the addition of a vertically distributed form drag as opposed to represent form drag only through the surface roughness length, which is probably only valid for horizontal grid spacing < 1 km, where the terrain is better resolved. The impact of the first tends to increase the near-surface wind speed in the convective boundary layer, which helps to correct the low wind speed bias we measured in WFIP2. The second and the third tasks are simply meant 15 to revise the original representation of drag in the HRRR in order to make the parameterizations more physically meaningful. All of these model components need to be investigated at a variety of model resolutions to ensure the model parameterizations successfully adapt in behavior to only represent the physical processes that are truly not well-resolved within the model. Stoelinga, and David D. Turner contributed with observational data and with useful discussion to improve the manuscript.

Data and code availability
The operational HRRR model is not entirely open source (data assimilation/cycling scripts/etc), but updates to the model parameterizations used in the HRRR are deposited periodically to the official repository for the Advanced Research version of the Weather Research and Forecasting (WRF-ARW) model, maintained by the National Center for Atmospheric Research (NCAR), which is open source (https://github.com/wrf-model/WRF). A branch from this repository was created for WFIP2 5 testing, based on WRF-ARWv3.9. This branch is currently stored at https://zenodo.org/record/3369984#.XVb6KpJKjUI (doi:10.5281/zenodo.3369984). This branch is no longer under development and all improvements have been transferred to NCAR's official repository.
Details on the improvements applied to the HRRR and HRRRNEST parameterizations can be also found in Olson et al. (2019a). 10 All dataset used in this study are freely available to the public from the DOE Data Archive and Portal (DAP; https://a2e.energy.gov/projects/wfip2).
Please contact the corresponding author for additional details, if needed.

Acknowledgements
We thank all the people involved in WFIP2 for site selection, leases, instrument deployment and maintenance, data collection, 15 and data quality control. Funding for this work was provided by the DOE, Office of Energy Efficiency and Renewable Energy,                  Upper panel: results using a "mean bias" correction; lower panel: results using a "diurnal bias" correction.