Seasonal and diurnal performance of daily forecasts with WRF V3.8.1 over the United Arab Emirates

Effective numerical weather forecasting is vital in arid regions like the United Arab Emirates (UAE) where extreme events like heat waves, flash floods, and dust storms are severe. Hence, accurate forecasting of quantities like surface temperatures and humidity is very important. To date, there have been few seasonal-to-annual scale verification studies with WRF at high spatial and temporal resolution. This study employs a convection-permitting scale (2.7 km grid scale) simulation with WRF with Noah-MP, in daily forecast mode, from 1 January to 30 November 2015. WRF was verified using measurements of 2 m air temperature (T2 m), 2 m dew point (TD2 m), and 10 m wind speed (UV10 m) from 48 UAE WMO-compliant surface weather stations. Analysis was made of seasonal and diurnal performance within the desert, marine, and mountain regions of the UAE. Results show that WRF represents temperature (T2 m) quite adequately during the daytime with biases ≤+1 C. There is, however, a nocturnal cold bias (−1 to −4 C), which increases during hotter months in the desert and mountain regions. The marine region has the smallest T2 m biases (≤−0.75 C). WRF performs well regarding TD2 m, with mean biases mostly ≤ 1 C. TD2 m over the marine region is overestimated, though (0.75–1 C), and nocturnal mountain TD2 m is underestimated (∼−2 C). UV10 m performance on land still needs improvement, and biases can occasionally be large (1–2 m s−1). This performance tends to worsen during the hot months, particularly inland with peak biases reaching ∼ 3 m s−1. UV10 m is better simulated in the marine region (bias ≤ 1 m s−1). There is an apparent relationship between T2 m bias and UV10 m bias, which may indicate issues in simulation of the daytime sea breeze. TD2 m biases tend to be more independent. Studies such as these are vital for accurate assessment of WRF nowcasting performance and to identify model deficiencies. By combining sensitivity tests, process, and observational studies with seasonal verification, we can further improve forecasting systems for the UAE.


Introduction
In a changing climate, effective numerical weather forecasting is vital in arid regions like the United Arab Emirates (UAE), to predict low-visibility events like fog and dust (e.g., Aldababseh and Temimi, 2017;Chaouch et al., 2017;Karagulian et al., 2019), and extreme events relating to storms and flash floods (Chowdhury et al., 2016;Wehbe et al., 2019), high temperatures, and droughts. These extreme events are expected to become more prevalent under a changing climate (Feng et al., 2014;Zhao et al., 2020). In fact, climate projections suggest that arid and semiarid regions are likely to expand in area along with rising temperatures (Huang et al., 2017;Lelieveld et al., 2016;Lu et al., 2007). Hence, it is vital that regional weather forecasting and climate simulations with regional climate models (RCMs) correctly simulate important quantities which characterize extreme events, especially surface temperatures, humidity, winds, and precipitation.
The model chain and configuration used in any simulation can heavily influence the results of such forecasts. Important factors include but are not limited to the RCM type (e.g., Coppola et al., 2020), general circulation model (GCM) dataset for boundary forcing (Gutowski et al., 2016;Jacob et al., 2020), horizontal and vertical grid resolutions (e.g., Schwitalla et al., 2017), physics and dynamics schemes (e.g., Chaouch et al., 2017;Schwitalla et al., 2020), and soil-landuse-terrain static data, as well as internal model parameter sets for important land surface processes (e.g., Weston et al., 2019).
The Weather Research and Forecasting (WRF) model (Powers et al., 2017;Skamarock et al., 2008) has been used in arid regions for various forecasting and verification purposes (e.g., Fonseca et al., 2020;Schwitalla et al., 2020;Valappil et al., 2020;Wehbe et al., 2019) and process studies (Becker et al., 2013;Karagulian et al., 2019;Nelli et al., 2020a;Wulfmeyer et al., 2014). Currently, there have been few annual-scale verification studies employing the WRF model on a NWP daily forecasting mode at such high spatiotemporal resolution (e.g., dx < 2-3 km). Horizontal grid scale is significant because simulations employing convection-permitting (CP) grid spacing (dx ∼ < 4 km) are known to outperform those at coarser resolutions, particularly in terms of clouds and precipitation -not least because they do not require a convection parameterization (Bauer et al., 2015Prein et al., 2015;Schwitalla et al., 2011Schwitalla et al., , 2017Sørland et al., 2018). Furthermore, it is known that land use, soil texture, and terrain interact with planetary boundary layer (PBL) processes in complex feedbacks (e.g., Anthes, 1984;Mahmood et al., 2014;Pielkel and Avissar, 1990;Smith et al., 2014) with a strong level of land-atmosphere (LA) coupling thought to exist in this region (Koster et al., 2006). Representation of landscape structure and the associated LA feedbacks should therefore be significantly improved when using finer grid resolution. In terms of timescale, seasonal-to-annual simulations are costly but provide a sufficient time series for robust statistical comparison with observations over different seasons.
This study employs a configuration of WRF, coupled with the NOAH-MP "multi-physics" land surface model (LSM), with modular parameterization options (Niu et al., 2011). In contrast to typical climate mode simulations, WRF is run here in a numerical weather prediction (NWP), or daily forecasting, mode in order to keep conditions inside the domain closer to that of the forcing data (see Sect. 2.3.3, for further details). We also apply high-quality and high-resolution boundary forcing data, improved static data for land use, soils, terrain, high-frequency aerosol optical depth, and sea surface temperature. This WRF configuration was employed and verified by Schwitalla et al. (2020) within a one-day case study of a physics ensemble.
Our main objective is to assess the seasonal and diurnal performance of WRF -both qualitatively and quantitatively -in reproducing surface air temperature, dew point, and wind data from 48 WMO-compliant surface weather stations distributed over the UAE.
Another objective is to assess the model performance in different areas of the UAE -which was split broadly into three environments: (1) northern coastline and islands, (2) inland lowland desert areas, and (3) the Al Hajar Mountains in the east. The aim is to investigate differences in performance due to expected differences in climate regimes within these zones, and their respective surface and landscape characteristics and how they are dealt with by WRF with Noah-MP. Factors include, amongst others, the influence of sea surface temperatures in the warm and shallow Arabian Gulf (Al Azhar et al., 2016), representation of albedo  and roughness length parameters , and limitations in simulations over orography, particularly with respect to the wind field (e.g., . The Al Hajar Mountains have a complex climate with regular coastal fog and convective events (e.g., . Therefore, splitting verification into the above zones (in which the stations are quite evenly distributed, with 17, 15, and 16 stations, respectively) can yield further insights into model performance and climate characteristics in different environments.
Through ambitious simulations and robust verification, we can gain valuable insights into the regional climate and model performance and take a step towards more skilful weather forecasting with WRF with Noah-MP in the UAE.
The structure of this work is as follows: we start with our materials and methods (Sect. 2), showing maps of the study area and model domain (Sect. 2.1); a description of the regional climate (Sect. 2.2); the model chain, configuration, and simulation method (Sect. 2.3); verification dataset (Sect. 2.4); and verification methods (Sect. 2.5). Then follows a results and discussion section (Sect. 3) and finally a summary and outlook (Sect. 4).

Study area and model domain
The region under investigation is the United Arab Emirates (UAE) located between 22. 61-26.43 • N and 51.54-56.55 • E in the far northeast of the Arabian Peninsula (see Fig. 1a), with the 48 surface verification stations being spread out across the country. The model domain is shown in Fig. 1b and covers a much larger area, (a) to be sure of excluding the area with the strong effects of the boundary forcing (i.e., relaxation zone) from the analysis, and (b) to incorporate the large-scale synoptic weather situation. The model uses a regular latitude-longitude grid and has corner grid cells located at 14.775 • N, 32.225 • N, 43.275 • E, and 65.725 • E.  Weather in the wider region is generally controlled by four predominant patterns, including troughs originating from the Atlantic and Mediterranean Sea in winter, locally forced convective storms over the UAE and Oman Al Hajar Mountains in summer, and the southerly summer monsoon and cyclones from the Arabian Sea during June and October (Bruintjes and Yates, 2003;Steinhoff et al., 2018). These phenomena are represented in large-scale seasonal climatologies (1979 in Figs. 2 and 3 (right-hand panels). To represent the climate, we have used geopotential height at 500 hPa, wind velocity at 850 hPa, and mean sea level pressure. Note that winter is represented exclusively by the months of January and February, because these are the months used for our winter analysis during 2015 -for reasons of temporal continuity. In the climatology, we can clearly see a typical winter January-February (JF) low centered over Turkey and Iraq and a trough extending down toward the Arabian Peninsula. During summer, in June, July, and August (JJA), we observe much higher temperatures further south, with a heat low centered over Iran and the UAE. The other two seasons are transitional periods.

UAE climate
The UAE climate is generally characterized by scarce precipitation and high temperatures. However, annual cycles do exist with maxima of precipitation and minima of temperatures in winter and the converse in summer. Annual UAE pre-cipitation is between 20 mm in the drier west to 130 mm in the higher Al Hajar Mountains of the east, mainly produced in the winter-spring time period (Sherif et al., 2014). During summer, subtropical subsidence leads to a strong reduction of precipitation and higher temperatures, and consequently summer precipitation represents only around 20 % of the annual amounts. However, upper-level disturbances from the southern monsoon flows can still transport moisture towards the Arabian Peninsula and the UAE (Böer, 1997;Schwitalla et al., 2020), and convection is initiated sporadically over the mountains of Oman and the UAE in summertime (Branch et al., 2020a).
The neighboring Arabian Gulf to the north of the UAE also plays a strong role in regional weather conditions. The prevailing winds from the Arabian Gulf are westerly or northwesterly between January and May, but these change to northwesterly and then northerly directions from June to November. In the Arabian Gulf, which is relatively shallow (maximum depth ∼ 90 m), particularly close to the UAE coast, the sea surface can heat rapidly, with temperatures often exceeding 30 • C (Al Azhar et al., 2016). Prevailing winds are augmented by strong sea and land breezes, which develop due to land-sea temperature gradients. Daytime sea breezes can penetrate up to 50 km inland (Eager et al., 2008).

Model chain and physics
The model chain is based on the Weather Research and Forecasting model version 3.8.1 using the Advanced Research WRF (ARW) core, which solves the Euler equations on a dis- cretized horizontal grid, with a terrain-following vertical coordinate system. The domain size and grid spacing matches that of a previous simulation by Schwitalla et al. (2020) and is comprised of a regular latitude-longitude grid with 900 by 700 cells horizontally (see Fig. 1b). In line with our previous statements on CP scale we selected a grid increment of 0.025 • (dx ∼ 2779 m), with no parameterization of deep convection. It was important to extend the domain enough to incorporate influential synoptic conditions upstream to the north, east, and southeast. Hence, our grid covers a region of approximately 2500 km × 1945 km extending up to Iraq in the north, down to the south of Yemen, and well into Pak-istan in the east. Care was taken, for reasons of model stability, that domain boundaries did not bisect very large peaks, especially in the complex terrain of Iran. Vertically, 100 levels were used, adjusted so that at least 25 levels were present in the lower 2000 m -to maximize resolution of the strong moisture gradients in the boundary layer and lower troposphere.
WRF was coupled with the NOAH-MP LSM (Niu, 2011) to simulate land-surface processes and land-atmosphere feedbacks. NOAH-MP provides a separate vegetation canopy defined by a canopy top and ground layer including a modified energy balance closure approach. It offers a tile approach where the net longwave radiation and turbulent fluxes are calculated separately for bare soil and the canopy layer. The calculated fluxes over vegetated grid cells are then bulked as a weighted sum of bare soil and canopy fluxes. Furthermore, NOAH-MP is partially modular in structure, providing a suite of optional schemes for several processes, such as radiation budget calculation, stomatal resistance, snow albedo, and others. The same configuration of Milovac et al. (2016) was used for all NOAH-MP options.
Other physics schemes included were the Rapid Radiative Transfer Model (RRTMG) for long-and shortwave radiation transfer (Iacono et al., 2008;Mlawer et al., 1997), the Thompson-Eidhammer microphysics scheme (Thompson and Eidhammer, 2014) (although without the aerosol-aware component activated), the Mellor-Yamada 2.5 Level scheme (MYNN) for the atmospheric surface layer, and the MYNN 2.5 level TKE scheme for the boundary layer (Nakanishi and Niino, 2006) (See Table 1 for a synopsis of physics schemes and their associated references).

Initial and lateral boundary conditions
These were retrieved from the European Centre for Medium-Range Weather Forecasts (ECMWF) Integrated Forecasting System (IFS), in the form of 6-hourly operational analysis data on the 41r1 cycle, on model levels. The horizontal grid increment is 0.125 • (∼ 12 km) with 137 vertical levels up to  Thompson-Eidhammer Thompson and Eidhammer (2014) 0.01 hPa. Soil moisture and soil temperatures are also provided by this model, which assimilates satellite soil moisture data (Albergel et al., 2012) into its coupled Hydrology-Tiled ECMWF Scheme for Surface Exchange over Land (HTES-SEL) model (Balsamo et al., 2009).

Sea surface temperatures (SSTs)
These data were retrieved from the OSTIA project (Donlon et al., 2012) -the data have a 1/20 • horizontal grid spacing at a 12-hourly frequency at 00:00 and 12:00 UTC. These data are particularly important in coastal regions like the UAE.

Aerosol optical depth (AOD) data
These data were retrieved from the ECMWF Monitoring Atmospheric Composition and Climate (MACC) reanalysis (Inness et al., 2013), which interacts with the shortwave radiation scheme to modify radiative transfer and diabatic heating -data have a ∼ 80 km horizontal grid spacing and a 6-hourly frequency starting from 00:00 UTC.

Soil texture data
These data are an update from the default Food and Agriculture Organization (FAO) dataset. The new data are based on the Harmonized World Soil Database (HWSD) v 1.2 at 30 arcsec grid spacing, where all the mapping units are reclassified into 12 soil and 4 non-soil types following the United States Department of Agriculture (USDA) soil classification system, as in the WRF model. For access to the data and more details see Milovac et al. (2018). The WRF default soil texture map based on the FAO data was used for the bottom soil layer.

Land use data
These data were provided as a combination of a highresolution dataset for the Emirates of Abu Dhabi and Dubai, provided by the National Center for Meteorology (NCM), and the International Geosphere-Biosphere Programme (IGBP) Moderate Resolution Infrared Spectroradiometer (MODIS) 20-class land use dataset, included within the WRF package (Fig. 4). The Abu Dhabi dataset contained some classes which differed from MODIS IGBP, and these were first reclassified in a logical manner before overwriting the MODIS dataset within the UAE (see Schwitalla et al., 2020 for further details of this process).

Terrain data
Here, we used the Global Multi-resolution Terrain Elevation Data (GMTED) 2010 static dataset (Danielson and Gesch, 2011)

Simulation method
The objective of this study was to run a series of daily forecasts with WRF for the period 1 January to 30 November 2015, with a discarded 1-month spin-up run from 1 December 2014. Note that December 2014 was not used for verification (observation data were in any case not available at that time; see Sect. 2.4). It also makes sense not to analyze a winter season split over 2 years. The intention of carrying out such a long sequence was to produce a long enough dataset to provide sufficient data points for robust statistical analysis. Forecasts were carried out in NWP mode, i.e., with daily cold starts -as opposed to a "climate" mode, which has a single cold start at the outset. In NWP mode, a cold start was initiated each day at 18:00 UTC (22:00 LT) and run for 30 h, i.e., 6+24 until 00:00 UTC the next day. The first 6 h of each forecast (18:00 to 00:00 UTC) were then discarded from the analysis. The 6 h allows time for the atmosphere to spin up after each cold start -in particular for the residual boundary layer to develop and dissipate before the convective boundary layer starts to develop after sunrise (∼ 06:00 LT), and for potential cloud development. Other UAE forecasting studies have also suggested that 5-6 h is an appropriate period for model convergence in the UAE region (Chaouch et al., 2017;. After discarding the first 6 h, a forecast remains for analysis spanning the 24 h of each day between 00:00 and 23:00 UTC (04:00 to 02:00 LT). See Table 2 for a summary of the simulation method.
By reinitializing the 3D state within the domain itself (as opposed to simply inputting lateral boundary conditions), we ensure the atmospheric state is closer to the forecast provided by ECMWF than would be the case in typical climate mode simulations. In climate mode, which is driven only at the boundaries, the WRF simulations may diverge more strongly, particularly toward the center of the large domain where the study area lies, unless some form of interior nudging were implemented (e.g., Lo et al., 2008).
An exception to the daily reinitialization of state variables was made with the soil moisture field, whose state was intentionally maintained from one successive day to the next, by overwriting the soil moisture state from 18:00 to the next day at 18:00, when the forecast is restarted. The intention is to reduce physical inconsistencies between the soil moisture forecast in the driving GCM model and that of WRF with Noah-MP. Intuitively that may not seem a large issue given the aridity of the UAE. However, it becomes significant when convective precipitation occurs in WRF, and soils are wetted. Such convective events and flash floods are common in the UAE and Oman, particularly from May to September in the mountains, including during 2015 Schwitalla et al., 2020;Wehbe et al., 2019). Hence, the NWP method is a worthwhile method of improving physical consistency. To summarize the NWP configuration: the soil moisture is overwritten at 18:00 UTC from each consecutive day to the next, for the start of each new forecast. The lateral boundary conditions are as for a climate mode run, i.e., input every 6 h from the forcing data. The atmospheric state within the domain boundaries is reinitialized each day at 18:00 UTC.

Datasets for verification
Hourly verification data come from 48 surface weather stations throughout the UAE ( Fig. 1a and Appendix Table A1) and is quality checked and made available by the National Center for Meteorology (NCM) in Abu Dhabi, UAE. Fields available include air temperature at 2 m (T 2 m ), dew point at 2 m (TD 2 m ) representing humidity, and wind speed at 10 m (UV 10 m ). Data cover the entire period of 1 January-30 November 2015. Unfortunately, quality checked observation data for December 2014 were not available and so in the interest of preserving contiguous seasons, the month of December 2015 was omitted from the winter statistics.

Verification method
An aim of the study is to assess WRF's performance on several timescales: annually (January-November), seasonally, daytime and nighttime periods, and hourly. Another aim is to assess performance within different regions of the UAE. The exclusive assessment of overall forecast means over the UAE may be valuable but could obscure variability within the different regions, such as the capturing of high daytime temperatures in the inland deserts, or cooler and windier coastal conditions. Accordingly, the dataset was split temporally and spatially, as follows.

Yearly analysis
Here, all time steps were analyzed from 1 January to 30 November (hourly interval).

Seasonal analysis
Here, we present the most extreme seasons in terms of air temperatures -the (coolest) winter period of 1 January-28 February 2015 and the (warmest) summer period of 1 June to 31 August 2015.

Daytime and nighttime periods
For daylight hours we used all hours between 02:00 and 13:00 UTC (06:00-17:00 LT) -and for nighttime, 14:00 to 01:00 UTC (18:00-05:00 LT). These hours were selected based on the range of UAE sunrise and sunset which range between ∼ 05:30 and 07:00 and between ∼ 17:00 and 18:50 LT respectively. The intention of separating day and night hours in this way is to examine performance during the nocturnal stable and daytime convective boundary layers. Indeed, several simulations in arid regions have demonstrated nocturnal cold biases and an overestimation of daytime wind speeds Schwitalla et al., 2020;Weston et al., 2019).

Regional analysis
We split the 48 UAE weather stations into three regionsmarine, mountain, and desert -based on surface geophysical characteristics and proximity to water bodies (See Fig. 1a). Accordingly, the following criteria were used for grouping the weather stations into regions: marine -located on islands or ≤ 5 km inland from the UAE coast (17 stations The only exception made to this classification was for a single station located at 204 m near the sand dunes of Liwa, in the south of the Abu Dhabi emirate. Although the station is quite high, it is remote from the Al Hajar Range and was deemed more suitable for a desert classification. Details on altitude of the regional station groups can be found in Table 3, and a list of individual stations in the Appendix. The desert region is characterized by barren or sparsely vegetated soils (as is most of the UAE), high surface temperatures, and rapid nighttime cooling due to radiative losses associated with a dry atmosphere. The Al Hajar mountain region is arid and has generally rocky bare slopes and lower albedo (e.g., Moody et al., 2005), with gravel plains running along the west side (Sherif et al., 2014). One can assume some similarity between these regions, particularly when the synoptic situation is relatively homogeneous over scales larger than the study area. Nevertheless, given the large number of stations and length of time series, if regional differences do exist then they should be evident.

Verification and diagnostics
All comparisons were made using NCAR's Model Evaluation Tools V9.0 (MET) package (Brown et al., 2020), utilizing a nearest-grid cell approach on an hourly temporal resolution.
To obtain a visual overview of model performance, in terms of closeness of fit, spread of forecast errors, and distribution of residuals, scatterplots divided by region and daynight period are shown in Fig. 5. Included are a line of best fit for the data, a 1 : 1 line of perfect fit, and a 95 % confidence ellipse. Then, we plotted regional seasonal statistics of the mean observations (T 2 m , TD 2 m , and UV 10 m ) (Fig. 6).
To quantify the regional forecast-observation association, error magnitude, and sign during day and night, we show three standard statistical diagnostics: -Pearson correlation coefficient root mean square error (RMSE) bias.
The Pearson correlation coefficient "r" measures the strength of linear association between forecast (f ) and observation (o), at all stations at each time step, given as follows: where f i and o i are the forecast and observation at each observation point i,f , andō are forecast and observation averages, ns indicates the total number of observations at each time step (i.e., number of stations), and overbars indicate the mean. Occasionally ns was reduced slightly whenever a missing value occurred. The RMSE is a scale-dependent diagnostic defined simply as the square root of the mean square error (MSE) of the forecast: The bias is a measure of overall error, including sign, defined as follows: These diagnostics were generated for 2015 for the region and time period and their temporal distribution expressed in boxplots (Sect. 3; Fig. 7) showing mean, median, 25 %-75 % percentiles (box range), and 5 % and 95 % percentiles (whiskers). Finally, a closer look at the diurnal evolution of the forecast is useful to investigate performance at specific times of day, such as local noon and at PBL transition periods, where models often have biases. Hence, we generated mean hourly cycles of the spatial mean and spatial standard deviations for both forecast and observations. The mean at each hour is calculated as follows: The spatial standard deviation (σ ) at each hour is given as follows: For the diurnal analysis, we selected the two most extreme seasons in terms of temperature -the (coolest) winter period of January-February (Fig. 8) and the (warmest) summer period of June-August (Fig. 9) 2015. Again, these figures are divided by region.

Results and discussion
In this section, we present a discussion of the results. Before examining the model performance however, we first discuss the study period of 2015 in the context of the long-term climate and El Niño (Sect. 3.1) to assess the representativeness of the 2015 study period. We then discuss differences in regional climate and their significance to our verification (Sect. 3.2). Finally, we evaluate the regional model output of T 2 m , TD 2 m , and UV 10 m fields across the seasons and time of day (Sect. 3.3).

2015 in context
Our study period is from 1 January to 30 November 2015 (during which time the full verification dataset was available). The year 2015 was considered one of the strongest El Niño periods since 1950 (L'Heureux et al., 2017) with an Oceanic Niño Index (ONI) index of up to 2.6 towards the end of the year (see Table 4). A high positive ONI indicates a stronger El Niño event (a negative ONI indicates La Niña Hence, a comparison was made between the long-term climatology and the year 2015, based on ECMWF ERA5 reanalysis data. In Figs. 2 and 3, from the geopotential height field, we can see that a positive 2015 winter temperature anomaly exists to the north of the UAE, extending from Turkey to the Caspian Sea (Fig. 2, top left). However, conditions over the UAE show less deviation in terms of the temperature, pressure, and wind fields. As the year progresses, and the ONI increases, the temperature anomaly becomes more pronounced further south, especially in JJA when higher 2015 temperatures extend further south toward Oman and Yemen than is apparent in the climatology ( Fig. 3a and b). Overall though, synoptic conditions over the Arabian Peninsula do not appear to be markedly different. They are similar enough, in fact, to consider the 2015 regional climate as representative of the climate in general.

Regional and seasonal characteristics
An assessment of regional distributions reveals that clear differences in means and variability do exist (Fig. 6). As expected, the marine region is dominated by the Arabian Gulf characteristics, with more moderate temperature maxima and minima (Fig. 6a), greater humidity (Fig. 6b), and higher wind speeds (Fig. 6c) than the inland desert (Fig. 6). Hence marine temperatures are lower than at the desert stations in the summer months but remain higher in winter and autumn. In fact, the desert stations have the most extreme T 2 m range in all seasons, reflecting the lower heat capacity surface, and consequent strong daytime surface heating. Rapid nocturnal cooling also occurs due to radiative losses in a much drier inland environment. The mountain region is only a little cooler than the desert (∼ 1 • C) in summer and autumn with the difference further reduced during spring and winter. The majority of mountain stations are located at fairly moderate altitudes (mean altitude 430 m; Table 3) with only one station located over 1000 m high (station ID 41229 -1485 m a.s.l.; see Table A1 in Appendix). Even so, one might have expected larger differences. However, there could be reasons other than the temperature lapse rate for this, such as differences in mountain and desert cloud cover Yousef et al., 2019) or in albedo (e.g., Nelli et al., 2020b). TD 2 m , or dew point temperature, is a standard measure of humidity and is in most cases relatively independent of the ambient temperature. It is also a reliable measure of how humid the air feels in terms of human comfort (Wood, 1970). In a hot (and warming) climate like the UAE, forecasting TD 2 m accurately is therefore important for society. Regionally, we observe considerable differences in TD 2 m (Fig. 6b), which are more or less expected due to coastal-land gradients and variation in vertical transport and distribution of va- Figure 7. Box plots of T 2 m , TD 2 m , and UV 10 m (respectively, panels a-c, d-f, and g-i) for all time steps over the period of January-November 2015. Statistics are divided by region (UAE, mountain, marine, desert) and then by nighttime and daytime hours (respectively, night 18:00-05:00 (grey boxes) and day 06:00-17:00 (red boxes) in local time). Statistics shown are Pearson correlation (a, d, g), bias (b, e, h), and RMSE (c, f, i). On the box plots the center line represents the mean, the white circle is the median, box ends represent 25 % and 75 % percentiles, and the whiskers are 5 % and 95 % percentiles. Also marked is a horizontal zero reference line for the Pearson and bias statistics. por in different environments. Table 5 shows the difference in observed T 2 m and TD 2 m means. The inland atmosphere tends to be humid in summer when temperatures are high but even closer to saturation in autumn and winter as temperatures fall, but humidity remains high. This seasonal range is particularly pronounced in the mountain regions, reflecting the predominance of annual rainfall occurring during winter in the mountains and gravel plains of the northeastern part of the UAE (Sherif et al., 2014;Wehbe et al., 2019). In all seasons, the marine region is closer to saturation than in the other regions (T 2 m minus TD 2 m range is 8.3 to 11 • C); however, this contrast is reduced in the cooler seasons as the mountain and desert regions become more humid.
There are significant regional differences in UV 10 m , with marine UV 10 m being 0.5-1 m s −1 higher than in other regions (Fig. 6c) and also more variable. This is not unexpected, due to low surface roughness, strong land-sea temperature gradients, and associated land-sea breezes. Desert UV 10 m is the lowest all year round, and mountain UV 10 m falls in between those of the desert and marine regions. In general, UV 10 m is highest in spring and autumn. These regional differences justify the need for regional splitting of the dataset and are further addressed below, in conjunction with model performance.

Model evaluation
Although the simulation of T 2 m , TD 2 m , and UV 10 m and causes for any biases may be physically linked, we nevertheless first examine each field individually for clarity.

T 2 m
In the scatter plots (Fig. 5a-h) we observe that in the daytime, T 2 m appears to be well estimated for the UAE on the whole (Fig. 5a) (+0.44 • C), and errors are well distributed over the T 2 m range. However, this agreement obscures some compensating regional biases; namely overestimation in the desert (+0.71 • C) and mountains (+1.06 • C), and underestimation in the marine region (−0.93 • C).
Reasons for the warm bias may be attributable to a combination of reasons. Firstly, a WRF overestimation of down- Table 5. Seasonal and regional differences in observed T 2 m and TD 2 m means to show the closeness to saturation. Included are the number of time steps for each season (N T ). Note that this is not a mean of the differences between T 2 m and TD 2 m calculated at each time step, but an overall difference in means.

Season
Region welling surface shortwave radiation has been observed before Nelli et al., 2020b). This has been attributed to a lack of cloud cover but may also relate to the performance of the radiative transfer scheme and interaction with aerosols. Secondly, the soil representation, such as soil texture classification -and associated parameters like heat capacity, thermal diffusivity, and albedo -may require adjustment. Underestimations of albedo in WRF have recently been observed, particularly for bright desert soils where measurements show typical albedo values of 0.3 to 0.34 (Nelli et al., 2020b). The WRF albedo value in this study is around 0.23 for much of the UAE lowlands, which would likely result in an overly high net radiation and sensible heating, especially on dry soils. This is consistent with the reported positive daytime temperature biases in the inland desert. A third factor may be the prescribed aerodynamic roughness length parameters used by WRF. Nelli et al. (2020a) found that a new value for the parameter, derived from eddy covariance measurements, reduced the warm daytime bias in WRF simulations . These causes may account for some or all of the daytime temperature biases and therefore need to be considered for future simulations in this region. Nocturnally, we observe a cold bias over the UAE (Fig. 5e). This is quantified in Fig. 7b as a mean negative bias of just over −2 • C. One can also see that this nocturnal bias tends to worsen with an increase in daily T 2 m , which implies that the cold bias gets worse in the hotter months. This is confirmed in the seasonal diurnal cycles (Figs. 8a and 9a), where the mean nocturnal bias in winter is ∼ −2 • C but increases to greater than −4 • C in summer. This nocturnal cold bias is re-flected in all sub-regions, but not to the same degree. The best nocturnal performance is in the marine region (Fig. 5g) (bias of −0.75 • C), with an even error distribution across the temperature range. The largest nocturnal cold bias is in the desert region (−3.1 • C) (Fig. 5h), with a steady increase in bias with temperature. The switch from positive to cold biases usually occurs more or less around the twice-daily transition times of the boundary layer between stable and convective states. Such arid nocturnal biases have been noted before Fekih and Mohamed, 2017;Weston et al., 2019). It may be that an overly dry lower atmosphere results in a lower downward flux of longwave radiation, as found by Fonseca et al. (2020) in a comparison of WRF with radiation measurements. All else being equal this dryness would lead to a reduction of "buffering" at nighttime. They also found an overly high upward ground heat flux during the night, which could be associated with sub-optimal soil parameters or an overly strong soil-air temperature gradient. Overall, their net radiation losses at night were higher in WRF than from the radiation measurements.

TD 2 m
TD 2 m is relatively well estimated in 2015 over the UAE as a whole, with correlations around 0.7 and biases of less than 1 • C ( Fig. 7d and e, UAE sections). However, we can look at regional and seasonal differences for more detail. In the desert and marine regions, the biases are ≤ 1 • C during both day and night. Marine TD 2 m is slightly overestimated in general, indicating the model to be more humid over the gulf and coast than observed. Mountain nocturnal dew points are more of a problem with a negative bias of ∼ −2 • C, and a larger error spread than the other regions (Fig. 7e). There is also a corresponding T 2 m nocturnal bias of ∼ −2 • C which could indicate a deficiency in the longwave surface budget as just mentioned, but also a model deficiency in representing the intermittent shear-driven turbulence that appears in nighttime stable boundary layers. However, such biases in complex terrain have been already well documented (e.g., Zhang et al., 2013). One of the reasons cited is that the CP scale is not fine enough to resolve mountain slopes and therefore cannot capture certain processes in the same way that large-eddy scale models can, with grid spacings on the order of x = 100 m. However, while such fine resolutions may be appropriate in a research context, they may remain prohibitively expensive and inappropriate in the context of operational forecasting.
An additional problem in complex terrain is the validity of the traditional Monin-Obukhov similarity theory (MOST) (e.g., see Foken, 2006) that is typically used in atmospheric models, including WRF, for calculation of model diagnostics like T 2 m or TD 2 m . MOST assumes homogeneous underlying land surface and stationary fluxes, and there is plenty of evidence that in complex and heterogeneous landscapes MOST needs significant improvements in scaling of turbu-lent kinetic energy profiles in the lowest part of the boundary layer (e.g., Figueroa-Espinoza et al., 2014;Wulfmeyer et al., 2018). The latter may affect representation of the heat, moisture, and momentum transport from the land surface to the atmosphere, and if misrepresented may lead to such high biases in the surface layer model diagnostics.
Seasonally, diurnal TD 2 m is quite well reproduced in both winter and summer (Figs. 8 and 9). The mountain nocturnal negative bias becomes more significant in summer (Fig. 9e). In the desert, a positive bias occurs over midday starting around 10:00 LT (Fig. 9k) showing an overestimation of water vapor in summer. This is likely to be too early in the day for a sea-breeze-driven anomaly but may relate to simulated soil moisture being higher than reality. This was observed in a study by Wehbe et al. (2019) that found a wet bias in dry soils and a dry bias in wetter soils in WRF over the UAE when not coupled with a more advanced hydrological model.

UV 10 m
WRF overestimates UV 10 m during the day and night, in all regions and seasons. Positive biases of 1-2 m s −1 are typical over the whole year (seen in Fig. 7h). Mountain daytime biases are strongest at 2 m s −1 , followed by daytime desert biases at 1.5 m s −1 . Marine biases are lowest with mean biases of < 1 m s −1 . Notably, there is a trend where positive biases increase with wind speed (Fig. 5p, q, s). There is a significant increase in bias during the daytime, and also in the summer, particularly in the mountain and desert regions ( Fig. 9f and i). In fact, the strongest wind biases occur in the same situations when daytime T 2 m is overestimated, particularly in the mountain and desert regions (Figs. 7,8,9), hinting at a relationship between the two. Indeed, it is likely that an overly strong sea breeze may account for this. During summer, the desert-marine T 2 m daytime gradient is highest (∼ 5 • C; see Fig. 9g and j, red curves) than in winter (∼ 3 • C; see Fig. 8g and j), although the seasonal warmth biases are similar (∼ 1.5-2 • C). The higher gradient coincides with a greater UV 10 m bias in summer. Weston et al. (2019) improved the duration and direction of UAE sea breezes by tuning a thermal roughness length parameter in WRF. The PBL and surface layer parameterization schemes could also be a cause of the bias. Schwitalla et al. (2020) found an overestimation of UV 10 m in all members of a UAE physics ensemble, with magnitudes of around 1.5 m s −1 . The bias was worse when using the MYNN 2.5 TKE PBL and MYNN surface layer schemes, when compared with the Yonsei University (YSU) scheme (Hong et al., 2006) paired with the MM5 Jiménez surface layer scheme (Jiménez et al., 2012).
Using a non-local PBL scheme like YSU tends to produce a deeper and drier PBL with a stronger vertical mixing, in comparison to local schemes like MYNN (see Milovac et al., 2016;Yang et al., 2017). This may lead to a reduction in wind speeds, heat, and moisture close to the surface. However, another study found that switching between seven different PBL schemes had little effect on positive UV bias (Shimada et al., 2011). One additional factor is that there are several parameters within the MYNN scheme itself, which may benefit from retuning for arid regions like the UAE (e.g., Yang et al., 2017). However, the total impact of the PBL scheme selection on reproduction of the T 2 m , TD 2 m , and UV 10 m diagnostics is not completely clear. This is because, depending on the land surface type, the calculations of transfer coefficients/fluxes are made in Noah-MP, the PBL scheme, or the surface layer scheme (SLS). In WRF, PBL schemes are generally coupled to the SLS, and typically all variables between the land surface and lowest model layer are diagnosed (e.g. T 2 m , U -10m, V -10m). These calculations in the SLS are based on Monin-Obukhov similarity theory and are represented in the model as hard-coded parameters and/or formulations of similarity functions. The latter are used to obtain dimensionless bulk transfer coefficients which are used for calculating momentum, heat, and moisture fluxes, and for diagnosing near-surface quantities like T 2 m . These coefficients re-enter the LSM and are to calculate the surface fluxes which then enter the PBL scheme, as the lower boundary condition. Therefore, bias in near-surface variables is strongly related to the choice of LSM and SLS. In this WRF configuration, the communication link between the SLS and NOAH-MP is broken, as NOAH-MP itself calculates transfer coefficients and diagnostics over land surfaces, effectively bypassing the SLS (Nielson et al., 2013). The SLS only becomes active over water surfaces. This means that when NOAH-MP is used, the LSM probably has a stronger impact on the bias of near surface variables than the PBL and SLS (e.g., Milovac et al., 2016).
Incorrect aerodynamic roughness length parameters, as mentioned previously, may also play a large role in determining UV 10 m -this parameter is used within the surface layer scheme. Nelli et al. (2020a) found positive wind speed biases over the same region when wind speeds were < 4 m s −1 and negative biases for wind speeds which were > 6 m s −1 within a WRF V3.8 simulation. We have a similar behavior at night in the marine and desert regions, as exhibited by the positive-to-negative distribution of errors increasing with wind speed. Nelli et al. (2020a) reduced these biases by retuning the roughness length parameter based on eddy covariance measurements .
Another possibility is the length of the forecast spin-up, the required length of which may still be uncertain. We have already mentioned that Chaouch et al. (2017) cited a 5 h spinup as being sufficient, but Hahmann et al. (2015) posit that the necessary spin-up over land could be 12 h or even more (primarily for effective use of the PBL scheme). However, such long spin-ups are likely to be (i) prohibitively expensive and (ii) too time consuming for forecasting purposes.

Summary and outlook
The aim of this study was to (i) assess the skill of WRF with Noah-MP in reproducing surface quantities over the UAE; (ii) identify regional, seasonal, and diurnal differences in performance; and (iii) estimate potential sources of model deficiencies. We have demonstrated the value of splitting the model evaluation temporally and spatially. While assessment of diagnostics for the whole UAE region remains useful, it can obscure regional, diurnal, and seasonal differences, as well as compensating biases. These are all scientifically interesting factors. Importantly, they might reveal information on model performance with respect to specific processes and land surface types, and how they are simulated. An analysis of model predictions has revealed that WRF with Noah-MP represents the mean T 2 m field reasonably well during the daytime, although with a tendency for slight overestimation (≤ 1 • C). The nocturnal T 2 m is underestimated more strongly though (1-4 • C), and with larger biases during the hotter months, particularly in the desert and mountains, likely due to a combination of deficiencies. The marine region has the lowest T 2 m biases, which is encouraging, and highlights the value of ingesting quality SST data, especially in coastal regions. WRF shows a good performance regarding TD 2 m in general, with mean biases being ≤ 1 • C. Humidity over the marine region tends to be slightly overestimated though, whilst nocturnal mountain TD 2 m is underestimated (bias ∼ −2 • C). UV 10 m performance on land still needs be improved, with biases of 1-2 m s −1 . Furthermore, performance for UV 10 m tends to worsen during the hot months, particularly inland. UV 10 m in the marine region is generally much better simulated than in the other regions (bias ≤ 1 m s −1 ). There is an apparent relationship between T 2 m bias and UV 10 m bias, and this could be due to deficiencies in sea-land breeze simulation. TD 2 m biases appear to be more independent. The only exception to this is during the night, when T 2 m and TD 2 m biases do appear linked. Ultimately, no model downscaling forecast (at scales economically viable for forecasting) can be expected to exhibit exceptional skill in all conditions. A general caveat when evaluating models is that one must factor in a certain level of error in station or gridded observational datasets themselves (e.g., as discussed by Prein and Gobiet, 2017). Nevertheless, assuming a high level of observational accuracy, we have discussed several avenues for improvement in this application of WRF. For instance, we should continue to devise and ingest new and improved datasets for land cover, terrain and soil texture, and albedo. In particular, within a vegetationsparse region like the UAE, soil texture, moisture, and other parameters are likely to be of prime importance. Certainly, ingesting SST data appears to have been valuable, given the lower coastal biases in all variables.
We have mentioned several very useful experiments carried out on parameters like aerodynamic and thermal roughness lengths Weston et al., 2019), as well as process-based observational studies related to the surface energy balance and verification studies Nelli et al., 2020b). Further experiments should now be coordinated in order to improve model predictions further. In terms of parameterization schemes, ensemble experiments (in the manner of Chaouch et al., 2017;Milovac et al., 2016; are still required to identify optimal land surface-surface-layer-PBL-microphysics combinations for arid regions. Such studies can also address the tuneable parameters defined inside parameterization schemes similarly to those conducted by Quan et al. (2016) and Yang et al. (2017). The most relevant ones can then be measured during dedicated field campaigns and subsequently ingested in the model.
Seasonal-scale studies such as these are vital for accurate assessment of WRF nowcasting performance and to identify model deficiencies and areas for improvement. By combining seasonal verification with sensitivity tests, and process and observational studies, we will move towards improved forecasting systems for the UAE and other arid regions.  Table A1 for details on individual weather stations. Data availability. WRF output data are available on reasonable request as they are extremely large in size (many TB). They are archived on the German Climate Computing Center (Deutsches Klimarechenzentrum, DKRZ) and will be there for a minimum of 10 years. Verification data were uploaded to Zenodo in the form of open-access Excel files. Data are courtesy of NCM and UAE. Observation data can be found at https://zenodo.org/deposit/3894544 (Branch et al., 2020c). The verification statistics dataset can be found at https://doi.org/10.5281/zenodo.4004195 (Branch et al., 2020d).
Author contributions. OB is the first author who conceived the experiment, carried out the simulations and analysis, and wrote the publication. TS contributed greatly to scientific support and cowriting of the paper, provided much technical assistance, and formatted the observation data for use in the MET software. MT, RF, NN, MW, and VW provided specialist scientific support and assisted with the drafting and improvement of key aspects of the paper.
Competing interests. The authors declare that they have no conflict of interest.