Seasonal and diurnal performance of daily forecasts with WRF- NOAHMP V3.8.1 over the United Arab Emirates

10 Effective numerical weather forecasting is vital in arid regions like the United Arab Emirates (UAE) where extreme events like heat waves, flash floods and dust storms are severe. Hence, accurate forecasting of quantities like surface temperatures and humidity is very important. To date, there have been few seasonal-to-annual scale verification studies with WRF at high spatial and temporal resolution. This study employs a convection-permitting scale (2.7 km grid scale) simulation with WRF-NOAHMP, in daily forecast mode, 15 from January 01 to November 3

Studies such as these are vital for accurate assessment of WRF nowcasting performance and to identify model deficiencies.
By combining sensitivity tests, process and observational studies with seasonal verification, we can further improve forecasting systems for the UAE.

Introduction 30
In a changing climate, effective numerical weather forecasting is vital in arid regions like the United Arab Emirates (UAE), to predict low-visibility events like fog and dust (e.g. Aldababseh and Temimi, 2017;Chaouch et al., 2017;Karagulian et al., 2019), and extreme events relating to storms and flash floods (Chowdhury et al., 2016;Wehbe et al., 2019), high temperatures and droughts. These extreme events are expected to become more prevalent under a changing climate (Feng et al., 2014;Zhao et al., 2020). In fact, climate projections suggest that arid and semi-arid regions are likely to expand in area along with rising 35 temperatures (Huang et al., 2017;Lelieveld et al., 2016;Lu et al., 2007). Hence, it is vital that regional weather forecasting and climate simulations with regional climate models (RCMs) correctly simulate important quantities which characterize extreme events, especially surface temperatures, humidity, winds, and precipitation.
The model chain and configuration used in any simulation can heavily influence the results of such forecasts. Important factors include, but are not limited to, RCM type (e.g. Coppola et al., 2018), general circulation model dataset (GCM) for boundary 40 forcing (Gutowski et al., 2016;Jacob et al., 2020), horizontal and vertical grid resolutions (e.g. Schwitalla et al., 2017b), physics and dynamics schemes (e.g. Chaouch et al., 2017;Schwitalla et al., 2020), soil/land use/terrain static data, as well as internal model parameter sets for important land surface processes (e.g. Weston et al., 2018).
The Weather Research and Forecasting (WRF) model (Skamarock et al., 2008) has been used in arid regions for various forecasting and verification (e.g Branch et al., 2014;Fonseca et al., 2020;Schwitalla et al., 2020;Valappil et al., 2019;Wehbe 45 et al., 2019), and process studies (Becker et al., 2013;Branch and Wulfmeyer, 2019;Karagulian et al., 2019;Nelli et al., 2020a;Wulfmeyer et al., 2014). To the best of our knowledge though, there have been few seasonal-to-annual scale verification studies with WRF a daily forecasting mode at such a high spatial and temporal resolution (e.g. < 2 -3 km). Horizontal grid scale in particular, is significant because simulations employing convection-permitting (CP) grid spacing ( ~ < 4 km) are known to outperform those at coarser resolutions, particularly in terms of clouds and precipitation -not least because they 50 don't require a convection parameterization (Bauer et al., 2015Prein et al., 2015;Schwitalla et al., 2011Schwitalla et al., , 2017aSørland et al., 2018). Furthermore, it is known that land use, soil texture, and terrain interact with planetary boundary layer (PBL) processes in complex feedbacks (e.g. Anthes, 1984;Mahmood et al., 2014;Pielkel and Avissar, 1990;Smith et al., 2014) with a strong level of land-atmosphere (LA) coupling thought to exist in this region (Koster et al., 2006). Representation of landscape structure and the associated LA feedbacks should therefore be significantly improved when using finer grid 55 resolution. In terms of time scale, seasonal-to-annual simulations are costly, but provide a sufficient time series for robust statistical comparison with observations over different seasons. This study employs a verified configuration of WRF -coupled with the NOAH-MP 'multi physics' land surface model (LSM), with modular parameterization options (Niu et al., 2011). In contrast to typical climate mode simulations, WRF is run here in a numerical weather prediction (NWP), or daily forecasting mode in order to keep conditions inside the domain closer to that 60 of the forcing data (see Section 2.3.3 for further details). We also apply high quality/resolution boundary forcing data, improved static data for land use/soils and terrain, and high frequency aerosol optical depth and sea surface temperature data. This WRF configuration was employed and verified by Schwitalla et al., (2020) within a one-day case study of a physics ensemble.
Our main objective is to assess the seasonal and diurnal performance of WRFboth qualitatively and quantitativelyin reproducing surface air temperature, dew point and wind data from 48 surface weather stations distributed over the UAE. 65 Another objective is to assess the model performance in different areas of the UAEwhich was split broadly into three environments: 1. northern coastline and islands, 2. inland lowland desert areas, and 3. the Al Hajar mountains in the east. The aim is to investigate differences in performance due to expected differences in climate regimes within these zones, and their respective surface/landscape characteristics and how they are dealt with by WRF-NOAH-MP. Factors include, amongst others, the influence of sea surface temperatures in the warm and shallow Arabian Gulf (Al Azhar et al., 2016), representation of 70 albedo (Fonseca et al., 2020) and roughness length parameters (Weston et al., 2018a), and limitations in simulations over orography, particularly with respect to the the wind field (e.g. Warrach-Sagi et al., 2013). The Al Hajar Mountains have a complex climate with regular coastal fog and convective events (e.g. . Therefore, splitting verification into the above zones (in which the stations are quite evenly distributed, with 16, 14, and 18 stations, respectively) can yield further insights into model performance, and climate characteristics in different environments. 75 Through ambitious simulations and robust verification, we can gain valuable insights into the regional climate, model performance and take a step towards more skilful weather forecasting with WRF-NOAH-MP in the UAE.
The structure of this work is as follows: We start with our Materials and Methods (Section 2), showing maps of the study area and model domain (2.1), a description of the regional climate (2.2), the model chain, configuration and simulation method (2.3), verification data set (2.4), verification methods (2.5). Then follows a results and discussion section (3), and finally a 80 summary and outlook (4).

Study area and model domain
The region under investigation is the United Arab Emirates (UAE) located between 22. 61 -26.43˚ N and 51.54 -56.55˚ E in the far north east of the Arabian Peninsula (see Figure 1a), with the 48 surface verification stations being spread out across the 85 country. The model domain is shown in Figure 1b and covers a much larger area, a) to be sure of excluding the area with the strong effects of the boundary forcing (i.e. relaxation zone) from the analysis, and b) to incorporate the large scale synoptic weather situation. Model corner grid cells are located at 14.775 ˚ N, 32.225˚ N, 43.275 ˚ E, and 65.725˚ E https://doi.org/10.5194/gmd-2020-201 Preprint. Discussion started: 1 September 2020 c Author(s) 2020. CC BY 4.0 License.

Regional climate
Synoptic climate -Weather in the wider region is controlled generally by four weather systems, including troughs originating 90 from the Atlantic and Mediterranean Sea in winter, locally forced convective storms over the UAE/Oman Al Hajar Mountains in summer, and the southerly summer monsoon and cyclones from the Arabian Sea during June and October (Bruintjes and Yates, 2003;Steinhoff et al., 2018). These phenomena are represented in large-scale seasonal climatologies (1979( -2014 in Figures 2 and 3 (righthand panels). To represent the climate, we have used geopotential height at 500 hPa, wind velocity at 850 hPa and mean sea level pressure. Note that winter is represented exclusively by the months of January and 95 February, because these are the months used for our winter analysis during 2015for reasons of temporal continuity. In the climatology, we can clearly see a typical winter January-February (JF) heat-low centred over Turkey and Iraq and a trough extending down toward the Arabian Peninsula. During summer June, July and August (JJA), we observe much higher temperatures further south, with a heat-low centred over Iran and the UAE. The other two seasons appear are transitional periods. 100 UAE climatethe UAE climate is generally characterized by scarce precipitation and high temperatures. However, annual cycles do exist with maxima of precipitation and minima of temperatures in winter and the converse in summer. Annual UAE precipitation is between 20 mm in the drier west to 130 mm in the higher Al Hajar Mountains of the east, mainly produced in the winter-spring time period (Sherif et al., 2014). During summer, subtropical subsidence leads to a strong reduction of precipitation and higher temperatures, and consequently summer precipitation represents only around 20% of the annual 105 amounts. However, upper level disturbances from the southern monsoon flows can still transport moisture towards the Arabian Peninsula and the UAE (Böer, 1997;Schwitalla et al., 2020), and convection is initiated sporadically over the mountains of Oman and the UAE in summertime .
The neighbouring Arabian Gulf to the north of the UAE also plays a strong role in regional weather conditions. The prevailing winds from the Arabian Gulf are westerly or north-westerlies between January and May, but these change to north-westerly 110 and then northerly direction from June toward November. The sea surface of the Arabian Gulf, which is relatively shallow (maximum depth ~90m) particularly close to the UAE coast, can heat rapidly with temperatures often exceeding 30˚C (Al Azhar et al., 2016). Prevailing winds are augmented by strong sea/land breezes, which develop due to land/sea temperature gradients. Daytime sea breezes can penetrate up to 50 km inland (Eager et al., 2008).

Model chain and physics
The model chain is based on the Weather Research and Forecasting model (WRF, Skamarock et al., 2008) version 3.8.1 using the Advanced Research WRF (ARW) core, which solves the Euler equations on a discretized horizontal grid, with a terrainfollowing vertical coordinate system. The domain was selected (following a twin experiment by Schwitalla et al., (2020)) comprising of 900( ) by 700( ) grid cells, horizontally (see Figure 1b). In line with our previous statements on CP scale we selected a grid increment of 0.025˚ ( ∼ 2779 m) -with no parameterization of deep convection. It was important to extend the domain enough to incorporate influential synoptic conditions upstream to the north, east, and south east. Hence our grid covers a region of approximately 2500 × 1945 km extending up to Iraq in the north, down to the south of Yemen, and well into Pakistan in the east. Care was taken, for reasons of model stability, that domain boundaries did not bisect very large peaks, 125 especially in the complex terrain of Iran.
Vertically, 100 levels were used, adjusted so that at least 25 levels were present in the lower 2000 m -to maximise resolution of the strong moisture gradients in the boundary layer and lower troposphere.
WRF was coupled with the NOAH-MP LSM (Niu, 2011) to simulate land-surface processes and land-atmosphere feedbacks.
NOAH-MP provides a separate vegetation canopy defined by a canopy top and ground layer including a modified energy 130 balance closure approach. It offers a tile approach where the net longwave radiation and turbulent fluxes are calculated separately for bare soil and the canopy layer. The calculated fluxes over vegetated grid cells are then bulked as a weighted sum of bare soil and canopy fluxes. Furthermore, NOAH-MP is partially modular in structure, providing a suite of optional schemes for several processes, such as radiation budget calculation, stomatal resistance, snow albedo and others. The same configuration of Milovac et al., (2016) was used for all NOAH-MP options. 135 Other physics schemes included were RRTMG for long and shortwave radiation transfer (Iacono et al., 2008;Mlawer et al., 1997), Thompson aerosol-aware scheme for microphysics (Thompson and Eidhammer, 2014), MYNN scheme for the atmospheric surface layer, and the MYNN 2.5 level TKE scheme for the boundary layer (Nakanishi and Niino, 2006) (See Table 1 for a synopsis of physics schemes and their associated references). Notably, Schwitalla et al., (2020) has already demonstrated the good performance of the MYNN/Thompson aerosol-aware combination in a 5 member physics ensemble, 140 verified with surface data.

Initialization and forcing data
Initial and lateral boundary conditionsare retrieved from the European Centre for Medium-Range Weather Forecasts (ECMWF) Integrated Forecasting System (IFS), in the form of 6-hourly operational analysis data on the 41r1 cycle, on model levels. The horizontal grid increment is 0.125˚ (~12 km) with 137 vertical levels up to 0.01 hPa. Soil moisture and soil 145 temperatures are also provided by this model, which assimilates satellite soil moisture data (Albergel et al., 2012) into its coupled Hydrology-Tiled ECMWF Scheme for Surface Exchange over Land model -HTESSEL (Balsamo et al., 2009).
Sea surface temperatures (SSTs)are obtained from the OSTIA project (Donlon et al., 2012) -the data has a 1/20˚ horizontal resolution at a 12-hourly frequency at 00:00 and 12:00 UTC. This data is particularly important in coastal regions like the

UAE. 150
Aerosol optical depth (AOD) dataare obtained from the ECMWF Monitoring Atmospheric Composition and Climate (MACC) reanalysis (Inness et al., 2013) which interacts with the shortwave radiation scheme to modify radiative transfer and diabatic heating -data has a ~80km horizontal resolution and a 6-hourly frequency starting from 00:00 UTC. Abu Dhabi dataset contained some classes which differed from MODIS IGBP, and these were first reclassified in a logical manner before overwriting the MODIS dataset within the UAE (see Schwitalla et al., (2020) for further details of this process).

Simulation method
The objective of this study was to run a series of daily forecasts with WRF for the period Jan 01 -Nov 30 2015. The intention of carrying out such a long sequence was to produce a long enough dataset to provide sufficient data points for robust statistical analysis. Forecasts were carried out in a NWP mode, i.e. with daily cold starts -as opposed to a 'climate' mode, which has a 170 single cold start at the outset. In NWP mode, a cold start was initiated each day at 18:00 UTC (22:00 LT) and run for 30 hours, i.e. 6+24 until 00:00 UTC the next day. The first 6 hours of each forecast (18:00 UTC to 00:00 UTC) were then discarded from the analysis. The 6 hours allows time for the atmosphere to spin up after each cold startin particular for the residual boundary layer to develop and dissipate before the convective boundary layer starts to develop after sunrise (~06:00 LT), and for potential cloud development. Other UAE forecasting studies have also suggested 5-6 hours an appropriate period for model 175 convergence in the UAE region (Chaouch et al., 2017;Weston et al., 2018). After discarding the first 6 hours, a forecast remains for analysis spanning the 24 hours of each day between 00:00 and 23:00 UTC (04:00 to 02:00 LT). See Table 2 for a summary of the simulation method.
By reinitialising the 3D state inside the domain, we keep the atmospheric simulation closer to the forecast provided by ECMWF than would be the case in climate mode. In climate mode, which is driven only at the boundaries, the WRF simulations may 180 diverge more strongly particularly toward the centre of the large domain where the study area lies, unless some form of interior nudging were implemented (e.g. Lo et al., 2008).
An exception to the daily reinitialization of state variables was made with the soil moisture field, whose state was intentionally maintained from one successive day to the next, rather than being overwritten. The intention is to reduce physical inconsistencies between the soil moisture forecast in the driving GCM model and that of WRF-NOAH-MP. Intuitively that 185 may not seem a large issue given the aridity of the UAE. However, it becomes significant when convective precipitation occurs https://doi.org/10.5194/gmd-2020-201 Preprint. Discussion started: 1 September 2020 c Author(s) 2020. CC BY 4.0 License.
in WRF, and soils are wetted. Such convective events and flash floods are common in the UAE and Oman, particularly from May to September in the mountainsincluding during 2015 Schwitalla et al., 2020;Wehbe et al., 2019).
Hence, the NWP method is a worthwhile method of improving physical consistency.

Datasets for verification 190
Hourly verification data comes from 48 surface weather stations throughout the UAE (Figure 1a and Appendix Table A1)quality checked and made available by the National Center for Meteorology (NCM) in Abu Dhabi, UAE. Fields available include air temperature at 2m (T-2m), dewpoint at 2m (TD-2m) representing humidity, and wind speed at 10m (UV-10m).
Data covers the entire period of January 01 -November 30 2015. Unfortunately, quality checked data for December 2014 was not available and so in the interest of preserving contiguous seasons, the month of December 2015 was omitted from the winter 195 statistics.

Verification method
An aim of the study is to assess WRF's performance on several timescalesyear (Jan-Nov), season, daytime and nighttime periods, and hourly. Another is to assess performance within different regions of the UAE. The exclusive assessment of overall forecast means over the UAE may be valuable, but could obscure variability within the different regions, such as the capturing 200 of high daytime temperatures in the inland deserts, or cooler and windier coastal conditions. Accordingly, the dataset was split temporally and spatially, as follows.

Temporal analysis
Yearly analysis -all timesteps analysed from Jan 01 to Nov 30 (hourly interval).
Seasonal analysiswe present the most extreme seasons in terms of air temperatures -the (coolest) winter period of January 205 01-February 28 2015 and the (warmest) summer period of June 01-August 31, 2015.
Daytime and nighttime periodsfor daylight hours we used all hours between 02:00 and 13:00 UTC (06:00-17:00 LT)and for nighttime, 14:00 to 01:00 UTC (18:00-05:00 LT). These hours were selected based on the range of UAE sunrise and sunset which range between ~05:30 and 07:00 LT, and ~17:00 and 18:50 respectively. The intention of separating day and night hours in this way is to examine performance during the nocturnal stable and daytime convective boundary layers. Indeed, 210 several simulations in arid regions have demonstrated nocturnal cold biases and an overestimation of daytime windspeeds (Branch et al., 2014;Schwitalla et al., 2020;Weston et al., 2018).

Regional analysis
We split the 48 UAE weather stations into 3 regionsmarine, mountain and desertbased upon on surface geophysical characteristics and proximity to water bodies (See Figure 1a). Accordingly, the following criteria were used for grouping the  Marinelocated on islands or ≤ 5 km inland from the UAE coast -17 stations. The only exception made to this classification was for a single station located at 204 m near the sand dunes of Liwa, in the 220 south of the Abu Dhabi emirate. Although the station is quite high, it is remote from the Al Hajar Range and was deemed more suitable for a desert classification. Details on altitude of the regional station groups can be found in Table 3, and a list of individual stations in the Appendix. The desert region is characterised by barren or sparsely vegetated soils (as is most of the UAE), high surface temperatures and rapid nighttime cooling due to radiative losses associated with a dry atmosphere. The Al Hajar mountain region is arid, has generally rocky bare slopes, with lower albedo (e.g. Moody et al., 2005) with gravel plains 225 running along the west side (Sherif et al., 2014).
One can assume some similarity between these regions, particularly when the synoptic situation is relatively homogeneous over scales larger than the study area. Nevertheless, given the large number of stations and length of time series, if regional differences do exist then they should be evident.

Diagnostics 230
In order to get a visual overview of model performance, in terms of closeness of fit, spread of forecast errors, and distribution of residuals, scatterplots divided by region and day/night period are shown in Figure 5. Included are a line of best fit for the data, a 1:1 line of perfect fit, and a 95% confidence ellipse.
To quantify the regional forecast/observation association, error magnitude and sign during day/night, we show three standard statistical diagnostics (Pearson correlation coefficient, root mean square error (RMSE) and bias). 235 The Pearson correlation coefficient 'r' measures the strength of linear association between forecast ( ) and observation (o), at all stations at each timestep, given as: where and are the forecast and observation at each observation point i, ̅ and ̅ are forecast and observation averages, ns indicates the total number of observations at each time step (i.e. number of stations), and overbars indicate the mean. 240 Occasionally ns was reduced slightly whenever a missing value occurred. The Bias is a measure of overall error, including sign, defined as: 245 These diagnostics were generated for 2015 for the region and time period and their temporal distribution expressed in boxplots (Section 3, Figure 7) showing mean, median, 25%-75% percentiles (box range) and 5% and 95% percentiles (whiskers).
Finally, a closer look at diurnal evolution of the forecast is useful to investigate performance at specific times of day such as local noon and at PBL transition periods, where models often have biases. Hence, we generated mean hourly cycles of the 250 spatial mean and spatial standard deviations for both forecast and observations. The mean at each hour is calculated as: The spatial standard deviation (σ) at each hour is given as: For the diurnal analysis, we selected the two most extreme seasons in terms of temperature -the (coolest) winter period of 255 January-February ( Figure 8) and the (warmest) summer period of June-August (Figure 9), 2015. Again, these figures are divided by region.

Results and Discussion
In this section, we present a discussion of the results. Before examining the model performance however, we first discuss the study period of 2015 in context of the long term climate and El Niño (3.1) to assess the representativeness of the 2015 study 260 period. We then discuss differences in regional climate and their significance to our verification (3.2). Finally, we evaluate the regional model output of T-2m, TD-2m and UV-10m fields across the seasons and time of day (3.3).

2015 in context
Our study period is 2015 from January 01 to November 30 (during which time the full verification dataset was available). 2015 was considered one of the strongest El Niño periods since 1950 (L'Heureux et al., 2017) with an Oceanic Niño Index (ONI) 265 index of up to 2.6 towards the end of the year (see Table 4 in fact to consider the 2015 regional climate as representative of the climate in general. 3.2 Regional and seasonal characteristics -An assessment of regional distributions reveals that clear differences in means and variability do exist ( Figure 6). As expected, the marine region is dominated by the Arabian Gulf characteristics, with more moderate temperature maxima and minima (6a), greater humidity (6b), and higher windspeeds (6c) than the inland desert for 280 instance ( Figure 6). Hence marine temperatures are lower than at the desert stations in the summer months but remain higher in winter and autumn. In fact, the desert stations have the most extreme T-2m range in all seasons, reflecting the lower heat capacity surface, and consequent strong daytime surface heating. Rapid nocturnal cooling also occurs due to radiative losses in a much drier inland environment. The mountain region is only a little cooler than the desert (~1˚C) in summer and autumn with the difference further reduced during spring and winter. The majority of mountain stations are located at fairly moderate 285 altitudes (mean altitude 430 m, Table 3) with only one station located over 1000 m high (station ID 41229 -1485 m.a.s.l, see Table A1 in Appendix). Even so, one might have expected larger differences. However, there could be reasons other than the temperature lapse rate for this such as differences in mountain and desert cloud cover for instance Yousef et al., 2019) or in albedo (e.g Nelli et al., 2020b). TD-2m, or dewpoint temperature, is a standard measure of humidity and is in most cases relatively independent of the ambient 290 temperature. It is also a reliable measure of how humid the air feels in terms of human comfort (Wood, 1970). In a hot (and warming) climate like the UAE, forecasting TD-2m accurately is therefore very important for society. Regionally, we observe considerable differences in TD-2m ( Figure 6b) which is more or less expected due to coastal/land gradients and variation in vertical transport/distribution of vapor in different environments. Table 5 shows the difference in observed T-2m and TD-2m means. The inland atmosphere tends to be humid in summer when temperatures are high, but even closer to saturation in 295 autumn and winter as temperatures fall, but humidity remains high. This seasonal range is particularly pronounced in the mountain regions reflecting the predominance of annual rainfall occurring during winter in the mountains and gravel plains of the north-eastern part of the UAE (Sherif et al., 2014;Wehbe et al., 2019). In winter and spring, the marine region is closer to saturation than in the other regions (T-2m minus TD-2m = -8 to -11˚C) however a reversal of this relationship occurs in summer and autumn as the mountain and desert regions become more humid. 300 There are significant regional differences in UV-10m, with marine UV-10m being 0.5-1 m s -1 higher than in other regions ( Figure 6c) and also more variable. This is not unexpected, due to low surface roughness, strong land-sea temperature gradients and the associated land-sea breezes. Desert UV-10m is the lowest all year round, and mountain UV-10m falls in between those of the desert and marine regions. In general, UV-10m is highest in spring and autumn. https://doi.org/10.5194/gmd-2020-201 Preprint. Discussion started: 1 September 2020 c Author(s) 2020. CC BY 4.0 License. These regional differences justify the need for regional splitting of the dataset and are further addressed in Section 3 in 305 conjunction with model performance.

Model evaluation
Although the simulation of T-2m, TD-2m and UV-10m and causes for any biases may be physically linked, we nevertheless, first examine each field individually for clarity. 310 T-2m -In the scatter plots (Figure 6a-6h) we observe that in the daytime, T-2m appears well estimated for the UAE on the whole (Figure 5a) (+0.44˚C) and errors are well distributed over the T-2m range. However, this agreement obscures some compensating regional biasesnamely overestimation in the desert (+0.71˚C) and mountains (+1.06˚C), and underestimation in the marine region (-0.93˚C).
Reasons for the warm bias may be attributable to a combination of reasons. Firstly, a WRF overestimation of downwelling 315 surface shortwave has been observed before (Fonseca et al., 2020;Nelli et al., 2020b). This has been attributed to a lack of cloud cover, but may also relate to the performance of the radiative transfer scheme and interaction with aerosols. Secondly, the soil representation, such as soil texture classification -and the associated parameters like heat capacity, thermal diffusivity, albedomay require adjustment. Underestimations of albedo in WRF have recently been observed particularly for bright desert soils where measurements show typical albedo values of 0.3 to 0.34 (Nelli et al., 2020b). The WRF albedo value in this 320 study is around 0.23 for much of the UAE lowlands, which would likely result in a too-high net radiation and sensible heating, especially on dry soils. This is consistent with the reported positive daytime temperature biases in the inland desert. A third factor may be the prescribed aerodynamic roughness length parameters used by WRF. Nelli et al., (2020a) found that a new value for the parameter, derived from eddy covariance measurements, reduced the warm daytime bias in WRF simulations (Nelli et al., 2020b). These causes may account for some or all of the daytime temperature biases and therefore need to be 325 considered for future simulations in this region.
Nocturnally, we observe a cold bias over the UAE (5e). This is quantified in Figure 7b as a mean negative bias of just over -2˚C. One can also see that this nocturnal bias tends to worsen with an increase in daily T-2m, which implies that the cold bias gets worse in the hotter months. This is confirmed in the seasonal diurnal cycles (Figure 8a and 9a) where the mean nocturnal bias in winter is ~ -2˚C, but increases to greater than -4 ˚C in summer. This nocturnal cold bias is reflected in all sub-regions, 330 but not to the same degree. The best nocturnal performance is in the marine region (Figure 5g) (bias of -0.75 ˚C), with an even error distribution across the temperature range. The largest nocturnal cold bias is in the desert region (-3.1 ˚C) (Figure 5h), with a steady increase in bias with temperature. The switch from positive to cold biases usually occurs more or less around the twice-daily transition times of the boundary layer between stable and convective states. Such arid nocturnal biases have been noted before (Branch et al., 2014;Fekih and Mohamed, 2017;Weston et al., 2018). It may be that a too-dry lower atmosphere 335 results in a lower downward flux of longwave, as found by (Fonseca et al., 2020) in a comparison of WRF with radiation measurements. All else being equal this dryness would lead to a reduction of 'buffering' at night time. They also found a too-high upward ground heat flux during the night, which could be associated with sub-optimal soil parameters or a too-strong soil-air temperature gradient. Overall, their net radiation losses at night were higher in WRF than from the radiation measurements. 340 TD-2m -is relatively well estimated in 2015 for the UAE as a whole, with correlations around 0.7 and biases of less than 1˚C.
However, we can look at regional/seasonal differences for more detail. In the desert and marine regions, the biases are ≤1˚C during both day and night. Marine TD-2m is slightly overestimated in general, indicating the model to be more humid over the Gulf and coast than observed. Mountain nocturnal dew points are more of a problem with a negative bias of ~ -2˚C, and a 345 larger error spread than the other regions ( Figure 7e). There is also a corresponding T-2m nocturnal bias of ~ -2˚C which could indicate a deficiency in the longwave surface budget as just mentioned, but also a model deficiency in representing the intermittent shear-driven turbulence that appears in nightime stable boundary layers. However, such biases in complex terrain have been already well documented (e.g. Warrach-Sagi et al., 2013;Zhang et al., 2013). One of the reasons cited is that the CP scale is not fine enough to resolve mountain slopes, and therefore cannot capture certain processes in the same way that 350 large-eddy scale models can, with grid spacings on the order of ∆x 100m. While such fine resolutions may be appropriate in a research context though, they may remain prohibitively expensive and inappropriate in the context of operation forecasting.
An additional problem in complex terrain is the validity of the traditional Monin-Obukhov similarity theory (MOST) (e.g. see Foken, 2006) that is typically used in atmospheric models, including WRF, for calculation of model diagnostics like T-2m or TD-2m. MOST assumes homogeneous underlying land surface and stationary fluxes, and there are multiple evidences that in 355 complex and heterogeneous landscapes MOST needs significant improvements in scaling of turbulent kinetic energy profiles in the lowest part of the boundary layer (e.g. Figueroa-Espinoza et al., 2014;Wulfmeyer et al., 2018). The latter may affect representation of the heat, moisture and momentum transport from the land surface to the atmosphere, and if misrepresented may lead to such high biases in the surface layer model diagnostics.
Seasonally, TD-2m is quite well reproduced in both winter and summer (Figures 8 and 9). The mountain nocturnal negative 360 bias becomes more significant in summer (Figure 9e). In the desert, a positive bias occurs over midday starting around 10 am LT (Figure 9k) showing an overestimation of water vapor in summer. This is likely to be too early in the day for a sea breeze driven anomaly but may relate to simulated soil moisture being higher than reality. This was observed in a study by Wehbe et al. (2018) who found a wet bias in dry soils and a dry bias in wetter soils in WRF over the UAE when not coupled with a more advanced hydrological model. 365 UV-10m -WRF overestimates UV-10m during the day and night, in all regions and seasons. Positive biases of 1-2 m s -1 are typical over the whole year (Figure 7h). Mountain daytime biases are strongest at 2 m s -1 , followed by daytime desert biases at 1.5 m s -1 . Marine biases are lowest with mean biases of <1 m s -1 . Notably, there is a trend where positive biases increase with windspeed (Figure 5p, 5q, 5s). There is a significant increase in bias during the daytime, and also in the summer, particularly in the mountain and desert regions (Figure 9f and 9i). In fact, the strongest wind biases occur in the same situations 370 when daytime T-2m is overestimated, particularly in the mountain and desert regions (Figures 7, 8, 9), hinting at a relationship https://doi.org/10.5194/gmd-2020-201 Preprint. Discussion started: 1 September 2020 c Author(s) 2020. CC BY 4.0 License. between the two. Indeed, there is a good chance that a too-strong sea breeze may account for this. During summer, the desertmarine T-2m daytime gradient is highest (~5 ˚C, see Figure 9g and 9j, red curves) than in winter (~3 ˚C, see Figure 8g and 8j), although the seasonal warm-biases are similar (~1.5-2 ˚C). The higher gradient coincides with a greater UV-10m bias in summer. Weston et al., (2018) improved the duration and direction of UAE sea breezes by tuning a thermal roughness length 375 parameter in WRF. The PBL and surface layer parameterization schemes could also be a cause of the bias. Schwitalla et al., (2020) found an overestimation of UV-10m in all members of a UAE physics ensemble, with magnitudes of around 1.5 m s -1 .
The bias was worse when using the MYNN 2.5 TKE PBL and MYNN surface layer schemes, when compared with the Yonsei YSU scheme (Hong et al., 2006) paired with the MM5 Jimenez surface layer scheme (Jiménez et al., 2012).

380
Using a non-local PBL scheme like YSU tends to produce a deeper and drier PBL with a stronger vertical mixing, in comparison to local schemes like MYNN (see Milovac et al., 2016;Yang et al., 2017). This may lead to a reduction in wind speeds, heat, and moisture close to the surface. However, another study however found that switching between 7 different PBL schemes had little effect on positive UV bias (Shimada et al., 2011). One additional factor is that there are several parameters within the MYNN scheme itself, which may benefit from retuning for arid regions like the UAE (e.g. Yang et al., 2017). 385 However, the total impact of the PBL scheme on reproduction of the T-2m, TD-2m and UV-10m diagnostics is not completely clear, since they are computed within the surface layer scheme, based on MOST, or by NOAH-MP itself when this LSM is selected.
Incorrect aerodynamic roughness length parameters, as mentioned previously, may also play a large role in determining UV-10mthis parameter is used within the surface layer scheme. Nelli et al., (2020a) found positive windspeeds biases of < 4 ms -390 1 and negative biases for wind speeds > 6 ms -1 within a WRF V3.8 simulation. We have a similar behaviour at night in the marine and desert regions, as exhibited by the positive-to-negative distribution of errors increasing with windspeed. Nelli et al., (2020a) reduced these biases by retuning the roughness length parameter based on eddy covariance measurements (Nelli et al., 2020b).
Another possibility is the length of the forecast spinup, the required length of which may still be uncertain. We have already 395 mentioned that Chaouch et al., (2017) cited a 5-h spinup as being sufficient, but Hahmann et al., (2015) posits that the necessary spinup over land could be 12 hours or even more (primarily for effective use of the PBL scheme). A longer daily spinup may be preferable but is likely be very expensive and perhaps too time consuming for forecasting purposes.

Summary and Outlook
The aim of this study was to (i) assess the skill of WRF-NOAHMP in reproducing surface quantities over the UAE, (ii) identify 400 regional, seasonal, and diurnal differences in performance and (iii) estimate potential sources of model deficiencies. We have demonstrated the value of splitting the model evaluation temporally and spatially. For while assessment of diagnostics for the whole UAE region remains useful, it can obscure regional, diurnal and seasonal differences and also compensating biases, all of which are scientifically interesting, and importantly may reveal information on model in respect to specific processes and land surface types, and how they are simulated. 405 An analysis of model predictions has revealed that WRF-NOAHMP represents the mean T-2m field reasonably well during the daytimealthough with a tendency for slight overestimation (≤1˚C). The nocturnal T-2m is underestimated more strongly though (1-4˚C), and with larger biases during the hotter months particularly in the desert and mountains, likely due to a combination of deficiencies. The marine region has the lowest T-2m biases which is encouraging, and highlights the value of ingesting quality SST data, especially in coastal regions. WRF shows a good performance regarding TD-2m in general, with 410 mean biases being ≤1˚C. Humidity over the marine region tends to be slightly overestimated though, whilst nocturnal mountain TD-2m is underestimated (bias ~-2˚C). UV-10m performance on land still needs be improved, with biases of 1-2 m s -1 .
Furthermore, performance for UV-10m tends to worsen during the hot months, particularly inland. UV-10m in the marine region is generally much better simulated than in the other regions (bias ≤1 m s -1 ). There is an apparent relationship between T-2m bias and UV-10m bias and this could be due to deficiencies in sea-land breeze simulation. TD-2m biases appear to be 415 more independent. The only exception to this is during the night, when T-2m and TD-2m biases do appear linked. Ultimately, no model downscaling forecast (at scales economically viable for forecasting) can be expected to exhibit exceptional skill in all conditions, but we have discussed several avenues for improvement on this application of WRF. For instance, we should continue to devise and ingest new and improved datasets for land use, terrain and soil texture, and albedo.
In particular, within a vegetation sparse region like the UAE, soil texture, moisture and other parameters are likely to be of 420 prime importance. Certainly, ingesting SST data appears to have been valuable, given the lower coastal biases in all variables.
We have mentioned several very useful experiments carried out on parameters like aerodynamic and thermal roughness lengths (Nelli et al., 2020a;Weston et al., 2018), and also process-based observational studies related to the surface energy balance, and verification studies (Fonseca et al., 2020;Nelli et al., 2020b). Further experiments should now be coordinated in order to improve model predictions further. In terms of parameterization schemes, ensemble experiments (in the manner of Chaouch 425 et al., 2017;Milovac et al., 2016;Schwitalla et al., 2020) are still required to identify optimal land surface/surface layer/PBL/microphysics combinations for arid regions. Such studies can also address the tunable parameters defined inside parameterization schemes similarly to those conducted by Quan et al. (2016) and Yang et al. (2017). The most relevant ones can then be measured during dedicated field campaigns and subsequently ingested in the model.
Seasonal scale studies such as these are vital for accurate assessment of WRF nowcasting performance and to identify model 430 deficiencies and areas for improvement. By combining seasonal verification with sensitivity tests, and process and observational studies, we will move towards improved forecasting systems for the UAE, and other arid regions. O. Branch is the first author who conceived the experiment, carried out the simulations and analysis, and wrote the publication. 480 T. Schwitalla contributed greatly with scientific support and co-writing of the manuscript, provided much technical assistance, and formatted the observation data for use in the MET software. Marouane Temimi, Ricardo Fonseca, Narendra Nelli, Michael Weston, and Volker Wulfmeyer provided specialist scientific support and assisted with the drafting and improvement of key aspects of the manuscript.

Conflicts of interest
The authors declare that they have no conflict of interest.

Acknowledgements.
This material is based on work supported by the UAE Research Program for Rain Enhancement Science, under the National Center of Meteorology, Abu Dhabi, UAE. Furthermore, we are grateful to the High Performance Computing Center Stuttgart 490 (HLRS) for providing support and computing time on the XC40 system. We are also grateful to ECMWF for providing operational analysis data.

55
Bias (centre) and RMSE (right). On the box plots the centre line represents the mean, the white circle is the median, box ends represent 25% and 75% percentiles and the whiskers are 5% and 95 % percentiles. Also marked is a horizontal zero reference li ne for the Pearson and Bias statistics.
https://doi.org/10.5194/gmd-2020-201 Preprint. Discussion started: 1 September 2020 c Author(s) 2020. CC BY 4.0 License.  Tables   Table 1: Table 5: Seasonal and regional differences in observed T-2m and TD-2m means to show the closeness to saturation. Included are the number of data points. Note that this is not a mean of the T-2m/TD-2m differences calculated at each time step, but an overall difference in means.