Calculating the turbulent ﬂuxes in the atmospheric surface layer with neural networks

. The turbulent ﬂuxes of momentum, heat and water vapour link the Earth’s surface with the atmosphere. The correct modelling of the ﬂux interactions between these two systems with very different time scales is therefore vital for climate (resp. Earth system) models. Conventionally, these ﬂuxes are modelled using Monin-Obukhov similarity theory (MOST) with stability functions derived from a small number of ﬁeld experiments; this results in a range of formulations of these functions and thus also in the ﬂux calculations; furthermore, the underlying equations are non-linear and have to be solved iteratively at 5 each time step of the model. For these reasons, we tried here a different approach, namely using an artiﬁcial neural network (ANN) to calculate the ﬂuxes resp. the scaling quantities u ∗ and θ ∗ , thus avoiding explicit formulas for the stability functions. The network was trained and validated with multi-year datasets from seven grassland, forest and wetland sites worldwide using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton backpropagation algorithm and six-fold cross validation. Extensive sensitivity tests showed that an ANN with six input variables and one hidden layer gave results comparable to (and 10 in some cases even slightly better than) the standard method. Similar satisfying results were obtained when the ANN routine was implemented in a one-dimensional stand alone land surface model (LSM), opening the way to implementation in three-dimensional climate models. In case of the one-dimensional LSM, no CPU time was saved when using the ANN version, since the small time step of the standard version required only one iteration in most cases. This could be different in models with longer time steps, e.g. global climate models.

involved in the training and validation process. Comrie (1997) finds that neural networks are "somewhat, but not overwhelmingly" better than regression models. Best performance was obtained with an ANN incorporating lagged data. Gomez-Sanchis et al. (2006) use a multi-layer perceptron (MLP) to predict ozone concentrations near Valencia based on meteorological and traffic information. Different model architectures were tested and good agreement with observations was found. However, different years required different model architectures. Elkamel et al. (2001) used a one hidden layer ANN and meteorological and 15 precursor concentrations to predict ozone levels in Kuwait. They found that the ANN gave consistently better predictions than linear and nonlinear regression models. Kolehmainen et al. (2001) compare the ability of self-organising maps and MLP to predict N O 2 concentrations when combined with different methods to preprocess the data. They find that direct application of the MLP gave best results. In all these studies just one hidden layer was sufficient and it was pointed out that careful selection of input data was crucial. An interesting application of a neural network's ability to learn is Knutti et al. (2003): they teached a 20 neural network to simulate certain output variables of a global climate model and used the result to establish probability density functions and to enlarge a global climate model ensemble considerably. Gentine et al. (2018) use an ANN to parameterise the effects of subgrid scale convection in a global climate model. The ANN learns the combined effects of turbulence, radiation and cloud microphysics from a convection resolving submodel. They find that the ANN can predict many of these processes skillfully and reduce inherent variance. 25 This paper is structured as follows: in Chapter 2, we give a short overview over Monin-Obuhkov similarity theory and artificial neural networks, introduce cross-validation, present the data used (including important quality checks) and describe our strategy to find the best network. Thereafter, trained ANNs (which are in fact MLPs, but we will stick to the generic name ANN here) are validated and results are discussed (Chapter 3). In Chapter 4, the best performing ANN is implemented in a one-dimensional land surface model (LSM) and results are compared with the ones of the standard version. A summary is 30 given in Chapter 5.
2 Methods and data 2.1 Monin-Obukhov similarity theory (MOST) The turbulent fluxes of momentum, heat, water vapor and trace gases between the Earth's (land and water) surface and the atmosphere are usually calculated on the basis of Monin-Obukhov similarity theory (MOST, Monin and Obukhov (1954)). We give here a very short survey over MOST, focussing on momentum and heat fluxes; details can be found in Arya (2001). Assuming homogeneous terrain, quasistationary (i.e fair weather) conditions and small terrain roughness, MOST postulates that turbulence in the surface (also called Prandtl or constant flux) layer depends only on four quantities: the height above ground 5 resp. canopy z, a velocity scale u * , a temperature scale θ * and a buoyancy term g/θ, where g is gravitational acceleration and θ denotes potential temperature. According to the Buckingham Pi-Theorem, these four quantities based on length, time and temperature can be combined to a single non-dimensional quantity ζ = z/L, where L = u 2 * θ/(κgθ * ) is the Obukhov length and κ ≈ 0.40 is the von Kármán constant ; other dimensionless quantities like dimensionless wind and temperature gradients can be expressed as functions of ζ. The Obukhov length L measures the stratification of the surface layer: large (positive or 10 negative) values (i.e. ζ ≈ ±0) indicate neutral stratification, positive values indicate stable stratification, negative values indicate unstable stratification. Since momentum flux is expressed as τ = ρu 2 * , and heat flux as H = −ρc p u * θ * (ρ is air density, c p is air heat capacity), our goal is to determine u * and θ * from known quantities, which are in our case modelled or observed wind and temperature gradients in the surface layer.
Non-dimensional wind shear φ m and the non-dimensional gradient of the potential temperature φ h (also called stability func-15 tions) can be written as respectively, where u is the mean wind speed at height z. The "'universal"' functions φ m and φ h can be obtained from simultaneously measured values of the wind and temperature gradients and the momentum and heat fluxes (providing u * and θ * ). Data from field experiments have been used to derive these universal functions, notably the Kansas experiment in 1968 by Businger 20 et al. (1971). Generally, the stability functions thus obtained have the form with the coefficients depending on ζ > 0 or ζ ≤ 0. An overview of these functions can be found in Högström (1988); it is shown there that there is considerable scatter in the data (especially under very stable and very unstable conditions) and, as a result, also in the derived universal functions. In applications, differences are known rather than gradients. Integrating the 25 functions (1) between a reference height z r and z yields where For the purpose of climate modelling, i.e. obtaining fluxes from simulated wind and temperature profiles, u * and θ * need to be 30 derived from wind resp. temperature data at two heights and eqns.
(1) or (2). Since ζ itself depends on u * and θ * , this amounts to solving a system of two nonlinear equations; we will call this traditional method the MOST method.
By employing neural networks in the context of MOST, we pursue two goals: first, we want to avoid the somewhat inflexible and uncertain approach of obtaining the scaling quantities u * and θ * from profiles via stability functions; instead we want a neural network to learn this relationship, based on larger and newer datasets. Second, once a neural network has learned the 5 relationships, the iterative calculation of the scaling quantities during a simulation can be replaced by such a trained network, possibly saving computing time.

Neural networks
We describe here only those aspects of neural networks which are relevant to our study; for more information on neural networks, the reader is referred the literature, e.g. (Rojas, 2013;Kruse et al., 2016). Neural networks, or more precisely artificial 10 neural networks (ANN), are a widely used technique to solve classification and regression problems as well as to analyse time series (Zhang, 2008). The building blocks of an ANN are the so-called neurons which are connected within a network with each by weights other via input and output nodes. A neuron processes input data as follows: where o j is the output of the neuron j, N is the number of inputs, o i k is the i k th input and w i k j is corresponding weight. A 15 network becomes non-linear by using non-linear activation functions f . Each neuron belongs to a unique layer in a directed graph. Here, we use so-called multi-layer perceptrons (MLP), also known as feed-forward networks due to the unidirectional information flow. Each MLP consists of an input, an output and at least one hidden layer with an arbitrary number of neurons.
The input layer takes (normalised) input data and the output data returns the (also normalised) results of the MLP. Normalisation is essential for equal weighting of the input and for consistency with the activation functions. Input information is propagated 20 from layer to layer while each neuron responds to the signal. Bias neurons are used to adjust the activation level.
All free parameters (i.e. weights) of a MLP need to be determined by a training process. In the case of supervised learning, the MLP knows its deviation from target values at every time and an error can be calculated using this deviation (Zhang, 2008).
The aim of the training is to minimise an error metric by adjusting the network's weights. Here we use the mean squared error P is the total number of data points, N Ω is the number of neurons in the output layer, t j,p is the target value of data point p and o j,p is the output of the MLP for data point p. In the study described here, we use a MLP with tangens hyperbolicus as activation functions in the hidden layer(s) (here one or two) and linear functions in the output layer trained by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton backpropagation algorithm (Broyden, 1970;Fletcher, 1970;Goldfarb, 30 1970;Shanno, 1970).
Neural networks are very flexible in terms of number of layers, number of nodes, error metrics, training method, activation function etc.; thus, a series of sensitivity runs was performed, which always consisted of a training and a testing (validation) phase. To find the best network architecture, we varied the following parameters: -The number and type of input variables (see 2.5)

5
-The number of hidden layers (one or two) -The total number of nodes in the hidden layer(s) (between 1 and 14) To reduce the number of sensitivity runs, the parameters listed in Table 1 were kept fixed based on recommendations in the literature (Zhang, 2008;Kruse et al., 2016).

Cross-validation
10 Trained networks were validated using k-fold cross-validation (Kohavi, 1995;Andersen and Martinez, 1999) to prevent overfitting (Domingos, 2012). Overfitting originates from the trade-off to minimise the error on given data and to maximise performance on new unknown data (Chicco, 2017). To test the network's ability to generalise, the full data set is divided into k = 6 subsets by a random data split with approximately equal size first. Cyclically, one subset is kept for independent testing, the remaining k − 1 subsets are used for training and validation. With this test, we can show that ANNs are able to learn from 15 the data and to represent their characteristics. Later on, we will go one step further and check if the found ANNs are able to transfer their knowledge not only on unknown data but on completely new stations not used previously. We decided to validate trained models with the station NL-Cab and to test the best ANNs finally on the station DE-Keh (see details on stations in Chapter 2.4). For these stations, the traditional MOST method performed best; thus, they presented a strong challenge for the ANNs to achieve similar quality . As ANN input, we used temperatures and wind speed in two measurement heights, and for validation the momentum and sensible heat fluxes to calculate the scaling quantities u * and θ * . If not available, density was calculated from the ideal gas equation using virtual temperature in case of available humidity data, otherwise we used the temperature of dry air. For forests, all observations had to be above the canopy. The original temporal resolution of the data was either 10 minutes or 30 minutes; 30 these were aggregated to 1-hour averages.
An important step before using data as input for the ANN was to check if the data were compatible with Monin-Obuhkov theory, i.e. if an (at least approximate) functional relationship between ζ and the right-hand sides of (1) was to be seen and if yes, how well they were represented by the universal stability functions (1). It turned out that for some sites, no relationship existed.
Reasons for this could be a violation of the assumptions of the Monin-Obuhkov theory like inhomogeneous terrain around the site or wind direction dependence of the roughness length. Data from these sites were not used further. The remaining stations 5 (see Table 2) with about 390000 data points in total were used to train and validate the networks. For these, agreement generally was better for temperature than for wind; also, agreement was better for unstable than for stable stratification, which is often mentioned in the literature.
Data were preprocessed before they were presented to the ANN as follows. Input and output data were normalised according to their extrema to the interval [0, 1] (see Table 1). Furthermore, weak wind situations with wind speeds below 0.3 ms −1 were 10 filtered out. Because of large scatter of wind and temperature gradients under atmospheric conditions with absolute heat fluxes below 10 Wm −2 or small scaling wind speeds (u * < 0.1 ms −1 ), such data were excluded. Finally, the signs of the temperature scale θ * and of the potential temperature gradient had to be the same, thus excluding counter-gradient fluxes which can be observed over forest (Denmead and Bradley, 1985) and ice (Sodemann and Foken, 2005), but violate the assumptions of MOST (Foken, 2017a, b). 15

ANN setup and selection of best ANN
We tested network architectures with six and seven element input vectors. The six element input vector consisted of the wind velocity and potential temperature averages over the two heights, the vertical gradients of wind and potential temperature and their ratio and a classifier to distinguish between low (c veg = 0) and tall (c veg = 1) vegetation. For the seven element input 20 vector, we replaced the temperature gradient by its absolute value and added an additional input node describing the sign of potential temperature gradient. The target vector remained in both cases a two element vector consisting of the wind scale u * and the temperature scale θ * .
We experimented with ANNs having one and two hidden layers. For the ANNs with one hidden layer, we varied the number of neurons in the hidden layer from one to twice the size of the input layer. For ANNs with two hidden layers, the number of 25 neurons in each layer is increased up to the number of input neurons.
As mentioned earlier, all networks were trained to minimise the overall (sum of u * and θ * ) MSE on normalised data from and Pearson's correlation coefficient r whereȳ j andt j are the averages of the jth net output and the target value withȳ j = 1 |P | p y j,p andt j = 1 |P | p t j,p . When ANNs are to be used in climate models, one has to find a trade-off between two aspects: on the one hand, the model should perform well according to the quality metrics described above; on the other hand, a superior model in terms of small 5 errors but with higher computational demands may not be the best choice to use in climate models where saving computing time is a very high priority criterion. For ANNs, computing time normally increases with complexity of a network, i.e. with its size. We therefore tested also ANNs with smaller-than-optimal numbers of neurons in view of this trade-off. To find smaller networks requiring possibly less computing time, we looked at networks that meet the requirement that the size of each hidden layer n hi is less or equal to the size of the input layer n I minus 1.
This condition was found after some experimenting and is somewhat arbitrary, but there is no hard rule defining the simplicity of a model. Here, we call ANNs that satisfy this condition simple networks.

Results
As described in sec. Chapter 2.3, ANNs are always trained on the training data set only and validated on a disjoint validation 15 data set. If the MSE on the validation set rises continuously, training is stopped to prevent overfitting (early stopping). After this training and validation stage, the ability of the thus found ANNs to generalise is tested on data completely new to the ANNs. All in all, more than 100000 nets were trained and tested this way.

Effect of data splitting
The validation results from ANNs with six inputs and one single hidden layer trained under six-fold cross-validation with 20 random data splitting are shown in the box-and-whiskers plot in Figure 3 as a function of of the number of hidden neurons.
One can see that the validation MSE decreases with increasing number of hidden neurons and reaches an asymptotic value of about 0.008 already with 6 to 7 neurons. Furthermore, the scatter of MSE is quite small, meaning that results of ANNs trained on different sets vary only slightly.
If the training data are not split randomly, but station-wise, a larger MSE and a considerably larger scatter of MSE results. 25 Comparing Figure 4 with Figure 3 shows that MSE is roughly doubling, whereas scatter increases by about a factor of ten, almost independent of the network architecture. On the other hand, increasing the network size doesn't necessarily imply smaller error minima. Using two hidden layers reduces slightly the median and error minimum, but increases the MSE spread, too. Comparison of Figure 3 with Figure 4 also shows that the station-wise error minima are comparable to the ones obtained from random data split. In both types of validation, ANNs with one and two hidden layers are not significantly different. These results show that using the station-wise split reduces network performance substantially. This implies that using not enough stations and station-wise training impairs the portability of learned relationships between inputs and target values. Among the reasons for this could be the tendency of the ANNs to overfitting training data by memorising relationships and local effects contaminating the validity of MOST like not ideal positioning of sites or not ideal atmospheric conditions. These findings 5 support the need for independent testing with data yet unknown to the ANN in order to estimate the ANNs real ability to transfer knowledge. This will be discussed in the next section.

Transferability to unknown data
After having shown that ANNs are able to extract u * and θ * from training data successfully, our next step is to assess how the ANNs found in the previous section can handle input from stations which weren't used neither for training nor for validation, 10 i.e. data completely unknown to them; this simulates the situations where ANNs are used in climate models (where grid points play the role of stations). To test this, we choose the station NL-Cab for validation and DE-Keh as the unknown station.
These two stations were selected because the standard MOST method performed best for these stations and are therefore a strong challenge for the ANNs to produce equivalent results. The results of the nets performing best on the validation set are summarised in Table 3, where the ANNs are compared according to increasing complexity of their net architecture. For 15 comparison and in view of reducing computation time, we show in this table also the results of the best simple networks (as defined in section Chapter 2.5). Table 3 shows that in terms of MSE and correlation coefficient r, all ANNs perform better than the traditional MOST method on the validation data set (NL-Cab). Applying these ANNs to the test data set DE-Keh results in an increased MSE and lower correlation coefficient, whereas the traditional MOST method performs better on the test data set. Best test performance of the ANNS in terms of MSE is reached by the 6-5-3-2 ANN with 0.68 · 10 −2 , but the simpler 20 6-3-2 ANN is second best in terms of MSE; it is interesting to see that simple nets can be almost as good as larger nets. ANNs with two hidden layers perform slightly better on the test data than ANNs with a single hidden layer. The overall correlation between network outputs and target values is in all cases quite high (r ≥ 0.85).
A final comparison is done for the turbulent momentum and heat fluxes τ = ρu 2 * and H = −ρc p u * θ * , which are the quantities ultimately needed in climate simulations. Results for the momentum and heat fluxes of three well-performing networks as well 25 as for the standard MOST method are shown in Figure 5 and Figure 6 and in Table 4 and Table 5 respectively. Both ANNs and the standard method tend to underestimate larger momentum fluxes, but differences among ANNs are quite small. Best agreement is achieved with the 6-5-3-2 ANN which is almost as good as the standard method; best ANNs and traditional method are quite similar regarding momentum flux.
Also for heat flux, the differences between the ANNs are relatively small, but the ANNs as well as the standard method tend different than the other ANNs. It works more like a discrete classifier of stability rather than as a continuous regression: there are two states, one around -30 Wm −2 and the other from 50 Wm −2 to 200 Wm −2 . As a result, r is reduced but MAE is lowest for this 7-5-2-2 ANN. These results show again that smaller nets can be as good or even better than larger ones.
A comparison of the computation time required by the different ANNs relative to the 6-3-2 ANN is shown in Table 6. The table shows that the increase of computational demand is approximately proportional to the number of weights (as could be 5 expected), and therefore increases considerably when two layer networks are used. As the discussion above shows, these costs are not reflected in a markedly higher quality of results.

Implementation of an ANN in a land surface model
As already mentioned, our goal is to replace the standard MOST method to calculate fluxes by an ANN in the land surface component of climate models, expecting more flexibility, accuracy and possibly saving of CPU time. The results presented in 10 the previous section indicate that from the accuracy as well as computational efficiency point of view, the 6-3-2 ANN seems to be most suitable for implementation into a land surface model (LSM). This ANN is shown in Figure 7.
The 6-3-2 ANN with weights as found in the previous sections was implemented in a stand-alone version of the onedimensional LSM Veg3d (Braun and Schädler, 2005) where it replaced the implemented routine using the standard MOST method to calculate the scaling quantities u * and θ * ; we will call this version the reference version. Input data for the ANN 15 and data normalisation was the same as described in Chapter 2.4 and output was analogously de-normalised. Since the LSM requires, apart from momentum and heat fluxes, also the moisture flux, the scaling specific humidity q * was calculated as proportional to θ * , following the standard procedure in boundary layer meteorology (Arya, 2001). Meteorological input for the LSM was 30" values of short-and long-wave radiation, wind speed, temperature, specific humidity and air pressure at two heights; additionally soil type and land use were prescribed. For comparison with observations, time series of heat and 20 moisture fluxes as well as soil temperature and soil moisture in the upper soil layers were available. Due to the availability of soil temperature and soil moisture in the top soil layer, the effect of the ANN on the soil component could also be assessed.
The comparison was performed with data from the DE-Fal (grassland, year 1988) and DE-Tha (evergreen needleleaf forest, year 2011) stations for years which had not yet been used neither for training nor for validation. We compared the RMSE and the correlation coefficient of the calculated values with the observed ones for the reference version and the ANN version. 25 Additionally, we compared the required CPU times. The results of the comparison are shown in Table 7 and 8.
Especially for grassland, results of the standard version are very good in terms of RMSE and correlation coefficients and it is difficult for the ANN version to outperform this. However, the results show that the ANN version is able to produce results of similar quality as the standard version for the fluxes as well as for soil temperature and soil moisture. For tall vegetation, RMSEs are larger and correlation is less; but the differences between the ANN version and the reference version are even 30 smaller than for grassland and for soil moisture the ANN version even outperforms the standard version. In terms of fluxes, the standard version is generally slightly better. Regarding CPU time, there are only minor differences, although we had expected the ANN version to be faster. However, due to the small prognostic time step used, once initialised, the standard version does in most cases not need to do more than one iteration to find a solution to the nonlinear equation and to update the scaling quantities, so that the expensive iteration is reduced considerably. In summary, as a result of this first comparison one can say that the ANN version works equally well as the reference version.

Summary
We have used an ANN (more precisely, a MLP) to obtain the scaling quantities u * and θ * as defined in MOST; these are used 5 in weather and climate models to calculate the turbulent fluxes of heat and momentum in the atmospheric surface layer. To train, validate and test the neural network, a large set of worldwide observations was used, representing tall vegetation (forests) and low vegetation (grassland, agricultural terrain). A quality assessment of the data sets showed that not all of them were compatible with MOST, so only 7 of the initially 20 data sets could be used.
Sensitivity studies were performed with different sets of input parameters, training methods and network architectures; 10 validation was done with 6-fold cross validation. An important part of the overall network validation was to check the ability of the network to generalise, i.e. to produce acceptable output if input is data from stations completely unknown to the network.
These studies showed that even a relatively small 6-3-2 network with six input parameters and one hidden layer yields satisfying results in terms of RMSE and correlation coefficient. In the trade-off of quality of results vs. computational efficiency, this network performed best. 15 We could show that results of the ANN were equivalent to the standard method in all tests we performed. A final validation with the heat and momentum fluxes instead of the scaling quantities showed that the traditional MOST method and the ANN approach were also in this case almost equal in terms of quality, with the 6-3-2 ANN performing best. An implementation of the 6-3-2 ANN into an existing LSM showed that the ANN version gives results equivalent to the standard implementation with sometimes even higher correlations. However, no saving of computation time was found.

20
In summary, it could be shown that even in this stage, an ANN gives results comparable in quality to the standard MOST method. Some obvious improvements will include more and better differentiated land use classes (e.g. water, urban areas) and including more situations of strong stratification. Next steps will include more experiments with the input parameters (e.g. including a time lag) and some fine tuning to improve the computational efficiency (e.g. using different activation functions).
We intend to implement and test the neural network routine in a three-dimensional regional climate model. This will require 25 the ANN to learn some additional land use types like urban areas or water surfaces. If these tests are positive, this would open the possibility to replace other "uncertain" components of climate models (e.g. cloud microphysics, sea ice) by neural network subroutines. The main hindrance to do that is presently the lack of suitable training and validation data. An alternative to "real" data could be to use data from more detailed models like LES or urban climate models.