IntelliO3-ts v1.0: A neural network approach to predict near-surface ozone concentrations in Germany

The prediction of near-surface ozone concentrations is important to support regulatory procedures for the protection of humans from high exposure to air pollution. In this study, we introduce a data-driven forecasting model named ‘IntelliO3-ts‘, which consists of multiple convolutional neural layers (CNN), grouped together as inception blocks. The model is trained with measured multi-year ozone and nitrogen oxides concentrations of more than 300 German measurement stations in rural 5 environments, and six meteorological variables from the meteorological COSMO reanalysis. This is by far the most extensive dataset used for time series predictions based on neural networks so far. IntelliO3-ts allows predicting daily maximum 8-hour average (dma8eu) ozone concentrations for a lead time of up to four days, and we show that the model outperforms standard reference models like persistence. Moreover, we demonstrate that IntelliO3-ts outperforms climatological reference models for the first two days, while it does not add any genuine value for longer lead times. We attribute this to the limited deterministic 10 information that is contained in the single station time series training data. We applied a bootstrapping technique to analyse the influence of different input variables and found, that the previous day ozone concentrations are of major importance, followed by 2m temperature. As we did not use any geographic information to train IntelliO3-ts in its current version and included no relation between stations, the influence of the horizontal wind components on the model performance is minimal. We expect that the inclusion of advection-diffusion terms in the model could improve results in future versions of our model. 15


Introduction
Exposure to ambient air pollutants such as ozone (O 3 ) is harmful for living beings (WHO, 2013;Bell et al., 2014;Lefohn et al., 2017;Fleming et al., 2018) and certain crops (Avnery et al., 2011;Mills et al., 2018). Therefore, the prediction of ozone concentrations is of major importance to issue warnings for the public if high ozone concentrations are foreseeable. As tropospheric ozone is a secondary air pollutant, there is nearly no source of directly emitted ozone. Instead, it is formed in chemical 20 reactions of several precursors like nitrogen oxides (NO x ) or volatile organic compounds (VOCs). Weather conditions (temperature, irradiation, humidity, and winds) have a major influence on the rates of ozone formation and destruction. Ozone has a "chemical lifetime" in the lower atmosphere of several days and can, therefore, be transported over distances of several hundred kilometres.
Ozone concentrations can be forecasted by various numerical methods. Chemical transport models (CTMs) solve chemical 25 and physical equations explicitly (for example Collins et al., 1997;Wang et al., 1998a,b;Horowitz et al., 2003;von Kuhlmann et al., 2003;Grell et al., 2005;Donner et al., 2011). These numerical models predict concentrations for grid cells, which are assumed to be representative for a given area. Therefore, local fluctuations which are below model resolution cannot be simulated. Moreover, CTMs often have a bias in concentrations, turnover rates or meteorological properties which have a direct influence on chemical processes (Vautard, 2012;Brunner et al., 2015). 30 This makes CTMs unsuited for regulatory purposes, which by law are bound to station measurements, except if so-called model output statistics are applied to the numerical modelling results (Fuentes and Raftery, 2005). As an alternative to CTMs, regression models are often used to generate point forecasts (c.f. Olszyna et al., 1997;Thompson et al., 2001;Abdul-Wahab et al., 2005). Regression models are pure statistical models, which are based on empirical relations among different variables.
They usually describe a linear functional relationship between various factors (precursor concentrations, meteorological, and 35 site information) and the air pollutant in question.
Since the late nineties machine learning techniques in the form of neural networks have also been applied as a regression technique to forecast ozone concentrations or threshold value exceedances (see Table 1). As the computational power was limited in the early days of those approaches, many of these early studies focused on a small number of measurement stations and used a fully connected (FC) network architecture. More recent studies explored the use of more advanced network archi-40 tectures like convolutional neural networks (CNN) or Long-Short-Term Memory networks (LSTM). These networks were also applied to a larger number of stations compared to the earlier studies and some studies have tried to generalise, i.e. to train one neural network for all stations instead of training individual networks for individual stations (Table 1). Although the total amount of studies focusing on air quality or explicit near-surface ozone is already quite substantial, there are only few studies which applied advanced deep learning approaches on a larger number of stations or on longer time series. Eslami et al. (2019) 45 applied a CNN on time series of 25 measurement stations in Seoul, South Korea to predict hourly ozone concentrations for the next 24 hours. Ma et al. (2020) trained a bidirectional LSTM on 19 measurement sites over a period of roughly 9 months, and afterwards used that model to retrain individually for 48 previously not used measurement stations (transfer learning). Sayeed et al. (2020) applied a deep CNN on data from 21 different measurement stations over a period of four years. They used three years (2014 to 2016) to train their model and evaluated the generalisation capability on the fourth year (2017). This article is structured as follows: In Sect. 2 we explain our variable selection and present our prepossessing steps. In Sect.
3, we introduce our forecasting model IntelliO3-ts, version 1.0. Sect. 4 introduces the statistical tools and reference models, which were used for verification. In Sect. 5 we present and discuss the results and analyse the influence of different input variables ob the model performance. Finally, Sect. 7 provides conclusions.
2 Variable selection and data processing

Variable selection
Tropospheric ozone (O 3 ) is a greenhouse gas formed in the atmosphere by chemical reactions of other directly emitted pollutants (ozone precursors) and therefore referred to as a secondary air pollutant.
The continuity equation of near surface ozone in a specific volume of air can be written as (Jacobson, 2005, p.74f):

70
where ∂Nq ∂t is the partial derivative of the ozone number concentration with respect to time, v is the vector wind velocity, K h is the eddy diffusion tensor for energy, while R depg and R chemg are the rates of dry deposition to the ground, and photo-chemical production or loss, respectively.
Tropospheric ozone is formed under sunlit conditions in gas-phase chemical reactions of peroxy radicals and nitrogen oxides (Seinfeld and Pandis, 2016). The peroxy radicals are themselves oxidation products of volatile organic compounds. The 75 nitrogen oxides undergo a rapid catalytic cycle: where NO and O 3 are converted to NO 2 and back within minutes (M is an arbitrary molecule which is needed to take up excess energy, denoted by the asterisk). As a consequence, ozone concentrations in urban areas with high levels of NO x from 80 combustive emissions are extremely variable. In this study, we therefore focus on background stations, which are less affected by the rapid chemical inter-conversion.
From a chemical perspective, the prediction of ozone concentrations would require concentration data of NO, NO 2 , speciated VOC, and O 3 itself. However, since speciated VOC measurements are only available from very few measurement sites, the chemical input variables of our model are limited to NO, NO 2 , and O 3 .

85
Besides the trace gas concentrations, ozone levels also depend on meteorological variables. Due to the scarcity of reported meteorological measurements at the air quality monitoring sites, we extracted time series of meteorological variables from the 6 km resolution COSMO-Reanalysis (Bollmeyer et al., 2015, COSMO-REA6) and treat those as observations.
All data used in this study were retrieved from the Tropospheric Ozone Assessment Report (TOAR) database (Schultz et al., 2017) via the Representational State Transfer (REST) Application Programming Interface (API) at https://join.fz-juelich.

90
de (last access: 12. Nov. 2020). The air quality measurements were provided by the German Umweltbundesamt, while the meteorological data were extracted from the COSMO-REA6 reanalysis as described above. These reanalysis data cover the period from 1995 to 2015 with some gaps due to incompleteness in the TOAR database. As discussed in the US EPA guidelines on air quality forecasting (Dye, 2003), ozone concentrations typically depend on temperature, irradiation, humidity, wind speed and wind direction. The vertical structure of the lowest portion of the atmosphere (i.e. the planetary boundary layer) also plays 95 an important role, because it determines the rate of mixing between fresh pollution and background air. Since irradiation data were not available from the REST interface, we used cloud-cover together with temperature as proxy variables. Table 2 shows the list of input variables used in this study, and Table 3 describes the daily statistics that were applied to the hourly data of each variable. The choice of using the dma8eu metrics for NO and NO2 was motivated by the idea to sample all chemical quantities during the same time periods and with similar averaging times. While the dma8eu metrics is calculated 100 based on data starting at 5pm on the previous day, the daily mean/max values for example would be calculated based on data starting from the current day. To test the impact of using different metrics for ozone precursors we also trained the model from scratch with either mean or maximum concentrations of NO and NO 2 as inputs. The results of these runs were hardly distinguishable from the results presented below.
As described above, ozone concentrations are less variable at stations, which are further away from primary pollutant emis-105 sion sources. We therefore selected those stations from the German air quality monitoring network, which are labelled as "background" stations according to the European Environmental Agency (EEA), Airbase classification.

Data processing
We split the individual station time series into three non-overlapping time periods for training, validation and testing which we will refer by set from now on (see Fig. 2 Due to changes in the measurement network over time, the number of stations in the three datasets differ: training data comprises 312 stations, validation data 212 stations, and testing 204 stations. This is by far the largest air quality time series dataset that has been used in a machine learning study so far (see Table 1).

115
Supervised learning techniques require input data (X) and a corresponding label (y) which we create for each station of the three sets as depicted in Algorithm 1.
Shape of X: 7 by 1 by 9 (number of days, 1, number of variables) 5: Create labels y with ozone (dma8eu) concentrations for the next four days (1d to 4d). Samples within the same data set (train, validation, test) can overlap which means that one missing data point would appear up to seven times in the inputs X and up to four times in the labels y. Consequently, one missing value will cause the removal of eleven samples (Algorithm 1, line 7). As we want to keep the number of samples as high as possible, we decided to linearly 120 interpolate (Algorithm 1, line 2) the time series if only one consecutive value is missing. Table 4 shows the number of stations per data set (train, val, test) and the corresponding amount of samples (pairs of inputs X and labels y) per data set. Moreover, Table A1 summarises all samples per station individually. Figure 1 shows a map of all station locations.
We trained the neural network (details on the network architecture are given in Sect. 3) with data of the training set and tuned hyperparameters exclusively on the validation data set. For the final analysis and model evaluation, we use the independent test 125 data set, which was neither used for training the models parameters, nor for tuning the hyperparameters. Random sampling, as is often done in other machine learning applications, and occasionally even in other air quality or weather applications of machine learning, would lead to overly optimistic results due to the multi-day auto-correlation of air quality and meteorological time series. Horowitz and Barakat (1979) already pointed to this issue when dealing with statistical tests. Likewise, the alternative split of the dataset into spatially segregated data could lead to the undesired effect that two or more neighbouring stations 130 with high correlation between several paired variables fall into different data sets. Again, this would result in overly optimistic model results.
By applying a temporal split, we ensure that the training data do not directly influence the validation and test data sets.
Therefore, the final results reflect the true generalisation capability of our forecasting model.
In accordance with other studies, our initial deep learning experiments with a subset of this data have shown that neural Our neural network named IntelliO3-ts, version 1.0, primarily consists of two inception blocks (Szegedy et al., 2015), which combine multiple convolutions, execute them in parallel, and concatenate all outputs in the last layer of each block. Figure A2 155 depicts one inception block including the first input layer, while Figures A2 and Figure A3 together show the entire model architecture including the final flat and output layers. We treat each input variable (see Table 2) as an individual channel (like R, G, B in images) and use time as the first dimension (this would correspond to the width axis of an image). Inputs (X) consist of the variable values from 7 days (-6d to 0d). Outputs are ozone concentration forecasts (dma8eu) for lead times up to 4 days (1d to 4d). Therefore, we change the kernel sizes in the inception blocks from 1 × 1, 3 × 3, and 5 × 5, as originally proposed 160 by Szegedy et al. (2015), to 1 × 1, 3 × 1, and 5 × 1). This allows the network to focus on different temporal relations. The 1 × 1 convolutions are also used for information compression (reduction of the number of filters), before larger convolutional kernels are applied (see Szegedy et al. (2015)). This decreases the computational costs for training and evaluating the network. In order to conserve the initial input shape of the first dimension (time), we apply symmetric padding to minimise the introduction of artefacts related to the borders.

165
While the original proposed concept of inception blocks has one max-pooling tower alongside the different convolution stacks, we added a second pooling tower, which calculates the average on a kernel size of 3 × 1. Thus, one inception block consists of three convolutional towers and two pooling (mean, and max) towers. A tower is defined as a collection or stack of successive layers. The outputs of these towers are concatenated in the last layer of an inception block (see Fig. A2). Between individual inception blocks, we added dropout layers (Srivastava et al., 2014) with a dropout rate of 0.35 to improve the 170 network's generalisation capability and prevent overfitting.
Moreover, we use batch normalisation layers (Ioffe and Szegedy, 2015) between each main convolution and activation layer to accelerate the training process (Fig. A2). Those normalisations ensure that for each batch the mean activation is zero with standard deviation of one. As proposed in Szegedy et al. (2015), the network has an additional minor tail which helps to eliminate the vanishing gradient problem. Additionally, the minor tail helps to spread the internal representation of data as it 175 strongly penalises large errors.
The loss function for the main tail is the mean squared error: while the loss function of the minor tail is: All activation functions are exponential linear units (ELU) (Clevert et al., 2016), only the final output activations are linear (minor and main tail).
We use Adam (Kingma and Ba, 2014) as optimiser and apply an initial learning rate of 10 −4 (see also Sect. A5).
We train the model for 300 epochs on the HPC system 'Jülich Wizard for European Leadership Science' (JUWELS, Jülich

185
Supercomputing Centre (2019)) which is operated by the Jülich Supercomputing Centre (see also A4 for further details regarding the software and hardware configurations).

Evaluation metrics and reference models
In general, one can interpret a supervised machine learning approach as an attempt to find an unknown function ϕ which maps an input pattern (X) to the corresponding labels or the ground truth (y). The machine learning model is consequently function, which the network tries to minimise during training. As the network is only an estimator of the true function, the mapping generally differs: To evaluate the genuine added value of any meteorological or air quality forecasting model, it is essential to apply proper 195 statistical metrics. The following section describes the verification tools, which are used in this study. We provide additional information on joint distributions as introduced by Murphy and Winkler (1987) in Section A2.

Scores and Skill Scores
To quantify a model's informational content, scores like the mean squared error (Eq. (5)) are defined to provide an absolute 205 performance measure, while skill scores provide a relative performance measure related to a reference forecast (Eq. (6)).
Here, N is the number of forecast-observation pairs, m is a vector containing all model forecasts, and o is a vector containing the observations (Murphy, 1988).
A skill score S may be interpreted as the percentage of improvement of A over the reference A ref . Its general form is Here, A represents a general score, A ref is the reference score, and A perf the perfect score.
For A = A ref S becomes zero, indicating that the forecast of interest has no improvements over the reference forecast. A value of S > 0 indicates an improvement over the reference, while S < 0 indicates a deterioration. The informative value of a skill score strongly depends on the selected reference forecast. In case of the mean squared error (Eq. (5)) the perfect score is 215 equal to zero and Eq. (6) reduces to where r is a vector containing the reference forecast.

Reference models
We used three different reference models: persistence, climatology, and an ordinary least square model (linear regression). For 220 the climatological reference we create four sub-reference models (see Sect. 4.2.2). In the following we introduce our basic reference models.

Persistence Model
One of the most straightforward models to build, which in general has good forecast skills on short lead times, is a persistence model. Today's observation of ozone dma8eu concentration is also the prediction for the next four days. Obviously, the skill 225 of persistence decreases with increasing lead time. The good performance on short lead times is mainly due to the facts that weather conditions influencing ozone concentrations generally do not change rapidly, and that the chemical lifetime of ozone is long enough.

Climatological reference models
We create four different climatological reference models (CASE I to CASE IV), which are based on the climatology of ob-230 servations by following Murphy (1988) (also with respect to their terminology, which means that the reference score A ref is calculated by using the reference forecast r).
The first reference forecast (A ref : r = o, CASE I) is the internal single value climatology which is the mean of all observations during the test period. This unique value is then applied as reference for all forecasts. As this forecast has only one constant value which is the expectation value, this reference model is well calibrated but not refined at all.

OLS reference model
The third reference model is an ordinary least square model. We train the OLS model by using the statsmodels package v0.10 (Seabold and Perktold, 2010). The OLS model is trained on the same data as the neural network (training set) and serves as a linear competitor. 250

Results
As described in Sect. 3 we split our data into three subsets (training, validation, and test set). We only used the independent test data set to evaluate the forecasting capabilities of the IntelliO3-ts network. During training and hyperparameter optimisation, only the training and validation sets were used, respectively. Therefore, the following results reflect the true generalisation capability of IntelliO3-ts. Before discussing the results in detail below, we would like to point out again, that this is the first 255 time that one neural network has been trained to forecast data from an entire national air quality monitoring station network.
Also, the network has been trained exclusively with time series data from air pollutant measurements and a meteorological reanalysis. No additional information, such as geographic coordinates, or other hints that could be used by the network to perform a station classification task, have been used. The impact of such extra information will be the subject of another study. values (for example, the maxima in July/August or the minima in December/January/February).

Comparison with competitive models
The skill scores based on the mean squared error (MSE) evaluated over all stations in the test set are summarised in Fig. 4. In the left and center groups of boxes and whiskers, the IntelliO3-ts model (labelled "CNN") and the OLS model are compared against persistence as reference. The right group of boxes and whiskers shows the comparison between IntelliO3-ts and OLS.

270
The mean skill score for IntelliO3-ts against persistence is positive and increases with time. The OLS forecast shows similar behaviour in terms of its temporal evolution, but exhibits a slightly lower skill score throughout the 4-day forecasting period.
The increases in skill score in both cases is mainly due to the decreased score of the persistence model (see also Sect. 4.2.1).
Consequently, IntelliO3-ts shows a positive skill score when the OLS model is used as a reference, indicating a small genuine added value over the OLS model.

275
In comparison with climatological reference forecasts as introduced in Sect. 4.1 and summarised in TableA2, the skill scores are high for the first lead time (1d) and decrease with increasing lead time (Fig. 5). Both cases with a single value as reference (internal CASE I, external CASE III) maintain a skill score above 0.4 over the four days. These high skill scores are a direct result of the fact that IntelliO3-ts captures the seasonal cycle as shown in Fig. 3, while the reference forecasts only report the overall mean as a single value prediction.

280
If the reference includes the seasonal variation (CASE II and CASE IV), the IntelliO3-ts skill score is still better than 0.4 for the first day (1d), but then it decreases rapidly and even becomes negative on day 4 for CASE II. The skill scores for CASE II are lower than for CASE IV as the reference climatology (i.e. the monthly mean values) is calculated on the test set itself.
These results show that, for the vast majority of stations, our model performs much better than a seasonal climatology for a one-day forecast, and it is still substantially better than the climatology after two days. However, there are some stations, which 285 yield a negative skill score even on day 2 in the CASE II comparison. Longer-term forecasts with this model set-up do not add value compared to the computationally much cheaper monthly mean climatological forecast.

Analysis of joint distributions
The full joint distribution in terms of calibration refinement factorisation (Sect. A2) is shown in Fig. 6a  quantiles in value regions with many data samples are more robust and therefore more credible than quantiles in data-sparse concentration regimes (Murphy et al., 1989). On the first lead time (d1, Fig. 6), the IntelliO3-ts network has a tendency to slightly over-predict concentrations 30 ppb. On the other hand, the the forecast is underestimating concentrations above 295 70 ppb.
Both, very high, and very low forecasts are rare (note the logarithmic axis for the sample size). Therefore, the results in these regimes have to be treated with caution. Further detail is provided in Fig. A1, where the conditional biases are shown (terms BI, BII, BIV in Sect. A3). and decrease the maximal climatological potential skill score (term AI, see also TableA2).
With increasing lead time the model looses its capability to predict concentrations close to zero and high concentrations  each season (DJF, MAM, JJA, and SON). Conditional quantile plots for individual seasons can be found in the supplementary material (A6). As mentioned above, the bimodal shape of the marginal distribution is mainly caused by the network's weakness to predict very high and low ozone concentrations. Moreover, the seasonal decomposition shows that the left mode is caused by the fall (SON) and winter (DJF) seasons (Fig. A7a to A7d and A4a to A4d). In both seasons, the most common values fall into the same concentration range, while the right tail of SON is much more pronounced than for DJF with higher values occurring 315 primarily in September. In the summer season (JJA, Fig. A6a to A6d) the most frequently predicted values correspond to the location of the right mode of Fig. 6a to 6d. During DJF, MAM, and JJA the model has a stronger tendency of under-forecasting with increasing lead time (median line moves above the reference line).

Relevance of input variables
To analyse the impact of individual input variables on the forecast results, we apply a bootstrapping technique as follows: we 320 take the original input of one station, keep eight of the nine variables unaltered, and randomly draw (with replacement) the missing variable (20 times per variable per station). This destroys the temporal structure of this specific variable so that the network will no longer be able to use this information for forecasting. Compared to alternative approaches, such as re-training the model with fewer input variables, setting all variable values to zero, etc., this method has two main advantages: (i) the model does not need to be re-trained and thus the evaluation occurs with the exact same weights that were learned from the full 325 dataset, and (ii) the distribution of the input variable remains unchanged so that adverse effects, for example due to correlated input variables, are excluded. However, we note, that this method may underestimate the impact of a specific variable in case of correlated input data, because in such cases the network will focus on the dominant feature (here: ozone). Also, this analyses only evaluates the behaviour of the deep learning model and does not evaluate the impact of these variables on actual ozone formation in the atmosphere.

330
After the randomisation of one variable, we apply the trained model on this modified input data and compare the new prediction with the original one. For comparison, we apply the skill score (Eq. (6)) based on the MSE where we use the original forecast as reference. Consequently, the skill score will be negative if the bootstrapped variable has a significant impact on model performance. Figure 7 shows the skill scores for all variables (x-axis) and lead times (dark (1d) to light blue (4d) boxplots). Ozone is the most crucial input variable, as it has by far the lowest skill score for all lead times. With increasing 335 lead time, the skill score increases but stays lower than for any other variable. In contrast, the model does not derive much skill from the variables nitrogen oxide, nitrogen dioxide, and the planetary boundary layer height. In other words, the network does not perform worse, when randomly drawn values replace one of those original time series. Relative humidity, temperature and the wind's u-component have an impact on the model performance. With increasing lead time, these influences decrease.
Even though IntelliO3-ts v1.0 generalises well on an unseen testing set (see Sect. 5), it still has some limitations related to the apllied data split.
By splitting the data into three consecutive, non-overlapping sets, we ensure that the data sets are as independent as possible. On the other hand, this independence comes at the cost, that changes of trends in the input variables may not be captured, especially as our input data are not de-trended. Indeed, at European non-urban measurement sites, several ozone metrics related to high 345 concentrations (e.g. 4th highest daily maximum 8-hour (4MDA8) or the 95%-percentile of hourly concentrations) show a significant decrease during our study period (1997 to 2015) (Fleming et al., 2018;Yan et al., 2018). Our data splitting method for evaluating the generalisation capability is conservative in the sense that we evaluate the model on the test set, which has the largest possible distance to the training set. If our research model shall be transformed into an operational system we suggest to apply online learning and use the latest available data for subsequent training cycles (see for example Sayeed et al. (2020)).

7 Conclusions
In this study, we developed and evaluated IntelliO3-ts, a deep learning forecasting model for daily near-surface ozone concentrations (dma8eu) at arbitrary air quality monitoring stations in Germany. The model uses chemical (O3, NO, NO2) and meteorological time-series of the previous six days to create forecasts for up to four days into the future. IntelliO3-ts is based on convolutional inception blocks, which allow to calculate concurrent convolutions with different kernel sizes. The model has 355 been trained on 10 years of data from 312 background stations in Germany. Hyperparameter tuning and model evaluation were done with independent data sets of 2 and 6 years length, respectively.
The model generalises well and generates good quality forecasts for lead times up to two days. These forecasts are superior compared to the reference models persistence, ordinary least squares, annual and seasonal climatology. After 2 days, the forecast quality degrades, and the forecast adds no value compared to a monthly mean climatology of dma8eu ozone levels. 360 We could primarily attribute this to the network's tendency to converge to the mean monthly value. The model does not have any spatial context information which could counteract this tendency. Near-surface ozone concentrations at background stations are highly influenced by air mass advection, but the IntelliO3-ts network yet has no way to take upwind information into account. We will investigate spatial context approaches in a forthcoming study.
We observed, that the model loses refinement with increasing lead time which results in unsatisfactory predictions on the 365 tails of the observed ozone concentration. We were able to attribute this weakness to the under-representation of extreme (either very small or high) levels in the training data set. This is a general problem for machine learning applications and regression methods. The machine learning community is investigating possible solutions to lessen the impact of such data imbalances, but their adaptation is beyond the scope of this paper as proposed techniques are not directly applicable to those time series (auto-correlation time).

370
Bootstrapping individual time series of the input data to analyse the importance of those variables on the predictive skill showed, that the model mainly focused on the previous ozone concentrations. Temperature and relative humidity only have a small effect on the model performance, while the time series of NO, NO 2 , and PBL have no impact.
The IntelliO3-ts network extends previous work by using a new network architecture, and training one model on a much larger set of measurement station data and longer time periods. In light of Rasp and Lerch (2018) who used several neural 375 networks to postprocess ensemble weather forecasts, we applied meteorological evaluation metices to perform a point-bypoint comparison, which is not common in the field of deep learning. We hope that the forecast quality of IntelliO3-ts can be further improved if we take spatial context information into account so that the advection of background ozone and ozone precursors can be learned by the model.
Code and data availability. The current version of IntelliO3-ts is available from the project website: https://gitlab.version.fz-juelich.de/ 380 toar/mlair/-/tree/IntelliO3-ts (last access: 12. Nov. 2020) under the MIT licence (http://opensource.org/licenses/mit-license.php). The exact version of the model and data used to produce the results in this paper are archived on b2share (Kleinert et al., 2020b). The initial version which was used for the initial submission is also archived on b2share (Kleinert et al., 2020a).
Appendix A A1 Information on used stations 385 Table A1 lists all measurement stations which we used in this study. The table also shows the number of samples (X, y) for each of the three data sets (training, validation, test).

A2 Joint Distributions
Forecasts and observations are treated as random variables. Let p(m, o) represent the joint distribution of a model's forecast m and an observation o, which contains information on the forecast, the observation and the relationship between both of them 390 (Murphy and Winkler, 1987). To access specific pieces of information, we factorise the joint distribution into a conditional and a marginal distribution in two ways. The first factorisation is called calibration-refinement and reads where p (o|m) is the conditional distribution of observing o given the forecast m and p(m) is the marginal distribution which indicates how often different forecast values are used (Murphy and Winkler, 1987;Wilks, 2006 Murphy and Winkler (1987) pointed out that a perfectly calibrated forecast is worth nothing if it lacks 400 refinement.
The second factorisation is called likelihood-base rate and consequently is given by where p (m|o) is the conditional distribution of forecast m given that o was observed. p(o) is the marginal distribution which only contains information about the underlying rate of occurrence of observed values and is therefore also called sample 405 climatological distribution (Wilks, 2006). (Murphy, 1988) This section provides additional information about the MSE decomposition introduced by Murphy (1988). The MSE decomposition is performed as

A3 Mean Squared Error Decomposition
Here σ m (σ o ) is the sample variance of the forecasts (observations) and σ mo is the sample covariance of the forecasts and observations, which is given by . ρ mo is the sample coefficient of correlation between forecast and observation. CASE I: The term AI is the square of the sample correlation coefficient and might be interpreted as the strength of linear relationship between the forecast and the observation. This term ranges from zero (no correlation) to one (perfect correlation). Term BI includes the square of the differences between the sample correlation coefficient and the ratio of standard deviation of the forecast and observation. Therefore, BI is a measure of the conditional bias of the forecast which is always positive due to the square and tends to decrease skill as it is a subtrahend. The last term, which is included in all cases I-IV, is CI and contains the a measure of the unconditional bias in the forecast and, again, tends to decrease the skill as it is a subtrahend which is always greater or equal to zero.
In case of multi-value internal climatology (Case II Eq. (A8)), two additional terms appear in the dominator as well as the numerator which tend to decrease skill in general and only vanish if 2ρ o o = σ µ /σ o . In Case III, the additional term 440 AIII appears that includes the square of the difference between the mean external and the internal climatology divided by the variance of the observation. AIII leads to an increase of skill for any difference in the means of external and internal climatologies.
Three additional terms (AIV, BIV and CIV) appear if Eq. (7) is decomposed by using a multi-value external climatology as reference forecast (Case IV). These terms only vanish if 2ρ µo = σ µ /σ o and µ = o. A summary of all four cases and all terms 445 included is given in TableA2 Figure A1 also includes all individual terms as described above.

A4 Additional information on JUWELS
Each node on JUWELS (Jülich Supercomputing Centre, 2019) which is part of the graphical processor unit (GPU) partition is equipped with four NVIDIA Volta V100 GPUs. The user guide for JUWELS is available from https://apps.fz-juelich.de/jsc/ 450 hps/juwels/index.html (last access: 12. Nov. 2020). Figure A2 and A3 show the full architecture of IntelliO3-ts including all individual layers and tails. Table A3 lists the specific compile options per keyword of keras' complile method. Table A4 summarises additional settings for the specific architecture