The prediction of near-surface ozone concentrations is important for supporting regulatory procedures for the protection of humans from high exposure to air pollution. In this study, we introduce a data-driven forecasting model named “IntelliO3-ts”, which consists of multiple convolutional neural network (CNN) layers, grouped together as inception blocks. The model is trained with measured multi-year ozone and nitrogen oxide concentrations of more than 300 German measurement stations in rural environments and six meteorological variables from the meteorological COSMO reanalysis. This is by far the most extensive dataset used for time series predictions based on neural networks so far. IntelliO3-ts allows the prediction of daily maximum 8 h average (dma8eu) ozone concentrations for a lead time of up to 4 d, and we show that the model outperforms standard reference models like persistence models. Moreover, we demonstrate that IntelliO3-ts outperforms climatological reference models for the first 2 d, while it does not add any genuine value for longer lead times. We attribute this to the limited deterministic information that is contained in the single-station time series training data. We applied a bootstrapping technique to analyse the influence of different input variables and found that the previous-day ozone concentrations are of major importance, followed by 2 m temperature. As we did not use any geographic information to train IntelliO3-ts in its current version and included no relation between stations, the influence of the horizontal wind components on the model performance is minimal. We expect that the inclusion of advection–diffusion terms in the model could improve results in future versions of our model.

Exposure to ambient air pollutants such as ozone (

Ozone concentrations can be forecasted by various numerical methods. Chemical transport models (CTMs) solve chemical and physical equations explicitly (for example,

This makes CTMs unsuited for regulatory purposes, which by law are bound to station measurements, except if so-called model output statistics are applied to the numerical modelling results

Since the late 1990s, machine learning techniques in the form of neural networks have also been applied as a regression technique to forecast ozone concentrations or threshold value exceedances (see Table

Overview of the literature on ozone forecasts with neural networks. Machine learning (ML) types are abbreviated as FC for fully connected, CNN for convolutional neural network, RNN for recurrent neural network, and LSTM for long short-term memory. We use the following abbreviations for time periods: yr for years and m for month.

The current study extends these previous works and introduces a new deep learning model for the prediction of daily maximum 8 h average

This article is structured as follows: in Sect.

Tropospheric ozone (

The continuity equation of near-surface ozone in a specific volume of air can be written as

Tropospheric ozone is formed under sunlit conditions in gas-phase chemical reactions of peroxy radicals and nitrogen oxides

From a chemical perspective, the prediction of ozone concentrations would require concentration data of

Besides the trace gas concentrations, ozone levels also depend on meteorological variables. Due to the scarcity of reported meteorological measurements at the air quality monitoring sites, we extracted time series of meteorological variables from the 6 km resolution COSMO reanalysis

All data used in this study were retrieved from the Tropospheric Ozone Assessment Report (TOAR) database

Table

Input variables and applied daily statistics according to Table

Definitions of statistical metrics in TOAR analysis relevant for this study. Adopted form

As described above, ozone concentrations are less variable at stations, which are further away from primary pollutant emission sources. We therefore selected those stations from the German air quality monitoring network, which are labelled as “background” stations according to the European Environmental Agency (EEA), Airbase classification.

Map of central Europe showing the location of German measurement sites used in this study. This figure was created with Cartopy

We split the individual station time series into three non-overlapping time periods for training, validation, and testing which we will refer to as “set” from now on (see Fig.

Data availability diagram combined for all variables and all stations. The training set is coloured in orange, the validation set in green, and the test set in blue. Gaps in 1999 and 2003 are caused by missing model data in the TOAR database.

Due to changes in the measurement network over time, the number of stations in the three datasets differ: training data comprise 312 stations, validation data 211 stations, and testing 203 stations.
This is by far the largest air quality time series dataset that has been used in a machine learning
study so far (see Table

Supervised learning techniques require input data (

Samples within the same dataset (train, validation, and test) can overlap which means that one missing data point would appear up to seven times in the inputs

Number of stations, total number of samples (pairs of

We trained the neural network (details on the network architecture are given in Sect.

By applying a temporal split, we ensure that the training data do not directly influence the validation and test datasets. Therefore, the final results reflect the true generalisation capability of our forecasting model.

In accordance with other studies, our initial deep learning experiments with a subset of this data have shown that neural networks, just as other classical regression techniques, have a tendency to focus on the mean of the distribution and perform poorly on the extremes.
However, especially the high concentration events are crucial in the air quality context due to their strong impact on human health and the adverse crop effects.
Extreme values occur relatively seldomly in the dataset, and it is therefore difficult for the model to learn their associated patterns correctly.
To increase the total number of values on the tails of the distribution during training, we append all samples where the standardised label (i.e. the normalised ozone concentration) is

We selected a batch size of 512 samples (algorithm 1, line 10), because this size is a good compromise between minimising the loss function and optimising computing time per trained epoch. Experiments with larger and smaller batch sizes did not yield significantly different results. Before creating the different training batches, we permute the ordering of samples per station in the training set to ensure that the distribution of each batch is similar to those of the full training dataset (algorithm 1, line 9). Otherwise, each batch would have an under-represented season and consequently would lead to undesired looping during training (e.g. no winter values in the first batch, no autumn values in the second batch).

Our machine learning model is based on a convolutional layer neural network

Our neural network named IntelliO3-ts, version 1.0, primarily consists of two inception blocks

While the original proposed concept of inception blocks has one max-pooling tower alongside the different convolution stacks, we added a second pooling tower, which calculates the average on a kernel size of

Moreover, we use batch normalisation layers

The loss function for the main tail is the mean squared error:

All activation functions are exponential linear units (ELUs)

The network is built with Keras 2.2.4

We train the model for 300 epochs on the Jülich Wizard for European Leadership Science (JUWELS;

In general, one can interpret a supervised machine learning approach as an attempt to find an unknown function

To evaluate the genuine added value of any meteorological or air quality forecasting model, it is essential to apply proper statistical metrics. The following section describes the verification tools, which are used in this study. We provide additional information on joint distributions as introduced by

To quantify a model's informational content,

A skill score

For

We used three different reference models: persistence, climatology, and an ordinary least-square model (linear regression). For the climatological reference, we create four sub-reference models (see Sect.

One of the most straightforward models to build, which in general has good forecast skills on short lead times, is a persistence model. Today's observation of ozone dma8eu concentration is also the prediction for the next 4 d. Obviously, the skill of persistence decreases with increasing lead time. The good performance on short lead times is mainly due to the facts that weather conditions influencing ozone concentrations generally do not change rapidly, and that the chemical lifetime of ozone is long enough.

We create four different climatological reference models (Case I to Case IV), which are based on the climatology of observations by following

The first reference forecast (

The second reference (

The third reference (

Finally, the fourth reference (

The third reference model is an ordinary least-square (OLS) model. We train the OLS model by using the statsmodels package v0.10

As described in Sect.

Figure

Monthly dma8eu ozone concentrations for all test stations as boxplots. Measurements are denoted by “obs” (green), while the forecasts are denoted by “1 d” (dark blue) to “4 d” (light blue). Whiskers have a maximal length of one interquartile range. The black triangles denote the arithmetic means.

The skill scores based on the mean squared error (MSE) evaluated over all stations in the test set are summarised in Fig.

Skill scores of the IntelliO3-ts (cnn) versus the two reference models' persistence (persi) and ordinary least square (ols) based on the mean squared error; separated for all lead times (1 d (dark blue) to 4 d (light blue)). Positive values denote that the first mentioned prediction model performs better than the reference model (mentioned as second). The triangles denote the arithmetic means.

In comparison with climatological reference forecasts as introduced in Sect.

Skill scores of IntelliO3-ts with respect to climatological reference forecasts: with internal single value reference (Case I), internal multi-value (monthly) reference (Case II), external single (Case III), and external multi-value (monthly) reference (Case IV) for all lead times from 1 d (dark blue) to 4 d (light blue). Triangles denote the arithmetic means.

If the reference includes the seasonal variation (Case II and Case IV), the IntelliO3-ts skill score is still better than 0.4 for the first day (1 d), but then it decreases rapidly and even becomes negative on day 4 for Case II. The skill scores for Case II are lower than for Case IV as the reference climatology (i.e. the monthly mean values) is calculated on the test set itself. These results show that, for the vast majority of stations, our model performs much better than a seasonal climatology for a 1 d forecast, and it is still substantially better than the climatology after 2 d. However, there are some stations which yield a negative skill score even on day 2 in the Case II comparison. Longer-term forecasts with this model setup do not add value compared to the computationally much cheaper monthly mean climatological forecast.

The full joint distribution in terms of calibration refinement factorisation (Sect.

Conditional quantile plot for all IntelliO3-ts predictions for a lead time of 1 d

Both very high and very low forecasts are rare (note the logarithmic axis for the sample size). Therefore, the results in these regimes have to be treated with caution. Further detail is provided in Fig.

With increasing lead time, the model looses its capability to predict concentrations close to zero and high concentrations above

To shed more light on the factors influencing the forecast quality, we analyse the network performance individually for each season (DJF, MAM, JJA, and SON).
Conditional quantile plots for individual seasons can be found in the Appendix (Sect.

To analyse the impact of individual input variables on the forecast results, we apply a bootstrapping technique as follows: we take the original input of one station, keep eight of the nine variables unaltered, and randomly draw (with replacement) the missing variable (20 times per variable per station). This destroys the temporal structure of this specific variable so that the network will no longer be able to use this information for forecasting. Compared to alternative approaches, such as re-training the model with fewer input variables, setting all variable values to zero, etc., this method has two main advantages: (i) the model does not need to be re-trained, and thus the evaluation occurs with the exact same weights that were learned from the full dataset, and (ii) the distribution of the input variable remains unchanged so that adverse effects, for example, due to correlated input variables, are excluded. However, we note that this method may underestimate the impact of a specific variable in the case of correlated input data, because in such cases the network will focus on the dominant feature (here ozone). Also, this analyses only evaluates the behaviour of the deep learning model and does not evaluate the impact of these variables on actual ozone formation in the atmosphere.

After the randomisation of one variable, we apply the trained model on this modified input data and compare the new prediction with the original one. For comparison, we apply the skill score (Eq.

Skill scores of bootstrapped model predictions having the original forecast as the reference model are shown as boxplots for all
lead times from 1 d (dark blue) to 4 d (light blue). The skill score for ozone is shown on the left

Even though IntelliO3-ts v1.0 generalises well on an unseen testing set (see Sect.

By splitting the data into three consecutive, non-overlapping sets, we ensure that the datasets are as independent as possible. On the other hand, this independence comes at the cost that changes of trends in the input variables may not be captured, especially as our input data are not de-trended. Indeed, at European non-urban measurement sites, several ozone metrics related to high concentrations (e.g. fourth highest daily maximum 8 h (4MDA8) or the 95th percentile of hourly concentrations) show a significant decrease during our study period (1997 to 2015)

In this study, we developed and evaluated IntelliO3-ts, a deep learning forecasting model for daily near-surface ozone concentrations (dma8eu) at arbitrary air quality monitoring stations in Germany. The model uses chemical (

The model generalises well and generates good quality forecasts for lead times up to 2 d. These forecasts are superior compared to the reference persistence, ordinary least squares, annual, and seasonal climatology models. After 2 d, the forecast quality degrades, and the forecast adds no value compared to a monthly mean climatology of dma8eu ozone levels. We could primarily attribute this to the network's tendency to converge to the mean monthly value. The model does not have any spatial context information which could counteract this tendency. Near-surface ozone concentrations at background stations are highly influenced by air mass advection, but the IntelliO3-ts network has no way of taking upwind information into account yet. We will investigate spatial context approaches in a forthcoming study.

We observed that the model loses refinement with increasing lead time which results in unsatisfactory predictions on the tails of the observed ozone concentration. We were able to attribute this weakness to the under-representation of extreme (either very small or high) levels in the training dataset. This is a general problem for machine learning applications and regression methods. The machine learning community is investigating possible solutions to lessen the impact of such data imbalances, but their adaptation is beyond the scope of this paper as proposed techniques are not directly applicable to those time series (auto-correlation time).

Bootstrapping individual time series of the input data to analyse the importance of those variables on the predictive skill showed that the model mainly focused on the previous ozone concentrations. Temperature and relative humidity only have a small effect on the model performance, while the time series of

The IntelliO3-ts network extends previous work by using a new network architecture, and training one model on a much larger set of measurement station data and longer time periods.
In light of

Table

Number of samples (input and output pairs) per station separated by training, validation (val), and test dataset. “–” denotes no samples in a set.

Continued.

Continued.

Continued.

Continued.

Continued.

Continued.

Forecasts and observations are treated as random variables.
Let

The second factorisation is called the likelihood-base rate and consequently is given by

This section provides additional information about the MSE decomposition introduced by

The term AI is the square of the sample correlation coefficient and might be interpreted as the strength of linear relationship between the forecast and the observation. This term ranges from 0 (no correlation) to 1 (perfect correlation). The term BI includes the square of the differences between the sample correlation coefficient and the ratio of standard deviation of the forecast and observation. Therefore, BI is a measure of the conditional bias of the forecast which is always positive due to the square and tends to decrease skill as it is a subtrahend. The last term, which is included in all cases (I–IV), is CI and contains the square of the difference of the mean forecast and mean observation divided by the variance of the observation. Therefore, CI is a measure of the unconditional bias in the forecast and, again, tends to decrease the skill as it is a subtrahend which is always greater than or equal to zero.

In the case of multi-value internal climatology (Case II, Eq.

Three additional terms (AIV, BIV, and CIV) appear if Eq. (

Summarised skill scores

Specific compile options passed to Keras' compile method. Other keywords which are not listed in this table are left with default values.

Specific information and rates used to set up the model architecture.

Figure

Skill scores of IntelliO3-ts with respect to climatological reference forecast, with internal single value reference (Case I), internal multi-value (monthly) reference (Case II), external single (Case III), and external multi-value (monthly) reference (Case IV) for all lead times from 1 d (dark blue) to 4 d (light blue). All terms are described in Sect.

Each node on JUWELS

Figures

First part of the network showing the input, the first padding, convolution and activation, and the first inception block.
This figure was created with Netron

Second part of the network after the “concatenate” layer in Fig.

The following section contains all conditional quantile plots decomposed for all seasons (DJF: Fig.

Same as Fig.

Same as Fig.

Same as Fig.

Same as Fig.

The current version of IntelliO3-ts is available from the project website:

FK and MGS developed the concept of the study. All authors jointly developed the concept of the machine learning model. FK implemented the neural network and performed the experiment. FK had the lead in writing the manuscript with contributions from LHL and MGS. LHL had the technical lead in code development and workflow design. All authors revised the final manuscript and submitted it to

The authors declare that they have no conflict of interest.

We are thankful to all air quality data providers which made their data available in the TOAR database. Moreover, we thank the meteorological section of the Institute of Geosciences at the University of Bonn, which provided the COSMO reanalysis data. We thank Sabine Schröder for the help in accessing data through the JOIN interface and Jenia Jitsev for helpful discussions.

The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (

This research has been supported by the European Research Council (grant no. IntelliAQ (787576)).The article processing charges for this open-access publication were covered by a Research Centre of the Helmholtz Association.

This paper was edited by Juan Antonio Añel and reviewed by two anonymous referees.