Global, high-resolution mapping of tropospheric ozone – explainable machine learning and impact of uncertainties

. Tropospheric ozone is a toxic greenhouse gas with a highly variable spatial distribution which is challenging to map on a global scale. Here we present a data-driven ozone mapping workflow generating a transparent and reliable product. We map the global distribution of tropospheric ozone from sparse, irregularly placed measurement stations to a high-resolution regular grid using machine learning methods. The produced map contains the average tropospheric ozone concentration of the years 2010 - 2014 with a resolution of 0.1° × 0.1°. The machine learning model is trained on AQ-Bench, a precompiled 5 benchmark dataset consisting of multi-year ground-based ozone measurements combined with an abundance of high-resolution geospatial data. Going beyond standard mapping methods, this work focuses on two key aspects to increase the integrity of the produced map. Using explainable machine learning methods we ensure that the trained machine learning model is consistent with commonly accepted knowledge about tropospheric ozone. To assess the impact of data and model uncertainties on our ozone map, we 10 show that the machine learning model is robust against typical fluctuations in ozone values and geospatial data. By inspecting the feature space input features , we ensure that the model is only applied in regions where it is reliable. We provide a rationale for the tools we use to conduct a thorough global analysis. The methods presented here can thus be easily transferred to other mapping applications to ensure the transparency and reliability of the maps produced.

emissions. It is typically formed downwind of precursor sources from traffic, industry, vegetation, and agriculture, under the influence of solar radiation. Ozone patterns are also influenced by the local topography causing specific flow patterns (Monks et al., 2015;Brasseur et al., 1999). Depending on the on-site conditions, ozone can be destroyed in a matter of minutes or have a lifetime of several weeks with advection from source regions to remote areas (Wallace and Hobbs, 2006). The interrelation of these factors of ozone formation, destruction, and transport is not fully understood (Schultz et al., 2017). This makes ozone 25 both difficult to quantify and to control. See Brasseur et al. (1999) and Monks et al. (2015) for more details on the ozone life cycle. Public authorities recognize ozone-related problems. To quantify ozone, tThey install air quality monitoring networks to quantify ozone (Schultz et al., 2015(Schultz et al., , 2017. Furthermore, they enforce maximum exposure rules to mitigate ozone health and vegetation impacts (e.g. European Union, 2008).
Tropospheric ozone research is currently seeing increased use of machine learning methods.Currently, there is increased 30 use of machine learning methods in tropospheric ozone research. Such "intelligent" algorithms can learn nonlinear relationships of ozone processes and connect them to environmental conditions, even if their interrelations are not well understood through process-oriented research. Kleinert et al. (2021) and Sayeed et al. (2021) used convolutional neural networks to forecast ozone at several hundred measurement stations, based on meteorological and air quality data. Large training datasets allowed them to train deep neural networks, resulting in a significant improvement over the first machine learning attempts to 35 predictforecast ozone (Comrie, 1997;Cobourn et al., 2000). Machine learning is also extensively used to calibrate low-cost ozone monitors that can then complement existing ozone monitoring networks (Schmitz et al., 2021;Wang et al., 2021). Furthermore, costlycompute-intensive chemical reactions schemes for numerical ozone modeling in atmospheric models can be emulated using machine learning (Keller et al., 2017;Keller and Evans, 2019). Ozone and ozone precursor datasets which can beare used as training data for machine learning models are being increasingly made available as FAIR (Wilkinson et al., 2016) 40 and open data. One of these datasets is AQ-Bench ('air quality benchmark dataset, ' Betancourt et al., 2021b), for example, is a dataset for machine learning on global ozone metrics which also and serves as training data for this mapping study.
We refer to mapping as a data-driven method for spatial predictions of environmental target variables. For mapping, a model is fitted to observations of the target variable at a number of measurement sites, which might even be sparse and of irregularly placementd. To fit the model, eEnvironmental features are used which areas proxies for the target variable to fit the measurement stations to easy-access geospatial data. It contains aggregated ozone statistics of the years 2010-2014 at 5577 stations all overaround the globe, compiled from the database of the Tropospheric Ozone Assessment report (TOAR, Schultz et al., 2017). The AQ-Bench dataset considers ozone concentrations on a climatological time scale instead of day-to-day air quality data. The scope of this dataset is to discover purely spatial relations. Machine learning models trained on this 90 dataset will output aggregated statistics over the years 2010 -2014, and will not be able to capture temporal variances. This is beneficial if the required final data products are also aggregated statistics. The bulkmajority of the stations is located in North America, Europe, and East Asia. The dataset contains different kinds of ozone statistics such as percentiles or health-related metrics. Of these statistics, tThis study solely focuses on the average ozone statistic as target (Fig. 1).
The features in the AQ-Bench dataset characterize the measurement site and are proxies for ozone formation, destruction, 95 and transport processes. For example, the 'altitude' and 'relative altitude' of the station are important proxies for local flow patterns and ozone sinks. Other features are 'pPopulation density' in different radii around every station, whichare proxies for human activity and thus ozone precursor emissions. 'Latitude' is a proxy for ozone formation through photochemistry, as radiation and heat generally increase towards the equator. The landcover variables are proxies for precursor emissions and deposition. The full list of features and which ozone processes they are related totheir relation to ozone processes are 100 documented by Betancourt et al. (2021b). Fig. 1 shows predictions of a machine learning model on the test set of AQ-Bench. Table 1 lists allThe features we choose as candidates for this mappingused in this study are listed in Table 1. Features that are only available at station locations and not in gridded format are excluded because they cannot be used for mapping.
With respect to geographical coordinates, only 'latitude' is used, which is a proxy for ozone formation through photochemistry.
'Longitude' is not a proxy for ozone formation. features of the AQ-Bench dataset. The original gridded data used here (Appendix A) has a resolution of 0.1°×0.1°or finer.
Since our target resolution is 0.1°×0.1°, the gridded data are downscaled to that resolution if the original resolution is finer. The

110
'land cover', 'population', and 'light pollution' features of the AQ-Bench dataset are not point data at the station, but spatial aggregates in a certain radius around the station (see Table 1). To prepare gridded fields of these features, the area around each individual grid point is considered, and the required radius aggregation is written to that grid point.

Explainable machine learning workflow
We apply a standard mapping workflow and extend it with explainable machine learning methods as described in the followingthis section. Together with the uncertainty assessment methods described in Sect. 2.3, they allow for a thorough analysis of theour machine learning model. A random forest (Breiman, 2001) is fitted on the AQ-Bench dataset to outputpredict average ozone 120 at the corresponding measurement stations for given features. A random forest is an ensemble of regression trees that is created by bootstrapping the training dataset several times to increase generalizability. We choose random forest as a machine learning algorithm because tree-based models are the state of the art for structured data (Lundberg et al., 2020). Random forest was also shown to outperform linear regression and a shallow neural network in predicting average ozone on the AQ-Bench dataset (Betancourt et al., 2021b). In addition, this algorithm has been proven to be the most suitable for mapping in several 125 studies (Petermann et al., 2021;Nussbaum et al., 2018;Ren et al., 2020). We use the python machine learning framework SciKit-learn (Pedregosa et al., 2011). We automate the hyperparameter search with the python package for machine learning and hyperactive (Blanke, 2021) for hyperparameter tuning.
A proper validation strategy is crucial for spatial prediction models because both environmental conditions and target variables are often correlated in space. When tested on spatially correlated and thus statistically dependent samples, mapping 130 results may be overconfident (Meyer et al., 2018;Ploton et al., 2020). We use the independent spatial data split provided with the AQ-Bench dataset to avoid this overconfidencevalidate spatial generalizability. Details on our validation strategy are given in Sect. 2.2.1. After training and validation, the model is applied point-wise to the gridded data with a resolution of 0.1°× 0.1°to produce the final ozone map. As an extension of theis standard mapping workflow described in Sect. 1, we perform experiments to increase interpretability, test robustness, and explain the model. The extended workflow is summarized 135 in Table 2 and further justified in the following. The use of redundant features in mapping applications can favor spatial overfitting and even cause the machine learning model to learn properties of individual locations. We thus remove counterproductive features by forward feature selection as proposed by Meyer et al. (2018). Additionally, we apply basic feature engineering to increase the interpretability of the model.

140
Details on feature engineering and feature selection are described in Sect. 2.2.2.
In order to make our mapping model trustworthy, we need to verify its robustness and ability to generalize to previously unseen locations, but alsoand to explore the limits of its predictive capabilities. Noise in the AQ-Bench dataset might causes problems if the model is not robust. Additionally, limited availability of ozone measurements in regions like Central and South East Asia, Central and South America, and Africa is expected to poses a problem as it is unclear whether our model will 145 generalize to these regions. Environmental factors and their interaction with ozone might be highly variable, especially over a large domain such as the entire globe. Because of that, our model can have high evaluation scores when tested on the world regions with many air quality measurement stations (Europe, North America, East Asia, see Fig. 1) but might not necessarily be as reliable in other regions. To tackle the issues of robustness and generalizability we develop a spatial cross validation strategy in Sect. 2.2.3.We address the issues of robustness and generalizability using the spatial cross validation strategy described in Sect. 2.2.3.
Finally, weWe also aim to explain how the model arrives at its predictions, and to check if it is consistentconsistency with common ozone process understanding. For that, we use by using SHAP (SHapley Additive exPlanations, Lundberg and Lee, 2017), a post-hoc explainable machine learning method. It is a game-theoretic approach based on Shapley values (Shapley, 1953). In game theory, Shapely values provide a way of fairly distributing the outcome of a game among the 'players', the 155 contributors to the game. For our random forest, they provide a means to identifySHAP identifies the importance of the individual features to a model prediction. We describe our SHAP implementation in (Sect. 2.2.4).

Evaluation scores
We rely on the independent 60 % -20 % -20 % data split of AQ-Bench as provided by Betancourt et al. (2021b). Here, stations with a distance of more than 50 km are considered independent of each other. 60 % of the AQ-Bench dataset is used for training, 160 and 20 % for validation. The remaining 20 % are only used for testing the final model that is used to generate the map. The evaluation score is the coefficient of determination R 2 , where m denotes a sample index, M the total number of samples,ŷ m a predicted target value, and y m a reference target value.
R 2 measures the proportion of variance in the output values that the model predicts. Thus, a larger R 2 represents a better model 165 and the largest possible value is 1, which is equal to 100 %. To provide an additional evaluation score that directly indicates the expected error of the predicted ozone levels, wWe also evaluate the root mean square error (RMSE) in ppb:

Feature engineering and feature selection
We perform bBasic feature engineering is performed to improve the interpretability of theour model. Different types of 170 savanna, shrublands, and forests are given individually in AQ-Bench (see Table 1). We merge them into 'savanna', 'forest', and 'shrubland' because a high number of features with similar properties would make the model interpretation more difficult.
Instead of 'latitude', we train on the 'absolute latitude', since radiation and temperature decrease when moving away from the equator, regardless of whether one moves south or north. The feature 'absolute latitude' thus has a direct meaning in regard to increased ozone formation favored through high radiation or temperature. Compared to experiments performed without feature 175 engineering, we did not see any increasechange in evaluation scores on the validation set (not shown).
Our feature selection method follows We use the forward feature selection method for spatial prediction models  evaluation score on the validation set is kept. The model is then trained on each remaining feature along with the already 180 selected features. The additional feature with the best evaluation score is appended to the existing list of features. This iterative approach is continued until the R 2 value drops, which indicates that a feature favorsleads to overfitting. The final selected features are presented in Sect. 3.1.1.

Spatial cross validation
To prove a machine learning models' robustness, cWe apply cross validation can be applied to prove the robustness of our 185 model. We reserve 20 % of the AQ-Bench dataset for testing the final model, relying on the independent split of Betancourt et al. (2021b).
We split the remaining 80 %test and training set into four independent cross validation folds of 20 % each. Like Betancourt et al. (2021b), we assume that air quality measurement stations with a distance of at least 50 km are independent of each other. We, therefore, produce the cross validation folds with a two-step approach. First, we cluster the data based on the spatial location of the measurement sites using the density-based clustering algorithm DBSCAN (Ester et al., 1996). The maximum 190 distance between clusters is set to 50 km so stations closer than that distance are assigned to the same cluster. Small clusters that result are randomly assigned ourto the cross validation folds. In athe second step, larger clusters (n > 50) are split again with KMeans clustering (Duda et al., 2001) to ensure the same statistical distribution of all cross validation folds. For this, we use the KMeans clustering algorithm (Duda et al., 2001). The resulting smaller clusters are again randomly assigned to the cross validation folds. Fig. 3 (a) shows this data split.

195
To evaluate the generalizability of our predictions to world regions with few measurements, wWe extend our spatial cross validation experiment to evaluate the generalizability of our predictions to world regions with few measurements. Here we divide the data byinto the three world regions North America, Europe, and East Asia ( Fig. 3 (b)). Most measurement sites are located either in North America, Europe, or East Asia, and we only consider stations in these world regions for this experiment. A random forest is fitted and evaluated on two of the three regions and also evaluated on the third region for 200 comparison. For example, it is fitted and evaluated on data of Europe and North America and additionally evaluated in East Asia. The difference in the resulting evaluation scores shows the spatial generalizability of the model. The results of the spatial cross validation experiments described in this section are presented in Sect. 3.1.2.

Shapley Additive Explanations (SHAP)
SHapley Additive exPlanations (SHAP) Lundberg and Lee (2017) (SHAP, Lundberg and Lee, 2017) provide detailed expla-205 nations for individual predictions by quantifying how each feature contributes to the result. The contribution refers to the average model output (or base value) over the training dataset. In other words, this means that a: A feature with the SHAP value x causes the model to predict x more than the average prediction or base value (over the training set). To calculate SHAP for our data and model, we use the Python package SHAP provided by Lundberg (2021). The package contains a TreeSHAP module (Lundberg et al., 12 Feb 2018), which has been specially tailored to tree-based models. It, therefore, 210 provides an efficient and accurate approach for our random forest model. We use the TreeShap module (Lundberg et al.,12 Feb 2018) of the Python packacke SHAP (Lundberg and Lee, 2017)

Methods to assess the impact of uncertainties
Uncertainty assessment increases the trustworthiness of our machine learning approach and final ozone map. In general, the predictions of machine learning models have two kinds of uncertainties (Gawlikowski et al., 7 Jul 2021): First, model uncertainty, which results from the trained machine learning model itself, and second, data uncertainty which stems from the uncertainty inherent in the data. It is common to treat these uncertainties separately. Developing an uncertainty assessment 220 strategy for our mapping approach is challenging because different uncertainties arise at different stages of the mapping process. Looking at it closely, eEvery ozone measurement, every preprocessing step, and every model prediction is a potential source of error. It would be infeasible to investigate the impacts of each and every error. We, therefore, identify the most important error sources and analyze the uncertainty induced in our produced map only for these. The decision on which aspects to analyze specifically is based on expert knowledge and on the results of our machine learning experiments, i.e., robustness 225 analysis (Sect. 2.2.3) and SHAP values (Sect. 2.2.4). We develop a formalized approach which is summarized in Table 3 and further elaborated in the following.  random seeds before training (as for example in Petermann et al., 2021). To rule out this training instability, we re-trained our models several times with different random seeds and monitored the results. We have seenfound negligible variations and thus rule out this kind of uncertainty (not shown). Apart from uncertainty through training instability, the model uncertainty is usually also high for predictions in areas of the feature space where training data is sparse (Lee et al., 26 Nov 2017;Meyer and Pebesma, 2021). For example, a model that was not trained on data from very high mountains or deserts is not expected 235 to produce reliable results in areas with these characteristics. For this reason, wWe apply the concept of 'area of applicability' by Meyer and Pebesma (2021) to limit our mapping to regions where our model is expected to produce reliable results. The details are described in Sect. 2.3.1.
Of the data errors, the error caused by tThe target variable 'average ozone' is the first choice for assessment of data errors.
Fluctuations and random measurement errors introduce uncertainty into the ozone measurements. We evaluate the uncertainty Additional data uncertainty stems from the features. For example, geospatial data derived from satellite products are sensitive to retrieval errors. Based on the sources and documentation of our geospatial data (Appendix A), we expect such errors to have 245 a small impact in this study. However, we want to take a closer look at subgrid features in the geospatial data, and how they affect the model resultswe inspect the subgrid features in the geospatial data and their effect on the model results.. We limit ourselves to the 'altitude' because our SHAP analysis (Sect. 3.1.3) has shown that it is the most important feature besides 'latitude' which does not have critical subgrid variations. Subgrid variations of the altitude might influence our final map, especially if a feature like a cliff or a high mountain is present in the respective grid cell. We evaluate the influence of subgrid 250 variations in heightaltitude on the final map by propagating higher resolution altitudes through the final model as described in Sect. 2.3.2.

Area of applicability method
We adopted the area of applicability method from Meyer and Pebesma (2021), and we refer to that study for a detailed derivation. The method is based on considering the distance of a prediction sample to training samples in the multidimensional on these feature combinations suffer from high uncertainty. Consequently, we use the area of applicability method to flagmark data points with a great distance to the training data cluster as 'not predictable'.

Modeling ozone fluctuations
Here we describe our error model for evaluating the uncertainty introduced by typical ozone biases in the produced maps.

270
Such biases may arise from measurement uncertainties, local geographic effects, or an "unusual" environment with respect to precursor emission sources. We consider all of these effects as ozone measurement uncertainties although it would be more precise to say that they are uncertainties in the determination of ozone concentrations at the scale of our grid boxes.

Results
The results of our explainable machine learning mapping workflow (Sect. 2.2,   (Table 4). These evaluation scores show that all models are useful despite the spreadvariance in evaluation scores. On average the models explain 61 % of the variance in ozone values. The mean R 2 score is 0.61 and the mean RMSE 335 is 3.97 ppb. To putPutting this RMSE value into perspective, 5 ppb is a conservative estimate for the ozone measurement error (Schultz et al., 2017). It is also lower than the 6.40 ppb standard deviation of the true ozone values of the training dataset (Fig. 1). Although the evaluation scores of all folds are in an acceptable range, the standard deviation of 0.08 ppb in the RMSEs shows that evaluation scores depend to some extent on the data split to some extend.
Concerning the spatial cross validation on different world regions, the R 2 value drops between 0.13 and 0.49 when training 340 and validating in different world regions (    respectively. The high ozone station ( Fig. 7 (a)) is located in a rural area in the US with many agricultural fields and a smaller city nearby. The average ozone at this location is predicted to be high because the model uses the absence of forests, the low 365 'night light in 5 km area' value, and the 'absolute latitude' as features leading to high ozone values. This is consistent with Fig. 6 where it can be seen that a lower 'absolute latitude' often increases the ozone value. The French station ( Fig. 7 (b)) is an urban background station surrounded by fields. The location is further in the north than the US station which leads to a strong decrease in the predicted ozone value. The low '(relative) altitude' further decreases the predicted ozone.    Table 4), and secondly, the cross validation on different world regions, that have a maximum distance from each other (RMSE values of up to 0.55 ppb, as seen in Table 5). In our uncertainty assessment, we therefore combine findings from both the area of applicability (for matching features) and the spatial cross validation methods (for spatial proximity). The criterion for matching features was presented in Sect. 3.2.1. for spatial proximity, we consider the mean spatial distance of a measurement station to the closest measurement station in another cross validation set. It is ca. 182 km. Analogously to the approach of the 380 area of applicability (Sect. 3.2.1), we analyze the distances between measurement stations in the geographical space. To quantify spatial proximity, we calculate the mean distance of a measurement station and its closest neighboring station in a different cross-validation set. Disregarding stations that are too far away from the others, we identified the distance of ca. 182 km (upper whisker), within which we expect a comparable RMSE as shown in Table 4. We assume a higher RMSE for locations that are more than 182 km away from their closest neighboring measurement station. Fig. 8 shows 385 the area of applicability of our model including this spatial distinction. In this figure, locations with unrestricted applicability (matching features and spatial proximity of up to 182 km to training locations) are marked in bright turquoise. Here we expect an RMSE in the range of 4 ppb and an R 2 value of about 0.55. We mark locations with a spatial distance of more than 182 km from training locations in a darker shade of turquoise. Here the RMSE may raise to about 5 ppb. Bright grey areas in Fig. 8 denote areas that are closer than 182 km to a measurement station but do not have feature combinations found in AQ-Bench.

390
The bulkmajority of the regions with good coverage of measurement stations (North America, Europe, and parts of East Asia) are well predictable. In these regions, only some areas high in the high north and high mountains are not predictable.
Conversely, large areas in South and Central America, Africa, far northern regions, and Oceania have feature combinations different from the training data and are therefore are not predictable. There are some regions in the Baltic area, South America, Africa, and South Australia where feature combinations can be predicted by the model, but they are far away from the AQ-

395
Bench stations. A broader discussion of the global applicability of our machine learning model follows in Sect. 4.3.

Uncertainty due to ozone fluctuations is within an acceptable range
The error model for ozone uncertainties is described in Sect. 2.3.2. The error model converged fully after 100 realizations, see Appendix D for details on the error model convergence. The R 2 values of the perturbed models varied between 0.50 and 0.58. Fig. 9 shows the resulting standard deviation in the mapped ozone. We find that tThe assumed ozone fluctuations may lead to 400 a less certain predictionhave a higher impact in specific areas, such as areas with sparse training datain areas with sparse training data. In general, it can be concluded, however,We conclude that our error model does not tend to amplify the effects  of perturbed training data. This means that the machine learning algorithm smoothes out noise during training. This can be is explained by the core functioning of the random forest which uses bootstrapping during training.

Uncertainty through subgrid DEM variation is within an acceptable range
This method was described in Sect. 2.3.3. In most regions of the world, subgrid DEM variations around mean altitude are 410 below 50 m ( Fig. 10 (a)), e.g., in the central and eastern United States and in Europe except for the Alps. There are regions with higher variances such as the Rocky Mountains and their surroundings, the Alps, and large parts of Japan outside Tokyo.
In Figure 10 (b) it can be seen how these variations influence the predicted ozone values. In the flat regions, the variance is below 0.5 ppb, and even in the very high variance regions, the deviation is very seldomly above 2 ppb. This means the model is robust against these variances. Few exceptions are present at the border of the area of applicability (ref. Sect. 3.2.1), 415 e.g. in the Alps. But even in these regions, the deviation is well below 5 ppb, which is a conservative estimate for ozone fluctuations (Schultz et al., 2017). A discussion of implications for general subgrid variances can be found in Sect. 4.1.

Production of the final map
All selected features listed in Sect. 3.1.1 are used to fit the final model. In contrast to the experiments in the previous sections, 420 we now train the model on 80 % of the AQ-Bench data set and test it on the remaining 20 % of the independent test set. Fig. 2 shows the predictions of this model on the test set vs. the true average ozone values. The R 2 value of this model is 0.55 and the RMSE is 4.4 ppb. Not all points are exactly on the 1:1 line, but there is a spread around it.There is a spread around the 1:1 line, furthermore, extremes are not captured as well as values closer to the mean. Furthermore, tTrue values of less than 20 ppb or more than 40 ppb are predicted with high bias, which is expected since random forests tend to predict both low and 425 high extremes less accurately than values closer to the mean.
Predictions in the area of applicability are in a range between 9.4 and 56.5 ppb. This is not the full range of measured ozone values (Fig. 1). There are some characteristics that are visible at first sight, e.g. the north-south gradient in Europe and generally

Robustness
Based on Hamon et al. (2020), we define robustness in the context of this work as follows: The model and map are considered robust if they do not change substantially under noise or perturbations that could realistically occur. Here, 5 ppb change in RMSE score and the ozone map are considered significant, because 5 ppb are a conservative estimate for ozone measurement errorsWe define a 5 ppb change in RMSE score or predicted ozone values as significant (Schultz et al., 2017).

450
Methods to assess the robustness are part of both the explainable machine learning workflow (Table 2) and the uncertainty assessments (Table 3) of this study. Regarding the robustness of the training process, the cross validation results in Table 4 show that the model performance depended on the data split, leading to variances of the RMSE between 3.83 and 4.04 ppb.
This was already noted by Betancourt et al. (2021b) and is regarded as an inherent limitation of a relatively smallnoisy dataset.
Apart from that, there were no robustness issues of the training process, e.g. evaluation scores did not vary with different 455 random seeds (not shown). We tested also the robustness regarding typical variances in the ozone and geospatial data. The results from Sect. 3.2.2 and Sect. 3.2.3 show that the produced ozone map is robust against these fluctuations. The variances are never above the initial perturbations, and variances in the map do not exceed theour limit of 5 ppb defined above. Limits in the robustness were only shown through variances above 3 ppb at the borders of the area of applicability of the model, and in regions with sparse training data (grey and dark turquoise areas in Fig. 9 and 10). This outcome is especially interesting 460 because it shows that the issues of applicability (discussed in more detail in Sect. 4.3) and robustness are interconnected. In areas where the model is applicable, it is also more robust and uncertainties are lower.
In order to make the robustness assessment with respect to data feasible, we strongly reduced the dimensionality of our error model by using expert knowledge about the problem. We only conducted two experiments where we modify training data and model inputs (described in Sect. 2.3.2 and 2.3.3). These experimental setups were chosen because they are expected to gener-465 alize well to other similar experiments. Firstly, spatial fluctuations produce similar perturbations to temporal fluctuations (not shown), and, secondly, subgrid variances of one feature are also expected to generalize to other features. The combined robustness experiments have shown that our produced maps are robust.

Scientific consistency
Here wWe discuss the scientific consistency of our model by assessing the results of the explainable machine learning work-470 flow ( Table 2). In more detail, wWe interpret the selected features, their importance, and their influence on the model predictions. In our case, tThe features are proxies to ozone processes, which makes it challenging to interpret the underlying chemical processes. Nevertheless, the connections between the features can be discussed, if they are plausible and consistent with respect to our understanding of ozone processes. This is a pure a posteriori approach, meaning we did not in any way enforce scientific consistency in the model orduring the training process.
'NO x emissions'. Geographic features are proxies for flow patterns and heat, not for ozone chemistry, which ozone researchers would be expected to be of greatermore importancet. This contradiction is due to the fact that the model provides an as-is view of ozone concentration and is not process-oriented in any way. Please note that mMany features such as 'nightlight' and 480 'population density' are correlated, so retraining the model might swap dependence in the SHAP values as noted by Lundberg et al. (2020).
The beeswarm plot in Fig. 6 aids in checkingshows the physical consistency of our model. The effect of 'absolute latitude' on predictions is consistent with what is known about ozone formation processes, i.e. -ozone production generally increases toward the equatorwhen more sunlight is available. This is also evident in the highly latitudinally stratified ozone overview 485 plots in global measurement-based overview studies such as TOAR health and TOAR vegetation (Fleming et al., 2018;Mills et al., 2018). Ozone is affected by meteorology (temperature, radiation) and precursor emissions (Sect. 1). The fact that there is no continuous increase of ozone towards tropical latitudes shows that the mapping model at least qualitatively captures the influence of low precursor emissions in the tropics. The importance of 'absolute latitude' also indicates that the model can be improved by including temperature and radiation features from meteorological data. High 'relative 490 altitude' and 'altitude' both increase the modelpredicted ozone. This altitude-ozone relation is consistent with our previous knowledge (Chevalier et al., 2007)These relations are consistent with Chevalier et al. (2007). There are also a few relatively important chemistry-related features. We can see that very high values of 'nightlight in 5 km area' reduces the modelpredicted ozone. This is consistent with NO titration (Monks et al., 2015). Nightlights are a proxy for human activity, generally in the context of fossil fuel combustion, which usually leads to elevated NO x concentrations. NO reacts with ozone and thus removes it,destroys ozone, and especially during the night time this leads to very low ozone levels close to zero ppb. Conversely, very hHigh 'forests in 25 km area' values lead to lower ozone predictions. This is plausible because there is little human activity in forested areas and thus no combustion-related precursor emissions occur. Quantification of either influence is not possible because, for example, it is unclear to what extent the different forests emit volatile organic compounds which are also ozone precursors, and a. A city with 'nightlight in 5 km area' = 50 cannot be directly quantified in terms of precursor emissions either.

500
It is also not expected that the machine learning model learns the ozone related processes described above because it is not process based. Instead, it learns the effects of processes if they are reflected in the training data. SHAP values also offer the possibility to quantify the influence of features on single predictions (Fig. 7). This is helpful for certain special cases, e.g., when only a single prediction needs to be explained. In a global application, however, it might become infeasible -not all the pixels in our ozone map can be explained one by one.

505
The forward feature selection (Sect. 2.2.2 and 3.1.1) can also be discussed in terms of plausibility. Features selected by this method favor a generalizable model. In other words, dDiscarded features may have some connection to ozone -but even if they help to characterize the locations, but their addition to the training data diddoes not lead to a more generalizable model. This can have different reasons. As such, 'u'Urban and built-up in 25 km area' was not selected presumably because urban areas are often very localized. Urban landcover in the area of 25 km around a locationThis feature is therefore not as meaningful 510 as the variables 'nightlight' and 'population density', which are like 'urban and built-up in 25 km area'also proxies for human activity, but are available at higher resolution. Similarly, the feature 'cropland / natural vegetation mosaic in 25 km area' was discarded because ozone is affected differently by croplands and natural vegetation. Together with the large area considered, this feature becomes obsolete. We suspect the features 'snow and ice in 25 km area', 'barren or sparsely vegetated in 25 km area', and 'wheat production' did not contribute to the model generalizability because they are simply not represented well 515 in the training data. A feature may be an important proxy for ozone, but if the relationship is not expressed in the training data, it cannot be learned by a machine learning model. This feature can become more important if other training locations are consideredincluded. This shows that the placing of measurement locations is crucial.

Mapping the global domain
For the global mapping, tThe model has to generalize to unseen locations for global mapping. Two prerequisites are: 1) The 520 model must have seen the feature combination during training. 2) The connection between features and the target, ozone, must be the same. The two conditions are only fulfilled in a very strictly constrained space, as can be seenshown in Fig. 8. We  (Table 5), we decided that because the model uncertainty rises when training and testing in different world regions, we also combined cross validation in the spatial domain in Sect. 3.2.1. One interesting thing here to mention is that weWe also conducted the samespatial cross validation approach on other world regions with a shallow neural network (as in the baseline experiments of Betancourt et al. (2021b)). The neural network had similar evaluation scores on the test set, but it did not generalize as 530 well to other world regions, even showing even negative R 2 values when testingevaluated in other world regions (not shown).
One reason for that could be that the random forest is an ensemble model and thus generalizes better under noisy data. We, therefore, decided to discard the neural network architecture, because our main goal is global generalizability.
Concerning our mapping approach, wWe can confidently map Europe, large parts of the US and East Asia, where the bulkmajority of the measurement stations are located. Those are all industrialized countries in the northern hemisphere.

535
OurThe cross validation results (Sect. 3.1.2), the area of applicability (Sect. 3.2.1), but alsoand expert knowledge would agreeconfirm that it is problematic to map touncertainties increase when a model trained on the AQ-Bench dataset is applied to other world regions with the AQ-Bench training dataset only. However, the cross validation in connection with the area of applicability technique yielded also the knowledgeshows that the models are not completely useless can be used in other world regions with acceptable uncertainties. That is promising for future global mapping approaches. One idea to solve 540 these problems of different connections between features and ozone in different world regions is to train localized models, and apply them wherever possible. Localized models could not only yield more accurate predictions but in connection with SHAP values (Sect. 2.2.4), they could also rule out the governing factors of ozone in the respective regions and be easier to interpret.
With regard to the spatial domain, we can also discuss the resolution. The model was trained on point data of the 'absolute latitude', 'altitude', and 'relative altitude', and technically one could produce more fine-grained maps if the inputgridded data 545 is present in higher resolution. The model is 'perfect' in this regard -because it was trained on infinite resolution point data as provided by TOAR. However, one may need to reconsider some assumptions made here in terms of regional representativity of the measurements and the relation between geographic features and ozone on a different scale.

Prospects for ozone mapping
In this study, wWe mapped average tropospheric ozone from the stations in the AQ-Bench dataset to a global domain. For this, 550 we fused different auxiliary geospatial datasets and gridded data with machine learning. We chose to useused features that are known to have a connection toproxies for ozone processes, and that were already proven to enable a prediction of ozone concentrations (Betancourt et al., 2021b). Our choice of data and algorithms is well justified and transparent. Errors did not exceed 5 ppb, which is also in the range of measurement error and therefore an acceptable uncertainty. The R 2 value of the final model is 0.55, which is a good value for properly validated mapping. The maps produced show known patterns of ozone 555 such as lower levels in metropolitan areas and higher levels in mediterranean or mountainous regions. But there are situations -especiallyHowever extremes (Fig. 2) -which are not predicted well with higher bias. This can be considered as a general problem of machine learning (Guth and Sapsis, 2019) but was also noted in other ozone modeling studies (Young et al., 2018).
For this first approach, we limited ourselves to the static mapping of aggregated mean ozone. An advantage of this approach is that the model result is directly the ozone metric of interest (in this case average ozone). Since the AQ-Bench dataset contains 560 other ozone metrics, they could be mapped as well. For example, vegetation-or health-related ozone metrics can be mapped with the same workflow and training dataset as described here. Another advantage is that we used a multitude of inputs that could not be used in a traditional model because their connection to ozone is unknown. This means we exploit two benefits of machine learning: first, obtaining a bias-free estimate of the target directly, and second, using a multitude of inputs with unknown direct impact on the target.

565
Our model is only valid for the training data period (2010)(2011)(2012)(2013)(2014), and it is not suitable to predict ozone values in other years. Our data product is a map that is aggregated in time. This could be a limitation as sometimes the data product of interest is a seasonal aggregate or even maps of daily or hourly air pollutant concentrations. In that regard, it is worth mentioning that tThe use of meteorological data in not-aggregated or aggregated formas static or non-static inputs can be beneficial to further increase model performance and allow time-resolved mapping. We applied a completely data-driven approach, 570 relying heavily on geospatial data. The other side of the spectrum is DeLang et al. (2021), who fused chemical transport model output to observations without exploiting the connection to any auxiliary dataother features. A possible direction to go from here is described by Irrgang et al. (2021), who propose the fusion of models and machine learning to benefit from both methods.

Conclusions
In this study, we developed a completely data-driven, machine learning-based, global mapping approach for tropospheric 575 ozone. We mapped from the 5577 irregularly placed measurement stations of the AQ-Bench dataset (Betancourt et al., 2021b) to a regular 0.1°× 0.1°grid. As environmental data, i.e. input features, wWe used a multitude of geospatial datasets as input features. To our knowledge, this is the first completely data-driven approach to global ozone mapping. We combined this map-ping with an end-to-end approach for explainable machine learning and uncertainty estimation. This allowed us to assess the robustness, scientific consistency, and global applicability of the model. We linked interpretation tools with domain knowledge 580 to obtain application-specific explanations, which is in line with Roscher et al. (2020). The methods are interconnected, e.g.
forward feature selection also made the model easier to interpret. Likewise, the area of applicability was shown to match the model's robustness. We justified the choice of tools and detailed how they tools we have chosen provided us with the results we need to make a comprehensive global analysis. The combination of explainable machine learning and uncertainty quantification makes the model and outputs trustworthy. Therefore, the map we produced provides information on global ozone 585 distribution and is a transparent and reliable data product.
We explained the outcome and the model, which can lead to new scientific insights. Mapping studies like this oneours could also contribute to studies like Sofen et al. (2016), that propose locations for new air quality measurement sites to extend the observation network,. Here the inspection of the feature space helps to cover not only spatial world regions but also air quality regimes and areas with diverse geographic characteristics. The approach of an area of applicability can also be used 590 to decide where to build new measurement stations to maximize the mapped areaBuilding locations can also be proposed based on their contribution to maximizing the area of applicability (Stadtler et al., 2022). The map as a data product can also be used to refine studies like TOAR (Fleming et al., 2018;Mills et al., 2018) because it enables analyzingses at locations with no measurement stations. Closing the gaps in the maps, iIt would be highly beneficial to also add station data from other countries, e.g. new data from East Asian countries, or from new data sources such as OpenAQ. It would be beneficial to add 595 time resolved input features to the training data to improve evaluation scores and increase the temporal resolution of the map. Adding training data from regions like East Asia, or new data sources such as OpenAQ 1 would close the gaps in the global ozone map.
Code and data availability. Appendix A: Technical details on the data Table A1: Technical details on the data used in this work. For more information on the station location data, refer to Betancourt et al. (2021b). Please note that 'land use in 25 km area' comprises all the different land cover features.     area in the US with many agricultural fields and a smaller city nearby. The average ozone at this location is predicted to be high because the model uses the absence of forests, the low 'night light in 5 km area' value, and the 'absolute latitude' as features leading to high ozone values. This is consistent with Fig. 6 where it can be seen that a lower 'absolute latitude' often increases the ozone value.
The French station (b) is an urban background station surrounded by fields. The location is further in the north than the US station which leads to a strong decrease in the predicted ozone value. The low '(relative) altitude' further decreases the predicted ozone.