Application of random forest regression to the calculation of gas-phase chemistry within the GEOS-Chem chemistry model v10

. Atmospheric chemistry models are a central tool to study the impact of chemical constituents on the environment, vegetation and human health. These models are numerically intense, and previous attempts to reduce the numerical cost of chemistry solvers have not delivered transformative change. Weshow here the potential of a machine learning (in this case random forest regression) replacement for the gas-phase chemistry in atmospheric chemistry transport models. Our training data consist of 1 month (July 2013) of output of chemical conditions together with the model physical state, produced from the GEOS-Chem chemistry model v10. From this data set we train random forest regression models to predict the concentration of each transported species after the integrator, based on the physical and chemical conditions before the integrator. The choice of prediction type has a strong impact on the skill of the regression model. We ﬁnd best re-sults from predicting the change in concentration for long-lived species and the absolute concentration for short-lived species. We also ﬁnd improvements from a simple implementation of chemical families (NO x = NO + NO 2 )

Abstract. Atmospheric chemistry models are a central tool to study the impact of chemical constituents on the environment, vegetation and human health. These models are numerically intense, and previous attempts to reduce the numerical cost of chemistry solvers have not delivered transformative change.
We show here the potential of a machine learning (in this case random forest regression) replacement for the gas-phase chemistry in atmospheric chemistry transport models. Our training data consist of 1 month (July 2013) of output of chemical conditions together with the model physical state, produced from the GEOS-Chem chemistry model v10. From this data set we train random forest regression models to predict the concentration of each transported species after the integrator, based on the physical and chemical conditions before the integrator. The choice of prediction type has a strong impact on the skill of the regression model. We find best results from predicting the change in concentration for longlived species and the absolute concentration for short-lived species. We also find improvements from a simple implementation of chemical families (NO x = NO + NO 2 ).
We then implement the trained random forest predictors back into GEOS-Chem to replace the numerical integrator. The machine-learning-driven GEOS-Chem model compares well to the standard simulation. For ozone (O 3 ), errors from using the random forests (compared to the reference simulation) grow slowly and after 5 days the normalized mean bias (NMB), root mean square error (RMSE) and R 2 are 4.2 %, 35 % and 0.9, respectively; after 30 days the errors increase to 13 %, 67 % and 0.75, respectively. The biases become largest in remote areas such as the tropical Pacific where errors in the chemistry can accumulate with little balancing influence from emissions or deposition. Over polluted regions the model error is less than 10 % and has significant fidelity in following the time series of the full model. Modelled NO x shows similar features, with the most significant errors occurring in remote locations far from recent emissions. For other species such as inorganic bromine species and short-lived nitrogen species, errors become large, with NMB, RMSE and R 2 reaching > 2100 % > 400 % and < 0.1, respectively.
This proof-of-concept implementation takes 1.8 times more time than the direct integration of the differential equations, but optimization and software engineering should allow substantial increases in speed. We discuss potential improvements in the implementation, some of its advantages from both a software and hardware perspective, its limitations, and its applicability to operational air quality activities.

Introduction
Atmospheric chemistry is central to many environmental problems, including climate change, air quality degradation, stratospheric ozone loss and ecosystem damage. Atmospheric chemistry models are important tools to understand these issues and to formulate policy. These models solve the three-dimensional system of coupled continuity equations for an ensemble of m species concentrations c = (c 1 , . . ., c m ) T expressed as number density (molec. cm −3 ) via 1210 C. A. Keller and M. J. Evans: Machine learning for atmospheric chemistry operation splitting of transport and local processes: (1) U denotes the wind vector, (P i (c) − L i (c) c i ) are the local chemical production and loss, E i is the emission rate, and D i is the deposition rate of species i. We ignore here molecular diffusion as it is negligibly slow compared to advection. The first term of Eq. (1) is the transport operator and involves no coupling between the chemical species. The second term is the chemical operator, which connects the chemical species through a system of simultaneous ordinary differential equations (ODEs) that describe the chemical production and loss: The numerical solution of Eq. (2) is computationally expensive as the equations are numerically stiff and require implicit integration schemes such as Rosenbrock solvers to guarantee numerical stability (Sandu et al., 1997a, b). As a consequence, 50 %-90 % of the computational cost of an atmospheric chemistry model such as GEOS-Chem can be spent on the integration of the chemical kinetics (Long et al., 2015;Nielsen et al., 2017;Eastham et al., 2018;Hu et al., 2018). Previous efforts to increase the efficiency of the integration (with an associated reduction in accuracy) have involved dynamical reduction in the chemical mechanism (adaptive solvers) (Santillana et al., 2010;Cariolle et al., 2017), separation of slow and fast species (Young and Boris, 1977), quasisteady state approximation (Whitehouse et al., 2004a) or approximation of the chemical kinetics using polynomial functions (repro-modelling) (Turányi, 1994). Other approaches have attempted to simplify the chemistry leading to a reduction in the number of reactants and species (Whitehouse et al., 2004b;Jenkin et al., 2008). However, none of these approaches have been transformative in their reduction in time spent on chemistry.
We discuss here the potential of a machine learning algorithm (in this case random forest regression) as an alternative approach to explicitly solving Eq. (2) with a numerical solver in the chemistry model GEOS-Chem. Figure 1 illustrates the approach: during each model time step, GEOS-Chem sequentially solves a suite of operations relevant to the simulation of atmospheric chemistry. In the original model, solving the chemistry is the computationally most expensive step. Our aim is to replace it with a machine learning algorithm while keeping all other processes unchanged. Conceptually, this approach is comparable to previous efforts to speed up the solution of the chemical equations through more efficient integration.
Machine learning is becoming increasingly popular within the natural sciences (Mjolsness and DeCoste, 2001) and specifically within the Earth system sciences to either simulate processes that are poorly understood or to emulate Figure 1. Schematic overview of the use of a random forest regression algorithm as an alternative to the chemistry solver. The original numerical model (GEOS-Chem) sequentially solves the operations relevant to atmospheric chemistry, with the chemical integrator being the computationally most expensive step (left side). Using training data produced from the full model, we generate a machine learning emulator that can then be used instead of the chemical integrator (right side). All other model processes are the same as in the original model. computationally demanding physical processes (notably convection) (Krasnopolsky et al., 2005(Krasnopolsky et al., , 2010Krasnopolsky, 2007;Jiang et al., 2018;Gentine et al., 2018;Brenowitz and Bretherton, 2018). Machine learning has also been used to replace the chemical integrator for other chemical systems such as those found in combustion and been shown to be faster than solving the ODEs (Blasco et al., 1998;Porumbel et al., 2014). Recently, Kelp et al. (2018) found order-of-magnitude speed-ups for an atmospheric chemistry box model using a neural network emulator, although their solution suffers from rapid error propagation when applied over multiple time steps. Machine learning emulators have also been explored to directly predict air pollution concentration in future time steps (Mallet et al., 2009), as well as for chemistryclimate simulations focusing on model predictions of timeaveraged concentrations for selected species such as ozone (O 3 ) and the hydroxyl radical (OH) over timescales of days to months (Nicely et al., 2017;Nowack et al., 2018). In contrast, the algorithm presented here is optimized to capture the small-scale variability of the entire chemical space within a timescale of minutes, with only a small loss of accuracy when used repeatedly over multiple time steps. To do so, we use the numerical solution of the GEOS-Chem chemistry model to produce a training data set of output before and after the chemical integrator (Sect. 2.1 and 2.2), train a machine learning algorithm to emulate this integration (Sect. 2.3, 2.4 and 2.5), and then describe and assess the trained machine learning predictors (Sect. 2.6, 2.7, 2.8 and 2.9). Section 3 describes the results of using the machine learning predictors to replace the chemical integrator in GEOS-Chem. In Sect. 4 we discuss potential future directions for the uses of this methodology and in Sect. 5 we draw some conclusions.

Chemistry model description
All model simulations were performed using the NASA Goddard Earth Observing System Model, version 5 (GEOS-5) with version 10 of the GEOS-Chem chemistry embedded (Long et al., 2015;Hu et al., 2018 The model chemistry scheme includes detailed tropospheric chemistry of oxides of hydrogen, nitrogen, bromine, volatile organic compounds and ozone (HO x -NO x -BrO x -VOC-ozone), as originally described by Bey et al. (2001), with the addition of halogen chemistry by Parrella et al. (2012) plus updates to isoprene oxidation as described by Mao et al. (2013). Photolysis rates are computed online by GEOS-Chem using the Fast-JX code of Bian and Prather (2002) as implemented in GEOS-Chem by Mao et al. (2010) and Eastham et al. (2014). The gas-phase mechanism comprises 150 chemical species and 401 reactions and is solved using the kinetic pre-processor (KPP) Rosenbrock solver (Sandu and Sander, 2006). There are 99 (very) short-lived species which are not transported, and we seek to emulate the evolution of the other 51 transported species. While the GEOS model with GEOS-Chem chemistry can be run as a chemistry-climate model where the chemical constituents (notably ozone and aerosols) directly feed back to the meteorology, we disable this option here and use prescribed ozone and aerosol concentrations for the meteorology instead. This ensures that any differences between the reference model and the machine learning model can be attributed to imperfections in the emulator rather than changes in meteorology due to chemistry-climate feedbacks.

Training data
To produce our training data set we run the model for 1 month (July 2013). Each hour we output the threedimensional instantaneous concentrations of each transported species immediately before and after chemical integration, along with a suite of environmental variables that are known to impact chemistry: temperature, pressure, relative humidity, air density, cosine of the solar zenith angle, cloud liquid water and cloud ice water. In addition, we output all photolysis rates since those are an essential element for chemistry calculations. Alternatively, one could also envision directly embedding the (computationally demanding) photolysis computation into the machine learning model, such that the emulator takes as input variables additional environmental variables relevant to photolysis (e.g. cloud cover, overhead ozone and aerosol loadings) and then emulates photolysis computation along with chemistry.
Each grid cell 1 h output constitutes one training sample, consisting of 126 input "features": the 51 transported species concentrations, 68 photolysis rates and the 7 meteorological variables. We restrict our analysis to the troposphere (lowest 25 model levels) since this is the focus of this work. Each hour thus produces a total of 327 600 (long × lat × lev = 144 × 91 × 25) training samples, and so an overall data set of 2.4 × 10 8 (long × lat × lev × days × hours = 144 × 91 × 25 × 31 × 24) samples is produced over the full month. We withhold a randomly selected 10 % of the samples to act as validation data while the remaining samples act as training data.

Random forest regression
We use the random forest regression (RFR) algorithm (Breiman, 2001) to emulate the integration of atmospheric chemistry. Figure 2 shows a schematic of RFR. It is a commonly used, and conceptually simple, supervised learning algorithm that consists of an ensemble (or forest) of decision trees. Each tree contains a tree-like sequence of decision nodes, based on which the tree splits into its various branches until the end of the tree ("the leaf") is reached. This leaf is the prediction of the decision tree. Each decision node is based on whether one of the input features is above a certain value. An important aspect of the random forest is that each tree of the forest is trained on a subset of the full training data, thus providing a slightly different approximation of the model. A prediction is then made by averaging the predictions of the individual trees.
The RFR algorithm is less prone to over-fitting and produces predictions that are more stable than a single decision tree (Breiman, 2001). Random forests are widely used since they are relatively simple to apply, suitable for both classification and regression problems, do not require data transformation, and are less susceptible to irrelevant or highly correlated input features. In addition, random forests allow for Figure 2. Schematic of random forest algorithm. For each species c i , we use a random forest consisting of 30 individual decision trees, each up to 12 layers deep (only the first four layers are shown). All decision trees take the same inputs (e.g. species concentration vector c at given location, photolysis rates J, temperature T , humidity q) and each decision tree node uses one of the input features plus a threshold value to determine the tree path for the given set of input features. The final prediction is made by averaging the 30 individual tree predictions (c i,1 , c i,2 , . . ., c i,30 ). easy evaluation of the factors controlling the prediction, the decision structure and the relative importance of each input variable. Analysing these features can offer valuable insights into the control factors of the underlying mechanism, as discussed later. We discuss the potential for other algorithms in Sect. 4.

Implementation
For each of the 51 chemical species transported in the chemistry model, we generate a separate random forest predictor. This predictor can be applied to all model grid cells, i.e. it captures all chemical regimes encountered by the respective target species. Conceptually, one can imagine that each tree path represents a different chemical regime, so it is important to generate trees that are large enough to encompass the entire solution space. We find a good compromise between computational complexity and accuracy of the solutions for random forests consisting of 30 trees with a maximum of 10 000 leaves (prediction values) per tree. These hyper-parameter were determined by trial and error, and we find very little sensitivity of our results to changes (±50 %) to the number of trees and/or number of leaves. Each tree is trained on a different sub-sample of the training data by randomly selecting 10 % of the training sample. In order to balance the training samples across the full range of model values, the training samples are evenly drawn from each decile of the predictor variable. This prevents over-sampling of ocean grid cells, which are typically characterized by very uniform chemistry. Our results show very little sensitivity to the size of the training sample as long as it covers the full solution space.
The Python software package scikit-learn (http: //scikit-learn.org/stable/, last access: 18 March 2019) (Pedregosa et al., 2011) was used to build the forests. We distributed the training of the entire forest (30 trees for 51 species) onto 1530 CPUs, and each tree took 1 h to train. After training, all forest data (i.e. all tree node decisions and leaf values) were written into text files.
The forests were then embedded as a Fortran 90 subroutine into the GEOS-Chem chemistry module. Using an ad-hoc approach, the module first loads all tree nodes (archived after the training) into local memory and then evaluates each of the 1530 trees in series upon calling the random forest emulator. Each grid cell calls the same random forest emulator separately, passing to it all local information required to evaluate the trees (species concentrations, photolysis rates, environmental variables). No attempts were made to optimize the prediction algorithm beyond the existing Message Passing Interface grid-domain splitting.

Choice of predictor
We find that the quality of the RFR model (as implemented back into the GEOS-Chem model) depends critically on the choice of the predictor. Most simplistically, we could predict the concentration of a species after the integration step. However, many of the species in the model are log-normally distributed in which case predicting the logarithm of the concentration may provide a more accurate solution; we could also predict the change in the concentration after the integrator, the fractional change in the concentration, the logarithm of the fractional change, etc. After some trial and error, and based on chemical considerations, we choose two types of prediction: the change in concentration after going through the integrator, and the concentration after the integrator. We describe the first as the "tendency". This fits with the differential equation perspective for chemistry given in Eq. (2). However, if we incorporate only this approach we find that errors rapidly accrue. This is due to errors in the prediction of short-lived species such as NO, NO 3 and Br. For these compounds, concentrations can vary by many orders of mag-nitude over an hour, and even small errors in the tendencies build up quickly when they are included in the full model. For these short-lived compounds, we use a second type of prediction where the RFR predicts the concentration of the compound after the integrator. We describe this as a prediction of the "concentration". From a chemical perspective, this is similar to placing the species into steady state, where the concentration after the integrator does not depend on the initial concentration but is a function of the production (P ) and loss (L · c) such that c = P /L. We imitate this process by explicitly removing the predictor species from the input features, which we find improves performance.
The choice between predicting the tendency or the concentration is based on the standard deviation of the ratio of the concentration after chemistry to the concentration before chemistry (σ (c/c 0 )) in the training data. This ratio is relatively stable and close to 1.00 for long-lived species but highly variable for short-lived species. Based on trial and error, we use a standard deviation threshold of 0.1 to distinguish between long-lived species (σ < 0.1) and short-lived species (σ ≥ 0.1). Table 1 lists the prediction type used for each species. We discuss the treatment of NO and NO 2 species in Sect. 2.7.

Feature importance
The importance of different input variables (features) for making a prediction of O 3 tendency is shown in Fig. 3a. The importance metric is the fraction of decisions in the forest that are made using a particular feature, with the variability indicating the standard deviation of that value between the trees. Consistent with our understanding of atmospheric chemistry, features such as NO, formaldehyde (CH 2 O), the cosine of the solar zenith angle ("SUNCOS"), bromine species and nitrogen reservoirs all appear within the top 20. From a chemical perspective, these features make sense given the global sources and sinks of O 3 in the lower to middle troposphere.
For ozone prediction, 6 out of the 20 most important input features are related to photolysis. Most of the photolysis rates are highly correlated, and the individual decision trees use different photolysis rates for decision making. This results in very large standard deviations for the photolysis input features across the 30 decision trees, as indicated by the black bars in Fig. 3a.
Note that the concentration of O 3 is not among the 20 most important input features for the prediction of O 3 tendency. If, instead, the random forest model is trained to predict the concentration of O 3 , the initial O 3 concentration dominates the input feature importance, explaining more than 99 % of the prediction. However, when predicting the ozone tendency, the random forest algorithm is more sensitive to availability of NO x , VOCs, photolysis, etc., rather than the initial concentration of O 3 . For regions producing ozone (dominated by the NO + HO 2 → NO 2 + OH reaction) the O 3 concentra-tion is not the primary source of variability. Similarly, for regions loosing ozone the dominant source of variability is the variability in photolysis rates (multiple orders of magnitude) rather than the variability in O 3 concentration (less than an order of magnitude). Fig. 3b shows the performance of the O 3 tendency predictor against the validation data. The predictor is not perfect, with an R 2 of 0.95 and a normalized root mean square error (NRMSE) of 23 %, but it is essentially unbiased with a normalized mean bias (NMB) of −0.13 % (descriptions of the metrics can be found in Sect. 2.8). However, as shown in Fig. 3c, the model becomes almost perfect when the tendency is added to the initial concentration -which is the operation to be performed by the chemistry model.

Prediction of NO x
For NO and NO 2 we find that the random forest has difficulties predicting the species concentrations independently of each other. This can result in unrealistically large changes in total NO x (NO x ≡ NO+NO 2 ). Given the central role of NO x for tropospheric chemistry, a quick deterioration of model performance occurs (see Sect. 3.1). For these species we thus adopt a different methodology: instead of making predictions for the species individually, we predict the tendency for a family comprising their sum (NO + NO 2 ) and then predict the ratio of NO to NO x . NO 2 is then calculated by subtracting NO from NO x . Thus, the overall number of forests that needs to be calculated does not change. This has the advantage of treating NO x as a long-lived family "species" and includes a basic conservation law, but it allows the NO and NO 2 concentration to still vary rapidly. Figure 4 shows the feature importance and the comparison with the validation data for the prediction of the NO x family tendency. The features make chemical sense, with NO 2 and NO but also acetaldehyde (a tracer of PAN chemistry) and HNO 2 , a short-lived nitrogen species, playing important roles. The importance of SO 2 may reflect heterogeneous N 2 O 5 chemistry, with SO 2 being a proxy for available aerosol surface area (note that we do not provide any aerosol information to the RFR). As shown in Fig. 4b, the NO x predictor gives the "true" NO x tendencies from the validation data with an R 2 of 0.96, NRMSE of 21 % and NMB of 0.28 %. While the NRMSE is relatively high, we find that the ability of the model to produce an essentially unbiased prediction is more critical for the long-term stability of the model. As for O 3 , the NO x skill scores become almost perfect when adding the tendency perturbations to the concentration before integration (Fig. 4c). Figure 5 shows the feature importance and performance of the predictor for the ratio of NO to NO x . Again the features make chemical sense with the top three features (photolysis, temperature and O 3 ) being those necessary to calculate the NO-to-NO 2 ratio from the well known Leighton relationship Table 1. Overview of the performance of the RFR model with the NO x family treatment. Shown are the Pearson correlation R 2 , normalized root mean square error (NRMSE) and normalized mean bias (NMB). Comparison against the validation data set (10 % of training data withheld from training) are indicated with a "V". Comparisons between the RFR simulation and the full GEOS-Chem model for July 2014 at 00:00 UTC after the 1st, 5th and 30st simulation day are indicated with "D1", "D5" and "D30", respectively. Prediction type of each species (concentration, tendency, NO x family treatment) is given in the prediction column.    (Leighton, 1961). The performance of the NO-to-NO x ratio predictor is very good, and the prediction is also unbiased.

Evaluation metrics
We now move to a systematic evaluation of the performance of the RFR models, both against the validation data and when implemented back into the GEOS-Chem model. We use three standard statistical metrics for this comparison. For each species c, we compute the Pearson correlation coefficent (R 2 ), the root mean square error normalized by the standard deviation σ (NRMSE), whereĉ denotes the concentration predicted by the RFR model, c is the concentration calculated by GEOS-Chem, and N is the total number of grid cells.

Performance against the validation data
Ten percent of the training data was withheld to form a validation data set. Columns "V" in Table 1 provide an evaluation of each predictor against the validation data for the three metrics discussed in Sect. 2.8. For most species the RFR predictors do a good job of prediction: R 2 values are greater than 0.90 for 35 of the 51 species, NRMSEs are below 20 % for 21 species and NMBs are below 1 % for 29 species, respectively. Those species which do less well are typically those that are shorter lived, such as inorganic bromine species or some nitrogen species (NO 3 , N 2 O 5 ). The performance of NO and NO 2 after implementing the NO x family and ratio methodology is consistent with other key species. Although we do not have a perfect methodology for predicting some species, we believe that it does provide a useful approach to predicting the concentration of the transported species after the chemical integrator. We now test this methodology when the RFR predictors are implemented back into GEOS-Chem.

Long-term simulation using the random forest model
To test the practical prediction skill of the RFR models, we run four simulations of GEOS-5 with GEOS-Chem for the same month (July) but a different year (2014) than was used to train the RFR model. This simulation differs from the training simulation not only in meteorology but also in emissions, with local differences in NO x , CO and VOC emissions of up to 20 %. As such, this experiment also evaluates the ability of the RFR model to capture the sensitivity of chemistry to changes in emissions.
The first simulation is a standard simulation where we use the standard GEOS-Chem integrator; the second is a simulation where we replace the chemical integrator with the RFR predictors described earlier (with the family treatment of NO x ); the third uses the RFR predictors but directly predicts the NO and NO 2 concentrations instead of NO x ; the fourth has no tropospheric chemistry and the model just transports, emits and deposits species. In all simulations the stratospheric chemistry uses a linearized chemistry scheme (Murray et al., 2012). This buffers the impact of the RFR emulator over the long-term since all simulations use the same relaxation scheme in the stratosphere. For the time frame of 1 month considered here, we consider this impact to be negligible in the lowest 25 model levels.
We evaluate the performance of the second, third and fourth model configuration against the first. We first focus on the statistical evaluation of the best RFR model configuration (second model configuration) for all species and then turn our attention to the specific performance of surface O 3 and NO 2 , two critical air pollutants. Table 1 and summarizes the prediction skill of the random forest regression model (using the NO x family method) for all 51 species plus NO x . We sample the whole tropospheric domain at three time steps during the 2014 test simulation: after 1 simulation day ("D1"), after 5 simulation days ("D5") and after 30 simulation days ("D30"). For each time slice, we calculate a number of metrics (Sect. 2.8) for the RFR model performance.

Statistics
The model with the RFR predictors shows good skill (R 2 > 0.8, root mean square error (RMSE) < 50 %, NMB < 30 %) for key long-lived species such as O 3 , CO, NO x , SO 2 and SO 2− 4 and for most VOCs, even after 30 days of integration. The NRMSEs can build up to relatively large numbers over the period of the simulation, with O 3 getting up to 67 % after 30 days, but the mean bias remains relatively low at 13 %. For the stability of the simulation, it is more important to have an overall unbiased estimation, as this prevents systematic buildups or drawdowns in concentrations that can eventually render the model unstable. For 36 of the 52 species (including NO x ), the NMB remains below 30 % at all times. The model has more difficulties with shorterlived species such as inorganic bromine species (e.g. atomic bromine, bromine nitrate) and nitrogen species such as NO 3 and N 2 O 5 . These species show poor performance with R 2 values below 0.1 even after the first day.
The hourly evolution of the metrics for O 3 over a 30-day simulation is shown in Fig. 6. We show here the performance of the model with the family treatment of NO x (solid line), with separate NO and NO 2 (dashed line), and with no chemistry at all (dotted line). For all metrics, the random forest simulation predicting family treatment of NO x performs better than a simulation predicting NO and NO 2 independently and a simulation with no chemistry. We use the latter as a minimum threshold to compare the RFR methodology. The metrics of the RFR model decrease over the course of the first 15 simulation days (1440 integration steps) but stabilize with an R 2 of 0.8, an NRMSE of 65 % and an NMB of less than 15 %. The simulation with the chemistry switched off degrades rapidly, highlighting the comparative skill of the RFR model to predict ozone over the entire 30-day period. The simulation with NO and NO 2 predicted independently of each other closely follows the NO x family simulation during the first 2-3 days but quickly deteriorates afterwards, as the compounding effect of NO and NO 2 prediction errors leads to an accelerated degradation of model performance.
Although there are some obvious issues associated with the RFR simulation, it is evident that for many applications, the model has sufficient fidelity to be useful. We now focus on the model's ability to simulate surface O 3 and NO 2 , two important air pollutants. Figure 7 compares concentration maps of surface O 3 at 00:00 UTC calculated by the full-chemistry model (upper row), the RFR model (middle row) and their ratio (bottom row) after 1, 5, 10 and 30 days of simulation. After 1 day there are only small differences between the full model and the RFR model. However, these differences grow over the period of the simulation as errors accumulate. By the time the model has been run for 10 days, the model has become significantly biased over clean background regions, in particular over the Pacific Ocean. The differences between the reference model and the RFR simulation grow more slowly after 10 days (see also Fig. 6), resulting in the model differ-ences between day 10 and day 30 being small relative to the difference between day 1 and day 10. It appears that the RFR model finds a new "chemical equilibrium" for surface O 3 on the timescale of a few days. This new equilibrium overestimates O 3 in clean background regions such as the tropical Pacific and underestimates O 3 in the Arctic. Figure 8 similarly compares concentration maps of surface NO x . Reflecting the shorter lifetime of NO x , the errors here grow more quickly compared to O 3 but level off after 5 days as a new chemical equilibrium is reached. The RFR model shows large differences compared to the GEOS-Chem model in regions where NO x concentrations are low and remote from recent emission, with NO x being highly overestimated in the tropics and underestimated at the poles. This pattern is highly consistent with the ones seen for O 3 , suggesting that the relative change in NO x drives the change in O 3 , as would also be the case in a full-chemistry model. Figures 9 and 10 show time series of O 3 and NO x mixing ratios at four polluted locations (New York, Delhi, London and Beijing) as generated by the full-chemistry model (black line), the RFR model (red), and the model with no chemistry (blue). The RFR model closely follows the full model at these locations and captures the concentrations patterns with an accuracy of 10 %-20 %. Especially for NO x it is hard to distinguish the RFR model from the full model, whereas the simulation without any chemistry shows a distinctly different pattern. These differences are significantly less than one would expect from running two different chemistry models for the same period (e.g. Stevenson et al., 2006;Cooper et al., 2014;Young et al., 2018;Brasseur et al., 2018). Events such as that in Beijing on day 20 are well simulated by the RFR model, which is able to follow the full model, whereas the simulation without chemistry follows a distinctly different path that is solely determined by the net effects of emission, deposition and (vertical and horizontal) transport.

Surface concentrations of O 3 and NO x
Although our analysis has not provided a complete analysis of the RFR model performance, we have shown that it is capable of providing a simulation of many key facets of the atmospheric chemistry system (O 3 , NO x ) on the timescale of days to weeks. We now discuss future routes to improve the system and some applications.

Discussion
We have shown that a machine learning algorithm, here random forest regression, can simulate the general features of the chemical integrator used to represent the chemistry scheme in an atmospheric chemistry model. This represents the first stage in producing a fully practical methodology.
Here we discuss some of the issues we have found with our approach, potential solutions, some limitations and where we think a machine learning model could provide useful applications.  . Concentration maps of surface O 3 mixing ratio after 1 simulation day (column 1), 5 simulation days (column 2), 10 simulation days (column 3) and 30 simulation days (column 4), as calculated by the full GEOS-Chem model (row 1) and the standard random forest (RF) model with the NO x family treatment (row 2). Row 3 shows the percentage difference between the RF simulation and GEOS-Chem (GC).

Speed, algorithms and hardware
The current RFR implementation takes about twice as long to solve the chemistry as the currently implemented integrator approach. While the evaluation of a single tree is fast (average execution time is 1.7 × 10 −3 ms on the Discover computer system), calculating them all for every forest and for every transported species (30 × 51) in series results in a total average execution time of 2.6 ms, which is 85 % slower than the average execution time of 1.4 ms using the standard model integrator.
We emphasize that this implementation is a proof of concept. Unlike for the chemical integrator, little work has been undertaken to optimize the algorithm parameters (e.g. optimizing the number of trees or the number of leaves per tree) or the Fortran90 implementation of the forests. For example, random forest have relatively large memory footprints that scale linearly with number of forests and trees. Efficient access of these data through optimal co-location of related information (e.g. grouping memory by branches) could dramatically reduce CPU register loading costs, as could moving from double precision to single precision or even integer maths. In the current implementation, we load all tree data onto every CPU separately without attempts of memory sharing. Thus we believe that different software structures, Figure 8. Concentration maps of surface NO x (NO + NO 2 ) after 1 simulation day (column 1), 5 simulation days (column 2), 10 simulation days (column 3) and 30 simulation days (column 4), as calculated by the full GEOS-Chem model (row 1) and the standard random forest (RF) model with the NO x family treatment (row 2). Row 3 shows the percentage difference between the RF simulation and GEOS-Chem (GC). algorithms and memory management may allow significant increases in the speed achieved.
A fundamental attractiveness of the random forest algorithm is its almost perfect parallel nature, even among species within the same grid cell: the nodes of all trees (and across all forests) solely depend on the initial values of the input features and thus can be evaluated independently of one another (in contrast, the system of coupled ODEs solved by the chemical solver requires coupling between the species). This would readily allow for parallelization of the chemistry operator, which has up to this point not been possible. This may allow other hardware paradigms (e.g. graphical processing units) to be exploited in calculating the chemistry.
We have implemented the replacement for the chemical integrator using the random forest regression algorithm. Our choice here was based on the conceptual ease of the algorithm. However, other algorithms are capable of fulfilling the same function. Neural networks have been used extensively in many Earth system applications (e.g. Krasnopolsky et al., 2010;Brenowitz and Bretherton, 2018;Silva and Heald, 2018), and gradient boosting frameworks such as XG-Boost (Chen and Guestrin, 2016) are becoming increasingly popular. A number of different algorithms need to be tested and explored for both speed and accuracy before a best-case algorithm can been found.

Training data
We have trained the random forest regression models on a single month of data. For a more general system the models will need to be trained with a more temporally extensive data set. Models are, however, able to generate large volumes of data. A year's worth of training data over the full extent of the model's atmosphere would result in a potentially very large (2 × 10 10 ) training data set. Applying this methodology to spatial scales relevant to air quality applications (on the order of 10 km) will result in even larger data sets (10 13 ). However, not all items from the training data are of equal value. Much of the atmosphere is made up of chemically similar air masses (e.g. central Pacific, remote free troposphere) which are highly represented in the training data but are not very variable. Most of the interest from an air quality perspective lies in small regions of intense chemistry. If a way can be found to reduce the complete training data set such that the sub-sample represents a statistical description of the full data, the amount of training data can be significantly reduced and thus the time needed to train the system.
The features being used to train the predictors could also be reconsidered. The current selection reflects an initial estimate of the appropriate features. It is evident that different and potentially better choices could be made. For example, we have included all photolysis rates, but these correlate very strongly and so a greatly reduced number of photolysis inputs (potentially from a principal components analysis) could achieve the same results but with a reduced number of features. Including other parameters such as the concentrations of the aerosol tracers may also improve the simulation.

Conservation laws and error checking
One of the fundamental laws of chemistry is the conservation of atoms. One interpretation of that has been applied here to the prediction of the change in NO x together with predictions for NO : NO x . Since the concentration of NO x changes much more slowly than the change in concentration of either NO or NO 2 , this approach attempts to improve the prediction of these short-lived nitrogen species, which are difficult to predict. Our results show that this does indeed increase the stability of the system, and it represents a first step towards ensuring the conservation of atoms in machine-learning-based chemistry models. A larger nitrogen family (NO,NO 2 ,NO 3 , N 2 O 5 , HONO, HO 2 NO 2 , etc.) might increase stability further, as could other chemical families such as BrO x , which showed significant errors both compared to the validation data and the evaluation of the chemistry model.
The solution space of a chemistry model is constrained by mass-balance requirements, and chemical concentrations tend to mean-revert to the equilibrium concentration implied by the chemical boundary conditions (emissions, deposition rates, sunlight intensity, etc.). A successful machine learning method should have the same qualities in order to prevent runaway errors that can arise from systematic model errors, e.g. if the model constantly over-or under-predicts certain species or if it violates the conservation of mass balance. Because each model prediction feeds into the next one, small errors compound and quickly lead to systematic model errors. Possible solutions for this involve prediction across multiple time steps, which have shown to yield more stable solutions for physical systems (Brenowitz and Bretherton, 2018), or the use of additional constraints that measure the connectivity between chemical species, e.g. through the consideration of the stoichiometric coefficients of all involved reaction rates.

Possible implementations
The ability to represent the atmospheric chemistry as a set of individual machine learning models (one for each species) rather than as one simultaneous integration has numerous advantages. In locations where the impact of a (relatively shortlived) molecule is known to be insignificant (for example isoprene over the polar regions or dimethyl sulfide (DMS) over the deserts), the differential equation approach continues to solve the chemistry for all species. However, with this machine learning methodology, there would be no need to call the machine learning algorithm for a species with a concentration below a certain threshold or for certain chemical environments (e.g. nighttime): the chemistry could continue without updating the change in the concentration of these species. Thus it would be easy to implement a dynamical chemistry approach which uses a simple lookup table with predefined threshold rates to evaluate whether the concentration of a compound needs to be updated or not. If it did, the machine learning algorithm could be run; if it did not, the concentration would remain untouched and the evaluation of the random forest emulator is skipped (for this species). This approach could reduce the computational burden of atmospheric chemistry yet further.
The machine learning methodology could also be implemented to work seamlessly with the integrator. For example, the full numerical integrator can be used over regions of particular interest (populated areas for an air quality model or a research domain for a research model), while outside of these regions (over the ocean or in the free troposphere for an air quality model or outside of the research domain for a research model) the machine learning could be used. This would provide a "best of both worlds" approach which provides higher chemical accuracy where necessary and faster but lower-accuracy solutions where appropriate.
Our methodology uses the output from the atmospheric chemistry model to generate the training data set. Another approach would be to use a series of box model simulations using initial conditions covering the appropriate chemical concentration ranges to generate the training data. This could allow the chemical complexity that is known to exist (e.g. Aumont et al., 2005;Jenkin et al., 1997) to be encoded in a way which would make it suitable for use in an atmospheric chemistry model. Much of this chemical complexity occurs in relatively small volumes of the atmosphere, for example, urban environments or over forested areas. These are areas with large emissions of complex volatile organic compounds which have a complex degradation chemistry. It would be possible to develop a machine-learning-based chemistry, trained on a number of box model simulations of the complex chemistry, which would represent this chemical complexity in a more efficient form, and to use this machine learning chemistry in only those grid boxes that require the full complex chemistry.

Limitations
This is the first step in constructing a new methodology for the representation of chemistry in atmospheric models. There are a number of limitations that should be explored in future work. Firstly, the machine learning methodology can only be applied within the range of the data used for the training. Applying the algorithm outside of this range would likely lead to inaccurate results. For example, the model here has been trained for the present-day environment. Although the training data set has seen a range of atmospheric conditions, it has only seen a limited range of methane (CH 4 ) concentrations or temperatures. Thus applying the model to the pre-industrial period or the future, where the CH 4 concentration and temperature may be significantly different than in the present day, would likely result in errors. Similarly, exploring scenarios where the emissions into the atmosphere change significantly (for example large changes in NO x emissions vs. VOC emissions) again will likely ask the model to make predictions outside of the range of training data. A simple check would be to evaluate the (surface) NO x /VOC ratios observed in the new model and compare them against the ranges used in the training: if the ratios in the updated model are significantly different from the training data, the RFR model likely needs to be retrained.
The same limitations also apply to model resolution: due to the non-linear nature of chemistry, the numerical solution of chemical kinetics is resolution-dependent, and a machine learning algorithm may not capture this. Thus, care should be taken when applying these approaches outside of the range of the training data.

Potential uses
Despite the limitations discussed here, there are a number of potential, exciting applications for this kind of methodologies.
The meteorological community has successfully exploited ensembles of predictions to explore uncertainties in weather forecasting (e.g. Molteni et al., 1996). However, air quality forecasting has not been able to explore this tool due to the computational burden involved. Using a computationally cheap machine learning approach, air quality forecasts based on ensemble predictions could become affordable. Ideally, in such a system the primary ensemble member would include the fully integrated numerical solution of the differential equations, while secondary members use the machine learning emulator. Since air quality forecasts are much more sensitive to boundary conditions (e.g. emissions) than initial conditions, the different machine learning members would be used to capture the sensitivity of the air quality forecast to emission scenarios, changes in dry or wet deposition parameters, uncertainties in the chemical rate constants etc. Data assimilation can be applied to determine the initial state for all models, and then the ensembles could be used for probabilistic air quality forecasting. This application is also less sensitive to long-term numerical instability of the machine learning model as the emulator is only used to produce 5-10 day forecasts, while the initial conditions are anchored to the full-chemistry model for every new forecast.
The data assimilation methodology itself could benefit from a machine learning representation of atmospheric chemistry. Data assimilation is often computationally intense, requiring the calculation of the adjoint of the model or running large numbers of ensemble simulations (Carmichael et al., 2008;Sandu and Chai, 2011;Inness et al., 2015;Bocquet et al., 2015). The ability to run these calculations faster would offer significant advantages.
Another potential application area for machine-learningbased chemistry emulators are chemistry-climate simulations. Unlike air quality applications, which focus on smallscale variations in air pollutants over comparatively short periods of time of days to weeks, chemistry-climate studies require long simulation windows of the order of decades. Because of this, machine learning models used for these applications need to be optimized such that they accurately reproduce the (long-term) response of selected species -e.g. ozone and OH -to key drivers such as temperature, photolysis rates and NO x (Nicely et al., 2017;Nowack et al., 2018). The method presented here could be optimized for such an application by simplifying the problem set, with the model trained to reproduce daily or even monthly averaged species concentrations.

Conclusions
We have shown that a suitably trained machine-learningbased approach can replace the integration step within an atmospheric chemistry model run on the timescale of days to weeks. The application of some chemical intuition, by which we separate long-lived from short-lived species, and a basic application of the conservation of atoms to the NO x family leads to significant improvements in model performance. The machine learning implementation is slower than the current model, but very little optimization and software development has thus far been applied to the code.
Methodologies similar to this may offer the potential to accelerate the calculation of chemistry for some atmospheric chemistry applications such as ensembles of air quality forecasts and data assimilation. Future work on both the algorithm and the methodology is necessary to produce a useful solution, but this first step shows promise.