Conservation laws in a neural network architecture: Enforcing the atom balance of a Julia-based photochemical model (v0.2.0)

. Models of atmospheric phenomena provide insight into climate, air quality, and meteorology, and provide a mechanism for understanding the effect of future emissions scenarios. To accurately represent atmospheric phenomena, these models consume vast quantities of computational resources. Machine learning (ML) techniques such as neural networks have 10 the potential to emulate compute-intensive components of these models to reduce their computational burden. However, such ML surrogate models may lead to nonphysical predictions that are difficult to uncover. Here we present a neural network architecture that enforces conservation laws to numerical precision. Instead of simply predicting properties of interest, a physically interpretable hidden layer within the network predicts fluxes between properties which are subsequently related to the properties of interest. This approach is readily generalizable to physical processes where flux continuity is an essential 15 governing equation . As an example application, we demonstrate our approach on a neural network surrogate model of photochemistry, trained to emulate a reference model that simulates formation and reaction of ozone. We design a physics-constrained neural network surrogate model of photochemistry using this approach and find that it conserves atoms as they flow between molecules, while outperforming two other neural network architectures in terms of accuracy, physical consistency, and non-negativity of concentrations. the fluxes can be posed as learning targets for the ML algorithms, and then tendencies can be predicted in a balanced manner. This work builds on that framework and proposes implementing the flux-tendency relationship directly into the architecture of a neural network, so that the neural 440 network will inherently respect the conservation laws, much like the reference model it emulates. As an example of how this framework can be implemented, we design a physics-constrained neural network surrogate model of photochemistry with input resembling bimolecular reaction rates, and a penultimate hidden layer enforcing an atom balance. The weights for the penultimate layer are hard stoichiometric constraints and can be obtained via the approach in Sturm and Wexler relating tendencies of molecular species to atom fluxes between them.

in recurrent, long-term predictions of gas-phase chemistry with a recurrent neural network architecture. Their recurrent neural 30 network is orders of magnitude faster than the reference model MOSAIC/CBM-Z (Zaveri et al., 2008).
Point 3 is an active area of research and the focus of this work. Complex machine learning (ML) tools, including neural networks, can be criticized as being "black box" methods that have opaque inner workings: this criticism motivates the development of interpretable ML methods, or in the physical sciences, more physically interpretable ML (McGovern et al., 35 2019). Physics-informed neural networks exploiting automatic differentiation can reproduce numerical solutions to partial differential equations (Raissi et al., 2019). In the atmospheric sciences, physical information has been incorporated into machine learning models via balancing approaches after prediction (Krasnopolsky et al., 2010), a cost function penalizing nonphysical behavior (Beucler et al., 2021;Zhao et al., 2019), including additional physically relevant information as input (Silva et al., 2021b), or incorporating hard constraints on a subset of the output in the neural network architecture (Beucler et 40 al., 2019;Beucler et al., 2021).
Incorporating fundamental knowledge into ML algorithms will ensure adherence to the physical and chemical laws underpinning these representations and likely improve the accuracy and stability of these algorithms. This work introduces a method to incorporate fundamental scientific laws in neural network surrogate models, in a way that ensures conservation of 45 important quantities (for example mass, atoms, or energy) by imposing flux continuity constraints within the neural network architecture. Atom conservation is fundamental to atmospheric photochemistry and photochemistry is a computationally intensive component of air quality and climate models so this work employs as an example inherently conserving atoms in a neural network model of atmospheric photochemistry.

50
Recent efforts in machine learning methods for atmospheric chemistry have indicated physically informed ML as a future research direction (Keller and Evans, 2019;Kelp et al., 2020). Kelp et. al (2020) motivate exploring ML architectures that are customized with information about the systems they aim to model, and the potential for this to improve predictions of the large concentration changes that frequently occur at the start of atmospheric chemistry simulations. Keller and Evans (2019) point out that incorporating physical information in ML, such as conservation laws, can help ensure point 2, numerical stability of 55 ML, by keeping predictions within the solution space of the reference model. Keller and Evans (2019) also provide the example of atom conservation and propose inclusion of stoichiometric information as a possible solution and a future direction to explore. In this work, we focus on this latter goal: conserving atoms, much in line with the suggestions outlined in Keller and Evans (2019), as well as the framework introduced by our prior work (Sturm and Wexler, 2020). More specifically, we utilize the weight matrix multiplication structure of a neural network (NN) to incorporate stoichiometric information in its 60 architecture. The architecture of this physics-constrained model ensures conservation of atoms by including a constraint layer that has non-optimizable weights representing the stoichiometry of the reactions. The physics-constrained NN is trained to emulate a reference photochemical model simulating production and loss of ozone with 11 species and 10 reactions. A secondary benefit of the physics-constrained NN architecture is increased physical interpretability of the neural network: the output of the hidden layer before these constraints can be interpreted as the net flux of atoms between molecules, or in terms 65 of chemical kinetics, the extent of reaction.

Physical constraints in the neural network architecture
Our prior work (Sturm and Wexler, 2020) introduced a framework that could be used with any machine learning algorithm to introduce conservation laws. In the case of atmospheric chemistry, most ML surrogate model approaches have estimated future concentrations ( + ∆ ) from current concentrations ( )and other parameters ( ), which can include meteorological conditions such as zenith angle, temperature, and humidity. 75 Rather than estimate the future value for the concentration (or more generally, the property of interest), we proposed training a machine learning algorithm to estimate fluxes between the properties of interest: for the photochemistry example, this is 80 atom fluxes between molecules in a stoichiometrically balanced way. These fluxes are also interpretable as rates of reaction, or when integrated over a certain timestep, extents of reaction. The fluxes are related to the tendencies, or change of concentrations of species ∆ , in a way that is stoichiometrically balanced. The stoichiometric information is contained in a matrix that relates fluxes, , to change in concentrations, such that ∆ = . This framework leads to prediction of these fluxes using an ML algorithm that emulates 85 wherein is a vector of the time-integrated flux of atoms between model species due to photochemistry. Future concentrations can then be calculated via Typically, the reference model is used to generate training and test data sets to be used to develop the ML algorithm. However, 90 values are not always standard output of such models: in some cases, the reference model can be altered to calculate and output values for the ML algorithm, for example calculating subgrid fluxes to train a physically consistent NN in a climate model (Yuval et al., 2021) or using explicit Euler integration in a simplified photochemical mechanism (Sturm and Wexler, especially when more sophisticated integrators are used. Our prior work focused on a way to invert in order to calculate the 95 target values (Sturm and Wexler, 2020). For the example of a surrogate model of condensation/evaporation in a sectional aerosol model, is overdetermined and a left pseudoinverse exists (see Appendix A1). However, where there are more reactions than species of interest, such as in a photochemical system, or more generally when there are many different phenomena contributing to fewer quantities of interest, will be underdetermined. This looks like ∈ ℝ #,% where < .
For underdetermined systems, we applied a generalized inverse, restricted to lie in the space of all possible ∈ ℝ % , that would 100 calculate from ∆ ∈ ℝ # . This approach does not guarantee that values would be realistic: sometimes predicted extents of reaction were erroneously negative in a photochemistry application.
This work explores the effects of implementing the ∆ = step directly in the last layer of neural network as shown by Figure 1. Each node in a layer has an inner product between its weight and input vectors, & . For the penultimate layer, the weight vector & of each node corresponds to rows of . This can be thought of as the "constraint layer". The constraint layer 105 has a zero bias vector and the linear activation function ( ) = such that the layer is simply a matrix operation equivalent to the product in equation 3. With this architecture, the inputs to this constraint layer are the time-integrated fluxes providing insight into the inner workings of the network as a side benefit. The activation function of the layer before should be chosen based on application. A rectified linear unit application that only outputs non-negative terms is appropriate for a photochemistry application, where integrated fluxes only have positive sign. 110 Including the matrix representing the chemical system in the last layer of a neural network captures the coupling and interdependence of the different chemical species with custom, non-optimizable weights. Our approach resembles the Beucler et al. (2021) approach in that hard constraints are built into a neural network, with several key differences.
1. Our entire output vector represents a coupled system where all elements are subject to the constraints. This 115 formulation is more restrictive than the approach in Beucler et al. (2021), which constrains a chosen subset of the output and allows some output to be unconstrained.
3. Our approach does not require relating elements in the input to the output. Instead, the fluxes in the penultimate layer are related to the output such that tendencies are balanced. 120 Training the NN with built into the last layer ultimately skips the compute-intensive and input-sensitive strategy of calculating the restricted inverse when the matrix is underdetermined or rank deficient (Sturm and Wexler, 2020). This results in a neural network that conserves atoms in every prediction while also predicting the fluxes in the penultimate layer.
This architecture adds physical interpretability to the last hidden layer of the neural network. 125

Additional input to the neural network
Physical information can be given as input to machine learning tools to improve predictions, for example when estimating aerosol activation fraction (Silva et al., 2021b). For our application, the complexity of the chemical system arises from the 130 coupling of species, which interact with each other through chemical reactions. Bimolecular reactions (or generally reactions that involve two species) are often represented with rate laws of the form where is the reaction rate for compounds ' and ( (the case = is allowed) and is an often empirically determined reaction rate constant. In addition to the concentrations themselves, ' ( can be calculated from the input concentrations and given as additional input to the neural network. Inclusion of this additional input, along with the methods described in section 2.1, lead to our physics-constrained neural network model shown in Figure 1. We avoid use of physically-informed input to describe this approach, to prevent confusion with the physics-informed NN approach as introduced by Raissi et al. (2019). We 140 do not call this approach "reactivity-informed" input, to keep the generalizability of this approach in mind. This additional input is informed by knowledge of bimolecular rate reactions: however, for other applications, additional input can take other forms. For the example of evaporation or condensation, the driving force of a concentration gradient could be supplied as additional input.

145
What follows is an assessment of the accuracy of the physics-constrained neural network compared to a neural network with a "naïve" structure: neither a constraint layer nor additional input layer. To assess the relative contributions of each knowledgeguided adjustment to the neural network, we also construct an intermediate neural network, which contains the additional knowledge-guided input but not the hard constraints built into the penultimate layer. Each network is trained to emulate the behavior of a reference model of chemistry modeling ozone production with 11 species and 10 reactions. All three are 150 feedforward neural networks implemented in Python with the Keras library (Chollet et al., 2015) using a TensorFlow backend (Abadi et al., 2015).
Though the physics-constrained NN technically has two hidden layers, incorporating the flux-based balance in the second hidden layer as a set of fixed weights with zero biases and linear activation functions adds no trainable parameters. This means that the penultimate layer is mapped to the output by a purely linear matrix operation.  170 change in cosine of the zenith angle. This input layer is fed to a hidden layer H, comprised of 40 nodes, each with weights, biases, and a rectified linear unit (ReLU) activation function. This is fed to a final output layer with a linear activation function and target values ∆C for the 11 species tracked in the reference photochemical model. The physics-constrained neural network (right) includes additional input: 5 products of concentrations which resemble the rate law form of 5 bimolecular reactions. This is fed to a hidden layer the same size as that of the naïve NN: 40 nodes. The hidden layer is then fed to another layer S, which is chosen to have as 175 many nodes as reactions in the chemical system (10), and ReLU activation functions to enforce non-negative output. This is subsequently fed via non-optimizable weights A and a linear activation function to the output vector ∆C. The intermediate neural network (not pictured) contained the additional input informed by the rate laws, but is otherwise identical to the naïve neural network.

Reference photochemical model
To demonstrate the methods developed above, we used a simplified model for production of ozone, used by Dr. Michael Kleeman at the University of California, Davis for the course ECI 241 Air Quality Modeling. Our reference model focuses solely on gas-phase chemistry, including 10 reactions and 11 species. Table 1 includes the full list of reactions. Table 2 includes the full list of species, including whether they are active (reactants that influence reaction rates), steady-state, or build-185 up species. This simplified model still represents important features of ozone photochemical production, including NOX chemistry, VOC chemistry, peroxy radical, and hydroxyl radical. The limitations of this simple model mean that the subsequent neural networks will at best maintain these limitations. However, this simple example demonstrates conservation properties readily generalizable to larger, more sophisticated models of gas phase chemistry, such as CBM-Z (Zaveri and Peters, 1999), CBM-IV (Gery et al., 1989) and SAPRC (Carter, 1990;Carter and Heo, 2013). 190 We ported the reference model over from Fortran to Julia and adapted the model for use in this work, include varying cosine of zenith angle and other parameters, discussed further in section 2.4. The source code is available on Zenodo: https://doi.org/10.5281/zenodo.5736487. Julia was designed for its flexibility and ease of use, which is comparable to dynamic programming languages like Python, while allowing for computational performance approaching that of compiled languages 195 like C or Fortran (Julia Documentation: https://julia-doc.readthedocs.io/en/latest/manual/introduction). These aspects of the Julia programming language have also motivated the recent development of JlBox, an atmospheric 0D box model written fully in Julia with gas-phase chemistry and aerosol microphysics (Huang and Topping, 2021 To fully represent the atom balance, the multitarget vector of tendencies ∆ for both the naïve NN and physics constrained NN includes species that are not defined as "active species" in the reference model, including quickly reacting species that are modeled as pseudo-steady state and species that are only produced, called build-up species. These are summarized in Table   2. Active species are defined in the original reference model as species that contribute to reaction rates, but have nonzero net rates of formation. Both NNs as depicted by Fig. 1 take concentrations of all 11 species as inputs, as well as cosine of zenith 210 angle and change in cosine of zenith angle. The physics-constrained NN additionally takes 5 products of concentrations corresponding to the bimolecular reactions: R3 ( ) ! *) ), R6 ( +,+) )+ ), R7 ( +) " *) ), R8 ( *) " )+ ), and R10 ( +) " + )+ ). Though R2 is also a bimolecular reaction, concentration of diatomic oxygen is assumed constant in the reference model at a mixing 209,000 ppm, so the concentration product of the two reactants in R2 is proportional to the concentration of atomic oxygen. Assumption of diatomic oxygen as constant and a pseudo-infinite source/sink make it a special case: for 215 this reason, oxygen is not included in Table 2 or in the stoichiometric balance in the following matrix.
In both neural networks, the input layer is fed to a hidden layer of 40 nodes. While the naïve NN feeds this hidden layer to the output vector, the physics-constrained NN contains a subsequent layer of 10 nodes corresponding to the fluxes of the 10 reactions: this penultimate layer is then connected to the output layer with non-optimizable weights corresponding to the 9 matrix, to properly emulate the system of reactions. Within the framework outlined in section 2.1 and in Sturm and Wexler (2020), this system of reactions can be modeled by an 11 by 10 matrix: This rectangular matrix is rank deficient and obtaining extents of reactions from and ∆ is a nontrivial inverse problem (Sturm and Wexler, 2020). matrices of larger models, such as the version of CBM-Z implemented in the box model version of MOSAIC (Zaveri et al., 2008), are also rank deficient. Our simplified reference model shares this property with more sophisticated models, making it a good contender for a proof-of-concept implementation within a neural network.

230
Within modeling of chemical mechanisms, is sometimes called the stoichiometry matrix. However, it can also be interpreted as the weighted, directed incidence matrix of the species-reaction graph of the chemical system. The species-reaction graph is a type of directed bipartite network that can give insight into a chemical system (Silva et al, 2021a). The species-reaction graph has two distinct sets of vertices corresponding to the reactions in Table 1 and species in Table 2: these vertices are connected by directed edges, corresponding to the values in the matrix. Edges leaving a species vertex and going to a reaction 235 vertex show that the species is a reactant and correspond to negative values in the matrix. Similarly, edges leaving a reaction vertex and going to a species vertex show that the species is produced by that reaction: these edges correspond to positive values in the matrix.
One metric of bipartite networks is the number of edges leaving nodes, called out-degree centrality. The out-degree centrality 240 of a species vertex represents how many reactions the reactant participates in, and its value is the opposite sign of row sums of negative entries in the matrix. The two species vertices with the highest out degree, 3, are formaldehyde (the sole reactive organic compound) and hydroxyl radical. Silva et al. (2021a) found that hydroxyl radical had the highest out-degree centrality in species-reaction graphs of 3 other chemical mechanisms. As in the other mechanisms, the out-degree centrality for the reactive nitrogen species, NO and NO2, is higher than for other species. This indicates that, though simple, the reference model 245 is a relevant case study and the methods developed in this work show potential to be extended to other more sophisticated models of atmospheric chemistry.

Training, validation, and test data
Often, box model chemistry is an operator within a larger 3D transport model, which includes other operators modeling 250 processes such as advection, emissions, and deposition. A good surrogate model should be able to emulate the input-output relationship of the reference model. If the context of machine learning surrogate modeling is operator replacement in larger chemical transport models (CTMs) or earth system models (ESMs), accurate short-term predictions on the order of the operator splitting timestep are required. This context informs the strategy of emulating short-term behavior. We set up the reference model to write concentrations of the species every 6 minutes and train the neural network surrogate models to predict ∆ after 255 this timestep. The timestep of 6 minutes is on the order of a common operator splitting timestep in a 3D chemical transport model: for example, the sectional aerosol model MOSAIC has a default timestep of 5 minutes (Zaveri et al., 2008). The operator splitting timestep in the 3D chemical transport model LOTOS-EUROS is chosen dynamically based on wind conditions to satisfy the Courant-Friedrichs-Lewy criterion, but ranges between 1 and 10 minutes (Manders et al, 2017).

260
We used the reference model to generate 5,000 independent days of output, with concentrations of the 11 species reported every hour: 0.12 million 11-dimensional samples. For each day, concentrations were randomly initialized for active species, documented in Table C1. The reference model was also adjusted to vary sunlight intensity, as measured by cosine of the zenith angle multiplied by a random factor associated with a full day simulation. This variable, as well as its change from the previous timestep, were chosen to be the additional parameters supplied to the neural network. 265 Of the 5,000 days, 4,800 were selected to be used as training and validation data for optimizing the neural network weights. A portion of this data (10%) was designated as validation data: rather than optimizing the neural network parameters on this data, the model was evaluated on the validation data during training, with early stopping if no improvement was measured on this set. As in previous work (Kelp et al., 2020) we remove all samples from the training and validation data where ozone 270 concentration exceeds 200 ppb. We additionally remove all days from the test data where ozone exceeds 200 ppb at any point, resulting in 126 full days used to evaluate the accuracy of the neural networks.
For supervised machine learning, the inputs ( and concatenated, as well as concentration products for the second neural network) are different from the targets ∆ . With a similar transformation, the inputs can be normalized on a scale from 0 to 275 1. This can be done by scaling each input feature in by its corresponding maximum and minimum in the training data: This information can be put into a diagonal matrix , whose elements are #8/ − #'% for each input. Representing the input minimums for each element as a vector #'% , the normalized feature space in this case looks like This is implemented in Python using the sci-kit learn preprocessing tool MinMaxScaler (Pedregosa et al., 2011).

295
The scatter plots of tendencies for the other active species and the buildup species are shown in Fig. B1 and Fig. B2 in Appendix B. Both the naïve and physics-constrained architectures show poor accuracy (negative . values) in predictions of ∆ for hydrogen peroxide. This can be attributed partially to the tendency range for hydrogen peroxide, which is 2 orders of magnitude smaller than that for other compounds. Error for species with smaller changes in concentration might be improved with choice of a different loss function than mean squared error (MSE) between predictions and targets, but a normalized MSE loss function 300 heavily biased towards zero values for the tendency vector ∆ led the neural network to only predict zero values for all species: where it either reacts with another species or undergoes photolysis. The physics-constrained NN restricts all predictions of formaldehyde tendency to be at most zero, which is in line with it being only a reactant. Similarly, some naïve and intermediate NN predictions of ∆ for the build-up species (species that are only products) are negative: this can be seen in Figure B2 in Appendix B. The reference model has no sinks for these build-up species, which are strictly products of reactions, hence the 320 term "build-up". The physics-constrained NN restricts the elements ∆ corresponding to build-up species to a positive halfspace.
Evaluated on overall metrics using a test data set, the physics-constrained NN outperforms both other NN architectures. The maximum absolute errors of ∆ in the test data set for the naïve, intermediate, and physics-constrained NNs are 10.7 ppb, 11.6 325 ppb and 10.6 ppb, respectively. For all species, the naïve NN predicted ∆ within 1.31 ppb, the intermediate NN predicted ∆ within 0.84 ppb, and the physics-constrained NN predicted ∆ within 0.76 ppb for 99% of cases. Error metrics evaluated with all active and build-up species for the 126 independent test days are given in Table 3. The physics-constrained NN is more accurate than both other NN architectures in these overall metrics, despite having a slightly smaller trainable parameter space than the intermediate NN. 330

Performance over varying concentration scales
At the beginning of simulations, steep changes in concentration occur when the chemical system is initialized in a state far from pseudo equilibrium. Chemical operators within larger models approach this pseudo equilibrium, but in each operator 335 splitting time step other operators such as advection and emission perturb these concentrations back away from the pseudo equilibrium state. This informs the focus on short-term accuracy, and the target vector of tendencies ∆ after timesteps of 6 minutes. Kelp et al. (2020) found that the most long-term stable models came at the price of diminished accuracy in predictions of the 340 extreme ∆ at the beginning of simulations, and motivated further research of ML models with specialized architectures. Our approach is only used for short term predictions. However, the two architectures with additional input informed by the reactivity show an ability to predict the ∆ at the beginning of each full-day simulation, while also remaining accurate relative to the naïve neural network in conditions that have smaller changes in concentration; those that occur after the initial transient return to the pseudo equilibrium condition. 345 The large ∆ values resulting from randomly initialized states far from equilibrium are well modeled by both the intermediate NN and the physics-constrained NN: this can be seen, for example, by ozone in Figure 2. Figure 3 shows scatter plots of ∆ as predicted by the 3 NN models, when only evaluated on 23-hour runs after the first hour of simulation that includes the transient return to pseudo equilibrium. Figure 3 and its reported . metrics are analogous to Figure 2, with the difference being 350 that Figure 3 repeats the analysis with the first hour of each day removed from the test data. With the first hour removed, the changes in ozone concentration shrink by a factor of ~4. While the naïve NN shows a substantial drop in accuracy of ∆ for reactive nitrogen species, the physics-constrained NN shows a smaller change in its . metric. row) and the physics-constrained NN (green, bottom row), on 126 test days, for 23 hour runs excluding the first hour of simulation. Silva et al. (2021b) found that a physically regularized NN emulator of aerosol activation fraction outperformed its naïve counterpart, especially in the edge case of prediction values falling within the lower 10% of the possible range. Similarly, we find that our NNs that have additional chemically relevant input (the intermediate and physics-constrained NNs) are much more accurate for ∆ of NO and NO2 than the naïve NN when only evaluated on cases falling within the lower ~25% of the 360 range of test data: Fig. 3 illustrates this improvement. Though both the intermediate and physics-constrained NNs better predict tendencies of the species than the naïve NN, this improvement is magnified when disregarding large concentration changes that occur at the beginning of simulations.

Steady-state species 365
Under the pseudo-steady state assumption that certain species have near-zero net rates of change in concentration, their concentrations become algebraic expressions of the concentrations of other species in the system. In our reference model, these steady state species are hydroxyl radical OH and atomic oxygen O. Approximating the rates of change for each steady state species to be zero and isolating their concentrations as expressions of other concentrations and rate constants, we obtain the following equations for )+ and ) : 370 where the ( correspond to rate constants for reactions j = 1, 2,…, 10, and ) " is assumed constant at 2.09 × 10 I ppm.
Information on ℎ is included in ( ).

380
The physics-constrained NN predicts ∆ , which is added to ( ) to calculate ( + ∆ ). Then the steady-state concentrations at time + ∆ are determined by the concentrations of active species using Eqs. (10) and (11). Figure 4 shows the scatter plots of concentrations using predictions from the physics-constrained NN versus the reference model.

385
The concentration of atomic oxygen is a function of only one variable influenced by NN predictions, NO2 concentration, and is nearly perfectly predicted. The concentration of the hydroxyl radical is dependent on concentrations of 5 other species, and is very sensitive to small errors in some of the species: HO2, NO, and to some extent formaldehyde. The limitation of the physics-constrained NN to predict OH indicates a that additional physical information might need to be included in order to optimize the physics-constrained NN to predict OH accurately, e.g. including Eq.(10) in the objective function when optimizing NN parameters.

Atom conservation in the physics-constrained neural network 395
The balance imposed on species by the physics-constrained neural network results in conservation of the total carbon and nitrogen. The atom balance for carbon and nitrogen can be demonstrated by summing up the mixing ratios of species these atoms occur in, multiplied by the number of atoms within that species. Figure 5 shows that there is a net zero change in total carbon and nitrogen in the system when using the physics-constrained NN. Balances of oxygen and hydrogen are not shown: oxygen is not conserved, because of the treatment within the reference model of diatomic oxygen as an infinite source and 400 sink. Hydrogen is not conserved because H2O is not explicitly tracked.

Preventing negative concentrations 415
Though the physics-constrained neural network inherently balances mass, there is no built-in constraint to ensure nonzero concentrations: predicted tendencies might exceed the magnitude of their corresponding concentrations in the previous timestep. However, the number of negative concentrations was reduced by a factor of more than 17 when using the physicsconstrained NN compared to the naïve NN. The naïve NN predictions led to 44,017 negative concentrations and the physicsconstrained NN predictions led to 2,489 negative concentrations in the test dataset containing 272,160 values (including active 420 and build-up species but excluding pseudo steady-state species whose concentrations are not calculated by adding their corresponding tendencies). Put another way, predictions by the naïve NN led to 13.2% of the values becoming negative, with the most negative concentration at -7.

Conclusions
Machine learning algorithms have potential to efficiently emulate complex models of atmospheric processes, but purely data-435 driven methods may not respect important physical symmetries that are built into the classical models, such as conservation of mass or energy. Prior efforts (Sturm and Wexler, 2020) developed a framework for building in conservation laws to machine learning algorithms: by using the relationship between fluxes and tendencies in systems, the fluxes can be posed as learning targets for the ML algorithms, and then tendencies can be predicted in a balanced manner. This work builds on that framework and proposes implementing the flux-tendency relationship directly into the architecture of a neural network, so that the neural 440 network will inherently respect the conservation laws, much like the reference model it emulates. As an example of how this framework can be implemented, we design a physics-constrained neural network surrogate model of photochemistry with input resembling bimolecular reaction rates, and a penultimate hidden layer enforcing an atom balance. The weights for the penultimate layer are hard stoichiometric constraints and can be obtained via the approach in Sturm and Wexler (2020) relating tendencies of molecular species to atom fluxes between them. 445 Adding additional parameters based on physical information (in this case chemical reaction rates) improve predictions, as demonstrated by the intermediate NN and the physics constrained NN. Like previous work (Silva et al, 2021b) these NNs more accurately predict edge cases than the naïve NN: in our case, lower ∆ conditions after the first hour of simulation approaching pseudo equilibrium. However, improved accuracy of the intermediate NN does not correspond to an adherence to physical 450 laws. Both the naïve and intermediate NNs deliver solutions outside of the solution space of the reference model, including negative tendencies for purely buildup species and positive formation of species that were purely reactants. Their predictions also lead to high numbers of negative concentrations, which are nonphysical. Most importantly, material is not conserved by either the naïve NN or intermediate NN: only the physics-constrained obeys the stoichiometric atom balance that is a fundamental property of chemical reactions. The results of this study show promise for hybrid models that combine our 455 knowledge of physical processes with data-driven machine learning approaches, and motivates future exploration of other physically interpretable machine learning techniques that can incorporate additional prior information such as pseudo steadystate approximations.
The reference model used in this work shares important chemical properties with more sophisticated models, making this 460 approach readily extendable to detailed chemical mechanisms. In extension to larger models, the effect of varying hyperparameters, including input and output dimensionality, network depth (number of layers) and layer width (number of nodes in the layers), will have to be assessed. Such a study may be better suited for application to a more sophisticated reference model that has a higher dimensionality and more realistic inputs, such as varying temperature. This approach also has potential to be integrated into work studying the speedup potential of neural networks versus their reference models, also better suited 465 for studies of larger, more sophisticated and computationally intensive reference models. The primary purpose of this work is to illustrate the atom balance enforced by the architecture of the physics-constrained neural network. We observe a secondary effect: that by including built-in information about the chemical system, both the atom balance and the input proportional to the instantaneous reaction rates of the bimolecular reactions, accuracy of the neural network is improved. This framework has been demonstrated for photochemistry but can be generalized to other applications, such as change of concentrations of condensable species in a sectional aerosol model, for example MOSAIC (Zaveri et al., 2008). Recent work has been published on machine learning surrogate models for cloud microphysics with resolved size bins (Gettelman et al., 2021) and aerosol microphysics using a modal approach (Harder et al., 2021). Harder et al. (2021) have indicated mass 475 conservation as a future research direction for ML surrogate models of aerosol microphysics, and have proposed regularization via a cost function or post-prediction mass fixers. Training ML algorithms on target fluxes rather than tendencies ∆ would allow for mass conservation to numerical precision if the tendencies are related to fluxes via an matrix as in Sturm and Wexler (2020). Studying the system of equations modeling evaporation and condensation in a sectional model, we see that a left pseudoinverse of the corresponding matrix can be used to obtain fluxes from concentrations (typical model output), 480 unlike the rank-deficient matrix in the photochemical application focused on in this work and Sturm and Wexler (2020).
Both mass transfer and thermodynamics play a role in the transport of material between the gas and aerosol phases (Wexler and Seinfeld, 1991). This idea can be represented by a system of equations taken directly from equations 3 and 4 in Zaveri et al. (2008), relating change in concentration to flux between the gas and particle phases: This overdetermined 9 by 8 matrix accounts for fluxes from the gas phase to each different bin. With more species, say, 23 species, the system resembles a block matrix: With the assumption that all species have the same number of bins, = = ⋯ = . is overdetermined and has full column rank, meaning that there are more equations than unknowns. This makes calculating a unique left inverse possible: The existence of a unique is useful because concentration values (and therefore ∆ ) might be more easily obtainable from reference models than the right-hand-side integrated flux values . From ∆ , can be used to obtain values: which a supervised machine learning algorithm can be trained to predict from concentration, temperature, and other parameters. The 5000 independent days include randomly initialized values for active species concentrations at the beginning of each day of simulation. Cosine of zenith angle is also multiplied by a random factor between 0 and 1 for the day, to vary intensity of photolysis reactions. Steady-state concentrations are a direct function of active species concentrations, so are initialized 540 accordingly. Build-up species concentrations are initialized at zero.