Emulating aerosol optics with randomly generated neural networks

. Atmospheric aerosols have a substantial impact on climate and remain one of the largest sources of uncertainty in climate prediction. Accurate representation of their direct radiative effects is a crucial component of modern climate models. However, direct computation of the radiative properties of aerosol populations is far too computationally expensive to perform in a climate model, so optical properties are typically approximated using a parameterization. This work develops artiﬁcial neural networks (ANNs) capable of replacing the current aerosol optics parameterization used in the Energy Exascale Earth System Model (E3SM). A large training dataset is generated by using Mie code to directly compute the optical properties of a range of atmospheric aerosol populations given a large variety of particle sizes, wavelengths, and refractive indices. Optimal neural architectures for shortwave and longwave bands are identiﬁed by evaluating ANNs with randomly generated wirings. Randomly generated deep ANNs are able to outperform conventional multilayer-perceptron-style architectures with comparable parameter counts. Finally, the ANN-based parameterization produces signiﬁcantly


Introduction
Atmospheric aerosols have a profound impact on atmospheric radiation and, ultimately, on the entire Earth system, both through their direct radiative effects (Hansen et al., 2005;Johnson et al., 2018) and interaction with clouds (Twomey, 1977;Albrecht, 1989;Fan et al., 2016).They have long been known as one of largest sources of internal uncertainty in climate modeling, primarily due to cloud interactions, although with a significant contribution from direct effects as well (Bellouin et al., 2020).Difficulties arise in both accurately modeling aerosol populations (Liu et al., 2012) and in determining their subsequent impacts in these areas.While the underlying physics may be well understood in many cases, modeling complex small-scale processes is not computationally feasible within an Earth system model (ESM), and these key physical processes are instead represented by parameterization schemes.
Recently, there has been a flurry of research that has leveraged new advances in machine learning (ML) to enhance climate and weather modeling (Boukabara et al., 2021).Various strategies have been used, including emulation of an entire weather or climate model (or at least key fields) with deep learning (Scher, 2018;Weyn et al., 2020), nudging parameterization output (Watt-Meyer et al., 2021;Bretherton et al., 2022), enhancing model output (Wang et al., 2021;Geiss et al., 2022), replacing key model physics such as the radiative transfer scheme (Krasnopolsky et al., 2012;Lagerquist et al., 2021), and replacing the many parameterizations that approximate unresolvable sub-grid-scale processes (Krasnopolsky et al., 2013;Rasp et al., 2018;Brenowitz and Bretherton, 2018).While many of these ap-proaches have some overlap, most are not mutually exclusive strategies for improving climate forecasts: conventional climate models must be used to generate training data for purely data-driven ML models (e.g., Gettelman et al., 2021) and, in the future, those physics-based ESMs may be significantly enhanced by replacing key parameterization schemes with ML emulators, for instance.Ideally, future climate models will leverage continued research in model development in conjunction with multiple ML-based approaches to generate climate simulations with unprecedented accuracy.
This research focuses on developing an artificial neural network (ANN) emulator to replace the current aerosol optics parameterization developed by Ghan and Zaveri (2007) for internally mixed aerosols represented by the four-mode version of the Modal Aerosol Module (MAM4) (Liu et al., 2016) in the Energy Exascale Earth System Model's (E3SM) (Golaz et al., 2019) Atmosphere Model (EAM) (Rasch et al., 2019).We perform a thorough neural architecture search using randomly generated ANN wirings to identify ANN structures best suited to accurately representing aerosol optics with the fewest possible parameters (i.e., at the lowest computational cost).Finally, we show that the ML-based parameterization can significantly outperform the current parameterization in terms of accuracy, and it can even outperform very high-resolution aerosol optics lookup tables, which would be too large to use in EAM but can be seen as a high-resolution extension of the current parameterization.
Section 2 of this paper provides background information on the radiative effects of atmospheric aerosols and the aerosol optics parameterization currently used in E3SM.Section 3 discusses how training and testing datasets were generated and how the neural network input and output variables are handled.Section 4 describes the randomly generated ANN approach in detail as well as the network training procedure and evaluation of the neural architectures.Section 5 evaluates the accuracy of the final ML-based parameterization.Finally, Sect.6 provides a short summary of results and some concluding remarks.

Modeling radiative effects of atmospheric aerosols
Atmospheric aerosols influence Earth's radiative budget both through direct interactions with radiation and modification of clouds (Boucher et al., 2013).Both effects have long been major sources of uncertainty in climate simulations as chronicled by over 3 decades of assessment reports from the Intergovernmental Panel on Climate Change (see Bellouin et al., 2020, their Table 1).Accurate representation of atmospheric aerosols in climate simulations is hindered by many challenges, including complex aerosol-chemical and microphysical processes, aerosol-cloud-precipitation interactions, and aerosol-radiation interactions.Even though the under-lying physics have been studied in great detail and accurate physics-and theory-based models exist to represent the relevant processes, these models are far too computationally expensive to use in an ESM.Instead, such processes are represented with simplified physical models and parameterizations that usually make sweeping simplifications in their representation of aerosol processes and trade model accuracy for computational tractability.
One crucial component of an atmospheric model is a radiation scheme.Radiative transfer models are responsible for representing the radiative exchange of energy between space, the Earth's surface, and the many intervening layers of the atmosphere resolved by an ESM.The radiative flux divergence computed by radiation code is used to determine heating rates in the atmosphere which ultimately impact largescale atmospheric dynamics.E3SM uses the version of the Rapid Radiative Transfer Model (RRTM) (Mlawer et al., 1997;Mlawer and Clough, 1997) developed for use in general circulation models (RRTMG) (Iacono et al., 2008;Pincus and Stevens, 2013).RRTMG does not take information about aerosol populations as a direct input; instead, the bulk optical properties of the aerosol populations in each grid cell are first estimated using a parameterization scheme (Ghan and Zaveri, 2007), and these properties (bulk absorption, extinction, and asymmetry parameter) are passed to the radiative transfer scheme.
Estimation of the optical properties for aerosol populations in each model grid cell is, on its own, a computationally daunting task.Scattering of light by particles is generally separated into three regimes that are defined by the ratio between the radius of the particle (r) and the wavelength of light (λ): Rayleigh (r λ), Mie (r ≈ λ), and geometric (r λ).In both the Rayleigh and geometric scattering regimes the optical properties of an aerosol particle vary smoothly as a function of its size.In the Mie regime, however, absorption and scattering efficiencies can vary wildly as a function of changing particle diameter.Mathematically, these undulations arise as the solution to Maxwell's equations applied to the propagation of electromagnetic radiation over a spherical particle (van de Hulst, 1957).A significant portion of atmospheric aerosols have size parameters (x = 2π r/λ) within the Mie regime, particularly in the shortwave radiative bands used by EAM's radiative transfer code.There is no strict definition of the bounds of the Mie regime, but typically one would use Mie code to estimate optical properties for size parameters within about 2 orders of magnitude of unity, whereas one would use geometric or Rayleigh approximations for larger or smaller particles, respectively, depending on the accuracy required for the application (Bohren and Huffman, 1983).Here, we use a Rayleigh approximation for size parameters less than 0.05 and Mie code for everything larger.Mie scattering solutions can be found in the form of an infinite series, although these series are weakly converging, and sometimes require a large number of terms to accurately determine a particle's opti-cal properties (Hansen and Travis, 1974;Bohren and Huffman, 1983).This is a scenario where the underlying physics are understood and accurate numerical models to represent the physics have been developed (Wiscombe, 1979;Sumlin et al., 2018), but they are far too computationally expensive to use at a large scale, and parameterizations must be used to represent these physics in an ESM (Ghan and Zaveri, 2007;Pincus and Stevens, 2013).This parameterization must represent a high-dimensional manifold in a space defined by the parameters of the aerosol size distribution, the imaginary and real components of aerosol refractive indices (which depend on the aerosol species), and various wavelengths of light.The portion of this manifold that falls in the Mie regime is characterized by large fluctuations, particularly with respect to wavelength and particle size, and any function used to parameterize it will likely require a large number of parameters to adequately capture this variability.In this work, we focus on developing a parameterization of bulk aerosol radiative properties that is fast enough to use in an ESM and substantially more accurate than previous methods.

E3SM and the Modal Aerosol Module (MAM4)
This study focuses on updating the aerosol optics representation for E3SM, an ESM developed by the U.S. Department of Energy (Golaz et al., 2019).EAMv1 (Rasch et al., 2019) uses the four-mode version of the Modal Aerosol Module (MAM4) (Liu et al., 2012(Liu et al., , 2016) ) with improvements to represent aerosol processes (Wang et al., 2020), RRTMG for atmospheric radiative transfer (Iacono et al., 2008;Pincus and Stevens, 2013), and the Ghan and Zaveri (2007) parameterization for aerosol optics.This parameterization is also used in other ESMs, including the Community Earth System Model (CESM) v2.2 (Danabasoglu et al., 2020;NCAR, 2020), so the new parameterization developed in this study can be easily used in other ESMs.
MAM is a simplified model of aerosol populations that was developed to allow representation of key aerosol physics in climate simulations without being computationally prohibitive.Because of the complexity of the general dynamic equation for aerosols (Friedlander, 2000), several methods for representing aerosols in simulations of the atmosphere exist that have varying degrees of accuracy and computational complexity.These include bulk models (Lamarque et al., 2012), modal models (Liu et al., 2012), the sectional method (Gelbard et al., 1980), the quadrature method of moments (McGraw, 1997), and discrete models (Gelbard and Seinfeld, 1979).The key differences between these models are primarily their treatment of aerosol size distributions and mixing.Section 1 of Liu et al. (2012) and Table 1 of Zhang et al. (2020) provide overviews of different approaches to modeling aerosol populations.
The MAM approach breaks aerosols down into several modes based on species and approximate size.MAM4 includes Aitken, accumulation, coarse, and primary carbon modes.Each mode contains multiple aerosol species within a certain particle size range, and MAM assumes internal mixing within modes and external mixing between modes (aerosol properties are averaged within each mode).The modal model assumes that the size distributions of each mode are lognormal and prescribes the log-standard deviations based on past observational studies.Major uncertainty in the modal approach stems from the limited representation of internal vs. external mixing of aerosol species and the assumption of lognormal size distributions.It is reasonably accurate and very computationally efficient compared with other schemes, however, and this makes it a good choice for long-duration ESM simulations.

The Ghan and Zaveri (2007) aerosol optics parameterization
EAMv1 uses a parameterization to estimate the bulk optical properties of simulated aerosols.The parameterization is described in detail in Ghan and Zaveri (2007) with further relevant information found in Ghan et al. (2001) and Neale et al. (2012), but we will provide a brief overview of the method here because it will be useful for understanding subsequent sections of this paper.A diagram of the aerosol optics parameterization training/preparation and how it integrates with EAMv1 is provided in Fig. 1 and may be a helpful reference while reading this section.
The existing optics parameterization estimates optical properties based on five input parameters: aerosol mode (corresponding to MAM modes), wavelength band (λ), real refractive index (n), imaginary refractive index (κ), and mean surface mode radius (r s ).Optical properties are precomputed over a range of values in each of these five dimensions; when called by the model, the parameterization then estimates optical properties from these precomputed values using a combination of Chebyshev and linear interpolation.
The precomputed optical properties are generated as follows: for each wavelength band and aerosol mode, refractive index bounds are computed by taking the minimum and maximum refractive indices across all aerosols in that mode and water.The real refractive index range is spanned by 7 linearly spaced values, and the imaginary refractive index range is spanned by 10 logarithmically spaced values.A range of 200 plausible aerosol radii are then generated between 0.001 and 100 µm.The wavelength, refractive index, and radii data are fed to a Mie code (Wiscombe, 1979) to compute the optical properties for individual particles.Ultimately, the parameterization uses bulk optical properties integrated over a size distribution, so a range of 30 lognormal size distributions are assumed and the individual particle optical properties are integrated over these size distributions.The size distributions are generated for r s values between 0.01 and 25 µm and spaced according to Chebyshev nodes.The optical properties are then fit with a fifth-order Chebyshev polynomial along the r s dimension, and the five Chebyshev coefficients https://doi.org/10.5194/gmd-16-2355-2023 Geosci.Model Dev., 16,[2355][2356][2357][2358][2359][2360][2361][2362][2363][2364][2365][2366][2367][2368][2369][2370]2023 are saved rather than directly saving 30 optical property values (Vetterling et al., 1988).Ultimately a three-dimensional matrix (real refractive index, imaginary refractive index, and surface mode radius) of Chebyshev coefficients is stored for each wavelength and aerosol mode combination, and four of these must be produced representing the four required output variables: bulk shortwave absorption efficiency, bulk shortwave extinction efficiency, bulk shortwave asymmetry parameter, and bulk longwave absorption efficiency.Because of its high dimensionality, the amount of data stored by the parameterization grows rapidly as the resolution with which it resolves the input parameters is increased.This is a major motivation for replacing the current parameterization with a neural network, because increasing accuracy by increasing resolution of the input parameter space rapidly becomes intractable in the existing parameterization.
When the optics parameterization is called by EAM, it is passed values of r s , n, and κ for each aerosol modewavelength band combination.The parameterization applies bilinear interpolation along the refractive index dimensions of the table to estimate Chebyshev coefficients at an intermediate refractive index.Then, the fifth-order Chebyshev polynomial generated with these coefficients is used to estimate the optical properties as a function of r s .This approach is very similar to using a lookup table, in that the optical properties have been precomputed, with the major difference being that a combination of bilinear and Chebyshev interpolation is used to resolve three of the dimensions as continuous functions of the input variables.
Errors are introduced at nearly every step in this process, including averaging of within-mode refractive properties, a limited number of wavelength bands treated by the model, assumed aerosol size distributions, interpolation of refractive indices and particle size distributions, and others.This approximation of well understood but unresolvable physics is a frustrating but unavoidable facet of climate modeling.Here, we set out to replace the Chebyshev interpolation approach with a neural network emulator, which addresses the errors incurred by coarsely resolving n, κ, r s , and particle radius information (evaluated in more detail in Table 1 in Sect.5).

Mie code
Training a neural network to emulate Mie scattering first required the generation of large training, validation, and testing datasets using established Mie solvers.We chose to refactor the Fortran code used to generate the existing parameterization's precomputed optical properties into Python.The FORTRAN 77 "MIEV0" Mie scattering code (Wiscombe, 1979(Wiscombe, , 1980a)), which was originally used to perform Mie calculations to generate the current EAM parameterization, was replaced by PyMieScatt (Sumlin et al., 2018;Sumlin, 2017), a Python-based Mie code.The machine learning libraries used in this study are also written in Python, and this refactoring allowed for an end-to-end Python-based pipeline for creating the neural network emulator and will enable easier and more flexible editing if new training data need to be generated in the future.Furthermore, PyMieScatt has support for additional scattering models, such as core-shell optics, which we intend to integrate into the neural network emulator in the future.We have made all of the code written for this study available on the project's GitHub repository (see the "Code and data availability" statement).
To ensure that using PyMieScatt did not introduce any additional errors or discrepancy with the original parameterization, we performed a comparison to MIEV0.The optical properties of every refractive index, particle size, and wavelength combination used by the original parameterization were output and compared to the same optical properties computed using PyMieScatt.The maximum, 99.9thpercentile, and 99th-percentile absolute errors are shown in Table A1.Even the most extreme discrepancies between the two schemes are negligible compared with other sources of error in the parameterization.

Training and validation data
For ANN training, we generated a large table of bulk aerosol optical properties similar to what is described in Sect.2.3 but with significantly higher resolution in terms of its input variables.We used the same bounds for possible real and imaginary refractive index values, particle radii, and surface mode radius as in Ghan and Zaveri (2007) and also similarly used logarithmic vs. linear spacing depending on the variable (the same wavelength bands and aerosol modes were used).The resolution of each of these variables was increased to 2049 particle radii, 257 mode radii, 129 imaginary refractive indices, and 129 real refractive indices; this is in comparison to 200, 30, 10, and 7 respective values in the original parameterization.The resulting high-resolution table has about 20 000 times the number of entries, takes on the order of 1 d to compute using parallelized calls to PyMieScatt on a modern CPU, and occupies several gigabytes of RAM, making it inappropriate for direct use in an ESM.
When training a neural network, it is best practice to evaluate the ANN on a holdout set of validation data after it is trained as a check for overfitting to the training data.The validation data used here were drawn randomly from the high-resolution table using half of the data points for training and half for validation.In this application, the boundaries of the optical property tables were chosen by Ghan and Zaveri (2007) to encompass all possible input values the parameterization could receive from the ESM, so we are not concerned about poor performance when extrapolating outside of the optics table.However, there is potential for overfitting to cause unexpected behavior in the regions between points

Testing data
In addition to a validation set, when hyperparameter tuning is used or multiple models are tested, an additional set of "test" data should be held out to ensure that the validation set has not been overfit by the hyperparameter or model selection (Murphy, 2012).The test set used in this study was generated separately from the training data and is approximately the same size as the combined training and validation sets.
The training set was constructed by generating an additional table of optical properties where each of the input parameters bisects the grid of values used to generate the training and validation data.This ensures that it completely covers the range of possible inputs and does not contain values near any of the training and validation data points.This test set was used to ensure that the randomly wired ANN approach did not lead to an overfit of the validation set.

Benchmark datasets
In addition to the high-resolution optics data used for training and validation, three other tables of optical properties were generated at intermediate resolutions of 1025×129×65×65, 513 × 65 × 33 × 33, and 257 × 65 × 17 × 9.Where the table dimensions have been listed in the following order: particle radii × mode radii × imaginary refractive index × real refractive index.We have chosen to scale dimensions to a power of 2 plus 1 so that grid points in a table will be bisected by grid points in the next-highest-resolution table.These datasets have total parameter counts of approximately 10 8 , 10 7 , and 10 6 , respectively, once the multiple wavelengths, aerosol modes, and output parameters are ac-counted for.Note that the number of particle radii used to resolve the particle size distributions does not add to the size of the optics table and is only used when the dataset is generated, but it is important to the table's accuracy.The total parameter count, in the shortwave table, for example, is computed as follows: number of mode radii × number of imaginary refractive indices × number of real refractive indices × 14 shortwave bands × 4 aerosol modes × 3 optical properties.These additional optics tables were evaluated by linearly interpolating their entries to query points in the test set described above, and the resulting errors are shown in Table 1 in Sect. 5.They provide an indication of how the resolution of the training data might impact the accuracy of the trained neural network parameterization.

Neural network inputs and outputs
To compute the bulk optical properties of a population of homogeneous spheres with lognormally distributed radii, five values must be known: the real and imaginary components of the refractive index, the geometric mean radius and logstandard deviation that define the size distribution, and the wavelength of light.For the parameterization problem solved here, we assist the neural network by encoding this information in a format more conducive to training neural networks.
Neural networks tend to perform better when input and output data have certain well-behaved distributions and formats.Several pre-and post-processing steps were used alongside the ANN to help ensure optimal performance.Each ANN has nine inputs (in order): λ, n, κ, r s /λ, r s , and a "one-hot" encoding of the four aerosol modes (four values).The one-hot encoding is a common strategy for categorical inputs and usually leads to better performance than a single scalar input that encodes the category (Murphy, 2012, p. 35). https://doi.org/10.5194/gmd-16-2355-2023 Geosci.Model Dev., 16, 2355-2370, 2023 The existing parameterization prescribes a log-standard deviation for each aerosol mode, so the log-standard deviation was not included as a separate continuous input.We chose to include r s /λ as a constructed input despite the fact that both of these variables are used as individual inputs because the size parameter is a key input for Mie scattering calculations, and we found this to improve model skill in early experiments.All of the inputs other than the one-hot encoding are scalar and are each standardized by first taking the log (except for real refractive indices where a log is not used) and then subtracting the mean and dividing by the standard deviation (each rounded to a precision of 0.1).The means and standard deviations used are shown in Table A2 and are based on data from the training set.This yields dimensionless, zero-centered inputs with a standard deviation of 1 and without extreme skew or leptokurtosis.
The Ghan and Zaveri (2007) parameterization estimates specific extinction, absorption, and scattering efficiencies, which are bulk optical properties of the aerosol distribution per total wet aerosol mass, but these values can span several orders of magnitude and, thus, are not well suited for prediction with a neural network.Instead, we have the neural network estimate a key intermediate value used in the Ghan and Zaveri (2007) parameterization that encapsulates the computationally expensive components of estimating bulk aerosol optical properties: where σ is the log-standard deviation of the particle size distribution, r is wet particle radius, λ is wavelength, m is the complex refractive index, Q is either the extinction or absorption efficiency (see Ghan and Zaveri, 2007, their Eq. 20), and the overline indicates a bulk optical property.In MAM, the values of σ are prescribed for each mode: 1.6 for modes 2 and 4 and 1.8 for modes 1 and 3.While the values of Eq. ( 1) are constrained to a reasonable range, linear scaling of the outputs of the ANN is still used to ensure that they are bounded by zero and one.This allows the use of a sigmoid output function to constrain the ANN's outputs.The bulk absorption efficiency is linearly scaled by a factor of 2.2, whereas the bulk extinction efficiency is scaled by 4.6.These values were determined empirically from the training set; when the parameterization is used in an ESM, this scaling will need to be applied.The bulk asymmetry parameter (g) is naturally bounded by zero to one for the range of inputs in this study and is not scaled (Bohren and Huffman, 1983).The longwave and shortwave bands have significantly different ranges for some of their inputs, and the existing parameterization only computes bulk absorption in the longwave, so two neural networks were trained: one with three outputs to process the shortwave bands and one with a single output to process the longwave bands.Neural networks are powerful data-fitting tools, and simple ANN designs can easily generalize to a wide variety of problems.Even so, specialized ANN architectures that have been optimized for a task will usually perform best.Task-specific ANN design is difficult, however, because the space of reasonable ANN designs is usually far too large to explore exhaustively, and it is not usually obvious which will work best.Typically, researchers will rely on heuristics, past experience, or simply convenience and popularity to choose an appropriate ANN architecture.
Various algorithmic approaches to neural architecture search (NAS) (Elsken et al., 2019) and hyper-parameter optimization (HPO) (Feurer and Hutter, 2019) have become popular for addressing this problem.These algorithms usually involve training many different neural networks with a range of parameter and design choices and selecting the bestperforming models.Search methods range from simple random or grid search to sophisticated algorithms such as evolutionary optimization (Angeline et al., 1994), Bayesian optimization (Bergstra et al., 2013), or reinforcement learning (Baker et al., 2017).Much of the recent (past 10 years) research in neural architecture search has focused on developing new convolutional neural network architectures for image processing (e.g., Zoph et al., 2018).Elsken et al. (2019) and Yao (1999) provide reviews of this topic.
Most NAS strategies that test a variety of network wiring patterns are limited to exploring certain families of predefined network styles or break up the search space by randomizing individual network "cells" that are then wired together in sequence.However, Xie et al. ( 2019) demonstrated a NAS strategy in which new convolutional neural network architectures were discovered through random wiring of network layers.Motivated by early observations during our work that the inclusion of skip connections and more complex wirings contributed to performance for the aerosol optics problem, we chose to employ a similar approach here.Whereas Xie et al. (2019) focus on convolutional neural networks, we use ANNs constructed of fully connected layers.In general, skip connections and complex wirings are much more common in deep convolutional neural network architectures than ones constructed from fully connected layers, but there is some past evidence that including skip connections in deep fully connected networks can improve performance on certain non-linear problems (Lang and Witbrock, 1988), and this seems to be the case for the problem of emulating Mie scattering.Here, we designed an ANN generator that automatically produces ANNs with a random number of layers, random layer sizes, and random connections between layers.Ultimately the randomly generated wirings allow for the discovery of networks that substantially outperform simple multilayer perceptrons.

Random network motivation
The physical parameterization problem discussed in this paper is particularly well suited for an ANN.The bulk aerosol optical properties used by the parameterization can be thought of as smooth, bounded, manifolds in a highdimensional space, and representing this type of dataset is an area where neural networks often excel.It is also a particularly data-rich problem because the only limits to the size of our training dataset are the computational and storage resources that we would like to devote to generating training data (and ultimately an upper bound on training set resolution where neighboring data points become highly autocorrelated).In early experiments, we found that while simple feedforward multilayer-perceptron-style architectures with one to two hidden layers can provide much higher performance than the current EAMv1 parameterization discussed in Sect.2.3, more complex architectures that included many smaller layers with skip connections could achieve even higher accuracy without an increase in the number of model parameters.Ultimately, when used in a climate model, the ANN-based parameterization will be evaluated many times (every time the radiative transfer code is called for each model grid cell).This means that reducing the network size as much as possible without significantly reducing accuracy is a worthwhile endeavor and can save both computation time and memory when the climate model is run.Additionally, because of the relatively small size (500-100 000 parameters) of the ANNs used here, they are cost-effective to train.Together, these factors mean that this ML problem is ideal for NAS.

The random ANN generator
Our ANN generator randomizes the network layer size, layer count, merge operators, and wiring.First, it randomly chooses a number of layers between 2 and 12; it then randomly chooses the number of neurons per layer by choosing an integer between 7 and 45 and scaling it by a factor of 0.5N layers (the scaling prevents the generation of very deep and wide ANNs with extremely high parameter counts).To facilitate merging inbound tensors to a layer with elementwise addition, all layers in the network use the same number of neurons.Each hidden layer used in the network is a fully connected layer and applies a tanh activation to its outputs.
Once layer counts and size are selected, the ANN generator creates a random wiring between the layers by generating an adjacency matrix that represents layer connections.Several constraints must be enforced on the adjacency matrix to ensure that it represents a usable ANN architecture.Firstly, we require that the ANN is feed forward.If each row in the adjacency matrix represents a layer in the order in which they will be evaluated in the ANN, this can be accomplished by enforcing that the adjacency matrix is lower triangular.For an ANN with N hidden layers, this means there are 1 2 (N 2 + N ) valid layer connections.The number of active connections for an ANN is randomly chosen from a uniform distribution between 0 and 1 2 (N 2 +N ), and this many entries in the lower triangular portion of the adjacency matrix are then randomly turned on.Additionally, each layer must have at least one inbound and one outbound tensor.Because the number of layers in the ANN is determined before the adjacency matrix is constructed, this must be enforced by iterating through each row and column of the adjacency matrix and randomly turning on one valid inbound and/or outbound connection if the corresponding layer has none.
Lastly, the number of inputs to each ANN are static (nine inputs), but we would like the outputs from each network layer to be a fixed size, and any layer can be directly connected to the input layer.As a workaround, each ANN includes an additional fully connected layer with a number of neurons equal to the difference between the nine inputs and the randomly selected network layer size.The outputs from this layer are appended to the actual inputs as a learnable padding.
Initial experiments on a subset of the training data were run using a single shortwave band (because of reduced training time on the smaller dataset) with additional randomization including the following: variable layer sizes (ANNs that used different layer sizes internally exclusively used concatenation to merge tensors); randomly selected activation functions from linear, tanh, rectified linear unit (ReLU) (Glorot et al., 2011), exponential linear unit (Clevert et al., 2015), leaky ReLU, and parametric ReLU (He et al., 2015); and batch normalization (Ioffe and Szegedy, 2015), dropout (Srivastava et al., 2014), or no regularizer.These experiments showed that the tanh function provided slightly better performance than other activations and that including batch normalization or dropout substantially reduced performance.We hypothesize that the reduced performance with dropout is related to the fact that we are testing relatively small networks.Because dropout layers generally force the ANNs to learn redundant representations of the data and the small ANNs used here only have limited capacity to represent the complex training data, requiring them to learn redundant representations of the data only reduces their skill.Additionally, the complexity of the training data and small size of the networks means that we are not particularly concerned about overfitting and do not expect to gain much from using regularization techniques.These additional types of randomization were not included in final experiments.

Training and model selection
Each model was trained using the Adam optimizer with an initial learning rate of 0.001, β 1 = 0.9, and β 2 = 0.999 to optimize mean-squared error.We used a batch size of 64 samples and trained for 10 epochs.The learning rate was reduced manually by a factor of 10 on the 4th, 7th, and 10th epochs.A total of 500 randomly wired ANNs were trained, and each was evaluated on the validation set. Figure 2 shows scatterhttps://doi.org/10.5194/gmd-16-2355-2023 Geosci.Model Dev., 16, 2355-2370, 2023 plots of each random ANN's validation performance in terms of mean absolute error (MAE) on the standardized ANN outputs plotted against the number of trainable parameters in the network.Both panels in Fig. 2 show a similar pattern in terms of ANN performance vs. size: skill improves rapidly with increasing size until it plateaus somewhere between 1000 and 20 000 trainable parameters.Additional size increases yield only very small performance gains.The approximate location of the elbow in each of these performance vs. size plots is expanded in an inset in each figure panel.Based on these inset plots, we subjectively chose an ANN for both the longwave and shortwave regimes that appears to provide a good balance between network size and skill.The selected ANNs are denoted in Fig. 2 with red circles, and diagrams of the wirings for the selected networks are shown in Fig. 3.Note that later, in Sect.5, errors will be reported after rescaling the standardized network outputs for comparison to the Ghan and Zaveri (2007) scheme; however, here, we report the selected ANNs' MAEs on the test set computed directly on the ANN output as in Fig. 2: shortwave (SW): 8.96 × 10 −5 ; longwave (LW): 2.32 × 10 −5 .The comparable performance on the test set to the validation set indicates that the chosen ANNs did not overfit the training and validation data.These selected ANNs were ultimately retained for use as parameterizations and are evaluated in more detail on the test set in Sect. 5. We also trained several benchmark ANNs for comparison to the random ANNs.Each of the benchmark networks is composed of two to six hidden layers wired in sequence with tanh activation functions, and they represent the performance of conventional ANN architectures.One-layer ANNs performed almost an order of magnitude worse than the others and were not included.Benchmark ANNs with a total of 10 different sizes in terms of total trainable parameters were used.Five copies of each unique benchmark ANN layer count-parameter count combination were trained, and only the best-performing models were retained to ensure that poor performance at a particular ANN size was not simply due to an unlucky random initialization or training sample selection.This means that a total of 250 benchmark ANNs were trained for both the longwave and shortwave regimes.The performance of these benchmark ANNs is also indicated in Fig. 2 by solid lines.

Discussion of ANN architecture
The performance of the benchmark and random ANNs provides some insight into ANN design.Firstly, we note that one-layer ANNs were also tested, but they typically performed nearly an order of magnitude worse than other ANNs and are not shown in Fig. 2.This suggests that using almost any multilayer architecture, regardless of construction, can yield substantial performance gains.Secondly, the twoto six-layer sequential models are outperformed by the majority of randomly wired ANNs that have similar parame-ter counts.Also, the multilayer sequential models with more than three layers begin to perform worse than their shallower counterparts.It appears that the inclusion of skip connections has likely allowed the random networks to train successfully despite their depth (high layer count).In the context of this problem, the neural networks are attempting to fit a high-dimensional manifold that varies significantly with respect to several of the input parameters.Deeper networks are likely required to efficiently represent the non-linearities in the problem, but deep neural networks can struggle to train effectively due to vanishing gradients (Goodfellow et al., 2016).The ANNs that were ultimately chosen here tend to have more, but smaller, layers than the best serially connected ANNs, and they include multiple skip connections.
The universal approximation theorem implies that this problem is solvable with a wide, single-layer perceptron network (Hornik et al., 1989).In practice, however, multilayer networks are almost always more efficient, and this is the case here.Furthermore, any of the randomly wired networks used here could theoretically be represented by a serially connected multilayer network: one can imagine a serially connected network learning to apply the identity function to some of its inputs, thereby learning to generate skip connections on its own.Again, while it is technically possible, this is not the case in practice, and even learning the identity function is not necessarily a trivial task for neural networks.While the importance of skip connections has been thoroughly explored in the context of building very deep convolutional neural networks (He et al., 2016), it has only rarely been applied to ANNs with fully connected layers, although some early examples of this approach do exist (Lang and Witbrock, 1988).These results are informative for our application and similar use cases, where the ANN's memory and computational requirements at the inference time are of particular importance; moreover, by evaluating many ANN architectures, we have identified ANNs with significantly higher accuracy than conventional architectures with no increase in inference cost.Taken together, our results indicate that significant performance gains may be achieved in other applications of ANNs in the Earth sciences and Earth system modeling through in-depth exploration of task-optimized network architectures.

Evaluation
The ANNs were ultimately evaluated on the randomly generated holdout test set described in Sect.3.3.In addition to evaluating the accuracy of their outputs, we evaluate them on two additional optical properties derived from the ANN output: shortwave bulk scattering efficiency (Q Sca. ) and single-scattering albedo (SSA).These respective properties are computed as follows: (Bohren and Huffman, 1983).SSAs with Q Ext.< 0.01 were not included in the analysis because The mean absolute error is computed on normalized optical properties (directly on the outputs from the neural networks) and are dimensionless.In each case, there is a clear elbow, beyond which increasing the network size does not substantially improve performance.In both panels, the inset shows a magnified region around this elbow.Solid lines indicate the performance of traditional feed-forward multilayer perceptron ANNs with two to six hidden layers.The red dot indicates the network that was ultimately chosen for use.very small errors get amplified by the Q −1 Ext. in scenarios where scattering is negligible.The existing aerosol optics parameterization was also evaluated along with linear interpolation applied to several high-resolution tables of aerosol optical properties that were generated at a range of resolutions (described in Sect. 3.4).This includes the very high resolution table used for training and validation.The test set MAE for each of the output parameters and wavelength regimes are listed in Table 1.The ANN shows a substantial performance improvement over the existing parameterization, with MAEs about 3 orders of magnitude smaller.This is particularly no-table for the shortwave extinction efficiencies where the existing parameterization has an MAE of 0.2 but the ANN has an MAE of 3.6 × 10 −4 .Extinction efficiencies range from about 0 to 3.5, so an MAE of 0.2 is substantial.The performance of the additional interpolated optics tables behaves about as expected, with the MAE decreasing in proportion to table size.It can also be seen that a lookup table with approximately 10 9 parameters is required to achieve performance comparable to the ANN.This is far too large to be used in an ESM.Lastly, Table 1 indicates the test set performance of the best-performing conventional (serially connected) ANN on the test set, and again we see that it cannot match the performance of the randomly wired ANN, which consistently outperforms it by around 10 % to 30 % for the shortwave and 65 % for the longwave.
The very low MAE shown in Table 1 is encouraging, but ideally a parameterization should perform well over the full range of possible inputs and a low MAE could potentially still be achieved in the presence of outlier cases with high error that could cause problems when it is used in a climate simulation.Figure 4 shows logarithmically scaled histograms of the absolute error for all individual samples in the test set.Here, we see that, in addition to outperforming the benchmark optics tables and existing parameterization on average, the most extreme errors produced by the ANN are also far smaller than those produced by the existing parameterization.Furthermore, the ANN's histograms tend to have peaks at lower error values than the other methods.Note that, because of the log scaling, the peak represents a large number of samples and the size of the error distribution's tails is exaggerated.An interesting feature from Fig. 4   Lookup table 10 6 3.8 × 10 −3 6.6 × 10 −3 1.7 × 10 −3 9.0 × 10 −3 2.6 × 10 −3 2.5 × 10 −3 Lookup table 10 7 1.0 × 10 −3 1.9 × 10 −3 5.3 × 10 −4 2.5 × 10 −3 6.8 × 10 −4 6.7 × 10 −4 Lookup table 10 8   with very low error.These occur because some regions in the input space have little to no variability in the output spacefor instance, the large regions where extinction is near zero.The linear interpolation in the lookup tables can perfectly fit constant-valued functions, but the ANN and Chebyshev methods will still have a small amount of error.Ultimately, the key observation from Fig. 4 is that the ANN's errors do not have a large right tail, meaning that we still expect very accurate estimates of aerosol optical properties, even for the input queries where the ANN performs worst.Finally, Fig. 5 shows a joint histogram of bulk aerosol optical properties estimated by the existing parameterization and by direct computation with Mie code for all samples in the test set.Separate joint histograms are not included for the ANN outputs; instead, a red contour in each of the joint histograms denotes the boundary containing all samples.Notable patterns appear in the joint histograms of the shortwave extinction field and the fields derived from it (SW scattering and SSA) as well as, to a lesser degree, the other predicted fields.These arise in the Ghan and Zaveri (2007) parameterization from the Chebyshev polynomial fit used to approximate optical properties as a function of surface mode radius.The Chebyshev polynomials are smooth functions that do not perfectly fit the bulk extinction efficiency curve, for instance, and consistently over-or undershoot it for certain r s values.Because bulk extinction efficiency is very sensitive to the particle size distribution, this effect is obvious in Fig. 5.
Drawing the training set from a regular grid over the input space has ensured good coverage of possible input values, while generating a test set of equal size consisting of intermediate values that are not near points in the training or validation data helps demonstrate that the ANN will not perform unexpectedly when interpolating within the region defined by the training data.Together, Table 1 and Figs. 4 and 5 demonstrate that the ANN parameterization not only provides a dramatic performance improvement over the current approach but can also be expected to perform exceedingly well for the full range of possible input data, with no extreme cases of high error.Therefore, the ANN parameterization is an accurate and reliable replacement for the current bulk aerosol optics parameterization.

Conclusions
This work has demonstrated the effectiveness of machine learning for emulating the aerosol optical properties that are crucial to climate simulation.A neural network is capable of producing bulk optical property estimates that are substantially more accurate than those produced by the existing (Ghan and Zaveri, 2007) parameterization in E3SM and CESM and does so with an order of magnitude smaller memory requirement.The computational requirements for evaluating an ANN with 10 4 parameters is larger than the computational requirements of the current approach, but this parameterization is evaluated every time EAM calls radiation code, and evaluating the ANN requires negligible computation compared with the radiation code, so the impact on model runtime should be negligible.Additionally, the ANN outperforms lookup-table-based optics emulators that resolve aerosol optical properties at much higher resolution than the existing scheme.Testing over a wide range of possible input data showed that the neural network performs well over the possible input space and will not produce any outlier errors or unexpected results within this range.Representation of aerosol direct effects is a major source of uncertainty in climate simulation; while representation of aerosol optics is likely only a small component of this uncertainty, adequate representation of these physics is a key step forward towards accurately representing aerosols in general.This work, to some degree, should be seen as a first step or proof of concept and as a demonstration of the power of randomly wired networks for this problem.Our ultimate goal is to develop a neural-network-based parameterization that represents core-shell scattering -a physical model that is too computationally expensive to represent with existing parameterizations.While this work presents the machine learning technique and evaluates it directly against Mie code, we expect to follow it with a climate modeling study evaluating the impacts of this parameterization, and a future core-shell scattering model, on E3SM simulations.
In addition to developing a new parameterization, we applied a recently developed (Xie et al., 2019) neural architecture search strategy that randomizes wiring patterns in deep neural networks.Key findings were that deeper ANNs significantly outperformed a single-layer ANN of comparable size.Also, the majority of randomly constructed ANN architectures (which include skip connections) outperformed conventional multilayer perceptron networks.In the context of this study, the NAS allowed us to identify neural architectures that provide a substantial performance improvement with no increase in network size.
Our findings provide some insights into ANN design.The fact that the majority of randomly wired networks outperform multilayer networks with serially connected layers indicates that the inclusion of skip connections may be critical for this type of problem.In image processing, convolutional neural networks with a large number of layers and skip connections (He et al., 2016;Huang et al., 2017) were identified as superior to serially connected designs several years ago, and they have dominated deep learning research since.While using skip connections in networks constructed of fully connected layers is certainly not a new idea (Lang and Witbrock, 1988), it has received comparatively little attention in recent machine learning literature.This work indicates that the inclusion of skip connections could be an effective way to train smaller regressor and function-fitting neural networks to fit complicated data or surfaces.
To the best of our knowledge, this is the first use of randomly wired neural architecture search approaches in the atmospheric sciences.Their performance against conventional serially connected feed-forward ANNs in this task was striking.The majority of random wirings were better able to represent Mie optics than serial wirings by a substantial amount (about 10 %-30 % in the shortwave regime and 65 % in the longwave) with no increase in model complexity in terms of the number of trainable parameters.There has recently been significant push to leverage new advances in machine learning to replace the various existing parameterizations used by climate and weather models with more performant and/or accurate representations (e.g., Gettelman et al., 2021;Lagerquist et al., 2021).Many of these problems, like the Mie optics problem addressed here, are data-rich and well suited for neural architecture search, as training data can be produced by an accurate but computationally expensive nu-merical simulation.Our results indicate that, when using neural networks for this type of application, significant performance improvements can be achieved by taking care to design or select network architectures optimized for the target task.NAS algorithms and random wirings have, so far, received little attention in the Earth sciences, and random network wiring may be a fruitful strategy for developing neuralnetwork-based parameterizations and physics emulators in the future.
Appendix A Table A1.Errors between optical properties computed with PyMi-eScatt and MIEV0.Table A2.Constants used to standardize ANN inputs.For all variables except the real refractive index, standardization is done after taking the natural logarithm.A value of 1 × 10 −6 is added to the imaginary refractive index before taking the logarithm.

Figure 1 .
Figure1.A diagram of the aerosol optics parameterization and how it integrates with EAM.The "online" section shows how the parameterization is used during a simulation, and the "offline" portion shows the process of precomputing optical properties and preparing the parameterization.

Figure 2 .
Figure 2. Validation set performance of randomly wired neural networks plotted against the network size.Panels show results for different wavelength regimes.The mean absolute error is computed on normalized optical properties (directly on the outputs from the neural networks) and are dimensionless.In each case, there is a clear elbow, beyond which increasing the network size does not substantially improve performance.In both panels, the inset shows a magnified region around this elbow.Solid lines indicate the performance of traditional feed-forward multilayer perceptron ANNs with two to six hidden layers.The red dot indicates the network that was ultimately chosen for use.

Figure 3 .
Figure3.Wiring patterns of the two (longwave and shortwave) randomly generated neural networks that were selected for use in the optics emulator.Nodes represent "dense" (fully connected) layers."C" and "+" indicate whether inbound tensors are combined by concatenation or addition, respectively.All hidden layers have the same number of neurons within each network: SW has 54 and LW has 32 (the nine inputs are padded to reach the appropriate size, and the output layer has either 3 neurons for SW or 1 neuron for LW).

Figure 4 .
Figure 4. Error histograms for estimates of the bulk aerosol optics test dataset.These panels show the distribution of errors on a log-log histogram to make outlier cases with high error more apparent.The vertical grid shows the bin edges of the histogram.The blue and magenta lines represent the Chebyshev-polynomial-based parameterization and the neural network, respectively.The dashed gray lines represent the error from applying linear interpolation to precomputed optics datasets of varying resolution, with the highest-resolution tables appearing to the left and progressively coarser tables to the right.

Figure 5 .
Figure 5. Scatterplot-like joint histograms comparing optical properties from the Chebyshev-interpolation-based parameterization and Mie code.Gray shading indicates the density of data points.The red contour contains all outputs from the neural network, which all lie very close to the 1-to-1 line.

Table 1 .
is that the lookup tables tend to have longer left tails, representing cases https://doi.org/10.5194/gmd-16-2355-2023Geosci.Model Dev., 16, 2355-2370, 2023 Mean absolute error for bulk optical property estimates using different methods.Note that only bulk absorption efficiency is computed for the longwave bands and that shortwave single-scattering albedo (SSA) and bulk scattering efficiency are computed from shortwave absorption and extinction efficiencies.The overbars denote that these are bulk values integrated over lognormal size distributions (Eq.1).