Geostatistical inverse modeling with very large datasets: an example from the Orbiting Carbon Observatory 2 (OCO-2) satellite

Geostatistical inverse modeling (GIM) has become a common approach to estimating greenhouse gas fluxes at the Earth’s surface using atmospheric observations. GIMs are unique relative to other commonly used approaches because they do not require a single emissions inventory or a bottom–up model to serve as an initial guess of the fluxes. Instead, a modeler can incorporate a wide range of environmental, economic, and/or land use data to estimate the fluxes. Traditionally, GIMs have been paired with in situ observations that number in the thousands or tens of thousands. However, the number of available atmospheric greenhouse gas observations has been increasing enormously as the number of satellites, airborne measurement campaigns, and in situ monitoring stations continues to increase. This era of prolific greenhouse gas observations presents computational and statistical challenges for inverse modeling frameworks that have traditionally been paired with a limited number of in situ monitoring sites. In this article, we discuss the challenges of estimating greenhouse gas fluxes using large atmospheric datasets with a particular focus on GIMs. We subsequently discuss several strategies for estimating the fluxes and quantifying uncertainties, strategies that are adapted from hydrology, applied math, or other academic fields and are compatible with a wide variety of atmospheric models. We further evaluate the accuracy and computational burden of each strategy using a synthetic CO2 case study based upon NASA’s Orbiting Carbon Observatory 2 (OCO-2) satellite. Specifically, we simultaneously estimate a full year of 3-hourly CO2 fluxes across North America in one case study – a total of 9.4× 106 unknown fluxes using 9.9×104 synthetic observations. The strategies discussed here provide accurate estimates of CO2 fluxes that are comparable to fluxes calculated directly or analytically. We are also able to approximate posterior uncertainties in the fluxes, but these approximations are, typically, an overor underestimate depending upon the strategy employed and the degree of approximation required to make the calculations manageable.


S1 Efficient calculation of the GIM cost function and gradient
In the main manuscript, we describe an iterative approach to estimate the fluxes that requires calculating the cost function and gradient using the L-BFGS algorithm paired with a variable transformation (Sect. 4.1). These calculations require computing the product G * s * (dimensions m×1) where G * = I−Q − 1 2 X(X T Q −1 X) −1 X T Q − 1 2 . The matrix G * has dimensions m×m and is often prohibitively large to compute explicitly or store in memory. 5 Instead, these calculations can be broken down into smaller components that are far more computationally efficient and require comparatively less memory: The calculations in Eq. S1 require multiplying several matrices with p rows and/or columns, where p << m. Furthermore, we can compute the matrices A and B efficiently in a manner that does not require explicitly formulating or manipulating Q in its entirety. To do so, we use a Kronecker product and reformulate using the approach described in Yadav and Michalak (2013): The vector X i, * denotes the rows of X corresponding to time period i. The variable D − 1 2 (i, 1) corresponds to the i th row and first column of the matrix D − 1 2 . Note that each line or block in the equations above has dimensions r × p. There are q blocks, and collectively the matrix A and B have dimensions rq × p or, equivalently, m × p. Note that A and B need only be calculated once and can subsequently be used in every iteration of the iterative solver.
One can use an analogous approach to calculate Q(H T ξ) in the minimum residual approach to the solution (Eq. 7-8 and 5 Sect. 4.2) and Q 1 2 s * in Eqs. 14, 16, and 17 (Sect. 4.1).

S2 Generation of conditional realizations using L-BFGS and a variable transformation
A conditional realization or simulation of the fluxes (denoted s c , dimensions m × 1) is an estimate of the fluxes that randomly samples from the posterior distribution. A large number of conditional realizations can be used to represent the posterior uncertainties in the fluxes. Several existing studies detail how to create conditional realizations using the GIM system of linear 10 equations (Eq. 7-8, Kitanidis, 1996;Saibaba and Kitanidis, 2012). Furthermore, Kitanidis (1995) and Snodgrass and Kitanidis (1997) describe how to generate conditional realizations using the cost function and gradient without a variable transformation (e.g., using the L-BFGS algorithm, Sect. 4.1). Here, we detail how to create conditional realizations using L-BFGS with a variable transformation (Eq. 11-16).
We modify the cost function and gradient functions slightly to obtain the equations required for generating conditional 15 realizations (s c ). In this study, we estimate s * c using the cost function and gradient paired with a L-BFGS minimum-finding algorithm. Subsequently, we back-transform s * c to obtain s c : where c is a random n × 1 vector with c ∼ N (0, R). If R is a diagonal matrix with diagonal elements σ 2 R , then c is a random vector with a mean of zero and standard deviation of σ R . The vector u has dimensions m × 1 and is randomly drawn from a standard normal distribution. Both vectors c and u must be re-generated anew when estimating each new conditional realization. WRF is a meteorological model, and the simulations used here have been specifically generated to be compatible with STILT (e.g.. Nehrkorn et al., 2010). STILT, by contrast, is a particle trajectory model, described in detail by Lin et al. (2003). STILT leverages a meteorological model like WRF to estimate how CO 2 fluxes in different locations and at different times would affect atmospheric CO 2 levels at the measurement locations. Specifically, we use STILT to generate a set of gridded influence footprints with units of ppm per unit flux. Each STILT footprint subsequently becomes one of n rows in H (e.g., Eq. 1), a key 5 input of the inverse model. The WRF model outputs used here have a resolution of 10km over the continental US and 30km for other regions of North America. By contrast, we compute the STILT footprints at a spatial resolution of 1 • by 1 • and a 3-hourly temporal resolution. Note that we do not run WRF-STILT for every OCO-2 observation due to the computational cost. Instead, we generate STILT footprints corresponding to every 2 seconds along the OCO-2 flight track, yielding a total of 9.88 × 10 4 footprints for the one-year study period (Sept. 2014-Aug. 2015.

10
All of the experiments in this study are synthetic; we generate a set of OCO-2-like observations using a known CO 2 flux estimate, and we use these synthetic observations in all of the inverse modeling experiments. This setup makes it possible to compare the CO 2 fluxes estimated using the inverse model against a known set of true CO 2 fluxes. We further add randomlygenerated errors to the synthetic observations -errors with a standard deviation of 2ppm, a number similar to that estimated in a recent top-down study using OCO-2 observations (Miller et al., 2018).

15
Note that we only consider biospheric CO 2 fluxes in the synthetic data setup. Many inverse modeling studies of CO 2 fix anthropogenic emissions at a specific value and/or pre-subtract the modeled contribution of anthropogenic emissions from the observations. Hence, we do not consider anthropogenic emissions in the synthetic case study here.
The covariance matrices also play a key role in the inverse modeling setup. The covariance matrix R describes errors in the inverse model that are unrelated to uncertain fluxes -errors due to the satellite retrieval, modeled atmospheric transport, 20 and errors due to the finite resolution of the inverse model. For the setup here, R is a diagonal matrix with a variance of (2 ppm) 2 on the diagonals. By contrast, the covariance matrix Q describes the spatial and temporal properties of the CO 2 fluxes; this matrix helps to guide the structure of the fluxes estimated by the inverse model. We use restricted maximum likelihood (RML) estimation to estimate the variance, decorrelation time, and decorrelation length in 3-hourly CT2017 fluxes, and we use these values to populate Q. RML is a statistical technique that can be used to estimate the spatial and temporal properties 25 that are most likely given a dataset, in this case CO 2 fluxes from CT2017 (e.g., Corbeil and Searle, 1976;Kitanidis, 1986;Mueller et al., 2008). In this application, RML requires minimizing a cost function that describes the likelihood of the data (i.e., CT2017 fluxes) given some guess for the variance, decorrelation time, and decorrelation length. For the case studies here, we use unique values of the variance for each month, and we use a single estimate for the decorrelation length and time that has been averaged across all months. Furthermore, we set up Q such that fluxes from one three-hour time window covary with 30 fluxes from the same three hour window on adjacent days. However, fluxes from one three-hour time window will not covary with fluxes from other time windows on the same day. For example, fluxes from noon to 15:00 UTC on May 5 will covary with fluxes from noon to 15:00 UTC on May 4 and May 6 but will not covary with fluxes from 9am to noon or 15:00 to 18:00 on The estimated covariance matrix parameters show a distinct seasonality. For the six week case study, we estimate a standard deviation in the fluxes at long spatial and temporal distances of 10.0 µmol m −2 s −1 (i.e., the square root of the diagonal elements of Q), a decorrelation length of 555 km (assuming a spherical covariance model), and a correlation time of 9.8 days (also assuming a spherical model). Figure S1 shows the standard deviation of the fluxes at long spatial and temporal distances by month for the one year case study. This figure has a distinct sinusoidal shape, and the standard deviations are highest during 5 summer months. In that case study, we also use a decorrelation length of 586 km and a decorrelation time of 12.4 days. These values are the average of the values estimated for each month of the year.
The matrix X is also a key input to the GIM. In many studies, the matrix X can include predictor variables that may help describe the spatial and/or temporal distribution of the fluxes. For the setup here, X is non-informative and consists of columns of ones. As a result of this setup, the fluxes estimated by the GIM will only reflect the information in the atmospheric 10 observations and will not reflect any prior information about the fluxes. In the 6-week case study for June and July 2015, X has dimensions m × 8. Each column of X corresponds to a different 3-hour time period of the day (for a total of 8 columns). For the annual case, we include one column for each month. Both of these setups for X have been used in past GIM studies (e.g., Gourdji et al., 2008Gourdji et al., , 2012. The different columns of X account for the fact that the fluxes have different overall magnitudes at different times of day and/or during different months of the year. The setup here also avoids including too many columns in 15 X; the coefficients (β) estimated by the GIM will scale each column in X to fit the observations. The estimated values of these coefficients are not guided by any prior estimate. As a result, there is nothing to regularize the coefficient (β) estimates, and those coefficients are only estimated using the atmospheric observations. Hence, we do not want to include too many columns in X to avoid overfitting the atmospheric observations and/or obtaining unrealistic estimates for the coefficients (β).