Efﬁcient multi-scale Gaussian process regression for massive remote sensing data with satGP v0.1

. Satellite remote sensing provides a global view to processes on Earth that has unique beneﬁts compared to measurements made on the ground. The global coverage and the enormous amounts of data produced come, however, with the price of spatial and temporal gaps and less than perfect data quality. Meaningful statistical inference from such data requires overcoming these problems and that calls for developing efﬁcient computational tools. We design and implement a computationally efﬁcient multi-scale Gaussian process (GP) software package, satGP, geared 5 towards remote sensing applications. The software is designed to be able to handle problems of enormous sizes and is able to compute marginals and sample from a random process with at least over hundred million observations. The mean function of the Gaussian process is described by approximating marginals of a Markov random ﬁeld (MRF). For covariance functions, Matern, exponential, and periodic kernels are utilized in a multi-scale kernel setting to describe the spatial heterogeneity present in data. We further demonstrate how winds can be used to inform the covariance kernel formulation. The 10 covariance kernel parameters are learned by calculating an approximate marginal maximum likelihood estimate and this is utilized to verify the validity of the multi-scale approach in synthetic experiments. For demonstrating the techniques above, data from the Orbiting Carbon Observatory 2 (OCO-2) satellite is used. The satGP program is released as open source software.


Introduction
Climate change is one of the most important current global environmental challenges, to the point where it is drawing constant widespread attention even in mainstream media. The underlying reason is the anthropogenic carbon emissions: among the well-mixed greenhouse gases, carbon dioxide (CO2) has currently the strongest effect on warming the planet, with the radiative forcing of ca. 1.68 W m − 2 according to the latest IPCC report (IPCC, 2013). results from CarbonTracker and the Goddard Earth Observing System with atmospheric chemistry (GEOS-Chem). This study evaluated also the temporal trend of the XCO2. Similarly Tadić et al. (2017) describe a moving window block kriging algorithm to introduce time dependence into GOSAT-based XCO2 map construction process using a quasi-probabilistic screening method for subsampling observations, thinning the data for computational reasons. Other recent studies have also contained analyses of OCO-2 data. For example, Zammit-Mangion et al. (2018) present fixed rank kriging (FRK) results based on OCO-2 data 5 using a 16-day moving window. The results again appear very smooth.
An interesting approach is presented by Ma and Kang (2017), who describe a fused Gaussian process, combining a graphical model with a Gaussian process and applying that to sea surface temperature data. Another interesting approach for atmospheric trace gas inversion is presented by Zammit-Mangion et al. (2015), who simultaneously model both flux fields and concentrations using a bivariate spatiotemporal model, utilizing Hamiltonian Monte Carlo (Neal, 2011) for sampling the posterior. 10 However, due to computational challenges the footprint area is very small.
For overcoming the difficulties posed by large numbers of data, various methods have been proposed. Lindgren et al. (2011) provide an explicit link between some random fields arising as solutions to certain stochastic partial differential equations and Markov random fields. A recent review of Vecchia-type approximations (Vecchia, 1988) is given by (Katzfuss et al., 2018) and a comparison of the performance of several recently developed methods is given by Heaton et al. (2018), with applications 15 to MODIS data. The difficulty of ordering the observations for effective inference with Gaussian processes, especially as the dimension of the inputs grows, is underlined by Ambikasaran et al. (2016).
In this work we describe an approach to solve spatial statistics problems with hundreds of millions of data points. We do this by combining various ideas and techniques that come close to those applied in Vecchia-type and nearest neighbor Gaussian processes while utilizing random sampling and aggressive pre-filtering of uninformative data when possible. The presentation 20 of the general Gaussian process problem is based on the one given by Santner et al. (2003) and Rasmussen and Williams (2006).
A generic space and time dependent mean function of the Gaussian process is found by solving marginals of a Markov random field (MRF). For covariance modeling, a multi-scale covariance kernel formulation is given. The validity of the multiscale approach is established via a synthetic study. Approximate methods to learn the parameters of both the covariance kernel 25 and the mean function as implemented in satGP are outlined. Additionally, a non-stationary covariance kernel formulation for utilizing wind data for computation, partly inspired by (Nassar et al., 2017), is proposed.
The capabilities of this early version satGP are demonstrated in practice by computing global XCO2 concentrations for a duration of 1526 days at 0.5 • spatial and daily temporal resolution with XCO2 data from OCO-2 utilizing over 116 million observations. The number of computed marginals is over 350 million. An example of how these results look like is given by 30 Fig. 7.
The key advances of this work are the capability to compute Gaussian process predictions with enormous remote sensing data sets, a practical way of learning the multi-scale kernel parameters and mean function parameters from data, and introduction of the flexible open source software, of which this is a first released version. Describing these developments is approached from the perspective of how the various parts of computation are implemented in the current version of satGP. 35 The rest of the manuscript is organized in the following manner: Section 2 describes the methods both generally and as implemented in satGP. An overview of computation in satGP is given in Sect. 3, and Sect. 4 presents and discusses simulation results, including a multi-scale synthetic parameter identifiability study and two applications to the OCO-2 v9 data set. In the concluding Sect. 5 some possible future directions are briefly mentioned.

5
In geosciences, kriging (Cressie and Wikle, 2001;Chiles and Delfiner, 2012) is often used for performing spatial statistics tasks such as gap-filling or representing data in a grid. The semivariogram models used in kriging are closely related to the covariance models used in the Gaussian process formalism (Santner et al., 2003;Rasmussen and Williams, 2006;Gelman et al., 2013), where instead of learning the variogram model from the data, a form of a covariance function is prescribed and its parameters learned. 10 Intuitively, one would like to learn properties of a spatio-temporal surface from some observational data of some quantity of interest. To each point in space and time corresponds a Gaussian distribution of that quantity, whose mean and variance can be calculated by solving a local regression problem at each desired point. This can also be crudely thought about as optimally solving a spatio-temporal interpolation problem when the observations have Gaussian errors.
The underlying theory related to Bayesian statistics, Gaussian processes, and Markov random fields is well known and 15 therefore the novel aspects in this section have to do with the computational methods and modifications that are presented, such as observation selection schemes in Sect. 2.6 or approximate marginal maximum likelihood computation in Sect. 2.7.
These modifications trade precision for tractability, but in a way that the results still remain valid. Due to the size of the problem, some sacrifices need to be made in order to be able to obtain any solution.
This section goes through the Gaussian process formalism, and both generic and the satGP-specific forms of mean and 20 covariance functions are described. This is followed by discussion of how observation selection is carried out and how model parameters are learned.

Gaussian process regression
A Gaussian process is a stochastic process, which can be thought of as an infinite-dimensional Gaussian distribution in that the joint distributions at any finite set A of space-time points are multivariate normal. We denote the vector of these points by 25 x ∈ R q and underline that they contain both space and time components. In this work q = 3, even though this restriction can be overcome if needed, and satGP does have limited support for space-only problems.
The Gaussian process is denoted by where m : R q → R and k : R q 2 → R are respectively the mean and covariance functions of the process parameterized by hyperparameter vectors β ∈ R n β and θ ∈ R n θ . Note, that with these functions x and x refer to coordinates of a single location in the spatio-temporal domain, while below it may also refer to multiple locations, depending on context.
The function m above is called the drift in kriging literature, and the expected value of the process in areas with no data will tend to the value of the mean function in that area. It is chosen to reflect the deterministic patterns in the data, and these 5 choices also affect how the function k and parameters θ in Eq. (1)  In what follows the domain R q x is divided into two disjoint parts, one of which, X train ⊂ R q , contains the part where 15 observation data (training data) was measured, and another one, X test = R q \X train , where observations were not made. Any x ∈ X test is below called test input as is often done in the GP literature, and these points are generally denoted by x * .
In practice marginals of the random function Ψ in Eq. (1) or samples ψ from it are evaluated (computed) only at a finite set of points. Let ψ obs ∈ R n denote a vector of observations -synthetic or real -generated by the Gaussian process at locations x obs ∈ R n×q . Given a set of functions f i for constructing the mean function, the matrix with elements f i (x j ; δ(x s j )) 20 corresponding to locations x j with regression coefficients β(x s j ) is denoted by F (x). For a single input, instead of F (x) the notation f : R q → R n β is used, and with that, f (x * ) = [f 1 (x * ), . . . , f n β (x * )] T . The joint distribution of the field at observed locations is then given by where the covariance matrix K is defined by its elements K i,j = k(x obs i , x obs j ; θ). For the mean function, in this work a specific is used, where the superindexes s and t refer to the spatial and temporal parts of the generic coordinate x, respectively, and δ(x s ) are auxiliary parameters which are potentially space-dependent. The purpose of the functionf is purely illustrative, showing that given the parameters δ, the function f does not depend on the spatial part of x, and similarly that the β parameters do not Bayesian statistics is a standard paradigm for analyzing data and uncertainties, and it is also widely used in geosciences (Rodgers, 2000;Gelman et al., 2013). From the vantage point it provides, given the observed data Ψ obs = ψ obs at some finite set of points x obs , the object of interest of the inference problem in this work is the joint posterior distribution of the Gaussian process and the parameters, where p(ψ|β, δ, θ) is the Gaussian process prior and p(β, δ, θ) is a prior on the Gaussian process hyperparameters. This calculation is not generally tractable for a huge number of inputs x, but posterior estimates of the GP, p(ψ|ψ obs ,β,δ,θ), can be calculated by conditioning on parameter point estimatesθ,β, andδ. The first of these may be found by minimizing some loss function L, described below in Sect. 2.7, 10 and for the second a closed-form expression, given a point estimate of the parameters θ and δ, is given by The δ parameters can be found approximately by finding a point estimate of parameters β and δ before computing Eq. (6), and by re-calibrating δ alone after. In practice this produces stable results with the OCO-2 data, and for pathological data sets, 15 repeated alternating optimization of the parameters may be performed.
Even though a full posterior distribution of the parameters is not obtained this way, the solution of the Gaussian process itself is Bayesian in that the posterior marginals at each x are found by conditioning on the observations. In the satGP software, the space-dependent β and δ parameters are fitted first, and any learning of the covariance parameters is done only after that.
For prediction in the context of Gaussian random functions, the properties of multivariate normal distributions are exploited 20 for calculating marginals of the random field Ψ at any set of points x.
The posterior distribution p(ψ * |ψ obs ,θ,β) of the Gaussian process at a finite set of test inputs x * can, given point estimateŝ β andθ, be modeled according to Eq.
where Ψ and x have been divided into two parts -one for the test inputs x * , and the other one for the observations x obs . The 25 predictive distribution at x * can then be written as Ψ * |β,θ ∼ N (µ * , Σ * ), where its moments are given by and and where the covariance Σ * is the Schur complement of K(x * , x * ).

Overview and objectives of satGP
The satGP program is meant to be a general purpose Gaussian process toolbox with emphasis on applicability to large remote sensing datasets. It features a selection of covariance kernels and routines for learning space-dependent mean function parameters and covariance parameters from data. With a given set of parameters, it computes posterior marginals and uncertainties at the spatial resolution desired by the user, or generates samples from the process. Drawing samples from the prior is also sup-5 ported, and this can be utilized for devising synthetic data experiments to study the identifiability of the GP covariance kernel parameters. This section goes through these capabilities and relevant computational details. Since the softwere is applied in Sect. 4 to OCO-2 data, details pertaining to that particular case are included for illustration.

Mean functions in satGP
The most general mean function form available in satGP is given by Eq. (3). The functions f i above are user-defined and, for 10 ease of use, functionality for using a zero mean function, a spatially independent mean function, and an arbitrary gridded array of values are available. The specific forms of f i used for the OCO-2 experiments in Sect. 4 are given by where ∆ year is the duration of one year, and δ x s is a space-dependent phase shift. The function f 1 fits the summer-winter cycle, and f 2 fits the semiannual cycle. It is assumed that these can be modeled with the same δ x s parameters. The constant term is 15 given by f 3 , and f 4 gives the slow global trend. The fit to the global mean values of XCO2 from OCO-2 can be seen in Fig. 1.

Learning β(x s ) as a Markov random field
When not learning GP covariance parameters or generating synthetic training sets, the finite set of test inputs x * for GP calculation is taken in satGP to be a grid with predefined geographical and temporal extents and resolution. Solving the GP marginalization and sampling problems then amounts to solving Eq. (9) and (10) at each corresponding space-time point.

20
Since e.g. sources, sinks and timing of seasons are local, the mean function should be different from one spatial grid point to another. This is achieved by modeling the β(x s ) parameters as a Markov random field, which are often used in geophysics as a computational tool to solve large spatial statistics or inference problems. In practice what follows explains how the spatial dependence can be resolved using computational statistics. The MRF imposes the condition that neighboring grid cells should not be too different from each other. How different they are allowed to be is a modeling choice, see Appendix A. ability of the β-parameters of latitude i and longitude j is given by p Since the maximal cliques of this graph are the connected pairs of vertices, according to Hammersley and Clifford (1971) the full joint distribution of the graph p(V) factors as (ν,ν )∈E 1 Z φ(ν, ν ), where Z is called a partition function and φ are compatibility functions. One reasonably efficient way to solve marginals for each vertex in such a graph is to use the variable 5 elimination algorithm, which is an exact standard algorithm suitable for undirected graphs of moderate size. To make the computation faster, satGP currently uses a modified version to compute each diagonal in the graph in parallel from ν 0,0 to ν n lat ,n lon and back, conditioning each ν ij on the previously evaluated vertices in ∂ν ij without introducing the diagonal edges of the reconstituted graph, as would be normally done. The program also inversely weights the edges exponentially according to the distances between the (geographical) coordinates corresponding to the connected nodes. This rate of exponential decay 10 is user-configurable. The structure of the MRF and the approximate elimination order are shown in Fig. 2.
In the particular form used for OCO-2 data in Eq. (11), the phase-shift parameter δ cannot be estimated with regression like β in Eq. (9) and (10). For this reason, the nonlinear space-dependent δ-parameters are found with an optimization algorithm from the NLOpt package, by default the BFGS algorithm, before findingβ with Eq. (9) and (10), and after obtainingβ the δ parameter is re-optimized given theβ. For calibrating the δ parameters for vertex ν, the quantity Here the first sum runs over the training data selected by the observation selection method described in Sect. 2.6. This optimization problem is very simple since there are few β or δ parameters for the individual vertices. The complexity introduced by the interactions described by the edges is taken care of by the approximate elimination algorithm described above.

Covariance functions in satGP
The smoothness, amplitude, and length scale of the Gaussian process are determined by the covariance kernel used, and this choice much determines how the result of the computation looks like. The satGP program supports several different types of covariance function components for forming the full covariance function k in Eq. (1). The options available reflect the properties that can be expected in remote sensing data -varying smoothness and meridional and zonal length scales, potential 5 periodicity, and changing the orientation of the data-informed and uninformed axes according to wind speed and direction. This section lists the available covariance function formulations. For further intuition regarding the parameters, also see Appendix A.
For convenience, let 10 where γ > 0 is the exponent, I ⊆ {x s , x t } is a set of dimensions of the input, with x s referring to latitude and longitude and x t to time. The P I matrix projects x onto indices I, and Γ is a diagonal covariance matrix with elements γ c , and the notation r Γ stands for √ r T Γ −1 r. The space-only variables are denoted I S and spatial and temporal variables together are denoted The exponential family of covariance functions with parameters θ = (γ, l, τ ) is defined by the covariance function The exponent γ controls the smoothness of the samples from the Gaussian process, with γ = 2 yielding infinitely differentiable realizations.
The Matérn family of covariance functions, with θ = (ν, I , τ ) is given by the covariance where s = 2 √ νξ 1 I (x, x ) and ν controls the smoothness parameter usually denoted by α via α = ν + q 2 . The function K ν is the modified Bessel function of the second kind of order ν. With q = 1, the value ν = ∞ corresponds to the squared exponential kernel and ν = 0.5 to the exponential kernel with γ = 1. Despite this similarity between the Matérn and exponential kernels, the realizations of the random function from the processes with values 1 2 < ν < ∞ do not correspond to those with the kernel 10 k exp with any value of γ.
A periodic kernel with θ = (τ, per , θ exp ) is defined in satGP by and the term θ exp defines the parameters for the exponential functions ξ, while per controls the periodic (inter-period) covariance 15 length. While the periodic kernel is not utilized with the OCO-2 case studies below, it can be a useful tool in many other situations, such as with OCO-3, which due to not being on a Sun-synchronous orbit will make observations at varying local times.
An additional covariance function formulation available in satGP is one based on local wind information. The underlying rationale is that winds affect how quantities of interest such as gases in the atmosphere or algae blooms in the surface water 20 spread. Therefore, if wind data is available, it is natural to use it in the Gaussian process.
The wind-informed covariance has parameters θ = (τ, I , ρ, w * ) and is defined by where the difference between x W and x W is represented using transformed axes parallel and perpendicular to the wind direction at the test input x * . The spatial scaling parameters in Eq. (13) for k W , corresponding to the parallel to wind and 25 perpendicular to wind directions, are given by where w * is the wind velocity at the test input x * and ρ scales the effect of the wind. The parameter vector for the exponential kernel θ W = (τ, γ, , ⊥ , t , 2), where the last element denotes the exponent γ used by the exponential kernel. The resulting covariance ellipses are shown in Fig. 3 for several wind vectors and values of ρ. The covariance functions used in this work to model Ψ are sums of several kernels -sums of valid Gaussian process kernels remain valid kernels. The general form of this multi-scale kernel is given by where the first term, which in kriging is called the nugget, contains the observation error variances, and the parameter θ is 4 utilize kernels with one to three components. The kernel components of a multi-scale kernel are below called subkernels.

Covariance localization and observation allocation for the multi-scale kernel
Using a large number of observations makes solving the Gaussian process Eq. (9) and (10) untractable as the cost of inverting the covariance matrix scales as O(n 3 obs ). This creates a need for finding approximate solutions while introducing as little error 10 as possible. In satGP covariance localization is used to utilize only a subset of observations for computing Eq. (9) and (10). To do this, a maximum subkernel covariance matrix size κ and minimum covariance parameter σ 2 min are defined by the user. Assume that the multi-scale kernel defined by the user contains n ker subkernels. For each test input x * and for each subkernel 15 where the last condition prevents observations from being added by several subkernels. From these candidate observations, min(|A obs * ,l |, κ) are selected, either greedily selecting the κ observations with highest k(x i , x * ), or choosing the observations uniformly randomly sampling from those training data for which the minimum covariance threshold is exceeded, see Appendix A for additional details. When |A obs * ,l | < κ and l < n ker , the parameter κ will be grown for the next kernel to compensate for the deficit by setting κ ← κ + (κ − |A obs * ,l |). This is done to allow the full kernel size to grow to n ker κ when possible.
specify the subkernel with the largest parameters as the last one. After selecting all observations for all kernels, the covariance matrix K is constructed by evaluating the full covariance function k according to Eq. (18) for all pairs of selected observations.
For learning the locally varying parameters in the mean function with Eq. (6) - (7), the observation selection is performed by disregarding the time component, i.e. setting x t i ← x * t for all x i . Observation allocation could be done also by selecting observations based on values of k instead of each k l individually, or 5 by other approaches, such as the one presented by Schäfer et al. (2017). However, while the method of observation selection does have an effect on the inferred posterior marginals, the screening property of Gaussian processes ensures that this effect is not major as long as observational noise is small and the nearest observations are included in all directions.
Out of the two methods available in satGP, random selection avoids observation sorting and is therefore faster, especially if a huge number of data are near the test input x * . This comes at the cost of producing slightly noisier fields of marginal 10 posterior means. For covariance parameter estimation random selection works well. The current nearest-neighbor-in-covariance approach is only one possibility, but is justified by the parameter identifiability results in Sect. 4.1.

Learning the covariance parameters θ
From Eq. (4), the log marginal likelihood of observations ψ obs given a set of parameters θ, β and δ is given by

15
where the covariance function parameters θ are implicitly in K and the non-linear mean function parameters in F , for which the shorthand notation F = F (x obs ) is used in this section. The maximum (marginal) likelihood estimate (MLE)θ of θ can be found via minimizing as stated in context of Eq. (5).

20
In the presence of a huge number of observations, calculating the determinant of the full covariance |K| is not feasible, and the log likelihood is approximated with the block diagonal form, resulting in While this method is suitable for finding point estimates for the parameters θ, the log-likelihood has an unknown scaling factor resulting in an unknown multiplicative factor for the variance term in the exponent of the Gaussian distribution, and hence information about the true size of the posterior of the covariance parameters p(θ|ψ obs , β, δ) is lost.
The covariance parameter optimization can be performed by using optimization algorithms such as COBYLA or SBPLEX available in NLOpt (Johnson, 2014). An alternative is to explore the scaled posterior by using the Adaptive Metropolis (AM) Markov chain Monte Carlo (MCMC) algorithm (Haario et al., 2001), an implementation of which is included in the satGP source code. Using MCMC is feasible since the forward model is just sampling from a multivariate normal distribution which is very fast, and also due to that the parameter dimension is moderate, even with multiple subkernels.

3 Overview of Computation
The satGP code is written in C with visualization scripts written in Python and parallellization implemented with OpenMP directives. The program reads data from netCDF files and the configuration from a C header file. For linear algebra, the C interfaces of LAPACK and BLAS, LAPACKE and CBLAS, are utilized and for optimization tasks, algorithms in the NLOpt library are used. The computations are carried out in single precision both in order to save memory resources with the largest 10 data sets and also in anticipation of implementing the covariance function routines in a way that allows computation on graphics processing units.
The most important configuration variables are listed in Table 1. The user needs to define whether parameters are learned or prescribed and whether marginals or samples from the GP are to be computed. The mean and covariance kernel need to be defined by initializing corresponding structs with parameters and their limits if calibration is to be performed. For computing 15 GP marginals or drawing samples from the random process, the geographic and temporal extents need to be specified and the mean function and the covariance kernel used must be given. For more details than is described below, see Appendix A.
For computational efficiency, several parameters can be tweaked, including all of those in the second and last sections of Table 1. The first main bottleneck for computing a marginal at x * is sorting the observations for selecting the most informative ones to be used in the covariance matrices, see Sect. 2.6. This requires roughly O(r l log r l + κ log κ) operations, where r l ∝ 20 q i=1 l i is the number of grid locations (test inputs) x * i in the spatial grid such that for the l th subkernel, k l (x * i , x * ) < σ 2 min . Here the parameters l i are the corresponding length scale parameters over all the dimensions of the inputs x -this controls the size of the hypersphere inside which observations are considered for each x * . The second bottleneck is calculating the Cholesky decompositions of the covariance matrices K with cost O((n ker κ) 3 ). The cost of calculating the means and variances of the GP in a grid for a set of n times points on the time axis is therefore given by where A is the grid area in degrees squared and ω is the grid resolution. When the random observation selection method mentioned in Sect. 2.6 is used, the r l log r in Eq. (23) becomes just r l .
The execution of the program is presented in Fig. 4. The names of the subprograms here deviate from those in the code to improve readability.

30
The function AddToState() reads observations (asynchronously) into a state object that tracks the proximity of each observation to each grid point. Only part of data is added, and what part, is controlled on l. 6 by the parameter η i train , which  corresponds to the inclusion probability of each observation. This probability depends on ζ train in Table 1 via where d(x i , x iprev ) is the Euclidean distance to the previous added point and ∧ is the standard notation for minimum. Hence with ζ = 0, all observations will be added.
For computing the marginals, the spatial domain can be decomposed with Decompose(), line 23, into several spatial 5 subdomains (sd) so that arbitrary-size grids can be computed. This makes solving large problems with limited amount of memory possible, but only works with sampling = 2.
The state object is emptied by ReInitializeState() which also potentially sets new subdomain extents. Function SampleFromPrior() actually performs the computations on lines 30-37, but with the set of points x * i in a random pattern instead of in a grid as is the case in l. 27-38.
SelectObservations() method (l. 31) carries out selecting the best observations as described in Sect. 2.6. For constructing the set of potential observations, the grid is searched for locations that may have informative observations for the current test input stored in the state object. These locations are first ordered into categories with decreasing potential covariance and for the best locations, that together hold at least 2κ observations, the covariance function with the test input is evaluated. Out 5 of these, the κ best are chosen. The factor 2 can be increased for the wind-informed kernel and the value 8 is used in the demonstration of the wind-informed kernel in Sect. 4.7.
The function ComputeMarginal() constructs the covariance matrix K, inverts via the Cholesky decomposition, and solves Eq. (9) and (10) to find the marginal distribution at any test input x * . That function returns the negative log likelihood and is therefore directly used in learning the covariance parameters θ in FindCovfunCoeffs() on line 18.

10
The Gaussian process algorithm is an interpolation algorithm when observation noise is zero, and interpolation algorithms may misbehave when used for extrapolation. In a spatio-temporal large grid, when sampling = 2, i.e. when draws of the Gaussian process are generated in a regular spatio-temporal grid, computing conditionals based on the previous predictions would amount to extrapolation if done in order. For this reason, a deterministic sparse ordering is used, which ensures that test inputs corresponding to simultaneous predictions are far from each other so that their mutual covariance is negligible.

15
For this reason conditioning on already computed values is for the vast majority of GP evaluations interpolation instead of extrapolation.

Results and discussion
In this section, several simulation studies are presented. In the first experiment, parameter identifiability with the multi-scale kernel is examined with satGP-generated data. After that, the MRF of mean function β coefficients is trained with OCO-2 data 20 and those fields are then briefly analyzed.
Based on a locally varying mean function of the form in Eq. (3), the covariance parameters of the OCO-2 XCO2 spatiotemporal field are learned. Knowing both the mean and the covariance functions, the Gaussian process is then solved globally in a grid and snapshots of the mean and uncertainty fields are presented. The section is concluded by a demonstrating how the wind-informed kernel works. The covariance function parameters are learned from data.

Parameter identifiability with the multi-scale kernel
A synthetic study was performed to confirm the identifiability of the multi-scale covariance function parameters. For this, sampling with a random spatial pattern from the prior was carried out, adding 1% noise, and then estimating the parameters by computing the posterior mean estimates using Adaptive Metropolis.
The identifiability experiment was performed with various kernels, and the more complex the kernel, the more difficult 30 recovering the true parameters was. With a single Matern, exponential, or periodic kernel, the parameters could be recovered very easily. This was also true for a combination of exponential and Matern kernels with a relatively small κ parameter. Data: lelist containing les with observation data y i = (µ ψi , σ 2 ψi ) indexed by location x i , input variables from Table 1. Result: Optimized β parameters for mean function and θ parameters for covariance kernel, gridded Gaussian process marginal means and variances or a sample from the Gaussian process evaluated in a grid.  Figure 4. Overview of satGP. After initialization data is read for training m and k, after which possible MRF computation is carried out. This is followed by sampling the prior if a synthetic study is performed, and learning the θ parameters controlling k. Gaussian process marginals are then computed in a grid, potentially by decomposing the domain for large grids. Finally, samples from the GP may be drawn.
The covariance kernel parameters were still recoverable with a combination of three kernels -Matern with ν = 5 2 , exponential, and periodic, but for this, a larger κ was needed -the simulation shown used κ = 256. With small κ, some of the parameters had a tendency to end up at the lower boundary, possibly due to effects of the covariance cutoff on the determinant of the covariance matrix in Eq. (20). Optimization using minimization algorithms such as Nelder-Mead, COBYLA, or BOBYQA tended to often end up in local minima, and for this reason MCMC was used instead. The number of random 5 reference points in E ref in Eq. (22) was set to 12, which was enough to reliably recover parameters close to the true value.
The parameter limits, true values, and posterior means of the synthetic experiment with three kernels are given in Table 2.
In total 200,000 observations were created in the region between -10 and 10 latitude and -10 and 10 longitude over a period of four years according to the values in Table 2. A total of 10 million Metropolis-Hastings iterations were carried out to make sure that the posterior covariance stabilized. The posterior, with first 50% of the chain discarded as burn-in, is shown in Fig. 5 10

The OCO-2 v9 data
The simulations with real remote sensing data utilize the v9 data from the OCO-2 satellite. The OCO-2 satellite was launched in 2014, and it orbits the Earth on a Sun-synchronous orbit O'Dell et al., 2012). The footprint area of each measurement is roughly 1.29 by 2.25 kilometers, but the data is very sparse in time and in space. The satellite completes 14.57 revolutions around Earth overpasses in one day. In the presence of clouds, the satellite is not able to produce measurements, and this poses a challenge for areas with persistent cloud covers, such as Northern Europe in the winter. The present work utilizes the XCO2 data, its reported uncertainties, associated coordinate information, and zonal and meridional wind speeds that are contained in the data files. Only observations flagged good are used, and there are in total 116489342 such observations for the time period considered.

Solving the mean function for OCO-2 v9
Solving the mean function from OCO-2 v9 XCO2 data, as described in Sect. 2.4, produces best estimates for the coefficients of

Covariance parameters of the OCO-2 v9 data 10
The OCO-2 data has several natural length scales, both spatially and temporally. The distance between adjacent observations is only one to two kilometers in space and some hundredths of a second in time, but the distance between consecutive orbits is thousands of kilometers in space and several hours in time. On consecutive days the satellite passes close to the trajectory of the previous day at a distance of tens to three hundred kilometers depending on the latitude. The Earth has natural temporal diurnal and annual cycles, but since OCO-2 is Sun-synchronous, only the latter matters. Since the annual cycle is already fitted 15 in the particular form of the mean function used, Eq. (11), a periodic kernel component is not included, and the data is modeled with a kernel consisting of a larger-scale exponential and smaller scale Matern component.
The covariance parameters for the two-component kernel, which are the median values from sampling the posterior with MCMC, are given in Table 3. With learning the parameters from a data set with natural length scales, the posterior may appear multi-modal, with some of the modes only having relatively little mass. In such a case, the median provides a more robust estimate for the parameters than the mean. The t parameters of the posterior mean were slightly larger, which would result in slower computation. Selecting the median is further justified by the slight overestimation of some parameters in the synthetic study in Sect. 4.1.
Learning the covariance parameters from OCO-2 v9 data used the following configuration parameters for satGP: ζ train = 0, 5 κ = 256, and n ref = 12. A total of 1.1184 million MCMC iterations were completed, with the first 50% discarded as burn-in to produce statistics. The reference points were randomly picked from a rectangle with corners at (0 • S, 65 • E) and (60 • N, 145 • E). While using the whole globe would have been a principled choice, MCMC requires lots of iterations, and for any claim of global coverage n ref would have needed to be much larger. Table 3. Covariance function parameter values learned from OCO-2 data. First column shows the Matern kernel parameters, and the second column the exponential kernel parameters. The length scale along the parallels, lon is much larger than that along the meridians, The marginal posterior predictive distribution at test points x * , given by Eq. (9) and (10), were calculated globally in a halfdegree grid between 80 • S and 80 • N at a daily resolution. The first day of simulation was September 6 2014, and the last day was November 10 2018, spanning in total 1526 days. For each day, 230400 marginals were computed, resulting in a collective 351 million inverted covariance matrices. The satGP parameters used were ζ sample = 0 and κ = 256, and the covariance kernel used was the one learned in Sect. 4.4, with parameters given in Table 3. The simulation time was 25 days on a moderately fast 15 Intel i7-8700K CPU utilizing the available 12 CPU threads and 32 GiB memory.
Global fields of the mean values and marginal uncertainties are presented in Fig. 7      Optimally the wind-informed kernel should utilize winds that are not recomputed from the observations as was done for convenience, but directly from a weather or climate model. The satGP program contains configuration options for doing this.
The optimal covariance function parameter values are conditional on the wind data, so the values should be learned separately for each new application and wind data set. There are various aspects of satGP that could be improved in future versions. These include addition of routines for doing 10 model selection to select the components of the multi-scale kernel, improving the observation selection/thinning scheme for statistical optimality, and finding joint posterior predictive distributions. For the last one, a multi-grid version can be developed, and this could be potentially useful for flux inversion studies.
The satGP software utilizes various approximations for computational tractability, and the connection between parameters such as length scales , thinning parameter ζ, maximum kernel size κ, and prediction accuracy could be studied further, as well 15 as changing the grid resolution according to density of observations.
The methodology and code presented can be also used with other data sources. For instance, combining data from the various satellites that measure CO2, such as GOSAT, GOSAT-2, OCO-2, TANSAT, and the OCO-3, would be particularly interesting.
That more and more instruments are about to provide data from the orbit in the near future will lead to a need to understand the properties of even larger data sets.

20
Code and data availability. The satGP code is available from the contact author upon a reasonable request. The OCO-2 v9 data used is freely available directly from JPL/NASA.

Appendix A: Input parameters and variables in satGP
The satGP software by design allows for lots of flexibility for defining how to model the quantity of interest as a Gaussian random field. In this section the possibilities are discussed along with some recommendations. The parameters in Table 1 25 are described in more detail than earlier, along with some other configuration variables in the configuration file config.h.
Practical aspects of defining mean functions and covariance kernels are also included. Some of the details in this section may change for future versions of satGP.
Of the four sections in Table 1, the first is obvious, as those parameters control the main logic of satGP. It is recommended to first learn the mean function, then with that mean function learn the covariance function, and only after that calculate the means and variances of the Gaussian process with sampling = 1. The setting sampling = 2 can be used for illustration purposes, for understanding how the different realizations of the random function would look like.
The area parameter defines the longitude-latitude extents of the domain where satGP is wished to be used. The strings and the corresponding areas are defined in the beginning of the file gaussian_proc.h, and can be changed there as needed.

5
The parameter n days defines how many days are to be simulated after the starting day. Currently the starting day is hardcoded in the code base to the first day of OCO-2 data. However, if use_daylist = 0 in the configuration file, a list of days can be used. This list can quite easily generated by modifying a python script create_daylist.py, which is included with satGP.
The ω parameter determines how much spatial detail is resolved when sampling or computing marginals of the random field.
A small value like 0.1 will make computing very expensive, and using such values might be unnecessary when the smallest 10 covariance subkernel length scale parameters are large. These parameters are in the scale of distances on the unit ball, and therefore on the equator an parameter of 0.05 corresponds to a length scale of around 2.9 • , so the ω parameter should rarely be much less than half of that. On the other hand, if the observations are spatially very close to each other and describing local variation is aimed for, then the parameters need also to be small. Given computational constraints, larger values or different area parameters may need to be used.

15
In the third section of Table 1, the first parameter n ker denotes the number of subkernels. Even though the hard limit is set at 10, in practice this should be between one and three since the parameters of more than three subkernels are not necessarily identifiable. More kernels means also more computational cost, due to the κ parameter, which is the last one in the table and discussed later.
The parameters cfc and mf are not strictly input variables, but C struct pointers that are created based on input variables.

20
These variables are described in the configuration file, and they amount to choosing the covariance kernels from prescribed types (e.g. Matern, exponential, and periodic), and then defining the parameters for those kernels. The best parameters are those that are learned with learn_k = 2 when non-synthetic data is used.
The learning is best performed with MCMC, and the posterior mean and median have proven to be a useful values. For learning the covariance parameters, parameter limits need to be given. These should correspond to the expected length 30 scales in the data -e.g. long-range fluctuations with low amplitude, and short-scale variations due to local effects. It is in practice best if the parameter ranges do not overlap.
If the exponent of the exponential kernel needs to be changed, that needs to be done by changing the exponent variable in the covfun_dyn() function in the file covariance_functions.h. Similarly, if the order of the Matern kernel needs to be changed, that can be done by changing the variable n in functions covfun_matern52() and initialize_covfunconfig() in that same file.
For constructing the mean function, the configuration file contains the parameter mftype. The possible values are: 0) a zero mean function is used, 1) a mean function that changes only in time is used, 2) a (time-dependent) field is read in and used -this can be e.g. the mean value from a previous Gaussian process simulation, and 3) a space and time dependent mean 5 function is used. The function itself is given as a function pointer to variable mean_function in the configuration file, and this function needs to be defined somewhere -e.g. in the file mean_functions.h. For the mean function, another variable, mfcoeff, needs to be set. This is the total number of parameters (β and δ in Eq. (3)) if mftype ∈ {1, 3}. If the mean function parameters are learned, the parameter nnonbetas, the number of mean function non-linear δ parameters, needs to be set to the appropriate value in the function fit_beta_parameters_with_unc() in mean_functions.h. For 10 global mean function coefficients, the values of those coefficients are given in the configuration file. Additionally, parameter limits for learning the space-dependent mean function parameters are set in the configuration file. Finally, when learning the space-dependent mean function parameters, the smoothness of the field may be controlled by changing the dscale parameter in the configuration file, and to a lesser extent by modifying the dfmin and dfmax parameters in function fit_beta_parameters_with_unc() in file mean_functions.h.

15
In the last section, the ζ train parameter controls data thinning when learning covariance kernel parameters and the ζ sample parameter has the same effect for when sampling = 0. How the thinning takes place was explained in the context of Eq.
(24). While with few observations no thinning needs to be done at all, i.e. ζ · may be set to zero, with large data sets the representability of data may be improved when a coarse grid is used for computation, and also memory bottlenecks may be avoided. These parameters may be also increased if faster execution is required, e.g. for debugging purposes.

20
The σ 2 min parameter controls which observations are not considered at all when computing at a location x * , as described by Eq. (19). The higher this is, the more data is discarded. Setting σ 2 min to a very low value makes searching for candidate observations slow, while picking too high a value may make posterior fields look edgy. In practice values between 10 −7 and 10 −3 seem to work well. This parameter is not actually meant to be changed, and it is for that reason set in create_config() in the file gaussian_proc.h.

25
The variable n synthetic defines how many synthetic observations are generated when learn_k = 1. Very large values are once again expensive, and instead a smaller area should rather be used with more moderate values of n synthetic . Those values can be in practice up to 10 5 or more. With very low values, it may be that spatial patterns specified by the prescribed covariance kernel are not represented appropriately, and therefore values less than 10 4 should be avoided, except for maybe in settings with only a single subkernel. If σ 2 synthetic is high, parameter identifiability suffers. Varying this parameter could be used for understanding 30 how complex a multi-scale kernel can be useful with particular data sets. The values also depend on the maximum covariance parameters of the Gaussian process, given by the τ 2 parameters in the formulas of Sect. 2.5.
The last parameter in Table 1, κ, defines the maximum subkernel size. The larger this parameter is, the more data is included for constructing the covariance matrix K, whose Cholesky decomposition needs to be computed to solve the local regression problem inherent to Gaussian processes. In practice the full kernel size should be kept under 1000, and in order to compute GP 35 calculations fast, a full kernel size of less than 500 is recommended. However, with a very small number of marginals, values up to 10 4 may be experimented with. When n ker κ < 64, the speed-up due to solving the GP formulas faster decreases, since at that point computing Cholesky decompositions no longer takes up majority of computing time. This lower bound depends on the CPU architecture and the sizes of the various CPU caches.
Whether the observations for computing the local values are chosen at random or greedily is determined by the variable 5 select_closest in function pick_observations() in file covariance_functions.h. The value used should normally be non-zero, since with random selection adjacent grid points often do not utilize the best available observations closest by, leading to noisiness or graininess in the computed mean field.
In addition to the parameters and variables listed here, there are also other parameters in the configuration file and in the code, even though those should not need to be changed. Any variables that the user might want to tweak are generally accompanied 10 by at least some comments describing their effects.
In the current version, the satGP program is run with the script gproc.sh, whose comments describe the various options.
Compiling and running require a modern GCC version (such as version 8) and the meson build system, and additionally all the needed libraries listed in Sect. 3. The current low version number reflects the fact that as of now, installing and using the software will require a degree of technical knowledge, including some Python, C, and BASH programming skills.

15
Author contributions. JS, AS, HH, and YM designed the study. JS prepared this manuscript, wrote the satGP code, chose, tested and implemented the computational methods, and performed the simulations.
Competing interests. The authors declare that they have no conflict of interest. Acknowledgements.