Technical Note : Improving computational e ffi ciency in large linear inverse problems : an example from carbon dioxide flux estimation

Addressing a variety of questions within Earth science disciplines entails the inference of the spatiotemporal distribution of parameters of interest based on observations of related quantities. Such estimation problems often represent inverse problems that are formulated as linear optimization problems. Computational limitations arise when the number of observations and/or the size of the discretized state space becomes large, especially if the inverse problem is formulated in a probabilistic framework and therefore aims to assess the uncertainty associated with the estimates. This work proposes two approaches to lower the computational costs and memory requirements for large linear space–time inverse problems, taking the Bayesian approach for estimating carbon dioxide (CO 2) emissions and uptake (a.k.a. fluxes) as a prototypical example. The first algorithm can be used to efficiently multiply two matrices, as long as one can be expressed as a Kronecker product of two smaller matrices, a condition that is typical when multiplying a sensitivity matrix by a covariance matrix in the solution of inverse problems. The second algorithm can be used to compute a posteriori uncertainties directly at aggregated spatiotemporal scales, which are the scales of most interest in many inverse problems. Both algorithms have significantly lower memory requirements and computational complexity relative to direct computation of the same quantities (O( n2.5) vs. O(n3)). For an examined benchmark problem, the two algorithms yielded massive savings in floating point operations relative to direct computation of the same quantities. Sample computer codes are provided for assessing the computational and memory efficiency of the proposed algorithms for matrices of different dimensions.


Introduction
Addressing a variety of questions within Earth science disciplines including environmental science, hydrology, geology, geophysics, and biogeochemistry entails the inference of the spatiotemporal distribution of parameters of interest based on observations of related quantities.Such estimation problems often represent inverse problems, with examples including the estimation of hydraulic conductivity or other aspects of subsurface structure using geophysical (e.g., Aster et al., 2013) or hydrologic (e.g., Hyndman et al., 2007) observations, the identification of environmental contaminant sources using downstream concentrations (e.g., Atmadja and Bagtzoglou, 2001;Liu and Zhai, 2007;Michalak and Kitanidis, 2003;Zhang and Chen, 2007), the characterization of atmospheric and oceanic processes (Bennett, 2002), and the quantification of budgets of atmospheric trace gases using atmospheric observations (e.g., Ciais et al., 2011;Enting, 2002).Such inverse problems are often formulated as linear optimization problems.Even when the physics and/or chemistry relating the unobserved field to the measurements yields a nonlinear problem, the inverse problem is often solved through iterative application of a linear approximation (e.g., Kitanidis, 1995).Computational limitations arise when the number of observations n and/or the size of the discretized state space m becomes large, especially if the inverse problem is formulated in a probabilistic framework and therefore aims to assess the uncertainty associated with the estimates.
This work proposes approaches for addressing these computational limitations.We take the Bayesian approach for estimating carbon dioxide (CO 2 ) emissions and uptake (a.k.a.fluxes) as a prototypical example of a spatiotemporal Published by Copernicus Publications on behalf of the European Geosciences Union.
V. Yadav and A. M. Michalak: Improving computational efficiency in large linear inverse problems inverse problem, and use it to illustrate the proposed tools.We use the study of Gourdji et al. (2012) as a computational benchmark.Gourdji et al. (2012) used atmospheric concentration measurements of CO 2 to constrain CO 2 fluxes in North America at a 1 • longitude by 1 • latitude spatial resolution (m s = 2635 land regions for 50 • W to 170 • W and 10 • N to 70 • N) and a 3 hourly temporal frequency over the period of 24 December 2003 to 31 December 2004 (m τ = 2992 periods over 374 days).The implemented setup resulted in n = 8503 observations and m = m s ×m τ =7 883 920 parameters to be estimated.This high spatiotemporal resolution was primarily motivated by the desire to avoid "aggregation errors" (Engelen et al., 2002;Meirink et al., 2008, Thompson et al., 2011), i.e., biases in the estimated fluxes caused by prescribing spatial and temporal patterns within estimation regions and not allowing these patterns to be adjusted through the estimation.The resolutions that can be resolved by observations are often coarser, however, as are the scales that are of most scientific interest.In the case of Gourdji et al. (2012), the estimates were therefore aggregated a posteriori to monthly and annual temporal resolution for interpretation.The a priori spatiotemporal error covariance was assumed separable, with exponential decay in correlation in both space and time.As a result, the prior covariance matrix could be expressed as a Kronecker product of matrices describing the spatial and temporal covariances, respectively.

Bayesian framework for linear inverse problems
Stochastic linear inverse problems are often formulated in a Bayesian framework by requiring the minimization of an objective function that can be written as where z (n×1) is a known vector of measurements, H (n×m) is a matrix that describes the relationship between measurements and the unknown field s (m×1) , R (n×n) is the covariance matrix of the model-data mismatch errors, s p(m×1) is the prior estimate of s, and Q (m×m) is a (square and symmetric) prior error covariance matrix, describing deviations between the true field s and the prior s p .The first term in Eq. (1) penalizes differences between available observations and those that would result from an estimated underlying field, while the second is a regularization term that penalizes departures from the prior, or more generally any type of desired structure.
The solution to the Bayesian linear inverse problem, defined as the estimate of s that minimizes the objective function in Eq. ( 1), can be expressed as and the a posteriori uncertainty covariance of the estimated ŝ can be written as For small n and m, implementing Eqs. ( 2) and ( 3) is straightforward.As inverse problems are solved using increasingly more observations and are used to estimate parameters at increasingly high spatiotemporal resolutions, as in the prototypical Gourdji et al. (2012) example, the number of floating point operations required to implement these equations becomes prohibitive.
A closer look at Eqs. ( 2) and (3) shows that the first computational bottleneck occurs due to the cost of multiplying matrices H and Q.The second is the cost of computing and storing a dense V ŝ with dimensions m × m.Paradoxically, as noted previously, the scales of ultimate interest are often coarser than the native resolution of V ŝ, and these covariances are frequently aggregated a posteriori in space and/or time by summing or averaging the corresponding entries in V ŝ.
In this work, we propose a computational approach for evaluating HQ, and by extension HQH T for very large inverse problems, for the case where the covariance matrix Q can be expressed as a Kronecker product of two smaller matrices.This is typical of spatiotemporal inverse problems where the space-time covariance is assumed separable, or simpler problems that only consider covariance in space or in time, rather than both.While the fact that the covariance matrix Q can be expressed as a Kronecker product has been recognized in previous work (Meirink et al., 2008;Singh et al., 2011;Thompson et al., 2011), the novelty here is the use of this property in developing an algorithm for increasing the computational efficiency of the matrix product HQ.We further present an approach for directly calculating the a posteriori error covariance at aggregated scales, without the intermediary step of first computing the full V ŝ.Note that both of the presented algorithms rely on the availability of a pre-computed matrix H, which can itself entail considerable computational cost that is not addressed in the work presented here.We use the Gourdji et al. (2012) problem as a computational benchmark for evaluating the performance of the proposed approaches relative to a direct implementation of Eqs. ( 2) and (3).Code demonstrating the implementation of both methods for a toy example is available as supplementary material.

Efficient method for the multiplication of any matrix with a matrix expressed as a Kronecker product
Reducing the computational cost of matrix multiplication and determining the minimum number of operations required to perform this operation has been an area of active research for several decades.The computational efficiency of matrix multiplication was generally assumed to be cubic until 1969 when Strassen (1969) showed that by performing more additions and fewer multiplications this cost could be reduced to O(n 2.82 ) (see Chapt. 47, p. 3 in Bini, 2010).This lower bound has since been further reduced, with the last minimum reported bound of O(n 2.37 ) obtained by Coppersmith and Winograd (1990).The majority of the existing algorithms (see Chapt.47, p. 9 by Bini, 2010) rely on the property that matrix multiplication can be represented as multilinear maps (for details, see Bini and Lotti (1980) and Horn and Johnson (1990)), which helps in formulating a recursive noncommutative block decomposition based algorithms for performing matrix multiplication (Bini, 2010;Laderman, 1976).
However the stability of these algorithms remains questionable, and other than Strassen's algorithm the proposed algorithms have not been found to be practically feasible (Bini and Lotti, 1980;Eve, 2009;Li et al., 2011).The computational complexity of performing matrix multiplication of sparse and specially structured matrices such as Toeplitz, triangular Toeplitz, Hankel, Cauchy, Vandermonde, and Vandermonde transposed matrices (Mouilleron, 2011), circulant matrices (Nowak et al., 2003), banded matrices (Gohberg and Olshevsky, 1994), and hierarchical matrices (Saibaba and Kitanidis, 2012) with an arbitrary matrix can be made to be significantly lower (for comparison of computational cost, see page 15 in Mouilleron, 2011).As such, in the case of inverse problems, the structure of the covariance matrices can be used to reduce the cost of matrix multiplication.For the special case of a regular estimation grid and Toeplitz covariances, for example, the fast Fourier (FF) method of matrix multiplication gives an exact answer and has a computational complexity of O(n 2 log n) (for details, see Nowak et al., 2003); however the algorithm has high computational cost and memory requirements for sparse matrices.For an irregular estimation grid, an approximate method based on a hierarchical framework has been proposed by Saibaba and Kitanidis (2012).Like the FF method, this method also has a computational complexity of O(n 2 log n), and can only be used for very specific generalized covariance functions (for details, see Saibaba and Kitanidis, 2012).
Another type of structured covariance matrix is the one that can be expressed as the Kronecker product of two smaller matrices, such as in the case of Q.This property has been used in the past for decomposing square symmetric covariance matrices (Meirink et al., 2008;Thompson et al., 2011) and in some cases in increasing the efficiency of the matrix multiplication of separable covariance matrices with diagonal matrices (Singh et al., 2011).Here we build upon this work and propose an algorithm for matrix multiplication of two arbitrary matrices, one of which can be expressed as a Kronecker product, and show that its computational complexity is O(n 2.5 ).The algorithm has the same stability properties as those of cubic matrix multiplication algorithm.

Algorithm
Any matrix B (pr×qt) that can be expressed as a Kronecker product can be defined based on matrices D (p×q) and E (r×t) and denoted as D ⊗ E, where For a square covariance matrix Q, both D and E are also square.For the prototypical case examined here, Q is expressed as the Kronecker product of the temporal covariance and the spatial covariance, both of which decay exponentially with separation distance or lag: where σ 2 s is the variance in space and time, X s and X τ represent the separation distances/lags between estimation locations in space and time, respectively, and l s and l τ are the spatial and temporal correlation range parameters, respectively.In this case, p = q = m τ and r = t = m s .This defines a block matrix Q with m 2 τ blocks, each defined as a square matrix d (i,j ) E with m 2 s elements.As the Kronecker product is not commutative, the arrangement of the temporal and spatial covariance in Eq. ( 5) determines the design of Q.
Returning to the more generic case, the multiplication of any matrix A (n×pr) by B (pr×qt) proceeds as follows: 1. Divide A into p column blocks each with dimension (n×r) 4. Repeat steps 2 and 3 for the remaining q −1 columns of D and the corresponding blocks of AB.Overall, www.geosci-model-dev.net/6/583/2013/Geosci.Model Dev., 6, 583-590, 2013 This algorithm can also be used for the multiplication of matrices where the first matrix is a Kronecker product of two smaller matrices, through the cyclical permutation property of transposes.
For H (n×m)=(n×m τ m s ) and Q (m×m)=(m τ m s ×m τ m s ) , Eqs. ( 6) and ( 7) become The multiplication of H and Q where Q is a block diagonal (e.g., there is correlation in space but not in time) is a special case of the algorithm where D is an identity matrix.

Floating point operations
The number of floating point operations required for a direct multiplication of a matrix A by a matrix B can be expressed as a function of the dimensions of these matrices (for details, see Golub and Van Loan, 1996): Similarly, the cost of the "indirect" multiplication algorithm presented in the last section is Equation ( 11) is based on the fact that steps 2 and 3 are each repeated q times.The relative computational performance of the indirect method can therefore be expressed as For H and Q, this simplifies to Note that the number of observations n does not affect the relative performance of the algorithm.Asymptotically, Eq. ( 13) approaches zero with increasing m τ and m s .For the Gourdji et al. (2012) problem, this ratio is 7.14 × 10 −4 , a savings of over 99.9 % floating point operations relative to the direct approach.
For the more generic case of multiplying A and B, consider the special simplifying case where n = m and p = q = r = t = √ n; the floating point operations required by the direct and indirect methods become AB direct = 2n 3 − n 2 and ( 14) This results in an asymptotic complexity of O(n 3 ) for the direct method, and O(n 2.5 ) for the indirect method.The direct method is thus more economical only for n < 3, and the relative cost of the indirect method decreases exponentially thereafter.The overall computational cost could be reduced further if the matrix multiplications in step 3 of the algorithm were computed through the Strassen or Coppersmith and Winograd algorithm.For a special case where D is composed of zeros and ones, the computational cost can also be further reduced by avoiding scalar multiplications, as listed in step 2 of the algorithm.

Other computational aspects of the indirect approach
Beyond the economies in floating point operations, the indirect method also dramatically decreases the random access memory requirements for matrix multiplication, because the proposed approach eliminates the need to create or store the full matrix B (or Q).Thus, again taking the special case of n = m and p = q = r = t = √ n as an example, the memory requirement for storing D and E is O(n 2 ), whereas it is O(n 4 ) if B is explicitly stored in memory.
To maximize the computational gains from the algorithm, it is also necessary to take into account the method adopted by different computer languages for storing arrays in computer memory.For example, C and C++ are far more efficient at multiplying Q by H T than H by Q.This results from the fact that C and C++ store matrices in row-major order, whereas Fortran and MATLAB store them in columnmajor order.The Fortran and C++ codes submitted with this manuscript show how these operations are performed for matrices stored in column (Fortran and MATLAB) or row-major order (C and C++) for square symmetric Q, D and E matrices.In addition, whether the matrices are stored as singledimension vectors or as two-dimensional matrices also affects the computational performance of the proposed algorithm, especially in C and C++, which are not array-based programming languages.
In comparison to the direct method, the indirect approach for matrix multiplication is also fault tolerant and amenable to distributed parallel programming or "out of core" matrix multiplication, as each column block of AB (or HQ) can be obtained separately without any communication between processors populating the individual blocks.
In the case of the solution of an inverse problem, the multiplication of HQH T can also be completed block by block: where h i and h j represent column blocks of the H matrix as defined earlier.The computational efficiency of the matrix multiplication of H, Q and H T can be further improved if the symmetry of HQH T is taken into account (see details on quadratic forms (Harville, 2008)).However there are no "Basic Linear Algebra Subroutines" (Blackford et al., 2002;Dongarra et al., 1988;Lawson et al., 1979) that take this property into account, and additional work would be required to develop these methods for application in inverse problems and statistics.

Computation of aggregated a posteriori covariance in large linear space-time inverse problems
The a posteriori covariance matrix (V ŝ; Eq. 3) is typically dense, and calculating V ŝ is a computational bottleneck for large inverse problems.For example, computing V ŝ explicitly for the Gourdji et al. (2012) problem would require approximately 1.06 × 10 18 floating point operations, and over 56 TB of RAM.We propose an algorithm for computing the a posteriori covariance directly at aggregated scales, which are typically the scales of most interest as described in Sect. 1, without explicitly calculating ŝ.We use the estimation of a posteriori uncertainties at the native spatial scales (1 • × 1 • grid scale in the prototypical example) but for estimates averaged across larger temporal scales as an example.

Algorithm
The algorithm is presented for a V ŝ design consistent with Eqs. ( 4) and (5), i.e., where the diagonal blocks describe the spatial covariance, and the off-diagonal blocks describe the decay of this covariance with time.The particular design framework of V ŝ used in this study does not hinder the application of the proposed algorithm for obtaining a posteriori covariances and cross-covariances in inverse problems where V ŝ has a different design, or where the aggregation is to be conducted over a different desired dimension.
The calculation of the a posteriori covariance at the native spatial resolution aggregated temporally over a desired time period proceeds as follows: 1. Sum all blocks of the matrix corresponding to the k = t u − t l + 1 time periods between periods t l and t u over which the a posteriori uncertainties are to be aggregated, where 1 ≤ t l < m τ , t l ≤ t u < m τ .For Q expressed as a Kronecker product as given in Sect.2.1, this is the sum of all entries between t l and t u in D, mul-tiplied by E (spatial covariance): where Q sum represents the sum of all Q blocks between t l and t u .
2. Sum all column blocks of the HQ (see Eq. ( 9)): where (HQ) sum represents the sum of all HQ column blocks as shown in Eq. ( 9) between t l and t u .
3. Compute the aggregated grid-scale a posteriori covariance Vŝ for the estimates averaged over the desired time periods: where Vŝ is the covariance of ŝ temporally averaged for time periods t l to t u .

Floating point operations
The number of floating point operations required for the direct calculation of V ŝ (Eq. 3) and its aggregation over k time periods is compared to the calculation of the aggregated V ŝ using the algorithm described above.The floating point operations required for multiplying H by Q and HQ by H T , for adding R to HQH T , for taking the inverse of HQH T + R , and for dividing the aggregated covariance by k 2 are excluded in the floating point operation count, because the cost of these operations is the same for both approaches.Of course, computational savings can be achieved for both by following the algorithm outlined in Sect. 2 for the matrix multiplications.
The number of floating point operations required for obtaining grid-scale a posteriori covariance from the direct method can be divided into four sequential operations: (1) matrix multiplication of (HQ) T and (HQH T +R) −1 , (2) matrix multiplication of (HQ) T (HQH T + R) −1 and HQ, (3) subtraction of (HQ) T (HQH T + R) −1 HQ from Q, and (4) summation of all k 2 and m 2 s (spatial covariance) blocks of V ŝ.The floating point operations for these four calculations are www.geosci-model-dev.net/6/583/2013/Geosci.Model Dev., 6, 583-590, 2013 For the algorithm proposed in Sect.3.1, five operations are required to obtain aggregated a posteriori covariance for the desired time period: (1) summation of k and m 2 s (spatial covariance) blocks of Q, (2) summation of k and n × m s blocks of HQ, (3) multiplication of (HQ) T sum and HQH T + R −1 , (4) multiplication of (HQ) T sum HQH T + R −1 and (HQ) sum , and (5) subtraction of (HQ) T sum HQH T + R −1 (HQ) sum from Q sum .The last three of these are all part of step 3 of the algorithm.The floating point operations for these five calculations are

Asymptotically, Vŝ indirect
Vŝ direct approaches zero with increasing n, m τ and m s .For the Gourdji et al. (2012) problem, this ratio is 5.34 × 10 −7 when evaluating the a posteriori covariance aggregated over the full year (k = m τ ), a savings of over 99.9999 % in floating point operations relative to the direct approach.
For the special simplifying case where n = m (i.e., n = m τ m s ) and m τ = m s , the ratio of floating point operations required by the direct and the indirect methods becomes This results in an asymptotic complexity of O(n 3 ) for the direct method, and O(n 2.5 ) for the indirect method.The reduced memory requirements are arguably even more important, however, as the proposed algorithm makes it possible to compute a posteriori covariances at any temporal resolution without explicitly creating V ŝ.

Conclusions
We propose two algorithms to lower the computational cost of performing large linear space-time inverse problems.
The proposed matrix multiplication algorithm can be implemented with any matrices, as long as one of them can be expressed as a Kronecker product of smaller matrices, making it broadly applicable in other areas of statistics and signal processing, among others (Van Loan, 2000).Furthermore, although not explicitly discussed in this work, the Kronecker product representation of covariance matrices can be used for reducing the memory allocation complexity of storing error covariance matrices in both batch inverse problems and data assimilation (for application, see Singh et al., 2011).The presented a posteriori covariance computation algorithm can provide aggregated uncertainty covariances even for extremely large space-time inverse problems with dramatically decreased computational and memory requirements.The mounting availability of massive volumes of data (e.g., satellite observations) will further increase the computational cost associated with the solution of inverse problems in the Earth sciences, making algorithms such as the ones presented here, as well as further improvements to the computational efficiency associated with the solution of large inverse problems, increasingly important.

Appendix A Description of the submitted code
Two MATLAB code files demonstrating the application of the methods proposed in this manuscript are included as supplementary material.The MATLAB script file "HQ HQHt.m" allows users to experiment with different sizes of random covariance matrices in a Kronecker product form and computes HQ and HQH T using the direct method as well as the method presented in Sect.2.1.The second MATLAB script file "Uncertainty Computations.m"allows users to experiment with random matrices for computing a posteriori covariances aggregated either over all time periods or for specified time periods.A detailed description of the codes is also given at the beginning of the script files.Note that these codes are provided to illustrate the two algorithms proposed in this research, but these codes should not be used to assess the computational performance of these algorithms.This is because MATLAB uses (1) highly optimized multithreaded external libraries (Basic Algebra Subroutines) for performing matrix multiplication, and (2) automatic memory management (e.g., allocation and reallocation of memory).
To supplement the MATLAB routines, we also include completely serial Fortran (filename: HQ.f90) and C++ (filename: HQ.cpp) code for performing the matrix multiplication of H and Q matrix.Although the performance of these codes may vary depending on computer architecture, the performance will approximately reflect the computational savings described in the manuscript.For example, for the Fortran code compiled using the GFortran compiler on a Intel Xeon X5660 2.80 GHz machine with 96 GB RAM, a matrix multiplication (1) with n = 8503, r = t =10, and p = q = 100 took approximately 12 s using the direct method and approximately 2.2 s using the indirect method; (2) with n = 8503, r = t = 100, and p = q = 10, the direct method took approximately 9.4 s whereas the indirect method took approximately 1.0 s; and (3) with n = 15000, r = t = 150 and p = q = 150, the direct method took approximately 3 h whereas the indirect method took approximately 3.4 min.
each block of A by the elements of the first column of D and add these blocks ( p i=1 a i d (i,1) ).If an element of D is zero, then skip the multiplication; if it is one, then add the column block of A without performing scalar multiplication.3. Multiply the resulting n × r matrix by E (r×t) to obtain the first n × t column block of AB.