On the parallelization of atmospheric inversions of CO 2 surface fluxes within a variational framework

Introduction Conclusions References


Introduction
CO 2 mole fractions in the atmosphere are functions of the CO 2 surface fluxes, of the atmospheric transport fluxes and, to a smaller extent, of the photochemical CO 2 production in the atmosphere.Inferring (or inverting) the surface fluxes from mole fraction measurements is made challenging by the combination of these three inputs, but it is rigorously guided by the Bayesian paradigm.An additional complication lies in the diversity of the time-space scales that are involved.At one end, the exceptionally long mean life time of CO 2 links any mole fraction measurement to global CO 2 surface fluxes in the distant past.At the other end, point-wise surface measurements exhibit a strong sensitivity to the fluxes in the immediate vicinity of the detectors (Bocquet, 2005).This singularity is not much compensated (i.e.regularized) by the size of the prior errors, at least in the case of terrestrial vegetation fluxes (Chevallier et al., 2006).The mixture of transport scales hinders the implementation of Bayes' theorem at reasonable computational cost.In practice, compromises are made between the technical complexity of the inversion code, the level of error induced by neglecting some of the flux variability (e.g.inferring continental monthly fluxes rather than high-resolution ones; Kaminski et al., 2001;Thompson et al., 2011) and the length of the inversion window.Flux inversion has been traditionally performed based on a closedform solution to Bayes' theorem (i.e. an algebraic formula; e.g.Gurney et al., 2002) but the computational cost of this analytical solution restricts it to small-size problems (typically less than 10 4 variables to infer or less than 10 4 measurements to exploit).In the context of high-performance parallel computing, ensemble formulations of Bayes' theorem (i.e. using a statistical ensemble of simulations) have become increasingly popular (e.g.Peters et al., 2005;Zupanski et al., 2007;Feng et al., 2009), but the relatively small sizes that can be currently afforded for the statistical ensembles do not allow tackling problems of higher resolutions than the analytical approach.Imported from the data assimilation systems of F. Chevallier: Parallel and variational CO 2 atmospheric inversion numerical weather prediction (NWP), the variational formulation of Bayes' theorem was introduced in the mid-2000s to lift this restriction (Chevallier et al., 2005;Rödenbeck et al., 2005), but it requires tedious initial developments (i.e.coding the adjoint of the atmospheric transport) and usually exhibits limited scalable parallelism (Hamrud, 2010).
This paper focusses on the technical performance of the variational approach.Chevallier et al. (2010) performed a global 21 yr inversion at grid point (3.75 • longitude and 2.5 • latitude) and weekly scale with this approach, but they needed 6 weeks of sustained parallel computing -the parallelization being for the code of the transport model.The high cost of this 21 yr inversion implies prohibitive costs for extended inversion periods, higher resolution transport modelling or more sophisticated physical parameterization of the transport model.Given the parallel structure of current supercomputers, finding high levels of parallelism has become critical for the developers of the variational approach (see, e.g.Fisher et al., 2011;Desroziers and Berre, 2012, in the field of NWP).At present, this approach is intrinsically sequential because it consists in iteratively minimizing a cost function, each iteration step relying on the results of the previous one.Higher resolutions of the transport models allow parallelizing on more processors, but the gain offered by the latter never fully compensates for the cost of the former because of communication overheads.Moreover, transport models perform a lot of input operations (e.g. to read the transport mass fluxes) that are not well parallelizable.This problem is even bigger for their adjoint codes that read most of the information about the linearization point from the disk.
In this paper, we introduce a parallel structure of the transport computations within variational atmospheric inversion systems in order to massively reduce the computation wallclock time, while still running the inversion on a single physically and statistically consistent window.This structure defines an ensemble of parallel computations that does not carry any statistical meaning, in contrast to the abovementioned ensemble methods.We call it physical parallelization (PP) here to distinguish it from the statistical methods.We illustrate the efficiency of the PP for a 32 yr inversion window  for CO 2 with atmospheric measurements from 81 sites of the NOAA global cooperative air sampling network.Section 2 introduces the PP.The inversion system and the measurements used to illustrate it are described in Sects.3 and 4, respectively.The results are presented in Sect. 5. Section 6 concludes the paper.

Physical parallelization
An inversion system operates within a temporal window starting at a chosen time T (for instance, 1 January 1979 at 00:00 UTC in Sect.4).Within this window, the transport model H simulates the mole fraction measurements made at any time t from the surface fluxes existing between times T and t and from the 3-D field of mole fraction at the initial time T .In the case of regional inversions, H also uses the lateral boundary conditions of the mole fractions between times T and t.
Variational inversion systems minimize a Bayesian cost function and, to this end, exploit a linearized version of H , under the form of a tangent-linear code and of an adjoint code.
In the tangent-linear code of H , a perturbation in mole fraction caused by perturbations in surface fluxes, by perturbations in the lateral boundary conditions and by perturbations in initial mole fractions, is computed as with δc(t) the vector that contains the mole fraction increments at time t, δϕ(t ) the vector that contains the flux increments and the lateral boundary condition increments at time t , H ϕ t t the linear operator (Jacobian matrix) that links δϕ(t ) and δc(t), and H c t t the linear operator that links δc(T ) and δc(t).Note that in variational systems, the two Jacobian matrices are not explicitly computed: the multiplication by the input increment vector is obtained from the chain rule applied to the reference code, line by line.
The PP brings a small simplification to Eq. ( 1) by reducing the backward influence to a time τ which is later than T following with δb(τ ) the global-mean mole fraction increment at time τ .The scalar δb(τ ) corresponds to the perturbation of the global growth rate.In the case of regional inversions, δb(τ ) is set to zero because the global growth rate is already accounted for by the lateral boundary conditions.The advantage of Eq. ( 2) is that the transport model does not link successive temporal segments together any more when computing mole fraction increments.The sum t t =τ H ϕ t t • δϕ(t ) can be computed in parallel for various segments of a long period: the bias δb(τ ) is added a posteriori.For global inversions, the downside is that Eq. (2) introduces a bottleneck at time τ when all mole fraction increments are supposed to be uniformly mixed in the global atmosphere.For regional inversions, setting δb(τ ) to zero implies neglecting the direct influence of the 3-D increments of the compound within the regional domain at time τ .To minimize both effects, we overlap the various segments so that they include a spin-up period without any observation.For global inversions, the spin-up time allows atmospheric mixing to operate.For regional ones, it allows the initial molecules to be advected outside the domain.The length of the spin-up period should therefore be chosen according to the corresponding time scales (the time scale of interhemispheric mixing for global inversions and the time scale of advection within the domain for the regional ones).Once the spin-up time has been chosen, the PP forward algorithm can be described as follows: 1. Run the transport model over the full period without any simplification (i.e.serially).This run provides the linearization point for subsequent increment computations and the innovation vector (i.e. the vector of initial model-minus-observation departures) at the best precision available.
2. Divide the full period into a series of overlapping segments.
3. Run the tangent-linear model for each segment in parallel without any perturbations to the mole fraction field at the segment initial time.Save the mole fraction increment t t =τ H ϕ t t • δϕ(t ) at the time and location of each observation, except for the spin-up period when observations are ignored (since they are processed by the previous segment).The earliest segment (i.e. starting at T ) is a particular case where Eq. ( 1) is applied directly.
4. In the case of global inversions, for each segment, compute the contribution to the global mass increment δb(τ ) that corresponds to the initial state of the next segment and add it to the mole fraction increments (Eq.2).
The PP algorithm for the adjoint code simply stems from that of the tangent-linear code, reversing the order of the four operations and transposing them.
The wall-clock time needed for this algorithm is about that of a segment, with a parallelization structure on the number of segments.
At the highest level of the inversion system, the minimizer deals with the full inversion period at once with a single cost function (including correlated prior errors between segments if such correlations are assigned) and without any parallelization, therefore ensuring a consistent inversion from time T until the time of the latest observation available.

Inversion method
To illustrate the PP, we use the inversion scheme of Chevallier et al. (2005Chevallier et al. ( , 2010)).This system mainly comprises of a tracer transport model, its tangent-linear and adjoint codes, and software to minimize the Bayesian cost function.It computes the best linear unbiased estimate (BLUE) of the CO 2 surface fluxes given the observations, the prior fluxes and their associated uncertainty.The BLUE fluxes are called inverted fluxes hereafter.Their uncertainty can also be estimated by the inversion system by a robust Monte Carlo method, but this facility is not exploited here.
In the configuration used in the result section, the transport model is the global general circulation model of the Laboratoire de Météorologie Dynamique (Hourdin et al., 2006) nudged to winds analysed by ECMWF, and the minimizer is the Lanczos version of the conjugate gradient algorithm, as described by Fisher (1998).The inverted fluxes are estimated on the global grid of the transport model (a regular mesh of 3.75  (Takahashi et al., 2009), climatological monthly biomass burning emissions (taken as the 1997-2010 mean of the database of van der Werf et al., 2010) and climatological 3-hourly biosphere-atmosphere fluxes taken as the 1989-2010 mean of a simulation of the ORCHIDEE model (ORganizing Carbon and Hydrology In Dynamic EcosystEms, Krinner et al., 2005), version 1.9.5.2.The mass of carbon emitted annually during specific fire events is compensated here by the same annual flux of opposite sign representing the regrowth of burnt vegetation, which is distributed regularly throughout the year.These gridded prior fluxes exhibits 3-hourly variations but their inter-annual variations are caused by anthropogenic emissions only.
The uncertainty of the prior fluxes is described by a covariance matrix.Over land, we assume that the errors of the prior biosphere-atmosphere fluxes dominate the error budget and the covariances are constrained by an analysis of in situ flux measurements (Chevallier et al., 2012): temporal correlations on daily mean net carbon exchange (NEE) errors decay exponentially with a length of 1 month, but nighttime errors are assumed to be uncorrelated with daytime errors; spatial correlations decay exponentially with a length scale of 500 km; standard deviations are set to the climatological daily varying heterotrophic respiration flux simulated by ORCHIDEE with a ceiling of 4 g C m −2 day −1 .Over a full year, the total 1-sigma uncertainty for the prior land fluxes amounts to about 2.8 Gt C yr −1 .The error statistics for the open ocean correspond to a global air-sea flux uncertainty about 0.75 Gt C yr −1 and are defined as follows: temporal correlations decay exponentially with a length scale of one month; unlike land, daytime and night-time flux errors are fully correlated; spatial correlations follow an efolding length of 1000 km; and standard deviations are set to 0.15 g C m −2 day −1 .Land and ocean flux errors are not correlated.
Prior initial conditions (i.e. the 3-D field of CO 2 at the start of the inversion window) are taken from a previous simulation of the LMDZ model and are adjusted by the inversion (with 0.3 % prior uncertainty).1979 1982 1985 1988 1991 1994 1997 2000 2003 2006

Observations
In the present illustration, the assimilated observations are mole fractions of CO 2 in dry air, collected in flask air samples at various places in the world over land (from fixed sites) and over ocean (from commercial ships).We use 81 site records from the NOAA Earth System Research Laboratory archive (Conway et al., 2011) for the 1979-2010 period.The site location and the length of each record are shown in Fig. 1.The observed synoptic variability at each site is taken as a proxy of the observation errors that are assigned in the inversion system (that are driven by transport modelling errors) and is computed following Chevallier et al. (2010).

Results
In the present configuration of the PP inversion, we define computation segments of 15 months (from October of a given year until December of the following year) with a 3-month overlap from one segment to the next (from October until December).The 3-month overlap allows a large portion of the global mixing to operate (Bruhwiler et al., 2005;Peters et al., 2005) at least at the hemispheric scale.The first segment is a particular case: it gathers 12 months only, and Eq. ( 1) is directly applied (i.e.without any simplification).
We show the impact of the simplification of the tangentlinear model by running a 32 yr inversion  both with the full system and with the simplified one.Each minimization is stopped after 30 iterations, as in Chevallier et al. (2010), when increments become stable.The full system required 2 months of sustained computing, while the PP one took days only.The two inversions are compared by looking at the inverted fluxes at various time-space scales: from a 3 h time step to the year, from the model grid to the widely used 22 TransCom3 regions of Gurney et al. (2002).The annual total fluxes aggregated at the scale of TransCom3 region Tropical Asia are shown in Fig. 2. The prior interannual variations (in black) are caused by fossil-fuel emission increase only (see Sect. 3).The fit between the PP inversion (in red) and the one with the full system (in cyan) is surprisingly poor: the shape of the variations is similar but the absolute values differ by up to 600 Mt C yr −1 (in 1990).Looking at the annual global scale (not shown), the PP inversion actually agrees with the mean CO 2 growth rate inferred by  NOAA (ftp://ftp.cmdl.noaa.gov/ccg/co2/trends/co2gr gl.txt accessed 6 September 2012) better than the reference inversion: the root-mean-square difference (RMS) of the misfit over the 32 yr is 0.19 ppm yr −1 for the former, while it reaches 0.88 ppm yr −1 for the latter.To understand the better performance of the inversion that uses the simplified system, the reference inversion is repeated for a shorter period: 1979-1992.In this case, the RMS misfit of the reference inversion with the NOAA data goes down to 0.21 ppm yr −1 , while the PP misfit is slightly larger (0.22 ppm yr −1 ) for the same period.This second reference inversion is also shown in Fig. 2, in standard blue: the curve is well fitted by the PP points, with differences less than 50 Mt C yr −1 , even for the year 1990 when the longer reference inversion shows a large spike.From this test and from similar ones varying the length of the inversion window (not shown), we conclude that numerical instabilities spoil the reference system for very large inversion windows.The better behaviour of the PP inversion suggests that the small transport sensitivities for very long ranges cause these instabilities.
Figure 3 summarizes the differences between the PP inversion and the reference system (run with two inversion windows) at the scale of the year and of the 22 TransCom3 land regions.The RMS is very large (up to 570 Mt C yr −1 in region South American Tropical) when comparing with the reference system run on the longer window, while it is less than 80 Mt C yr −1 when comparing with the reference system run on the shorter window.The misfit with the shorter window system represents about 10 % of the inversion increments.
Figure 4 refines the comparison with the shorter-window reference system by showing the map of the RMS at the weekly grid-point scale.This time scale is chosen because it is the highest of the inversion system.The RMS is less than 0.08 g C m −2 day −1 everywhere.Its pattern follows the one of the prior errors (see Fig. 1 of Chevallier et al., 2010).The RMS represents less than 11 % of the flux inversion increments on any grid point, excluding those where the increments are negligible.
We have repeated the PP inversion with longer segments.In this test the segments cover 18 months, from July of a given year until December of the following year, with a 6month overlap from one segment to the next, from July until December.In this case, the 1979-1991 fit to the shorterwindow reference system is improved by a few percentage points, with the largest relative improvement seen in the ocean basins of the Southern Hemisphere.In terms of annual budgets in the TransCom3 regions, the improvement is less than 20 Mt C yr −1 , with most values being a few Mt C yr −1 .

Discussion and conclusions
More than 10 CO 2 monitoring sites have records that span more than 30 yr and tens of them span more than 20 yr.These archives offer the exceptional opportunity to yield consistent estimations of CO 2 surface fluxes over the last decades.However, the inversion problem involves relatively small spatial scales (in the transport properties and in the prior errors) that make such estimations technically difficult.Coarse resolution systems are usually chosen (e.g.Gurney and Eckels, 2011), which degrades the quality of the inversion (Kaminski et al., 2001).Regional inverse modelling for dense regional networks (e.g.Lauvaux et al., 2009;Broquet et al., 2011)  within variational inversion systems.This approach allows efficient "embarrassingly parallel" workload, i.e. tasks that do not need communicating, within each iteration step.In the case of global inversions, the transport information from one segment to the next one is carried out through a global bias term, while it is not needed for regional inversions.The initial simulation that provides the linearization point before the first iteration as well as the final simulation after the minimization are not parallelized.At the highest level, the inversion manages all segments at once with a single cost function.
For global inversions, the increment bottleneck that we introduce in each segment in the form of this bias term (on 1 October of each year in our illustration) reduces the accuracy of the tangent-linear model with an effect that depends on the assimilated observations: a dense and precise observation network counterbalances the trend towards homogeneously distributed inversion increments because it better distinguishes between errors in the recent fluxes and errors from the past.In this context, the test with a single network (the NOAA global cooperative air sampling network), while tens of other sites are available (available at: http://ds.data.jma.go.jp/gmd/wdcgg/), provides an upper bound on the method precision.The effect of the increment bottleneck is also mitigated by the overlap period from one segment to the next that plays the role of a spin-up time.In this study, we have tested our approach with a 3-month overlap.For a 14 yr inversion window, we have shown that the simplification degrades the inversion increments at various scales (weekly grid scale and sub-yearly continental) by less than 20 %.For a 32 yr inversion window, the simplification makes the inversion numerically stable, which is not the case for the full system that suffers from round-off errors.Lastly, the 32 yr inversion does not take more time to run than for a 15-month inversion (i.e.days rather than months), while still processing the three decades consistently.
Qualitatively, the effect of the transport truncation is balanced by the limited robustness of transport models for long ranges.At the global scale, this limitation was clearly seen in the validation of the interhemispheric exchange times simulated by a pool of models, even though the precision was shown to have improved compared to older models (Patra et al., 2011).We argue that performing long inversions is important despite transport weaknesses because long inversions ensure both the statistical consistency and the physical consistency of the inferred long-term CO 2 budgets.Indeed, a sequential filter (e.g.Peters et al., 2007) can either conserve mass throughout the inversion cycles or account for the uncertainty of the past cycles, but cannot do both.Furthermore, long inversions benefit from the long temporal structures of the prior errors (Chevallier et al., 2006), thereby propagating observation information well beyond measurement time in a way that is complementary to transport.This second property is particularly interesting for regional inversions when high-resolution flux estimates are sought while the network is sparse at the scale of interest.
For regional (i.e.domain-limited) inversions, the information about the global growth rate is carried by the atmospheric lateral boundary conditions around the inversion domain.In this case, the bias term disappears from Eq. ( 2): for a given segment, our method neglects the direct influence of increments of the initial conditions of this segment.The spinup period (i.e. the between-segment overlap period) damps this effect that becomes negligible when the spin-up period is long enough.An adequate spin-up period mainly depends on the domain latitude range and on the speed of the zonal flow in the domain.
To further improve the inversion speed, additional levels of parallelism can be found in two ways.First, parallel computing can be introduced within each segment, in addition to between-segment parallelism.This hybrid parallel strategy would fit high-performance computing systems that are usually made of clusters of shared memory nodes.Second, advantage can be taken from ensembles of perturbed inversions because they accelerate convergence when performed together (Desroziers and Berre, 2012), while yielding rigorous inversion uncertainty computation (Chevallier et al., 2007).The parallelization of such ensembles is straightforward because the ensemble members do not communicate.
Among Bayesian systems, the variational ones are the fittest to deal with large inversion problems.The opportunities for parallelism that we see in the case of global and regional CO 2 atmospheric inversions reinforce their attractiveness in the context of high-performance cluster computing.Our approach also applies when the atmospheric inversion infers parameters of flux process models, like in variational carbon cycle data assimilation systems (Rayner et al., 2005), rather than just the fluxes.We have focused here on CO 2 because our method exploits its passive transport in the global atmosphere.It can also be applied for reactive species like methane, nitrogen and chlorofluorocarbons within regional inversions.For global inversions of these species, the distant past cannot be summarized by a constant bias: in this case, Eq. ( 2) has to be adapted by assigning a lifetime to the global offset.

Fig. 1 .
Fig. 1.Location of the NOAA ESRL flask measurements used in this study (a) and measurement availability at each site (b).In (b), the sites are designated by their identifier in the NOAA database.

FluxFig. 2 .
Fig. 2. Time series of the annual total CO 2 fluxes (including fossil fuel) in region Tropical Asia.In the sign convention, positive fluxes correspond to a net carbon source into the atmosphere.The black line (marked PRIOR) corresponds to the prior fluxes.The two coloured lines show the inverted fluxes for the reference 1979-2010 inversion (in cyan, marked REFlong) and for the reference 1979-1992 inversion (in regular blue, marked REFshort).The 1979-2010 PP inversion appears with red disks (PP).

Fig. 3 .
Fig. 3. RMS differences for annual CO 2 fluxes between the PP inversion and its prior for the period 1979-2010 (black, PP-PRIOR), between the 1979-2010 PP inversion and the 1979-2010 reference inversion for the same period (blue, PP-REFlong) and between the 1979-2010 PP inversion and the 1979-1992 reference inversion for 1979-1991 (red, PP-REFshort).The world is split into the 22 TransCom3 regions.

Fig. 4 .
Fig. 4. RMS differences between the 1979-2010 PP inversion and the reference 1979-1992 inversion when computing the inverted weekly fluxes.The statistics are computed for the 1979-1991 period.
faces a similar issue when it accounts for flux and transport variability on scales of the order of 10 km or less.In this paper we propose to parallelize some of the transport computations along overlapping temporal segments www.geosci-model-dev.net/6