A comprehensive evaluation of the use of Lagrangian particle dispersion models for inverse modeling of greenhouse gas emissions

. Using the example of sulfur hexaﬂuoride (SF 6 ) we investigate the use of Lagrangian Particle Dispersion Models (LPDMs) for inverse modeling of greenhouse gas (GHG) emissions and explore the limitations of this approach. We put the main focus on the impacts of baseline methods and the LPDM backward simulation period on the a posteriori emissions determined by the inversion. We consider baseline methods that are based on a statistical selection of observations at individual measurement sites and a global distribution based (GDB) approach, where global mixing ratio ﬁelds are coupled to the LPDM 5 back-trajectories at their termination points. We show that purely statistical baseline methods cause large systematical errors, which lead to inversion results that are highly sensitive to the LPDM backward simulation period and can generate unrealistic global total a posteriori emissions. The GDB method produces a posteriori emissions that are far less sensitive to the backward simulation period and that are consistent with recognized global total emissions. Our results show that longer backward simulation periods, beyond the often used 5 to 10 days, reduce the mean squared error and increase the correlation between a 10 priori modeled and observed mixing ratios. Also, the inversion becomes less sensitive to biases in the a priori emissions and the global mixing ratio ﬁelds for longer backward simulation periods. Further, longer periods help to better constrain emissions in regions poorly covered by the global SF 6 monitoring network (e.g., Africa, South America). We ﬁnd that the inclusion of existing ﬂask measurements in the inversion helps to further close these gaps and suggest that a few additional and well placed ﬂask sampling sites would have great value for improving global a posteriori emission ﬁelds.

. Sites of surface in situ measurements used in the inversion and in the re-analysis. Site  and a number of independent organisations whose data were partly included in the World Data Centre for Greenhouse Gases (WDCGG, 2018). Measurement sites are listed in Table 1, together with acronyms and other station specific information.
At AGAGE stations, SF 6 mixing ratios are measured using Medusa Gas Chromatography followed by Mass Spectrometry 95 (GC/MS, Miller et al., 2008). At the stations HAT and COI the SF 6 measurement system is based on cryogenic preconcentration and capillary GC/MS (Yokouchi et al., 2006). At all other stations, Gas Chromatography followed by Electron Capture Detection (GC-ECD) is used to measure SF 6 mole fractions. Observations were calibrated with four different SF 6 scales: SIO-2005, WMO SF6 X2006, WMO SF6 X2014 and NIES-2008. We converted all observations to the SIO-2005 calibration scale, by dividing NIES-2008 calibrated data by the factor 1.013 (Saito, 2021) and WMO SF6 X2014 calibrated data by 1.002 For the re-analysis of SF 6 (see section 2.5) we used all the available 2011 and 2012 in situ measurements from the sites listed in Table 1. In addition, we included flask air samples from 44 surface observation stations (NOAA, Dlugokencky et al., 2020) and from 16 aircraft profiling stations (Sweeney et al., 2015;NOAA Carbon Cycle Group ObsPack Team, 2018). Surface flask 115 measurements were available at intervals ranging from a few days up to months. Sampling flights were conducted irregularly with intervals between 2 and 5 weeks at individual sites. Aircraft measurements from individual flights provide vertical SF 6 mixing ratio profiles up to 8.5 km above sea level, where air samples are usually taken within less than an hour. With one exception, all aircraft samples were collected over North America. Additional information about the flask measurements from surface sites and aircraft programs can be found in Table A1 and Table A2 (Appendix). All flask measurements were calibrated

Inversion method
In this study we use the Bayesian inversion framework FLEXINVERT+ described in detail by Thompson and Stohl (2014), which was further developed since then, to make the code more modular and to include iterative solution methods. However, our results should be valid for all inversion methods based on LPDM calculations and we thus only include a brief description of FLEXINVERT+. It is based on a linear forward operator H that represents the atmospheric transport, so that the forward 130 problem reads: where y is the vector of observed mixing ratios, x the emission state vector and ε the sum of observation and model error.
Since H is ill conditioned and has no unique inverse, a priori emission estimates can be added, in order to solve (1) for x. The inversion method applies Bayes' theorem to calculate a posteriori emissions, which on the one hand minimize the 135 difference between observed and modeled mixing ratios, and on the other hand stay close to the a priori emissions and inside of predefined uncertainty bounds. Assumed uncertainties are Gaussian distributed resulting in a minimization of the cost function (e.g. Tarantola, 2005).
where B is the a priori emission error covariance matrix, R the observation error covariance matrix, and x p the vector of the a 140 priori emissions. This study uses the following analytical solution to minimize J(x): We use a spatial emission grid ( Fig. A1) with a varying grid size ranging from 1 • ×1 • to 16 • ×16 • . We define the grid by using model information to aggregate grid cells with low emission contributions, as further described by Thompson and Stohl (2014). For this, the emission sensitivity is taken from the LPDM 50 day backward simulation and the resulting inversion grid 145 is used for all inversions. The output emission fields are saved at a spatial resolution of 1 • ×1 • .
SF 6 has no surface sinks and its surface fluxes can therefore only be larger or equal to zero. However, the inversion algorithm can produce negative a posteriori fluxes. To overcome this problem we follow Thompson et al. (2015) and apply an inequality constraint on the a posteriori emissions, using the truncated Gaussian approach by Thacker (2007). This approach, which applies inequality constraints as error-free observations, is described by the following equation: where P is a matrix operator selecting the fluxes violating the inequality constraint, and c a vector of the inequality constraint (zero in our case). x and A represent the a posteriori emissions and error covariance matrix precalculated in the inversion, respectively.
In contrast to many other studies (e.g., Henne et al., 2016;Stohl et al., 2009;Thompson and Stohl, 2014) we do not use the option to optimize the baseline mixing ratios in the inversion. This gives us the opportunity to better analyze the differences between investigated baseline methods and to study their impacts on the a posteriori emissions in detail.

Atmospheric transport
H is the so-called Source-Receptor-Relationship (SRR) in the context of atmospheric transport. The SRR is an emission sensitivity that relates emission changes in a given grid cell to changes in modeled mixing ratios at a given receptor; for further 160 details, see Seibert and Frank (2004). The SRR value in a specific grid cell (units of 1 sm 3 kg ) measures the simulated mixing ratio change at a receptor that a unit strength source (1 kg sm 3 ) in that grid cell would create (Stohl et al., 2009). In this study, we use the LPDM FLEXPART 10.4 (Pisso et al., 2019;Stohl et al., 1998Stohl et al., , 2005 to calculate the SRR. The model is run in backward mode as this is more efficient than forward calculations when the number of emission grid cells exceeds the number of observation sites. Available observations are averaged to three hourly means (see section 2.1). For each 165 of these means 50,000 virtual particles are released continuously over the averaging period and followed backward in time.
The SRR is calculated by determining the average time the particles spend in each grid cell of the 1 • ×1 • output grid within the lowest 100 meters above the ground, assuming that all emissions occur at or near the ground. FLEXPART is driven by the hourly reanalysis dataset ERA5 (Hersbach et al., 2018) from the European Centre for Medium-Range Weather Forecasts (ECMWF) at a resolution of 0.5 • ×0.5 • and with 137 vertical levels. Since SF 6 is an almost nonreactive gas, removal processes 170 are neglected in the calculation of the SRR.
In this study, five different backward calculation periods are investigated: 1, 5, 10, 20 and 50 days. At the end of these periods, particles are terminated and the back trajectories end. Figure 2 shows the 2012 annual average emission sensitivities for the backward calculation period of 5 ( Fig. 2a) and 50 ( Fig. 2b) days, respectively. On the 5 days time scale large land areas in the Southern Hemisphere (Northern Australia, South America, Southern Africa) and also parts of the Northern Hemisphere (e.g.

175
India, Iran) are sampled poorly or not at all. In these areas, emissions can therefore not be determined well by the inversion.
High sensitivity can only be found at land regions with many receptors, such as Europe. On the 50 days time scale, the SRR has higher values compared to the 5 days backward calculation. Large parts of the Northern Hemisphere are sampled quite well and the emission sensitivities provide some information, even at areas that are far away from the observation stations.
However, emission sensitivities are still low in the Tropics, especially over Africa, South America and Northern Australia.

180
When also using surface flask measurements (Fig. 2c) in addition to in situ measurements for the case of a 50 day backward simulation period, the emission sensitivity is substantially higher almost everywhere and more smoothly distributed over the globe. However, regions of low sensitivity remain in the Tropics and in the Southern Hemisphere.

The baseline definition
The transport model can only account for mixing ratio changes caused by emissions within the chosen backward calculation 185 period. Consequently, a baseline representing the influence of all the emission contributions prior to this time period has to be defined.    in various studies to determine a baseline for atmospheric inversions of several GHG species (e.g. Brunner et al., 2017;Henne et al., 2016;Schoenenberger et al., 2018;Simmonds et al., 2016;Vollmer et al., 2016). The REBS method defines observed mixing ratios y(t i ) at each time step t i as the sum of a baseline signal g(t i ), an enhancement due to polluted air masses m(t i ) and the observational error E i : The method assumes that most observations are baseline observations and therefore not influenced during pollution episodes (m(t i ) = 0). It also assumes that the baseline curve g is smooth -so that it can be linearly approximated around any given time. The method then applies a local linear regression model that fits the observation data, giving more weight to data points close to the considered time and iteratively excluding data points outside a certain range. An advantage of the REBS method is that it is simple to implement. The code is freely available and besides some parameters that need to be chosen, it only 200 depends on the observation data. This simplicity, however, also means that the method is unable to take the length of the LPDM backward calculation into account. As we shall see, this leads to systematic biases in the inversion results that depend on the length of the backward calculation. The method also assumes a smoothly varying baseline which limits its ability to account for meteorological variability. Another disadvantage is the dependence on certain parameter settings. The settings used in this study are provided in Table A3. Finally, the method can only be used at sites with frequent observations, not for flask 205 measurement sites, or moving measurement platforms.

Stohl's method
The method introduced by Stohl et al. (2009) is primarily based on the selection of observed mixing ratios at individual observation stations, but also uses the simulated SRR values and a priori emissions to determine the baseline. In the last few years it was used in several inversion studies (e.g. Brunner et al., 2017;Fang et al., 2014Fang et al., , 2015Fang et al., , 2019Stohl et al., 2010; 210 Thompson and Stohl, 2014). We apply the method and select the lowest 25 percent of observations from individual stations in By subtracting prior simulated mixing ratios the method takes the length of the LPDM backward calculation into account and aims to avoid an overestimation of the baseline. However, simulated mixing ratios are calculated using a priori emission estimates, making the method dependent on a priori information. Further, the subjective choice of the time window, as well as the equally subjective selection of the lowest quartile of observations and the lower half of prior simulated mixing ratios are 220 problematic. As the REBS method, Stohl's method assumes a smooth baseline curve, and thus it cannot account for sudden changes in the baseline due to meteorological variability. Also, the method can only be used at sites with frequent observations.

The GDB method
The idea of the GDB approach  is to determine the baseline directly from a 3D global field of mixing ratios, e.g., from a re-analysis of the atmospheric chemical composition. The end points of the back-trajectories that are 225 used by the LPDM to calculate the SRR are utilized to determine the sensitivity at the receptor to mixing ratios at the points in space and time where particles terminate (see Fig. 3 for a simplified illustration). This sensitivity (termed as "termination sensitivity", thereafter) in a particular grid cell is calculated in the LPDM by dividing the number of particles terminating in that cell by the total number released at the receptor, while also including a transmission function to account for loss processes (not relevant for SF 6 ) during the backward simulation period. The termination sensitivity fields are saved in a 3D 1 • ×1 • output 230 grid with 16 vertical layers with interface heights at 0.1, 0.5, 1, 2, 3,4,5,7,9,12,15,20,25,30,40, and 50 km above ground level. For global inversions, baseline mixing ratios are then calculated by multiplying the termination sensitivity with the mixing ratios of the 3D global field and integrating the product over all grid cells. The GDB method can also be used for regional inversions (not done in this study). In this case, the emission contributions from outside the regional domain need to be added to the baseline  but otherwise the inversion procedure is identical as described here.

235
The GDB method is independent of subjective data selection and choice of parameter settings. In contrast to the REBS and Stohl's method it does not assume a smooth baseline and has the potential to fully account for meteorological variability. As illustrated, it excludes emission contributions from within the backward simulation period and therefore provides a baseline that is fully consistent with the length of the backward simulation. Furthermore, contrary to the other two methods, it can also be used at measurement sites with infrequent observations or moving observation platforms. Its accuracy, however, is dependent 240 on the ability to eliminate errors, and especially any bias of the global 3D mixing ratio fields. We target this challenge by using the FLEXible PARTicle dispersion chemical transport model (FLEXPART CTM, Henne et al., 2018) to perform a re-analysis of SF 6 as described in the next section. million particles are randomly distributed over the globe, proportional to the air density. In addition to an air tracer, particles also carry the chemical species SF 6 . The initialisation is based on a latitudinal SF 6 profile based on surface observations. We run the simulation from 2011 to 2012, using 2011 as a spin up period. Particles are followed forward in time, and whenever a 250 particle resides below the diagnosed boundary layer height its mass is increased due to surface SF 6 emissions. The model is driven with the ECMWF ERA5 dataset and with emission fields calculated as described in section 2.6 . Mixing ratio fields are saved in a 3 • ×2 • output grid and extrapolated to the same grid as the termination sensitivity fields.

Re-analysis of SF
FLEXPART CTM uses a nudging routine to keep simulated SF 6 fields close to the observations of SF 6 . With this simple data assimilation method, modeled fields of mixing ratios are relaxed towards observations within so-called nudging kernels  Table A4.

A priori information
An a priori estimate of the spatial distribution of SF 6 emissions for the year 2012 is determined by collecting information on the emissions from individual countries. We use country emissions reported to the United Nations Framework Convention on 265 Climate Change (UNFCCC, 2021) and for East Asian countries emissions estimated by Fang et al. (2014). The sum of these individual country emissions is subtracted from the total global SF 6 emissions determined by Simmonds et al. (2020) and the remaining emissions are distributed to all other countries proportional to their electric power consumption (World Bank, 1·10 −13 kg m 2 h . Spatial and temporal correlation between uncertainties are considered by using an exponential decay model with a scale length of 250 km and 30 days. The error covariance matrix B is calculated as the Kronecker product of the spatial and temporal covariance matrices.

Baselines and length of backward simulation 275
The three investigated baseline methods are discussed for the example of two measurement sites, Gosan and Ragged Point, and for five backward simulation time periods. The Gosan observation station is located on the south-western tip of the Korean Island Jeju, monitors the outflows from the Asian continent, and is respresentative of stations which frequently measure pollution events. The Ragged Point observation station is situated on the eastern edge of Barbados with direct exposure to the Atlantic Ocean. Ragged Point is primarily influenced by easterly winds providing "clean" background air masses, uninfluenced by local 280 emissions, and is therefore representative of background stations. Baseline mixing ratios are plotted together with respective observations and a priori mixing ratios for different LPDM backward simulation periods ranging from 1 to 50 days. A priori mixing ratios are calculated as the sum of the baseline and the contribution originating from a priori emissions during the period of the backward simulation (termed "direct emissions contributions" thereafter). Ideally, the choice of the backward simulation period should have no systematic effect on the calculated a priori mixing ratios. By increasing the backward sim-285 ulation time, and therefore enlarging the temporal domain, more direct emission contributions are included. All these direct emission contributions should be removed from the baseline and as a result the baseline should become lower and smoother in order to leave a priori mixing ratios unchanged. Furthermore, one can assume that a correctly working baseline method leads to a proper agreement between a priori mixing ratios and observations. This agreement is investigated here for the three methods with time series plots ( Fig. 4-7), as well as statistical parameters (bias, mean squared error (MSE), and coefficient of 290 determination (r 2 )) summarized in Table 2. mixing ratios, as neither the smooth baselines nor the small direct emission contributions can reproduce the observed mixing ratios during pollution episodes. This agreement becomes much better with longer backward simulation periods, when direct 295 emission contributions get more impact (Fig. 4b/e). The REBS baseline stays completely unchanged for different backward simulation periods. Therefore, a priori mixing ratios grow with increasing simulation periods (Fig. 4b/c), as more direct emissions contribute to the calculated total mixing ratio. For Gosan, the bias is negative for the 1-day simulation period but becomes increasingly positive for longer simulation periods (Table 2). This systematically increasing bias is inherent to all purely observation based baseline methods and cannot be corrected without adding model information. In contrast, Stohl's baseline level 300 decreases with longer backward simulation periods as higher direct emission contributions are subtracted from the pre-selected  observations. Consequently, the bias of the a priori mixing ratios changes less between 10 and 50 days of backward simulation ( Fig. 4e/f). This is confirmed by statistical parameters in Table 2, showing also only little change between 10 and 50 days.
At Ragged Point (Fig. 5) the a priori mixing ratios determined by the REBS method fit the observation data very well for short backward simulation periods, where baseline and prior mixing ratios overlap because of small direct emission contribu-305 tions ( Fig. 5a/b). This is expected, since the method determines the baseline by fitting the observation data, while iteratively excluding outliers. Since Ragged Point is uninfluenced by regional emissions, no significant measurement peaks need to be excluded. Therefore, the REBS baseline fits well through the measurement data, resulting in a good statistical model-observation agreement (Table 2). However, the smooth baseline is unable to reproduce the observed variability. In the case of a simulation period of 50 days (Fig. 5c), more direct emission contributions give higher a priori mixing ratios, overestimating the measure-310 ments and causing a large bias. In contrast, due to its 25th percentile pre-selection of observations, Stohl's method shifts the baseline curve towards the lower observations. For low direct emission contributions ( Fig. 5a/b), a priori mixing ratios thus underestimate the observations. The resulting bias is almost unaffected by the different backward simulation periods (Table 2, Fig. 5c), showing the method's ability to compensate for increasing direct emission contributions. However, the rather ad hoc 25th percentile pre-selection of data for the baseline is obviously not justified for a background station with few pollution 315 episodes, leading to a systematic underestimation of modeled a priori mixing ratios, irrespective of the length of the backward simulation.
The GDB method is illustrated for all tested backward simulation periods, including a case without any backward simulation (0 days). In this extreme case the baseline is obtained directly from the value of the global mixing ratio field simulated with FLEXPART CTM in the spatio-temporal grid cell of the respective observation. At Gosan, FLEXPART CTM reproduces 320 observed mixing ratios well, even capturing a few pollution events (Fig. 6a). In the 1-day backward simulation case ( Fig. 6b), the method computes a highly variable baseline partly representing the observed variability. This results in a much better agreement between a priori and observed mixing ratios than using the REBS or Stohl's method ( Table 2). The GDB baseline becomes smoother and lower with increasing backward simulation time. The loss of variability arises from the fact that the GDB method calculates the baseline from a weighted average of grid cell mixing ratios at the trajectory termination points.

325
The longer particles are followed backward in time, the more widely dispersed over large geographical regions termination points become, thus resulting in a smoother baseline. The lowering of the GDB baseline is compensated by the increase of the direct emission contributions (see Section 2.4.3 and Fig. 3), ensuring a seamless transition between forward (Flexpart CTM) and backward simulations. As a result a priori mixing ratios in Fig. 6 show no large changes with increasing simulation period between 5 and 50 days.
330 Figure 6 also demonstrates the advantage of the Lagrangian backward simulation. As FLEXPART CTM is limited in resolution and particle number, it can only reproduce a few pollution events at Gosan, underestimates the highest and overestimates the lowest measured SF 6 mixing ratios (Fig. 6a). The backward simulation is initiated at the exact location of the measurement point and provides, in principle, infinite resolution ( Fig. 6b-f). If the backward calculation period is long enough that back trajectories reach important emission regions, mixing ratio spikes similar to the observed ones can be simulated. At the same time,  the lowered baseline for intrusions of southern air masses during the Asian summer monsoon also allows capturing the lowest observed values. As an indicator, Table 2 shows exclusively improving correlation between modeled and observed values with increasing backward simulation periods. Figure 7 illustrates the GDB method at the Ragged Point station. FLEXPART CTM (Fig. 7a) reproduces the measured mixing ratios well, however generates more variability than observed at this station. This is partly due to the limited number of particles in the domain-filling simulation, which introduces noise into the model results. This is averaged out by the GDB method with increasing backward simulation time, as the baseline becomes a weighted average over many grid cells. Nevertheless, the baseline maintains variability for all tested simulation periods, fitting the observed signal well (Fig. 7b-e). It is noteworthy that at Ragged Point a substantial part of the observed SF 6 variability seems to be caused by transport from different latitudes/regions without direct emission contributions, exemplified by the quite variable baseline even for the 50-day 345 backward simulation. In contrast, the direct emissions accumulated over the 50 days of the backward simulation are producing an almost constant enhancement over the baseline. This is very different from a station like Gosan that is strongly influenced by pollution episodes.
Notice also that for backward simulation times of 10 days and longer, the combined FLEXPART CTM/backward model is able to reproduce short episodes of very low observed mixing ratios at Ragged Point that are caused by episodic transport from is also unclear whether the FLEXPART CTM nudging routine was able to properly correct mixing ratios at higher altitudes, as aircraft measurements were available only over North America (with one exception).
Statistical parameters (bias, MSE, and r 2 ) were separately calculated for every observation station and respective averages over all stations are shown in Table 2. One should keep in mind that the REBS and Stohl's method are based on the observations themselves and thus the observed and modeled a priori mixing ratios are not independent. Therefore, it is remarkable 365 that overall the GDB method obtains smaller bias and MSE values than the other two methods. Regarding correlation, it is not surprising, that Table 2 shows the largest r 2 values for the REBS method, where the baseline is basically a fit of the observation data. However, it is noteworthy that for the GDB method, the r 2 value improves systematically with growing backward simulation time and for 50 days even exceeds the value derived by Stohl's method. By extending the backward calculation period from 10 to 50 days the GDB r 2 value increases by 0.045, meaning that an extra 4.5% of the observed variability can be 370 explained by the model. This is quite a substantial improvement. Notice also the improvement in bias and MSE, which can be observed for the GDB and Stohl's method, when extending the simulation period from 10 to 50 days. The REBS method does not show these improvements due to its systematical increase of bias with backward simulation time.   When using the REBS method (Fig. 8b), the inversion produces negative emission increments in almost all areas of the globe,

Inversion Results
indicating that calculated baselines are too high overall. This is consistent with the assumption that the method overestimates 385 the baseline at individual stations by wrongly classifying observations as baseline observations that are actually influenced by emissions within the backward calculation period. In contrast, the inversion algorithm produces positive increments almost everywhere around the globe when applying Stohl's method (Fig. 8c), suggesting that the method systematically underestimates the baseline (not only at background stations) which generally leads to a priori emissions that are too high. In case of the GDB method (Fig. 8d) negative and positive increments are more balanced, showing no sign of a systematical under-or 390 overestimation of the baseline. Large positive increments can be seen in East Asian regions and parts of Europe, whereas the inversion tends to produce slightly negative increments in the Southern Hemisphere.

National emissions
As the verification of emission reports to UNFCCC takes place on a national scale, the impact of baseline methods on national emissions is of great interest (Fig. 9). In countries with very low emission sensitivity (e.g., Brazil) inversion increments are 395 very small in all three cases and the baseline choice has therefore little impact. However, considering countries with higher emission sensitivities (e.g., China), the a posteriori emissions are very sensitive to the baseline definition. In almost all cases the REBS method leads to smaller and Stohl's method to larger national emissions than the GDB method, again revealing systematic problems in the first two methods. Due to the large emissions in China these problems become especially apparent there, with almost a factor 3 emission difference, corresponding to almost 30% of the 2012 global SF 6 emissions.

Global emissions
The 2012 SF 6 global emissions are shown in Fig. 10. The bars represent inversion results using different backward calculation periods between 1 and 50 days (light to dark shading). The horizontal dashed line illustrates a reference value calculated by Simmonds et al. (2020) with the AGAGE 12-box model. Notice that this is the same value used to calculate the a priori emissions, so the line represents also the global a priori emissions. Since the uncertainty of the global emissions is relatively

The advantage of longer backward simulation periods
As an argument for a relatively short backward simulation period Stohl et al. (2009) stated that "the value for the inversion of every additional simulation day decreases rapidly with time backward". Certainly, this is true for countries and regions that are well covered by the global monitoring network. For instance, for France the SRR increases rapidly in the first few backward simulation days but flattens to a linear increase for longer backward simulation periods (Fig. 11a). A similar behavior can be 430 observed for many countries in the Northern Hemisphere, although the curve's slope for the first few days varies. For countries poorly covered by the monitoring network, however, the SRR is close to zero for the first 5 to 15 backward days and increases exponentially afterwards (see Fig. 11b). For these poorly-monitored countries only backward simulations beyond the usual 5-10 days used in most studies provide information for the inversion. For these countries, the SRR increase with time flattens to a linear increase only for very long transport times, even beyond the 50 days used in this study.    increments for the GDB method and for backward simulation periods of 1, 10 and 50 days. In case of 1-day backward calculations (Fig. 12a) the inversion significantly optimizes a priori emissions only in East Asia and parts of Europe. As the backward simulation period is extended to 10 days (Fig. 12b), the inversion optimizes emissions in larger parts of the Northern Hemisphere but in the Southern Hemisphere emission increments are still small. In case of 50 days (Fig. 12c), the inversion 440 optimizes emissions even far away from observation stations (e.g. South America or South Africa). In India, where SRR values are also small and a priori emissions (and thus emission uncertainties) are high (see also 11b), the emission increments even switch from positive to negative by extending the period from 10 to 50 days. Also, the calculated relative uncertainty reduction increases by extending the backward simulation period (see Fig. A2a-c).
The use of flask samples 445 An advantage of the GDB method is the possibility to include flask measurements from fixed sites or moving platforms in the inversion. By contrast, the REBS and Stohl's method require short measurement intervals at fixed sites for the statistical baseline calculation. Figure 13 shows the relative change in a posteriori emissions using flask measurements additionally to the in situ measurements for the 50 days backward simulation. One can see substantial differences in the USA, Eastern Europe, Asia, and Southern Africa, illustrating the great value of this additional information. Further, the inclusion of flask 450 measurements slightly increases the relative error reduction in their surroundings (e.g. USA, South Africa, see Fig. A2c,d).

Reliable global emissions can only be obtained with long backward simulation periods
In previous sections, we have used global mixing ratio fields from the GDB method where great care has been taken to avoid biases that would affect the baseline, and we have used global a priori emissions that correspond to the rather well known global SF 6 emissions. These are optimal conditions for the inversion that are rarely fulfilled for other species than SF 6 . For many species global emissions are less well known, and with fewer observations than for SF 6 also the global distribution (and, thus, the baseline) is more uncertain. However, a skillful inversion should tolerate such biases and still produce reliable results. While we lack information for verifying that regional emissions are reliable, for SF 6 we can at least test whether global emissions can be determined by our inversion in the presence of biases. Figure 14 shows global a posteriori emissions when biases in (1) the a priori emissions and (2) global mixing ratio fields 460 were added. This is shown for different backward simulation periods between 1 and 50 days and for the 50 days case with inclusion of flask measurements. Note here that for all these sensitivity cases we use the same absolute a priori emission uncertainties, as for the original a priori emissions without any artificial bias.
Comparing the inversion results for doubled (Fig. 14a) and halved (Fig. 14b) a priori emissions clearly shows that the corresponding biases in the global a posteriori emissions are reduced substantially with increasing backward simulation period 465 and converge towards the rather well known global SF 6 emission from the box model. However, it is clear that a substantial bias remains even with a backward simulation period of 50 days. It seems likely that an extension of the backward simulation period beyond 50 days would further reduce the bias. The inclusion of flask measurements leads to slight additional improvements.
Another sensitivity test was performed with artificially biased global mixing ratio fields by subtracting (Fig. 14c)  Finally, we also combined doubled a priori emissions with a -0.003 ppt bias in the global mixing ratio fields (Fig. 14e) and 485 halved a priori emissions with a +0.003 ppt bias (Fig. 14f). For both cases, the inversion becomes less sensitive to biases in the a priori emissions and the global fields with longer backward simulation periods.

Conclusions
We have examined the use of LPDMs for inverse modeling of GHG emissions by varying the backward simulation period in the range of 1 to 50 days, testing several methods for estimating a baseline, investigating the influence of biases in the a priori 490 emissions and the baseline, and exploring the value of flask measurements for the inversion. We found the following: -A baseline method that is purely based on the observations at the observation site itself, such as the REBS method, leads to unreliable inversion results that are highly sensitive to the length of the LPDM backward simulation and can lead to entirely unrealistic a posteriori global total emissions. For instance, for the year 2012 inversions with the REBS method produce a posteriori global total SF 6 emissions of 9.8 Gg/yr and 3.2 Gg/yr for backward simulation periods of 1 day and 495 50 days, respectively, compared to a well known reference value of around 8.0 Gg/yr.
-A baseline method that is based on the observations at the site itself but corrects for emissions occurring during the LPDM backward simulation period leads to smaller sensitivity to the backward calculation time but may still lead to substantially biased emissions irrespective of backward simulation period. For instance, inversions with Stohl's method overestimate the well known 2012 SF 6 global total emissions by 2.2 -3.6 Gg/yr (28-45%). 500 field, is superior and leads to a posteriori emissions that are far less sensitive to the LPDM backward calculation length and that are consistent with global total emissions. In contrast to station-specific baselines, the GDB method allows the inclusion of low-frequency measurements (e.g., flask samples) or data from mobile platforms into the inversion.
-Statistical comparisons of a priori modeled versus observed mixing ratios show that longer LPDM backward simulations 505 outperform shorter simulations. In particular, extending the trajectory length from the usual 5-10 days to 50 days reduces the mean squared error and increases the correlation. .
-Inverse modelling is highly sensitive to biases in the a priori emissions as well as biases in the baseline. We could show that this sensitivity decreases with the length of the backward simulation period. While it is nearly impossible to correct biased global a priori emissions with backward simulation periods of 1-10 days, 50-day backward simulations 510 can capture global emissions quite accurately even in the presence of large biases.
-The additional use of flask measurements improves the observational constraint on SF 6 emissions substantially. However, existing flask sampling sites are often not well located for inversion purposes. We suggest that placing a few additional flask sampling sites downwind of potential emission regions in currently undersampled regions (in particular, tropical South America, tropical Africa, India, Australia and the Maritime Continent) would have disproportionately large value 515 for improving regional and global a posteriori emission fields.
Following these results, we strongly recommend to abandon the use of baseline methods based purely on the observations of individual sites, for inverse modeling. We also recommend to employ longer LPDM backward simulation periods, beyond the usual 5-10 days, as this leads to improvements in overall model performance, allows to constrain emissions in regions poorly covered by the monitoring network, and produces more robust global emission estimates. When consistency between regional 520 and global emission estimates is important, even longer backward simulation periods than 50 days may be useful. Finally, we suggest to take additional flask measurements at continental sites in the Tropics and Southern Hemisphere as they would greatly enhance inversion-derived global emission fields.
Code and data availability. The used source codes of FLEXPART 10.4 and FLEXINVERT+ (with small modifications to the original version freely available at https://flexinvert.nilu.no/download.html; downloaded in July 2020; (described in detail by Thompson and Stohl, 2014) Table A3. Setting parameters of the REBS method. For more information see Ruckstuhl et al. (2012).