Automated Monte Carlo-based quantification and updating of geological uncertainty with borehole data (AutoBEL v1.0)

Yin, Zhen; Strebelle, Sebastien; Caers, Jef

doi:https://doi.org/10.5194/gmd-13-651-2020

Articles | Volume 13, issue 2

https://doi.org/10.5194/gmd-13-651-2020

Articles | Volume 13, issue 2

Development and technical paper

19 Feb 2020

Development and technical paper |

| 19 Feb 2020

Automated Monte Carlo-based quantification and updating of geological uncertainty with borehole data (AutoBEL v1.0)

Zhen Yin, Sebastien Strebelle, and Jef Caers

Abstract

Geological uncertainty quantification is critical to subsurface modeling and prediction, such as groundwater, oil or gas, and geothermal resources, and needs to be continuously updated with new data. We provide an automated method for uncertainty quantification and the updating of geological models using borehole data for subsurface developments within a Bayesian framework. Our methodologies are developed with the Bayesian evidential learning protocol for uncertainty quantification. Under such a framework, newly acquired borehole data directly and jointly update geological models (structure, lithology, petrophysics, and fluids), globally and spatially, without time-consuming model rebuilding. To address the above matters, an ensemble of prior geological models is first constructed by Monte Carlo simulation from prior distribution. Once the prior model is tested by means of a falsification process, a sequential direct forecasting is designed to perform the joint uncertainty quantification. The direct forecasting is a statistical learning method that learns from a series of bijective operations to establish “Bayes–linear-Gauss” statistical relationships between model and data variables. Such statistical relationships, once conditioned to actual borehole measurements, allow for fast-computation posterior geological models. The proposed framework is completely automated in an open-source project. We demonstrate its application by applying it to a generic gas reservoir dataset. The posterior results show significant uncertainty reduction in both spatial geological model and gas volume prediction and cannot be falsified by new borehole observations. Furthermore, our automated framework completes the entire uncertainty quantification process efficiently for such large models.

Download & links

How to cite.

Received: 21 Aug 2019 – Discussion started: 22 Oct 2019 – Revised: 15 Jan 2020 – Accepted: 20 Jan 2020 – Published: 19 Feb 2020

1 Introduction

Uncertainty quantification (UQ) is at the heart of decision making. This is particularly true in subsurface applications such as groundwater, geothermal resources, fossil fuels, CO₂ sequestration, or minerals resources. Uncertainty on the geological structures, rocks, and fluids is due to the lack of access to the subsurface geological medium. For most of the subsurface applications, knowledge of the geological settings is mainly gained through the drilling of well boreholes where geophysical or rock physical measurements are made. For example, several tens to hundreds of boreholes are drilled in geothermal or groundwater appraisals (e.g., Le Borgne et al., 2006; Klepikova et al., 2011; Vogt et al., 2010), while in mineral resources and shale gas, the number of boreholes can even be in the thousands (e.g., Curtis, 2002; Abbott, 2013). From borehole data, geological models are constructed for appraisal and uncertainty quantification, such as estimating water volumes stored in groundwater systems or heat storage in a geothermal system. Realistic geological modeling involves complex procedures (Caumon, 2010, 2018; de la Varga et al., 2019). This is due to the hierarchical nature of geological formations: fluids are contained in a porous medium, the porous medium is defined by various lithologies, and lithological variation is contained in faults and layers (structure). In addition, boreholes are not drilled all at once but throughout the lifetime of managing the Earth's resource.

Representing the unknown subsurface geological reality by a single deterministic model has been commonly done (Beven, 1993; Royse, 2010), mostly by means of a single realization of the structure (layers or faults), rock, and fluid model derived from the borehole data with other supporting geological and geophysical interpretations (e.g., Fischer et al., 2015; Kaufmann and Martin, 2008). However, relying on a single model cannot reflect the inherent geological uncertainty (Neuman, 2003). Recent advances in geostatistics have shown the importance of using multiple model realizations for uncertainty quantification in many geoscience fields, including glaciology (e.g., Cullen et al., 2017), hydrogeology (e.g., Barfod et al., 2018; Zhou et al., 2014), hydrology (e.g., Goovaerts, 2000; Marko et al., 2014), hydrocarbon reservoir modeling (e.g., Caers and Zhang, 2004; Christie et al., 2002; Dutta et al., 2019; Yin et al., 2019), and geothermal (e.g., Rühaak et al., 2015; Vogt et al., 2010). Geostatistical approaches can provide multiple geological models that are conditioned or constrained to borehole data. When new boreholes are drilled, uncertainty needs to be updated. While uncertainty updating in the form of data assimilation is commonly applied to various subsurface applications, it is rarely used for updating newly drilled borehole data, often termed “hard data” in geostatistical literatures (Goovaerts, 1997). Elfeki and Dekking (2007) used a coupled Markov chain (CMC) approach to calibrate a hydrogeological lithology model by conditioning on boreholes in the central Rhine–Meuse delta in the Netherlands, and they then ran a Monte Carlo simulation to reevaluate the hydrogeological uncertainty. A similar approach was also used by Li et al. (2016) to reduce the uncertainty in near-surface geology for the risk assessment of soil slope stability and safety in Western Australia. Jiménez et al. (2016) updated 3-D hydrogeological models by adding new geological features identified from borehole tracer tests. Eidsvik and Ellefmo (2013) and Soltani-Mohammadi et al. (2016) investigated the value of information of additional boreholes for uncertainty reduction in mineral resource evaluations.

The problem of geological uncertainty, due to its interpretative nature and the presence of prior information, is often handled in a Bayesian framework (Scheidt et al., 2018). The key part often lies in the joint quantification of the prior uncertainty on all modeling parameters, whether structural, lithological, petrophysical, or fluid. A common problem is that the observed data may lie outside the defined prior model and hence are falsified. Another major issue is that most of the state-of-the-art uncertainty updating practices deal with each geological model component separately (a silo treatment of each UQ problem). However, the borehole data inform all components jointly, and hence any separate treatment ignores the likely dependency between the model components, possibly returning unrealistic uncertainty quantification. A final concern, more practically, lies around automating any uncertainty updating. Geological modeling often requires significant individual or group expertise and manual intervention to make the model adhere to geological rules, hence often requiring months of work when new data are acquired. There is to date, no method that addresses, with borehole data, the falsification, the joint uncertainty quantification, and the automation problem.

Recently, an uncertainty quantification protocol termed Bayesian evidential learning has been proposed to address decision making under uncertainty, and it has been applied to cases in oil or gas, groundwater contaminant remediation and geothermal energy (Athens and Caers, 2019; Hermans et al., 2018, 2019; Scheidt et al., 2018). It provides explicit standards that need to be reached at each stage of its UQ design with the purpose of decision making, including model falsification, global sensitivity analysis, prior elicitation, and data-science-driven uncertainty reduction under the principle of Bayesianism. Compared to the previous works on Bayesian evidential learning (BEL), model falsification, statistical learning-based uncertainty reduction approaches, and automation are what is of concern in this paper. Also, we will deal with one specific data source: borehole data, through logging or coring, for geological uncertainty quantification. First, we will introduce a scheme to address the model falsification problem involving borehole data by using robust Mahalanobis distance. We will then extend a statistical learning approach termed direct forecasting (Hermans et al., 2016; Satija et al., 2017; Satija and Caers, 2015) to reduce uncertainty of all geological model parameters jointly, using all (new) borehole data simultaneously. To achieve this, we will present a model formulation that involves updating based on the hierarchy typically found in subsurface formation: structures, then lithology, and then property and fluid distribution. Finally, we will show how the proposed framework can be completely automated in an open-source project. With a generalized field case study of uncertainty quantification of gas volume in an offshore reservoir, we will illustrate our approach and emphasize the need for automation, minimizing the need for tuning parameters that require human interpretation.

2 Methodology

2.1 Bayesian evidential learning

2.1.1 Overview

We establish the geological uncertainty quantification framework based on BEL, which is briefly reviewed in this section. BEL is not a method, but a prescriptive and normative data-scientific protocol for designing uncertainty quantification within the context of decision making (Athens and Caers, 2019; Hermans et al., 2018; Scheidt et al., 2018). It integrates four constituents in UQ – data, model, prediction, and decision under the scientific methods and philosophy of Bayesianism. In BEL, the data are used as evidence to infer model or/and prediction hypotheses via “learning” from the prior distribution, whereas decision making is ultimately informed by the model and prediction hypotheses. The BEL protocol consists of six UQ steps: (1) formulating the decision questions and prediction variables; (2) statement of model parametrization and prior uncertainty; (3) Monte Carlo and prior model falsification with data; (4) global sensitivity analysis between data and prediction variables; (5) uncertainty reduction based on statistical learning methods that reflect the principle of Bayesian philosophy; (6) posterior falsification and decision making. Bayesian methods, particularly in the Earth sciences rely on the statement of prior uncertainty. However, such a statement may be inconsistent with data in the sense that the prior cannot predict the data, hence the important falsification step. We next provide important elements of BEL within the problem of this paper: prior model definition, falsification, and inversion by direct forecasting.

2.1.2 Hierarchical model definition

In geological uncertainty quantification, any prior uncertainty statement needs to involve all model components jointly. A geological model m typically consists of four components that are modeled in hierarchical order: structural model χ (e.g., faults, stratigraphic horizons), rock types ζ (which are categorical, e.g., sedimentary or architectural facies), petrophysics model κ (e.g., density, porosity, permeability), and subsurface fluid distribution τ (e.g., water saturation, salinity).

\begin{matrix} (1) & m = \{χ, ζ, κ, τ\} \end{matrix}

The uncertainty model then becomes the following sequential decomposition:

\begin{matrix} (2) & \begin{aligned} f (m) = & f (χ, ζ, κ, τ) = f (τ | χ, ζ, κ) f (κ | χ, ζ) \\ f (ζ | χ) f (χ) . \end{aligned} \end{matrix}

In addition, because of the spatial context of all geological formations, we divide the model variables into global and spatial ones. The global variables, such as proportions, depositional system interpretation, or trend, are scalars and not attached to any specific grid locations, whereas the spatial variables are gridded. Here, we term the global variables as m_gl, and the spatial ones as m_sp In this way, the geological model variables are

\begin{matrix} (3) & m = \{(χ_{gl}, χ_{sp}), (ζ_{gl}, ζ_{sp}), (κ_{gl}, κ_{sp}), (τ_{gl}, τ_{sp})\} . \end{matrix}

The prior uncertainty f(m) of the global and spatial variables needs to be specified for each model component; this is problem specific and may require a substantial amount of work by considering the existing data (e.g., the system is deltaic) and any prior knowledge about the interpreted systems. Using the prior distribution f(m), we run Monte Carlo to generate a set of L model realizations $\{m^{(1)}, m^{(2)}, \dots, m^{(L)}\}$ . This means instantiating all geological variables $χ, ζ, κ, τ$ jointly.

Since borehole data provide information at the locations of drilling, we define the data variables d through an operator G_d.

\begin{matrix} (4) & d = G_{d} m \end{matrix}

G_d is simply a matrix in which each element is either 0 or 1, identifying the locations of boreholes in the model m. In this sense, borehole data are linear data because of the linear forward operator. By applying G_d to prior geological model realizations, we obtained a set of L samples of the borehole data variable.

\begin{matrix} (5) & d = \{d^{(1)}, d^{(2)}, \dots, d^{(L)}\} \end{matrix}

Note that we term the actual acquired data d_obs.

The prediction variable h, such as storage volume of a groundwater aquifer or the heat storage of a geothermal reservoir, is defined through another operator (linear or nonlinear):

\begin{matrix} (6) & h = G_{h} (m) . \end{matrix}

Applying this function to the prior model realizations we get

\begin{matrix} (7) & h = \{h^{(1)}, h^{(2)}, \dots, h^{(L)}\} . \end{matrix}

A common problem in practice is that the statement of the prior may be too narrow (overconfidence) and hence may not in fact predict the observed data. In falsification, we use hypothetic–deductive reasoning to attempt to reject the prior by means of data, namely by stating the null hypothesis: the prior can predict the observation and attempt to reject it. This step does not involve matching models to data; it is only a statistical test. One way of achieving this is using outlier detection as discussed in the next section.

2.1.3 Falsification using multivariate outlier detection

The goal of falsification is to test that the prior model is not wrong. The prior model should be able to predict the data. Our reasoning then is that a prior model is falsified if the observed data d_obs are not within the same population as the samples $d^{(1)}, d^{(2)}, \dots, d^{(L)}$ ; i.e., d_obs is an outlier. Evidently, the data variable can be high dimensional due to a large number of wells with various types of measurements on structure, facies, petrophysics, and saturation, which calls for multivariate outlier detection. We propose in this paper to use a robust statistical procedure based on Mahalanobis distance to perform the outlier detection. The robust Mahalanobis distance (RMD) for each data variable realization d^(l) or d_obs is calculated as

\begin{matrix} (8) & \begin{aligned} RMD (d^{(l)}) = & \sqrt{{(d^{(l)} - μ)}^{T} Σ^{- 1} (d^{(l)} - μ)}, \\ for l = 1, 2, \dots, L \end{aligned}, \end{matrix}

where μ and Σ are the robust estimation of mean and covariance of the data (Hubert and Debruyne, 2010; Rousseeuw and Driessen, 1999). Assuming d distributes as a multivariate Gaussian, the distribution of [RMD(d^(l))]² will be chi-squared $χ_{d}^{2}$ . We will use the 97.5 percentile of ${\sqrt{χ}}_{d}^{2}$ as the tolerance for the multivariate dimensional points d^(l). If the RMD(d_obs) falls outside the tolerance ( $RMD (d_{obs}) > {\sqrt{χ}}_{d, 97.5}^{2})$ , the d_obs will be regarded as outliers, which means the prior model has a very small probability of predicting the actual observations; hence it is falsified. It should be noted that the d_obs dealt with in this paper is at model grid resolution. Outlier detection using the Mahalanobis distance has the advantage of providing robust statistical calculations. In addition, diagnostic plots can be used to visualize the result for high-dimensional data. However, it requires the marginal distribution of data to be Gaussian. If the data variables are not Gaussian, other outlier detection approaches such as one-class support vector machine (SVM) (Schölkopf et al., 2001) or isolation forest (Liu et al., 2008) can be used.

2.2 Direct forecasting

2.2.1 Review

If the prior model cannot be falsified, we will use direct forecasting to reduce geological model uncertainty. Direct forecasting (DF) is a prediction-focused data science approach for inverse modeling (Hermans et al., 2016; Satija et al., 2017; Satija and Caers, 2015). The aim is to estimate/learn the conditional distribution f(h|d) between the prediction variable h and data variable d from prior Monte Carlo samples. Then, instead of using traditional inverse methods that require rebuilding models to update prediction, direct forecasting directly calculates the conditional prediction distribution f(h|d_obs) through the statistical learning based on data. The learning strategy of direct forecasting is that, by employing bijective operations, the non-Gaussian problem f(h|d) can be transformed into a linear-Gauss problem of transformed variables $(h^{*}, d^{*})$ :

\begin{matrix} (9) & \begin{aligned} h^{*} \sim \exp (- \frac{1}{2} {(h^{*} - h_{prior}^{*})}^{T} C_{prior}^{- 1} (h^{*} - h_{prior}^{*})); \\ d_{obs}^{*}; d^{*} = G h^{*} \end{aligned}, \end{matrix}

where G is coefficients that linearly map h^∗ to d^∗. This makes $f (h^{*} | d_{obs}^{*}$ ) become a “Bayes–linear-Gauss” problem that has an analytical solution:

\begin{matrix} (10) & \begin{aligned} E [h^{*} | d_{obs}^{*}] = h_{posterior}^{*} = h_{prior}^{*} + C_{prior} G^{T} \\ {({GC}_{prior} G^{T})}^{- 1} (d_{obs}^{*} - G h_{prior}^{*}), \\ Var [h^{*} | d_{obs}^{*}] = C_{posterior} = C_{prior} - C_{prior} G^{T} \\ {({GC}_{prior} G^{T})}^{- 1} {GC}_{prior} \end{aligned} . \end{matrix}

In detail, the specific steps of direct forecasting are

Monte Carlo: generate L samples of prior model and run forward function to evaluate data and prediction variables.
Orthogonality: PCA (principal component analysis) on data variable d and prediction variable h.
Linearization: maximize linear correlation between the orthogonalized data and variables by normal score transform and CCA (canonical component analysis), obtaining transformed $h^{*}, d^{*}$ .
Bayes–linear-Gauss: calculate conditional mean and covariance of the transformed prediction variable.
Sampling: sample from the posterior distribution of transformed prediction variable $h_{posterior}^{*}$ .
Reconstruction: invert all bijective operations, obtaining h_posterior in the original space.

One key question in direct forecasting is how to determine the Monte Carlo samples size L. Usually, the samples size L lies between 100 and 1000, according to the studies in water resources (Satija and Caers, 2015), hydrogeophysics (Hermans et al., 2016), and hydrocarbon reservoirs (Satija et al., 2017).

Direct forecasting can also be extended to update model variables, by simply replacing the prediction variable h by model variable m in the above algorithms, to obtain f(m|d_obs) without conventional model inversions (Park, 2019). However, the high dimensionality of spatial models (millions of grid cells) imposes challenge to such an extension. This is because CCA requires the sum of input data and model variable dimensions to be smaller than the Monte Carlo samples size L: $L > \dim (d) + \dim (m)$ . Otherwise it will always produce perfect correlations (correlation coefficients be 1) (Pezeshki et al., 2004). Although PCA can significantly reduce the dimensionality of m from L×P to L×L, where P is the number of model parameters and L≪P, this requirement is still difficult to meet. Global sensitivity analysis is therefore applied to select a subset of the PCA orthogonalized m that is most informed by the data variables. The subset m may retain only a few principal components (PCs) (Hoffmann et al., 2019), depending on how informative the boreholes are. For unselected (non-sensitive) model variables, they remain random according to their prior empirical distribution. Both the sensitive and non-sensitive variables will be used for posterior reconstruction in step 6. In this paper, we use a distance-based generalized sensitivity analysis (DGSA) method (Fenwick et al., 2014; Park et al., 2016) to perform sensitivity analysis. Compared to the other global sensitivity analyses, such as variance-based methods (e.g., Sobol, 2001, 1993), regionalized methods (e.g., Pappenberger et al., 2008; Spear and Hornberger, 1980), or tree-based method (e.g., Wei et al., 2015), DGSA has its specific advantages for high-dimensional problems while requiring no functional form between model responses and model parameters. It can efficiently compute global sensitivity, which makes it preferred for our geological UQ problem where the models are large and computationally intensive. When performing PCA on the data variable d, we select the PCs by preserving 90 % variance. Note that borehole data are in a much lower dimension than spatial models and hence are already low dimension.

2.2.2 Direct forecasting on a sequential model decomposition

We defined our prior uncertainty model (Eq. 2) through a sequential decomposition of hierarchical model components. Likewise, the conditioning of such model components to borehole data will be done, using direct forecasting in a sequential fashion:

\begin{matrix} (11) & \begin{aligned} f (χ, ζ, κ, τ | d_{obs}) = \\ f (τ | χ_{posterior}, κ_{posterior}, ζ_{posterior}, d_{obs, τ}) \\ f (κ | χ_{posterior}, ζ_{posterior}, d_{obs, κ}) \\ f (ζ | χ_{posterior}, d_{obs, ζ}) f (χ | d_{obs, χ}) \end{aligned} . \end{matrix}

Following this equation, the joint uncertainty quantification is equivalent to a sequential uncertainty quantification, where the uncertainty quantification of one model component conditions to borehole data and posterior models of the previous components. Direct forecasting has not been applied within this framework of Eq. (11); hence this is one of the new contributions in this paper. In applying direct forecasting we will use the posterior realizations of χ and prior realizations of ζ to determine a conditional distribution f(ζ|χ_posterior); then we evaluate this using borehole observations d_obs,ζ of ζ.

To apply this framework to discrete variables such as lithology, we need a different method for dimension reduction than using PCA. PCA relies on a reconstruction by a linear combination of principal component vectors, which becomes challenging when the target variable is discrete. Figure 1 shows this problem that discrete lithology model cannot be recovered from inverse PCA. To avoid this, a level set method of signed distance function (Osher and Fedkiw, 2003; Deutsch and Wilde, 2013) is employed to transform rock type models into a continuous scalar field of signed distances before applying PCA. Here, considering S discrete rock types in model ζ, for each sth ( $s = 1, 2, \dots, S$ ) rock type, the signed distance ψ_s(x) from location x to its closest boundary x_β can be computed as

\begin{matrix} (12) & ψ_{s} (x) = \{\begin{cases} + ∥x - x_{β}∥, if ζ (x) = s \\ - ∥x - x_{β}∥, otherwise \end{cases} s = 1, 2, \dots, S . \end{matrix}

Figure 2 illustrates the concept of using a signed distance function to first transform a sedimentary lithology model to continuous signed distances for PCA. We observe that, with the signed distance as an intermediate transformation, the inverse PCA recovers the lithology model. In the case of multiple categories, we will have multiple signed distance functions.

https://www.geosci-model-dev.net/13/651/2020/gmd-13-651-2020-f01

Figure 1PCA on discrete lithology model: (a) the original lithology model; (b) scree plot of PCA on the lithology model. (c) The reconstructed model from inverse PCA using the preserved PCs (marked by the red dashed line on the scree plot).

Automated Monte Carlo-based quantification and updating of geological uncertainty with borehole data (AutoBEL v1.0)

2.1 Bayesian evidential learning

2.1.1 Overview

2.1.2 Hierarchical model definition

2.1.3 Falsification using multivariate outlier detection

2.2 Direct forecasting

2.2.1 Review

2.2.2 Direct forecasting on a sequential model decomposition

2.3 Automation and code

3.1 The field case

3.2 Prior model parameterization and uncertainty

3.2.1 Approaches

Thickness

Facies

Porosity and permeability

Saturation

3.2.2 Monte Carlo

3.3 Prior falsification with newly acquired borehole data

3.4 Automatic updating of uncertainty with new boreholes

3.4.1 Thickness and facies

3.4.2 Porosity, permeability, and saturation

3.4.3 Posterior prediction and falsification