The representation of subsurface structures is an
essential aspect of a wide variety of geoscientific investigations
and applications, ranging from geofluid reservoir studies, over raw
material investigations, to geosequestration, as well as many
branches of geoscientific research and applications in
geological surveys. A wide range of methods exist to generate
geological models. However, the powerful methods are
behind a paywall in expensive commercial packages. We present here a full
open-source geomodeling method, based on an implicit potential-field
interpolation approach. The interpolation algorithm is comparable to
implementations in commercial packages and capable of constructing
complex full 3-D geological models, including fault networks,
fault–surface interactions, unconformities and dome
structures. This algorithm is implemented in the programming
language Python, making use of a highly efficient underlying library
for efficient code generation (

We commonly capture our knowledge about relevant geological features in the subsurface in
the form of geological models, as 3-D representations of the geometric structural
setting. Computer-aided geological modeling methods have existed for decades, and many
advanced and elaborate commercial packages exist to generate these models (e.g., GoCAD,
Petrel, GeoModeller). But even though these packages partly enable an external access to
the modeling functionality through implemented APIs or scripting interfaces, it is a
significant disadvantage that the source code is not accessible, and therefore the true
inner workings are not clear. More importantly still, the possibility to extend these
methods is limited – and, especially with the current rapid development of highly
efficient open-source libraries for machine-learning and computational inference (e.g.,

However, there is to date no fully flexible open-source project that integrates
state-of-the-art geological modeling methods. Conventional 3-D construction tools (CAD,
e.g.,

Example of models generated using

With the aim to close this gap, we present

Especially in this current time of rapid development of open-source scientific software packages and powerful machine-learning frameworks, we consider an open-source implementation of a geological modeling tool as essential. We therefore aim to open up this possibility to a wide community, by combining state-of-the-art implicit geological modeling techniques with additional sophisticated Python packages for scientific programming and data analysis in an open-source ecosystem. The aim is explicitly not to rival the existing commercial packages with well-designed graphical user interfaces, underlying databases and highly advanced workflows for specific tasks in subsurface engineering, but to provide an environment to enhance existing methodologies as well as give access to an advanced modeling algorithm for scientific experiments in the field of geomodeling.

In the following, we will present the implementation of our code in the form of core
modules, related to the task of geological modeling itself, and additional assets, which
provide the link to external libraries, e.g., to facilitate stochastic geomodeling and
the inversion of structural data. Each part is supported and supplemented with Jupyter
notebooks that are available as additional online material and
part of the package
documentation, which enable the direct testing of our methods (see
Sect.

In this section, we describe the core functionality of

After describing the simple functionality required to construct models, we go deeper into
the underlying architecture of

The potential-field method developed by

Let's break down what we actually mean by this: imagine that a geological setting is
formed by a perfect sequence of horizontal layers piled one above the other. If we know
the exact timing of when one of these surfaces was deposited, we would know that any
layer above had to occur afterwards while any layer below had to be deposited earlier in
time. Obviously, we cannot have data for each of these infinitesimal synchronal layers,
but we can interpolate the “date” between them. In reality, the exact year of the
synchronal deposition is meaningless – as it is not possible to remotely obtain accurate
estimates. What has value to generate a 3-D geomodel is the location of those synchronal
layers and especially the lithological interfaces where the change of physical properties
are notable. Because of this, instead of interpolating

The advantages of using a global interpolator instead of interpolating each layer of interest independently are twofold: (i) the location of one layer affects the location of others in the same depositional environment, making it impossible for two layers in the same potential field to cross; and (ii) it enables the use of data between the interfaces of interest, opening the range of possible measurements that can be used in the interpolation.

Example of scalar field. The input data are formed by six points distributed in
two layers (

The interpolation function is obtained as a weighted interpolation based on universal
cokriging

So far we have shown what we want to obtain and how universal cokriging is a suitable
interpolation method to get there. In the following, we will describe the concrete steps
from taking our input data to the final interpolation function

Note that in this context the scalar field property

Considering Eq. (

The algebraic dependency between

As we can see in Eq. (

Furthermore, since the choice of covariance parameters is ad hoc
(Appendix

In most scenarios the goal of structural modeling is to define the spatial distribution of geological structures, such as layers, interfaces and faults. In practice, this segmentation is usually done either by using a volumetric discretization or by depicting the interfaces as surfaces.

The result of the kriging interpolation is the random function

At the time of this manuscript's preparation,

The second alternative segmentation consists of locating the layer isosurfaces.

Code to generate a single scalar field model
(as seen in Fig.

Example of different lithological units and their relation to scalar fields;

In reality, most geological settings are formed by a concatenation of depositional phases partitioned by unconformity boundaries and subjected to tectonic stresses that displace and deform the layers. While the interpolation is able to represent realistic folding – given enough data – the method fails to describe discontinuities. To overcome this limitation, it is possible to combine several scalar fields to recreate the desired result.

So far the implemented discontinuities in

Modeling unconformities is rather straightforward. Once we have grouped the layers into their respective series, younger series will overlay older ones beyond the unconformity. The scalar fields themselves, computed for each of these series, could be seen as a continuous depositional sequence in the absence of an unconformity.

Extension of the code in Listing 1 to generate an
unconformity by using two scalar fields. The corresponding
model is shown in Fig.

Faults are modeled by the inclusion of an extra drift term into the kriging system

Code to generate a model with an unconformity and a fault
using the three scalar fields model (as seen in Fig.

The computation of the segmentation of fault compartments (called

An important detail to consider is that drift functions will bend the isosurfaces
according to the given rules, but they will conserve their continuity. This differs from
the intuitive idea of offset, where the interface presents a sharp jump. This fact has a
direct impact on the geometry of the final model, and can, for example, affect certain
meshing algorithms. Furthermore, in the ideal case of choosing the perfect drift
function, the isosurface would bend exactly along the faulting plane. In the current
state,

The architecture of

Graph of the logical structure of

Regarding data structure, we make use of the Python package

It is important to keep in mind that, in this structure, once data enters the part of the
symbolic graph, only algebraic operations are allowed. This limits the use of many
high-level coding structures (e.g., dictionaries or undefined loops) and external
dependencies. As a result of that, the preparation of data must be exhaustive before
starting the computation. This includes ordering the data within the arrays and passing
the exact lengths of the subsets we will need later on during the interpolation or the
calculation of many necessary constant parameters. The preprocessing of data is done
within the sub-classes of

The rest of the package is formed by a (always growing) series of modules that perform
different tasks using the geological model as input (see
Sect.

Efficiently solving a large number of algebraic equations, and especially their
derivatives, can easily get unmanageable in terms of both time and memory. Up to this
point we have referenced

Within the Python implementation,

The symbolic graph is later analyzed to perform the optimization, the symbolic differentiation and the compilation to a faster language than Python (C or CUDA). This process is computationally demanding and therefore it must be avoided as much as possible.

Among the most outstanding optimizers included with

However, although

Many of the most advanced algorithms in computer science rely on an inverse framework,
i.e., the result of a forward computation,

The alternative is to create the symbolic differentiation of

In this second half of the paper we will explore different features that complement and
expand the construction of the geological model itself. These extensions are just some
examples of how

The segmentation of meaningful units is the central task of geological modeling. It is often a prerequisite for engineering projects or process simulations. An intuitive 3-D visualization of a geological model is therefore a fundamental requirement.

In-built

For its data and model visualization,

On top of these features,

For additional high-quality visualization, we can generate vtk files using

For sharing models,

In short,

In recent years gravity measurements have increased in quality

Forward gravity response overlayed on top of a 3-D lithology block
sliced on the

As an example, we show here the forward gravity response of the geological model in
Fig.

Computing forward gravity of a

The computation of forward gravity is a required step towards a fully coupled gravity
inversion. Embedding this step into a Bayesian inference allows us to condition the
initial data used to create the model to the final gravity response. This idea will be
further developed in Sect.

The concept of topology provides a useful tool to describe adjacency relations in
geomodels, such as stratigraphic contacts or across-fault connectivity

Section of the example geomodel with overlaid topology graph. The geomodel contains eight unique regions (graph nodes) and 13 unique connections (graph edges). White edges represent stratigraphic and unconformity connections, while black edges correspond to across-fault connections.

To analyze the model topology,

Topology analysis of a GemPy geomodel.

Raw geological data are noisy and measurements are usually sparse. As a result,
geological models contain significant uncertainties

The answers to these questions are still actively debated in research and are highly
dependent on the type of mathematical and computational framework chosen. Uncertainty
quantification and its logical extension into probabilistic machine learning will not be
covered in the depth in this paper due to the broad scope of the subject. However, the
main goal of

As we have seen so far, the cokriging algorithm enables the construction of geological
models for a wide range of geometric and topological settings with a limited number of
parameters (Fig.

geometric parameters; interface points

geophysical parameters, e.g., density;

model parameters, e.g., covariance at distance zero

Therefore, an implicit geological model is simply a graphical representation of a
deterministic mathematical operation of these parameters and as such any of these
parameters are suitable to behave as latent variables. From a probabilistic point of view

On the other hand, the latest version,

In this context, the purpose of

For this paper we will use

An essential aspect of probabilistic programming is the inherent capability
to quantify uncertainty. Monte Carlo error propagation

In this paper, for example Fig.

The first step in the creation of a PGM is to define the parameters that are supposed to
be stochastic and the probability functions that describe them. To do so,

Probabilistic model construction and inference using

The suite of possible realizations of the geological model are stored, as traces, in a database of choice (HDF5, SQL or Python pickles) for further analysis and visualization.

In 2-D we can display all possible locations of the interfaces on a cross section at the
center of the model (see Fig.

Probabilistic graphical model generated with

Although computing the forward gravity has its own value for many applications, the main
aim of

As we have shown above, topological graphs can represent the connectivity among the
segmented areas of a geological model. As is expected, stochastic perturbations of the
input data can rapidly alter the configuration of mentioned graphs. In order to preserve
a given topological configuration partially or totally, we can construct specific
likelihood functions. To exemplify the use of a topological likelihood function, we will
use the topology computed in Sect.

The first challenge is to find a metric that captures the similarity of two graphs. As a
graph is nothing but a set of nodes and their edges we can compare the intersection and
union of two different sets using the Jaccard index

Probabilistic model construction and inference using

Gravity likelihoods aim to exploit the spatial distribution of density, which can be
related to different lithotypes

Probabilistic programming results on a cross section at the middle of the model
(

Defining the topology potential and gravity likelihood on the same Bayesian network
creates a joint likelihood value that will define the posterior space. To sample from the
posterior we use adaptive Metropolis (

As a result of applying likelihood functions we can appreciate a clear change in the
posterior (i.e., the possible outcomes) of the inference. A closer look shows two main
zones of influence, each of them related to one of the likelihood functions. On one hand,
we observe a reduction of uncertainty along the fault plane due to the restrictions that
the topology function imposes by conditioning the models to high Jaccard values. On the
other hand, what in the first example – i.e., Monte Carlo error propagation, left in
Fig.

We have introduced

Up until now, implicit geological modeling was limited to proprietary software suites –
for the petroleum industry (GoCad, Petrel, JewelSuite) or the mining sector (MicroMine,
MIRA Geoscience, GeoModeller, Leapfrog) – with an important focus on industry needs and
user experience (e.g., graphical user interfaces or data compatibilities). Despite access
to the APIs of many of these software, their lack of transparency and the inability to
fully manipulate any of the algorithms represents a serious obstacle for conducting
appropriate

Implicit methods rely on interpolation functions to automate some or all the construction
steps. Different mathematical approaches have been developed and improved in the recent
years to tackle many of the challenges that particular geological settings pose

Using

Another important feature of

Up to now, structural geological models have significantly relied on the best
deterministic and explicit realization that an expert is able to construct using often
noisy and sparse data. Research into the interpretation uncertainty of geological data
sets (e.g., seismic data) has recognized the significant impact of interpreter education
and bias on the extracted input data for geological models

In the transition to a world dominated by data and optimization algorithms – e.g., deep
neural networks or big data analytics – there are many attempts to apply those advances
in geological modeling

We propose a more general approach. By embedding the geological model construction into a
probabilistic machine-learning framework

Despite the convincing mathematical formulation of Bayesian inferences, there are caveats
to be dealt with for practical applications. As mentioned in Sect.

Nevertheless, performing AD does not come free of cost. The required code structures
limit the use of libraries that do not perform AD themselves, which in essence imposes a
requirement to rewrite most of the mathematical algorithms involved in the Bayesian
network. Under these circumstances, we have rewritten in the potential field method

Currently

In conclusion,

GemPy is a free, open-source
Python library licensed under the GNU Lesser General Public License v3.0
(GPLv3). It is hosted on the GitHub repository

Installing GemPy can be done in two ways: (i) either by cloning the GitHub repository
with

We provide Jupyter notebooks as part of the online documentation. These notebooks can be
executed in a local Python environment (if the required dependencies are correctly
installed, see above). In addition, static versions of the notebooks can currently be
inspected directly on the GitHub repository web page or through the use of nbviewer. In
addition, it is possible to run interactive notebooks through the use of binder (provided
through

The

If all tests are successful, you are ready to continue.

The following equations have been derived from the work in

The gradient covariance matrix,

2-D representation of the decomposition of the orientation vectors into
Cartesian axes. Each Cartesian axis represents a variable of a sub-cokriging system. The
dashed green line represents the covariance distance,

As in our case the directional derivatives used are the three Cartesian directions; we
can rewrite gradients covariance,

Notice, however, that covariance functions by definition are described in a polar
coordinate system, and therefore it will be necessary to apply the chain rule for

Distances

In a practical sense, keeping the value of the scalar field at every interface unfixed
forces us to consider the covariance between the points within an interface as well as
the covariance between different layers following equation

Combining Eqs. (

In a cokriging system, the relation between the interpolated parameters is given by a
cross-covariance function. As we saw above, the gradient covariance is subdivided into
covariances with respect to the three Cartesian directions (Eq.

Distances

As the interfaces are relative to a reference point per later

As the mean value of the scalar field is going to be always unknown, it needs to be
estimated from data itself. The simplest approach is to consider the mean constant for
the whole domain, i.e., ordinary kriging. However, in the

In normal kriging the right-hand term of the kriging system (Eq.

Since, in this case, the parameters of the variogram functions are arbitrarily chosen,
the kriging variance does not hold any physical information of the domain. As a result of
this, being interested in only the mean value, we can solve the kriging system in the
dual form

The choice of the covariance function will govern the shape of the isosurfaces of the scalar field. As opposed to other kriging uses, here the choice cannot be based on empirical measurements. Therefore, the choice of the covariance function is merely arbitrary trying to mimic, as far as possible, coherent geological structures.

Representation of a cubic variogram and covariance for an arbitrary range and nugget effect.

The main requirement to take into consideration when the time comes to choose a
covariance function is that it has to be twice differentiable,

The most widely used function in the potential field method is the cubic
covariance due to mathematical robustness and its coherent geological description of the space.

Estimating convergence of MCMC simulations. Here we show selected plots to evaluate
convergence of the MCMC process. In Fig.

Geweke values of the parameters belonging to the inference of all likelihoods.
Every point represents the mean of separated intervals of the chain. If interval

Traces of the parameters belonging to the inference with all likelihoods. The asymptotic behavior probes ergodicity. The first 1000 iterations (not displayed here) were used as burn-in and are not represented here.

Throughout the paper we have mentioned and show Blender visualizations
(Fig.

Extra functionality needed to create GemPy models in Blender.

MdlV and FW contributed to project conceptualization and method development. MdlV wrote and maintained the code with the help of AS (topology asset and the initial vtk-based visualization). MdlV prepared the manuscript with contributions of both co-authors in reviewing and editing. AS was involved in visualizing some results and writing some chapters (topology and visualization). FW provided overall project supervision and funding.

The authors declare that they have no conflict of interest.

The authors would like to acknowledge all of the people – all around the world – who have contributed to the final state of the library, either by stimulating mathematical discussions or finding bugs. This project could not be possible without their invaluable help.Edited by: Lutz Gross Reviewed by: Sally Cripps and one anonymous referee