AutoQS v1: automatic parametrization of QuickSampling based on training images analysis

Gravey, Mathieu; Mariethoz, Grégoire

doi:https://doi.org/10.5194/gmd-16-5265-2023

Articles | Volume 16, issue 17

https://doi.org/10.5194/gmd-16-5265-2023

Articles | Volume 16, issue 17

Model description paper

14 Sep 2023

Model description paper |

| 14 Sep 2023

AutoQS v1: automatic parametrization of QuickSampling based on training images analysis

Mathieu Gravey and Grégoire Mariethoz

Abstract

Multiple-point geostatistics are widely used to simulate complex spatial structures based on a training image. The practical applicability of these methods relies on the possibility of finding optimal training images and parametrization of the simulation algorithms. While methods for automatically selecting training images are available, parametrization can be cumbersome. Here, we propose to find an optimal set of parameters using only the training image as input. The difference between this and previous work that used parametrization optimization is that it does not require the definition of an objective function. Our approach is based on the analysis of the errors that occur when filling artificially constructed patterns that have been borrowed from the training image. Its main advantage is to eliminate the risk of overfitting an objective function, which may result in variance underestimation or in verbatim copy of the training image. Since it is not based on optimization, our approach finds a set of acceptable parameters in a predictable manner by using the knowledge and understanding of how the simulation algorithms work. The technique is explored in the context of the recently developed QuickSampling algorithm, but it can be easily adapted to other pixel-based multiple-point statistics algorithms using pattern matching, such as direct sampling or single normal equation simulation (SNESIM).

Download & links

Article (PDF, 9113 KB)

Download & links

How to cite.

Received: 15 Sep 2022 – Discussion started: 17 Oct 2022 – Revised: 02 May 2023 – Accepted: 26 Jun 2023 – Published: 14 Sep 2023

Highlights

Adaptive calibration as a function of the simulation progression
Calibration depends on each training image
Robust parametrization based on a rapid prior analysis of the training image

1 Introduction

Geostatistics is extensively used in natural sciences to map spatial variables such as surface properties (e.g., soils, geomorphology, meteorology) and subsurface geological features (e.g., porosity, hydraulic conductivity, 3D geological facies). Its main applications involve the estimation and simulation of natural phenomena. In this paper, we focus on simulation approaches.

Traditional two-point geostatistical simulations preserve the histogram and variogram inferred from point data (Matheron, 1973). However, inherent limitations make the reproduction of complex structures difficult (Gómez-Hernández and Wen, 1998; Journel and Zhang, 2006). Multiple-point statistics (MPS), by accounting for more complex relations, enables the reproduction of such complex structures (Guardiano and Srivastava, 1993) but comes with its own limitations (Mariethoz and Caers, 2014). The main requirements for using MPS algorithms are (1) analog images (called training images) and (2) appropriate parametrization, while training images can often be provided by expert knowledge, and several methods have been proposed to automatically select one or a subset of appropriate training images among a set of candidates (Pérez et al., 2014; Abdollahifard et al., 2019). However, the parametrization of an MPS algorithm depends not only on the chosen training image but also on the specifics of the algorithm. This makes the task of finding a good parametrization cumbersome, and therefore users often have to resort to trial-and-error approaches (Meerschman et al., 2013). Here we will mainly focus on QuickSampling (QS) (Gravey and Mariethoz, 2020), which has two main parameters: n, which defines the maximum number of conditional data points to consider during the search process, and k, which is the number of best candidates from which to sample the simulated value. Additionally, QS supports a kernel that allows weighting each conditioning pixel in the pattern based on its position related to the simulated pixel. Direct sampling (DS) has the following parameters: n, which has an identical role as in QS; “th”, which represents the pattern acceptance threshold, or the degree of similarity between local data patterns and the training image; and f, which is the maximum proportion of the image that can be explored for each simulated pixel. In summary, n controls the spatial continuity, and k or “th” and f control the variability.

Over the last few years, several studies have addressed the challenge of automatically finding appropriate parameters for MPS simulation. These can be categorized in two approaches. The first approach is to assume that an optimal parametrization is related to the simulation grid (including possible conditioning data), the training image, and the MPS algorithm. In this vein, Dagasan et al. (2018) proposed a method that uses the known hard data from the simulation grid as a reference for computing the Jensen–Shannon divergence between histograms. Following this, they employ a simulated annealing optimization to update the MPS parameters until the metrics achieve the lowest divergence. This method is flexible enough to be adapted to any other metric. The second type of approach assumes that the parametrization is only related to the training image and the MPS algorithm. Along these lines, Baninajar et al. (2019) propose the MPS automatic parameter optimizer (MPS-APO) method based on the cross-validation of the training image (TI) to optimize simulation quality and CPU cost. In this approach, artificially generated gaps in the high-gradient areas of the training image are created, and a MPS algorithm is used to fill those gaps. The performance of a particular parametrization is quantified by assessing the correspondence between the filled and original training data. By design, this approach is extremely interesting for gap-filling problems. The authors state that it can be used for the parametrization of unconditional simulations; however, the use of limited gaps cannot guarantee the reproduction of long-range dependencies. Furthermore, due to the design of the framework for generating gaps, only MPS algorithms able to handle gap-filling problems can be used.

While both approaches yield good results based on their objective functions, they all rely on a stochastic optimization process; therefore, the duration of the optimization process cannot be predetermined or controlled by the user. Furthermore, an objective function is needed, which can be difficult because it depends on the training image used: many metrics can be accounted for in the objective function, such as histogram, variogram, pattern histogram, connectivity function, and Euler characteristic, among others (Boisvert et al., 2010; Renard and Allard, 2013; Tan et al., 2013), or a weighted combination of these. Similarly, one has to define meta-parameters linked to the optimization algorithm itself, such as the cooling rate in simulated annealing or maximum number of iterations. As a result, MPS parameter optimization approaches tend to be complex and difficult to use.

In this contribution, we propose a simplified optimization procedure for simulating complex systems. Rather than using a complex optimization algorithm, our approach focuses on finding optimal parameters to accurately simulate a single pixel in the system. The underlying principle of our approach is that if each pixel is accurately simulated, the resulting sequence of pixels will converge to an accurate representation of the real-world system being simulated. The goal is therefore to find the optimal parameters to simulate a single pixel using the training image as the only reference. Baninajar et al. (2019) showed that computing the prediction error (i.e., the error between the simulation and the reference) is an appropriate metric to identify optimal parameters. To find the optimal parameters for simulating a single pixel, we propose an exhaustive exploration of the parameter space and a computation of the prediction error between the simulation and the reference image.

The remainder of this paper is structured as follows. Section 2 presents the proposed method. Section 3 evaluates the approach in terms of quantitative and qualitative metrics. Finally, Sect. 4 discusses the strengths and weaknesses of the proposed approach and presents the conclusions of this work.

2 Understanding and addressing verbatim copy in multiple-point simulation

The principle underlying multiple-point simulation is that the neighborhood of a given pixel x (the pattern generated by known or previously simulated pixels) is informative enough to constrain the probability density function of the value Z(x). This requires a training image with several pattern repetitions. The extended normal equation simulation (ENESIM) algorithm (Guardiano and Srivastava, 1993) computes the full probability distribution for each simulated pixel. To ensure that enough samples are used, the SNESIM (Strebelle, 2002) and the Impala (Straubhaar et al., 2011) algorithms include a parameter to define a minimum number of pattern replicates. Direct sampling (DS) (Mariethoz et al., 2010) adopts a different strategy by allowing for the interrupted exploration of the training image. It includes a distance threshold parameter that defines what is an acceptable match for a neighborhood; however, too small a threshold typically results in a single acceptable pattern in the training image, leading to exact replication of parts of the training image – a phenomenon known as verbatim copy. To reduce this issue, a parameter f is introduced, controlling the fraction of the explored training image. QuickSampling (QS) (Gravey and Mariethoz, 2020) also suffers from verbatim copy when the number of candidate patterns is set to k=1; the authors recommend the use of k>1 and highlight that k is similar to the number of replicates in SNESIM or IMPALA. A value k=1.5 in QS can be seen as SNESIM with a minimum number of replicates of 1 for 50 % of the simulated values and 2 for the remaining values.

The definition of verbatim copy is the unintended pasting of a large section from the training image to the simulation (patch-based approaches do so intentionally, e.g., Rezaee et al., 2013). This means that the relative position of the simulated values is the same as that in the training image. This occurs when the neighborhood constraints on the simulated pixels are too strong, and only the exact same patterns as those in the training image are acceptable. To detect this issue, a common strategy is to create a position map (similar to the index map), which represents the provenance of simulated values by mapping their original coordinates in the training image, as shown in Fig. 1.

Figure 1 illustrates the most common forms of verbatim copy. The pure verbatim (the most common type of verbatim copy) is a simple copy of a large part of the image, with all pixels in the same order inside of the patches. Block verbatim typically appears when there are many replicates of a very specific type of pattern in the training image and few replicates of all other patterns. Consequently, the MPS algorithm uses common patterns for transitioning between copied blocks resulting from rare patterns. Structural verbatim occurs when the copied portion spreads throughout the simulation without giving a direct impression of copying (e.g., pure verbatim over a subset of pixels). Structural verbatim tends to appear when large-scale structures are unique in the training image, which often allows a visually satisfying image to be quickly obtained, but with large non-stationary features identical to the training image. Often, users are willing to allow verbatim on large-scale structures, but this can easily introduce bias between simulations. This is one of the hardest types of verbatim to detect. Typically, this can occur when the maximum neighborhood radius is too large, leading to the duplication of large structures in the initial phase of the simulation. Finally, no verbatim, which is the expected result of simulations, occurs when the position of pixels does not have any particular structure (i.e., their position is unpredictable).

https://gmd.copernicus.org/articles/16/5265/2023/gmd-16-5265-2023-f01

Figure 1Visualization of verbatim copies using a position map. This is an extreme case that highlights that verbatim is not defined by the values simulated but by their position in the training image.

AutoQS v1: automatic parametrization of QuickSampling based on training images analysis

5.1 Optimization of two parameters

5.2 Optimization of three parameters

5.3 Sequential simulation using automatic calibration

A1 Stone

A2 Strebelle

A3 Lena river delta