This paper proposes a new method that combines checkpointing methods with error-controlled lossy compression for large-scale high-performance full-waveform inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory.

In the exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit.

Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory footprint during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations.

Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.

Full-waveform inversion (FWI) is an adjoint-based optimization problem used in seismic imaging to infer the Earth's subsurface structure and physical parameters

The FWI algorithm is explained in more detail in Sect.

Estimated computational requirements of a full waveform inversion problem based on the SEAM model

FWI is similar to other inverse problems like brain-imaging

An illustration of the approach presented in this paper. Checkpoints are compressed using lossy compression to combine lossy compression and the checkpoint-recompute strategies.

Many algorithmic optimizations and approximations are commonly applied to reduce the computational load from the numbers calculated in Table

A commonly used method to deal with the problem of this large memory footprint is domain decomposition over messaging–passing interface (MPI), where the computational domain is split into subdomains over multiple compute nodes to use their memory. The efficacy of this method depends on the ability to hide the MPI communication overhead behind the computations within the subdomain. For effective communication–computation overlap, the subdomains should be big enough that the computations within the subdomain take at least as long as the MPI communication. This places a lower bound on subdomain size (and hence peak memory consumption per MPI rank) that is a function of the network interconnect – this lower bound might be too large for slow interconnects, e.g., on cloud systems.

Some iterative frequency domain methods, e.g.,

Hybrid methods that combine time domain methods, as well as frequency domain methods, have also been tried

In the following subsections, we discuss three techniques that are commonly used to alleviate this memory pressure – namely numerical approximations, checkpointing, and data compression. The common element in these techniques is that all three solve the problem of high memory requirement by increasing the operational intensity of the computation – doing more computations per byte transferred from memory. With the gap between memory and computational speeds growing wider as we move into the exaFLOP era, we expect to use such techniques more moving forward.

There has been some recent work on alternate floating-point representations

Another approximation commonly applied to reduce the memory pressure in FWI in the time domain is subsampling. Here, the time step rate of the gradient computation (See Eq.

Instead of storing the wave field at every time step during the forward computation, it is possible to store it at a subset of the time steps only. During the following computation that proceeds in a reverse order to calculate the gradient, if the forward wave field is required at a time step that was not stored, it can be recovered by restarting the forward computation from the last available time step. This is commonly known as checkpointing. Algorithms have been developed to define the optimal checkpointing schedule involving forward, store, backward, load, and recompute events under different assumptions

Schematic of the checkpointing strategy. Wall-clock time is on the horizontal axis, while the vertical axis represents simulation time. The blue line represents forward computation. The dotted red line represents how the reverse computation would have proceeded after the forward computation if there had there been enough memory to store all the necessary checkpoints. Checkpoints are shown as black dots. The reverse computation under the checkpointing strategy is shown as the solid red line. It can be seen that the reverse computation proceeds only where the results of the forward computation are available. When not available, the forward computation is restarted from the last available checkpoint to recompute the results of the forward – shown here as the blue-shaded regions.

In previous work, we introduced the open-source software pyRevolve, a Python module that can automatically manage the checkpointing strategies under different scenarios with minimal modification to the computation code

The most significant advantage of checkpointing is that the numerical result remains unchanged by applying this technique. Note that we will shortly combine this technique with lossy compression, which might introduce an error, but checkpointing alone is expected to maintain bitwise equivalence. Another advantage is that the increase in run time incurred by the recomputation is predictable.

Compression or bit-rate reduction is a concept originally from signal processing. It involves representing information in fewer bits than the original representation. Since there is usually some computation required to go from one representation to another, compression can be seen as a memory–compute tradeoff.

Perhaps the most commonly known and used compression algorithm is ZLib (from GZip)

For floating-point data, another possibility is lossy compression, where the compressed and decompressed data are not exactly the same as the original data but a close approximation. The precision of this approximation is often set by the user of the compression algorithm. Two popular algorithms in this class are SZ

Compression has often been used to reduce the memory footprint of adjoint computations in the past, including

Floating-point representation can be seen as a compressed representation that is not entirely precise. However, the errors introduced by the floating-point representation are already accounted for in the standard numerical analysis as noise. The errors introduced by ZFP's compression of fields are more subtle since the compression loss is pattern sensitive. Hence, we tackle it empirically here.

Schematic of the three-way tradeoff presented in this paper. With the use of checkpointing, it was possible to trade off memory and execution time (the horizontal line). With the use of compression alone, it was possible to trade off memory and accuracy. The combined approach presented in this work provides a novel three-way tradeoff.

The last few sections discussed some existing methods that allow tradeoffs that are useful in solving FWI on limited resources. While checkpointing allows a tradeoff between computational time and memory, compression allows a tradeoff between memory and accuracy. This work combines these three approaches into one three-way tradeoff.

In previous work

In this work, we evaluate this method on the specific problem of FWI, specifically the solver convergence and accuracy.

To this end, we conduct an empirical study of the following factors:

the propagation of errors when starting from a lossy checkpoint,

the effect of checkpoint errors on the gradient computation,

the effect of decimation and subsampling on the gradient computation,

the accumulation of errors through the stacking of multiple shots,

the effect of the lossy gradient on the convergence of FWI.

The rest of the paper is organized as follows. Section

FWI is designed to numerically simulate a seismic survey experiment and invert for the Earth parameters that best explain the observations. In the physical experiment, a ship sends an acoustic impulse through the water by triggering an explosion. The waves created as a result of this impulse travel through the water into the Earth's subsurface. The reflections and turning components of these waves are recorded by an array of receivers being dragged in tow by the ship. A recording of one signal sent and the corresponding signals received at each of the receiver locations is called a shot. A single experiment typically consists of

Having recorded this collection of data (

Using this, it is possible to set up an optimization problem that aims to find the value of

This objective function

The computation described in the previous paragraph is for a single shot and must be repeated for every shot, and the final gradient is calculated by averaging the gradients calculated for the individual shots. This is repeated for every iteration of the minimization. This entire minimization problem is one step of a multi-grid method that starts by inverting only the low-frequency components on a coarse grid and adding higher-frequency components that require finer grids over successive inversions.

We use Devito

The FWI experiments were run on Azure's Kubernetes Service, using the framework described in

Following the method outlined in

As can be seen in Sect.

Let

: peak signal-to-noise ratio, we define this as follows:

: we treat

: we also report some errors by defining the error vector

Evolution of compressibility through the simulation. We compressed every time step of the reference problem using an absolute error tolerance (

To understand the evolution of compressibility, we compress every time step of the reference problem using the same compression setting and report on the achieved compression factor as a function of the time step.

This is shown in Fig.

A 2D slice of the last time step of the reference solution. The wave has spread through most of the domain. The values in this figure have been thresholded to

Therefore, we pick the last time step as the reference for further experiments. A 2D cross section of this snapshot is shown in Fig.

A histogram of pressure values in the wave field shown in Fig.

To understand the direct effects of compression, we compressed the reference wave field using different absolute tolerance (

Direct compression. We compress the wave field at the last time step of the reference solution using different absolute error tolerance (

Next, we ran the simulation for 550 steps and compressed the field's final state after this time. We then compress and decompress this through the lossy compression algorithm, getting two checkpoints – a reference checkpoint and a lossy checkpoint. We then restarted the simulation from step 550, comparing the progression of the simulation restarted from the lossy checkpoint vs. a reference simulation that was started from the original checkpoint. We run this test for another 250 time steps.

Figure

Forward propagation. We stop the simulation after about 500 time steps. We then compress the state of the wave field at this point using

Gradient computation. In this experiment we carry out the full forward–reverse computation to get a gradient for a single shot while compressing the checkpoints at different

Gradient computation. Angle between the lossy gradient vector and the reference gradient vector (in radians) vs.

Gradient error.

Gradient error. In this plot we measure the effect of varying number of checkpoints on the error in the gradient. We report PSNR of lossy vs. reference gradient as a function of number of checkpoints for four different compression settings.

Shot stacking. The gradient is first computed for each individual shot and then added up for all the shots. In this experiment we measure the propagation of errors through this step. This plot shows that while errors do have the potential to accumulate through the step, as can be seen from the curve for

True model used for FWI.

Reference solution for the complete FWI problem. This is the solution after running reference FWI for 30 iterations.

Convergence. As the last experiment, we run complete FWI to convergence (up to max 30 iterations). Here we show the convergence profiles for

The final image after running FWI

Final image after running FWI

Subsampling. We set up an experiment with subsampling as a baseline for comparison. Subsampling is when the gradient computation is carried at a lower time step than the simulation itself. This requires less data to be carried over from the forward to the reverse computation at the cost of solution accuracy, and thus it is comparable to lossy checkpoint compression. This plot shows the angle between the lossy gradient and the reference gradient vs. the compression factor CF

In this experiment, we do the complete gradient computation, as shown in Fig.

In Fig.

In Fig.

In Fig.

We also repeated the gradient computation for different number of checkpoints. Fewer checkpoints would apply lossy compression at fewer points in the calculation, while the amount of recomputation would be higher. More checkpoints would require less recomputation but would apply lossy compression at more points during the calculation. Figure

It can be seen from the plots that the errors induced in the checkpoint compression do not propagate significantly until the gradient computation step. In fact, the

After gradient computation on a single shot, the next step in FWI is the accumulation of the gradients for individual shots by averaging them into a single gradient. We call this stacking. In this experiment we studied the accumulation of errors through this stacking process. Figure

This plot shows us that the errors across the different shots are not adding up and that the cumulative error is not growing with the number of shots, except for the compression setting of

Finally, we measure the effect of an approximate gradient on the convergence of the FWI problem. In practice, FWI is run for only a few iterations at a time as a fine-tuning step interspersed with other imaging steps. Here we run a fixed number of FWI iterations (30) to make it easier to compare different experiments. To make this a practical test problem, we extract a 2D slice from the original 3D velocity model and run a 2D FWI instead of a 3D FWI. We compare the convergence trajectory with the reference problem and report. For reference, Fig.

Figure

As a comparison baseline, we also use subsampling to reduce the memory footprint as a separate experiment and track the errors. Subsampling is another method to reduce the memory footprint of FWI that is often used in industry. The method is set up so that the forward and adjoint computations continue at the same time step as the reference problem. However, the gradient computation is now not done at the same rate – it is reduced by a factor

Figure

Comparing Fig.

The results indicate that significant lossy compression can be applied to the checkpoints before the solution is adversely affected. This is an interesting result because, while it is common to see approximate methods leading to approximate solutions, this is not what we see in our results: the solution error does not change much for large compression error. This being an empirical study, we can only speculate on the reasons for this. We know that in the proposed method, the adjoint computation is not affected at all – the only effect is in the wave field carried over from the forward computation to the gradient computation step. Since the gradient computation is a cross-correlation, we only expect correlated signals to grow in magnitude in the gradient calculation and when gradients are stacked. The optimization steps are likely to be error-correcting processes as well, since even with an approximate gradient (

the error distribution,

the compression and decompression times, and

the achieved compression factors.

Our method accepts an acceptable error tolerance as input for every gradient evaluation. We expect this to be provided as part of an adaptive optimization scheme that requires approximate gradients in the first few iterations of the optimization and progressively more accurate gradients as the solution approaches the optimum. Such a scheme was previously described in

In the preceding sections, we have shown that by using lossy compression, high-compression factors can be achieved without significantly impacting the convergence or final solution of the inversion solver. This is a very promising result for the use of lossy compression in FWI. The use of compression in large computations like this is especially important in the exascale era, where the gap between computing and memory speed is increasingly large. Compression can reduce the strain on the memory bandwidth by trading it off for extra computation. This is especially useful since modern CPUs are hard to saturate with low operational intensity (OI) computations.

In future work, we would like to study the interaction between compression errors and the velocity model for which FWI is being solved, as well as the source frequency. We would also like to compare multiple lossy compression algorithms, e.g., SZ.

Direct compression. On the left,

Gradient computation.

The data used are from the overthrust model, provided by SEG/EAGE

Most of the coding and experimentation was done by NK. The experiments were planned between JH and NK. ML helped set up meaningful experiments. JW contributed in fine-tuning the experiments and the presentation of results. PHJK and GJG gave the overall direction of the work. Everybody contributed to the writing.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was carried out with the support of Georgia Research Alliance and partners of the ML4Seismic Center. We would like to thank the anonymous reviewers for the constructive feedback that helped us improve this paper.

This research has been supported by the Engineering and Physical Sciences Research Council (grant no. EP/R029423/1) and the Department of Energy, Labor and Economic Growth (grant no. DE-AC02-06CH11357).

This paper was edited by Adrian Sandu and reviewed by Arash Sarshar and two anonymous referees.