|The manuscript “Lossy Checkpoint Compression in Full Waveform Inversion: a case study with ZFP v0.5.5 and the Overthrust Model” by Kukreja et al. discusses a strategy to mitigate the huge memory bottleneck of time-domain adjoint simulations that require access to the entire forward wavefield in reverse order. |
The topic is highly interesting and relevant for realistic FWI on modern compute architectures. The specific contribution of the manuscript is to compress checkpoints during the forward simulation before writing them to disk and to restart the computations from de-compressed checkpoints during the adjoint run. Admittedly, this is a rather small step, but given its importance in an FWI workflow, it could still warrant a publication. A synthetic 2D FWI demonstrates that even with significant compression factors the rate of convergence is not negatively affected. This is a great and very relevant result.
I primarily see two aspects in the current manuscript that require some additional work, which I list below. Furthermore, some parts of the manuscript are a bit sloppy and could use more care. Several figures (8, 10, 11, 12, 14, 16) are not explained or even referenced in the text.
+++ Technical aspects
Can you provide more information on some technical aspects of the compression? For instance:
- How do you manage parallel file formats for distributed simulations with compressed chunks of data, where the size of each chunk might vary and unknown prior to the simulation?
- Did you compare different compression algorithms, and can you comment on the computational overhead for (de)compressing the wavefield?
- Which fields do you compress (pressure, velocity, pressure gradient, …)? Do you apply different tolerances or compression strategies for different fields?
- Is it possible to a-priori ensure an absolute / relative tolerance after decompression?
+++ Error analysis
Can you elaborate more on the absolute tolerance atol used in the numerical examples? I think it would be better to somehow relate the tolerance to the maximum amplitude of the wavefield / resp. source. The absolute number is rather meaningless.
When looking at Fig. 7, I am wondering if the frequency content of the (de)compressed snapshot is altered, which is why the error is growing in the first couple of time steps? Furthermore, is the decreasing trend a simple result from the decaying amplitudes in the wavefield or is this normalized in some way? It would help to also show relative errors per time step.
+++ Minor comments
page 2, line 22:
Referencing an equation long before it appears in the manuscript is bad style. Furthermore, the equation is not called TTI. TTI refers to the medium / model parameterization or stress-strain relation, respectively, and should not be used as an acronym without introduction.
page 2, Table 1:
“Forward propagation” is misleading and should be “time steps” instead. Calling it “peak memory” is very misleading for gradient computations because no reasonable implementation would do that.
page 3, line 48/49:
Either remove the reference to eq. (1) or state the equation here.
page 4, line 81:
Why is there an asterisk after Louboutin?
page 5, Figure 2:
Add labels and annotation to make it easier to read. The horizontal axis could count multiples of single simulations.
page 9, line 191:
What density model are you using? I would still consider it an inverse crime if it were a scaled version of the velocity model or even just a homogeneous model.
This section seems a bit disconnected from the rest. I would recommend merging it with section 4. In particular, I don’t see a reason to introduce the subsections of section 4 already here with a single paragraph.
page 11, Figures 4 and 5:
How does the absolute tolerance of 1e-4 relate to the pressure amplitude? Which other fields do you compress is this really an absolute tolerance and not a relative one?
page 12, caption Figure 6:
There is a reference missing: “See figures A1 and ??”
page 12, Figure 6:
How do you define the signal to noise ratio in this case?
page 15, Figure 10:
The same results are shown again in Fig. 19, so I would either show the entire x-axis here or remove the figure.
page 15, Figure 11:
What error is shown on the y-axis? And why is it huge?
page 17, line 269:
Please put atol in context to the maximum amplitude of the wavefields that are compressed.
page 17, Figure 14:
Should this be “true model” instead of “true solution”? The figure is not referenced in the text.
page 17, Figure 15:
Some of the recovered structure looks to be significantly smaller than a wavelength for the given frequency content. Could you comment on inverse crime? Are you inverting for density as well (see question on the density model above)?
page 20, line 293:
What does atol > 4 mean?