Evaluation of lossless and lossy algorithms for the compression of scientific datasets in NetCDF-4 or HDF 5 formatted files

The increasing volume of scientific datasets imposes the use of compression to reduce the data storage or transmission costs, specifically for the oceanography or meteorological datasets generated by Earth observation mission ground segments. These data are mostly produced in NetCDF formatted files. Indeed, the NetCDF-4/HDF5 file formats are widely spread in the global scientific community because of the nice features they offer. Particularly, the HDF5 offers the 10 dynamically loaded filter plugin functionality allowing users to write filters, such as compression/decompression filters, to process the data before reading or writing it on the disk. In this work, we evaluate the performance of lossy and lossless compression/decompression methods through NetCDF-4 and HDF5 tools on analytical and real scientific floating-point datasets. We also introduce the Digit Rounding algorithm, a new relative error bounded data reduction method inspired by the Bit Grooming algorithm. The Digit Rounding algorithm allows high compression ratio while preserving a given number 15 of significant digits in the dataset. It achieves higher compression ratio than the Bit Grooming algorithm while keeping similar compression speed.


Interactive comment
Printer-friendly version Discussion paper so reliably and quickly.The study presents an original advance in lossy compression whose implementation unfortunately hampers its utility.The study is understandable yet poorly written.This potentially useful study of lossy compression techniques needs a thorough overhaul before publication.

Specific Comments
Originality: DR is an improvement on "Bit Grooming" (BG) which I invented as an improvement on "Bit Shaving".In that sense I am qualified to comment on its originality.The heart of DR is essentially a continuous version of BG: Whereas BG fixes the number of bits masked for each specified precision, and masks these bits for every value, DR recomputes the number of bits masked for each quantized value to achieve the same precision.BG did not implement the continuous method because I thought that computing the logarithm of each value would be expensive, inelegant, and yield only marginally more compression.However, DR cleverly uses the exponent field instead of computing logarithms, and so deciphers the correct number of bits to mask while avoiding expensive floating point math.This results in significantly more compressibility that (apparently) incurs no significant speed penalty (possibly because it compresses better and thus the lossless step is faster?).Hence DR appears to be a significant algorithmic advance and I congratulate the authors for their insight.
The manuscript stumbles in places due to low quality English, and cries out for more fluent editing.Not only is the word choice often awkward, but the manuscript is like a continuously choppy sea of standalone sentences with few well developed paragraphs that swell with meaning then yield gently to the next idea.GMD readers deserve and expect better.
Does DR guarantee that it will never create a relative error greater than half the value of the least significant digit?BG chooses the number of digits to mask conservatively, so it can and does guarantee that it always preserves the specified precision.Equations C2

Interactive comment
Printer-friendly version Discussion paper (1)-( 7) imply that DR can make the same claim, but this claim is never explicitly tested or made.The absence of this guarantee is puzzling because it would strengthen the confidence of users in the algorithm.However, the guarantee must be explicitly tested, because it undergirds the premise that the comparison between DR and BG is fair.
In any case, clearly state whether DR ever violates the desired precision, even if that happens only rarely.
p. 16 L13: "Code and data availability: The Digit Rounding software source code and the data are currently only available upon request to Xavier Delaunay (xavier.delaunay@thalesgroup.com)or to Flavien Gouillon (Flavien.Gouillon@cnes.fr)." The GMD policy on code and data is here: https://www.geoscientific-model-development.net/about/code_and_data_policy.html.This manuscript provides no code access nor explanation, and no dataset access, and thus appears to violate GMD policy in these areas.
Common comparisons would help build confidence in your results.It would have been more synergistic to evaluate the algorithms on at least one of the same datasets as Zender (2016), which are all publicly available.I am glad the authors used the publicly available NCO executables.Why not release the DR software in the same spirit so that the geoscience community can use (and possibly improve) it?
The lossless and lossy compression algorithms analyzed seem like a fairly balanced collection of those most relevant to GMD readers.Most methods that were omitted are, to my knowledge, either non-competitive (e.g., Packing) or not user-friendly, e.g., research grade but not widely available (e.g., Layer Packing) and too hard to independently implement.

Table 6
on p. 19 shows the maximum absolute error (MAE) of BG is quite similar to DR, as I would expect.However, Table 7 on p. 20 shows the maximum absolute error (MAE) of BG is nearly 10x less than DR.Why are the MAEs similar for dataset s1