The increasing volume of scientific datasets requires the use of compression to reduce data storage and transmission costs, especially for the oceanographic or meteorological datasets generated by Earth observation mission ground segments. These data are mostly produced in netCDF files. Indeed, the netCDF-4/HDF5 file formats are widely used throughout the global scientific community because of the useful features they offer. HDF5 in particular offers a dynamically loaded filter plugin so that users can write compression/decompression filters, for example, and process the data before reading or writing them to disk. This study evaluates lossy and lossless compression/decompression methods through netCDF-4 and HDF5 tools on analytical and real scientific floating-point datasets. We also introduce the Digit Rounding algorithm, a new relative error-bounded data reduction method inspired by the Bit Grooming algorithm. The Digit Rounding algorithm offers a high compression ratio while keeping a given number of significant digits in the dataset. It achieves a higher compression ratio than the Bit Grooming algorithm with slightly lower compression speed.

Ground segments processing scientific mission data are facing challenges due to the ever-increasing resolution of on-board instruments and the volume of data to be processed, stored and transmitted. This is the case for oceanographic and meteorological missions, for instance. Earth observation mission ground segments produce very large files mostly in netCDF format, which is standard in the oceanography field and widely used by the meteorological community. This file format is widely used throughout the global scientific community because of its useful features. The fourth version of the netCDF library, denoted netCDF-4/HDF5 (as it is based on the HDF5 layer), offers “Deflate” and “Shuffle” algorithms as native compression features. However, the compression ratio achieved does not fully meet ground processing requirements, which are to significantly reduce the storage and dissemination cost as well as the I/O times between two modules in the processing chain.

In response to the ever-increasing volume of data, scientists are keen to compress data. However, they have certain requirements: both compression and decompression have to be fast. Lossy compression is acceptable only if the compression ratios are higher than those of lossless algorithms and if the precision, or data loss, can be controlled. There is a trade-off between the data volume and the accuracy of the compressed data. Nevertheless, scientists can accept small losses if they remain below the data's noise level. Noise is difficult to compress and of little interest to scientists, so they do not consider data degradation that remains under the noise level as a loss (Baker et al., 2016). In order to increase the compression ratio within the processing chain, “clipping” methods may be used to degrade the data before compression. These methods increase the compression ratio by removing the least significant digits in the data. Indeed, at some level, these least significant digits may not be scientifically meaningful in datasets corrupted by noise.

This paper studies compression and clipping methods that can be applied to scientific datasets in order to maximize the compression ratio while preserving scientific data content and numerical accuracy. It focuses on methods that can be applied to scientific datasets, i.e. vectors or matrices of floating-point numbers. First, lossless compression algorithms can be applied to any kind of data. The standard is the “Deflate” algorithm (Deutsch, 1996) native in netCDF-4/HDF5 libraries. It is widely used in compression tools such as zip, gzip, and zlib libraries, and has become a benchmark for lossless data compression. Recently, alternative lossless compression algorithms have emerged. These include Google Snappy, LZ4 (Collet, 2013) or Zstandard (Collet and Turner, 2016). To achieve faster compression than the Deflate algorithm, none of these algorithms use Huffman coding. Second, preprocessing methods such as Shuffle, available in HDF5, or Bitshuffle (Masui et al., 2015) are used to optimize lossless compression by rearranging the data bytes or bits into a “more compressible” order. Third, some lossy/lossless compression algorithms, such as FPZIP (Lindstrom and Isenburg, 2006), ZFP (Lindstrom, 2014) or Sz (Tao et al., 2017), are specifically designed for scientific data – and in particular floating-point data – and can control data loss. Fourth, data reduction methods such as Linear Packing (Caron, 2014a), Layer Packing (Silver and Zender, 2017), Bit Shaving (Caron, 2014b), and Bit Grooming (Zender, 2016a) lose some data content without necessarily reducing its volume. Preprocessing methods and lossless compression can then be applied to obtain a higher compression ratio.

This paper focuses on compression methods implemented for netCDF-4 or HDF5 files. These scientific file formats are widespread among the oceanographic and meteorological communities. HDF5 offers a dynamically loaded filter plugin that allows users to write compression/decompression filters (among others), and to process data before reading or writing them to disk. Consequently, many compression/decompression filters – such as Bitshuffle, Zstandard, LZ4, and Sz – have been implemented by members of the HDF5 user community and are freely available. The netCDF Operator toolkit (NCO) (Zender, 2016b) also offers some compression features, such as Bit Shaving, Decimal Rounding and Bit Grooming.

The rest of this paper is divided into five sections. Section 2 presents the lossless and lossy compression schemes for scientific floating-point datasets. Section 3 introduces the Digit Rounding algorithm, which is an improvement of the Bit Grooming algorithm that optimizes the number of mantissa bits preserved. Section 4 defines the performance metrics used in this paper. Section 5 describes the performance assessment of a selection of lossless and lossy compression methods on synthetic datasets. It presents the datasets and compression results before making some recommendations. Section 6 provides some compression results obtained with real CFOSAT and SWOT datasets. Finally, Sect. 7 provides our conclusions.

Compression schemes for scientific floating-point datasets usually entail several steps: data reduction, preprocessing, and lossless coding. These three steps can be chained as illustrated in Fig. 1. The lossless coding step is reversible. It does not degrade the data while reducing its volume. It can be implemented by lossless compression algorithms such as Deflate, Snappy, LZ4 or Zstandard. The preprocessing step is also reversible. It rearranges the data bytes or bits to enhance lossless coding efficiency. It can be implemented by algorithms such as Shuffle or Bitshuffle. The data reduction step is not reversible because it entails data losses. The goal is to remove irrelevant data such as noise or other scientifically meaningless data. Data reduction can reduce data volume, depending on the algorithm used. For instance, the Linear Packing and Sz algorithms reduce data volume, but Bit Shaving and Bit Grooming algorithms do not.

Compression chain showing the data reduction, preprocessing and lossless coding steps.

This paper evaluates lossless compression algorithms Deflate, LZ4, and Zstandard; Deflate because it is the benchmark algorithm, LZ4 because it is a widely-used, very-high-speed compressor, and Zstandard because it provides better results than Deflate both in terms of compression ratios and of compression/decompression speeds. The Deflate algorithm uses LZ77 dictionary coding (Ziv and Lempel, 1977) and Huffman entropy coding (Huffman, 1952). LZ77 and Huffman coding exploit different types of redundancies to enable Deflate to achieve high compression ratios. However, the computational cost of the Huffman coder is high and makes Deflate compression rather slow. LZ4 is a dictionary coding algorithm designed to provide high compression/decompression speeds rather than a high compression ratio. It does this without an entropy coder. Zstandard is a fast lossless compressor offering high compression ratios. It makes use of dictionary coding (repcode modeling) and a finite-state entropy coder (tANS) (Duda, 2013). It offers a compression ratio similar to that of Deflate coupled with high compression/decompression speeds.

This paper also evaluates Shuffle and Bitshuffle. The Shuffle groups all the data samples' first bytes together, all the second bytes together, etc. In smooth datasets, or datasets with highly correlated consecutive sample values, this rearrangement creates long runs of similar bytes, improving the dataset's compression. Bitshuffle extends the concept of Shuffle to the bit level by grouping together all the data samples' first bits, second bits, etc.

Last, we evaluate the lossy compression algorithms Sz, Decimal Rounding and
Bit Grooming. We chose to evaluate the Sz algorithm because it provides
better rate-distortion results than FPZIP and ZFP, see Tao et al. (2017).
The Sz algorithm predicts data samples using an

Representation of the value of

The Digit Rounding algorithm is similar to the Decimal Rounding algorithm in
the sense that it computes a quantization factor

The Digit Rounding algorithm uses uniform scalar quantization with
reconstruction at the bin center:

We have developed an HDF5 dynamically loaded filter plugin so as to apply
the Digit Rounding algorithm to netCDF-4 or HDF5 datasets. It should be
noted that data values rounded by the Digit Rounding algorithm can be read
directly: there is no reverse operation to Digit Rounding, and users do not
need any software to read the rounded data. Table 2 provides the results of
the Digit Rounding algorithm on the value of

Representation of the value of

Maximum absolute errors, mean absolute errors and mean errors of
the Digit Rounding algorithm preserving a varying number of significant
digits (nsd) on an artificial dataset composed of 1 000 000 values evenly
spaced over the interval

Comparison of the Bit Grooming and Digit Rounding algorithms for the compression of a MERRA dataset. Shuffle and Deflate (level 1) are applied. The compression ratio (CR) is defined by the ratio of the compressed file size over the reference data size (244.3 MB) obtained with Deflate (level 5) compression. Bit Grooming results are extracted from Zender (2016a). Bold values indicate where Digit Rounding performs better than Bit Grooming.

The following section first defines the various performance metrics used hereinafter, then studies the performance of various lossless and lossy compression algorithms – including Digit Rounding – when applied to both synthetic and real scientific datasets.

One of the features required for lossy scientific data compression is
control over the amount of loss, or the accuracy, of the compressed data.
Depending on the data, this accuracy can be expressed by an absolute or a
relative error bound. The maximum absolute error is defined by

A near-nearly exhaustive list of metrics for assessing the performance of
lossy compression of scientific datasets is provided in Tao et al. (2019).
For the sake of conciseness, only a few of them are presented in this paper.
The following metrics were chosen for this study:

compression ratio CR

compression speed CS

The following metrics were chosen to assess the data degradation of the
lossy compression algorithms:

maximum absolute error

mean error

mean absolute error

SNR to evaluate the signal to compression error ratio. It is defined by the ratio of the signal level over the root mean square compression error and is expressed in decibels (dB):

Synthetic datasets

The lossless compression algorithms evaluated are Deflate and Zstandard with
or without the Shuffle or Bitshuffle preprocessing step. LZ4 is always
evaluated with the Bitshuffle preprocessing step because it was imposed in
the LZ4 implementation we used. We ran a lossless compression algorithm
using the h5repack tool from the HDF5 library, version 1.8.19, Deflate
implemented in zlib 1.2.11, Zstandard version 1.3.1 with the corresponding
HDF5 filter available on the HDF web portal
(

Figures 2 and 3 provide the results obtained for the compression and
decompression of dataset

Results obtained for the lossless compression of the

Results obtained for the lossless compression of the

To summarize, these results show that preprocessing by Shuffle or Bitshuffle
is very helpful in increasing compression efficiency. They also show that
Zstandard can provide higher compression and decompression speeds than
Deflate at low compression levels. However, on the

The lossy compression algorithms evaluated are error-bounded compression algorithms. They can constrain either the maximum absolute error or the maximum relative error, or both. The compression algorithms evaluated are Sz, Decimal Rounding, Bit Grooming and the Digit Rounding algorithm introduced in this paper. The Sz compression algorithm works in both error-bounded modes. Decimal Rounding allows a specific number of decimal digits to be preserved. In this sense, it bounds the maximum absolute error. Bit Grooming allows a specific number of significant digits to be preserved. In this sense, it bounds the maximum relative error. Like the Bit Grooming algorithm, Digit Rounding preserves a specific number of significant digits and bounds the maximum relative error.

We ran Sz version 2.1.1 using the h5repack tool and Sz HDF5 filter plugin, applying the Deflate lossless compression algorithm integrated in the Sz software. We ran the Decimal Rounding and Bit Grooming algorithms using NCO version 4.7.9, applying Shuffle and Deflate compression in the call to the NCO tool. Last, we ran the Digit Rounding algorithm using the h5repack tool and custom implantation of the algorithm in an HDF5 plugin filter. The Supplement provides the command lines and options used.

This section compares the performance of the absolute error-bounded
compression algorithms: Sz and Decimal Rounding. The results reported were
obtained by applying Sz configured with the options SZ_BEST_SPEED and Gzip_BEST_SPEED.
Shuffle and Deflate with

Table 5 compares the results obtained in absolute error-bounded compression
mode for

Compression results of the absolute error-bounded compression
algorithms Sz and Decimal Rounding on datasets

Comparison of the compression results (SNR vs. compression ratio)
of the Sz and Decimal Rounding algorithms in absolute error-bounded
compression mode, on the

Figure 4 compares Sz and Bit Grooming algorithms in terms of SNR versus
compression ratio. This figure was obtained with the following parameters:

For the Sz algorithm, the

For the Decimal Rounding algorithm, the dsd parameter was successively set to 4, 3, 2, 1, 0,

This section compares the performance of the relative error-bounded
compression algorithms: Sz, Bit Grooming, and Digit Rounding. The results
reported were obtained by applying Sz configured with the options
SZ_DEFAULT_COMPRESSION and Gzip_BEST_SPEED. Shuffle and Deflate with

We first focus on the results obtained with dataset

Compression results of the relative error-bounded compression
algorithms Sz, Bit Grooming, and Digit Rounding on dataset

Comparison of the compression results (SNR vs. compression ratio)
of the Sz, Bit Grooming and Digit Rounding algorithms in relative
error-bounded compression mode, on the

Figure 5a compares Sz, Bit Grooming, and Digit Rounding algorithms in
terms of SNR versus compression ratio. This figure has been obtained with
the following parameters:

For the Sz algorithm, the

For the Bit Grooming algorithm, the nsd parameter was successively set to 6, 5, 4, 3, 2, 1;

For the Digit Rounding algorithm, the nsd parameter was successively set to 6, 5, 4, 3, 2, 1.

Compression ratio as a function of the user-specified number of
significant digits (nsd) for the Sz, Bit Grooming and Digit Rounding
algorithms, on the

We now focus on the results obtained with dataset

Compression results of Sz, Bit Grooming, and Digit Rounding in
relative error-bounded compression mode on dataset

Figure 5b compares Sz, Bit Grooming, and Digit Rounding algorithms in
terms of SNR versus compression ratio. This figure has been obtained with
the following parameters:

For the Sz algorithm, the

For the Bit Grooming algorithm, the nsd parameter was successively set to 6, 5, 4, 3, 2, 1;

For the Digit Rounding algorithm, the nsd parameter was successively set to 6, 5, 4, 3, 2, 1.

The Bit Grooming and Digit Rounding algorithms provide similar compression
ratios, but even higher compression ratios are obtained with Sz. Figure 6b compares the compression ratio obtained as a function of the nsd
parameter, which is the user-specified number of significant digits. As for
dataset

Those results show that the Digit Rounding algorithm can be competitive with the Bit Grooming and Sz algorithms in relative error-bounded compression mode. It is thus applied to real scientific datasets in the next section.

Chinese-French Oceanography Satellite (CFOSAT) is a cooperative program between the French and Chinese space agencies (CNES and CNSA respectively). CFOSAT is designed to characterize the ocean surfaces to better model and predict ocean states, and improve knowledge of ocean/atmosphere exchanges. CFOSAT products will help marine and weather forecasting and will also be used to monitor the climate. The CFOSAT satellite will carry two scientific payloads – SCAT, a wind scatterometer; and SWIM, a wave scatterometer – for the joint characterization of ocean surface winds and waves. The SWIM (Surface Wave Investigation and Monitoring) instrument delivered by CNES is dedicated to measuring the directional wave spectrum (density spectrum of wave slopes as a function of direction and wavenumber of the waves). The CFOSAT L1A product contains calibrated and geocoded waveforms. By the end of the mission in 2023/2024, CFOSAT will have generated about 350 TB of data. Moreover, during routine phase, the users should have access to the data less 3 h after their acquisition. The I/O and compression performance are thus critical.

Currently, the baseline for compression of the CFOSAT L1A product involves a
clipping method as a data reduction step, with Shuffle preprocessing and
Deflate lossless coding with a compression level

We studied the following compression methods:

CFOSAT clipping followed by Shuffle and Deflate (

CFOSAT clipping followed by Shuffle and Zstandard (

Sz followed by Deflate in the absolute error bounded mode;

Decimal Rounding followed by Shuffle and Deflate (

Bit Grooming (nsd

Digit Rounding (nsd

Compression results for the

The results for the compression of the full CFOSAT L1A product of 7.34 GB
(uncompressed) are provided in Table 9. The maximum absolute error and the
mean absolute error are not provided because this dataset contains several
variables compressed with different parameters. Compared to the CFOSAT
baseline compression, Zstandard increases the compression speed by about
40 % while offering a similar compression ratio. It was not possible to
apply Sz compression on the full dataset since Sz configuration file has to
be modified to adapt the

Compression results for the CFOSAT L1A product.

The Surface Water and Ocean Topography Mission (SWOT) is a partnership between NASA and CNES, and continues the long history of altimetry missions with an innovative instrument known as KaRin, which is a Ka band synthetic aperture radar. The launch is foreseen for 2021. SWOT addresses both oceanographic and hydrological communities, accurately measuring the water level of oceans, rivers, and lakes.

SWOT has two processing modes, so two different types of products are
generated: high-resolution products dedicated to hydrology, and
low-resolution products mostly dedicated to oceanography. The Pixel Cloud
product (called L2_HR_PIXC) contains data from
the KaRin instrument's high-resolution (HR) mode. It contains information on
the pixels that are detected as being over water. This product is generated
when the HR mask is turned on. The Pixel Cloud product is organized into
sub-orbit tiles for each swath and each pass, and this is an intermediate
product between the L1 Single Look Complex products and the L2 lake/river
ones. The product granularity is a tile 64 km long in the along-track
direction, and it covers either the left or right swath (

The compression of two different datasets was evaluated:

A simplified simulated SWOT L2_HR_PIXC pixel cloud product of 460 MB (uncompressed);

A realistic and representative SWOT L2 pixel cloud dataset of 199 MB (uncompressed).

The current baseline for the compression of the simplified simulated SWOT L2
pixel cloud product involves Shuffle preprocessing and Deflate lossless
coding with a compression level

Shuffle and Deflate (

Shuffle and Zstandard (

Sz with Deflate in the relative error bounded mode;

Bit Grooming followed by Shuffle and Deflate (

Digit Rounding followed by Shuffle and Deflate (

Compression results for the

Compression results for the

Compression results for the simplified simulated SWOT L2_HR_PIXC pixel cloud product.

Compression results for the representative SWOT L2 pixel cloud product.

Next we focused on the

Table 12 provides the results of the compression of the full simulated SWOT L2_HR_PIXC pixel cloud product. The maximum absolute error and the mean absolute error are not provided because this dataset contains several variables compressed with different parameters. Compared to the SWOT baseline compression, Zstandard increases the compression speed by over 5 times, while offering a similar compression ratio. Sz compression was not applied because it does not allow achieving the high precision required on some variables. Bit Grooming and Digit Rounding were configured on a per-variable basis to keep the precision required by the scientists on each variable. Compared to the baseline, Bit Grooming and Digit Rounding increase the compression respectively by 20 % and 30 % with similar compression speeds and faster decompression.

The results for the compression of the representative SWOT L2 pixel cloud product are provided in Table 13. Compared to the baseline, Zstandard compression is nearly 4 times faster while offering a similar compression ratio. Bit Grooming increases the compression ratio by 29 % with higher compression speed. And Digit Rounding increases the compression ratio by 34 % with slightly lower compression speed than Bit Grooming. Bit Grooming and Digit Rounding provide the fastest decompression. Our recommendation for the compression of SWOT datasets is thus to use the Digit Rounding algorithm to achieve high compression, at the price of a lower compression speed than the lossless solutions, considering that for SWOT the driver is product size, and taking into account the ratio between compression time and processing time.

This study evaluated lossless and lossy compression algorithms both on synthetic datasets and on realistic simulated datasets of future science satellites. The compression methods were applied using netCDF-4 and HDF5 tools. It has been shown that the impact of the compression level options of Zstandard or Deflate on the compression ratio achieved is not significant compared to the impact of the Shuffle or Bitshuffle preprocessing. However, high compression levels can significantly reduce the compression speed. Deflate and Zstandard with low compression levels are both reasonable options to consider for the compression of scientific datasets, but must always follow a Shuffle or Bitshuffle preprocessing step. It has been shown that Zstandard can speed-up the compression of CFOSAT and SWOT datasets compared to the baseline solution based on Deflate.

The lossy compression of scientific datasets can be achieved in two different error-bounded modes: absolute and relative error-bounded. Four algorithms have been studied: Sz, Decimal Rounding, Bit Grooming and Digit Rounding. One useful feature of the last three is that the accuracy of the compressed data can easily be interpreted: rather than defining an absolute or a relative error bound, they define the number of significant decimal digits or the number of significant digits. In absolute error-bounded mode, Sz provides higher compression ratios than Decimal Rounding on most datasets. However for the compression of netCDF/HDF5 datasets composed of several variables, its usability is reduced by the fact that only one absolute error bound can be set for all the variables. It cannot be easily configured to achieve the precision required variable per variable. This is why we rather recommend the Decimal Rounding algorithm to achieve fast and effective compression of the CFOSAT dataset. In relative error-bounded mode, the Digit Rounding algorithm introduced in this work provides higher compression ratios than the Bit Grooming algorithm from which it derives, but with lower compression speed. Sz can provide even higher compression ratios but fails to achieve the high precision required for some variables. This is why we rather recommend the Digit Rounding algorithm to achieve relative error bounded compression of SWOT datasets with a compression ratio 30 % higher than the baseline solution for SWOT compression.

The Digit Rounding software source code is available from CNES GitHub at

The supplement related to this article is available online at:

XD designed and implemented the Digit Rounding software and wrote most of the manuscript. AC performed most of the compression experiments and generated the analytical datasets. FG provided the scientific datasets used in the experiments, supervised the study, and contributed both to its design and to the writing of the manuscript.

The authors declare that they have no conflict of interest.

This work was funded by CNES and carried out at Thales Services. We would like to thank Hélène Vadon, Damien Desroches, Claire Pottier and Delphine Libby-Claybrough for their contributions to the SWOT section and for their help in proofreading. We also thank Charles S. Zender and the anonymous reviewers their comments, which helped improving the quality of this paper.

This research has been supported by the Centre National d'Etudes Spatiales (CNES) (grant no. 170850/00).

This paper was edited by Steve Easterbrook and reviewed by Charles Zender, Dingwen Tao, and one anonymous referee.