the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Parquet Cube alternative to store gridded data for data analytics and modeling
Abstract. The volume of data in the field of Earth data observation has increased considerably, especially with the emergence of new generations of satellites and models providing much more precise measures and thus voluminous data and files. One of the most traditional and popular data formats used in scientific and education communities (reference) is the NetCDF format. However, it was designed before the development of cloud storage and parallel processing in big data architectures. Alternative solutions, under open source or under proprietary licences, appeared in the past few years (See Rasdaman, Opendatacube). These data cubes are managing the storage and the services for an easy access to the data but they are also altering the input information applying conversions and/or reprojections to homogenize their internal data structure, introducing a bias in the scientific value of the data. The consequence is that it drives the users in a closed infrastructure, made of customized storage and access services.
The objective of this study is to propose a light new open source solution which is able to store gridded datasets into a native big data format and make data available for parallel processing, analytics or artificial intelligence learning. There is a demand for developing a unique storage solution that would be opened to different users:
- Scientists, setting up their prototypes and models in their customized environment and qualifying their data to publish as Copernicus datatsets for instance;
- Operational teams, in charge of the daily processing of data which can be run in another environment, to ingest the product in an archive and make it available to end-users for additional model and data science processing.
Data ingestion and storage are key factors to study to ensure good performances in further subsetting access services and parallel processing.
Through typical end users’ use cases, four storage and services implementations are compared through benchmarks:
- Unidata's THREDDS Data Server (TDS) which is a traditional NetCDF data access service solution built on the NetCDF-Java,
- an extension of the THREDDS Data Server using object store,
- pangeo/Dask/Python ecosystem,
- and the alternative Hadoop/Spark/Parquet solution, driven by CLS technical and business requirements.
This preprint has been withdrawn.
-
Withdrawal notice
This preprint has been withdrawn.
-
Preprint
(911 KB)
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021
The authors propose to use Parquet for storing gridded data efficiently. They compare NetCDF (using the THREDDS Data Server), NetCDF on S3, Pangeo-Zarr and Spark-Parquet using a number of different scenarios.
The comparison shown in the paper is interesting, but suffers from a few flaws: It is hard to compare the different results since the experiments use different hardware configurations and different compression algorithms. However, most notably, Pangeo-Zarr seems to perform best in majority of experiments (sometimes being orders of magnitude faster than Spark-Parquet) but the authors still propose using Spark-Parquet. Overall, the benefits of Spark-Parquet do not become clear. The paper would benefit significantly from more thorough explanations and uniform experiment configurations.
General comments and questions:
- 41: The citation format doesn't seem to adhere to the journal's guidelines (https://www.geoscientific-model-development.net/submission.html#references). This also applies to the following citations.
- Figure 1: All figures should be referenced explicitly in the text. This also applies to all other figures.
- 146: Why do you describe the system configuration here? It's also mentioned later.
- 182: The parallel processing scenario is not very detailed. How does it work? What happens in parallel? In general, all three scenarios could use some more explanations, for instance, showing their (pseudo) code.
- 192: The code for the different benchmarks should be provided to allow readers to compare/reproduce them. The Git repository only seems to contain code for converting NetCDF to Parquet.
- 193: How is the THREDDS implementation used, a Java application or something else? It does not become clear from the description.
- 229: Pangeo-Zarr uses LZ4 compression. Does NetCDF also use compression in your experiments? If not, Compression can add significant overhead, how are you able to compare them fairly?
- 250: How does this scalable process work?
- Figure 3: What does this figure show? It does not become clear from the text. Moreover, the text in the figure is quite blurry and its resolution should be increased.
- 301: Spark-Parquet seems to use Snappy, which has different performance characteristics than LZ4 used previously. For a fair comparison, all approaches should use the same compression algorithm (or none).
- Figure 5: This figure shows that Snappy and LZ4 result in different file sizes, which in turn means that the benchmarks will have to read/write different amounts of data, possibly skewing the comparison.
- 330: Every scenario seems to run on different hardware, making it hard to compare the results. 512 GB vs. 12 GB RAM will have significant impact on caching. Pangeo-Zarr uses GPFS, while NetCDF is stored on local disks. GPFS is expected to be much faster than a local disk.
- 350: How did you make sure they were using the same amount of memory? Did you also account for differences in CPU performance (apart from using the same amount of cores)?
- Table 2: Why is THREDDS-S3 larger than NetCDF? Aren't they both using NetCDF?
- 373: What are these numbers supposed to tell the reader?
- 387: Pangeo-Zarr and Spark-Parquet were also using NetCDF then? According to Table 2, all of their datasets were below 1 TB.
- 395: What does "10 cores of 4 Gb" mean? 4 GB in total or 4 GB per core?
- Figure 9: The different scaling on these figures implies that both Pangeo-Zarr and Spark-Parquet had similar performance but Pangeo-Zarr was significantly faster.
- Figure 11: The different scaling makes it hard to compare results again.
- Figures 15 and 16: These figures seem to imply that Pangeo-Zarr has much better scaling behavior than Spark-Parquet. Can this be explained somehow?
- Figure 18: Why are you only using 1 to 3 cores here?
- 534: Why do you compare 50 to 40 cores? That makes it hard to judge the differences.
- 539: It seems NetCDF-S3 has several limitations. Have you also considered HDF5 (https://www.hdfgroup.org/solutions/enterprise-support/cloud-amazon-s3-storage-hdf5-connector/)?
- 555: You mention that Pangeo-Zarr is limited to Python. Isn't THREDDS also limited to Java? Which languages are supported by Spark-Parquet?
- 586: How do you support this conclusion? Pangeo-Zarr seems to be the fastest format and Python most likely has the largest community. Moreover, Pangeo-Zarr shows much better scaling than Spark-Parquet in a few of your experiments.
- 615: The last access dates for all references seem to be in French.Layout problems, typos etc.:
- 186: "4 go" - Is this supposed to be "4 GB"?
- 209: "manor" - Should be "manner".
- Figure 4: This figure is also blurry.
- Figures 5 and 6: The axis descriptions are very hard to read due to the low resolution.
- 339: "Go" - Should be "GB". Also applies to the following lines.
- Figure 11: The second row seems to be titled incorrectly, THREDDS-NC is missing and instead Spark-Parquet is shown twice.
- Figure 19: This figure is so blurry that it's impossible to read.Citation: https://doi.org/10.5194/gmd-2021-138-RC1 -
RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021
The authors perform a comparison of several systems on 3D and 4D gridded data, with a particular view on parallelization. Tools addressed are THREDDS/NetCDF on standard file system and an object store, pangeo/Dask, and Hadoop/Spark/Parquet. The conceptual model is a 4D cube where the vertical dimension is "nullable" in case of 3D data. This resembles the model of the Barrodale engine on top of Informix, for example.
In Section 2, I get puzzled about the experimental setup description:
- the "criteria identified to stress" are not justified and sometimes surprising: why is it relevant for storage (!) whether a pixel is ocean or land? why is complex processing, heterogeneous data fusion, etc. not considered?
- "enrichment of CSV locations": CSV = comma-separated values? if so, why is that format choice important over, e.g., JSON? what does enrichment mean, and what is the scenario? why is it relevant?
- "parallelized processing" (check language use!) is not a user scenario, but an implementation detail. What would be interesting is what operations exactly to parallelize - for example, an edge filter or matrix operations are harder to parallelize than the trivial pixel operations such as NDVI or subsetting.Generally there seems to be a lack in concise description of the relevant choices, impacts, and measures. Just one example (p 11): "We repeated the tests several times to obtain the most reliable metrics." A rigorous approach might "run all tests 5 times on a [hot|cold] setup, discarding the maximum and minimum value and averaging the 3 remaining measurements". The algorithms used for the scenarios, such as extracting significant continuous areas and enrichment, are only given cursorily, without a rigorous pseudo code or mathematical description.
In 3.3, it is claimed that datacubes need reprojection which should be avoided. However, for join/fusion of datasets in different projections there needs to be a reprojection. And rescaling (which likewise involves resampling) is performed routinely by the approach. So I do not see the substantial difference. In fact, substantial preprocessing takes place at datacube creation time, massaging data to make them suitable for the processing lateron. For example, all cubes are forced into the same space/time resolution - a brute-force method, and certainly more dangerous from a scientist perspective than reprojection. Other datacube approaches work on the original data and perform dynamic recombination.
The Parquet format does not seem efficient for cubes. By materializing the coordinates for each point the data volume is blown up immensely, and processing (including data bus transport between RAM and CPU) get slowed down. Combine this with the 3x explosion contributed by HDFS (not to speak about its inflexible page size), it is not clear how this approach can be efficient in comparison to others published. Certainly not a "green computing"!
Partitioning, a well-known technique for gridded data, is applied here as well. The parameters chosen are not justified, though: why regular partitioning? why partition length 1 day along time? We just learn "the partitioning-per-day makes sense",probably because the demonstration scenarios have been pre-trimmed to that. This will be problematic in a general-purpose operational deployment.
Sadly, the performance comparison of the different implementations was done on rather different infrastructure, so comparison of the results is problematic.
In Table 3, performance results for single-file conversion are reported. There is a breathtaking span from 1 to 12816 seconds per file. Unfortunately, it is not explained sufficiently.
Performance results (Figure 8) are a little hard to follow due to nonlinearity - eg, time axis extraction starts with 1D (incidentally the stored grid resolution) and then scales with a factor 30 (?), 3, 2, 2. Equidistant spacing would help understanding the results.
Surprisingly, performance degrades quadratic with increasing data volume returned, whereas other tools in the field commonly show a linear behavior. Parallelization does not seem to help much.
Also surprising (and unexplained) is the non-monotonicity in Fig 11 for THREDDS / North Sea while THREDDS / global shows reasonable monotonicity. Further, in Fig 13 there is a huge outlier for THREDDS / hourly / 1M - is this a measurement issue, or a real result? Unfortunately, no explanation is attempted.On p 18 I would like to understand how "significant" data are determined algorithmically - as it stands, the system load generated cannot be estimated. Further, as "continuous" areas are retrieved there is likely some kernel operation involved. How does kernel computation work at partition boundaries, is it still correct (ie, fetches values from neighboring partitions where necessary)?
Looking at the results on p 19: Performing a simple subsetting returning an estimated 5 MB of data using 10 cores in 30 seconds is breathtakingly slow - other systems can do that in less than 1 second.
Overall, seeing the data set in the conclusion is described as having 2 TB and Spark/Dask were used on 50/40 cores: that means about 40 GB per node - loading that into RAM of each node and using just numpy etc. should be faster by orders of magnitude. Shouldn't the test go well beyond the cumulated RAM?
Bottom line, what I take home is
- THREDDS is not really scalable (which is confirmed by other studies)
- Parquet works well in stuations where data and scenarios are carefully aligned
- benchmark deployments have so many special tweaks and differences that a comparison is difficult
- both Spark-Parquet and Pangeo-Zarr fall significantly behind the performance of other tools around
- dynamic partitioning (as studied by Paula Furtado, for example) has not yet found wide recognitioneditorial comments:
- p 3: URLs in the make the text ugly to read, better make it a reference.
- p 3: "N variables" - what does that want to tell us? Unknown, variable, or...?
- maybe recheck for typos, such as p 4 "librairies"
- best use uniform nomenclature, not "4G" and "4 go" for 4 GB
- p 5: "The number of cores is increased for the Pangeo-Zarr and the Spark-Parquet environments." ...why? what is the number of cores there? Best motivate such decisions.
- Figure 9: as the diagram lines are greyscale they are not easy to distinguish. If color is not possible consider dashed lines etc.
- check for French words occurring, such as "Novembre"Citation: https://doi.org/10.5194/gmd-2021-138-RC2 -
EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021
For the records, after checking the reports by the reviewers, my decision on the manuscript is 'Major revisions'.
Citation: https://doi.org/10.5194/gmd-2021-138-EC1 -
AC1: 'AC1', elisabeth lambert, 17 Nov 2021
Dear Michael Kuhn,
I would like first to yhank you for your precise comments and questions which have helped us to increase the restitution of our work.
Please find attached a Word document where you can find a reply for all the items you noted in RC1.
Best regards.
Jean-Michel
-
AC2: 'AC-2', elisabeth lambert, 17 Nov 2021
Dear Peter Baumnan,
It was an honor to read your comments ansI would like first to yhank you for your precise comments and questions which have helped us to increase the quality of the paper.
Please find attached a PDF document where you can find a reply for all the items you noted in RC2.
Best regards.
Jean-Michel
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021
The authors propose to use Parquet for storing gridded data efficiently. They compare NetCDF (using the THREDDS Data Server), NetCDF on S3, Pangeo-Zarr and Spark-Parquet using a number of different scenarios.
The comparison shown in the paper is interesting, but suffers from a few flaws: It is hard to compare the different results since the experiments use different hardware configurations and different compression algorithms. However, most notably, Pangeo-Zarr seems to perform best in majority of experiments (sometimes being orders of magnitude faster than Spark-Parquet) but the authors still propose using Spark-Parquet. Overall, the benefits of Spark-Parquet do not become clear. The paper would benefit significantly from more thorough explanations and uniform experiment configurations.
General comments and questions:
- 41: The citation format doesn't seem to adhere to the journal's guidelines (https://www.geoscientific-model-development.net/submission.html#references). This also applies to the following citations.
- Figure 1: All figures should be referenced explicitly in the text. This also applies to all other figures.
- 146: Why do you describe the system configuration here? It's also mentioned later.
- 182: The parallel processing scenario is not very detailed. How does it work? What happens in parallel? In general, all three scenarios could use some more explanations, for instance, showing their (pseudo) code.
- 192: The code for the different benchmarks should be provided to allow readers to compare/reproduce them. The Git repository only seems to contain code for converting NetCDF to Parquet.
- 193: How is the THREDDS implementation used, a Java application or something else? It does not become clear from the description.
- 229: Pangeo-Zarr uses LZ4 compression. Does NetCDF also use compression in your experiments? If not, Compression can add significant overhead, how are you able to compare them fairly?
- 250: How does this scalable process work?
- Figure 3: What does this figure show? It does not become clear from the text. Moreover, the text in the figure is quite blurry and its resolution should be increased.
- 301: Spark-Parquet seems to use Snappy, which has different performance characteristics than LZ4 used previously. For a fair comparison, all approaches should use the same compression algorithm (or none).
- Figure 5: This figure shows that Snappy and LZ4 result in different file sizes, which in turn means that the benchmarks will have to read/write different amounts of data, possibly skewing the comparison.
- 330: Every scenario seems to run on different hardware, making it hard to compare the results. 512 GB vs. 12 GB RAM will have significant impact on caching. Pangeo-Zarr uses GPFS, while NetCDF is stored on local disks. GPFS is expected to be much faster than a local disk.
- 350: How did you make sure they were using the same amount of memory? Did you also account for differences in CPU performance (apart from using the same amount of cores)?
- Table 2: Why is THREDDS-S3 larger than NetCDF? Aren't they both using NetCDF?
- 373: What are these numbers supposed to tell the reader?
- 387: Pangeo-Zarr and Spark-Parquet were also using NetCDF then? According to Table 2, all of their datasets were below 1 TB.
- 395: What does "10 cores of 4 Gb" mean? 4 GB in total or 4 GB per core?
- Figure 9: The different scaling on these figures implies that both Pangeo-Zarr and Spark-Parquet had similar performance but Pangeo-Zarr was significantly faster.
- Figure 11: The different scaling makes it hard to compare results again.
- Figures 15 and 16: These figures seem to imply that Pangeo-Zarr has much better scaling behavior than Spark-Parquet. Can this be explained somehow?
- Figure 18: Why are you only using 1 to 3 cores here?
- 534: Why do you compare 50 to 40 cores? That makes it hard to judge the differences.
- 539: It seems NetCDF-S3 has several limitations. Have you also considered HDF5 (https://www.hdfgroup.org/solutions/enterprise-support/cloud-amazon-s3-storage-hdf5-connector/)?
- 555: You mention that Pangeo-Zarr is limited to Python. Isn't THREDDS also limited to Java? Which languages are supported by Spark-Parquet?
- 586: How do you support this conclusion? Pangeo-Zarr seems to be the fastest format and Python most likely has the largest community. Moreover, Pangeo-Zarr shows much better scaling than Spark-Parquet in a few of your experiments.
- 615: The last access dates for all references seem to be in French.Layout problems, typos etc.:
- 186: "4 go" - Is this supposed to be "4 GB"?
- 209: "manor" - Should be "manner".
- Figure 4: This figure is also blurry.
- Figures 5 and 6: The axis descriptions are very hard to read due to the low resolution.
- 339: "Go" - Should be "GB". Also applies to the following lines.
- Figure 11: The second row seems to be titled incorrectly, THREDDS-NC is missing and instead Spark-Parquet is shown twice.
- Figure 19: This figure is so blurry that it's impossible to read.Citation: https://doi.org/10.5194/gmd-2021-138-RC1 -
RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021
The authors perform a comparison of several systems on 3D and 4D gridded data, with a particular view on parallelization. Tools addressed are THREDDS/NetCDF on standard file system and an object store, pangeo/Dask, and Hadoop/Spark/Parquet. The conceptual model is a 4D cube where the vertical dimension is "nullable" in case of 3D data. This resembles the model of the Barrodale engine on top of Informix, for example.
In Section 2, I get puzzled about the experimental setup description:
- the "criteria identified to stress" are not justified and sometimes surprising: why is it relevant for storage (!) whether a pixel is ocean or land? why is complex processing, heterogeneous data fusion, etc. not considered?
- "enrichment of CSV locations": CSV = comma-separated values? if so, why is that format choice important over, e.g., JSON? what does enrichment mean, and what is the scenario? why is it relevant?
- "parallelized processing" (check language use!) is not a user scenario, but an implementation detail. What would be interesting is what operations exactly to parallelize - for example, an edge filter or matrix operations are harder to parallelize than the trivial pixel operations such as NDVI or subsetting.Generally there seems to be a lack in concise description of the relevant choices, impacts, and measures. Just one example (p 11): "We repeated the tests several times to obtain the most reliable metrics." A rigorous approach might "run all tests 5 times on a [hot|cold] setup, discarding the maximum and minimum value and averaging the 3 remaining measurements". The algorithms used for the scenarios, such as extracting significant continuous areas and enrichment, are only given cursorily, without a rigorous pseudo code or mathematical description.
In 3.3, it is claimed that datacubes need reprojection which should be avoided. However, for join/fusion of datasets in different projections there needs to be a reprojection. And rescaling (which likewise involves resampling) is performed routinely by the approach. So I do not see the substantial difference. In fact, substantial preprocessing takes place at datacube creation time, massaging data to make them suitable for the processing lateron. For example, all cubes are forced into the same space/time resolution - a brute-force method, and certainly more dangerous from a scientist perspective than reprojection. Other datacube approaches work on the original data and perform dynamic recombination.
The Parquet format does not seem efficient for cubes. By materializing the coordinates for each point the data volume is blown up immensely, and processing (including data bus transport between RAM and CPU) get slowed down. Combine this with the 3x explosion contributed by HDFS (not to speak about its inflexible page size), it is not clear how this approach can be efficient in comparison to others published. Certainly not a "green computing"!
Partitioning, a well-known technique for gridded data, is applied here as well. The parameters chosen are not justified, though: why regular partitioning? why partition length 1 day along time? We just learn "the partitioning-per-day makes sense",probably because the demonstration scenarios have been pre-trimmed to that. This will be problematic in a general-purpose operational deployment.
Sadly, the performance comparison of the different implementations was done on rather different infrastructure, so comparison of the results is problematic.
In Table 3, performance results for single-file conversion are reported. There is a breathtaking span from 1 to 12816 seconds per file. Unfortunately, it is not explained sufficiently.
Performance results (Figure 8) are a little hard to follow due to nonlinearity - eg, time axis extraction starts with 1D (incidentally the stored grid resolution) and then scales with a factor 30 (?), 3, 2, 2. Equidistant spacing would help understanding the results.
Surprisingly, performance degrades quadratic with increasing data volume returned, whereas other tools in the field commonly show a linear behavior. Parallelization does not seem to help much.
Also surprising (and unexplained) is the non-monotonicity in Fig 11 for THREDDS / North Sea while THREDDS / global shows reasonable monotonicity. Further, in Fig 13 there is a huge outlier for THREDDS / hourly / 1M - is this a measurement issue, or a real result? Unfortunately, no explanation is attempted.On p 18 I would like to understand how "significant" data are determined algorithmically - as it stands, the system load generated cannot be estimated. Further, as "continuous" areas are retrieved there is likely some kernel operation involved. How does kernel computation work at partition boundaries, is it still correct (ie, fetches values from neighboring partitions where necessary)?
Looking at the results on p 19: Performing a simple subsetting returning an estimated 5 MB of data using 10 cores in 30 seconds is breathtakingly slow - other systems can do that in less than 1 second.
Overall, seeing the data set in the conclusion is described as having 2 TB and Spark/Dask were used on 50/40 cores: that means about 40 GB per node - loading that into RAM of each node and using just numpy etc. should be faster by orders of magnitude. Shouldn't the test go well beyond the cumulated RAM?
Bottom line, what I take home is
- THREDDS is not really scalable (which is confirmed by other studies)
- Parquet works well in stuations where data and scenarios are carefully aligned
- benchmark deployments have so many special tweaks and differences that a comparison is difficult
- both Spark-Parquet and Pangeo-Zarr fall significantly behind the performance of other tools around
- dynamic partitioning (as studied by Paula Furtado, for example) has not yet found wide recognitioneditorial comments:
- p 3: URLs in the make the text ugly to read, better make it a reference.
- p 3: "N variables" - what does that want to tell us? Unknown, variable, or...?
- maybe recheck for typos, such as p 4 "librairies"
- best use uniform nomenclature, not "4G" and "4 go" for 4 GB
- p 5: "The number of cores is increased for the Pangeo-Zarr and the Spark-Parquet environments." ...why? what is the number of cores there? Best motivate such decisions.
- Figure 9: as the diagram lines are greyscale they are not easy to distinguish. If color is not possible consider dashed lines etc.
- check for French words occurring, such as "Novembre"Citation: https://doi.org/10.5194/gmd-2021-138-RC2 -
EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021
For the records, after checking the reports by the reviewers, my decision on the manuscript is 'Major revisions'.
Citation: https://doi.org/10.5194/gmd-2021-138-EC1 -
AC1: 'AC1', elisabeth lambert, 17 Nov 2021
Dear Michael Kuhn,
I would like first to yhank you for your precise comments and questions which have helped us to increase the restitution of our work.
Please find attached a Word document where you can find a reply for all the items you noted in RC1.
Best regards.
Jean-Michel
-
AC2: 'AC-2', elisabeth lambert, 17 Nov 2021
Dear Peter Baumnan,
It was an honor to read your comments ansI would like first to yhank you for your precise comments and questions which have helped us to increase the quality of the paper.
Please find attached a PDF document where you can find a reply for all the items you noted in RC2.
Best regards.
Jean-Michel
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
1,252 | 839 | 68 | 2,159 | 67 | 45 |
- HTML: 1,252
- PDF: 839
- XML: 68
- Total: 2,159
- BibTeX: 67
- EndNote: 45
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1