A Parquet Cube alternative to store gridded data for data analytics and modeling

Zigna, Jean-Michel; Semlal, Reda; Gouillon, Flavien; Davis, Ethan; Lambert, Elisabeth; Briol, Frédéric; Prod-Homme, Romain; Arms, Sean; Zawadzki, Lionel

doi:https://doi.org/10.5194/gmd-2021-138

Preprints

https://doi.org/10.5194/gmd-2021-138

Preprints

Submitted as: development and technical paper

02 Jul 2021

Submitted as: development and technical paper |

| 02 Jul 2021

Status: this preprint has been withdrawn by the authors.

A Parquet Cube alternative to store gridded data for data analytics and modeling

Jean-Michel Zigna, Reda Semlal, Flavien Gouillon, Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme, Sean Arms, and Lionel Zawadzki

Abstract. The volume of data in the field of Earth data observation has increased considerably, especially with the emergence of new generations of satellites and models providing much more precise measures and thus voluminous data and files. One of the most traditional and popular data formats used in scientific and education communities (reference) is the NetCDF format. However, it was designed before the development of cloud storage and parallel processing in big data architectures. Alternative solutions, under open source or under proprietary licences, appeared in the past few years (See Rasdaman, Opendatacube). These data cubes are managing the storage and the services for an easy access to the data but they are also altering the input information applying conversions and/or reprojections to homogenize their internal data structure, introducing a bias in the scientific value of the data. The consequence is that it drives the users in a closed infrastructure, made of customized storage and access services.

The objective of this study is to propose a light new open source solution which is able to store gridded datasets into a native big data format and make data available for parallel processing, analytics or artificial intelligence learning. There is a demand for developing a unique storage solution that would be opened to different users:

Scientists, setting up their prototypes and models in their customized environment and qualifying their data to publish as Copernicus datatsets for instance;
Operational teams, in charge of the daily processing of data which can be run in another environment, to ingest the product in an archive and make it available to end-users for additional model and data science processing.

Data ingestion and storage are key factors to study to ensure good performances in further subsetting access services and parallel processing.

Through typical end users’ use cases, four storage and services implementations are compared through benchmarks:

Unidata's THREDDS Data Server (TDS) which is a traditional NetCDF data access service solution built on the NetCDF-Java,
an extension of the THREDDS Data Server using object store,
pangeo/Dask/Python ecosystem,
and the alternative Hadoop/Spark/Parquet solution, driven by CLS technical and business requirements.

This preprint has been withdrawn.

Received: 29 Apr 2021 – Discussion started: 02 Jul 2021

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 911 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (911 KB)

Download & links

This preprint has been withdrawn.

Jean-Michel Zigna, Reda Semlal, Flavien Gouillon, Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme, Sean Arms, and Lionel Zawadzki

Interactive discussion

Status: closed

RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021

The authors propose to use Parquet for storing gridded data efficiently. They compare NetCDF (using the THREDDS Data Server), NetCDF on S3, Pangeo-Zarr and Spark-Parquet using a number of different scenarios.

The comparison shown in the paper is interesting, but suffers from a few flaws: It is hard to compare the different results since the experiments use different hardware configurations and different compression algorithms. However, most notably, Pangeo-Zarr seems to perform best in majority of experiments (sometimes being orders of magnitude faster than Spark-Parquet) but the authors still propose using Spark-Parquet. Overall, the benefits of Spark-Parquet do not become clear. The paper would benefit significantly from more thorough explanations and uniform experiment configurations.

General comments and questions:

- 41: The citation format doesn't seem to adhere to the journal's guidelines (https://www.geoscientific-model-development.net/submission.html#references). This also applies to the following citations.

- Figure 1: All figures should be referenced explicitly in the text. This also applies to all other figures.

- 146: Why do you describe the system configuration here? It's also mentioned later.

- 182: The parallel processing scenario is not very detailed. How does it work? What happens in parallel? In general, all three scenarios could use some more explanations, for instance, showing their (pseudo) code.

- 192: The code for the different benchmarks should be provided to allow readers to compare/reproduce them. The Git repository only seems to contain code for converting NetCDF to Parquet.

- 193: How is the THREDDS implementation used, a Java application or something else? It does not become clear from the description.

- 229: Pangeo-Zarr uses LZ4 compression. Does NetCDF also use compression in your experiments? If not, Compression can add significant overhead, how are you able to compare them fairly?

- 250: How does this scalable process work?

- Figure 3: What does this figure show? It does not become clear from the text. Moreover, the text in the figure is quite blurry and its resolution should be increased.

- 301: Spark-Parquet seems to use Snappy, which has different performance characteristics than LZ4 used previously. For a fair comparison, all approaches should use the same compression algorithm (or none).

- Figure 5: This figure shows that Snappy and LZ4 result in different file sizes, which in turn means that the benchmarks will have to read/write different amounts of data, possibly skewing the comparison.

- 330: Every scenario seems to run on different hardware, making it hard to compare the results. 512 GB vs. 12 GB RAM will have significant impact on caching. Pangeo-Zarr uses GPFS, while NetCDF is stored on local disks. GPFS is expected to be much faster than a local disk.

- 350: How did you make sure they were using the same amount of memory? Did you also account for differences in CPU performance (apart from using the same amount of cores)?

- Table 2: Why is THREDDS-S3 larger than NetCDF? Aren't they both using NetCDF?

- 373: What are these numbers supposed to tell the reader?

- 387: Pangeo-Zarr and Spark-Parquet were also using NetCDF then? According to Table 2, all of their datasets were below 1 TB.

- 395: What does "10 cores of 4 Gb" mean? 4 GB in total or 4 GB per core?

- Figure 9: The different scaling on these figures implies that both Pangeo-Zarr and Spark-Parquet had similar performance but Pangeo-Zarr was significantly faster.

- Figure 11: The different scaling makes it hard to compare results again.

- Figures 15 and 16: These figures seem to imply that Pangeo-Zarr has much better scaling behavior than Spark-Parquet. Can this be explained somehow?

- Figure 18: Why are you only using 1 to 3 cores here?

- 534: Why do you compare 50 to 40 cores? That makes it hard to judge the differences.

- 539: It seems NetCDF-S3 has several limitations. Have you also considered HDF5 (https://www.hdfgroup.org/solutions/enterprise-support/cloud-amazon-s3-storage-hdf5-connector/)?

- 555: You mention that Pangeo-Zarr is limited to Python. Isn't THREDDS also limited to Java? Which languages are supported by Spark-Parquet?

- 586: How do you support this conclusion? Pangeo-Zarr seems to be the fastest format and Python most likely has the largest community. Moreover, Pangeo-Zarr shows much better scaling than Spark-Parquet in a few of your experiments.

- 615: The last access dates for all references seem to be in French.

Layout problems, typos etc.:

- 186: "4 go" - Is this supposed to be "4 GB"?

- 209: "manor" - Should be "manner".

- Figure 4: This figure is also blurry.

- Figures 5 and 6: The axis descriptions are very hard to read due to the low resolution.

- 339: "Go" - Should be "GB". Also applies to the following lines.

- Figure 11: The second row seems to be titled incorrectly, THREDDS-NC is missing and instead Spark-Parquet is shown twice.

- Figure 19: This figure is so blurry that it's impossible to read.

Citation: https://doi.org/10.5194/gmd-2021-138-RC1
RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021

The authors perform a comparison of several systems on 3D and 4D gridded data, with a particular view on parallelization. Tools addressed are THREDDS/NetCDF on standard file system and an object store, pangeo/Dask, and Hadoop/Spark/Parquet. The conceptual model is a 4D cube where the vertical dimension is "nullable" in case of 3D data. This resembles the model of the Barrodale engine on top of Informix, for example.

In Section 2, I get puzzled about the experimental setup description:

- the "criteria identified to stress" are not justified and sometimes surprising: why is it relevant for storage (!) whether a pixel is ocean or land? why is complex processing, heterogeneous data fusion, etc. not considered?

- "enrichment of CSV locations": CSV = comma-separated values? if so, why is that format choice important over, e.g., JSON? what does enrichment mean, and what is the scenario? why is it relevant?

- "parallelized processing" (check language use!) is not a user scenario, but an implementation detail. What would be interesting is what operations exactly to parallelize - for example, an edge filter or matrix operations are harder to parallelize than the trivial pixel operations such as NDVI or subsetting.

Generally there seems to be a lack in concise description of the relevant choices, impacts, and measures. Just one example (p 11): "We repeated the tests several times to obtain the most reliable metrics." A rigorous approach might "run all tests 5 times on a [hot|cold] setup, discarding the maximum and minimum value and averaging the 3 remaining measurements". The algorithms used for the scenarios, such as extracting significant continuous areas and enrichment, are only given cursorily, without a rigorous pseudo code or mathematical description.

In 3.3, it is claimed that datacubes need reprojection which should be avoided. However, for join/fusion of datasets in different projections there needs to be a reprojection. And rescaling (which likewise involves resampling) is performed routinely by the approach. So I do not see the substantial difference. In fact, substantial preprocessing takes place at datacube creation time, massaging data to make them suitable for the processing lateron. For example, all cubes are forced into the same space/time resolution - a brute-force method, and certainly more dangerous from a scientist perspective than reprojection. Other datacube approaches work on the original data and perform dynamic recombination.

The Parquet format does not seem efficient for cubes. By materializing the coordinates for each point the data volume is blown up immensely, and processing (including data bus transport between RAM and CPU) get slowed down. Combine this with the 3x explosion contributed by HDFS (not to speak about its inflexible page size), it is not clear how this approach can be efficient in comparison to others published. Certainly not a "green computing"!

Partitioning, a well-known technique for gridded data, is applied here as well. The parameters chosen are not justified, though: why regular partitioning? why partition length 1 day along time? We just learn "the partitioning-per-day makes sense",probably because the demonstration scenarios have been pre-trimmed to that. This will be problematic in a general-purpose operational deployment.

Sadly, the performance comparison of the different implementations was done on rather different infrastructure, so comparison of the results is problematic.

In Table 3, performance results for single-file conversion are reported. There is a breathtaking span from 1 to 12816 seconds per file. Unfortunately, it is not explained sufficiently.

Performance results (Figure 8) are a little hard to follow due to nonlinearity - eg, time axis extraction starts with 1D (incidentally the stored grid resolution) and then scales with a factor 30 (?), 3, 2, 2. Equidistant spacing would help understanding the results.

Surprisingly, performance degrades quadratic with increasing data volume returned, whereas other tools in the field commonly show a linear behavior. Parallelization does not seem to help much.

Also surprising (and unexplained) is the non-monotonicity in Fig 11 for THREDDS / North Sea while THREDDS / global shows reasonable monotonicity. Further, in Fig 13 there is a huge outlier for THREDDS / hourly / 1M - is this a measurement issue, or a real result? Unfortunately, no explanation is attempted.

On p 18 I would like to understand how "significant" data are determined algorithmically - as it stands, the system load generated cannot be estimated. Further, as "continuous" areas are retrieved there is likely some kernel operation involved. How does kernel computation work at partition boundaries, is it still correct (ie, fetches values from neighboring partitions where necessary)?

Looking at the results on p 19: Performing a simple subsetting returning an estimated 5 MB of data using 10 cores in 30 seconds is breathtakingly slow - other systems can do that in less than 1 second.

Overall, seeing the data set in the conclusion is described as having 2 TB and Spark/Dask were used on 50/40 cores: that means about 40 GB per node - loading that into RAM of each node and using just numpy etc. should be faster by orders of magnitude. Shouldn't the test go well beyond the cumulated RAM?

Bottom line, what I take home is

- THREDDS is not really scalable (which is confirmed by other studies)

- Parquet works well in stuations where data and scenarios are carefully aligned

- benchmark deployments have so many special tweaks and differences that a comparison is difficult

- both Spark-Parquet and Pangeo-Zarr fall significantly behind the performance of other tools around

- dynamic partitioning (as studied by Paula Furtado, for example) has not yet found wide recognition

editorial comments:

- p 3: URLs in the make the text ugly to read, better make it a reference.

- p 3: "N variables" - what does that want to tell us? Unknown, variable, or...?

- maybe recheck for typos, such as p 4 "librairies"

- best use uniform nomenclature, not "4G" and "4 go" for 4 GB

- p 5: "The number of cores is increased for the Pangeo-Zarr and the Spark-Parquet environments." ...why? what is the number of cores there? Best motivate such decisions.

- Figure 9: as the diagram lines are greyscale they are not easy to distinguish. If color is not possible consider dashed lines etc.

- check for French words occurring, such as "Novembre"

Citation: https://doi.org/10.5194/gmd-2021-138-RC2
EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021

For the records, after checking the reports by the reviewers, my decision on the manuscript is 'Major revisions'.

Citation: https://doi.org/10.5194/gmd-2021-138-EC1
AC1: 'AC1', elisabeth lambert, 17 Nov 2021

Dear Michael Kuhn,
I would like first to yhank you for your precise comments and questions which have helped us to increase the restitution of our work.
Please find attached a Word document where you can find a reply for all the items you noted in RC1.
Best regards.
Jean-Michel

Citation: https://doi.org/10.5194/gmd-2021-138-AC1
AC2: 'AC-2', elisabeth lambert, 17 Nov 2021

Dear Peter Baumnan,
It was an honor to read your comments ansI would like first to yhank you for your precise comments and questions which have helped us to increase the quality of the paper.
Please find attached a PDF document where you can find a reply for all the items you noted in RC2.
Best regards.
Jean-Michel

Citation: https://doi.org/10.5194/gmd-2021-138-AC2

Interactive discussion

Status: closed

RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021

The authors propose to use Parquet for storing gridded data efficiently. They compare NetCDF (using the THREDDS Data Server), NetCDF on S3, Pangeo-Zarr and Spark-Parquet using a number of different scenarios.

The comparison shown in the paper is interesting, but suffers from a few flaws: It is hard to compare the different results since the experiments use different hardware configurations and different compression algorithms. However, most notably, Pangeo-Zarr seems to perform best in majority of experiments (sometimes being orders of magnitude faster than Spark-Parquet) but the authors still propose using Spark-Parquet. Overall, the benefits of Spark-Parquet do not become clear. The paper would benefit significantly from more thorough explanations and uniform experiment configurations.

General comments and questions:

- 41: The citation format doesn't seem to adhere to the journal's guidelines (https://www.geoscientific-model-development.net/submission.html#references). This also applies to the following citations.

- Figure 1: All figures should be referenced explicitly in the text. This also applies to all other figures.

- 146: Why do you describe the system configuration here? It's also mentioned later.

- 182: The parallel processing scenario is not very detailed. How does it work? What happens in parallel? In general, all three scenarios could use some more explanations, for instance, showing their (pseudo) code.

- 192: The code for the different benchmarks should be provided to allow readers to compare/reproduce them. The Git repository only seems to contain code for converting NetCDF to Parquet.

- 193: How is the THREDDS implementation used, a Java application or something else? It does not become clear from the description.

- 229: Pangeo-Zarr uses LZ4 compression. Does NetCDF also use compression in your experiments? If not, Compression can add significant overhead, how are you able to compare them fairly?

- 250: How does this scalable process work?

- Figure 3: What does this figure show? It does not become clear from the text. Moreover, the text in the figure is quite blurry and its resolution should be increased.

- 301: Spark-Parquet seems to use Snappy, which has different performance characteristics than LZ4 used previously. For a fair comparison, all approaches should use the same compression algorithm (or none).

- Figure 5: This figure shows that Snappy and LZ4 result in different file sizes, which in turn means that the benchmarks will have to read/write different amounts of data, possibly skewing the comparison.

- 330: Every scenario seems to run on different hardware, making it hard to compare the results. 512 GB vs. 12 GB RAM will have significant impact on caching. Pangeo-Zarr uses GPFS, while NetCDF is stored on local disks. GPFS is expected to be much faster than a local disk.

- 350: How did you make sure they were using the same amount of memory? Did you also account for differences in CPU performance (apart from using the same amount of cores)?

- Table 2: Why is THREDDS-S3 larger than NetCDF? Aren't they both using NetCDF?

- 373: What are these numbers supposed to tell the reader?

- 387: Pangeo-Zarr and Spark-Parquet were also using NetCDF then? According to Table 2, all of their datasets were below 1 TB.

- 395: What does "10 cores of 4 Gb" mean? 4 GB in total or 4 GB per core?

- Figure 9: The different scaling on these figures implies that both Pangeo-Zarr and Spark-Parquet had similar performance but Pangeo-Zarr was significantly faster.

- Figure 11: The different scaling makes it hard to compare results again.

- Figures 15 and 16: These figures seem to imply that Pangeo-Zarr has much better scaling behavior than Spark-Parquet. Can this be explained somehow?

- Figure 18: Why are you only using 1 to 3 cores here?

- 534: Why do you compare 50 to 40 cores? That makes it hard to judge the differences.

- 539: It seems NetCDF-S3 has several limitations. Have you also considered HDF5 (https://www.hdfgroup.org/solutions/enterprise-support/cloud-amazon-s3-storage-hdf5-connector/)?

- 555: You mention that Pangeo-Zarr is limited to Python. Isn't THREDDS also limited to Java? Which languages are supported by Spark-Parquet?

- 586: How do you support this conclusion? Pangeo-Zarr seems to be the fastest format and Python most likely has the largest community. Moreover, Pangeo-Zarr shows much better scaling than Spark-Parquet in a few of your experiments.

- 615: The last access dates for all references seem to be in French.

Layout problems, typos etc.:

- 186: "4 go" - Is this supposed to be "4 GB"?

- 209: "manor" - Should be "manner".

- Figure 4: This figure is also blurry.

- Figures 5 and 6: The axis descriptions are very hard to read due to the low resolution.

- 339: "Go" - Should be "GB". Also applies to the following lines.

- Figure 11: The second row seems to be titled incorrectly, THREDDS-NC is missing and instead Spark-Parquet is shown twice.

- Figure 19: This figure is so blurry that it's impossible to read.

Citation: https://doi.org/10.5194/gmd-2021-138-RC1
RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021

The authors perform a comparison of several systems on 3D and 4D gridded data, with a particular view on parallelization. Tools addressed are THREDDS/NetCDF on standard file system and an object store, pangeo/Dask, and Hadoop/Spark/Parquet. The conceptual model is a 4D cube where the vertical dimension is "nullable" in case of 3D data. This resembles the model of the Barrodale engine on top of Informix, for example.

In Section 2, I get puzzled about the experimental setup description:

- the "criteria identified to stress" are not justified and sometimes surprising: why is it relevant for storage (!) whether a pixel is ocean or land? why is complex processing, heterogeneous data fusion, etc. not considered?

- "enrichment of CSV locations": CSV = comma-separated values? if so, why is that format choice important over, e.g., JSON? what does enrichment mean, and what is the scenario? why is it relevant?

- "parallelized processing" (check language use!) is not a user scenario, but an implementation detail. What would be interesting is what operations exactly to parallelize - for example, an edge filter or matrix operations are harder to parallelize than the trivial pixel operations such as NDVI or subsetting.

Generally there seems to be a lack in concise description of the relevant choices, impacts, and measures. Just one example (p 11): "We repeated the tests several times to obtain the most reliable metrics." A rigorous approach might "run all tests 5 times on a [hot|cold] setup, discarding the maximum and minimum value and averaging the 3 remaining measurements". The algorithms used for the scenarios, such as extracting significant continuous areas and enrichment, are only given cursorily, without a rigorous pseudo code or mathematical description.

In 3.3, it is claimed that datacubes need reprojection which should be avoided. However, for join/fusion of datasets in different projections there needs to be a reprojection. And rescaling (which likewise involves resampling) is performed routinely by the approach. So I do not see the substantial difference. In fact, substantial preprocessing takes place at datacube creation time, massaging data to make them suitable for the processing lateron. For example, all cubes are forced into the same space/time resolution - a brute-force method, and certainly more dangerous from a scientist perspective than reprojection. Other datacube approaches work on the original data and perform dynamic recombination.

The Parquet format does not seem efficient for cubes. By materializing the coordinates for each point the data volume is blown up immensely, and processing (including data bus transport between RAM and CPU) get slowed down. Combine this with the 3x explosion contributed by HDFS (not to speak about its inflexible page size), it is not clear how this approach can be efficient in comparison to others published. Certainly not a "green computing"!

Partitioning, a well-known technique for gridded data, is applied here as well. The parameters chosen are not justified, though: why regular partitioning? why partition length 1 day along time? We just learn "the partitioning-per-day makes sense",probably because the demonstration scenarios have been pre-trimmed to that. This will be problematic in a general-purpose operational deployment.

Sadly, the performance comparison of the different implementations was done on rather different infrastructure, so comparison of the results is problematic.

In Table 3, performance results for single-file conversion are reported. There is a breathtaking span from 1 to 12816 seconds per file. Unfortunately, it is not explained sufficiently.

Performance results (Figure 8) are a little hard to follow due to nonlinearity - eg, time axis extraction starts with 1D (incidentally the stored grid resolution) and then scales with a factor 30 (?), 3, 2, 2. Equidistant spacing would help understanding the results.

Surprisingly, performance degrades quadratic with increasing data volume returned, whereas other tools in the field commonly show a linear behavior. Parallelization does not seem to help much.

Also surprising (and unexplained) is the non-monotonicity in Fig 11 for THREDDS / North Sea while THREDDS / global shows reasonable monotonicity. Further, in Fig 13 there is a huge outlier for THREDDS / hourly / 1M - is this a measurement issue, or a real result? Unfortunately, no explanation is attempted.

On p 18 I would like to understand how "significant" data are determined algorithmically - as it stands, the system load generated cannot be estimated. Further, as "continuous" areas are retrieved there is likely some kernel operation involved. How does kernel computation work at partition boundaries, is it still correct (ie, fetches values from neighboring partitions where necessary)?

Looking at the results on p 19: Performing a simple subsetting returning an estimated 5 MB of data using 10 cores in 30 seconds is breathtakingly slow - other systems can do that in less than 1 second.

Overall, seeing the data set in the conclusion is described as having 2 TB and Spark/Dask were used on 50/40 cores: that means about 40 GB per node - loading that into RAM of each node and using just numpy etc. should be faster by orders of magnitude. Shouldn't the test go well beyond the cumulated RAM?

Bottom line, what I take home is

- THREDDS is not really scalable (which is confirmed by other studies)

- Parquet works well in stuations where data and scenarios are carefully aligned

- benchmark deployments have so many special tweaks and differences that a comparison is difficult

- both Spark-Parquet and Pangeo-Zarr fall significantly behind the performance of other tools around

- dynamic partitioning (as studied by Paula Furtado, for example) has not yet found wide recognition

editorial comments:

- p 3: URLs in the make the text ugly to read, better make it a reference.

- p 3: "N variables" - what does that want to tell us? Unknown, variable, or...?

- maybe recheck for typos, such as p 4 "librairies"

- best use uniform nomenclature, not "4G" and "4 go" for 4 GB

- p 5: "The number of cores is increased for the Pangeo-Zarr and the Spark-Parquet environments." ...why? what is the number of cores there? Best motivate such decisions.

- Figure 9: as the diagram lines are greyscale they are not easy to distinguish. If color is not possible consider dashed lines etc.

- check for French words occurring, such as "Novembre"

Citation: https://doi.org/10.5194/gmd-2021-138-RC2
EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021

For the records, after checking the reports by the reviewers, my decision on the manuscript is 'Major revisions'.

Citation: https://doi.org/10.5194/gmd-2021-138-EC1
AC1: 'AC1', elisabeth lambert, 17 Nov 2021

Dear Michael Kuhn,
I would like first to yhank you for your precise comments and questions which have helped us to increase the restitution of our work.
Please find attached a Word document where you can find a reply for all the items you noted in RC1.
Best regards.
Jean-Michel

Citation: https://doi.org/10.5194/gmd-2021-138-AC1
AC2: 'AC-2', elisabeth lambert, 17 Nov 2021

Dear Peter Baumnan,
It was an honor to read your comments ansI would like first to yhank you for your precise comments and questions which have helped us to increase the quality of the paper.
Please find attached a PDF document where you can find a reply for all the items you noted in RC2.
Best regards.
Jean-Michel

Citation: https://doi.org/10.5194/gmd-2021-138-AC2

Jean-Michel Zigna, Reda Semlal, Flavien Gouillon, Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme, Sean Arms, and Lionel Zawadzki

Viewed

Total article views: 2,862 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,732	1,044	86	2,862	99	115

HTML: 1,732
PDF: 1,044
XML: 86
Total: 2,862
BibTeX: 99
EndNote: 115

Views and downloads (calculated since 02 Jul 2021)

Month	HTML	PDF	XML	Total
Jul 2021	197	40	3	240
Aug 2021	56	5	1	62
Sep 2021	48	7	0	55
Oct 2021	92	20	6	118
Nov 2021	92	38	2	132
Dec 2021	70	26	7	103
Jan 2022	67	18	1	86
Feb 2022	55	17	2	74
Mar 2022	46	18	2	66
Apr 2022	36	23	0	59
May 2022	11	22	2	35
Jun 2022	8	12	1	21
Jul 2022	14	28	1	43
Aug 2022	9	40	1	50
Sep 2022	24	37	1	62
Oct 2022	10	19	2	31
Nov 2022	9	12	0	21
Dec 2022	22	16	0	38
Jan 2023	24	32	1	57
Feb 2023	17	10	0	27
Mar 2023	25	13	0	38
Apr 2023	8	14	0	22
May 2023	8	11	0	19
Jun 2023	10	23	2	35
Jul 2023	10	28	1	39
Aug 2023	8	28	0	36
Sep 2023	9	32	2	43
Oct 2023	19	42	0	61
Nov 2023	25	9	0	34
Dec 2023	13	16	1	30
Jan 2024	11	15	1	27
Feb 2024	28	21	2	51
Mar 2024	24	30	2	56
Apr 2024	32	11	10	53
May 2024	13	35	4	52
Jun 2024	29	10	4	43
Jul 2024	10	9	4	23
Aug 2024	9	13	0	22
Sep 2024	11	4	0	15
Oct 2024	6	14	0	20
Nov 2024	26	10	1	37
Dec 2024	13	12	1	26
Jan 2025	12	15	1	28
Feb 2025	9	13	5	27
Mar 2025	25	19	1	45
Apr 2025	10	19	1	30
May 2025	17	13	1	31
Jun 2025	14	24	1	39
Jul 2025	12	17	1	30
Aug 2025	68	38	1	107
Sep 2025	290	23	2	315
Oct 2025	21	23	4	48

Cumulative views and downloads (calculated since 02 Jul 2021)

Month	HTML	PDF	XML	Total
Jul 2021	197	40	3	240
Aug 2021	56	5	1	62
Sep 2021	48	7	0	55
Oct 2021	92	20	6	118
Nov 2021	92	38	2	132
Dec 2021	70	26	7	103
Jan 2022	67	18	1	86
Feb 2022	55	17	2	74
Mar 2022	46	18	2	66
Apr 2022	36	23	0	59
May 2022	11	22	2	35
Jun 2022	8	12	1	21
Jul 2022	14	28	1	43
Aug 2022	9	40	1	50
Sep 2022	24	37	1	62
Oct 2022	10	19	2	31
Nov 2022	9	12	0	21
Dec 2022	22	16	0	38
Jan 2023	24	32	1	57
Feb 2023	17	10	0	27
Mar 2023	25	13	0	38
Apr 2023	8	14	0	22
May 2023	8	11	0	19
Jun 2023	10	23	2	35
Jul 2023	10	28	1	39
Aug 2023	8	28	0	36
Sep 2023	9	32	2	43
Oct 2023	19	42	0	61
Nov 2023	25	9	0	34
Dec 2023	13	16	1	30
Jan 2024	11	15	1	27
Feb 2024	28	21	2	51
Mar 2024	24	30	2	56
Apr 2024	32	11	10	53
May 2024	13	35	4	52
Jun 2024	29	10	4	43
Jul 2024	10	9	4	23
Aug 2024	9	13	0	22
Sep 2024	11	4	0	15
Oct 2024	6	14	0	20
Nov 2024	26	10	1	37
Dec 2024	13	12	1	26
Jan 2025	12	15	1	28
Feb 2025	9	13	5	27
Mar 2025	25	19	1	45
Apr 2025	10	19	1	30
May 2025	17	13	1	31
Jun 2025	14	24	1	39
Jul 2025	12	17	1	30
Aug 2025	68	38	1	107
Sep 2025	290	23	2	315
Oct 2025	21	23	4	48

Viewed (geographical distribution)

Total article views: 2,763 (including HTML, PDF, and XML) Thereof 2,763 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 30 Oct 2025

Short summary

The Parquet Cube storage alternative presented here is compared with Pangeo and THREDDS platforms to access to gridded data for large scale processing and modeling. Stressing the 3 implementations through 3 data scientists' scenarii, this Parquet Cube Alternative appears to be a good candidate to share gridded data in a cloud environment and share them through different communities of users. This open source alternative can be enriched by additional services to subset, enrich or explore data.


Total:	0
HTML:	0
PDF:	0
XML:	0