Preprints
https://doi.org/10.5194/gmd-2021-138
https://doi.org/10.5194/gmd-2021-138
Submitted as: development and technical paper
 | 
02 Jul 2021
Submitted as: development and technical paper |  | 02 Jul 2021
Status: this preprint has been withdrawn by the authors.

A Parquet Cube alternative to store gridded data for data analytics and modeling

Jean-Michel Zigna, Reda Semlal, Flavien Gouillon, Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme, Sean Arms, and Lionel Zawadzki

Abstract. The volume of data in the field of Earth data observation has increased considerably, especially with the emergence of new generations of satellites and models providing much more precise measures and thus voluminous data and files. One of the most traditional and popular data formats used in scientific and education communities (reference) is the NetCDF format. However, it was designed before the development of cloud storage and parallel processing in big data architectures. Alternative solutions, under open source or under proprietary licences, appeared in the past few years (See Rasdaman, Opendatacube). These data cubes are managing the storage and the services for an easy access to the data but they are also altering the input information applying conversions and/or reprojections to homogenize their internal data structure, introducing a bias in the scientific value of the data. The consequence is that it drives the users in a closed infrastructure, made of customized storage and access services.

The objective of this study is to propose a light new open source solution which is able to store gridded datasets into a native big data format and make data available for parallel processing, analytics or artificial intelligence learning. There is a demand for developing a unique storage solution that would be opened to different users:

  • Scientists, setting up their prototypes and models in their customized environment and qualifying their data to publish as Copernicus datatsets for instance;
  • Operational teams, in charge of the daily processing of data which can be run in another environment, to ingest the product in an archive and make it available to end-users for additional model and data science processing.

Data ingestion and storage are key factors to study to ensure good performances in further subsetting access services and parallel processing.

Through typical end users’ use cases, four storage and services implementations are compared through benchmarks:

  • Unidata's THREDDS Data Server (TDS) which is a traditional NetCDF data access service solution built on the NetCDF-Java,
  • an extension of the THREDDS Data Server using object store,
  • pangeo/Dask/Python ecosystem,
  • and the alternative Hadoop/Spark/Parquet solution, driven by CLS technical and business requirements.

This preprint has been withdrawn.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.
Jean-Michel Zigna, Reda Semlal, Flavien Gouillon, Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme, Sean Arms, and Lionel Zawadzki

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021
  • RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021
  • EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021
  • AC1: 'AC1', elisabeth lambert, 17 Nov 2021
  • AC2: 'AC-2', elisabeth lambert, 17 Nov 2021

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021
  • RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021
  • EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021
  • AC1: 'AC1', elisabeth lambert, 17 Nov 2021
  • AC2: 'AC-2', elisabeth lambert, 17 Nov 2021
Jean-Michel Zigna, Reda Semlal, Flavien Gouillon, Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme, Sean Arms, and Lionel Zawadzki
Jean-Michel Zigna, Reda Semlal, Flavien Gouillon, Ethan Davis, Elisabeth Lambert, Frédéric Briol, Romain Prod-Homme, Sean Arms, and Lionel Zawadzki

Viewed

Total article views: 1,958 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
1,144 758 56 1,958 39 35
  • HTML: 1,144
  • PDF: 758
  • XML: 56
  • Total: 1,958
  • BibTeX: 39
  • EndNote: 35
Views and downloads (calculated since 02 Jul 2021)
Cumulative views and downloads (calculated since 02 Jul 2021)

Viewed (geographical distribution)

Total article views: 1,851 (including HTML, PDF, and XML) Thereof 1,851 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 19 May 2024
Download

This preprint has been withdrawn.

Short summary
The Parquet Cube storage alternative presented here is compared with Pangeo and THREDDS platforms to access to gridded data for large scale processing and modeling. Stressing the 3 implementations through 3 data scientists' scenarii, this Parquet Cube Alternative appears to be a good candidate to share gridded data in a cloud environment and share them through different communities of users. This open source alternative can be enriched by additional services to subset, enrich or explore data.