Preprints
https://doi.org/10.5194/gmd-2021-138
https://doi.org/10.5194/gmd-2021-138
Submitted as: development and technical paper
02 Jul 2021
Submitted as: development and technical paper | 02 Jul 2021
Status: this preprint has been withdrawn by the authors.

A Parquet Cube alternative to store gridded data for data analytics and modeling

Jean-Michel Zigna1, Reda Semlal1, Flavien Gouillon3, Ethan Davis4, Elisabeth Lambert1, Frédéric Briol2, Romain Prod-Homme1, Sean Arms4, and Lionel Zawadzki3 Jean-Michel Zigna et al.
  • 1Software Engineering Division, CLS, Toulouse, 31520, France
  • 2Environmental Business Unit, CLS, Toulouse, 31520, France
  • 3Earth Observation division, altimetry and radar department, CNES, Toulouse, 31400, France
  • 4Division of the University Corporation for Atmospheric Research, UNIDATA, Boulder CO 80301, United States

Abstract. The volume of data in the field of Earth data observation has increased considerably, especially with the emergence of new generations of satellites and models providing much more precise measures and thus voluminous data and files. One of the most traditional and popular data formats used in scientific and education communities (reference) is the NetCDF format. However, it was designed before the development of cloud storage and parallel processing in big data architectures. Alternative solutions, under open source or under proprietary licences, appeared in the past few years (See Rasdaman, Opendatacube). These data cubes are managing the storage and the services for an easy access to the data but they are also altering the input information applying conversions and/or reprojections to homogenize their internal data structure, introducing a bias in the scientific value of the data. The consequence is that it drives the users in a closed infrastructure, made of customized storage and access services.

The objective of this study is to propose a light new open source solution which is able to store gridded datasets into a native big data format and make data available for parallel processing, analytics or artificial intelligence learning. There is a demand for developing a unique storage solution that would be opened to different users:

  • Scientists, setting up their prototypes and models in their customized environment and qualifying their data to publish as Copernicus datatsets for instance;
  • Operational teams, in charge of the daily processing of data which can be run in another environment, to ingest the product in an archive and make it available to end-users for additional model and data science processing.

Data ingestion and storage are key factors to study to ensure good performances in further subsetting access services and parallel processing.

Through typical end users’ use cases, four storage and services implementations are compared through benchmarks:

  • Unidata's THREDDS Data Server (TDS) which is a traditional NetCDF data access service solution built on the NetCDF-Java,
  • an extension of the THREDDS Data Server using object store,
  • pangeo/Dask/Python ecosystem,
  • and the alternative Hadoop/Spark/Parquet solution, driven by CLS technical and business requirements.
This preprint has been withdrawn.

Jean-Michel Zigna et al.

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021
  • RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021
  • EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021
  • AC1: 'AC1', elisabeth lambert, 17 Nov 2021
  • AC2: 'AC-2', elisabeth lambert, 17 Nov 2021

Interactive discussion

Status: closed

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021
  • RC2: 'Comment on gmd-2021-138', Peter Baumann, 26 Oct 2021
  • EC1: 'Comment on gmd-2021-138', Juan Antonio Añel, 26 Oct 2021
  • AC1: 'AC1', elisabeth lambert, 17 Nov 2021
  • AC2: 'AC-2', elisabeth lambert, 17 Nov 2021

Jean-Michel Zigna et al.

Jean-Michel Zigna et al.

Viewed

Total article views: 1,267 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
847 388 32 1,267 15 16
  • HTML: 847
  • PDF: 388
  • XML: 32
  • Total: 1,267
  • BibTeX: 15
  • EndNote: 16
Views and downloads (calculated since 02 Jul 2021)
Cumulative views and downloads (calculated since 02 Jul 2021)

Viewed (geographical distribution)

Total article views: 1,182 (including HTML, PDF, and XML) Thereof 1,182 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 08 Dec 2022
Download

This preprint has been withdrawn.

Short summary
The Parquet Cube storage alternative presented here is compared with Pangeo and THREDDS platforms to access to gridded data for large scale processing and modeling. Stressing the 3 implementations through 3 data scientists' scenarii, this Parquet Cube Alternative appears to be a good candidate to share gridded data in a cloud environment and share them through different communities of users. This open source alternative can be enriched by additional services to subset, enrich or explore data.