Preprints
https://doi.org/10.5194/gmd-2021-138
https://doi.org/10.5194/gmd-2021-138

Submitted as: development and technical paper 02 Jul 2021

Submitted as: development and technical paper | 02 Jul 2021

Review status: this preprint is currently under review for the journal GMD.

A Parquet Cube alternative to store gridded data for data analytics and modeling

Jean-Michel Zigna1, Reda Semlal1, Flavien Gouillon3, Ethan Davis4, Elisabeth Lambert1, Frédéric Briol2, Romain Prod-Homme1, Sean Arms4, and Lionel Zawadzki3 Jean-Michel Zigna et al.
  • 1Software Engineering Division, CLS, Toulouse, 31520, France
  • 2Environmental Business Unit, CLS, Toulouse, 31520, France
  • 3Earth Observation division, altimetry and radar department, CNES, Toulouse, 31400, France
  • 4Division of the University Corporation for Atmospheric Research, UNIDATA, Boulder CO 80301, United States

Abstract. The volume of data in the field of Earth data observation has increased considerably, especially with the emergence of new generations of satellites and models providing much more precise measures and thus voluminous data and files. One of the most traditional and popular data formats used in scientific and education communities (reference) is the NetCDF format. However, it was designed before the development of cloud storage and parallel processing in big data architectures. Alternative solutions, under open source or under proprietary licences, appeared in the past few years (See Rasdaman, Opendatacube). These data cubes are managing the storage and the services for an easy access to the data but they are also altering the input information applying conversions and/or reprojections to homogenize their internal data structure, introducing a bias in the scientific value of the data. The consequence is that it drives the users in a closed infrastructure, made of customized storage and access services.

The objective of this study is to propose a light new open source solution which is able to store gridded datasets into a native big data format and make data available for parallel processing, analytics or artificial intelligence learning. There is a demand for developing a unique storage solution that would be opened to different users:

  • Scientists, setting up their prototypes and models in their customized environment and qualifying their data to publish as Copernicus datatsets for instance;
  • Operational teams, in charge of the daily processing of data which can be run in another environment, to ingest the product in an archive and make it available to end-users for additional model and data science processing.

Data ingestion and storage are key factors to study to ensure good performances in further subsetting access services and parallel processing.

Through typical end users’ use cases, four storage and services implementations are compared through benchmarks:

  • Unidata's THREDDS Data Server (TDS) which is a traditional NetCDF data access service solution built on the NetCDF-Java,
  • an extension of the THREDDS Data Server using object store,
  • pangeo/Dask/Python ecosystem,
  • and the alternative Hadoop/Spark/Parquet solution, driven by CLS technical and business requirements.

Jean-Michel Zigna et al.

Status: open (extended)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2021-138', Michael Kuhn, 14 Oct 2021 reply

Jean-Michel Zigna et al.

Jean-Michel Zigna et al.

Viewed

Total article views: 430 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
360 64 6 430 6 7
  • HTML: 360
  • PDF: 64
  • XML: 6
  • Total: 430
  • BibTeX: 6
  • EndNote: 7
Views and downloads (calculated since 02 Jul 2021)
Cumulative views and downloads (calculated since 02 Jul 2021)

Viewed (geographical distribution)

Total article views: 384 (including HTML, PDF, and XML) Thereof 384 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
1
 
 
 
 
Latest update: 25 Oct 2021
Download
Short summary
The Parquet Cube storage alternative presented here is compared with Pangeo and THREDDS platforms to access to gridded data for large scale processing and modeling. Stressing the 3 implementations through 3 data scientists' scenarii, this Parquet Cube Alternative appears to be a good candidate to share gridded data in a cloud environment and share them through different communities of users. This open source alternative can be enriched by additional services to subset, enrich or explore data.