A Parquet Cube alternative to store gridded data for data analytics and modeling

. The volume of data in the field of Earth data observation has increased considerably, especially with the emergence 10 of new generations of satellites and models providing much more precise measures and thus voluminous data and files. One of the most traditional and popular data formats used in scientific and education communities (reference) is the NetCDF format. However, it was designed before the development of cloud storage and parallel processing in big data architectures. Alternative solutions, under open source or under proprietary licences, appeared in the past few years (See Rasdaman, Opendatacube). These data cubes are managing the storage and the services for an easy access to the data but they are also altering the input 15 information applying conversions and/or reprojections to homogenize their internal data structure, introducing a bias in the scientific value of the data. The consequence is that it drives the users in a closed infrastructure, made of customized storage and access services. The objective of this study is to propose a light new open source solution which is able to store gridded datasets into a native big data format and make data available for parallel processing, analytics or artificial intelligence learning. There is a demand 20 for developing a unique storage solution that would be opened to different users:


Introduction 35
Thanks to the possibility to manage big and various datasets with velocity in big data architecture, there is an opportunity to quickly extract bigger data volumes, process them for new value-added products on larger geographical and temporal scales, with higher geographical and temporal resolutions, and provide more precise output data to build up new data science models intelligence from existing scientific data. For instance, in the animal monitoring field, scientists need to correlate locations, in-https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License. situ observations with satellite or modelized environmental data on a long time period for the detection and the modeling of 40 any new threat on the animal habitat due to Climate and Environment changes (reference EO4wildlife).
Cloud and bigdata technologies can help through their focus on horizontal scalability, by distributing the workload and I/O across a potentially resizable pool of resources. However, the effort to acquire new technical skills and to switch to this type of technology can be expensive and a barrier for scientists developing their data processing and artificial intelligence models by their own. 45 Many oceanographic, climate, Earth observations datasets are in NetCDF format. This format is widely used to represent multidimensional arrays (see Unidata site). While the NetCDF format and the libraries ecosystem were initially designed for local, random-access files several approaches have been developed for enhancing the NetCDF libraries to make efficient use of cloud storage and parallel cloud processing. Support has also been added to the NetCDF-Java library to utilize HTTP byte-range reading into NetCDF files stored in S3-compatible object stores to emulate random-access file reading. Using the new S3 50 capabilities of the NetCDF-Java library, the THREDDS Data Server (TDS) can provide data access services to S3-stored NetCDF files. Work has also be done for adding support to read and write Zarr data in the NetCDF core libraries (NetCDF-C and -Java). Preliminary results based on this initial THREDDS S3 work are presented in this paper.
Pangeo is already widely used by scientists in the geoscience field to process data in bigdata architectures using Python (see https://speakerdeck.com/rabernat/pangeo-earthcube-tac-update?slide=13). The conversion of NetCDF data into Zarr data, 55 based on NumPy, and the use of Dask are logical in the Pangeo ecosystem. In the industry and many firms, such as CLS, we can still find places where the development environment can be Python oriented, but the operational environment is based on Spark/Java/Scala technologies. Thus the idea is to find a common storage format and provide a unique datalake for both environments.
Several big data formats are candidate to extend the use of data in a Cloudera/Spark infrastructure. Cloud Optimized Geotiff, 60 HSDS object store, Parquet: each of them has pros and cons, (reference https://ntrs.nasa.gov/citations/20200001178). Even if Parquet is not suitable for all datatypes, such as individual data points, it is relevant for gridded data. The question is to see how and to what an extent Parquet can be a viable and performant solution to promote. The different implementations tested in this publication to store and access to data are: • THREDDS-NC for the traditional NetCDF environment, 65 • THREDDS-S3 for a new possibility using THREDDS services managing data stored in S3 storage (Amazon Simple Storage Service), • Pangeo-Zarr for the Pangeo/Python/Zarr/Dask implementation, For the datasets, the following criteria were identified to stress these implementations: • Significant data volume: data is available over a one year duration, • Variability in the geographical distribution: the geographical coverage is typically on the ocean or on the land, 75 • Significant set of variables to estimate the impact on the performances of the services.
For the use cases, we identified 3 typical scenarios: • extraction of data in a geographical area over a time period for a subset of variables (3D or 4D) via a subsetter service, • enrichment of CSV locations with variables values as additional CSV columns, • Parallelized processing. 80 https://doi.org/10.5194/gmd-2021-138 Preprint. All these datasets are available as NetCDF files. According to each implementation, datasets can be converted into another 95 format to optimize the execution time of the scenarios.
The first scenario is made of a set of 100 requests to a subsetter service. It is defined in order to run and replay the same extractions for each implementation.
Storage size for inputs and outputs of the requests and execution times of the subsetter service are discussed in the results item for each implementation. 100 For the second scenario, an enrichment along tracks is performed. A set of additional variables, growing in numbers, is added to all positions in the tracks, over a specific region or worldwide. To get a significant number of real tracks, anonymized ships locations were exported from the Automatic Identification System archive available at CLS premises. During the enrichment, a linear interpolation using the surrounding values was performed. The implemented interpolation is simply a weighted arithmetic mean of the neighbours' values, where the weight of each value corresponds to the inverse of the distance (in meters) 105 between the neighbour and the track point. https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License.

Figure 1: interpolation along track
For the third scenario which is parallelized processing, the CMEMS sea level anomalies product is used to compute and display 110 the yearly mean sea surface elevation since the beginning of altimetry measurement in 1993.
Each implementation is detailed below to illustrate its storage mechanism, its services for geographical extractions or along track enrichment, and its limits and potential workarounds. Then the results are illustrated to provide an overview of their performances. A first set of requests is performed to try to assess the incidence of the dataset characteristics on subsetting results and estimate 145 the limits of the implementation due to the time coverage or number of variables to extract. The system configuration for these extractions is a Linux server with 10 cores and 4G of memory.
A second set of requests is performed to see how the number of cores, for a given dataset and a given request, is impacting the subsetting performances and how scalable the solution is. The number of cores is increased for the Pangeo-Zarr and the Spark-Parquet environments. 150 All the requests are described in an excel file available with the tested datasets (see Data Availability).

Enrichment along track scenario 2 details
Run: extraction of environmental data along tracks (set of positions, Input in csv, output in csv)

Metrics: 155
• Global execution time: on a hot instance The parameters that are used to stress the solutions are: Then, the incidence on the number of extracted variables and of the number of cores to enrich the location are also studied.

Parallel processing scenario 3 details
Run: Calculate the annual average elevation of the oceans on the globe over 25 years -Fill value to ignore -Output : curve plot (png) 185 -Conf : 4 go per executor, 5 cores per executor (optimum), up to 50 cores

Metrics:
• With the initialization time of the cluster to have an significant estimation from the end user point of view With sea level anomaly variable, draw on the same plot: • The average by time step of the values 190 • The sliding average with a window size of 365 days

THREDDS implementation
Unidata's THREDDS Data Server (TDS) provides a variety of data access services that can be used to serve NetCDF datasets.
The NetCDF Subset Service (NCSS) supports coordinate system requests and can return the requested data in several formats 195 including NetCDF-4. Unidata's TDS uses the NetCDF-Java library to read the datasets to be served by the TDS. This is the THREDDS implementation.
By extending the NetCDF-Java library to support byte-range access to object stores that support the S3 API, the existing TDS services can now be used to serve NetCDF files stored as objects in an object store. It is the THREDDS-S3 implementation.
While this new S3 access can be used to serve individual NetCDF files, extending the new S3 capability to TDS aggregation 200 of datasets is still being implemented. This means that the year of daily files for the two CMEMS hourly and daily mean dataset collections cannot be served as a virtual aggregation of the existing files but must instead be concatenated into a single file per dataset collection. This concatenation was done using the NCO `ncrcat` command.
TDS services (including NCSS) were designed for targeted subset data access and were not designed to handle requests that 205 result in very large responses. The size of results from some of the benchmark requests resulted in responses that exceeded size limits in place on the systems. The perhaps related second issue involved requests reaching timeout limits.
While the TDS handles multiple requests in a multi-threaded manor, individual requests are not currently handled in a multithreaded manor. Working to parallelize reads to S3 objects is something on the THREDDS list for future work (see list below). 210 To work around the size and timeout limitations, some of the original requests were modified by adding horizontal and/or temporal strides to the requests thus reducing the volume of the resulting data. Adding a stride means that not every data point in the requested spatial/temporal extent is returned. Instead only every n-th data point along the spatial dimensions or temporal dimension are returned, where 'n' is the value given by the stride parameter. 215

Pangeo implementation
Pangeo is, first and foremost, a community of people working collaboratively to develop software and infrastructure for geoscience research on large volume of data. Among the products developed by this community, there are interconnected https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License. software that can be deployed in cloud computing or high-performance computing (HPC) environments. Such a deployment 220 is also known as a "Pangeo environment." These environments are built around a software stack using the Python computer language and reference libraries, allowing large-scale processing. The central element of this ecosystem is the Dask library. This library enables to run calculations on a distributed or local cluster. Dask uses task graphs and data structures to distribute tasks and memory. Data structures are handled in memory by Dask. Third-party libraries such as Zarr takes care of reading and writing data to the disk. Data 225 processing libraries, such as Dask, manage data internally by dividing it into smaller chunks to enable distributed access, efficient compression and to reduce input/output access time on the file system.
In this implementation, the conversion of the different NetCDF datasets into Zarr format is performed by a specific Python script based on the NetCDF4 module. The written data is compressed with the LZ4 algorithm using a compression level of 5.
When relevant for certain physical variables, filters (ex., delta, scale factor, add offset) are added to improve the compression 230 rate of these variables.

Parquet Cube implementation
The development of the Parquet Cube solution is driven by the following scientists' requirements: 1. Do not alter the NetCDF input quality and precision 2. Make it simple, generic, robust and scalable 235 3. Allow further large scale processing for ML or DL models 4. Be compliant with traditional scientists' habits (having everything on their computer) 5. Open the use of the solution to a large community of users For item 1: Existing data cubes are ingesting data to be able to process them in the same projection, if needed the same 240 geospatial resolution 'see open cube ingestion'. Scientists want to keep the original information or have a complete description of the ingestion mechanism, which is not clear once data is available through the data cube access services. So reprojections, or conversion of units, interpolations are prohibited in this solution. This introduces a limitation to manage X,Y geographical representation used in polar datasets for instance. The single recommendation in place is to homogenise the precision in latitude and longitude between datasets to be able to use such latitudes or longitudes key values during select or join operations. The 245 decimal precision is set to 0.00001 degrees, representing approximately a 1-meter precision at the Equator line. Depending on the variable type defining longitude and latitude, this rounding operation is also useful to avoid floating point error and failing jointures during requests.
For item 2: a scalable docker ingestion process is in place to deal with inputs of GBytes and thousands of files. 250 For item 3: Parquet is a common big data format standard. Several libraries, In Java, Scala, Python or R, familiar to data scientist and educational communities, are reading it. Final storage can be done through additional local storage disks, HDFS or object storage like S3 buckets for big volumes available in big data infrastructures. Be aware that HDFS replicated storage can be typically three times the original size of a file. The operational storage costs are not studied here. 255 For item 4: The proposed parquet cube solution can be the input format of a scalable subsetter service, to extract data for scientists matching their geographical, time coverage and environmental variables list of interest. The output of the subsetter service can load data in memory for further parallel processing or can generate a downloadable output file for any scientist who wants to process data directly on his computer. The first option is to write the output as a NetCDF file, but the storage 260 https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License. and the time to process the output are quickly prohibitive. The second option is to benefit from the scalability of the bigdata enabled data format such as Parquet as an output format. The results of the two options are described in the "Results" item.
For item 5: the conversion from NetCDF to Parquet is done by using an opensource software. The objective is to open the source code to users for any further optimization or modifications required to ingest their NetCDF inputs. 265 To meet the above requirements, the Parquet Cube implementations is using the following ingestion mechanism. The overall idea is to store each data point from the original gridded data as a row in a Parquet table. Thus, each row has columns corresponding to NetCDF dimensions and to variables values.

Figure 2: Flattening a multi-dimensional array in tabular format
In addition to this parquet file, two separate files are generated: • A dimension index file: An index for each dimension (one file per dimension). This metadata file stores all the attributes extracted from the original NetCDF files of the dataset. This can be useful to generate a NetCDF output in line with the NetCDF input but also to implement a description service of the dataset (not described here). This allows to benefit from quick partition pruning based on dates, since almost all our use cases are applied to a pre-defined time interval.

285
During the data transformation from NetCDF to parquet format, the following basic transformations are performed to avoid additional operations in the subsetter service: • Normalization and rescaling of latitude/longitude values (latitudes are rescaled to the range [-90, 90] and longitudes to the interval [-180, 180]). This enables the ability to join several datasets on the latitude/longitude without considering the ranges. 290 • Time dimension values are converted to unix epoch timestamps, to avoid disparities between datasets.
• Variable values read from NetCDF files are rescaled using the offset and the scale factor declared in the NetCDF file metadata, to allow comparisons of values between datasets (this would otherwise needs calculating the real values during the processing).
• It is to be noted that we implement a special handling for dataset that mix 3d and 4d variables: Since all variables, 295 regardless of their dimensions are grouped in the same parquet table, data is stored as if the depth dimension was nullable and all variables were 4d variables.
• the following example illustrates how the parquet table looks like in this specific case: The 4d variables are considered as "null" (empty cell) when the depth is "null", and the 3d variables are considered as "null" when depth is not.
For the tested datasets, the Parquet storage is designed to get a daily parquet file per folder to obtain a timely indexation, with 300 a storage size compliant with good I/O performances. The compression format is Snappy since it is splittable (see Cloudera documentation), performant in read and write access for any dataset.
Additional metadata columns are added in the footer of the Parquet format to quickly filter data between physical values ranges for instance (see Faster Performance for Selective Queries).  The parquet files structure is stored on HDFS, and Apache Spark is used to read and process data from parquet files in a 315 distributed manner. This Parquet storage can be accessed through a dedicated subsetter service or several technologies like Spark or Python Dask notebooks.
As a reminder, value subsetting consists in queries that select only data points for which the associated value is bounded by a predefined interval. Parquet metadata are used to select chunks of interest while running value range search and performing 320 scans only on the chunks for which the interval [min, max] intersects with value range of interest.
Note that the subsetter service used in the trials is under CLS property intellectual rights, currently not exposed or distributed. https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License.

Execution environments 325
The Hardware architecture for each solution is described below, all running on UNIX Virtual machines.
THREDDS implementation is using two TDS instances. One accesses the single file benchmark datasets from local disk. The second accesses the single-file benchmark datasets from an on-prem object store appliance. The two TDS instances provide us with an interesting comparison for the benchmark testing.
The TDS service manages data stored on a local file system is a dual Xeon Gold 6138 machine with 512GB memory and a 330 120TB RAID running CentOS. The TDS that serves data stored on an object store, is a VM on a cluster co-located with the object store, it has 4 CPUs with 12 GB of RAM and 80 GB disk and is also running CentOS.
The CNES HPC cluster was used to deploy Pangeo on a Conda environment. PBS Pro drives this cluster, and GPFS operates the distributed file system. The main libraries used with our Pangeo environment are Dask, Dask-jobqueue, Zarr, and Pyinterp.
Pyinterp was used to perform the interpolations. For each test, PBS was used to limit the resources allocated on the cluster. 335 We repeated the tests several times to obtain the most reliable metrics. Nodes in the CNES HPC have the following characteristics: • CPU: 24 cores / 2.2 GHz (Intel Broadwell) • Memory: 128 Go / 2400 MHz • Disk storage : 2 x SAS 15Klaps/mn / 300 Go / RAID0 / 500 Go 340 • Network : Gigabit Ethernet • Infiniband : FDR 56Gb The Parquet Cube solution is deployed on the CLS big data infrastructure (Cloudera version 5.13.1).

Figure 7: Parquet Cube ingestion and processing diagram
The ingestion processing produces the parquet files consumed later by Spark jobs to extract data through Jupiter or Zeppelin Notebooks or through APIs such as the subsetter service.

350
Unfortunately, within the scope of this study, it was impossible to deploy each solution on the same infrastructure. To try to minimize the impact of the hardware architecture, performances for executing the requests were measured with the same number of CPU and memory size, but with different disk storage performances. https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License.

Ingestion of NetCDF data 355
The first THREDDS deployment is using the current NetCDF format, so no conversion is required. The second one is using the S3 storage of a unique object per dataset. The conversion uses the NCO `ncrcat` command to generate a single object store.
Concerning Pangeo-Zarr solution, a python script was developed using the xarray module to convert NetCDF into Zarr format.
XARRAY module was not used to convert the big volume of the datasets and a specific script was developed using NetCDF4 Python module for input read and using Dask/Zarr for output write. 360 The ingestion service for the Spark-Parquet conversion is implemented as a Java/Scala job encapsulated in a docker (see git repository for a complete description).  For big datasets, we can see the impact of the compression method and of the replication of data. There is no replication for 365 the Pangeo-Zarr or THREDDS implementations. It is to consider that HDFS storage is replicating 3 times the data chunks to monitor the disaster recovery data integrity (so the effective disk storage is 3 times the presented figures). For the Lake Water Quality, the ingestion size in the Parquet format could be optimize by the generation of a second dataset with variables first_observation_day, last_observation_day, number_of_observations which are metadata and not observations. This can be discussed according to the usage of the lake water quality physical measurements. 370  The Zarr conversion on the HPC was not measured precisely, but using all the expected nodes in the CNES HPC, the ingestion of all the datasets took approximately 1 day. In comparison, the Parquet ingestion for all the datasets was 7.4 days on the CLS infrastructure without parallelization. 375

Subsetting access results
The first idea concerning the implementation of the scenario was to generate NetCDF outputs to provide extracted data as a familiar format to scientists. But limits were found, both in output file size and execution time to generate the NetCDF result.
The NetCDF output was not generated in the end when the output size was reaching the limits of the tested implementation.
For THREDDS, since the output is NetCDF, a stride strategy is put in place to limit the output of the subsetting to a threshold of 600Gb compliant with NCSS. Some requests, even with stride values given, still resulted in timeout issues or out of memory issues in the case of S3 access. Changes of the settings on how TDS caches S3 reads solved some out of memory issues.
Remaining out of memory issues are believed to be related to how data is handled when generating the resulting NetCDF-4 files. These issues have caused the TDS testing not having results for all the benchmark requests which are identified as N/A 385 below.
For the Pangeo-Zarr and the Spark-Parquet implementations, it was initially decided to generate NetCDF outputs when their size was less than 1 Tb, and then to generate Zarr and Parquet outputs respectively to benefit from the scalability of the architecture.
To refine the estimation of the execution times, requests were run 5 times for Pangeo-Zarr, 3 times for Spark-Parquet. The 390 mean result is presented below.

Incidence of geographical and time coverage
The sea_surface_height_above_geoid variable of the global-analysis-forecast-phy-001-024 dataset was extracted (for Pangeo-Zarr and Spark-Parquet using 10 cores of 4 Gb). 395

Figure 8: 3D Subsetting in time (seconds) and area of interest
We can see that the execution time is quite linear between 1 month and 6 months for THREDDS-NC and THREDDS-S3 with a coefficient equal to 1 for a given area, but due to the striding strategy, it is not possible to conclude on a linear factor for the 400 geographical dimension.

Figure 9: 3D Subsetting time in seconds
We can see a quadratic behaviour for the extractions performed with Pangeo-Zarr and Spark-Parquet. Pangeo-Zarr has a higher sensitivity to area size than Spark-Parquet for which extraction times are in the same order except for the global coverage.
Even If the execution times are around 1 minute for the global extraction over one year for Spark-Parquet in the CLS 405 infrastructure, they are around 10 times slower than the Pangeo-Zarr results in the HPC data centre for the Global extraction, up to 60 times slower for some local areas.

Incidence of the number of dimensions
The 4d sea_water_potential_temperature is extracted for depth 5.078224. 410 Compared to 3D extractions, we can see a significant increase in time for 4D requets for the Pangeo-Zarr and the Spark-Parquet implementations: 3 times slower for Pangeo-Zarr, only a half more for Spark-Parquet. This is not the case for the THREDDS implementation because of the stride policy put in place to be able to extract data in the 415 THREDDS NCSS file size limit which is 600MB. Striding means that a data point is extracted every N along a dimension (N can be between 2 to 20 depending on the request). https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License.

Incidence of the number of variables
The requests to study the impact of the number of variables are based on the following variables: 1: sea_surface_height_above_geoid 420 4: sea_surface_height_above_geoid,eastward_sea_water_velocity, northward_sea_water_velocity,sea_water_potential_temperature 11: sea_water_salinity,sea_water_potential_temperature_at_sea_floor,ocean_mixed_layer_thickness_defined_by_si gma_theta,sea_surface_height_above_geoid,eastward_sea_water_velocity,northward_sea_water_velocity,sea_water _potential_temperature,sea_ice_area_fraction,sea_ice_thickness,northward_sea_ice_velocity,eastward_sea_ice_vel 425 ocity As before, the requests are processed by combining: • a given area: North Sea, Atlantic Ocean, Pacific Ocean, Global • A period of time: from the lightest to the darkest line : 1 day, 1 month, 3 months, 6 months, 1 year The THREDDS execution times are slower than the ones in the Pangeo-Zarr or the Spark-Parquet implementations. The THREDDS implementation is also including the stride bias. For Pangeo-Zarr and Spark-Parquet, we can see that the execution time is less sensible to the number of variables for small areas and becomes more linear for the global coverage. The Pangeo-Zarr implementation on the CNES HPC can retrieve data faster than the CLS Spark-Parquet implementation but depends more on the number of variables. 435 The THREDDS-S3 implementation did not provide all results to be able to make an interpretation of the results.

Incidence of number of cores
One of the key factors in the execution time is the number of cores to process data. The diagram below illustrates the results for the extraction of sea_water_potential_temperature, eastward_sea_water_velocity,northward_sea_water_velocity, 440 sea_surface_height_above_geoid variables for the global-analysis-forecast-phy-001-024 dataset over a 3 months time period.

Incidence of the number of time steps
The test is to assess if the number of time steps in a dataset is a key factor in terms of performances and to try to find out what 450 the limits are in the different implementations. The request is the same: 4 variables, the sea_water_potential_temperature, eastward_sea_water_velocity, northward_sea_water_velocity, sea_surface_height_above_geoid, extracted first from the global-analysis-forecast-phy-001-024 daily dataset and then from the global-analysis-forecast-phy-001-024-hourly-t-u-v-ssh the hourly dataset having 24 more time steps than the daily. The size of the first one is approximately twice as the second one since it includes more variables. The THREDDS-S3 implementation provided a single value for the hourly dataset and is not 455 considered in the analysis.

Incidence of not significant data
The idea was to compare the incidence of the data type: significant continuous data over wide areas (seas) versus lot of small data spots (lakes). The extraction for the Africa region is performed with 10 cores and with 50 cores for both Africa and Europe. The execution time of the extraction for the non-continuous data is greater than the global extraction for the oceanographic data with 50 cores (See Figure 12). In the Parquet case, with 100 cores, the result of the European and African coverage extraction is 5 times greater than the oceanographic data worldwide extraction performed with 50 cores. Even if the tested data is not a big amount of volume, it is necessary to read a lot of chunks to extract it. These formats are not optimized for gridded data with a poor 475 geographical coverage distribution, or the chunk size should be optimized for this kind of dataset.

Enrichment along track results
Note: The TDS services are not designed for such a use case since a call to the THREDDS service should be done for each location. Therefore the enrichment along track use case was not performed for these solutions. The dataset used for the enrichment is global-analysis-forecast-phy-001-024 daily.
Tests are run with 10 cores of 4Gb, except for the incidence of the number of cores.

Incidence of the geographical coverage
Comparing enrichments of positions in a local area or worldwide is another topic to consider: for the enrichment of 152484 position in Rotterdam and of 176735 positions in the world, the execution time is quite constant for the Pangeo-Zarr 495 implementation (respectively 0.80s and 0.82s) but is 4 times more for the Spark-Parquet implementation (57s instead of 14s).

Incidence of the temporal coverage
The temporal coverage from 1 to 50 days includes a significant increase in the number of positions:  The time coverage and number of positions increase is well supported by the Pangeo-Zarr implementation, less for the Spark-Parquet one which seems to be parabolic.

Incidence of the number of variables
Runs were executed with the following requests for positions around the world. 505 For the Pangeo-Zarr implementation, the mean ration between the execution times from 1 to 7 variables is 16 times slower only 1.6 for the Spark-Parquet implementation which is more time consuming. 520

Incidence of the number of cores
The increase of cores used to perform a worldwide extraction over 50 days for 8 897 349 positions provides the following result: The scalability of the big data architecture enables a significant improvement for both implementations, but the limits were not found and a test with 50 or 60 cores could be set to try to find it.

Parallel processing results
The Jupyter Notebook in the CNES HPC is used to run the process of the sea level yearly mean, and its conversion in a Spark/Scala job is run in Zeppelin notebook on the CLS big data architecture. 530

THREDDS
In future versions, UNIDATA intends to make the partitioning strategy customizable by dataset (in order to avoid having too many small files when we can group data by month for example). While the initial S3 access supports a number of data access 540 patterns, the work of this study has shown that there are issues with handling very large datasets and requests that result in very large response sizes. To better support large collections of data and reduce the need for rewriting existing datasets, the THREDDS team is working to implement support for virtual aggregation of datasets stored in S3 object stores.
The current study has suggested a number of possible items for future work: 545 • Requesting NetCDF-4 files with NCSS is RAM intensive and requires the resulting NetCDF-4 file to be written on the server before the server can start sending the HTTP response. The TDS CdmRemote Service can return a ncstream response which can start as the data is being retrieved. This would, to some extent, give a better indication of data access times by separating it from the time required to write the NetCDF-4 file.
• Further investigation is needed on caching of S3 reads in the TDS to see if changes might improve performance and 550 memory usage.
• Once virtual aggregations of S3 datasets are supported, look for possible performance improvements.
• Longer term, the THREDDS team will look at how to parallelize reads from S3 object stores.

Pangeo-Zarr 555
The performances of the Pangeo solution are the best of the 3 compared solutions. The main limitation is that this solution is only accessible to the Python users.

Spark-Parquet
Several opportunities to optimize the performances are studied: 560 The use of the quadtrees, geowave Hilbert curves or Google-S2 indexes can optimize geo extractions by zone, storing the chunks of partitions according to their geographical vicinity. Using a quadtree data structure could allow to build an index that associates each cell id to the location of the data chunk that contains its data tile.
Some scientific data have the particularity of being tightly linked, meaning that for most scientific use cases, these data are 570 always accessed together. A notable example for climate data is wind or current data. In NetCDF files, they are encoded using 2 variables embedded in the same file: u, v. When processing or visualizing them, the user or the process accesses to both variables on the area and/or the period of interest. Their storage should be in the same chunk or closed chunks to avoid additional read access. 575 https://doi.org/10.5194/gmd-2021-138 Preprint. Discussion started: 2 July 2021 c Author(s) 2021. CC BY 4.0 License.
Then, a tuning could be performed to adjust the chunk size to the dataset hierarchy and size. in this study, the daily partitioning was suitable for the tested data but a specific test could have been performed to estimate the impact of the partition size. A parameter could be customized to adjust the dataset structure with the storage partition size. Additional tests will lead to recommendations according to datasets characteristics, but a classification of the dataset types should be studied first to help users of the Parquet Cube solution to tune it. 580 Reducing the ingestion time by scaling the ingestion process is an easy step to be performed. This scalability is implemented in the operational environment at CLS premises to ingest files in parallel and speed up the ingestion of datasets with a long time coverage.

Conclusion 585
The Parquet format is a convenient implementation to open the data to a large community of users. Through a scalable process in a big data infrastructure, the ingestion of NetCDF data facilitates the processing of gridded data to extract easily the expected piece of information and make high added value processing chains or artificial intelligence modelling.
The presented results demonstrate that the Parquet format is compliant with the typical users'use cases in processing big volume of gridded data: extraction of a subset of data to study or modelize, enrichment of positions with environmental 590 variables, parallel processing to tune models is becoming accessible to scientists and compliant with operational processing requirements.
Alternatives could be runs of the benchmarks using Parquet/Dask/Python, TileDB or Cloud Optimize Geotiff formats for instance. The inputs and scenario provided here are detailed and free for anyone willing to compare such new implementations of gridded data storage and access services in bigdata architecture. 595

Data availability
Gridded Data is around 2 Tb and is available on a NAS storage at CLS premises. The access can be provided on request to download data through HTTP. Please send a mail to bigdata4science@groupcls.com that will provide a slot time, an URL and password for downloads.
You can also perform the tests with the rolling year on line on the Copernicus services, but the availability of these datasets is not guarantee in the long term.