the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
The ESGF Virtual Aggregation (CMIP6 v20240125)
Abstract. The Earth System Grid Federation (ESGF) holds several petabytes of climate data distributed across millions of files held in data centers worldwide. Obtaining and manipulating the scientific information (climate variables) held in these files is non-trivial. The ESGF Virtual Aggregation is one of several solutions to providing an out-of-the-box aggregated and analysis ready view of those variables. Here we discuss the ESGF Virtual Aggregation in the context of the existing infrastructure, and some of those other solutions providing analysis ready data. We describe how it is constructed, how it can be used, and provide some performance evaluation. It will be seen that the ESGF Virtual Aggregation provides a sustainable solution to some of the problems encountered in producing analysis ready data, without the cost of data replication to different formats, albeit at the cost of more data movement within the analysis than some alternatives. If heavily used, it may also require more ESGF data servers than are currently deployed in data node deployments. The need for such data servers should be a component of ongoing discussions about the future of the ESGF and its constituent core services.
- Preprint
(1699 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
RC1: 'Comment on gmd-2024-120', Anonymous Referee #1, 01 Oct 2024
In the manuscript titled “The ESGF Virtual Aggregation (CMIP6 v20240125)”, the authors study the Earth System Grid Federation (ESGF) Virtual Aggregation which provides a sustainable solution to some of the problems encountered in producing climate analysis ready data. In case of an overload of access requests, the Virtual Aggregation requires more ESGF data servers than are currently deployed. The study addresses this critical issue by focusing on data obtaining, improving deliver capabilities beyond conventional file search and download. This gap in the research makes the study timely and relevant, especially in the enhancing the efficiency and productivity of climate data analysis.
While the current focus on database operations might make the paper more suitable for a data management journal, there is potential for it to fit within Geoscientific Model Development if the numerical models of the Earth system aspects are emphasized. If the primary contribution remains database-oriented, the manuscript might not meet the scientific inquiry expectations of Geoscientific Model Development. Reframing the study to address earth system model processes more directly could improve its suitability, but if that is not feasible, the authors might consider submitting to a journal more focused on database management and database query. Please refer to the aims and scope of GMD https://www.geoscientific-model-development.net/about/aims_and_scope.html
Citation: https://doi.org/10.5194/gmd-2024-120-RC1 -
AC1: 'Reply on RC1', Ezequiel Cimadevilla, 11 Oct 2024
Thank you for your review of our manuscript. We agree that the paper currently has a strong focus on data management aspects, and we need to make the connect to model evaluation clearer. In the revision we will address this by incorporating a detailed example of the impact of the ESGF Virtual Aggregation on model evolution. In doing so we would hope to make it clearer why we chose GMD and why this work fits within the GMD scope and should be of interest to GMD readership.
Citation: https://doi.org/10.5194/gmd-2024-120-AC1 -
RC3: 'Reply on AC1', Anonymous Referee #1, 11 Oct 2024
I agree with the author's approach and remain open-minded. If the journal editor and other reviewers have other considerations regarding the applicability of this study's scope, I am also agreeable.
Citation: https://doi.org/10.5194/gmd-2024-120-RC3
-
RC3: 'Reply on AC1', Anonymous Referee #1, 11 Oct 2024
-
AC1: 'Reply on RC1', Ezequiel Cimadevilla, 11 Oct 2024
-
RC2: 'Comment on gmd-2024-120', Anonymous Referee #2, 09 Oct 2024
In the manuscript titled “The ESGF Virtual Aggregation (CMIP6 v20240125)”, the authors propose a Virtual Aggregation solution to support an analysis ready data view to files stored on distributed data servers in the ESGF data infrastrcuture is described. Besides providing a standardized virtual aggregation description, the major advantage lies in the possiblility to generate logical aggregation descriptions without the need to inspect storage details of the netcdf files. This allows for a efficient implementation in the ESGF data federation based on the ESGF medatadata search capablility and ESGF data access services. The implementation also revealed options, problems and possibilites in the current operational ESGF data fedeartion to better enable virtual aggregation in the future - these aspects are highlighted. This provides important input for the evolution of ESGF to better support data analysis exploiting virtual aggregation descriptions.
The problem with respect to the dependency of the described solution on available opendap servers (not forseen in the future ESGF infrastructure planing) is described. A short comment on how other types of lightweight data servers e.g. based on xpublish would be an option for the future would be helpfull. Also a short comment on the nature of this dependency would be helpfull - DMR++ Opendap is not mentioned etc. ?
The manuscript focuses on the distributed database and data management aspects and not so much on the model development, yet there are important implications for the climate model components responsible for generating standardized model (meta)data to enable data distribution e.g. in ESGF (e.g. cmor, versioning, chunking and aggregation problems with the currently available ESGF metadata are mentioned in the paper). With a more explicit description on this aspect it might fit better within Geoscientific Model Development.
Minor, more technical comment:
- The numbers in figure 6 are a bit misleading because numbers refer to different aspects. The illustrated ESGF index is much smaller (GB scale) then the mentioned ~21 PB of data this index is addressing. Yet the ESGF index is used to generate the local sql database. The aspect that other aggregation methods like kerchunk would need to inspect the indexed ~21 PB data is probably something which should not be included in the figure ..
Citation: https://doi.org/10.5194/gmd-2024-120-RC2 -
AC2: 'Reply on RC2', Ezequiel Cimadevilla, 14 Oct 2024
Thank you for your review of our manuscript. We agree that the paper currently has a strong focus on data management aspects, and we need to make the connect to model evaluation clearer. In the revision we will address this by incorporating a detailed example of the impact of the ESGF Virtual Aggregation on model evolution. We can see the need for a more detailed exposition on the workflow necessary for setting up efficient data access, we will add some material on this. Thanks also for the suggestion to add something on other lightweight data services, we will address that too.
Citation: https://doi.org/10.5194/gmd-2024-120-AC2
-
AC2: 'Reply on RC2', Ezequiel Cimadevilla, 14 Oct 2024
-
CEC1: 'Comment on gmd-2024-120', Astrid Kerkweg, 18 Oct 2024
Dear authors,
congratulations, you managed the shortest non-empty code availability section I've ever seen.
However, it would be much appreciated, if you could provide some more information on what is contained in the zenodo directory_ code ,plot scripts, data etc.
Furthermore, the DOI should be presented in a form of proper citation.
Best regards, Astrid Kerkweg (Executive Editor)
Citation: https://doi.org/10.5194/gmd-2024-120-CEC1 -
AC3: 'Reply on CEC1', Ezequiel Cimadevilla, 21 Oct 2024
Thank you for your review of the manuscript. In the revision we will provide a description of the contents of the repository in the code availability section. We will fix the DOI form too.
Citation: https://doi.org/10.5194/gmd-2024-120-AC3
-
AC3: 'Reply on CEC1', Ezequiel Cimadevilla, 21 Oct 2024
Status: closed
-
RC1: 'Comment on gmd-2024-120', Anonymous Referee #1, 01 Oct 2024
In the manuscript titled “The ESGF Virtual Aggregation (CMIP6 v20240125)”, the authors study the Earth System Grid Federation (ESGF) Virtual Aggregation which provides a sustainable solution to some of the problems encountered in producing climate analysis ready data. In case of an overload of access requests, the Virtual Aggregation requires more ESGF data servers than are currently deployed. The study addresses this critical issue by focusing on data obtaining, improving deliver capabilities beyond conventional file search and download. This gap in the research makes the study timely and relevant, especially in the enhancing the efficiency and productivity of climate data analysis.
While the current focus on database operations might make the paper more suitable for a data management journal, there is potential for it to fit within Geoscientific Model Development if the numerical models of the Earth system aspects are emphasized. If the primary contribution remains database-oriented, the manuscript might not meet the scientific inquiry expectations of Geoscientific Model Development. Reframing the study to address earth system model processes more directly could improve its suitability, but if that is not feasible, the authors might consider submitting to a journal more focused on database management and database query. Please refer to the aims and scope of GMD https://www.geoscientific-model-development.net/about/aims_and_scope.html
Citation: https://doi.org/10.5194/gmd-2024-120-RC1 -
AC1: 'Reply on RC1', Ezequiel Cimadevilla, 11 Oct 2024
Thank you for your review of our manuscript. We agree that the paper currently has a strong focus on data management aspects, and we need to make the connect to model evaluation clearer. In the revision we will address this by incorporating a detailed example of the impact of the ESGF Virtual Aggregation on model evolution. In doing so we would hope to make it clearer why we chose GMD and why this work fits within the GMD scope and should be of interest to GMD readership.
Citation: https://doi.org/10.5194/gmd-2024-120-AC1 -
RC3: 'Reply on AC1', Anonymous Referee #1, 11 Oct 2024
I agree with the author's approach and remain open-minded. If the journal editor and other reviewers have other considerations regarding the applicability of this study's scope, I am also agreeable.
Citation: https://doi.org/10.5194/gmd-2024-120-RC3
-
RC3: 'Reply on AC1', Anonymous Referee #1, 11 Oct 2024
-
AC1: 'Reply on RC1', Ezequiel Cimadevilla, 11 Oct 2024
-
RC2: 'Comment on gmd-2024-120', Anonymous Referee #2, 09 Oct 2024
In the manuscript titled “The ESGF Virtual Aggregation (CMIP6 v20240125)”, the authors propose a Virtual Aggregation solution to support an analysis ready data view to files stored on distributed data servers in the ESGF data infrastrcuture is described. Besides providing a standardized virtual aggregation description, the major advantage lies in the possiblility to generate logical aggregation descriptions without the need to inspect storage details of the netcdf files. This allows for a efficient implementation in the ESGF data federation based on the ESGF medatadata search capablility and ESGF data access services. The implementation also revealed options, problems and possibilites in the current operational ESGF data fedeartion to better enable virtual aggregation in the future - these aspects are highlighted. This provides important input for the evolution of ESGF to better support data analysis exploiting virtual aggregation descriptions.
The problem with respect to the dependency of the described solution on available opendap servers (not forseen in the future ESGF infrastructure planing) is described. A short comment on how other types of lightweight data servers e.g. based on xpublish would be an option for the future would be helpfull. Also a short comment on the nature of this dependency would be helpfull - DMR++ Opendap is not mentioned etc. ?
The manuscript focuses on the distributed database and data management aspects and not so much on the model development, yet there are important implications for the climate model components responsible for generating standardized model (meta)data to enable data distribution e.g. in ESGF (e.g. cmor, versioning, chunking and aggregation problems with the currently available ESGF metadata are mentioned in the paper). With a more explicit description on this aspect it might fit better within Geoscientific Model Development.
Minor, more technical comment:
- The numbers in figure 6 are a bit misleading because numbers refer to different aspects. The illustrated ESGF index is much smaller (GB scale) then the mentioned ~21 PB of data this index is addressing. Yet the ESGF index is used to generate the local sql database. The aspect that other aggregation methods like kerchunk would need to inspect the indexed ~21 PB data is probably something which should not be included in the figure ..
Citation: https://doi.org/10.5194/gmd-2024-120-RC2 -
AC2: 'Reply on RC2', Ezequiel Cimadevilla, 14 Oct 2024
Thank you for your review of our manuscript. We agree that the paper currently has a strong focus on data management aspects, and we need to make the connect to model evaluation clearer. In the revision we will address this by incorporating a detailed example of the impact of the ESGF Virtual Aggregation on model evolution. We can see the need for a more detailed exposition on the workflow necessary for setting up efficient data access, we will add some material on this. Thanks also for the suggestion to add something on other lightweight data services, we will address that too.
Citation: https://doi.org/10.5194/gmd-2024-120-AC2
-
AC2: 'Reply on RC2', Ezequiel Cimadevilla, 14 Oct 2024
-
CEC1: 'Comment on gmd-2024-120', Astrid Kerkweg, 18 Oct 2024
Dear authors,
congratulations, you managed the shortest non-empty code availability section I've ever seen.
However, it would be much appreciated, if you could provide some more information on what is contained in the zenodo directory_ code ,plot scripts, data etc.
Furthermore, the DOI should be presented in a form of proper citation.
Best regards, Astrid Kerkweg (Executive Editor)
Citation: https://doi.org/10.5194/gmd-2024-120-CEC1 -
AC3: 'Reply on CEC1', Ezequiel Cimadevilla, 21 Oct 2024
Thank you for your review of the manuscript. In the revision we will provide a description of the contents of the repository in the code availability section. We will fix the DOI form too.
Citation: https://doi.org/10.5194/gmd-2024-120-AC3
-
AC3: 'Reply on CEC1', Ezequiel Cimadevilla, 21 Oct 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
261 | 94 | 439 | 794 | 6 | 9 |
- HTML: 261
- PDF: 94
- XML: 439
- Total: 794
- BibTeX: 6
- EndNote: 9
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1