Technology to aid the analysis of large-volume multi-institute climate model output at a central analysis facility (PRIMAVERA Data Management Tool V2.10)

Seddon, Jon; Stephens, Ag; Mizielinski, Matthew S.; Vidale, Pier Luigi; Roberts, Malcolm J.

doi:https://doi.org/10.5194/gmd-16-6689-2023

Articles | Volume 16, issue 22

https://doi.org/10.5194/gmd-16-6689-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gmd-16-6689-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 16, issue 22

Methods for assessment of models

|

20 Nov 2023

Methods for assessment of models |

| 20 Nov 2023

Technology to aid the analysis of large-volume multi-institute climate model output at a central analysis facility (PRIMAVERA Data Management Tool V2.10)

Jon Seddon, Ag Stephens, Matthew S. Mizielinski, Pier Luigi Vidale, and Malcolm J. Roberts

Download

Final revised paper (published on 20 Nov 2023)
Preprint (discussion started on 04 May 2023)

Interactive discussion

Status: closed

RC1:
'Comment on gmd-2023-46', Anonymous Referee #1, 14 Jun 2023
This manuscript describes the process of creating the capability of allowing scientists to analyze data where the data resides. In this particular example, several high-resolution datasets were created from many centers. The data were then transferred to a central location, cataloged, and made available for analysis on this platform. They describe a database that is connected to a web portal that the scientists can use to request data to analyze. This work is an important step in understanding how this needed technology could work. As datasets become larger, transferring data to your home institution creates many copies of data around the world and that workflow is becoming more difficult because of the time needed to copy the data. I would like to commend the authors on their work in this area and their initiative.
I would like to suggest the following ideas for consideration:
The title: The title suggests "The analysis of large-volume multi-institute climate model output .." The paper talks more about the technology to make the analysis possible instead of the analysis itself. Could the title reflect this more?
Page 2, lines 4-7: The last 3 lines in that paragraph seem a little choppy. Could the flow be fixed?
Page 2, line 26: Could you add a quick comment about the compute hardware on JASMIN's 4000 cores?
Page 6, line 3: Was the 440 TiB size the high-water mark (or maximum) on the size needed. Some extra clarification would be helpful.
Page 8, line 3-4: You mention that creating data requests earlier in the process would have made things more efficient. Could you elaborate on this a little more?
Page 9, first paragraph:
Could you clarify what the difference between "data request" and "unique data requests" are?

Could you indicate if there's a purge policy and what the retention policy is?

Are there any requests that are given higher priority?

How are requests handled when demand is larger than the resource?

Page 9, second paragraph: Are users able to use Jupyter notebooks?
Page 9, line 15: Could you add a response rate for the survey?
Page 10, first paragraph: I applaud this effort. Were there any extra efforts needed to help the scientists that had slow and unreliable internet connections? And if so, could you add a brief mention of what was needed?
Citation: https://doi.org/10.5194/gmd-2023-46-RC1
RC2: 'Review of gmd-2023-46', Anonymous Referee #2, 21 Jul 2023

Summary
In their submission to GMD, the authors describe a programmatic approach to manage large-volumes of well-organized earth system model (ESM) simulation output at a central analysis facility (CAF). This approach includes a central orchestration application which facilitates automated data transfer between storage backends (upon user request) in order to enable efficient data processing on the compute partition of the CAF (UK’s JASMIN in this case).
Further, the process of standardizing, transferring and quality checking of ESM data produced at various HPC sites across Europe required for access at JASMIN is also described. However, since these are more or less standard procedures required to enable global sharing of ESM simulation output in the framework of CMIPs, the focus of the paper is clearly put towards the description of data handling at JASMIN. Thus, the paper is very well structured.
The tool described in the paper is the Data Management Tool (DMT) developed in the framework of the PRIMAVERA project. The DMT allows users to query the entire database of available PRIMAVERA data on JASMIN using a GUI. These data are stored either on disk (online) or tape (offline) – a necessity because disk space on JASMIN is limited. Once users find a dataset they want to analyse, the DMT allows for a piecewise retrieval of the data from tape to disk depending on the status of the users’ analysis workflow. When the analysis is done, the users accordingly flag the retrieved dataset in the DMT’s GUI. This results in the data to be removed from disk storage, making space for the datasets requested by other users. Adoption of this system has been good – which may also be due to the fact the DMT seemed to be the only way to get access to PRIMAVERA data at JASMIN(?).
The authors also include some interesting thoughts and back-of-the envelope calculations regarding the use of public/commercial storage- and compute providers (e.g. Amazon) versus the use of institutional resources such as JASMIN and also highlight the portability of the DMT software stack to other HPC centres in order to allow easier management and access to data hosted on CAFs.
With the scope of the use case provided by the PRIMAVERA project resulting in relatively well-defined boundary conditions, e.g. data must comply with the CMIP (meta)data standard and be organized according to the well-defined DRS, the design and implementation of the DMTs data handling routines probably did not pose a significant challenge from a computational science perspective. However, this does not downplay the importance of this contributions content, because exactly such approaches to data handling at large CAFs are required to cope with the ever-increasing data amounts and data complexities stemming from state-of-the art ESM simulations, e.g. in the context of Destination Earth and associated projects. It is therefore important to document this, as it seems, success story to inform ongoing developments at various international CAFs.
That being said, I have no serious concerns with this contribution to GMD. The paper will contribute an important piece to the landscape of ongoing developments in data handling frameworks of state-of-the-art ESM output handling and analysis. However, I would like to see a bit more content related to the design-process of the DMT and how user adoption was enabled and user feedback incorporated. Furthermore, the authors just hint at the portability of the DMT to other infrastructures and that this would require additional work – I would also like to see a bit more concrete formulation here. Please see below for a more concrete formulation of my thoughts.
Further, I have some specific minor points. See also below.
Overall, I recommend major revisions – but these should be manageable.

General points
Portability
As stated above, I assume that the boundary conditions of the PRIMAVERA project made it relatively straight-forward to design and implement the DMT and its capacities for use at JASMIN. However, the limitations of available fast storage (disk) are more and more impeding operational day-to-day work at HPC centres because state-of-the art ESM simulation output is approaching PB-scales for single simulations. For these activities, no agreed-upon standards exist and the users are confronted with a plethora of different file formats, chunks and data organization principles – all at one CAF. The challenge here is to most efficiently combine fast (disk, ssd, cloud?) and slow (tape) storage tiers to both enable efficient and economic (in terms of energy costs) data handling. The DMT described here could be a cornerstone of such a system.
So my question here is: Could the DMT handle several different output organization principles at once? How generalizable is the database in the DMTs backend? Could the DMT also be used to display data holding at external institutions?
Design of the DMT
The DMT described in the contribution needs to fulfill two main purposes: allowing data handling given the constraints of limited disk space on JASMIN and providing ESM simulation output users with an intuitive and easy-to-use GUI. While the first requirement provides a hard boundary condition, the second requirement is more complicated to fulfill.
I recommend the authors to provide more information regarding the design process of the DMT and its functionalities. How were the requirements for the DMT gathered from users and how has user feedback guided software development? The authors write, that the project began in 2015, development of the DMT began in 2016 with first data available in 2017. Then, development has continued. What were the shortcomings leading to improvements of “the flow of data through the system”? Was user feedback gathered during this process?
I am asking this because from my experience, it seems to be very hard to impose new ways of data handling on users, i.e. scientists. For example, the availability of just a GUI would probably only be accepted with a large level of reluctance by the users of my institution. The use of a command-line interface would definitely be preferred – along with python-APIs which can be called directly from the analysis code.
Were any of these features ever considered in the design of the DMT? Or were the users just presented with a finished solution which they had to use – if they liked it or not?

Specific points
Page 2, lines 5-7: this sentence thematically does not fit here, because it motivates the need for high-res ESM simulations from a physical point of view. Can it be place further up in the introduction, e.g maybe even as the second sentence?
Page 2, line 8: cite the DMP here, because the reader (at least myself) immediately asks the question, if the DMP is publicly available
Page 2, line 10: same here, please provide the reference to the DMT already here
Page 2, line 9-10: can it please be clarified which aspects of the DMP exactly were implemented by the DMT? For example, from my reading experience, the role of the DMT for long-term archiving within CEDA or the ESGF publication, i.e. essential parts of the DMP, is not clear from the manuscript.
Page 2, lines 13-18: this paragraph is more of a conclusion and should be placed towards the end of the manuscript
Page 3, lines 23-24: can the amount of resources made available for the data management part of the project be quantified? “Sufficient resource” is a bit vague in this context.
Page 5, lines 8-9: How is consistency in the catalog achieved when data is moved between storage tiers? How is the process of data migration triggered? Is it completely automated or does it involve human intervention?
Page 9, section 7: logging data access is a very nice feature of the DMT (or would be of any tool) because it allows for selective data storage location depending on the anticipated usage. Just out of curiosity, can the authors provide some information on the most requested model output variables and fields?
Page 9, line 26: use the acronym CAF
Page 11, lines 8-10: although I have often heard that using commercial cloud providers such as Amazon is very expensive compared to using institutional resources, I have never seen such a calculation. Although it is probably more of a back-of-the-envelope calculation, the result is impressive. For more context: can the authors please provide a reference to these costs or at least provide information as to when these numbers were obtained?
Page 11, lines 24-34: This paragraph focusing on the Pangeo framework does not make a lot of sense to me from a thematic point of view. Rather than focusing on that Pangeo concept, I would more focus on how the DMT is a part of the IT ecosystem at JASMIN and how it can/could interoperate with the available Jupyter and cloud environments. Or is the DMT indeed a stand-alone tool at JASMIN which cannot be integrated with other tools?
Page 12-13: although it is clear that large IT providers for the ESM community as mentioned here will continue to play a vital role in the evolving European/global infrastructure environment, it is not clear to me what the authors want to say with the formulation “but continued expansion of CAFs such as at… is necessary…”. Can this be elaborated on a bit more?

Citation: https://doi.org/10.5194/gmd-2023-46-RC2
RC3: 'Comment on gmd-2023-46', Anonymous Referee #3, 21 Jul 2023

This paper presents an approach to enabling a community of users to analyze large-scale data via a common analysis platform. These kinds of platforms will continue to grow in importance as data volumes increase, thus this report provides important insight for future projects.
The paper could be strengthened, however, by adding more details about lessons learned, problems encountered, and adjustments made. The value of a paper like this is in providing demonstrations and insight for others who might be building such platforms now or in the future. The user feedback data shown in Section 8 is nice, and illustrates the effectiveness of this project's approach. I also like the discussion in Section 9 of potential adjustments and changes going forward. It would be great to have more of these lessons learned included. This was a multi-year project with many stakeholders and contributors. Such projects are rarely completely linear. Decisions are debated, problems encountered, and adjustments are made. More insight into these kinds of practical challenges would increase the value of this paper to potentially interested readers. More specific comments on this theme are listed below.
Clarification question - the paper could be a bit more clear about who had access to the CAF at which stages. My understanding from the paper is that only PRIMAVERA partners had access to the CAF for analysis, e.g. stage 4 as shown in Figure 1. Is that correct? If not, then this needs to be clarified. If it is correct, this makes sense from a resource management point of view. But it raises questions about the overall impact of such analysis platforms, because the majority of potentially interested users (external people) can't use it, and instead would be accessing the data via the standard portal-based tools such as ESGF downloads? I know the authors cannot solve the problem of making large scale compute available for all users worldwide, but this is an equity issue that the community needs to deal with. It would be nice to at least see this acknowledged.
Specific comments:

Pg. 5, last line - "When their analysis was complete, users then marked the data as finished in the DMT’s web interface and the DMT would then delete the data from disk to create space for other data." Did this work? Did people actually delete their data, or did people leave data on the system for longer than desired?
Pg. 8, lines 3-4 - "The DMT worked well, although some effort was required to maintain the set of expected data requests. Programmatically creating data requests when each file is validated and added to the DMT would have been more efficient." Can you say more about this? This is a good example of my general comment above, namely, that more detail about lessons learned and adjustments made would be nice.
Pg. 12, lines 17-18 - "The tools developed by the PRIMAVERA project have successfully demonstrated the feasibility of such techniques. The tools have been made freely available, but require some development before they can be seamlessly adopted by other projects." Can you say more about this also? What additional development may be required for someone else to use the tools? What is your recommendation on how to do this? Contribute to your Github repositories? Fork them and customize from there? Are you looking for active development and community contributions, or is this toolbase more or less solidified and not interested in active contributions?

Citation: https://doi.org/10.5194/gmd-2023-46-RC3
AC1: 'Comment on gmd-2023-46', Jon Seddon, 04 Sep 2023

We would like to thank the three reviewers for their thoughtful and constructive comments, which we believe have helped to improve the manuscript. We have implemented all of their suggestions and our responses to each comment are listed in the attached file.

Citation: https://doi.org/10.5194/gmd-2023-46-AC1

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Jon Seddon on behalf of the Authors (04 Sep 2023) Author's response Author's tracked changes Manuscript

ED: Publish as is (18 Sep 2023) by Xiaomeng Huang

AR by Jon Seddon on behalf of the Authors (19 Sep 2023)

Short summary

The PRIMAVERA project aimed to develop a new generation of advanced global climate models. The large volume of data generated was uploaded to a central analysis facility (CAF) and was analysed by 100 PRIMAVERA scientists there. We describe how the PRIMAVERA project used the CAF's facilities to enable users to analyse this large dataset. We believe that similar, multi-institute, big-data projects could also use a CAF to efficiently share, organise and analyse large volumes of data.