the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A Standardised Validation Framework for Ocean Physics Models: Application to the Northwest European Shelf
Jeff Polton
Enda O'Dea
Joanne Williams
Abstract. Validation is one of the most important stages of a model's development. By comparing outputs to observations, we can estimate how well the model is able to simulate reality, which is the ultimate aim of many models. During development, validation may be iterated upon to improve the model simulation and compare to similar existing models or perhaps previous versions of the same configuration. As models become more complex, data storage requirements increase and analyses improve, scientific communities must be able to develop standardised validation workflows for efficient and accurate analyses with an ultimate goal of a complete, automated validation.
We set out our process and principles used to construct a standardised and partially automated validation system. This is discussed alongside five principles which are fundamental for our system: system scaleability, independence from data source, reproducible workflows, expandable code base and objective scoring. We also describe the current version of our own validation workflow and discuss how it adheres to the above principles. We use the COAsT Python package as a framework within which to build our analyses. COAsT provides a set of standardised oceanographic data objects ideal for representing both modelled and observed data. We use the package to compare two model configurations of the Northwest European Shelf to observations from tide gauge and profiles.
David Byrne et al.
Status: closed
-
RC1: 'Comment on gmd-2022-218', Anonymous Referee #1, 26 Jan 2023
This manuscript appears to describe a python package that can be used to compare models with observations and quantify the differences in a reproducible way. It gives a very nice example of this comparison using two versions of the NEMO model. There are some fantastic moments in this manuscript, for example the section about vertical interpolation options on line 281 is wonderful. One of my struggles in reviewing this paper has been that it is not clear if it is a documentation paper for the COAsT python package. I do not wish to review the whole of the COAsT package because its goals are a bit nebulous. The scope of this paper is narrower and makes more sense to me, but I still feel that the manuscript could more clearly state that the COAsT python package contains other tools that are not useful for comparing models with observations in this way.Â
The programming decisions here make sense in the operational context in which this paper was written: a context in which many versions of the same coastal model are being run, with the goal of improving its realism, and in which the same operations are being performed many times. The list in the introduction to the paper reads like a list of generic values that are important for all scientific code. I hope that it can be better tuned to make it clear why particular choices were taken. The software framework described here puts the data in a very specific format: this specific format is an advantage in this context, and the goal of this work is not to write general code for comparing any model with any observations.Â
I recommend that the authors make significant changes in response to my comments. Some updates to the documentation of the COAsT package may also be appropriate.Â
Major comments:
1. On first read, it was not clear to me that using classes like the "Gridded" class had real benefits. It seemed to me that this data could simply be stored as an xarray dataset, and the relevant dimension names could be input into any plotting or calculation functions. I eventually realized that if you were performing similar operations multiple times, putting all of this information into an object where the details are abstracted away from the user probably reduces errors. But I didn't understand that until I had gone away and thought about it a lot. Please rewrite the beginning of the paper to emphasize this and any other advantages of classes that I may have missed.Perhaps this is the same point, but I was confused by the sentence "By providing a middle layer into the workflow, it is much simpler to apply the analysis technique to multiple data sources, to share it with others and to expand upon it in the future." I do not understand what "providing a middle layer to the workflow" means, and I would like to understand more about why classes were chosen for this task.Â
2. I can't find any examples of this python package being used on gridded datasets that are not based on NEMO. It is fine if this package (and hence framework) is actually mainly designed for NEMO data, but then the paper should clearly state this. If this package will be applied to other gridded datasets, I'd like to see a discussion of how different kinds of data (netcdf, zarr, binary) could be read by the package, e.g. via xarray. Lines 30-37 say some really important things about the need for lazy loading, but it's no good having lazy loading if I have to rewrite all my data in a different format in order to even load it into the package.Â
In addition, I am not sure that this package makes full use of lazy loading. If the data is in netcdf format with no use of kerchunk indices, then you must load the whole netcdf file in order to access the data. The dask tutorial on the COAsT website is a bit lacking here. Certainly computation can be delayed and some parallelization should be possible because the objects are based on xarray, but again it's not clear why building these new classes is helpful, because the user has to use xarray/dask in order to parallelize anyway. Why not just use xarray objects directly?Â
3. It seems to me that COAsT is a bit of an "everything but the kitchen sink" package at the moment. Having an expandable code base is nice, but xarray already exists and some more clarity on the goals of COAsT would certainly help people to understand what is going on here.Â
4. If this manuscript is meant to document the python package (and the first half of the manuscript suggests that it is), then I'd like to see a significant discussion of testing. Part of having an expandable code base is having well-designed tests. I see the package has some testing set up. Good code coverage is also necessary for the testing, so that untested code isn't constantly being added.Â
5. I like the Matched Harmonic Analysis section, but I'd like to see a bit more context at the beginning. What is the overall goal of the comparison?
6. I was not able to understand the description of CRPS provided between line 290 and 299. Please provide more detail on what F x and y represent.Â
7. The code actually used to make the figures presented here does not appear to be available anywhere (potentially it is located somewhere in the package, but its location is not given). For a paper that talks about reproducibility, I think that the plotting code should at least be provided. Ideally the datasets used to generate the figures would also be made available, but I understand that they might be too large.Â
Minor points:
1. I'd like to see some citations for technical concepts like lazy loading, chunking etc. I know that traditional references for these concepts may not be available, but I think non-expert readers would benefit from some references.Â
2. I would also like to see more citations for concepts introduced between line 145 and line 165. e.g. "the estimation of tides is a vital step for the validation of sea surface height in our regional models" Please reference an example. "Non-tidal residual signals can be generated by many processes but in coastal regions the modest (sic) significant are generated by atmospheric processes". How do we know this?
3. Figures 1 and 3 have colorbars with white in the middle. I would recommend choosing a different colorbar so that we can see all the observations.ÂTypos:
1. Line175: "quick and easy" should be "quickly and easily"ÂCitation: https://doi.org/10.5194/gmd-2022-218-RC1 -
RC2: 'Comment on gmd-2022-218', Anonymous Referee #2, 12 Feb 2023
- General comments
The authors propose the foundation of a framework based on the COAsT Python package to evaluate kilometric scale regional modeling outputs against observations. After describing their principles to construct such a workflow, they showcase two applications: the comparisons to tide gauges and mooring profiles along the coasts of the Northwest European Shelf.
The paper is well written, and the balance between the description of the method and the two applications is good. The authors claim that validation is part of the model's development and that analyses should be automated, and they are correct. Their work is a cornerstone for achieving such a goal.
The main issues concern, first, the position of their tool among all the python packages dedicated to ocean analyses, and second, the fact that some of their fundamental principles (scaleability, independence to data source) may not be fully demonstrated.
Â
- specific comments
Title: As the framework is entirely related to the COAsT Python package, why not mention COAst in the title?
Introduction
The authors mention that they use the COAsT Python package in their standardized validation framework. But throughout the paper, they describe classes, methods, and analyses available in COAsT. This paper looks like a scientific/engineering application of the COAsT library. Thus, the capabilities and novelty of the COAsT package should be well explained in a specific paragraph to outline its contribution among other ocean analysis tools. And the authors must include references to other related packages (in Python language, at least) in the paper.
L66. The authors specify "many" principles can be satisfied by using COAsT package. Indeed, principles 3,4,5 are straightforward but not 1 and 2. So it is worth clarifying which principles still need to be fully achieved and why.
Methods
L116. Are there any intermediate steps, such as saving sub-datasets in zarr format?
L127. Please, for easy reading, summarize what differs between the two configurations (even though one can get the information from the table). And clearly state what is rigorously the same.
Table 3. What is the coastline product? Also, information on the output files is helpful to understand better the difficulty of reading (more complicated and long as the number of files is large even with xarray reading methods) and the impact of temporal interpolations.
L131. Please indicate the multiple sources.
L132. "These locations... Section-3" should be removed to stick to the general description. But a (bigger) figure dedicated to the locations and types of data is welcome because it is much easier to see the locations of the tidal amplitudes and phases in Figure 3. And it will allow for renumbering the figures in a progressive manner in accordance with the text: current 1, 2, 3, 4 numbers will become 4, 3,2, 5. Moreover, the regions over which profile comparisons are averaged could be visualized.
Validation against tide gauge data
L182 Reformulate? For each location, the analysis lengths have been identified from observations.
L194. "it must not be ignored". What do you mean? The harmonic analysis can be performed, but considering the uncertainty?
L195 - L203 - L228. The reader can get confused. L195, it is precised "both the MHA approach and an application of the harmonic variability". But L203 is "As discussed in Section-3.1, we cannot apply a matched harmonic analysis to this analysis". L228 "apply the matched harmonic analysis described in Section-3.1". Could you add in the sentence L195 (section 3.2) and (section 3.3)?
L258. And we can conclude that the models do not capture large events, or that large events are underestimated in the models, right? A short interpretation of the figure is welcome.
Validation against profile data
L285-... The separation between the methodology and the results makes the reading a bit confusing. Please, move this sentence into the previous paragraph and add a reference to figure 7. Then start a new paragraph about CRPS at the sea surface.
L304 for temperature and salinity.
L306. Please comment on the results shown in figures 9-10.
Figures 7-8 and 9-10. Why do the regions differ between the two panels? "Irish sea" versus "off Shelf"?Â
Further discussion
L314 Scaleability is one of the fundamental principles on which this tool is based, and rightly so. However, the showcases do not fully demonstrate this capability. The comparisons use a large number of vertical profiles and time series. In this sense, scalability is achieved. Even though the model outputs are huge, the analyzed data volume (time series and profiles) remains small because Pangeo tools (dask, xarray, etc.) effectively sub-select specific locations. Perhaps the authors should slightly moderate their conclusion or clarify the potential beyond the actual results shown in the paper. Similarly, the design intends the data source independence, but so far, only NEMO model outputs seem to have been used.
Â
- technical corrections
Table 1. Should be Gridded (t_dim, z_dim, y_dim, x_dim)? Is the order correct for Indexed?
Table 2. Profile: isn't time a coordinate?
Table 3. Initial conditions "Analysis period starts 2004" should be set in the text.
L201. Figures 1 and 4. Figures 2 and 3 are excluded.
L214. Just curious. From figures 1 and 4, CO9p0 improves the representation of the tides along the NW coast of the UK? Why? Different coastline? Or bottom friction formulation?... Even though I regret it, no additions should be made to the paper to explain the improvements, as it is not the topic of this paper. So feel free to answer my question or not.
L244. Correct the sentence "Both models have are similar"
Figure 2. Please complete the legend. Could the colorbar on the right panel be changed? It isn't easy to distinguish the squares. And why not reduce the geographical extension of the domain?
Figure3. Please (a) add the units (day?) (b) according to the text, the unit is % of the M2 amplitude (not m).
Figure 5. The color scale is not discriminating in panels a and b.
Citation: https://doi.org/10.5194/gmd-2022-218-RC2 - AC1: 'Responses to reviewers', David Byrne, 06 Apr 2023
Status: closed
-
RC1: 'Comment on gmd-2022-218', Anonymous Referee #1, 26 Jan 2023
This manuscript appears to describe a python package that can be used to compare models with observations and quantify the differences in a reproducible way. It gives a very nice example of this comparison using two versions of the NEMO model. There are some fantastic moments in this manuscript, for example the section about vertical interpolation options on line 281 is wonderful. One of my struggles in reviewing this paper has been that it is not clear if it is a documentation paper for the COAsT python package. I do not wish to review the whole of the COAsT package because its goals are a bit nebulous. The scope of this paper is narrower and makes more sense to me, but I still feel that the manuscript could more clearly state that the COAsT python package contains other tools that are not useful for comparing models with observations in this way.Â
The programming decisions here make sense in the operational context in which this paper was written: a context in which many versions of the same coastal model are being run, with the goal of improving its realism, and in which the same operations are being performed many times. The list in the introduction to the paper reads like a list of generic values that are important for all scientific code. I hope that it can be better tuned to make it clear why particular choices were taken. The software framework described here puts the data in a very specific format: this specific format is an advantage in this context, and the goal of this work is not to write general code for comparing any model with any observations.Â
I recommend that the authors make significant changes in response to my comments. Some updates to the documentation of the COAsT package may also be appropriate.Â
Major comments:
1. On first read, it was not clear to me that using classes like the "Gridded" class had real benefits. It seemed to me that this data could simply be stored as an xarray dataset, and the relevant dimension names could be input into any plotting or calculation functions. I eventually realized that if you were performing similar operations multiple times, putting all of this information into an object where the details are abstracted away from the user probably reduces errors. But I didn't understand that until I had gone away and thought about it a lot. Please rewrite the beginning of the paper to emphasize this and any other advantages of classes that I may have missed.Perhaps this is the same point, but I was confused by the sentence "By providing a middle layer into the workflow, it is much simpler to apply the analysis technique to multiple data sources, to share it with others and to expand upon it in the future." I do not understand what "providing a middle layer to the workflow" means, and I would like to understand more about why classes were chosen for this task.Â
2. I can't find any examples of this python package being used on gridded datasets that are not based on NEMO. It is fine if this package (and hence framework) is actually mainly designed for NEMO data, but then the paper should clearly state this. If this package will be applied to other gridded datasets, I'd like to see a discussion of how different kinds of data (netcdf, zarr, binary) could be read by the package, e.g. via xarray. Lines 30-37 say some really important things about the need for lazy loading, but it's no good having lazy loading if I have to rewrite all my data in a different format in order to even load it into the package.Â
In addition, I am not sure that this package makes full use of lazy loading. If the data is in netcdf format with no use of kerchunk indices, then you must load the whole netcdf file in order to access the data. The dask tutorial on the COAsT website is a bit lacking here. Certainly computation can be delayed and some parallelization should be possible because the objects are based on xarray, but again it's not clear why building these new classes is helpful, because the user has to use xarray/dask in order to parallelize anyway. Why not just use xarray objects directly?Â
3. It seems to me that COAsT is a bit of an "everything but the kitchen sink" package at the moment. Having an expandable code base is nice, but xarray already exists and some more clarity on the goals of COAsT would certainly help people to understand what is going on here.Â
4. If this manuscript is meant to document the python package (and the first half of the manuscript suggests that it is), then I'd like to see a significant discussion of testing. Part of having an expandable code base is having well-designed tests. I see the package has some testing set up. Good code coverage is also necessary for the testing, so that untested code isn't constantly being added.Â
5. I like the Matched Harmonic Analysis section, but I'd like to see a bit more context at the beginning. What is the overall goal of the comparison?
6. I was not able to understand the description of CRPS provided between line 290 and 299. Please provide more detail on what F x and y represent.Â
7. The code actually used to make the figures presented here does not appear to be available anywhere (potentially it is located somewhere in the package, but its location is not given). For a paper that talks about reproducibility, I think that the plotting code should at least be provided. Ideally the datasets used to generate the figures would also be made available, but I understand that they might be too large.Â
Minor points:
1. I'd like to see some citations for technical concepts like lazy loading, chunking etc. I know that traditional references for these concepts may not be available, but I think non-expert readers would benefit from some references.Â
2. I would also like to see more citations for concepts introduced between line 145 and line 165. e.g. "the estimation of tides is a vital step for the validation of sea surface height in our regional models" Please reference an example. "Non-tidal residual signals can be generated by many processes but in coastal regions the modest (sic) significant are generated by atmospheric processes". How do we know this?
3. Figures 1 and 3 have colorbars with white in the middle. I would recommend choosing a different colorbar so that we can see all the observations.ÂTypos:
1. Line175: "quick and easy" should be "quickly and easily"ÂCitation: https://doi.org/10.5194/gmd-2022-218-RC1 -
RC2: 'Comment on gmd-2022-218', Anonymous Referee #2, 12 Feb 2023
- General comments
The authors propose the foundation of a framework based on the COAsT Python package to evaluate kilometric scale regional modeling outputs against observations. After describing their principles to construct such a workflow, they showcase two applications: the comparisons to tide gauges and mooring profiles along the coasts of the Northwest European Shelf.
The paper is well written, and the balance between the description of the method and the two applications is good. The authors claim that validation is part of the model's development and that analyses should be automated, and they are correct. Their work is a cornerstone for achieving such a goal.
The main issues concern, first, the position of their tool among all the python packages dedicated to ocean analyses, and second, the fact that some of their fundamental principles (scaleability, independence to data source) may not be fully demonstrated.
Â
- specific comments
Title: As the framework is entirely related to the COAsT Python package, why not mention COAst in the title?
Introduction
The authors mention that they use the COAsT Python package in their standardized validation framework. But throughout the paper, they describe classes, methods, and analyses available in COAsT. This paper looks like a scientific/engineering application of the COAsT library. Thus, the capabilities and novelty of the COAsT package should be well explained in a specific paragraph to outline its contribution among other ocean analysis tools. And the authors must include references to other related packages (in Python language, at least) in the paper.
L66. The authors specify "many" principles can be satisfied by using COAsT package. Indeed, principles 3,4,5 are straightforward but not 1 and 2. So it is worth clarifying which principles still need to be fully achieved and why.
Methods
L116. Are there any intermediate steps, such as saving sub-datasets in zarr format?
L127. Please, for easy reading, summarize what differs between the two configurations (even though one can get the information from the table). And clearly state what is rigorously the same.
Table 3. What is the coastline product? Also, information on the output files is helpful to understand better the difficulty of reading (more complicated and long as the number of files is large even with xarray reading methods) and the impact of temporal interpolations.
L131. Please indicate the multiple sources.
L132. "These locations... Section-3" should be removed to stick to the general description. But a (bigger) figure dedicated to the locations and types of data is welcome because it is much easier to see the locations of the tidal amplitudes and phases in Figure 3. And it will allow for renumbering the figures in a progressive manner in accordance with the text: current 1, 2, 3, 4 numbers will become 4, 3,2, 5. Moreover, the regions over which profile comparisons are averaged could be visualized.
Validation against tide gauge data
L182 Reformulate? For each location, the analysis lengths have been identified from observations.
L194. "it must not be ignored". What do you mean? The harmonic analysis can be performed, but considering the uncertainty?
L195 - L203 - L228. The reader can get confused. L195, it is precised "both the MHA approach and an application of the harmonic variability". But L203 is "As discussed in Section-3.1, we cannot apply a matched harmonic analysis to this analysis". L228 "apply the matched harmonic analysis described in Section-3.1". Could you add in the sentence L195 (section 3.2) and (section 3.3)?
L258. And we can conclude that the models do not capture large events, or that large events are underestimated in the models, right? A short interpretation of the figure is welcome.
Validation against profile data
L285-... The separation between the methodology and the results makes the reading a bit confusing. Please, move this sentence into the previous paragraph and add a reference to figure 7. Then start a new paragraph about CRPS at the sea surface.
L304 for temperature and salinity.
L306. Please comment on the results shown in figures 9-10.
Figures 7-8 and 9-10. Why do the regions differ between the two panels? "Irish sea" versus "off Shelf"?Â
Further discussion
L314 Scaleability is one of the fundamental principles on which this tool is based, and rightly so. However, the showcases do not fully demonstrate this capability. The comparisons use a large number of vertical profiles and time series. In this sense, scalability is achieved. Even though the model outputs are huge, the analyzed data volume (time series and profiles) remains small because Pangeo tools (dask, xarray, etc.) effectively sub-select specific locations. Perhaps the authors should slightly moderate their conclusion or clarify the potential beyond the actual results shown in the paper. Similarly, the design intends the data source independence, but so far, only NEMO model outputs seem to have been used.
Â
- technical corrections
Table 1. Should be Gridded (t_dim, z_dim, y_dim, x_dim)? Is the order correct for Indexed?
Table 2. Profile: isn't time a coordinate?
Table 3. Initial conditions "Analysis period starts 2004" should be set in the text.
L201. Figures 1 and 4. Figures 2 and 3 are excluded.
L214. Just curious. From figures 1 and 4, CO9p0 improves the representation of the tides along the NW coast of the UK? Why? Different coastline? Or bottom friction formulation?... Even though I regret it, no additions should be made to the paper to explain the improvements, as it is not the topic of this paper. So feel free to answer my question or not.
L244. Correct the sentence "Both models have are similar"
Figure 2. Please complete the legend. Could the colorbar on the right panel be changed? It isn't easy to distinguish the squares. And why not reduce the geographical extension of the domain?
Figure3. Please (a) add the units (day?) (b) according to the text, the unit is % of the M2 amplitude (not m).
Figure 5. The color scale is not discriminating in panels a and b.
Citation: https://doi.org/10.5194/gmd-2022-218-RC2 - AC1: 'Responses to reviewers', David Byrne, 06 Apr 2023
David Byrne et al.
David Byrne et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
286 | 76 | 12 | 374 | 3 | 3 |
- HTML: 286
- PDF: 76
- XML: 12
- Total: 374
- BibTeX: 3
- EndNote: 3
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1