Novel clustering framework using k-means (S k-means) for mining spatiotemporal structured climate data
- 1Center for Computational Sciences, University of Tsukuba, Japan
- 2Hanoi University of Sciences, National University Hanoi, Vietnam
- 3Research Applications Laboratory, National Center for Atmospheric Research, USA
- 1Center for Computational Sciences, University of Tsukuba, Japan
- 2Hanoi University of Sciences, National University Hanoi, Vietnam
- 3Research Applications Laboratory, National Center for Atmospheric Research, USA
Abstract. Dramatic increases in climate data underlie a gradual paradigm shift in knowledge-acquisition methods from physical-based models to data-based mining techniques. k-Means is one of the most popular data clustering/mining techniques, and it has been used to detect hidden patterns in climate systems. k-Means is established based on distance metrics for pattern recognition, which is relatively ineffective when dealing with “structured” data that are dominant in climate science, that is, data in time and space domains. Here, we propose (i) a novel structural similarity recognition-based k-means algorithm called structural k-means or S k-means for climate data mining and (ii) a new clustering uncertainty representation/evaluation framework based on the information entropy concept. We demonstrated that the novel S k-means could provide higher-quality clustering outcomes in terms of general silhouette analysis, although it requires higher computational resources compared with conventional algorithms. The results are consistent with different demonstration problem settings using different types of input data, including two-dimensional weather patterns, historical climate change in terms of time series, and tropical cyclone paths. Additionally, by quantifying the uncertainty underlying the clustering outcomes we for the first time evaluated the “meaningfulness” of applying a given clustering algorithm for a given dataset. We expect that this study will constitute a new standard of k-means clustering with “structural” input data, as well as a new framework for uncertainty representation/evaluation of clustering algorithms for (but not limited to) climate science.
- Preprint
(5257 KB) -
Supplement
(24539 KB) - BibTeX
- EndNote
Quang-Van Doan et al.
Status: final response (author comments only)
-
RC1: 'Comment on gmd-2022-172', Anonymous Referee #1, 16 Oct 2022
Manuscript by Doan et. al. presents a S k-means clustering framework, improving on standard k-means clustering, and demonstrate their application to several climate datasets.Â
Manuscript presents a methods focused study, which however lacks sufficient discussion to demonstrate the benefits of the proposed algorithmic improvements to standard k-means algorithm. Section "Results and Discussions" focus more on Results and less on Discussion, which is the critical weakness of the manuscript in its current form.
1. Manuscript is missing several key references from the reference list. Â Wang et. al. 2004, Wang and Bovik, 2009 Mo et al., 2014; Han and Szunyogh, 2018; Doan et al., 2021
2. One of the motivation for the proposed work, as discussed in introduction, is to mine the unique "structuredness" of temporal and spatial climate data (Line 67-81). However, rest of the manuscript focused on comparison of various clustering methods based on Silhouette scores, uncertainty degree etc. Proposed S k-means consistently shows better scores than the other methods, but if and how it better captures the "structuredness" of the data need to be discussed, since that's the key contribution of the study.
3. Structural similarity metric (Section 2.2) is the most important part of the study. However, several symbols/terms in equations 2, 3 and on lines 142-145 are not defined or explained. In particular the equations for luminance, contrast and structure. And the cited articles (Wang et. al. 2004, Wang and Bovik, 2009) that developed the similarity metrics are missing from the reference list. That makes it difficult to understand the similarity metric. Aside from describing equations for S-SIM, there are disussions, in methods section or later, as to how these structural metrics capture the spatial and temporal structuredness of climate data.Â
4. Discussion of clustering results in Section 5.2 is very high level. Â Question remains, aside from slightly higher scores what unique and new insights does the S k-means clustering enabled?
5. I am glad to see S k-means being compared with three other k-means variants. They were all run for a 11 different 'k' and with 10 random ensembles each, resulting in a toal of 1320 clustering runs. BUT were all four k-means variants run with exactly the same random starting centroids for the purpose of comparison? It's important to do that for a fair comparison. Also, was a consistent convergence criteria used for all four methods? Converge criteria was mentioned on Lines 128-129, but what criteria was used in the study never discussed.Â
6. Lines 364-365 "As the first study to address this issue, we believe that CUEF can constitute a new standard for addressing uncertainty issues when performing data clustering in (but not limited to) climate science." -- This is an overstatement. It's well know that custering algorithms are local search methods that are sensitive to random start, however, there are number of approaches in published literature to identify good seeds and ensure that algorithms can converge to a consistent cluster set.Â
7. Lines 370-374: "This makes sense because different data have different topologies, which can make them unsuitable or even invalid for a clustering solution. The question of whether it is valid or meaningful to apply a clustering solution to a dataset is more important than how to find the best method of clustering.  Although this issue is fundamentally important, to the authors’ best knowledge, no studies have addressed this question or proposed a solution, at least among the climate sciences." -- this again is broad and biased inference based on the demonstrated applications and results.Â
8. Authors have termed their clustering framework to be novel, including in the title of the manuscript, which in my opinion is overstated and not justified. There are three key methodology elements in the paper + application to three select climate datasets.Â
Application component of study is weak and limited in scope. But author's acknowledge that application/interpretation was not the focus of their study, Lines 277-278 "We do not intend to physically interpret the specific clustering outcomes, although some phenomenal explanations are provided in the manuscript." So novelty is not in the three applications.Â
Three elements of methodology are adopted from published literature:Â
1. Structural similarity based k-means -- adopted from Wang et. al. 2004, Wang and Bovik, 2009Â
2. Evaluation of clustering algorithms using Similarity distributions (adopted from Doan et. al. 2021), Silhouette scores (adopted from Hassani and Seidl, 2017).
3. Clustering uncertainty degree and information theory (Vinh et al. (2009))Building upon published literature is normal discourse of scientific research. But I suggest reconsidering the use of term "novel".
-
AC1: 'Reply on RC1', Quang-Van Doan, 01 Nov 2022
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2022-172/gmd-2022-172-AC1-supplement.pdf
-
AC1: 'Reply on RC1', Quang-Van Doan, 01 Nov 2022
-
RC2: 'Comment on gmd-2022-172', Anonymous Referee #2, 17 Nov 2022
This study by Doan et al. presents the use of a S k-means clustering as a better alternative for climate and atmospheric science to clustering data than traditional k-means methods. This study introduces a novel framework to identify uncertainty within clustering methodologies and said framework introduces a methodology by which researchers can compare different clustering techniques with each other in a way that doesn't require a ground truth dataset to exist by which to compare results to. The study presents the methdology in an excellent manner that seems like it would be easy to replicate/apply to future studies.
S k-means is a useful technique that adapats SSIM techniques, traditionally used in image comparison analysis, to be applied to climate data. It is an improved technique, compared to the traditional distance metric comparisons, as this takes into account both spatial and temporal differences in datasets. This manuscript does a good job at summarizing the use of the aforementioned techniques with respect to three example tests for typical climate situations in which clustering is used. However, this manuscript lacks in the discussion and summary sections. The manuscript needs to emphasize more as to the usefulness of this new uncertainty framework compared to current available methodology. The results are well explained, but there is a lack of discussion about how this brings a significant change to current techniques/how this improves current understanding and techniques.Â
1. Table 1 provides a nice summary of different metrics compared between the different k-means models used in the study. In the text, the mean and standard deviation are mentioned from the table, however, the other metrics are not mentioned at all other than in passing. The Shannon metric needs to be explained more and some presentation of the data should be given in the text to give the reader some context as to its meaning and how it is used in this study.
2. Many references from the text are missing citations. Please check over the references in the paper to make sure all are cited, here are a few that I found that were not cited: Jancey 1966, Lloyd 1957, Wang et al. 2004, etc.
3. This study intends to establish both the uncertainty framework and the s k-means methodology as a new standard for data mining in the climate sciences. While the uncertainty framework definitely provides a new standard by which to test the usefulness and effectiveness of different clustering algorithms against each other, no work has been shown as to the ability of the s k-means clustering. While comparisons are shown between the s k-means to other k-means clustering measures, we cannot objectively say from this study that the S k-means method better captured the underlying structures within the data compared to the other k-means models. A more comprehensive case study would be needed, rather than the short test cases, that applies the methodologies to a known problem that has a ground truth that can be compared back to.
4. The use of 3 different test case scenarios to test the uncertainty framework was a great idea and well presented. It gives good insight into how this methodology can be used in the wide-array of applications in climate science.
5. Lines 370-374. This question of applying the framework to see whether data is suitable for clustering is a much more novel approach and useful to the science than comparing the initializations. There are many other methodologies and ways to get suitable initializations for clustering and help datasets to converge on useful clustering.
6. Lines 370-374. It is tough to say with respect to WPs that clustering my be ineffective. WPs present a lot of uncertainty compared to other types of climate data, so without care as to what is being analyzed/searched for in the data, uncertainty analysis may present false positives for datasets that would not be suitable for clustering. This isn't a problem with the methodology, the authors do note that these are inherently a data issue, which this methodology does not take into account. The authors could do to make note of similar situations in the manuscript for those who would use this method in the future.
7. Some figures need revision, specificaly figures 3, 4, and 5. In Figure 3, the silhouette score charts are very small compared to the WP plots. Make them a similar size and make the text size more legible. Figures 4 and 5 have the silhouette score charts inside of the other figures. There is far too much going on inside these figures as it is, and adding the silhouette plots inside here makes it more cluttered and confusing to understand. Move them outside the plots and enlargen them.
Minor notes:
Lines 72-74: Rephrase the wording, it is confusing in this state.
Line 75: Remove ".It is" and use because to join the two sentences into one for better flow.
Line 125: Should cite the SSIM technique (Wang et al. 2004)
Lines 158-160: What's the interpolation method used?
Line 238: Could explain cluster realization better/earlier. Explaining it in this sentence while also introducing a new concept could cause confusion to the reader.
Line 239-240: What do you mean by partition set? Is this the same thing as the cluster realization?
Line 246: What do you mean by weakness? Is it related to the randomess you discuss in the next few lines?
Line 297: Change tense of "were" to "are".
Line 316: What does "completed by C K-means" mean? Is it a typo?
Â
Â
-
AC2: 'Reply on RC2', Quang-Van Doan, 23 Nov 2022
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2022-172/gmd-2022-172-AC2-supplement.pdf
-
AC2: 'Reply on RC2', Quang-Van Doan, 23 Nov 2022
Quang-Van Doan et al.
Model code and software
S k-means Quang-Van Doan https://github.com/doan-van/S-k-means
Quang-Van Doan et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
397 | 107 | 15 | 519 | 36 | 5 | 5 |
- HTML: 397
- PDF: 107
- XML: 15
- Total: 519
- Supplement: 36
- BibTeX: 5
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1