the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Clustering analysis of very large measurement and model datasets on high performance computing platforms
Abstract. Spatiotemporal clustering of data is an important analytical technique with many applications in air quality, including source identification, monitoring network analysis and airshed partitioning. Hierarchical agglomerative clustering is one such algorithm, where sets of input data are grouped by a chosen similarity metric, without requiring any a priori information about the final arrangement of clusters. Modern implementations of the algorithm have O(n2 logā”(n)) computational complexity and O(n2) memory usage, where n is the number of initial clusters. This dependence can strain the resources of even very large individual computers as the number of initial clusters increases into the tens or hundreds of thousands, for example, to cluster all the points in an air-quality model’s simulation grid as part of airshed analysis (~105 to 106 time series to be clustered). Using two parallelization techniques – the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) – we have reduced the amount of wallclock time while increasing the memory available to a new hierarchical clustering program, by dividing up the program into blocks which are run on separate CPUs but communicate with each other to produce a single result. The new algorithm opens up new directions for large data analysis which had previously not been possible. Here we present a massively parallelized version of an agglomerative hierarchical clustering algorithm which is able to cluster an entire year of hourly regional air-quality model output (538x540 domain; 290,520 hourly concentration timeseries) in 12 hours 37 minutes of wallclock time, by spreading the computation across 8000 Intel® Xeon® Platinum 8830 CPU cores with a total of 2TB of RAM. We then show how the new algorithm allows a new form of air-quality analysis to be carried out starting from air-quality model output. We present maps of the different airsheds within the model domain, identifying equally unique regions for each chemical species. These regions can be used as an aid in determining the placement of surface air-quality monitors which gives the most representative sampling given a fixed number of monitors, or the number of monitors required for a given level of similarity between airsheds. We then demonstrate the new algorithm’s application towards source apportionment of very large observation data sets, through the analysis of a year of Canada’s hourly National Air Pollution Surveillance Program data, comprising 366,427 original observation vectors, a problem size that would be impossible with other source apportionment programs such as Positive Matrix Factorization.
This preprint has been withdrawn.
-
Withdrawal notice
This preprint has been withdrawn.
-
Preprint
(1679 KB)
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2023-185', Anonymous Referee #1, 03 Jan 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2023-185/gmd-2023-185-RC1-supplement.pdf
-
RC2: 'Comment on gmd-2023-185', Anonymous Referee #2, 21 Jan 2024
Dear Author(s),
Firstly, I would like to commend your efforts in developing a hierarchical clustering method based on the integration of MPI and OpenMP. The application of your method to air-quality model analysis, and its efficacy on large-scale data in a distributed memory architecture cluster, is indeed noteworthy.
However, before this paper can be considered outstanding, there are several areas that could be improved upon. I would like to highlight these and provide some suggestions:
1ćThe paper could benefit from a more comprehensive review of related work on the parallelization of hierarchical clustering and other clustering methods. There is existing literature where scholars have made efforts to parallelize hierarchical clustering on distributed clusters, such as "A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures," "A novel parallelization approach for hierarchical clustering," and "A hierarchical clustering algorithm for MIMD architecture." The current version of the manuscript seems to lack a thorough survey of these related studies.
2ćIn lines 80 to 86, the manuscript discusses the shift towards parallel computing to address the issues of high time and space complexity associated with hierarchical clustering of ultra-large scale data. However, I would suggest considering existing high-performance hierarchical clustering algorithms that reduce complexity at the algorithmic level before resorting to increased computational resources, which could be more effective. Improvements could be further amplified when combined with the additional computational resources. Several high-performance algorithms have already been optimized for time and space complexities in the field of hierarchical clustering. Here are a few notable methods:
a. SLINK (Single-Linkage Clustering): Optimized for single-linkage hierarchical clustering, SLINK algorithm reduces the time complexity to O(n^2) and space complexity to O(n).
b. CLINK (Complete-Linkage Clustering): Similar to SLINK, optimized for complete-linkage clustering with a time complexity of O(n^2), although it may be slower with different data types.
c. Fastcluster: An efficient hierarchical clustering library for larger datasets, providing efficient implementations for different linkage strategies such as single, complete, and average linkage.
d. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Designed for large datasets, BIRCH initially compresses data using a CF Tree before applying hierarchical clustering.
e. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise): A density-based clustering algorithm extending DBSCAN, converting the concept of core distance into hierarchical clustering to identify clusters of varying density.
f. Efficient Agglomerative Clustering using a Heap (EAC): EAC uses a heap data structure to optimize the process of finding the nearest cluster pair, making the merging steps more efficient.
g. Divisive Analysis Clustering (DIANA): A divisive clustering algorithm that starts with a cluster of all samples and recursively splits it into smaller clusters, which may be more efficient for large datasets compared to agglomerative methods.3ćThe manuscript would also benefit from an expanded discussion on the necessity of using the hierarchical clustering method. Readers not specialized in the field may wonder why one must use a hierarchical clustering method, which is relatively more complex, over other faster clustering methods such as K-means. A balance between effectiveness and efficiency could be better articulated.
4ćAdditional experimentation would strengthen the paper's contribution. Comparisons of the proposed method with previous parallel hierarchical clustering algorithms, traditional high-performance hierarchical clustering methods like HDBSCAN, and simpler clustering methods such as K-means, would be highly illustrative. Such comparisons would highlight the necessity, advantages, and advancements of your work.
Regrettably, considering the aforementioned points, I believe the article is not yet suitable for acceptance in its current form.
Best regardsCitation: https://doi.org/10.5194/gmd-2023-185-RC2 -
AC1: 'Comment on gmd-2023-185', Colin Lee, 23 Jan 2024
We thank the reviewers for their thorough and carefully considered comments. We note that both reviewers appear to agree that the paper needs more context from the wide fields of already available clustering algorithms in order to present this implementation as novel and useful, and this point is well-received. We will require some time to perform preliminary analysis and investigation before posting a final author response comment and request your patience while we perform these tasks. We will respond to the reviewers' specific comments in that final response in a few weeks.Ā
Citation: https://doi.org/10.5194/gmd-2023-185-AC1 -
AC2: 'Final author response to reviewer comments', Colin Lee, 16 Feb 2024
-
EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024
Dear Colin,
After careful review by two independent referees, it has become apparent that their assessments were predominantly negative. So we regret to inform you that your manuscript will be rejected for publication in GMD. The reviewers raised concerns regarding the methodology, broader literature, data analysis, and overall contribution of your research. We appreciate your response to the reviewers' comments, but we agree with their assessments that significant revisions would be necessary before reconsidering your work for GMD publication. Therefore, we do not encourage you to invest further time in submitting modified versions at this time. Thank you for considering GMD for the publication of your research. Good Luck.Best regards,
Xiaomeng HuangĀ
Citation: https://doi.org/10.5194/gmd-2023-185-EC1
-
EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024
Interactive discussion
Status: closed
-
RC1: 'Comment on gmd-2023-185', Anonymous Referee #1, 03 Jan 2024
The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2023-185/gmd-2023-185-RC1-supplement.pdf
-
RC2: 'Comment on gmd-2023-185', Anonymous Referee #2, 21 Jan 2024
Dear Author(s),
Firstly, I would like to commend your efforts in developing a hierarchical clustering method based on the integration of MPI and OpenMP. The application of your method to air-quality model analysis, and its efficacy on large-scale data in a distributed memory architecture cluster, is indeed noteworthy.
However, before this paper can be considered outstanding, there are several areas that could be improved upon. I would like to highlight these and provide some suggestions:
1ćThe paper could benefit from a more comprehensive review of related work on the parallelization of hierarchical clustering and other clustering methods. There is existing literature where scholars have made efforts to parallelize hierarchical clustering on distributed clusters, such as "A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures," "A novel parallelization approach for hierarchical clustering," and "A hierarchical clustering algorithm for MIMD architecture." The current version of the manuscript seems to lack a thorough survey of these related studies.
2ćIn lines 80 to 86, the manuscript discusses the shift towards parallel computing to address the issues of high time and space complexity associated with hierarchical clustering of ultra-large scale data. However, I would suggest considering existing high-performance hierarchical clustering algorithms that reduce complexity at the algorithmic level before resorting to increased computational resources, which could be more effective. Improvements could be further amplified when combined with the additional computational resources. Several high-performance algorithms have already been optimized for time and space complexities in the field of hierarchical clustering. Here are a few notable methods:
a. SLINK (Single-Linkage Clustering): Optimized for single-linkage hierarchical clustering, SLINK algorithm reduces the time complexity to O(n^2) and space complexity to O(n).
b. CLINK (Complete-Linkage Clustering): Similar to SLINK, optimized for complete-linkage clustering with a time complexity of O(n^2), although it may be slower with different data types.
c. Fastcluster: An efficient hierarchical clustering library for larger datasets, providing efficient implementations for different linkage strategies such as single, complete, and average linkage.
d. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Designed for large datasets, BIRCH initially compresses data using a CF Tree before applying hierarchical clustering.
e. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise): A density-based clustering algorithm extending DBSCAN, converting the concept of core distance into hierarchical clustering to identify clusters of varying density.
f. Efficient Agglomerative Clustering using a Heap (EAC): EAC uses a heap data structure to optimize the process of finding the nearest cluster pair, making the merging steps more efficient.
g. Divisive Analysis Clustering (DIANA): A divisive clustering algorithm that starts with a cluster of all samples and recursively splits it into smaller clusters, which may be more efficient for large datasets compared to agglomerative methods.3ćThe manuscript would also benefit from an expanded discussion on the necessity of using the hierarchical clustering method. Readers not specialized in the field may wonder why one must use a hierarchical clustering method, which is relatively more complex, over other faster clustering methods such as K-means. A balance between effectiveness and efficiency could be better articulated.
4ćAdditional experimentation would strengthen the paper's contribution. Comparisons of the proposed method with previous parallel hierarchical clustering algorithms, traditional high-performance hierarchical clustering methods like HDBSCAN, and simpler clustering methods such as K-means, would be highly illustrative. Such comparisons would highlight the necessity, advantages, and advancements of your work.
Regrettably, considering the aforementioned points, I believe the article is not yet suitable for acceptance in its current form.
Best regardsCitation: https://doi.org/10.5194/gmd-2023-185-RC2 -
AC1: 'Comment on gmd-2023-185', Colin Lee, 23 Jan 2024
We thank the reviewers for their thorough and carefully considered comments. We note that both reviewers appear to agree that the paper needs more context from the wide fields of already available clustering algorithms in order to present this implementation as novel and useful, and this point is well-received. We will require some time to perform preliminary analysis and investigation before posting a final author response comment and request your patience while we perform these tasks. We will respond to the reviewers' specific comments in that final response in a few weeks.Ā
Citation: https://doi.org/10.5194/gmd-2023-185-AC1 -
AC2: 'Final author response to reviewer comments', Colin Lee, 16 Feb 2024
-
EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024
Dear Colin,
After careful review by two independent referees, it has become apparent that their assessments were predominantly negative. So we regret to inform you that your manuscript will be rejected for publication in GMD. The reviewers raised concerns regarding the methodology, broader literature, data analysis, and overall contribution of your research. We appreciate your response to the reviewers' comments, but we agree with their assessments that significant revisions would be necessary before reconsidering your work for GMD publication. Therefore, we do not encourage you to invest further time in submitting modified versions at this time. Thank you for considering GMD for the publication of your research. Good Luck.Best regards,
Xiaomeng HuangĀ
Citation: https://doi.org/10.5194/gmd-2023-185-EC1
-
EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
526 | 135 | 41 | 702 | 44 | 42 |
- HTML: 526
- PDF: 135
- XML: 41
- Total: 702
- BibTeX: 44
- EndNote: 42
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1