Clustering analysis of very large measurement and model datasets on high performance computing platforms

Lee, Colin J.; Makar, Paul A.; Soares, Joana

doi:https://doi.org/10.5194/gmd-2023-185

Preprints

https://doi.org/10.5194/gmd-2023-185

Preprints

Submitted as: methods for assessment of models

20 Nov 2023

Submitted as: methods for assessment of models |

| 20 Nov 2023

Status: this preprint was under review for the journal GMD. A final paper is not foreseen.

Clustering analysis of very large measurement and model datasets on high performance computing platforms

Colin J. Lee, Paul A. Makar, and Joana Soares

Abstract. Spatiotemporal clustering of data is an important analytical technique with many applications in air quality, including source identification, monitoring network analysis and airshed partitioning. Hierarchical agglomerative clustering is one such algorithm, where sets of input data are grouped by a chosen similarity metric, without requiring any a priori information about the final arrangement of clusters. Modern implementations of the algorithm have O(n² log⁡(n)) computational complexity and O(n²) memory usage, where n is the number of initial clusters. This dependence can strain the resources of even very large individual computers as the number of initial clusters increases into the tens or hundreds of thousands, for example, to cluster all the points in an air-quality model’s simulation grid as part of airshed analysis (~10⁵ to 10⁶ time series to be clustered). Using two parallelization techniques – the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) – we have reduced the amount of wallclock time while increasing the memory available to a new hierarchical clustering program, by dividing up the program into blocks which are run on separate CPUs but communicate with each other to produce a single result. The new algorithm opens up new directions for large data analysis which had previously not been possible. Here we present a massively parallelized version of an agglomerative hierarchical clustering algorithm which is able to cluster an entire year of hourly regional air-quality model output (538x540 domain; 290,520 hourly concentration timeseries) in 12 hours 37 minutes of wallclock time, by spreading the computation across 8000 Intel® Xeon® Platinum 8830 CPU cores with a total of 2TB of RAM. We then show how the new algorithm allows a new form of air-quality analysis to be carried out starting from air-quality model output. We present maps of the different airsheds within the model domain, identifying equally unique regions for each chemical species. These regions can be used as an aid in determining the placement of surface air-quality monitors which gives the most representative sampling given a fixed number of monitors, or the number of monitors required for a given level of similarity between airsheds. We then demonstrate the new algorithm’s application towards source apportionment of very large observation data sets, through the analysis of a year of Canada’s hourly National Air Pollution Surveillance Program data, comprising 366,427 original observation vectors, a problem size that would be impossible with other source apportionment programs such as Positive Matrix Factorization.

This preprint has been withdrawn.

Received: 12 Sep 2023 – Discussion started: 20 Nov 2023

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 1679 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (1679 KB)

Download & links

This preprint has been withdrawn.

Colin J. Lee, Paul A. Makar, and Joana Soares

Interactive discussion

Status: closed

RC1: 'Comment on gmd-2023-185', Anonymous Referee #1, 03 Jan 2024

The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2023-185/gmd-2023-185-RC1-supplement.pdf

Citation: https://doi.org/10.5194/gmd-2023-185-RC1
RC2: 'Comment on gmd-2023-185', Anonymous Referee #2, 21 Jan 2024

Dear Author(s),
Firstly, I would like to commend your efforts in developing a hierarchical clustering method based on the integration of MPI and OpenMP. The application of your method to air-quality model analysis, and its efficacy on large-scale data in a distributed memory architecture cluster, is indeed noteworthy.
However, before this paper can be considered outstanding, there are several areas that could be improved upon. I would like to highlight these and provide some suggestions:
1、The paper could benefit from a more comprehensive review of related work on the parallelization of hierarchical clustering and other clustering methods. There is existing literature where scholars have made efforts to parallelize hierarchical clustering on distributed clusters, such as "A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures," "A novel parallelization approach for hierarchical clustering," and "A hierarchical clustering algorithm for MIMD architecture." The current version of the manuscript seems to lack a thorough survey of these related studies.
2、In lines 80 to 86, the manuscript discusses the shift towards parallel computing to address the issues of high time and space complexity associated with hierarchical clustering of ultra-large scale data. However, I would suggest considering existing high-performance hierarchical clustering algorithms that reduce complexity at the algorithmic level before resorting to increased computational resources, which could be more effective. Improvements could be further amplified when combined with the additional computational resources. Several high-performance algorithms have already been optimized for time and space complexities in the field of hierarchical clustering. Here are a few notable methods:

a. SLINK (Single-Linkage Clustering): Optimized for single-linkage hierarchical clustering, SLINK algorithm reduces the time complexity to O(n^2) and space complexity to O(n).

b. CLINK (Complete-Linkage Clustering): Similar to SLINK, optimized for complete-linkage clustering with a time complexity of O(n^2), although it may be slower with different data types.

c. Fastcluster: An efficient hierarchical clustering library for larger datasets, providing efficient implementations for different linkage strategies such as single, complete, and average linkage.

d. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Designed for large datasets, BIRCH initially compresses data using a CF Tree before applying hierarchical clustering.

e. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise): A density-based clustering algorithm extending DBSCAN, converting the concept of core distance into hierarchical clustering to identify clusters of varying density.

f. Efficient Agglomerative Clustering using a Heap (EAC): EAC uses a heap data structure to optimize the process of finding the nearest cluster pair, making the merging steps more efficient.

g. Divisive Analysis Clustering (DIANA): A divisive clustering algorithm that starts with a cluster of all samples and recursively splits it into smaller clusters, which may be more efficient for large datasets compared to agglomerative methods.
3、The manuscript would also benefit from an expanded discussion on the necessity of using the hierarchical clustering method. Readers not specialized in the field may wonder why one must use a hierarchical clustering method, which is relatively more complex, over other faster clustering methods such as K-means. A balance between effectiveness and efficiency could be better articulated.
4、Additional experimentation would strengthen the paper's contribution. Comparisons of the proposed method with previous parallel hierarchical clustering algorithms, traditional high-performance hierarchical clustering methods like HDBSCAN, and simpler clustering methods such as K-means, would be highly illustrative. Such comparisons would highlight the necessity, advantages, and advancements of your work.
Regrettably, considering the aforementioned points, I believe the article is not yet suitable for acceptance in its current form.

Best regards

Citation: https://doi.org/10.5194/gmd-2023-185-RC2
AC1: 'Comment on gmd-2023-185', Colin Lee, 23 Jan 2024

We thank the reviewers for their thorough and carefully considered comments. We note that both reviewers appear to agree that the paper needs more context from the wide fields of already available clustering algorithms in order to present this implementation as novel and useful, and this point is well-received. We will require some time to perform preliminary analysis and investigation before posting a final author response comment and request your patience while we perform these tasks. We will respond to the reviewers' specific comments in that final response in a few weeks.

Citation: https://doi.org/10.5194/gmd-2023-185-AC1
AC2:
'Final author response to reviewer comments', Colin Lee, 16 Feb 2024

Attached as supplemental PDF.

Citation: https://doi.org/10.5194/gmd-2023-185-AC2
- EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024
  
  Dear Colin,
  
  After careful review by two independent referees, it has become apparent that their assessments were predominantly negative. So we regret to inform you that your manuscript will be rejected for publication in GMD. The reviewers raised concerns regarding the methodology, broader literature, data analysis, and overall contribution of your research. We appreciate your response to the reviewers' comments, but we agree with their assessments that significant revisions would be necessary before reconsidering your work for GMD publication. Therefore, we do not encourage you to invest further time in submitting modified versions at this time. Thank you for considering GMD for the publication of your research. Good Luck.
  Best regards,
  
  Xiaomeng Huang
  
  Citation: https://doi.org/10.5194/gmd-2023-185-EC1

Interactive discussion

Status: closed

RC1: 'Comment on gmd-2023-185', Anonymous Referee #1, 03 Jan 2024

The comment was uploaded in the form of a supplement: https://gmd.copernicus.org/preprints/gmd-2023-185/gmd-2023-185-RC1-supplement.pdf

Citation: https://doi.org/10.5194/gmd-2023-185-RC1
RC2: 'Comment on gmd-2023-185', Anonymous Referee #2, 21 Jan 2024

Dear Author(s),
Firstly, I would like to commend your efforts in developing a hierarchical clustering method based on the integration of MPI and OpenMP. The application of your method to air-quality model analysis, and its efficacy on large-scale data in a distributed memory architecture cluster, is indeed noteworthy.
However, before this paper can be considered outstanding, there are several areas that could be improved upon. I would like to highlight these and provide some suggestions:
1、The paper could benefit from a more comprehensive review of related work on the parallelization of hierarchical clustering and other clustering methods. There is existing literature where scholars have made efforts to parallelize hierarchical clustering on distributed clusters, such as "A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures," "A novel parallelization approach for hierarchical clustering," and "A hierarchical clustering algorithm for MIMD architecture." The current version of the manuscript seems to lack a thorough survey of these related studies.
2、In lines 80 to 86, the manuscript discusses the shift towards parallel computing to address the issues of high time and space complexity associated with hierarchical clustering of ultra-large scale data. However, I would suggest considering existing high-performance hierarchical clustering algorithms that reduce complexity at the algorithmic level before resorting to increased computational resources, which could be more effective. Improvements could be further amplified when combined with the additional computational resources. Several high-performance algorithms have already been optimized for time and space complexities in the field of hierarchical clustering. Here are a few notable methods:

a. SLINK (Single-Linkage Clustering): Optimized for single-linkage hierarchical clustering, SLINK algorithm reduces the time complexity to O(n^2) and space complexity to O(n).

b. CLINK (Complete-Linkage Clustering): Similar to SLINK, optimized for complete-linkage clustering with a time complexity of O(n^2), although it may be slower with different data types.

c. Fastcluster: An efficient hierarchical clustering library for larger datasets, providing efficient implementations for different linkage strategies such as single, complete, and average linkage.

d. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Designed for large datasets, BIRCH initially compresses data using a CF Tree before applying hierarchical clustering.

e. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise): A density-based clustering algorithm extending DBSCAN, converting the concept of core distance into hierarchical clustering to identify clusters of varying density.

f. Efficient Agglomerative Clustering using a Heap (EAC): EAC uses a heap data structure to optimize the process of finding the nearest cluster pair, making the merging steps more efficient.

g. Divisive Analysis Clustering (DIANA): A divisive clustering algorithm that starts with a cluster of all samples and recursively splits it into smaller clusters, which may be more efficient for large datasets compared to agglomerative methods.
3、The manuscript would also benefit from an expanded discussion on the necessity of using the hierarchical clustering method. Readers not specialized in the field may wonder why one must use a hierarchical clustering method, which is relatively more complex, over other faster clustering methods such as K-means. A balance between effectiveness and efficiency could be better articulated.
4、Additional experimentation would strengthen the paper's contribution. Comparisons of the proposed method with previous parallel hierarchical clustering algorithms, traditional high-performance hierarchical clustering methods like HDBSCAN, and simpler clustering methods such as K-means, would be highly illustrative. Such comparisons would highlight the necessity, advantages, and advancements of your work.
Regrettably, considering the aforementioned points, I believe the article is not yet suitable for acceptance in its current form.

Best regards

Citation: https://doi.org/10.5194/gmd-2023-185-RC2
AC1: 'Comment on gmd-2023-185', Colin Lee, 23 Jan 2024

We thank the reviewers for their thorough and carefully considered comments. We note that both reviewers appear to agree that the paper needs more context from the wide fields of already available clustering algorithms in order to present this implementation as novel and useful, and this point is well-received. We will require some time to perform preliminary analysis and investigation before posting a final author response comment and request your patience while we perform these tasks. We will respond to the reviewers' specific comments in that final response in a few weeks.

Citation: https://doi.org/10.5194/gmd-2023-185-AC1
AC2:
'Final author response to reviewer comments', Colin Lee, 16 Feb 2024

Attached as supplemental PDF.

Citation: https://doi.org/10.5194/gmd-2023-185-AC2
- EC1: 'Reply on AC2', Xiaomeng Huang, 17 Feb 2024
  
  Dear Colin,
  
  After careful review by two independent referees, it has become apparent that their assessments were predominantly negative. So we regret to inform you that your manuscript will be rejected for publication in GMD. The reviewers raised concerns regarding the methodology, broader literature, data analysis, and overall contribution of your research. We appreciate your response to the reviewers' comments, but we agree with their assessments that significant revisions would be necessary before reconsidering your work for GMD publication. Therefore, we do not encourage you to invest further time in submitting modified versions at this time. Thank you for considering GMD for the publication of your research. Good Luck.
  Best regards,
  
  Xiaomeng Huang
  
  Citation: https://doi.org/10.5194/gmd-2023-185-EC1

Colin J. Lee, Paul A. Makar, and Joana Soares

Viewed

Total article views: 1,058 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
757	251	50	1,058	67	83

HTML: 757
PDF: 251
XML: 50
Total: 1,058
BibTeX: 67
EndNote: 83

Views and downloads (calculated since 20 Nov 2023)

Month	HTML	PDF	XML	Total
Nov 2023	64	38	3	105
Dec 2023	51	7	2	60
Jan 2024	117	13	6	136
Feb 2024	84	16	8	108
Mar 2024	47	11	4	62
Apr 2024	38	11	4	53
May 2024	35	18	7	60
Jun 2024	17	6	2	25
Jul 2024	24	5	1	30
Aug 2024	14	2	16
Sep 2024	17	6	2	25
Oct 2024	16	4	0	20
Nov 2024	32	3	2	37
Dec 2024	31	4	0	35
Jan 2025	19	5	0	24
Feb 2025	24	5	0	29
Mar 2025	24	9	2	35
Apr 2025	28	7	2	37
May 2025	27	13	1	41
Jun 2025	19	26	2	47
Jul 2025	22	30	0	52
Aug 2025	7	14	0	21

Cumulative views and downloads (calculated since 20 Nov 2023)

Month	HTML	PDF	XML	Total
Nov 2023	64	38	3	105
Dec 2023	51	7	2	60
Jan 2024	117	13	6	136
Feb 2024	84	16	8	108
Mar 2024	47	11	4	62
Apr 2024	38	11	4	53
May 2024	35	18	7	60
Jun 2024	17	6	2	25
Jul 2024	24	5	1	30
Aug 2024	14	2	16
Sep 2024	17	6	2	25
Oct 2024	16	4	0	20
Nov 2024	32	3	2	37
Dec 2024	31	4	0	35
Jan 2025	19	5	0	24
Feb 2025	24	5	0	29
Mar 2025	24	9	2	35
Apr 2025	28	7	2	37
May 2025	27	13	1	41
Jun 2025	19	26	2	47
Jul 2025	22	30	0	52
Aug 2025	7	14	0	21

Viewed (geographical distribution)

Total article views: 1,031 (including HTML, PDF, and XML) Thereof 1,031 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 07 Aug 2025

Short summary

Clustering is an analysis technique for finding similarities within datasets. We present a new implementation of the hierarchical clustering algorithm that is able to process much larger datasets than was previously possible, by spreading the program out over many connected computers in a high-performance computing system. We show airshed maps of a high-resolution regional model output domain, and find related air pollution profiles at monitoring stations separated by thousands of kilometers.


Total:	0
HTML:	0
PDF:	0
XML:	0