TY - JOUR
T1 - Optimizing Parallel Clustering Throughput in Shared Memory
AU - Gowanlock, Michael
AU - Blair, David M.
AU - Pankratius, Victor
N1 - Funding Information:
We thank the anonymous reviewers for their useful comments and suggestions. We acknowledge support from US National Science Foundation ACI-1442997.
Publisher Copyright:
© 1990-2012 IEEE.
PY - 2017/9/1
Y1 - 2017/9/1
N2 - This article studies the optimization of parallel clustering throughput in the context of variant-based parallelism, which exploits commonalities and reuse among variant computations for multithreading scalability. This direction is motivated by challenging scientific applications where scientists have to execute multiple runs of clustering algorithms with different parameters to determine which ones best explain phenomena observed in empirical data. To make this process more efficient, we propose a novel set of optimizations to maximize the throughput of Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a frequently used algorithm for scientific data mining in astronomy, geoscience, and many other fields. Our approach executes multiple algorithm variants in parallel, computes clusters concurrently, and leverages heuristics to maximize the reuse of results from completed variants. As scientific datasets continue to grow, maximizing clustering throughput with our techniques may accelerate the search and identification of natural phenomena of interest with computational support, i.e., Computer-Aided Discovery. We present evaluations on a whole spectrum of datasets, such as geoscience data on space weather phenomena, astronomical data from the Sloan Digital Sky Survey on intermediate-redshift galaxies, as well as synthetic datasets to characterize performance properties. Selected results show a 1,115 percent performance improvement due to indexing tailored for variant-based clustering, and a 2,209 percent performance improvement when applying all of our proposed optimizations.
AB - This article studies the optimization of parallel clustering throughput in the context of variant-based parallelism, which exploits commonalities and reuse among variant computations for multithreading scalability. This direction is motivated by challenging scientific applications where scientists have to execute multiple runs of clustering algorithms with different parameters to determine which ones best explain phenomena observed in empirical data. To make this process more efficient, we propose a novel set of optimizations to maximize the throughput of Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a frequently used algorithm for scientific data mining in astronomy, geoscience, and many other fields. Our approach executes multiple algorithm variants in parallel, computes clusters concurrently, and leverages heuristics to maximize the reuse of results from completed variants. As scientific datasets continue to grow, maximizing clustering throughput with our techniques may accelerate the search and identification of natural phenomena of interest with computational support, i.e., Computer-Aided Discovery. We present evaluations on a whole spectrum of datasets, such as geoscience data on space weather phenomena, astronomical data from the Sloan Digital Sky Survey on intermediate-redshift galaxies, as well as synthetic datasets to characterize performance properties. Selected results show a 1,115 percent performance improvement due to indexing tailored for variant-based clustering, and a 2,209 percent performance improvement when applying all of our proposed optimizations.
KW - clustering throughput
KW - computer-aided discovery
KW - data mining
KW - DBSCAN
KW - Parallel clustering
UR - http://www.scopus.com/inward/record.url?scp=85029533937&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85029533937&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2017.2675421
DO - 10.1109/TPDS.2017.2675421
M3 - Article
AN - SCOPUS:85029533937
SN - 1045-9219
VL - 28
SP - 2595
EP - 2607
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 9
M1 - 7865993
ER -