Abstract
This article studies the optimization of parallel clustering throughput in the context of variant-based parallelism, which exploits commonalities and reuse among variant computations for multithreading scalability. This direction is motivated by challenging scientific applications where scientists have to execute multiple runs of clustering algorithms with different parameters to determine which ones best explain phenomena observed in empirical data. To make this process more efficient, we propose a novel set of optimizations to maximize the throughput of Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a frequently used algorithm for scientific data mining in astronomy, geoscience, and many other fields. Our approach executes multiple algorithm variants in parallel, computes clusters concurrently, and leverages heuristics to maximize the reuse of results from completed variants. As scientific datasets continue to grow, maximizing clustering throughput with our techniques may accelerate the search and identification of natural phenomena of interest with computational support, i.e., Computer-Aided Discovery. We present evaluations on a whole spectrum of datasets, such as geoscience data on space weather phenomena, astronomical data from the Sloan Digital Sky Survey on intermediate-redshift galaxies, as well as synthetic datasets to characterize performance properties. Selected results show a 1,115 percent performance improvement due to indexing tailored for variant-based clustering, and a 2,209 percent performance improvement when applying all of our proposed optimizations.
Original language | English (US) |
---|---|
Article number | 7865993 |
Pages (from-to) | 2595-2607 |
Number of pages | 13 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 28 |
Issue number | 9 |
DOIs | |
State | Published - Sep 1 2017 |
Externally published | Yes |
Keywords
- DBSCAN
- Parallel clustering
- clustering throughput
- computer-aided discovery
- data mining
ASJC Scopus subject areas
- Signal Processing
- Hardware and Architecture
- Computational Theory and Mathematics