Abstract
The self-join finds all objects in a dataset within a threshold of each other defined by a similarity metric. As such, the self-join is a fundamental building block for the field of databases and data mining. In low dimensionality, there are several challenges associated with efficiently computing the self-join on the graphics processing unit (GPU). Low dimensional data results in higher data densities, causing a significant number of distance calculations and a large result set, and as dimensionality increases, index searches become increasingly exhaustive. We propose several techniques to optimize the self-join using the GPU that include a GPU-efficient index that employs a bounded search, a batching scheme to accommodate large result sets, and duplicate search removal with low overhead. Furthermore, we propose a performance model that reveals bottlenecks related to the result set size and enables us to choose a batch size that mitigates two sources of performance degradation. Our approach outperforms the state-of-the-art on most scenarios.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 107-123 |
| Number of pages | 17 |
| Journal | Journal of Parallel and Distributed Computing |
| Volume | 133 |
| DOIs | |
| State | Published - Nov 2019 |
| Externally published | Yes |
Keywords
- GPGPU
- In-memory database
- Query optimization
- Self-join
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications
- Artificial Intelligence