Skip to main navigation Skip to search Skip to main content

Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modern GPUs are equipped with tensor cores (TCs) that are commonly used for matrix multiplication in artificial intelligence workloads. However, because they have high computational throughput, they can lead to significant performance gains in other algorithms if they can be successfully exploited. We examine using TCs to compute Euclidean distance calculations, which are used in many data analytics applications. Prior work has only investigated using 64 bit floating point (FP64) data for computation; however, TCs can operate on lower precision floating point data (i.e., 16 bit matrix multiplication and 32 bit accumulation), which we refer to as FP16-32. FP16-32 TC peak throughput is so high that TCs are easily starved of data. We propose a Fast and Scalable Tensor core Euclidean Distance (FaSTED) algorithm. To achieve high computational throughput, we design FaSTED for significant hierarchical reuse of data and maximize memory utilization at every level (global memory, shared memory, and registers). We apply FaSTED to the application of similarity searches, which typically employ an indexing data structure to eliminate superfluous Euclidean distance calculations. We compare to the state-of-the-art (SOTA) TC Euclidean distance algorithm in the literature that employs FP64, as well as to two single precision (FP32) CUDA core algorithms that both employ an index. We find that across four real-world high-dimensional datasets spanning 128-960 dimensions, the mixed-precision brute force approach achieves a speedup over the SOTA algorithms of 2.5-51 ×. We also quantify the accuracy loss of our mixed precision algorithm to be < 0.06% when compared to the FP64 baseline.

Original languageEnglish (US)
Title of host publication54th International Conference on Parallel Processing, ICPP 2025 - Main Conference Proceedings
PublisherAssociation for Computing Machinery, Inc
Pages288-298
Number of pages11
ISBN (Electronic)9798400720741
DOIs
StatePublished - Dec 20 2025
Event54th International Conference on Parallel Processing, ICPP 2025 - San Diego, United States
Duration: Sep 8 2025Sep 11 2025

Publication series

Name54th International Conference on Parallel Processing, ICPP 2025 - Main Conference Proceedings

Conference

Conference54th International Conference on Parallel Processing, ICPP 2025
Country/TerritoryUnited States
CitySan Diego
Period9/8/259/11/25

Keywords

  • CUDA
  • Euclidean Distance
  • GPU
  • Mixed Precision Floating Point
  • Self-Join
  • Similarity Search
  • Tensor Cores

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • General Mathematics

Fingerprint

Dive into the research topics of 'Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores'. Together they form a unique fingerprint.

Cite this