검색 상세

Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural Networks

초록 (요약문)

Sparse matrix–matrix multiplication (SpMM) is essential for deep learning models and scientific computing. Recently, Tensor Cores (TCs) on GPUs, originally designed for dense matrix multiplication with mixed precision, have gained prominence. However, utilizing TCs for SpMM is challenging due to irregular memory access patterns and a varying number of non-zero elements in a sparse matrix. To improve data locality, previous studies have proposed reordering sparse matrices before multiplication, but this adds computational overhead. In this paper, we propose Tensor Core-Adapted SpMM (TCA-SpMM), which leverages TCs without requiring matrix reordering and uses the compressed sparse row (CSR) format. To optimize TC usage, the SpMM algorithm’s dot product operation is transformed into a blocked matrix–matrix multiplication. Addressing load imbalance and minimizing data movement are critical to optimizing the SpMM kernel. Our TCA-SpMM dynamically allocates thread blocks to process multiple rows simultaneously and efficiently uses shared memory to reduce data movement. Performance results on sparse matrices from the Deep Learning Matrix Collection public dataset demonstrate that TCA-SpMM achieves up to 29.58× speedup over state-of-the-art SpMM implementations optimized with TCs.

more

목차

1 Introduction 1
2 Background 6
2.0.1 Sparse Matrix Representation 6
2.0.2 Sparse MatrixDense Matrix Multiplication (SpMM) 8
2.0.3 Tensor Cores on GPUs 9
3 Related Work on SpMM Using Tensor Cores 12
4 GPU Implementation of TCA-SpMM 15
4.0.1 Design Overview of TCA-SpMM 15
4.0.2 Parallelization of TCA-SpMM 21
Maximizing Tensor Core Utilization 25
Achieving Load Balancing 27
4.0.3 Detailed Complexity Analysis of TCA-SpMM 30
5 Experimental Evaluation 34
5.0.1 Experimental Setup 34
5.0.2 Performance Evaluation 36
Speedup 36
Effectiveness of Load Balancing Scheme in TCA-SpMM 38
6 Conclusion 42
Bibliography 44

more