dCollection 디지털 학술정보 유통시스템

Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural Networks

원문보기

주제어 (키워드) Sparse Matrix Multiplication , Tensor Cores , Sparse Deep Neural Networks , Load Balancing , Data Movement
발행기관 서강대학교 일반대학원
지도교수 문의현
발행년도 2025
학위수여년월 2025. 2
학위명 석사
학과 및 전공 일반대학원 인공지능학과협동과정
실제 URI http://www.dcollection.net/handler/sogang/000000079294
UCI I804:11029-000000079294
본문언어 영어
저작권 서강대학교 논문은 저작권 보호를 받습니다.

초록 (요약문)

Sparse matrix–matrix multiplication (SpMM) is essential for deep learning models and scientific computing. Recently, Tensor Cores (TCs) on GPUs, originally designed for dense matrix multiplication with mixed precision, have gained prominence. However, utilizing TCs for SpMM is challenging due to irregular memory access patterns and a varying number of non-zero elements in a sparse matrix. To improve data locality, previous studies have proposed reordering sparse matrices before multiplication, but this adds computational overhead. In this paper, we propose Tensor Core-Adapted SpMM (TCA-SpMM), which leverages TCs without requiring matrix reordering and uses the compressed sparse row (CSR) format. To optimize TC usage, the SpMM algorithm’s dot product operation is transformed into a blocked matrix–matrix multiplication. Addressing load imbalance and minimizing data movement are critical to optimizing the SpMM kernel. Our TCA-SpMM dynamically allocates thread blocks to process multiple rows simultaneously and efficiently uses shared memory to reduce data movement. Performance results on sparse matrices from the Deep Learning Matrix Collection public dataset demonstrate that TCA-SpMM achieves up to 29.58× speedup over state-of-the-art SpMM implementations optimized with TCs.

1 Introduction 1
2 Background 6
2.0.1 Sparse Matrix Representation 6
2.0.2 Sparse MatrixDense Matrix Multiplication (SpMM) 8
2.0.3 Tensor Cores on GPUs 9
3 Related Work on SpMM Using Tensor Cores 12
4 GPU Implementation of TCA-SpMM 15
4.0.1 Design Overview of TCA-SpMM 15
4.0.2 Parallelization of TCA-SpMM 21
Maximizing Tensor Core Utilization 25
Achieving Load Balancing 27
4.0.3 Detailed Complexity Analysis of TCA-SpMM 30
5 Experimental Evaluation 34
5.0.1 Experimental Setup 34
5.0.2 Performance Evaluation 36
Speedup 36
Effectiveness of Load Balancing Scheme in TCA-SpMM 38
6 Conclusion 42
Bibliography 44

반출 Meta View 목록

서강대학교

검색 상세

Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural Networks

초록 (요약문)

목차