Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication
- 주제어 (키워드) SpMM , sparse matrix reordering , Tensor Cores
- 발행기관 서강대학교 일반대학원
- 지도교수 문의현
- 발행년도 2025
- 학위수여년월 2025. 2
- 학위명 석사
- 학과 및 전공 일반대학원 컴퓨터공학과
- 실제 URI http://www.dcollection.net/handler/sogang/000000079610
- UCI I804:11029-000000079610
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록 (요약문)
Sparse Matrix-Multivector (SpMM) multiplication is a key kernel in deep learning models and scientific computing applications. However, achieving high performance for SpMM is challenging due to the irregular distribution of non-zero elements and memory access patterns. Therefore, several sparse matrix reordering algorithms have been developed to improve data locality for SpMM. However, existing approaches for reordering sparse matrix have not considered block sparsity during the reordering process. In this paper, we present a novel algorithm for sparse matrix reordering that considers block sparsity to enhance data locality for SpMM on Tensor Cores. To alleviate the main bottleneck of reordering, which involves substantial computations for measuring similarity between rows, we develop an efficient GPU implementation by adapting dynamic parallelism and synchronization schemes. Experimental results on a large number of sparse matrices demonstrate the effectiveness of our reordering algorithm and the benefits of leveraging Tensor Cores for SpMM. Our approach achieves a significant performance improvement over various state-of-the-art SpMM implementations.
more목차
1 Introduction 1
2 Background and Related Work 4
2.1 Sparse Matrix-Multivector Multiplication (SpMM) and Sparse Matrix Representation 4
2.2 GPUs with Tensor Cores 5
2.3 Related Work on Sparse Matrix Reordering for Optimizing SpMM 5
3 BSA-SpMM: Block-Sparsity-Aware SpMM 7
3.1 Overview of BSA-SpMM 7
3.2 Details of BSA-SpMM 8
3.2.1 Similarity Measure in BSA Reordering 8
3.2.2 Acceleration of BSA Reordering for GPUs 9
3.2.3 Determination of Dense Tiles 15
3.2.4 Compressing Blocked-ELL Format to Reduce Complexity 15
4 Experimental Evaluation 17
4.1 Experimental Setup 17
4.1.1 Datasets 17
4.1.2 SpMM Implementations Compared 18
4.2. Performance Evaluation 19
4.2.1 Speedup 19
4.2.2 Effectiveness of BSA Reordering 20
4.2.3 Reordering Overhead 20
5 Conclusion 22
Bibliography 23