Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling
- 주제어 (키워드) Deep Learning , Model Optimization , Lightweight Deep Neural Networks
- 발행기관 서강대학교 일반대학원
- 지도교수 문의현
- 발행년도 2024
- 학위수여년월 2024. 2
- 학위명 석사
- 학과 및 전공 일반대학원 컴퓨터공학과
- 실제URI http://www.dcollection.net/handler/sogang/000000076644
- UCI I804:11029-000000076644
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록
Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as potential loss of essential sequence features due to a uniform fixed pattern applied across all layers. Additionally, there is an increase in the model size or number of operations, resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer, integrating convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our approach not only reduces the computational complexity and memory footprint of the Transformer during training but also achieves this with minimal additional time spent on pattern searching. Moreover, this efficient pattern capturing algorithm significantly contributes to operational efficiency, reducing the number of operations by up to 5.91X, by focusing only on the essential parts. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating a new SPION that achieves up to 2.78X speedup and 7.24X memory reduction over existing state-of-the-art sparse Transformer models, with better evaluation quality.
more목차
1 Introduction 1
2 Background and Related Work 4
2.1 Transformer 4
2.1.1 Sparsification of MHA 6
2.2 Flood Fill Algorithm 6
2.3 Related Work 7
2.3.1 Fixed Sparse Pattern 7
2.3.2 Data-driven Sparse Pattern 8
3 Motivation: Analysis of Sparse Patterns in MHA 9
3.1 Shape of Sparse Pattern 10
3.2 Degree of Sparsity 11
4 SPION: Layer-Wise Sparse Attention in Transformer 12
4.1 Overview of SPION 12
4.2 Sparsity Pattern Generation with Convolutional Flood Fill Algorithm 15
4.3 Acceleration of Sparse MHA Implementation on GPUs 20
4.3.1 Sparse MHA 20
4.3.2 Sparse Softmax Kernel 22
4.3.3 Integration of GPU kernels with PyTorch 23
5 Experimental Evaluation 24
5.1 Performance Evaluation 27
5.2 Computational Complexity Analysis 30
6 Conclusion 32
Bibliography 33