Leveraging Self-Distillation in Automatic Speech Recognition for Real-World Deployments
- 발행기관 서강대학교 일반대학원
- 지도교수 Hyung-Min Park
- 발행년도 2026
- 학위수여년월 2026. 2
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과협동과정
- 실제URI http://www.dcollection.net/handler/sogang/000000082793
- UCI I804:11029-000000082793
- 본문언어 영어
- 저작권 논문은 저작권에 의해 보호받습니다.
초록(요약문)
Automatic speech recognition (ASR) systems have achieved strong accuracy and efficiency, yet their performance often deteriorates when deployed in real-world settings due to background noise and domain mismatch. Existing approaches to noise robustness frequently introduce additional modules or require increased computational resources, which complicates deployment and can reduce generalization to unseen conditions. We propose a resource-efficient self-distillation framework explicitly aimed at improving ASR performance for practical deployment rather than only optimizing for synthetic noisy benchmarks. Our goal is to identify a self-distillation strategy that raises both noise robustness and real-world usability while keeping model size and inference cost unchanged. During training, two identical copies of the network process the same utterance: one receives the original input and the other an augmented version. Consistency between the two outputs is enforced via Kullback–Leibler divergence. To further regularize the encoder and encourage noise-invariant representations, we perform k-means clustering on encoder embeddings and minimize the KL divergence between cluster-similarity distributions computed for clean and noisy inputs. The encoder is trained with self-supervised learning on unlabeled data and subsequently fine-tuned with a CTC objective combined with the proposed self-distillation losses. To evaluate generalization to unseen, realistic recording conditions, we measure performance on the CHiME-4 benchmark as an out-of-distribution test set. Experiments demonstrate that our approach improves robustness while preserving model compactness: a 49M-parameter model trained with our method achieves WERs of 9.02 on the CHiME-4 challenge benchmark, corresponding to a 20% relative reduction compared to the baseline. Ablation studies confirm the individual contributions of the self-distillation and clustering-based regularization terms. The proposed method offers a simple and effective regularization strategy for deploying compact ASR systems in noisy real-world conditions. Keywords : automatic speech recognition, self-distillation, noise robustness, self- supervised learning, k-means clustering, Kullback–Leibler divergence
more목차
List of Figures iii
List of Tables iv
Abstract v
1 Introduction 1
1.1 Motivation 1
1.2 Overview of the Proposed Method 3
2 Related Works 4
2.1 Noise Robust ASR 4
2.2 Self-Supervised Learning via Self-Distillation 5
3 Proposed Method 8
3.1 Methodology 8
3.1.1 ASR pipeline 8
3.1.2 Self-Distillation pipeline 13
3.1.3 Consistency regularization 15
3.1.4 Preventing Representation Collapse 16
3.2 Training Objective 18
4 Experiments 20
4.1 Experimental Setup 20
4.1.1 Dataset 20
4.1.2 Implementation Detail 22
4.1.3 Training Criteria 23
4.2 Experimental Results 24
4.2.1 Comparison with Existing Methods 24
4.2.2 Impact of Varying layer choices 25
4.2.3 Hyperparameter Sensitivity Analysis 27
4.2.4 Effect of Augmentation Strategy in Self-Supervised Learning 29
4.2.5 Ablation Study 30
5 Conclusion 32
5.1 Limitations and Future Work 32
5.2 Summary of Contributions 33
Bibliography 35

