Noise-Robust Contextual ASR with Audio Context Integration and SummaryMixing-Based Encoder
- 주제어 (키워드) Automatic Speech Recognition , Contextual ASR , SummaryMixing , Multi-modal Fusion , Zipformer , Audio Context Integration
- 발행기관 서강대학교 일반대학원
- 지도교수 박형민
- 발행년도 2025
- 학위수여년월 2025. 2
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과협동과정
- 실제 URI http://www.dcollection.net/handler/sogang/000000079587
- UCI I804:11029-000000079587
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
목차
1 Introduction 1
1.1 Motivation 1
1.2 OverviewoftheProposedMethod 3
2 RelatedWorks 4
2.1 End-to-EndASREncoder(Zipformer) 4
2.2 ContextualSpeechRecognition 6
2.2.1 ContextualSpeechRecognitionModeling 6
2.2.2 DatasetforContextualASR 7
2.3 EfficientAlternativestoAttentionMechanisms 8
2.3.1 CoreComponentsf andsofSummaryMixing 9
2.4 AudioRepresentationLearning 12
3 ProposedMethod 15
3.1 Overview 15
3.2 ApplyingSummaryMixingtoZipFormer 15
3.3 AlternativeFusionMethodsforZipformerinContextualASR 16
3.4 ImprovingContextualASRwithAudioRepresentationsinNoisyConditions 19
4 Experiments 20
4.1 DatasetandEvaluationMetrics 20
4.1.1 Datasets 20
4.1.2 EvaluationMetrics 21
4.1.3 NoisyDataSynthesis 22
4.2 ExperimentalSettings 23
4.2.1 HardwareConfigurations 23
4.2.2 ModelConfigurations 23
4.3 Results 24
4.3.1 ImpactofReplacingAttentionwithSummaryMixing 24
4.3.2 ComparisonofFusionMethods 26
4.3.3 EffectivenessofUsingPreviousAudioRepresentations 27
4.3.4 RobustnessunderNoisyConditions 29
5 Conclusion 31
5.1 Conclusion 31
5.2 LimitationsandFutureWorks 32
Bibliography 34