검색 상세

Noise-Robust Contextual ASR with Audio Context Integration and SummaryMixing-Based Encoder

목차

1 Introduction 1
1.1 Motivation 1
1.2 OverviewoftheProposedMethod 3
2 RelatedWorks 4
2.1 End-to-EndASREncoder(Zipformer) 4
2.2 ContextualSpeechRecognition 6
2.2.1 ContextualSpeechRecognitionModeling 6
2.2.2 DatasetforContextualASR 7
2.3 EfficientAlternativestoAttentionMechanisms 8
2.3.1 CoreComponentsf andsofSummaryMixing 9
2.4 AudioRepresentationLearning 12
3 ProposedMethod 15
3.1 Overview 15
3.2 ApplyingSummaryMixingtoZipFormer 15
3.3 AlternativeFusionMethodsforZipformerinContextualASR 16
3.4 ImprovingContextualASRwithAudioRepresentationsinNoisyConditions 19
4 Experiments 20
4.1 DatasetandEvaluationMetrics 20
4.1.1 Datasets 20
4.1.2 EvaluationMetrics 21
4.1.3 NoisyDataSynthesis 22
4.2 ExperimentalSettings 23
4.2.1 HardwareConfigurations 23
4.2.2 ModelConfigurations 23
4.3 Results 24
4.3.1 ImpactofReplacingAttentionwithSummaryMixing 24
4.3.2 ComparisonofFusionMethods 26
4.3.3 EffectivenessofUsingPreviousAudioRepresentations 27
4.3.4 RobustnessunderNoisyConditions 29
5 Conclusion 31
5.1 Conclusion 31
5.2 LimitationsandFutureWorks 32
Bibliography 34

more