Leveraging Multi-View Integration for Improved Occlusion Robustness in Visual Speech Recognition
- 주제어 (키워드) visual speech recognition , multi-view , reinforcement learning , deep learning
- 발행기관 서강대학교 일반대학원
- 지도교수 박형민
- 발행년도 2024
- 학위수여년월 2024. 8
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과
- 실제 URI http://www.dcollection.net/handler/sogang/000000079235
- UCI I804:11029-000000079235
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록 (요약문)
In everyday communication, individuals rely on a multi-modal approach to speech perception, integrating auditory and visual cues such as facial expressions and lip movements. This integration has spurred advancements in automatic speech recognition (ASR) systems, inspiring this paper’s focus on developing occlusion-robust visual speech recognition (VSR) systems. This research extends multi-view lipreading methodologies to the VSR domain and proposes novel hybrid multi-view training strategies. These strategies leverage pre-trained cross-view models to enhance multi-view training, aiming to address the limitations of traditional single-view VSR systems, particularly their vulnerability to occlusions. Additionally, we introduce policy-based multi-view processing methods, optimized via policy-gradient techniques, to create a more adaptable multi-view system. By simulating real-world occlusions with multiple occluders that obscure the lip region across various views, we assess the occlusion-robustness of our VSR models. Our experimental findings highlight the substantial benefits of integrating multi-view inputs, significantly boosting the robustness and accuracy of VSR systems in environments with visual obstructions. The results emphasize the potential of this research direction in advancing multi-view VSR, paving the way for more resilient systems capable of managing the complexities inherent in visual inputs.
more목차
1 Introduction 1
1.1 Research background 1
1.2 Our contributions 3
1.3 Paper structure 4
2 Related works 5
2.1 End-to-end conformer 5
2.2 Lip occlusion 7
2.3 Reinforcement learning 8
2.3.1 Markov decision process 8
2.3.2 Policy gradient method 9
3 Proposed methods 11
3.1 Multi-view occlusion modeling 11
3.1.1 Camera geometry 11
3.1.2 Brownian motion simulation 13
3.1.3 Occlusion strategy 14
3.2 Multi-view visual speech recognition architecture 15
3.2.1 Merging-based methods 16
3.2.2 Policy-based methods 17
3.2.3 Multi-view VSR training strategies 20
4 Experiments 23
4.1 Data 23
4.2 Setup 25
4.2.1 Training 25
4.2.2 Evaluation 27
5 Results 28
5.1 Training strategy comparison 28
5.2 Occlusion-robustness analysis 30
5.3 Ablation studies 32
5.3.1 Multi-view data augmentation 32
5.3.2 Handling unknown viewpoints 33
6 Conclusion 35
Bibliography 37