검색 상세

Leveraging Multi-View Integration for Improved Occlusion Robustness in Visual Speech Recognition

초록 (요약문)

In everyday communication, individuals rely on a multi-modal approach to speech perception, integrating auditory and visual cues such as facial expressions and lip movements. This integration has spurred advancements in automatic speech recognition (ASR) systems, inspiring this paper’s focus on developing occlusion-robust visual speech recognition (VSR) systems. This research extends multi-view lipreading methodologies to the VSR domain and proposes novel hybrid multi-view training strategies. These strategies leverage pre-trained cross-view models to enhance multi-view training, aiming to address the limitations of traditional single-view VSR systems, particularly their vulnerability to occlusions. Additionally, we introduce policy-based multi-view processing methods, optimized via policy-gradient techniques, to create a more adaptable multi-view system. By simulating real-world occlusions with multiple occluders that obscure the lip region across various views, we assess the occlusion-robustness of our VSR models. Our experimental findings highlight the substantial benefits of integrating multi-view inputs, significantly boosting the robustness and accuracy of VSR systems in environments with visual obstructions. The results emphasize the potential of this research direction in advancing multi-view VSR, paving the way for more resilient systems capable of managing the complexities inherent in visual inputs.

more

목차

1 Introduction 1
1.1 Research background 1
1.2 Our contributions 3
1.3 Paper structure 4
2 Related works 5
2.1 End-to-end conformer 5
2.2 Lip occlusion 7
2.3 Reinforcement learning 8
2.3.1 Markov decision process 8
2.3.2 Policy gradient method 9
3 Proposed methods 11
3.1 Multi-view occlusion modeling 11
3.1.1 Camera geometry 11
3.1.2 Brownian motion simulation 13
3.1.3 Occlusion strategy 14
3.2 Multi-view visual speech recognition architecture 15
3.2.1 Merging-based methods 16
3.2.2 Policy-based methods 17
3.2.3 Multi-view VSR training strategies 20
4 Experiments 23
4.1 Data 23
4.2 Setup 25
4.2.1 Training 25
4.2.2 Evaluation 27
5 Results 28
5.1 Training strategy comparison 28
5.2 Occlusion-robustness analysis 30
5.3 Ablation studies 32
5.3.1 Multi-view data augmentation 32
5.3.2 Handling unknown viewpoints 33
6 Conclusion 35
Bibliography 37

more