A Lightweight Audio-Visual Voice Activity Detection Using Temporal Shift Module
- 주제어 (키워드) Multi-modal Voice Activity Detection , End-point-detection
- 발행기관 서강대학교 일반대학원
- 지도교수 박형민
- 발행년도 2022
- 학위수여년월 2022. 2
- 학위명 석사
- 학과 및 전공 일반대학원 전자공학과
- 실제 URI http://www.dcollection.net/handler/sogang/000000066542
- UCI I804:11029-000000066542
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록 (요약문)
Recently, many modern intelligent devices are operated by voice commands. It can be used as an interface in various fields including automobiles, IoT devices, AI speakers, and robots. In speech recognition, high recognition performance inevitably requires accurate detection of voice activity intervals to recognize spoken parts while ignoring the other parts. Most of the current voice activity detection (VAD) methods use audio information only. However, conventional audio-based VAD methods frequently suffer from significant performance degradation in real-world noisy environments. Since acoustic noise is usually independent of visual noise, video information may be very helpful if available. Therefore, this thesis proposes a VAD method that uses both audio and video modalities. By exploiting audio and video information together, the multi-modal VAD method is robust to external noise and even can distinguish multiple speakers’ speech. In particular, our model uses a reduced number of parameters using MobileNetV2 with the temporal shift module. Experimental results on MOBIO and Colombia datasets demonstrate the effectiveness of the proposed attention architecture. Comparing with the related theses, our multi-modal VAD model can be seen that state-of-the-art performance with the lowest parameter. We proposed an early fusion multi-modal model that prioritizes weight reduction and a multi-modal fusion attention architecture model that focuses on performance improvement.
more