A Lightweight Audio-Visual Voice Activity Detection Using Temporal Shift Module
- Subject Multi-modal Voice Activity Detection , End-point-detection
- Publisher 서강대학교 일반대학원
- Advisor 박형민
- Issued Date 2022
- Awarded Date 2022. 2
- Thesis Degree 석사
- Major 일반대학원 전자공학과
- URI Entity http://www.dcollection.net/handler/sogang/000000066542
- UCI I804:11029-000000066542
- Language 영어
- Rights 서강대학교 논문은 저작권 보호를 받습니다.
Abstract
Recently, many modern intelligent devices are operated by voice commands. It can be used as an interface in various fields including automobiles, IoT devices, AI speakers, and robots. In speech recognition, high recognition performance inevitably requires accurate detection of voice activity intervals to recognize spoken parts while ignoring the other parts. Most of the current voice activity detection (VAD) methods use audio information only. However, conventional audio-based VAD methods frequently suffer from significant performance degradation in real-world noisy environments. Since acoustic noise is usually independent of visual noise, video information may be very helpful if available. Therefore, this thesis proposes a VAD method that uses both audio and video modalities. By exploiting audio and video information together, the multi-modal VAD method is robust to external noise and even can distinguish multiple speakers’ speech. In particular, our model uses a reduced number of parameters using MobileNetV2 with the temporal shift module. Experimental results on MOBIO and Colombia datasets demonstrate the effectiveness of the proposed attention architecture. Comparing with the related theses, our multi-modal VAD model can be seen that state-of-the-art performance with the lowest parameter. We proposed an early fusion multi-modal model that prioritizes weight reduction and a multi-modal fusion attention architecture model that focuses on performance improvement.
more

