검색 상세

A Lightweight Audio-Visual Voice Activity Detection Using Temporal Shift Module

초록 (요약문)

Recently, many modern intelligent devices are operated by voice commands. It can be used as an interface in various fields including automobiles, IoT devices, AI speakers, and robots. In speech recognition, high recognition performance inevitably requires accurate detection of voice activity intervals to recognize spoken parts while ignoring the other parts. Most of the current voice activity detection (VAD) methods use audio information only. However, conventional audio-based VAD methods frequently suffer from significant performance degradation in real-world noisy environments. Since acoustic noise is usually independent of visual noise, video information may be very helpful if available. Therefore, this thesis proposes a VAD method that uses both audio and video modalities. By exploiting audio and video information together, the multi-modal VAD method is robust to external noise and even can distinguish multiple speakers’ speech. In particular, our model uses a reduced number of parameters using MobileNetV2 with the temporal shift module. Experimental results on MOBIO and Colombia datasets demonstrate the effectiveness of the proposed attention architecture. Comparing with the related theses, our multi-modal VAD model can be seen that state-of-the-art performance with the lowest parameter. We proposed an early fusion multi-modal model that prioritizes weight reduction and a multi-modal fusion attention architecture model that focuses on performance improvement.

more