검색 상세

Audio-Visual Speech Recognition Based on Gaussian Dual Cross-Modality Attentions with the Transformer Model

  • 주제(키워드) audio-visual speech recognition
  • 발행기관 서강대학교 일반대학원
  • 지도교수 박형민
  • 발행년도 2020
  • 학위수여년월 2020. 8
  • 학위명 석사
  • 학과 및 전공 일반대학원 전자공학과
  • UCI I804:11029-000000065264
  • 본문언어 영어
  • 저작권 서강대학교 논문은 저작권보호를 받습니다.

초록/요약

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence problems with the LSTM. In contrast to the neural machine translation, audio-visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. Because the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification(CTC) loss and gaussian multi-head attention in combination with our attention-based model to force monotonic alignments required in AVSR. Experimental results on LRS2-BBC and LRS3-TED datasets demonstrated effectiveness of the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture.

more