dCollection 디지털 학술정보 유통시스템

Multilingual Speech-to-Vocal Tract Visualization Using Deep Learning for Pronunciation Training

원문보기

주제어 (키워드) vocal tract visualization , speech to vocal tract , pronunciation training , pronunciation correction
발행기관 서강대학교 일반대학원
지도교수 박운상
발행년도 2025
학위수여년월 2025. 2
학위명 석사
학과 및 전공 일반대학원 컴퓨터공학과
실제 URI http://www.dcollection.net/handler/sogang/000000079329
UCI I804:11029-000000079329
본문언어 영어
저작권 서강대학교 논문은 저작권 보호를 받습니다.

초록 (요약문)

Visualizing the vocal tract during speech is challenging, even with the advancements in open-source algorithms and data. One major issue is the lack of multimodal datasets that combine speech with internal mouth structures information, which limits the development of effective algorithms. In this work, we propose a new visualization algorithm that translates speech into vocal tract movements. This allows individuals to see how their pronunciation compares to that of a correctly pronounced reference. The application is especially useful for language learners and those who are deaf from birth, as it provides a visual way to understand speech production. We address the dataset gap by creating a new set that maps speech to VocalTractLab parameters. Using Wav2Vec 2.0 and HuBERT to extract audio features, we model this as a sequence-to-sequence problem with a Bi-GRU to predict parameters for VocalTractLab, which allow us then to create visualizations. Additionally, we create separate datasets for General American English, Korean and Brazilian Portuguese, using the same method and train different models with them. We demonstrate the model’s effectiveness through both qualitative and quantitative evaluations and assess its usefulness for pronunciation correction.

1 Introduction 1
1.1 Thesis Structure 1
1.2 Overview 1
2 Related Works 4
3 Background 8
3.1 Wav2Vec 2.0 8
3.2 HuBERT: Self-Supervised Speech Representation Learning 9
3.3 Gated Recurrent Unit (GRU) 11
3.4 Bidirectional Gated Recurrent Unit (Bi-GRU) 12
3.5 Voice Conversion and Diff-HierVC 12
3.6 Anatomy of the Vocal Tract 14
3.7 International Phonetic Alphabet and Vowel Diagram 15
3.8 Vocal Tract Lab (VTL) 17
4 Methods 21
4.1 Extension of VTL 21
4.2 Dataset 23
4.3 Model 24
4.4 Experimental Setup 26
5 Results and Discussion 30
5.1 Dataset Discussion 30
5.2 Audio Representation Model Comparison 32
5.3 Qualitative Results 33
5.4 Quantitative Results 34
5.5 Pronunciation Correction Survey 36
6 Conclusion 42
Bibliography 44

반출 Meta View 목록

서강대학교

검색 상세

Multilingual Speech-to-Vocal Tract Visualization Using Deep Learning for Pronunciation Training

초록 (요약문)

목차