Multilingual Speech-to-Vocal Tract Visualization Using Deep Learning for Pronunciation Training
- 주제어 (키워드) vocal tract visualization , speech to vocal tract , pronunciation training , pronunciation correction
- 발행기관 서강대학교 일반대학원
- 지도교수 박운상
- 발행년도 2025
- 학위수여년월 2025. 2
- 학위명 석사
- 학과 및 전공 일반대학원 컴퓨터공학과
- 실제 URI http://www.dcollection.net/handler/sogang/000000079329
- UCI I804:11029-000000079329
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록 (요약문)
Visualizing the vocal tract during speech is challenging, even with the advancements in open-source algorithms and data. One major issue is the lack of multimodal datasets that combine speech with internal mouth structures information, which limits the development of effective algorithms. In this work, we propose a new visualization algorithm that translates speech into vocal tract movements. This allows individuals to see how their pronunciation compares to that of a correctly pronounced reference. The application is especially useful for language learners and those who are deaf from birth, as it provides a visual way to understand speech production. We address the dataset gap by creating a new set that maps speech to VocalTractLab parameters. Using Wav2Vec 2.0 and HuBERT to extract audio features, we model this as a sequence-to-sequence problem with a Bi-GRU to predict parameters for VocalTractLab, which allow us then to create visualizations. Additionally, we create separate datasets for General American English, Korean and Brazilian Portuguese, using the same method and train different models with them. We demonstrate the model’s effectiveness through both qualitative and quantitative evaluations and assess its usefulness for pronunciation correction.
more목차
1 Introduction 1
1.1 Thesis Structure 1
1.2 Overview 1
2 Related Works 4
3 Background 8
3.1 Wav2Vec 2.0 8
3.2 HuBERT: Self-Supervised Speech Representation Learning 9
3.3 Gated Recurrent Unit (GRU) 11
3.4 Bidirectional Gated Recurrent Unit (Bi-GRU) 12
3.5 Voice Conversion and Diff-HierVC 12
3.6 Anatomy of the Vocal Tract 14
3.7 International Phonetic Alphabet and Vowel Diagram 15
3.8 Vocal Tract Lab (VTL) 17
4 Methods 21
4.1 Extension of VTL 21
4.2 Dataset 23
4.3 Model 24
4.4 Experimental Setup 26
5 Results and Discussion 30
5.1 Dataset Discussion 30
5.2 Audio Representation Model Comparison 32
5.3 Qualitative Results 33
5.4 Quantitative Results 34
5.5 Pronunciation Correction Survey 36
6 Conclusion 42
Bibliography 44

