검색 상세

Multilingual Speech-to-Vocal Tract Visualization Using Deep Learning for Pronunciation Training

초록 (요약문)

Visualizing the vocal tract during speech is challenging, even with the advancements in open-source algorithms and data. One major issue is the lack of multimodal datasets that combine speech with internal mouth structures information, which limits the development of effective algorithms. In this work, we propose a new visualization algorithm that translates speech into vocal tract movements. This allows individuals to see how their pronunciation compares to that of a correctly pronounced reference. The application is especially useful for language learners and those who are deaf from birth, as it provides a visual way to understand speech production. We address the dataset gap by creating a new set that maps speech to VocalTractLab parameters. Using Wav2Vec 2.0 and HuBERT to extract audio features, we model this as a sequence-to-sequence problem with a Bi-GRU to predict parameters for VocalTractLab, which allow us then to create visualizations. Additionally, we create separate datasets for General American English, Korean and Brazilian Portuguese, using the same method and train different models with them. We demonstrate the model’s effectiveness through both qualitative and quantitative evaluations and assess its usefulness for pronunciation correction.

more

목차

1 Introduction 1
1.1 Thesis Structure 1
1.2 Overview 1
2 Related Works 4
3 Background 8
3.1 Wav2Vec 2.0 8
3.2 HuBERT: Self-Supervised Speech Representation Learning 9
3.3 Gated Recurrent Unit (GRU) 11
3.4 Bidirectional Gated Recurrent Unit (Bi-GRU) 12
3.5 Voice Conversion and Diff-HierVC 12
3.6 Anatomy of the Vocal Tract 14
3.7 International Phonetic Alphabet and Vowel Diagram 15
3.8 Vocal Tract Lab (VTL) 17
4 Methods 21
4.1 Extension of VTL 21
4.2 Dataset 23
4.3 Model 24
4.4 Experimental Setup 26
5 Results and Discussion 30
5.1 Dataset Discussion 30
5.2 Audio Representation Model Comparison 32
5.3 Qualitative Results 33
5.4 Quantitative Results 34
5.5 Pronunciation Correction Survey 36
6 Conclusion 42
Bibliography 44

more