검색 상세

NeRF-THIS : Neural Radiance Field based Talking Head Synthesis Incorporating Text-to-Speech

초록

With the progression of deep learning techniques, the field of generating videos au- tomatically from audio or text inputs has emerged as a highly promising and rapidly evolving area of research. This paper presents NeRF-THIS(Neural Radiance Field based Talking Head Synthesis Incorporating Text-to-Speech), a novel approach to text-driven talking head generation that combines the strengths of text-based audio generation models with audio-driven video generation models. The method builds a Neural Radiance Fields (NeRF) based talking head generation architecture integrated with text-to-speech(TTS). This approch has a number of advantages. :1) it only needs 5 min of trainning data. 2)It is not constrained by Automatic Speech Recognition (ASR) models, thereby offering freedom from language barriers. 3)It cat support real-time inference in low computational cost. Our findings indicate a promising direction for future research in multimedia content generation, opening new avenues for applications in virtual reality, digital entertainment, and interactive media.

more

목차

1 Introduction 1
1.1 Motivation 1
1.2 Overview of the proposed method 2
2 Related Works 4
2.1 Speech Synthesis 4
2.1.1 Neural Audio Codec based TTS 5
2.2 Talking Head Generation 8
2.2.1 Text-Driven 8
2.2.2 Audio-Driven 8
2.2.3 NeRF-Based Talking Head Synthesis 9
3 Proposed Method 11
3.1 Overview 11
3.2 input features 13
3.2.1 audio and text features 13
3.2.2 video features 13
3.3 Text-to-Speech Module 14
3.4 NeRF-based Talking Head Generation Module 14
3.5 Loss Function 15
4 Experiments 16
4.1 Dataset and Evaluation Metrics 16
4.2 Quantitative Evaluation Results 16
4.3 Qualitative Evaluation Results 17
4.3.1 User Study 17
4.4 Advantages of TTS-integrated Synthesis 19
4.4.1 Synchronization 19
4.4.2 Efficiency 20
5 Conclusion 22
5.1 Conclusion 22
5.2 limitations and future work 22
Bibliography 24

more