검색 상세

Generating Whispered Speech for Recognition through Self-supervised Learning

자기 지도 학습을 통한 속삭임 음성 생성 및 인식

초록 (요약문)

Whispered speech is a natural way of speaking characterized by low energy and a subtle presence of fundamental frequency. Various speech generative models for diverse voices have recently gained popularity and whispered speech has also been recognized for its necessity in sleep induction and stress relief ASMR content. Additionally, it finds application in scenarios requiring speech interaction in quiet environments. This paper introduces a deep learning model for seamlessly converting normal speech to whispered speech and aims to understand the characteristics of whispered speech. While LPC-based methods or GMM-based models have been introduced for whispered speech conversion in the past, this paper proposes utilizing self-supervised learning-based models such as HuBERT to learn and generate whispered speech characteristics. Furthermore, this paper compares the different characteristics of whispered and normal speech and seeks to un- derstand these features. By leveraging whispered data augmentation from this whispered speech generation model, it is expected to achieve higher speech recognition performance, including in Korean, where whispered data is scarce, for various languages.

more

목차

1 Introduction 1
1.1 Background 1
1.2 Overview of the proposed method 3
2 Conventional Method 5
2.1 Signal Processing-Based Voice Conversion 5
2.2 Deep Learning-Based Voice Conversion 6
2.3 Voice Conversion with Self-supervised Model 7
3 Proposed Method 9
3.1 Speech Representation Extractor (SRE) 10
3.2 Whispered Speech Generator (WSG) 11
4 Experiments 16
4.1 Datasets 16
4.1.1 LibriSpeech 16
4.1.2 wTIMIT 16
4.1.3 CHAINS 17
4.2 Model Configurations 17
4.2.1 Speech Representation Extractor 17
4.2.2 Whispered Speech Generator 18
4.2.3 Evaluation Metrics 18
4.3 Experimental Results 20
4.3.1 MCD Result 20
4.3.2 t-SNE Result 21
4.3.3 ASR Result 22
5 Conclusion 26
Bibliography 28

more