Generating Whispered Speech for Recognition through Self-supervised Learning
자기 지도 학습을 통한 속삭임 음성 생성 및 인식
- 주제어 (키워드) whispered speech generation , voice conversion , speech recognition , self-supervised learning
- 발행기관 서강대학교 일반대학원
- 지도교수 박형민
- 발행년도 2024
- 학위수여년월 2024. 8
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과
- 실제 URI http://www.dcollection.net/handler/sogang/000000079222
- UCI I804:11029-000000079222
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록 (요약문)
Whispered speech is a natural way of speaking characterized by low energy and a subtle presence of fundamental frequency. Various speech generative models for diverse voices have recently gained popularity and whispered speech has also been recognized for its necessity in sleep induction and stress relief ASMR content. Additionally, it finds application in scenarios requiring speech interaction in quiet environments. This paper introduces a deep learning model for seamlessly converting normal speech to whispered speech and aims to understand the characteristics of whispered speech. While LPC-based methods or GMM-based models have been introduced for whispered speech conversion in the past, this paper proposes utilizing self-supervised learning-based models such as HuBERT to learn and generate whispered speech characteristics. Furthermore, this paper compares the different characteristics of whispered and normal speech and seeks to un- derstand these features. By leveraging whispered data augmentation from this whispered speech generation model, it is expected to achieve higher speech recognition performance, including in Korean, where whispered data is scarce, for various languages.
more목차
1 Introduction 1
1.1 Background 1
1.2 Overview of the proposed method 3
2 Conventional Method 5
2.1 Signal Processing-Based Voice Conversion 5
2.2 Deep Learning-Based Voice Conversion 6
2.3 Voice Conversion with Self-supervised Model 7
3 Proposed Method 9
3.1 Speech Representation Extractor (SRE) 10
3.2 Whispered Speech Generator (WSG) 11
4 Experiments 16
4.1 Datasets 16
4.1.1 LibriSpeech 16
4.1.2 wTIMIT 16
4.1.3 CHAINS 17
4.2 Model Configurations 17
4.2.1 Speech Representation Extractor 17
4.2.2 Whispered Speech Generator 18
4.2.3 Evaluation Metrics 18
4.3 Experimental Results 20
4.3.1 MCD Result 20
4.3.2 t-SNE Result 21
4.3.3 ASR Result 22
5 Conclusion 26
Bibliography 28