검색 상세

Asymmetric Encoder-decoder Using Time-frequency Correlation for Universal Speech Enhancement and Separation

목차

List of Figures vi
List of Tables xi
Abstract xiii
국문초록 xvi
1 Introduction 1
1.1 Motivation 1
1.2 Overview of the Proposed Method 7
1.3 Organization of This Thesis 11
2 Related Works 14
2.1 Speech Enhancement 14
2.1.1 Macro Framework for Speech Enhancement 15
2.1.2 Loss Function 17
2.2 Speech Separation 19
2.2.1 Permutation Invariant Training 20
2.2.2 Time-domain Audio Separation Network (TasNet) 22
2.2.3 Time-frequency Dual-path Model 23
2.3 Multi-channel Speech Enhancement and Separation 25
2.3.1 Inter-microphone Phase Difference (IPD) 26
2.3.2 Guided Beamforming by DNN 26
2.3.3 Neural Beamforming 29
2.3.4 Continuous Speech Separation 31
2.4 Encoder-decoder: Symmetric vs.Asymmetric Structure 32
2.4.1 Symmetric Encoder-decoder 34
2.4.2 Asymmetric Encoder-decoder 35
2.5 Remark: From Related Works to Our Unified Model 38
3 Asymmetric Encoder-decoder for Speech Separation 39
3.1 Architectural Motivation 43
3.2 Method 45
3.2.1 Overall Pipeline 45
3.2.2 Architecture of Separator 45
3.2.3 Global and Local Transformer for Long Sequences 48
3.2.4 Boosting Discriminative Learning by Multi-loss 50
3.3 Experimental Settings 51
3.3.1 Dataset 51
3.3.2 Training and Model Configuration 52
3.4 Results 55
3.4.1 Ablation Studies of SepRe Method 55
3.4.2 Effects of the SepRe Method in Other Networks 57
3.4.3 Ablation Studies of Unit Blocks 57
3.4.4 Visualization of Discriminative Learning 60
3.4.5 Comparison With Existing Models 62
4 Spatial Correlation for Multi-channel Speech 66
4.1 TF-CorrNet for Speech Separation 68
4.1.1 Spatial Correlations With PHAT-β 69
4.1.2 Time-frequency Spatial Module 70
4.1.3 Spectral Module 71
4.1.4 Filter Estimation 71
4.2 Efficient Global and Local Transformer 72
4.2.1 Efficient Feed-forward Network 72
4.2.2 Global and Local Transformer 72
4.3 Experiment 73
4.3.1 Dataset and Evaluation 73
4.3.2 Model Configuration and Training 76
4.3.3 Comparison on the Simulation Data 77
4.3.4 Results on LibriCSS 79
4.3.5 Ablation Study 79
5 Temporal Correlation for Reverberant Speech 82
5.1 IF-CorrNet for Speech Dereverberation 85
5.1.1 Inter-frame Correlations for Deep Filter Estimation 85
5.1.2 Time-frequency Module 87
5.1.3 Transformer Block With ConvFFN Module 87
5.2 Experimental Setups 88
5.2.1 Datasets and Evaluation 88
5.2.2 Training and Model Configuration 89
5.3 Experimental Results 89
5.3.1 Investigation on the Number of Taps 89
5.3.2 Comparison With Existing Models 90
5.3.3 Ablation Study 92
6 Asymmetric Encoder-decoder for Correlation for Universal Speech En-
hancement and Separation 95
6.1 Correlation for Filter Estimation 99
6.1.1 Formulation 99
6.1.2 Spatio-spectro-temporal Correlation as an Input Feature 100
6.1.3 Multi-channel Time-frequency Filter Estimation 103
6.2 SR-CorrNet 105
6.2.1 Correlation-to-filter Design 106
6.2.2 Asymmetric Encoder-decoder Structure 106
6.2.3 Unit Processing TF-module 108
6.2.4 Split Module 108
6.2.5 Boosting SepRe Method by Early Supervision 110
6.3 Experimental Setup 111
6.3.1 Monaural Speech Separation for Anechoic Clean Mixture 112
6.3.2 Speech Separation for Simulated Noisy-reverberant Mixture 113
6.3.3 Continuous Speech Separation for Real-recorded Mixture 114
6.4 Evaluation Results 118
6.4.1 Monaural Speech Separation for Anechoic Clean Mixture 118
6.4.2 Speech Separation under Noisy-reverberant Conditions 121
6.4.3 Ablation Study 122
6.5 Application to Real-recorded Meeting Diarization 126
6.5.1 Dataset and Evaluation Metric 128
6.5.2 System Pipeline and Evaluation 131
6.5.3 Diarization Results 133
7 Conclusions and Further Works 136
7.1 Conclusions 136
7.2 Further Works 138
Bibliography 140

more