NeXt-TDNN : A Modernized Backbone Network for Speaker Recognition
- 주제어 (키워드) speaker verification , speaker recognition , TDNN , ConvNeXt , multi-scale , gating mechanism
- 발행기관 서강대학교 일반대학원
- 지도교수 박형민
- 발행년도 2024
- 학위수여년월 2024. 2
- 학위명 석사
- 학과 및 전공 일반대학원 전자공학과
- 실제URI http://www.dcollection.net/handler/sogang/000000076942
- UCI I804:11029-000000076942
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권 보호를 받습니다.
초록
With the advent of ECAPA-TDNN in speaker verification, there has been a remarkable improvement through the use of a one-dimensional (1D) Res2Net block and a squeeze- and-excitation (SE) module, enhanced by multi-layer feature aggregation (MFA) and a channel-dependent attentive pooling layer. Meanwhile, in the vision community, ConvNet structures have been improved by incorporating design elements from Transformer, leading to performance enhancements. This paper presents an improved block design for Time Delay Neural Networks (TDNN) in speaker verification, drawing inspiration from recent developments in ConvNets. The SE-Res2Net block in ECAPA-TDNN is replaced with a novel 1D two-step multi-scale ConvNeXt block (TS-ConvNeXt). This TS-ConvNeXt block comprises two separate sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, global response normalization (GRN) is introduced for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Also, the global response standardization (GRS) was proposed for the feature normalization phase of GRN, utilizing secondary statistics, contributing to improved performance in speaker verification tasks. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time.
more목차
1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Outline of the thesis 4
2 Related works 5
2.1 Backbone architecture 5
2.1.1 Res2Net 5
2.1.2 ConvNeXt 6
2.2 Pooling functions 6
2.3 Attention and gating mechanism 8
2.3.1 Squeeze - and - Excitation 8
2.3.2 Convolutional Block Attention Module 8
3 Proposed NeXt-TDNN architecture 10
3.1 MFA layer and channel-dependent ASP as temporal pooling 10
3.2 Proposed TS-ConvNeXt block 11
3.3 Rationale of the backbone block design 13
3.4 Global Response Standardization 14
4 Experimental setup 16
4.1 Dataset and metrics 16
4.2 Configurations 17
5 Experimental result 18
6 Conclusion 22
Bibliography 23