검색 상세

NeXt-TDNN : A Modernized Backbone Network for Speaker Recognition

초록

With the advent of ECAPA-TDNN in speaker verification, there has been a remarkable improvement through the use of a one-dimensional (1D) Res2Net block and a squeeze- and-excitation (SE) module, enhanced by multi-layer feature aggregation (MFA) and a channel-dependent attentive pooling layer. Meanwhile, in the vision community, ConvNet structures have been improved by incorporating design elements from Transformer, leading to performance enhancements. This paper presents an improved block design for Time Delay Neural Networks (TDNN) in speaker verification, drawing inspiration from recent developments in ConvNets. The SE-Res2Net block in ECAPA-TDNN is replaced with a novel 1D two-step multi-scale ConvNeXt block (TS-ConvNeXt). This TS-ConvNeXt block comprises two separate sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, global response normalization (GRN) is introduced for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Also, the global response standardization (GRS) was proposed for the feature normalization phase of GRN, utilizing secondary statistics, contributing to improved performance in speaker verification tasks. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time.

more

목차

1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Outline of the thesis 4
2 Related works 5
2.1 Backbone architecture 5
2.1.1 Res2Net 5
2.1.2 ConvNeXt 6
2.2 Pooling functions 6
2.3 Attention and gating mechanism 8
2.3.1 Squeeze - and - Excitation 8
2.3.2 Convolutional Block Attention Module 8
3 Proposed NeXt-TDNN architecture 10
3.1 MFA layer and channel-dependent ASP as temporal pooling 10
3.2 Proposed TS-ConvNeXt block 11
3.3 Rationale of the backbone block design 13
3.4 Global Response Standardization 14
4 Experimental setup 16
4.1 Dataset and metrics 16
4.2 Configurations 17
5 Experimental result 18
6 Conclusion 22
Bibliography 23

more