검색 상세

WFST를 이용한 그래프 기반의 디코딩 네트워크 및 이를 이용한 End-to-End 음성인식과 언어 모델의 통합

Graph-based Decoding Network using WFST and Its Use in The Integration of End-to-End Speech Recognition and Language Model

초록 (요약문)

Most modern automatic speech recognition (ASR) systems are developed in an end-to-end manner. The main feature of end-to-end ASR is to overcome the dependence of hidden Markov models on forced alignments of conventional deep neural network(DNN)-weighted finite-state transducer(WFST)-based ASR using methods such as connectionist temporal classification (CTC) and attention mechanisms. It also enables the training of acoustic and speech models in a single DNN. However, only paired audio and text are used in training. Rich linguistic knowledge from large-scale text data can be used in many natural language processing tasks. However, collecting paired speech-text data is more expensive than collecting only text or speech data. Consequently, the quality and quantity of paired speech-text data determine how to improve the performance of end-to-end based speech recognition systems. In contrast, for DNN-WFST-based speech recognition, large-scale text-only data can train the language model to improve speech recognition performance. Moreover, compared to paired speech-text data, text-only data are easier to collect and less costly. Decoding networks are exploratory for combining end-to-end ASR with external language models from text-only data. Decoding networks have two types. The partially expanded decoding network passes the probabilities from the end-to-end ASR model to the language model, and the probabilities from the language model are combined to calculate the probability of the entire sequence. partially expanded decoding networks can learn both models independently. Decoding is possible using external information. It can also be applied regardless of the model type (input form and topology) because it only considers the probability of the model's output. However, it has the drawback of requiring the same model call operation to be processed at every instant in time and requiring the same word set to be used by both the end-to-end and language models. It also has the disadvantage of high memory usage because a graph must store all paths. This work proposes a fully expanded decoding network because it 1) can use off-the-shelf pre-trained end-to-end ASR and language models without modification and 2) outperforms partially expanded methods. A WFST implements a fully expanded decoding network used in many previous studies. WFSTs are used statically to incorporate fixed lattices in discriminative sequence criteria. To incorporate a graph-based language model into ASR, we propose an enhanced CTC transducer for collapsing output labels in CTC-based ASR. The new CTC transducer, tightly developed in a standard CTC algorithm, reduces the word error rate (WER). Subsequently, we introduce a tokenization transducer for non-speech processing, showing that with adequately designed schemes of processing non-speech symbols. We found that the tokenizing transducer significantly improves decoding performance compared to a vanilla WFST-based decoder.

more