dCollection 디지털 학술정보 유통시스템

종단형 심층 신경망 기반 화자 분할 및 소음 제거 통합 시스템 개발

Development of an End-to-End Deep Neural Network-Based Speaker Diarization and Denoising Integration System

원문보기

발행기관 서강대학교 일반대학원
지도교수 구명완
발행년도 2024
학위수여년월 2024. 2
학위명 석사
학과 및 전공 일반대학원 인공지능학과
실제URI http://www.dcollection.net/handler/sogang/000000077259
UCI I804:11029-000000077259
본문언어 한국어
저작권 서강대학교 논문은 저작권 보호를 받습니다.

초록

Speaker Diarization is a crucial technology in the field of speech processing. It involves estimating when each speaker's utterances start and end(who spoke when) in overlapped speech mixture environment. On the other hand, denoising aims to remove unnecessary sounds, such as background noise or reverberation, from input speech mixture. Recent research focuses on developing preprocessing systems that combine various speech preprocessing techniques to improve speech recognition performance. However, the current modular approach in speech preprocessing systems has drawbacks, including the independent training of modules and increased implementation complexity. In particular, building an integrated system for dereverberation and denoising based on end-to-end deep neural networks presents a challenging task, as there are no established precedents. Therefore, in this study, we propose the system of an integrated preprocessing pipeline based on end-to-end deep neural networks for speaker diarization, speech separation, denoising and dereverberation in single-channel environments. Additionally, we create a Korean-based dataset containing overlapped speech with noise and reverberation for training the proposed integrated preprocessing system. The developed system pipeline centers around a core module that simultaneously handles speaker diarization and speech separation, while pre- and post-processing modules address noise and reverberation removal. Based on the system design and training results, the pipeline consisting of dereverberation module, denoising module, and core module in that order demonstrates the best performance. It achieved a CER(character error rate) of 33.1% for systems with a fixed number of two speakers and 46.8% for systems with flexible numbers of two or three speakers.

1. 서론 3
1.1. 화자 분할 및 소음 제거 기술의 필요성 3
1.1.1. 화자 분할 기술 3
1.1.2. 소음 제거 기술 4
1.2. 종단형 심층신경망 기반 통합 시스템 제안 5
1.3. 기여 사항 7
1.4. 개요 8
2. 관련 연구 9
2.1. 화자 분할 네트워크 9
2.1.1. 전통적 화자 분할 네트워크 9
2.1.2. 종단형 심층 신경망 기반 네트워크 10
2.2. 음원 분리 네트워크 14
2.2.1. 전통적 음원 분리 네트워크 14
2.2.2. 종단형 심층 신경망 기반 네트워크 16
2.3. 종단형 화자 분할 및 음원 분리 통합 네트워크 20
2.4. 발화 중첩 데이터셋 21
2.4.1. 소음 기반 발화 중첩 데이터셋 21
2.4.2. 소음 및 잔향 기반 발화 중첩 데이터셋 22
3. 종단형 화자 분할 및 소음 제거 시스템 23
3.1. 시스템 개요 23
3.2. 종단형 화자 분할 네트워크 24
3.3. 종단형 화자 분할 및 소음 제거 시스템 27
3.3.1. 화자 분할 및 음성 분리 모듈 27
3.3.2. 소음 및 잔향 제거 모듈 29
3.4. 종단형 음성 전처리 통합 시스템 29
3.4.1. 소음 및 잔향 제거를 모두 포함하는 시스템 30
3.4.2. 소음 또는 잔향 제거를 포함하는 시스템 31
3.4.3. 화자 분할 및 음성 분리만을 수행하는 시스템 33
4. 실험 및 결과 34
4.1. 데이터셋 구성 34
4.1.1. 한국어 기반 소음 및 잔향 중첩 발화 데이터셋 구축 34
4.1.2. 발화 중첩 데이터셋 구축 과정 35
4.1.3. 희소 발화 중첩 데이터셋 구축 과정 37
4.2. 실험 구성 39
4.3. 실험 환경 41
4.4. 실험 결과 41
4.4.1. 화자 분할 및 음성 분리 모듈 성능 평가 결과 41
4.4.2. 소음 및 잔향 제거 모듈 성능 평가 결과 43
4.4.3. 시스템 파이프라인 별 성능 평가 결과 44
4.4.4. 시스템 파이프라인 별 출력 음성 ASR 평가 결과 46
5. 결론 47
Bibliography 49

반출 Meta View 목록

서강대학교

검색 상세

종단형 심층 신경망 기반 화자 분할 및 소음 제거 통합 시스템 개발

초록

목차