A Speech Segregation Method Based on Pitch and Binaural Cues in Reverberant Environments
- 주제(키워드) CASA , Binaural , pitch , ITDs
- 발행기관 서강대학교 일반대학원
- 지도교수 박형민
- 발행년도 2017
- 학위수여년월 2017. 2
- 학위명 석사
- 학과 및 전공 일반대학원 전자공학과
- 실제URI http://www.dcollection.net/handler/sogang/000000061432
- 본문언어 영어
- 저작권 서강대학교 논문은 저작권보호를 받습니다.
초록/요약
Humans have the significant ability to attend to a voice sound. Based on the human's auditory processing called computational auditory scene analysis (CASA), we propose a method to enhance target speech by removing noise components from monaural and binaural observation. We propose a method to enhance speech based on ASA exploiting monaural and binaural cues in reverberant environments. After initial segmentation for T-F components in each signal to the two ears, both signals have pitch contours related to harmonics which can be formed as groups to segregate between foreground and background segments. Then, each direction corresponding to a sound source is elaborately estimated from T-F components belonging to the corresponding source, and ITDs estimated in T-F units are used to perform sequential grouping including segregation of unvoiced speech components. After sequential grouping, we estimate more reliable speech T-F components. Verifying whether mask obtained from ITDs is captured more than 50% on the basis of a segment from initial segmentation. It is called overlap segments. In second round, using overlap segments, we reestimate pitch contours for more reliable information and find segments linked to second pitch contours. The T-F masks obtained from ITDs and the pitch contour in addition to the overlap segments are combined to yield a final mask to enhance speech. It should be noted that while we derive the method based on ASA, we make use of two sensors (with no object between them) that are spaced far more closely than human ears to avoid the effects of spatial aliasing for frequencies up to half the sampling frequency and to apply them in a compact device, so the largest possible delay between the sensors is always less than half a period over all frequencies of interest.
more