검색 상세

A Speech Segregation Method Based on Pitch and Binaural Cues in Reverberant Environments

초록/요약

Humans have the significant ability to attend to a voice sound. Based on the human's auditory processing called computational auditory scene analysis (CASA), we propose a method to enhance target speech by removing noise components from monaural and binaural observation. We propose a method to enhance speech based on ASA exploiting monaural and binaural cues in reverberant environments. After initial segmentation for T-F components in each signal to the two ears, both signals have pitch contours related to harmonics which can be formed as groups to segregate between foreground and background segments. Then, each direction corresponding to a sound source is elaborately estimated from T-F components belonging to the corresponding source, and ITDs estimated in T-F units are used to perform sequential grouping including segregation of unvoiced speech components. After sequential grouping, we estimate more reliable speech T-F components. Verifying whether mask obtained from ITDs is captured more than 50% on the basis of a segment from initial segmentation. It is called overlap segments. In second round, using overlap segments, we reestimate pitch contours for more reliable information and find segments linked to second pitch contours. The T-F masks obtained from ITDs and the pitch contour in addition to the overlap segments are combined to yield a final mask to enhance speech. It should be noted that while we derive the method based on ASA, we make use of two sensors (with no object between them) that are spaced far more closely than human ears to avoid the effects of spatial aliasing for frequencies up to half the sampling frequency and to apply them in a compact device, so the largest possible delay between the sensors is always less than half a period over all frequencies of interest.

more