Speech Segregation Using an Auditory Vocoder With Event-Synchronous Enhancements

Author

Irino, Toshio ; Patterson, Roy D. ; Kawahara, Hideki

Author_Institution

Fac. of Syst. Eng., Wakayam Univ.

Volume

14

Issue

6

fYear

2006

Firstpage

2212

Lastpage

2221

Abstract

We propose a new method to segregate concurrent speech sounds using an auditory version of a channel vocoder. The auditory representation of sound, referred to as an "auditory image," preserves fine temporal information, unlike conventional window-based processing systems. This makes it possible to segregate speech sources with an event synchronous procedure. Fundamental frequency information is used to estimate the sequence of glottal pulse times for a target speaker, and to repress the glottal events of other speakers. The procedure leads to robust extraction of the target speech and effective segregation even when the signal-to-noise ratio is as low as 0 dB. Moreover, the segregation performance remains high when the speech contains jitter, or when the estimate of the fundamental frequency FO is inaccurate. This contrasts with conventional comb-filter methods where errors in FO estimation produce a marked reduction in performance. We compared the new method to a comb-filter method using a cross-correlation measure and perceptual recognition experiments. The results suggest that the new method has the potential to supplant comb-filter and harmonic-selection methods for speech enhancement

Keywords

feature extraction; speaker recognition; speech coding; speech enhancement; speech synthesis; vocoders; auditory image; auditory vocoder; channel vocoder; cross-correlation measure; event-synchronous enhancements; glottal pulse times; perceptual recognition experiments; signal-to-noise ratio; speech enhancement; speech segregation; speech sounds; target speaker synthesis; target speech robust extraction; window-based processing systems; Biomedical engineering; Data mining; Frequency estimation; Image analysis; Loudspeakers; Power harmonic filters; Signal to noise ratio; Speech enhancement; Systems engineering and theory; Vocoders; Auditory image; auditory scene analysis; channel vocoder; comb filter; pitch/F0 extraction;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher

ieee

ISSN

1558-7916

Type

jour

DOI

10.1109/TASL.2006.872611

Filename

1709908