DocumentCode
667554
Title
Keynote addresses: From auditory masking to binary classification: Machine learning for speech separation
Author
DeLiang Wang
Author_Institution
Ohio State Univ., Columbus, OH, USA
fYear
2013
fDate
20-23 Oct. 2013
Firstpage
1
Lastpage
1
Abstract
Summary form only given. Speech separation, or the cocktail party problem, is a widely acknowledged challenge. Part of the challenge stems from the confusion of what the computational goal should be. While the separation of every sound source in a mixture is considered the gold standard, I argue that such an objective is neither realistic nor what the human auditory system does. Motivated by the auditory masking phenomenon, we have suggested instead the ideal time-frequency binary mask as a main goal for computational auditory scene analysis. This leads to a new formulation to speech separation that classifies time-frequency units into two classes: those dominated by the target speech and the rest. In supervised learning, a paramount issue is generalization to conditions unseen during training. I describe novel methods to deal with the generalization issue where support vector machines (SVMs) are used to estimate the ideal binary mask. One method employs distribution fitting to adapt to unseen signal-to-noise ratios and iterative voice activity detection to adapt to unseen noises. Another method learns more linearly separable features using deep neural networks (DNNs) and then couples DNN and linear SVM for training on a variety of noisy conditions. Systematic evaluations show high quality separation in new acoustic environments.
Keywords
iterative methods; learning (artificial intelligence); neural nets; signal classification; signal detection; source separation; speech processing; support vector machines; time-frequency analysis; DNNs; acoustic environments; auditory masking phenomenon; binary classification; cocktail party problem; computational auditory scene analysis; deep neural networks; distribution fitting; human auditory system; ideal time-frequency binary mask estimation; iterative voice activity detection; linear SVM; machine learning; signal-to-noise ratio; sound source separation; speech separation; supervised learning; support vector machines; systematic evaluations; time-frequency unit classification; Acoustics; Awards activities; Biological neural networks; Conferences; Educational institutions; Speech; Speech processing;
fLanguage
English
Publisher
ieee
Conference_Titel
Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on
Conference_Location
New Paltz, NY
ISSN
1931-1168
Type
conf
DOI
10.1109/WASPAA.2013.6701900
Filename
6701900
Link To Document