Title :
Localization based stereo speech separation using deep networks
Author :
Yu, Yang ; Wang, Wenwu ; Luo, Jian ; Feng, Pengming
Author_Institution :
School of Marine Science and Technology, Northwestern Polytechnical University, Xi´an, China, 710072
Abstract :
Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberations are present in the mixtures. In this paper, we proposed a new stereo speech separation system where deep networks are used to generate soft T-F mask for separation. More specifically, the deep network, which is composed of two sparse autoencoders and a softmax classifier, is used to estimate the orientations of the target and interferers at each T-F unit, based on low-level features, such as mixing vector (MV), interaural level and phase difference (IPD/ILD). The deep network is trained by a greedy layer-wise method using a dataset that was generated by convolving room impulse responses (RIRs) with clean speech signals positioned in different angles with respect to the sensors. With the trained deep networks, the probability that each T-F unit belongs to the target or interferer can be estimated based on the localization cues for generating the soft mask. Experiments based on real binaural RIRs and TIMIT dataset are provided to show the performance of the proposed system for reverberant speech mixtures, as compared with a model based T-F masking technique proposed recently.
Keywords :
Feature extraction; Neural networks; Reverberation; Source separation; Speech; Speech processing; Training; Deep learning; Deep networks; Soft mask; Source separation;
Conference_Titel :
Digital Signal Processing (DSP), 2015 IEEE International Conference on
Conference_Location :
Singapore, Singapore
DOI :
10.1109/ICDSP.2015.7251849