Title :
Voice Activity Detection Based on an Unsupervised Learning Framework
Author :
Dongwen Ying ; Yonghong Yan ; Jianwu Dang ; Soong, Frank K.
Author_Institution :
Thinkit Lab., Chinese Acad. of Sci., Beijing, China
Abstract :
How to construct models for speech/nonspeech discrimination is a crucial point for voice activity detectors (VADs). Semi-supervised learning is the most popular way for model construction in conventional VADs. In this correspondence, we propose an unsupervised learning framework to construct statistical models for VAD. This framework is realized by a sequential Gaussian mixture model. It comprises an initialization process and an updating process. At each subband, the GMM is firstly initialized using EM algorithm, and then sequentially updated frame by frame. From the GMM, a self-regulatory threshold for discrimination is derived at each subband. Some constraints are introduced to this GMM for the sake of reliability. For the reason of unsupervised learning, the proposed VAD does not rely on an assumption that the first several frames of an utterance are nonspeech, which is widely used in most VADs. Moreover, the speech presence probability in the time-frequency domain is a byproduct of this VAD. We tested it on speech from TIMIT database and noise from NOISEX-92 database. The evaluations effectively showed its promising performance in comparison with VADs such as ITU G.729B, GSM AMR, and a typical semi-supervised VAD.
Keywords :
Gaussian processes; speech recognition; unsupervised learning; Gaussian mixture model; NOISEX-92 database; TIMIT database; VAD; nonspeech discrimination; semisupervised learning; speech discrimination; statistical models; time frequency domain; unsupervised learning framework; voice activity detection; Gaussian processes; Mathematical model; Signal to noise ratio; Speech; Unsupervised learning; Model-based Gaussian clustering; sequential Gaussian mixture model (GMM); speech presence probability; unsupervised learning; voice activity detection (VAD);
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2011.2125953