Title :
Using latent topic features to improve binary classification of spoken documents
Author :
Wintrode, Jonathan
Author_Institution :
Center for Language & Speech Process., Johns Hopkins Univ., Baltimore, MD, USA
Abstract :
In many topic identification applications, supervised training labels are indirectly related to the semantic content of the documents being classified. For example, many topically distinct emails will all be assigned a single broad category label of "spam" or "not-spam", and a two-class classifier will lack direct knowledge of the underlying topic structure. This paper examines the degradation of topic identification performance on conversational speech when multiple semantic topics are combined into a single broad category. We then develop techniques using document clustering and Latent Dirchlet Allocation (LDA) to exploit the underlying semantic topics which improve performance over classifiers trained on the single category label by up to 20%.
Keywords :
speech recognition; LDA; conversational speech identification performance; latent Dirchlet allocation; latent topic features; spoken document binary classification; two-class classifier; Detectors; Error analysis; Semantics; Speech; Speech recognition; Support vector machines; Training; LDA; clustering; topic identification;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on
Conference_Location :
Prague
Print_ISBN :
978-1-4577-0538-0
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2011.5947615