Title :
Unsupervised topic clustering of switchboard speech messages
Author :
Carlson, Beth A.
Author_Institution :
Lincoln Lab., MIT, Lexington, MA, USA
Abstract :
This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material in data from other corpora. The SWITCHBOARD Speech Corpus was used to demonstrate these techniques, since it provides sets of speech files which are nominally on the same topic. Excellent automatic clustering was achieved on the truth text transcripts provided with the SWITCHBOARD corpus, with an average cluster purity of 97.3%. Degraded clustering was achieved using the output transcriptions of a speech recognizer, with a clustering purity of 61.4%
Keywords :
pattern classification; speech recognition; statistical analysis; tree data structures; SWITCHBOARD Speech Corpus; automatic clustering; automatically group; average cluster purity; degraded clustering; hierarchical structure; output transcriptions; speech data records; speech files; speech recognizer; statistical technique; switchboard speech messages; tree-based clustering algorithm; truth text transcripts; unsupervised topic clustering; Automatic speech recognition; Clustering algorithms; Communication switching; Databases; Degradation; Electronic mail; Information retrieval; Laboratories; Speech recognition; Tree data structures;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
0-7803-3192-3
DOI :
10.1109/ICASSP.1996.541095