Using hidden Markov models for topic segmentation of meeting transcripts

Author

Sherman, Melissa ; Liu, Yang

Author_Institution

Behavioral & Brain Sci., Univ. of Texas at Dallas, Dallas, TX

fYear

2008

fDate

15-19 Dec. 2008

Firstpage

185

Lastpage

188

Abstract

In this paper, we present a hidden Markov model (HMM) approach to segment meeting transcripts into topics. To learn the model, we use unsupervised learning to cluster the text segments obtained from topic boundary information. Using modified WinDiff and P_k metrics, we demonstrate that an HMM outperforms LCSeg, a state-of-the-art lexical chain based method for topic segmentation using the ICSI meeting corpus. We evaluate the effect of language model order, the number of hidden states, and the use of stop words. Our experimental results show that a unigram LM is better than a trigram LM, using too many hidden states degrades topic segmentation performance, and that removing the stop words from the transcripts does not improve segmentation performance.

Keywords

hidden Markov models; information analysis; unsupervised learning; Pk metrics; hidden Markov model; language model order; lexical chain; stop words; text segment clustering; topic boundary information; topic segmentation performance; unsupervised learning; Broadcasting; Coherence; Computer science; Decision trees; Degradation; Feature extraction; Hidden Markov models; Machine learning algorithms; Speech analysis; Unsupervised learning; Hidden Markov Model; LCSeg; Meeting Transcript; Topic Segmentation;

fLanguage

English

Publisher

ieee

Conference_Titel

Spoken Language Technology Workshop, 2008. SLT 2008. IEEE

Conference_Location

Goa

Print_ISBN

978-1-4244-3471-8

Electronic_ISBN

978-1-4244-3472-5

Type

conf

DOI

10.1109/SLT.2008.4777871

Filename

4777871