Title :
Topic segmentation using Markov models on section level
Author :
Matusov, Evgeny ; Peters, Jochen ; Meyer, Carsten ; Ney, Hermann
Author_Institution :
Philips Res. Labs., Aachen, Germany
fDate :
30 Nov.-3 Dec. 2003
Abstract :
Topic segmentation, i.e. the combined task of document segmentation and topic identification, is an interesting issue both from a theoretical point of view as well as for practical applications. Previous studies have mainly focussed on applications exposing rather weak correlations regarding the topic order (e.g. broadcast news). In this work, we concentrate on documents following a typical structure regarding the sequence and organization of the individual sections. We propose an algorithm allowing us to explicitly add such structures as additional knowledge sources by modeling the document structure on the level of complete sections. Specifically, we address the issues of explicit section length modeling and modeling of typical section start phrases. On a database of dictated reports, we show significant improvements over state-of-the-art approaches both on manually and automatically transcribed text. Moreover, we show that our approach is significantly more robust against recognition errors than a phrase matching approach exploiting merely the typical section start phrases.
Keywords :
Markov processes; identification; text analysis; dictated reports database; document segmentation; document structure modeling; explicit section length modeling; knowledge sources; recognition errors; section level Markov models; section organization; section sequence; section start phrase modeling; topic identification; topic segmentation; transcribed text; Broadcasting; Databases; Laboratories; Law; Legal factors; Medical simulation; Probability; Protocols; Robustness; Switches;
Conference_Titel :
Automatic Speech Recognition and Understanding, 2003. ASRU '03. 2003 IEEE Workshop on
Print_ISBN :
0-7803-7980-2
DOI :
10.1109/ASRU.2003.1318486