• DocumentCode
    2423188
  • Title

    Identifying Sentence-Level Semantic Content Units with Topic Models

  • Author

    Hennig, Leonhard ; Strecker, Thomas ; Narr, Sascha ; De Luca, Ernesto William ; Albayrak, Sahin

  • Author_Institution
    Distrib. Artificial Intell. Lab. (DAI-Lab.), Tech. Univ., Berlin, Germany
  • fYear
    2010
  • fDate
    Aug. 30 2010-Sept. 3 2010
  • Firstpage
    59
  • Lastpage
    63
  • Abstract
    Statistical approaches to document content modeling typically focus either on broad topics or on discourse-level subtopics of a text. We present an analysis of the performance of probabilistic topic models on the task of learning sentence-level topics that are similar to facts. The identification of sentential content with the same meaning is an important task in multi-document summarization and the evaluation of multi-document summaries. In our approach, each sentence is represented as a distribution over topics, and each topic is a distribution over words. We compare the topic-sentence assignments discovered by a topic model to gold-standard assignments that were manually annotated on a set of closely related pairs of news articles. We observe a clear correspondence between automatically identified and annotated topics. The high accuracy of automatically discovered topic-sentence assignments suggests that topic models can be utilized to identify (sub-) sentential semantic content units.
  • Keywords
    content management; data mining; text analysis; document content modeling; gold standard assignment; multidocument summarization; probabilistic topic model; sentence level semantic content units identification; sentence level topics learning; Analytical models; Humans; Petroleum; Probabilistic logic; Resource management; Semantics; Storage tanks; latent dirichlet allocation; text summarization; topic models;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications (DEXA), 2010 Workshop on
  • Conference_Location
    Bilbao
  • ISSN
    1529-4188
  • Print_ISBN
    978-1-4244-8049-4
  • Type

    conf

  • DOI
    10.1109/DEXA.2010.33
  • Filename
    5592003