• DocumentCode
    3166291
  • Title

    Using Burstiness to Improve Clustering of Topics in News Streams

  • Author

    He, Qi ; Chang, Kuiyu ; Lim, Ee-Peng

  • Author_Institution
    Nanyang Technol. Univ., Nanyang Avenue
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    493
  • Lastpage
    498
  • Abstract
    Specialists who analyze online news have a hard time separating the wheat from the chaff. Moreover, automatic data-mining techniques like clustering of news streams into topical groups can fully recover the underlying true class labels of data if and only if all classes are well separated. In reality, especially for news streams, this is clearly not the case. The question to ask is thus this: if we cannot recover the full C classes by clustering, what is the largest K < C clusters we can find that best resemble the K underlying classes? Using the intuition that bursty topics are more likely to correspond to important events that are of interest to analysts, we propose several new bursty vector space models (B-VSM)for representing a news document. B-VSM takes into account the burstiness (across the full corpus and whole duration) of each constituent word in a document at the time of publication. We benchmarked our B-VSM against the classical TFIDF-VSM on the task of clustering a collection of news stream articles with known topic labels. Experimental results show that B-VSM was able to find the burstiest clusters/topics. Further, it also significantly improved the recall and precision for the top K clusters/topics.
  • Keywords
    data mining; document handling; information resources; media streaming; automatic data mining; burstiness; bursty topics; bursty vector space model; news document representation; news stream article; news stream clustering; online news analysis; topic label; topics clustering; Clustering methods; Data engineering; Data mining; Functional analysis; Helium; Nominations and elections; Organizing; Telecommunication traffic;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.17
  • Filename
    4470279