• DocumentCode
    3105152
  • Title

    Subjectivity Categorization of Weblog with Part-of-Speech Based Smoothing

  • Author

    Huang, Shen ; Sun, Jian-Tao ; Wang, Xuanhui ; Zeng, Hua-Jun ; Chen, Zheng

  • Author_Institution
    Microsoft Res. Asia, Beijing
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    285
  • Lastpage
    294
  • Abstract
    Experts from different domains try to mine users\´ comments on Weblogs for different reasons such as politics or commerce. All these needs necessitate automatically distinguishing subjective Weblog contents from objective ones, namely subjectivity categorization. Since Weblogs contain various topics from different domains, limited training data can hardly cover all the topics and "unseen words" becomes a serious problem for categorization tasks. In this paper, part-of-speech (POS) based smoothing is proposed to alleviate the "unseen words" problem. In conjunction with a naive Bayes model constructed from limited training data, the probability of an unseen word in a new domain can be well smoothed by the probability of its POS result. Empirical studies on five datasets show that our approach consistently outperforms the basic naive Bayes with Laplace smoothing. In a cross-domain experiment, our approach achieves 22.0% improvement in Macro Fl and 24.4% in Micro Fl over basic naive Bayes. These verify that POS based smoothing can indeed benefit subjectivity categorization, especially in the cases with a large number of unseen words.
  • Keywords
    Bayes methods; Web sites; classification; smoothing methods; speech processing; Laplace smoothing; Weblogs; naive Bayes model; part-of-speech based smoothing; subjective Weblog contents; subjectivity categorization; training data; unseen words problem; Asia; Cameras; Data mining; Information services; Internet; Mood; Neural networks; Smoothing methods; Training data; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2006. ICDM '06. Sixth International Conference on
  • Conference_Location
    Hong Kong
  • ISSN
    1550-4786
  • Print_ISBN
    0-7695-2701-7
  • Type

    conf

  • DOI
    10.1109/ICDM.2006.156
  • Filename
    4053056