• DocumentCode
    751080
  • Title

    Text classification without negative examples revisit

  • Author

    Fung, Gabriel Pui Cheong ; Yu, Jeffrey X. ; Lu, Hongjun ; Yu, Philip S.

  • Author_Institution
    Dept. of Syst. Eng. & Eng. Manage., Chinese Univ. of Hong Kong, China
  • Volume
    18
  • Issue
    1
  • fYear
    2006
  • Firstpage
    6
  • Lastpage
    20
  • Abstract
    Traditionally, building a classifier requires two sets of examples: positive examples and negative examples. This paper studies the problem of building a text classifier using positive examples (P) and unlabeled examples (U). The unlabeled examples are mixed with both positive and negative examples. Since no negative example is given explicitly, the task of building a reliable text classifier becomes far more challenging. Simply treating all of the unlabeled examples as negative examples and building a classifier thereafter is undoubtedly a poor approach to tackling this problem. Generally speaking, most of the studies solved this problem by a two-step heuristic: first, extract negative examples (N) from U. Second, build a classifier based on P and N. Surprisingly, most studies did not try to extract positive examples from U. Intuitively, enlarging P by P´ (positive examples extracted from U) and building a classifier thereafter should enhance the effectiveness of the classifier. Throughout our study, we find that extracting P´ is very difficult. A document in U that possesses the features exhibited in P does not necessarily mean that it is a positive example, and vice versa. The very large size of and very high diversity in U also contribute to the difficulties of extracting P´. In this paper, we propose a labeling heuristic called PNLH to tackle this problem. PNLH aims at extracting high quality positive examples and negative examples from U and can be used on top of any existing classifiers. Extensive experiments based on several benchmarks are conducted. The results indicated that PNLH is highly feasible, especially in the situation where |P| is extremely small.
  • Keywords
    data mining; learning (artificial intelligence); pattern classification; text analysis; data mining; labeling heuristic; partially supervised learning; text categorization; text classification; unlabeled example; Computer Society; Data mining; Labeling; Supervised learning; Text categorization; Index Terms- Data mining; labeling unlabeled data.; partially supervised learning; text categorization;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.16
  • Filename
    1549824