• DocumentCode
    2848165
  • Title

    Text Classification without Labeled Negative Documents

  • Author

    Fung, Gabriel Pui Cheong ; Jeffrey Xu Yu ; Lu, Hongjun ; Yu, Jeffrey Xu

  • Author_Institution
    Chinese Univ. of Hong Kong, China
  • fYear
    2005
  • fDate
    05-08 April 2005
  • Firstpage
    594
  • Lastpage
    605
  • Abstract
    This paper presents a new solution for the problem of building a text classifier with a small set of labeled positive documents (P) and a large set of unlabeled documents (U). Here, the unlabeled documents are mixed with both of the positive and negative documents. In other words, no document is labeled as negative. This makes the task of building a reliable text classifier challenging. In general, the existing approaches for solving this kind of problem use a two-step approach: i) extract the negative documents (N) from U; and ii) build a classifier based on P and N. However, none of the reported studies tries to further extract any positive documents (P΄) from U. Intuitively, extracting P΄ from U will increase the reliability of the classifier. However, extracting P΄ from U is difficult. A document in U that possesses some of the features exhibited in P does not necessarily mean that it is a positive document, and vice versa. It is very sensitive to extract positive documents, because those extracted positive samples may become noises. The very large size of U and the very high diversity exhibited there also contribute to the difficulty of extracting any positive documents. In this paper, we propose a partitionbased heuristic which aims at extracting both of the positive and negative documents in U. Extensive experiments based on three benchmarks are conducted. The favorable results indicated that our proposed heuristic outperforms all of the existing approaches significantly, especially in the case where the size of P is extremely small.
  • Keywords
    learning (artificial intelligence); pattern classification; text analysis; labeled positive documents; negative documents; partition-based heuristic; text classification; Costs; Labeling; Text categorization; Text recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on
  • ISSN
    1084-4627
  • Print_ISBN
    0-7695-2285-8
  • Type

    conf

  • DOI
    10.1109/ICDE.2005.139
  • Filename
    1410177