Text classification without negative examples revisit

Author

Fung, Gabriel Pui Cheong ; Yu, Jeffrey X. ; Lu, Hongjun ; Yu, Philip S.

Author_Institution

Dept. of Syst. Eng. & Eng. Manage., Chinese Univ. of Hong Kong, China

Volume

18

Issue

1

fYear

2006

Firstpage

6

Lastpage

20

Abstract

Traditionally, building a classifier requires two sets of examples: positive examples and negative examples. This paper studies the problem of building a text classifier using positive examples (P) and unlabeled examples (U). The unlabeled examples are mixed with both positive and negative examples. Since no negative example is given explicitly, the task of building a reliable text classifier becomes far more challenging. Simply treating all of the unlabeled examples as negative examples and building a classifier thereafter is undoubtedly a poor approach to tackling this problem. Generally speaking, most of the studies solved this problem by a two-step heuristic: first, extract negative examples (N) from U. Second, build a classifier based on P and N. Surprisingly, most studies did not try to extract positive examples from U. Intuitively, enlarging P by P´ (positive examples extracted from U) and building a classifier thereafter should enhance the effectiveness of the classifier. Throughout our study, we find that extracting P´ is very difficult. A document in U that possesses the features exhibited in P does not necessarily mean that it is a positive example, and vice versa. The very large size of and very high diversity in U also contribute to the difficulties of extracting P´. In this paper, we propose a labeling heuristic called PNLH to tackle this problem. PNLH aims at extracting high quality positive examples and negative examples from U and can be used on top of any existing classifiers. Extensive experiments based on several benchmarks are conducted. The results indicated that PNLH is highly feasible, especially in the situation where |P| is extremely small.

Keywords

data mining; learning (artificial intelligence); pattern classification; text analysis; data mining; labeling heuristic; partially supervised learning; text categorization; text classification; unlabeled example; Computer Society; Data mining; Labeling; Supervised learning; Text categorization; Index Terms- Data mining; labeling unlabeled data.; partially supervised learning; text categorization;

fLanguage

English

Journal_Title

Knowledge and Data Engineering, IEEE Transactions on

Publisher

ieee

ISSN

1041-4347

Type

jour

DOI

10.1109/TKDE.2006.16

Filename

1549824