Title :
A new framework for uncertainty sampling: exploiting uncertain and positive-certain examples in similarity-based text classification
Author :
Lee, Kang H. ; Kang, Byeong H.
Author_Institution :
Sch. of Inf. Technol., Sydney Univ., NSW, Australia
Abstract :
One of the major concerns with supervised learning approaches to text classification is that they require a large number of labeled examples to achieve a high level of effectiveness. Labeling such a large number of examples poses a considerable burden on human experts. Two common approaches to reduce the amount of labeled examples required are: (1) selecting informative uncertain examples for human-labeling and (2) using many inexpensive unlabeled data with a small number of labeled examples. While previous work in text classification focused only on one approach, we investigate a framework to combine both approaches in similarity-based text classification. By applying our new thresholding strategy (RinSCut) to uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that are directly used for learning without human-labeling. With our similarity-based learning algorithm (KAN), experiments have been conducted on Reuters-21578 data set. Our proposed scheme has been compared with random sampling and previous conventional uncertainly sampling, based on micro and macroaveraged F1. The results showed that if both macro and microaveraged measures are concerned, the optimal choice might be our framework.
Keywords :
learning by example; pattern classification; text analysis; uncertainty handling; Reuters-21578 data set; RinSCut thresholding strategy; human experts; human-labeling; inexpensive unlabeled data; informative uncertain examples; labeled examples; positive-certain examples; similarity-based learning algorithm; similarity-based text classification; supervised learning; uncertainty sampling; Australia; Humans; Information technology; Labeling; Machine learning; Natural languages; Sampling methods; Supervised learning; Text categorization; Uncertainty;
Conference_Titel :
Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on
Print_ISBN :
0-7695-2108-8
DOI :
10.1109/ITCC.2004.1286699