DocumentCode :
2816581
Title :
A new framework for uncertainty sampling: exploiting uncertain and positive-certain examples in similarity-based text classification
Author :
Lee, Kang H. ; Kang, Byeong H.
Author_Institution :
Sch. of Inf. Technol., Sydney Univ., NSW, Australia
Volume :
2
fYear :
2004
fDate :
5-7 April 2004
Firstpage :
474
Abstract :
One of the major concerns with supervised learning approaches to text classification is that they require a large number of labeled examples to achieve a high level of effectiveness. Labeling such a large number of examples poses a considerable burden on human experts. Two common approaches to reduce the amount of labeled examples required are: (1) selecting informative uncertain examples for human-labeling and (2) using many inexpensive unlabeled data with a small number of labeled examples. While previous work in text classification focused only on one approach, we investigate a framework to combine both approaches in similarity-based text classification. By applying our new thresholding strategy (RinSCut) to uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that are directly used for learning without human-labeling. With our similarity-based learning algorithm (KAN), experiments have been conducted on Reuters-21578 data set. Our proposed scheme has been compared with random sampling and previous conventional uncertainly sampling, based on micro and macroaveraged F1. The results showed that if both macro and microaveraged measures are concerned, the optimal choice might be our framework.
Keywords :
learning by example; pattern classification; text analysis; uncertainty handling; Reuters-21578 data set; RinSCut thresholding strategy; human experts; human-labeling; inexpensive unlabeled data; informative uncertain examples; labeled examples; positive-certain examples; similarity-based learning algorithm; similarity-based text classification; supervised learning; uncertainty sampling; Australia; Humans; Information technology; Labeling; Machine learning; Natural languages; Sampling methods; Supervised learning; Text categorization; Uncertainty;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on
Print_ISBN :
0-7695-2108-8
Type :
conf
DOI :
10.1109/ITCC.2004.1286699
Filename :
1286699
Link To Document :
بازگشت