مرکز منطقه ای اطلاع رساني علوم و فناوري - A new framework for uncertainty sampling: exploiting uncertain and positive-certain examples in similarity-based text classification

DocumentCode :

2816581

Title :

A new framework for uncertainty sampling: exploiting uncertain and positive-certain examples in similarity-based text classification

Author :

Lee, Kang H. ; Kang, Byeong H.

Author_Institution :

Sch. of Inf. Technol., Sydney Univ., NSW, Australia

Volume :

fYear :

2004

fDate :

5-7 April 2004

Firstpage :

474

Abstract :

One of the major concerns with supervised learning approaches to text classification is that they require a large number of labeled examples to achieve a high level of effectiveness. Labeling such a large number of examples poses a considerable burden on human experts. Two common approaches to reduce the amount of labeled examples required are: (1) selecting informative uncertain examples for human-labeling and (2) using many inexpensive unlabeled data with a small number of labeled examples. While previous work in text classification focused only on one approach, we investigate a framework to combine both approaches in similarity-based text classification. By applying our new thresholding strategy (RinSCut) to uncertainty sampling, we propose a new framework which automatically selects informative uncertain data that should be presented to human expert for labeling and positive-certain data that are directly used for learning without human-labeling. With our similarity-based learning algorithm (KAN), experiments have been conducted on Reuters-21578 data set. Our proposed scheme has been compared with random sampling and previous conventional uncertainly sampling, based on micro and macroaveraged F₁. The results showed that if both macro and microaveraged measures are concerned, the optimal choice might be our framework.

Keywords :

learning by example; pattern classification; text analysis; uncertainty handling; Reuters-21578 data set; RinSCut thresholding strategy; human experts; human-labeling; inexpensive unlabeled data; informative uncertain examples; labeled examples; positive-certain examples; similarity-based learning algorithm; similarity-based text classification; supervised learning; uncertainty sampling; Australia; Humans; Information technology; Labeling; Machine learning; Natural languages; Sampling methods; Supervised learning; Text categorization; Uncertainty;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on

Print_ISBN :

0-7695-2108-8

Type :

conf

DOI :

10.1109/ITCC.2004.1286699

Filename :

1286699

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2816581