Title :
A New Method of Training Sample Selection in Text Classification
Author :
Liao, Yixing ; Pan, Xuezeng
Author_Institution :
Dept. of Comput. Sci. & Technol., Zhejiang Univ., Hangzhou, China
Abstract :
Aiming to noise samples in the training dataset, a new method for reducing the amount of training dataset is proposed in the paper which is applicable to text classification. This method describes the distribution of training dataset according to the representativeness score of samples in the class they belong to, so as to show representative samples and noise samples in each class. The new method is applied on Chinese text dataset provided by Fudan Database Center. The experiments show that the proposed method can reduce noise samples effectively, improve the performance of classification and decrease the computational cost.
Keywords :
classification; natural language processing; text analysis; noise samples reduction; text classification; training dataset distribution; training sample selection; Computational efficiency; Computer science; Educational technology; Frequency; Iterative methods; Mutual information; Noise reduction; Paper technology; Probability; Text categorization; representativeness score; text classification; training dataset selection;
Conference_Titel :
Education Technology and Computer Science (ETCS), 2010 Second International Workshop on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-6388-6
Electronic_ISBN :
978-1-4244-6389-3
DOI :
10.1109/ETCS.2010.621