DocumentCode :
2448695
Title :
A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization
Author :
Tam, Vincent ; Santoso, Ardi ; Setiono, Rudy
Author_Institution :
Dept. of Electr. & Electron. Eng., Hong Kong Univ., China
Volume :
4
fYear :
2002
fDate :
2002
Firstpage :
235
Abstract :
Associating documents to relevant categories is critical for effective document retrieval. Here, we compare the well-known k-nearest neighborhood (kNN) algorithm, the centroid-based classifier and the highest average similarity over retrieved documents (HASRD) algorithm, for effective document categorization. We use various measures such as the micro and macro F1 values to evaluate their performance on the Reuters-21578 corpus. The empirical results show that kNN performs the best, followed by our adapted HASRD and the centroid-based classifier for common document categories, while the centroid-based classifier and kNN outperform our adapted HASRD for rare document categories. Additionally, our study clearly indicates that each classifier performs optimally only when a suitable term weighting scheme is used All these significant results lead to many exciting directions for future exploration.
Keywords :
classification; information retrieval; statistical analysis; HASRD algorithm; Reuters-21578 corpus; centroid-based document categorization; document retrieval; highest average similarity algorithm; k-nearest neighborhood algorithm; kNN algorithm; macro F1 values; micro F1 values; neighborhood-based document categorization; optimal classification; statistical document categorization; term weighting scheme; Bayesian methods; Extraterrestrial measurements; Frequency; Information retrieval; Internet; Nearest neighbor searches; Performance analysis; Software libraries; Statistical analysis; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition, 2002. Proceedings. 16th International Conference on
ISSN :
1051-4651
Print_ISBN :
0-7695-1695-X
Type :
conf
DOI :
10.1109/ICPR.2002.1047440
Filename :
1047440
Link To Document :
بازگشت