DocumentCode :
2042101
Title :
A feature selection method for document clustering based on part-of-speech and word co-occurrence
Author :
Liu, Zitao ; Yu, Wenchao ; Deng, Yalan ; Wang, Yongtao ; Bian, Zhiqi
Author_Institution :
Int. Sch. of Software, Wuhan Univ., Wuhan, China
Volume :
5
fYear :
2010
fDate :
10-12 Aug. 2010
Firstpage :
2331
Lastpage :
2334
Abstract :
Feature selection is a process which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and provides a better understanding for the data and learning process. However, few modern feature selection approaches take the advantage of features´ context information. Based on this analysis, we propose a novel feature selection method based on part-of-speech and word co-occurrence. According the components of Chinese document text, we utilize the words´ part-of-speech attributes to filter lots of meaningless terms. Then we define and use co-occurrence words by their part-of-speech to select features. In the evaluating process, we use the text corpus from Sogou Lab to do some experiments and use Entropy and Precision as criteria to give an objective evaluation of document clustering performance. The results show that our method can select better features and get a more pleasant clustering performance.
Keywords :
feature extraction; pattern clustering; speech synthesis; text analysis; unsupervised learning; word processing; Chinese document; Sogou lab; context information; document clustering; feature selection method; learning process; part of speech; text corpus; word co-occurrence; Context; Educational institutions; Entropy; Feature extraction; Machine learning; Software; Speech; document clustering; feature selection; part-ofspeech; word co-occurrence;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
Conference_Location :
Yantai, Shandong
Print_ISBN :
978-1-4244-5931-5
Type :
conf
DOI :
10.1109/FSKD.2010.5569827
Filename :
5569827
Link To Document :
بازگشت