DocumentCode :
3120634
Title :
Semantic feature reduction in chinese document clustering
Author :
Meng, Xianjun ; Chen, Qingcai ; Wang, Xiaolong
Author_Institution :
Shenzhen Grad. Sch., Harbin Inst. of Technol. Shenzhen, Harbin
fYear :
2008
fDate :
12-15 Oct. 2008
Firstpage :
3721
Lastpage :
3726
Abstract :
Text clustering techniques were usually used to structure the text documents into topic related groups which can facilitate users to get a comprehensive understanding on corpus or results from information retrieval system. Most of existing text clustering algorithm which derived from traditional formatted data clustering heavily rely on term analysis methods and adopted vector space model (VSM) as their document representation. But because of the essential characteristic underlying text such as high dimensionality features vector space, the problem of sparseness has a strong impact on the clustering algorithm. So feature reduction is an important preprocess step for improving the efficiency and accuracy of clustering algorithm by removing redundant and irrelevant terms from corpus. Even the clustering is considered as an unsupervised learning method, but in text, there is still some priori knowledge we can use from NLP analysis based approach. In this paper, we propose a semantic analysis based feature reduction method which used in Chinese text clustering. Our method bases on a dedicated Part-of-Speech tags selection and synonyms consolidation and can reduce the feature space of documents more effectively compared with traditional feature reduction method tfidf and stopwords removal; meanwhile it preserves or sometimes even improves the accuracy of clustering algorithm. In our experiment, we tested our feature reduction method using bisecting k-means algorithm which was proved be efficient in text clustering. The results show that our method can reduce the feature space significantly, and meanwhile have a better clustering accuracy in terms of the purity.
Keywords :
information retrieval; natural language processing; text analysis; unsupervised learning; Chinese document clustering; document representation; information retrieval system; natural language processing; part-of-speech tags; semantic feature reduction; term analysis; text clustering; unsupervised learning; vector space model; Algorithm design and analysis; Clustering algorithms; Clustering methods; Functional analysis; Information retrieval; Navigation; Search engines; Testing; Text mining; Unsupervised learning; feature selection; part-of-speech; synonym; text clustering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on
Conference_Location :
Singapore
ISSN :
1062-922X
Print_ISBN :
978-1-4244-2383-5
Electronic_ISBN :
1062-922X
Type :
conf
DOI :
10.1109/ICSMC.2008.4811878
Filename :
4811878
Link To Document :
بازگشت