Clustering Efficient Method on Mass Chinese Text Based on Semantic Concept

Author

Jinling, Liu ; Hong, Zhou

Author_Institution

Comput. Eng. Fac., Huaiyin Inst. of Technol., Huaian, China

Volume

2

fYear

2010

fDate

16-18 July 2010

Firstpage

151

Lastpage

155

Abstract

In the current thinking of the Chinese text clustering, most clustering algorithms are limited by the data´s scalability and the results´ interpretability. This paper presents an efficient Chinese text clustering method based on semantic concepts. This method, proceeding from the text itself, by using classified hierarchy Subject Word in Thesaurus of Modern Chinese, extracts the conceptional tuple from a high-dimensional text vector collection to form the high-level concept expressing clustering results. Then samples are divided based on these high-level concepts which indicates the entire text clustering process has completed. On the premise of ensuring the clustering results´ accuracy, this method can greatly reduce the number of data needing to be processed and improve the clustering algorithms´ scalability. The experimental results show that this clustering algorithm has achieved a satisfactory clustering result and a higher implementation efficiency as well.

Keywords

natural language processing; pattern clustering; text analysis; Chinese text clustering; classified hierarchy subject word; clustering efficient method; high-dimensional text vector collection; mass Chinese text; modern Chinese thesaurus; semantic concept; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Clustering methods; Dictionaries; Semantics; Thesauri; chinese text; classified dictionary; clustering; conceptional tuple; semantic;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology and Applications (IFITA), 2010 International Forum on

Conference_Location

Kunming

Print_ISBN

978-1-4244-7621-3

Electronic_ISBN

978-1-4244-7622-0

Type

conf

DOI

10.1109/IFITA.2010.77

Filename

5634880