Title :
Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering
Author :
Ammar Ismael Kadhim;Yu-N Cheah;Nurul Hashimah Ahamed
Author_Institution :
Sch. of Comput. Sci., Univ. Sains Malaysia, Minden, Malaysia
Abstract :
Text mining defines generally the process of extracting interesting features (non-trivial) and knowledge from unstructured text documents. Text mining is an interdisciplinary field which depends on information retrieval, data mining, machine learning, parameter statistics and computational linguistics. Standard text mining and retrieval information techniques of text document usually rely on similar categories. An alternative method of retrieving information is clustering documents to preprocess text. The preprocessing steps have a huge effect on the success to extract knowledge. This study implements TF-IDF and singular value decomposition (SVD) dimensionality reduction techniques. The proposed system presents an effective preprocessing and dimensionality reduction techniques which help the document clustering by using k-means algorithm. Finally, the experimental results show that the proposed method enhances the performance of English text document clustering. Simulation results on BBC news and BBC sport datasets show the superiority of the proposed algorithm.
Keywords :
"Clustering algorithms","Text mining","Algorithm design and analysis","Data models","Singular value decomposition","Indexing"
Conference_Titel :
Artificial Intelligence with Applications in Engineering and Technology (ICAIET), 2014 4th International Conference on
DOI :
10.1109/ICAIET.2014.21