Title :
An Approach for Document Pre-processing and K Means Algorithm Implementation
Author :
Gowtham, S. ; Goswami, Mausumi ; Balachandran, Krishna ; Purkayastha, B.S.
Author_Institution :
Fac. of Eng., Christ Univ., Bangalore, India
Abstract :
The web mining is a cutting edge technology, which includes information gathering and classification of information over web. This paper puts forth the concepts of document pre-processing, which is achieved by extraction of keywords from the documents fetched from the web, processing it and generating a term-document matrix, TF-IDF and the different approaches of TF-IDF (term frequency Inverse document frequency) for each respective document. The last step is the clustering of these results through K Means algorithm, by comparing the performance of each approach used. The algorithm is realized on an X64 architecture and coded on Java and Matlab platform. The results are tabulated.
Keywords :
Internet; Java; classification; data mining; document handling; pattern clustering; Java; Matlab platform; TF-IDF; Web mining; World Wide Web; X64 architecture; cutting edge technology; document preprocessing; information classification; information gathering; k means algorithm implementation; term frequency inverse document frequency; term-document matrix; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Data mining; Information retrieval; Java; MATLAB; K Means clustering; Stop words; augmented; frequency; logarithmic; stemming; term-document matrix; tf-idf;
Conference_Titel :
Advances in Computing and Communications (ICACC), 2014 Fourth International Conference on
Conference_Location :
Cochin
Print_ISBN :
978-1-4799-4364-7
DOI :
10.1109/ICACC.2014.46