• DocumentCode
    248401
  • Title

    An Approach for Document Pre-processing and K Means Algorithm Implementation

  • Author

    Gowtham, S. ; Goswami, Mausumi ; Balachandran, Krishna ; Purkayastha, B.S.

  • Author_Institution
    Fac. of Eng., Christ Univ., Bangalore, India
  • fYear
    2014
  • fDate
    27-29 Aug. 2014
  • Firstpage
    162
  • Lastpage
    166
  • Abstract
    The web mining is a cutting edge technology, which includes information gathering and classification of information over web. This paper puts forth the concepts of document pre-processing, which is achieved by extraction of keywords from the documents fetched from the web, processing it and generating a term-document matrix, TF-IDF and the different approaches of TF-IDF (term frequency Inverse document frequency) for each respective document. The last step is the clustering of these results through K Means algorithm, by comparing the performance of each approach used. The algorithm is realized on an X64 architecture and coded on Java and Matlab platform. The results are tabulated.
  • Keywords
    Internet; Java; classification; data mining; document handling; pattern clustering; Java; Matlab platform; TF-IDF; Web mining; World Wide Web; X64 architecture; cutting edge technology; document preprocessing; information classification; information gathering; k means algorithm implementation; term frequency inverse document frequency; term-document matrix; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Data mining; Information retrieval; Java; MATLAB; K Means clustering; Stop words; augmented; frequency; logarithmic; stemming; term-document matrix; tf-idf;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in Computing and Communications (ICACC), 2014 Fourth International Conference on
  • Conference_Location
    Cochin
  • Print_ISBN
    978-1-4799-4364-7
  • Type

    conf

  • DOI
    10.1109/ICACC.2014.46
  • Filename
    6906015