• DocumentCode
    3582537
  • Title

    Feature extraction for co-occurrence-based cosine similarity score of text documents

  • Author

    Kadhim, Ammar Ismael ; Cheah, Yu.-N. ; Ahamed, Nurul Hashimah ; Salman, Lubab A.

  • Author_Institution
    Sch. of Comput. Sci., Univ. Sains Malaysia, Minden, Malaysia
  • fYear
    2014
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    A major challenge in topic classification (TC) is the high dimensionality of the feature space. Therefore, feature extraction (FE) plays a vital role in topic classification in particular and text mining in general. FE based on cosine similarity score is commonly used to reduce the dimensionality of datasets with tens or hundreds of thousands of features, which can be impossible to process further. In this study, TF-IDF term weighting is used to extract features. Selecting relevant features and determining how to encode them for a learning machine method have a vast impact on the learning machine methods ability to extract a good model. Two different weighting methods (TF-IDF and TF-IDF Global) were used and tested on the Reuters-21578 text categorization test collection. The obtained results emerged a good candidate for enhancing the performance of English topics FE. Simulation results the Reuters-21578 text categorization show the superiority of the proposed algorithm.
  • Keywords
    data mining; feature extraction; feature selection; learning (artificial intelligence); pattern classification; text analysis; FE; Reuters-21578 text categorization test collection; TC; TF-IDF global method; TF-IDF term weighting; co-occurrence-based cosine similarity score; dataset dimensionality reduction; feature extraction; feature selection; feature space high dimensionality; learning machine method; text documents; text mining; topic classification; Feature extraction; Indexing; Iron; Measurement; Text categorization; Vectors; Vocabulary; TF-IDF weighting; cosine similarity score; feature extraction; topic classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Research and Development (SCOReD), 2014 IEEE Student Conference on
  • Print_ISBN
    978-1-4799-6427-7
  • Type

    conf

  • DOI
    10.1109/SCORED.2014.7072954
  • Filename
    7072954