DocumentCode
3582537
Title
Feature extraction for co-occurrence-based cosine similarity score of text documents
Author
Kadhim, Ammar Ismael ; Cheah, Yu.-N. ; Ahamed, Nurul Hashimah ; Salman, Lubab A.
Author_Institution
Sch. of Comput. Sci., Univ. Sains Malaysia, Minden, Malaysia
fYear
2014
Firstpage
1
Lastpage
4
Abstract
A major challenge in topic classification (TC) is the high dimensionality of the feature space. Therefore, feature extraction (FE) plays a vital role in topic classification in particular and text mining in general. FE based on cosine similarity score is commonly used to reduce the dimensionality of datasets with tens or hundreds of thousands of features, which can be impossible to process further. In this study, TF-IDF term weighting is used to extract features. Selecting relevant features and determining how to encode them for a learning machine method have a vast impact on the learning machine methods ability to extract a good model. Two different weighting methods (TF-IDF and TF-IDF Global) were used and tested on the Reuters-21578 text categorization test collection. The obtained results emerged a good candidate for enhancing the performance of English topics FE. Simulation results the Reuters-21578 text categorization show the superiority of the proposed algorithm.
Keywords
data mining; feature extraction; feature selection; learning (artificial intelligence); pattern classification; text analysis; FE; Reuters-21578 text categorization test collection; TC; TF-IDF global method; TF-IDF term weighting; co-occurrence-based cosine similarity score; dataset dimensionality reduction; feature extraction; feature selection; feature space high dimensionality; learning machine method; text documents; text mining; topic classification; Feature extraction; Indexing; Iron; Measurement; Text categorization; Vectors; Vocabulary; TF-IDF weighting; cosine similarity score; feature extraction; topic classification;
fLanguage
English
Publisher
ieee
Conference_Titel
Research and Development (SCOReD), 2014 IEEE Student Conference on
Print_ISBN
978-1-4799-6427-7
Type
conf
DOI
10.1109/SCORED.2014.7072954
Filename
7072954
Link To Document