DocumentCode
3580549
Title
Towards Reliable Clustering of English Text Documents Using Correlation Coefficient
Author
Bhaumik, Hrishikesh ; Chakraborty, Biswanath ; Mukherjee, Anirban ; Bhattacharyya, Siddhartha ; Chattopadhyay, Manojit
Author_Institution
Dept. of Inf. Technol., RCC Inst. of Inf. Technol., Kolkata, India
fYear
2014
Firstpage
530
Lastpage
535
Abstract
This paper proposes a new approach for clustering English text documents, based on finding the pair wise correlation of documents in a given set of text documents. The correlation coefficient for each pair of documents is calculated on the basis of ranks given to the words in the documents. The ranking of the words occurring in a document is computed on the basis of weights of the words calculated according to the conventional TF-IDF factor. The proposed method is found to be able to cluster a given set of text documents into a number of classes depending on their contents where the number of classes is not known a priori. It is revealed from experimental results that the proposed method of text categorization using correlation coefficient performs better than some of the other text categorization methods, including methods that use artificial neural network.
Keywords
natural language processing; pattern clustering; statistical analysis; text analysis; English text document; TF-IDF factor; correlation coefficient; pairwise correlation; reliable clustering; text categorization; Classification algorithms; Clustering algorithms; Correlation; Correlation coefficient; Equations; Text categorization; Vectors; clustering; correlation coefficient; text classification;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence and Communication Networks (CICN), 2014 International Conference on
Print_ISBN
978-1-4799-6928-9
Type
conf
DOI
10.1109/CICN.2014.121
Filename
7065541
Link To Document