Information retrieval using fuzzy c-means clustering and modified vector space model

Author

Chowdhury, Chandrani Ray ; Bhuyan, Prachet

Author_Institution

Sch. of Comput. Eng., KIIT Univ., Bhubaneswar, India

Volume

1

fYear

2010

fDate

9-11 July 2010

Firstpage

696

Lastpage

700

Abstract

This paper presents a method to improve the performance of Information Retrieval System (IRS) by increasing the no of relevant documents retrieved. There are several types of uncertainty and fuzziness associated with IRS like search term uncertainty, relevance uncertainty involved in retrieving of irrelevant documents. The aim of this paper is to eliminate different types of uncertainty and increase the chance of retrieving relevant documents. In the framework a method is proposed which first calculate query and document cluster similarity which not only retrieve the documents matching query terms as well as similar to retrieved documents by calculating the query and cluster similarity. This helps to reduce search term uncertainty and tries to reduce the fuzziness associated with document relevance in two steps. First modification is made in general term frequency-inverse document frequency (tf-idf) scoring mechanism to give importance of informativeness of a document contents and secondly calculating query and document summary overlap. All the above information is used to measure the document relevant score. Finally retrieved documents are filtered by Pearson correlation coefficient between query vector and document vector to find out only those documents correlated with query. In experiment standard NPL test collection prepared by Vaswani and Cameron at the National Physical Laboratory in England was used. After full implementation of above methodology it was found that proposed work is better in comparison with existing methods.

Keywords

correlation methods; document handling; fuzzy set theory; information filtering; information retrieval systems; pattern clustering; relevance feedback; search problems; vectors; IRS; National Physical Laboratory; Pearson correlation coefficient; document cluster similarity; document relevance; document summary; document vector; fuzziness; fuzzy c-means clustering; information retrieval system; irrelevant documents; modified vector space model; query vector; relevant document retrieval; search term uncertainty; standard NPL test collection; term frequency-inverse document frequency; tf-idf scoring mechanism; Correlation; Discrete wavelet transforms; Clustering; Correlation ratio; Document cluster; Document summary; Fuzzy cmeans; Information Retrieval; document frequency;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on

Conference_Location

Chengdu

Print_ISBN

978-1-4244-5537-9

Type

conf

DOI

10.1109/ICCSIT.2010.5564542

Filename

5564542