Title :
Information retrieval using fuzzy c-means clustering and modified vector space model
Author :
Chowdhury, Chandrani Ray ; Bhuyan, Prachet
Author_Institution :
Sch. of Comput. Eng., KIIT Univ., Bhubaneswar, India
Abstract :
This paper presents a method to improve the performance of Information Retrieval System (IRS) by increasing the no of relevant documents retrieved. There are several types of uncertainty and fuzziness associated with IRS like search term uncertainty, relevance uncertainty involved in retrieving of irrelevant documents. The aim of this paper is to eliminate different types of uncertainty and increase the chance of retrieving relevant documents. In the framework a method is proposed which first calculate query and document cluster similarity which not only retrieve the documents matching query terms as well as similar to retrieved documents by calculating the query and cluster similarity. This helps to reduce search term uncertainty and tries to reduce the fuzziness associated with document relevance in two steps. First modification is made in general term frequency-inverse document frequency (tf-idf) scoring mechanism to give importance of informativeness of a document contents and secondly calculating query and document summary overlap. All the above information is used to measure the document relevant score. Finally retrieved documents are filtered by Pearson correlation coefficient between query vector and document vector to find out only those documents correlated with query. In experiment standard NPL test collection prepared by Vaswani and Cameron at the National Physical Laboratory in England was used. After full implementation of above methodology it was found that proposed work is better in comparison with existing methods.
Keywords :
correlation methods; document handling; fuzzy set theory; information filtering; information retrieval systems; pattern clustering; relevance feedback; search problems; vectors; IRS; National Physical Laboratory; Pearson correlation coefficient; document cluster similarity; document relevance; document summary; document vector; fuzziness; fuzzy c-means clustering; information retrieval system; irrelevant documents; modified vector space model; query vector; relevant document retrieval; search term uncertainty; standard NPL test collection; term frequency-inverse document frequency; tf-idf scoring mechanism; Correlation; Discrete wavelet transforms; Clustering; Correlation ratio; Document cluster; Document summary; Fuzzy cmeans; Information Retrieval; document frequency;
Conference_Titel :
Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-5537-9
DOI :
10.1109/ICCSIT.2010.5564542