DocumentCode :
2308269
Title :
Document-Oriented Pruning of the Inverted Index in Information Retrieval Systems
Author :
Zheng, Lei ; Cox, Ingemar J.
Author_Institution :
Univ. Coll. London, London
fYear :
2009
fDate :
26-29 May 2009
Firstpage :
697
Lastpage :
702
Abstract :
Searching very large collections can be costly in both computation and storage. To reduce this cost, recent research has focused on reducing the size (pruning) of the inverted index. The inverted index represents a table, the rows and columns of which are terms in the lexicon and documents in the collection, respectively. A non-zero entry in the table, known as a posting, indicates that the corresponding document contains the term. Previous researches on static index pruning was either (i) posting-oriented, in which less important postings are removed from the table, or (ii) term-oriented, in which less important terms are removed from the table. In this paper, we investigate a new, document-oriented pruning strategy that removes entire columns of the table, i.e. removes less important documents from the collection. Three methods for estimating the importance of a document are proposed. Methods 1 and 2 are dependent on the score function of the retrieval system (e.g. Okapi BM25), while Method 3 is independent of the retrieval system. Experimental results compare the three proposed methods with Carmel et al.´s posting-oriented approach, using both the FT and LA Times collections and using both ordinary and difficult queries. Based on mean average precision and precision at 10, experimental results show that Method 3 generally performs best on the FT collection for pruned indexes down to 35% of the original size. However, for more severe pruning, Carmel et al.´s algorithm is better. For the LA Times collection, the performance of Method 3 and that of Carmel et al. are reversed. This variation in performance across collections has not been previously reported.
Keywords :
information retrieval; very large databases; document-oriented pruning; information retrieval systems; inverted index; nonzero entry; posting-oriented approach; very large collections; Advertising; Computer networks; Costs; Data structures; Educational institutions; Indexing; Information retrieval; Search engines; Vocabulary; Web search;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Information Networking and Applications Workshops, 2009. WAINA '09. International Conference on
Conference_Location :
Bradford
Print_ISBN :
978-1-4244-3999-7
Electronic_ISBN :
978-0-7695-3639-2
Type :
conf
DOI :
10.1109/WAINA.2009.147
Filename :
5136730
Link To Document :
بازگشت