DocumentCode :
1963549
Title :
Space-Efficient Framework for Top-k String Retrieval Problems
Author :
Hon, Wing-Kai ; Shah, Rahul ; Vitter, Jeffrey Scott
Author_Institution :
Dept. of Comput. Sci., Nat. Tsing-Hua Univ., Hsinchu, Taiwan
fYear :
2009
fDate :
25-27 Oct. 2009
Firstpage :
713
Lastpage :
722
Abstract :
Given a set D={d1, d2,..., dD} of D strings of total length n, our task is to report the "most relevant"strings for a given query pattern P. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by Muthukrishnan. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking O(n log n) words of space. We study this problem in a slightly different framework of reporting the top k most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.
Keywords :
data structures; indexing; pattern matching; query processing; relevance feedback; trees (mathematics); arbitrary score function; frequency metric; information retrieval; inverted indexes; linear space data structure; pattern-matching data structures; proximity metric; query functionality; relevance metrics; space-efficient framework; suffix arrays; suffix tree search; threshold-based approach; top-k string retrieval problem; Computer science; Data structures; Databases; Extraterrestrial measurements; Frequency; Indexing; Information retrieval; Pattern matching; Tree data structures; USA Councils; document retrieval; succinct data structures; text indexing; top-$k$ queries;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Foundations of Computer Science, 2009. FOCS '09. 50th Annual IEEE Symposium on
Conference_Location :
Atlanta, GA
ISSN :
0272-5428
Print_ISBN :
978-1-4244-5116-6
Type :
conf
DOI :
10.1109/FOCS.2009.19
Filename :
5438585
Link To Document :
بازگشت