DocumentCode :
3145226
Title :
Selectivity estimation for extraction operators over text data
Author :
Wang, Daisy Zhe ; Wei, Long ; Li, Yunyao ; Reiss, Frederick ; Vaithyanathan, Shivakumar
Author_Institution :
Univ. of California, Berkeley, CA, USA
fYear :
2011
fDate :
11-16 April 2011
Firstpage :
685
Lastpage :
696
Abstract :
Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.
Keywords :
data structures; dictionaries; natural language processing; query processing; text analysis; Enron email corpus; database management; dictionaries; document synopses; extraction operators; join operators; natural language processing; regular expressions; relational query processing; roll-up synopsis; selectivity estimation; text data; text processing queries; top-k stratified bloom filter synopsis; Accuracy; Adaptation model; Arrays; Blogs; Data mining; Dictionaries; Estimation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Conference_Location :
Hannover
ISSN :
1063-6382
Print_ISBN :
978-1-4244-8959-6
Electronic_ISBN :
1063-6382
Type :
conf
DOI :
10.1109/ICDE.2011.5767931
Filename :
5767931
Link To Document :
بازگشت