DocumentCode
154142
Title
A Sampling-Based Hybrid Approximate Query Processing System in the Cloud
Author
Yuxiang Wang ; Junzhou Luo ; Aibo Song ; Fang Dong
Author_Institution
Sch. of Comput. Sci. & Eng., Southeast Univ., Nanjing, China
fYear
2014
fDate
9-12 Sept. 2014
Firstpage
291
Lastpage
300
Abstract
Sampling-based approximate query processing method provides the way, in which the users can save their time and resources for ´Big Data´ analytical applications, if the estimated results can satisfy the accuracy expectation earlier before a long wait for the final accurate results. Online aggregation (OLA) is such an attractive technology to respond aggregation queries by calculating approximate results with the confidence interval getting tighter over time. It has been built into the MapReuduce-based cloud system for big data analytics, which allows users to monitor the query progress and save money by killing the computation earlier once sufficient accuracy has been obtained. Unfortunately, there exists a major obstacle that is the estimation failure of OLA affects the OLA performance, which is resulted from the biased sample set that violates the unbiased assumption of OLA sampling. To handle this problem, we first propose a hybrid approximate query processing model to improve the overall OLA performance, where a dynamic scheme switching mechanism is deliberately designed to switch unpromising OLA queries into the bootstrap scheme for further processing, avoiding the whole dataset scanning resulted from the OLA estimation failure. In addition, we also present a progressive estimation method to reduce the false positive ratio of our dynamic scheme switching mechanism. Moreover, we have implemented our hybrid approximate query processing system in Hadoop, and conducted extensive experiments on the TPC-H benchmark for skewed data distribution. Our results demonstrate that our hybrid system can produce acceptable approximate results within a time period one order of magnitude shorter compared to the original OLA over Hadoop.
Keywords
cloud computing; estimation theory; query processing; sampling methods; Hadoop; MapReduce; OLA estimation failure; TPC-H benchmark; big data analytics; bootstrap scheme; cloud system; confidence interval; data distribution; dynamic scheme switching mechanism; hybrid approximate query processing; online aggregation; progressive estimation method; sampling method; Accuracy; Aggregates; Educational institutions; Estimation; Query processing; Silicon; Switches;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Processing (ICPP), 2014 43rd International Conference on
Conference_Location
Minneapolis MN
ISSN
0190-3918
Type
conf
DOI
10.1109/ICPP.2014.38
Filename
6957238
Link To Document