Title :
When Good-Enough is Enough: Complex Queries at Fixed Cost
Author :
Mickulicz, Nathan D. ; Martins, Rolando ; Narasimhan, Priya ; Gandhi, Rajeev
Author_Institution :
Dept. of Electr. & Comput. Eng., Carnegie Mellon Univ., Pittsburgh, PA, USA
fDate :
March 30 2015-April 2 2015
Abstract :
Collections of time-series data appear in a wide variety of contexts. To gain insight into the underlying phenomenon (that the data represents), one must analyze the time-series data. Analysis can quickly become challenging for very large data (~terabytes or more) sets, and it may be infeasible to scan the entire data-set on each query due to time limits or resource constraints. To avoid this problem, one might pre-compute partial results by scanning the data-set (usually as the data arrives). However, for complex queries, where the value of a new data record depends on all of the data previously seen, this might be infeasible because incorporating a large amount of historical data into a query requires a large amount of storage. We present an approach to performing complex queries over very large data-sets in a manner that is (i) practical, meaning that a query does not require a scan of the entire data-set, and (ii) fixed-cost, meaning that the amount of storage required only depends on the time-range spanned by the entire data-set (and not the size of the data-set itself). We evaluate our approach with three different data-sets: (i) a 4-year commercial analytics data-set from a production content-delivery platform with over 15 million mobile users, (ii) an 18-year data-set from the Linux-kernel commit-history, and (iii) an 8-day data-set from Common Crawl HTTP logs. Our evaluation demonstrates the feasibility and practicality of our approach for a diverse set of complex queries on a diverse set of very large data-sets.
Keywords :
Linux; hypermedia; operating system kernels; query processing; time series; very large databases; Linux-kernel commit-history; commercial analytics data-set; common crawl HTTP log; complex query; data record; fixed cost; mobile user; production content-delivery platform; time-series data; very large data; Aggregates; Data structures; Indexes; Kernel; Production; Radiation detectors; Time series analysis; big data; large scale; query; time series;
Conference_Titel :
Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on
Conference_Location :
Redwood City, CA
DOI :
10.1109/BigDataService.2015.24