Title :
A two-pass exact algorithm for selection on Parallel Disk Systems
Author :
Tian Mi ; Rajasekaran, Sanguthevar
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Connecticut, Storrs, CT, USA
Abstract :
Numerous OLAP queries process selection operations of “top N”, median, “top 5%”, in data warehousing applications. Selection is a well-studied problem that has numerous applications in the management of data and databases since, typically, any complex data query can be reduced to a series of basic operations such as sorting and selection. The parallel selection has also become an important fundamental operation, especially after parallel databases were introduced. In this paper, we present a deterministic algorithm Recursive Sampling Selection (RSS) to solve the exact out-of-core selection problem, which we show needs no more than (2 + ε) passes (ε being a very small fraction). We have compared our RSS algorithm with two other algorithms in the literature, namely, the Deterministic Sampling Selection and QuickSelect on the Parallel Disks Systems. Our analysis shows that DSS is a (2+ε)-pass algorithm when the total number of input elements N is a polynomial in the memory size M (i.e., N = Mc for some constant c). While, our proposed algorithm RSS runs in (2+ε) passes without any assumptions. Experimental results indicate that both RSS and DSS outperform QuickSelect on the Parallel Disks Systems. Especially, the proposed algorithm RSS is more scalable and robust to handle big data when the input size is far greater than the core memory size, including the case of N ≪ Mc.
Keywords :
Big Data; data mining; deterministic algorithms; parallel databases; DSS; OLAP queries; QuickSelect; RSS; big data handling; deterministic algorithm; deterministic sampling selection; out-of-core selection problem; parallel disk systems; recursive sampling selection; two-pass exact algorithm; Algorithm design and analysis; Approximation algorithms; Decision support systems; Gaussian distribution; Robustness; Sorting; Uncertainty; Median; OLAP queries; Parallel Disk System; Selection;
Conference_Titel :
Computers and Communications (ISCC), 2013 IEEE Symposium on
Conference_Location :
Split
DOI :
10.1109/ISCC.2013.6755015