• DocumentCode
    3455915
  • Title

    A two-pass exact algorithm for selection on Parallel Disk Systems

  • Author

    Tian Mi ; Rajasekaran, Sanguthevar

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Univ. of Connecticut, Storrs, CT, USA
  • fYear
    2013
  • fDate
    7-10 July 2013
  • Abstract
    Numerous OLAP queries process selection operations of “top N”, median, “top 5%”, in data warehousing applications. Selection is a well-studied problem that has numerous applications in the management of data and databases since, typically, any complex data query can be reduced to a series of basic operations such as sorting and selection. The parallel selection has also become an important fundamental operation, especially after parallel databases were introduced. In this paper, we present a deterministic algorithm Recursive Sampling Selection (RSS) to solve the exact out-of-core selection problem, which we show needs no more than (2 + ε) passes (ε being a very small fraction). We have compared our RSS algorithm with two other algorithms in the literature, namely, the Deterministic Sampling Selection and QuickSelect on the Parallel Disks Systems. Our analysis shows that DSS is a (2+ε)-pass algorithm when the total number of input elements N is a polynomial in the memory size M (i.e., N = Mc for some constant c). While, our proposed algorithm RSS runs in (2+ε) passes without any assumptions. Experimental results indicate that both RSS and DSS outperform QuickSelect on the Parallel Disks Systems. Especially, the proposed algorithm RSS is more scalable and robust to handle big data when the input size is far greater than the core memory size, including the case of N ≪ Mc.
  • Keywords
    Big Data; data mining; deterministic algorithms; parallel databases; DSS; OLAP queries; QuickSelect; RSS; big data handling; deterministic algorithm; deterministic sampling selection; out-of-core selection problem; parallel disk systems; recursive sampling selection; two-pass exact algorithm; Algorithm design and analysis; Approximation algorithms; Decision support systems; Gaussian distribution; Robustness; Sorting; Uncertainty; Median; OLAP queries; Parallel Disk System; Selection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computers and Communications (ISCC), 2013 IEEE Symposium on
  • Conference_Location
    Split
  • Type

    conf

  • DOI
    10.1109/ISCC.2013.6755015
  • Filename
    6755015