• DocumentCode
    166680
  • Title

    Parallel query evaluation as a Scientific Data Service

  • Author

    Bin Dong ; Byna, Surendra ; Kesheng Wu

  • Author_Institution
    Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
  • fYear
    2014
  • fDate
    22-26 Sept. 2014
  • Firstpage
    194
  • Lastpage
    202
  • Abstract
    Scientific experiments and simulations produce mountains of data in file formats, such as HDF5, NetCDF, and FITS. Often, a relatively small amount of data holds the key to new scientific insight. Locating that critical information in these large files is challenging because existing solutions need significant user involvement in preparing the data, generating indexes, and answering queries. Data management systems that support querying, such as SciDB, require a costly process of loading data from scientific data formats to these systems. The search results also need to be converted back to a format needed by the subsequent data analysis and visualization tools. These steps are time-consuming, tedious, and possibly error-prone. Toward providing efficient data management directly on these scientific file formats, we introduce a framework called Scientific Data Services (SDS). SDS targets to provide efficient data management optimizations as services. In this paper, we introduce the design and implementation of one such service, the parallel querying service. To answer the queries efficiently, we transparently augment user data with bitmap indexes and ordered datasets. We design the querying service to manage these augmented datasets and to redirect queries automatically to bitmap indexes or to ordered datasets based on their availability and the expected query response time. The generation of bitmap indexes and sorted datasets and querying are parallelized to work on large supercomputers. We show that SDS achieves 22X, 55X, and 62X speedups compared to conventional full-scan approach of sifting through data in answering three queries from a plasma physics analysis application.
  • Keywords
    parallel processing; query processing; FITS file format; HDF5 file format; NetCDF file format; SDS framework; SciDB system; data management systems; parallel query evaluation; parallel querying service; plasma physics analysis application; query answering; query response time; scientific data service; Indexing; Libraries; Optimization; Query processing; Servers; Sorting; Parallel Query Processing; Scientific Data Services;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing (CLUSTER), 2014 IEEE International Conference on
  • Conference_Location
    Madrid
  • Type

    conf

  • DOI
    10.1109/CLUSTER.2014.6968765
  • Filename
    6968765