Title :
Parallel query evaluation as a Scientific Data Service
Author :
Bin Dong ; Byna, Surendra ; Kesheng Wu
Author_Institution :
Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA
Abstract :
Scientific experiments and simulations produce mountains of data in file formats, such as HDF5, NetCDF, and FITS. Often, a relatively small amount of data holds the key to new scientific insight. Locating that critical information in these large files is challenging because existing solutions need significant user involvement in preparing the data, generating indexes, and answering queries. Data management systems that support querying, such as SciDB, require a costly process of loading data from scientific data formats to these systems. The search results also need to be converted back to a format needed by the subsequent data analysis and visualization tools. These steps are time-consuming, tedious, and possibly error-prone. Toward providing efficient data management directly on these scientific file formats, we introduce a framework called Scientific Data Services (SDS). SDS targets to provide efficient data management optimizations as services. In this paper, we introduce the design and implementation of one such service, the parallel querying service. To answer the queries efficiently, we transparently augment user data with bitmap indexes and ordered datasets. We design the querying service to manage these augmented datasets and to redirect queries automatically to bitmap indexes or to ordered datasets based on their availability and the expected query response time. The generation of bitmap indexes and sorted datasets and querying are parallelized to work on large supercomputers. We show that SDS achieves 22X, 55X, and 62X speedups compared to conventional full-scan approach of sifting through data in answering three queries from a plasma physics analysis application.
Keywords :
parallel processing; query processing; FITS file format; HDF5 file format; NetCDF file format; SDS framework; SciDB system; data management systems; parallel query evaluation; parallel querying service; plasma physics analysis application; query answering; query response time; scientific data service; Indexing; Libraries; Optimization; Query processing; Servers; Sorting; Parallel Query Processing; Scientific Data Services;
Conference_Titel :
Cluster Computing (CLUSTER), 2014 IEEE International Conference on
Conference_Location :
Madrid
DOI :
10.1109/CLUSTER.2014.6968765