• DocumentCode
    3717172
  • Title

    DSDQuery DSI - Querying scientific data repositories with structured operators

  • Author

    Roee Ebenstein;Gagan Agrawal

  • Author_Institution
    Department of Computer Science and Engineering, The Ohio State University
  • fYear
    2015
  • Firstpage
    485
  • Lastpage
    492
  • Abstract
    Scientific data is often distributed through repositories that host a large number of files in formats such as NetCDF or HDF5. With recent and anticipated increases in the size of observational and simulation data, it is important to transport just the data that are of interest from a large distributed dataset. Unfortunately, existing portals provide limited querying interfaces - typically a set of predefined hard coded subsettings, limiting user´s querying flexibility. This paper describes a system that addresses this gap. The relational algebra is adapted for scientific array querying allowing us to adapt a subset of SQL for this domain, which enables nuanced subsetting conditions to be applied on a set of dataset files within a repository. A query processing algorithm extracts and collects data from relevant datasets, based on metadata that was earlier extracted using an automatic metadata extraction engine. Finally, the system stitches a new structured, NetCDF, file to be returned as a resultset, allowing the returned data to be used and analyzed by existing tools. The system has been extensively evaluated to show its ability to handle increasing data and/or number of files.
  • Keywords
    "Arrays","Algebra","Metadata","Distributed databases","Portals","Data mining"
  • Publisher
    ieee
  • Conference_Titel
    Big Data (Big Data), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/BigData.2015.7363790
  • Filename
    7363790