• DocumentCode
    2441580
  • Title

    PreDatA – preparatory data analytics on peta-scale machines

  • Author

    Zheng, Fang ; Abbasi, Hasan ; Docan, Ciprian ; Lofstead, Jay ; Liu, Qing ; Klasky, Scott ; Parashar, Manish ; Podhorszki, Norbert ; Schwan, Karsten ; Wolf, Matthew

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2010
  • fDate
    19-23 April 2010
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequent data presentation, visualization, and detailed analysis. In addition, scientists desire to gain insights into selected data characteristics `hidden´ or `latent´ in these massive datasets while data is being produced by simulations. PreDatA, short for Preparatory Data Analytics, is an approach to preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines. By dedicating additional compute nodes on the machine as `staging´ nodes and by staging simulations´ output data through these nodes, PreDatA can exploit their computational power to perform select data manipulations with lower latency than attainable by first moving data into file systems and storage. Such intransit manipulations are supported by the PreDatA middleware through asynchronous data movement to reduce write latency, application-specific operations on streaming data that are able to discover latent data characteristics, and appropriate data reorganization and metadata annotation to speed up subsequent data access. PreDatA enhances the scalability and flexibility of the current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and inspection, as well as for data exchange between concurrently running simulations.
  • Keywords
    data analysis; data visualisation; electronic data interchange; meta data; middleware; storage management; HEC platforms; I/O stack; PreDatA middleware; application-specific operations; asynchronous data movement; data access; data exchange; data inspection; data manipulations; data preprocessing; data presentation; data reorganization; data visualization; file storage; file systems; high end computing platforms; high performance storage; large scale simulations; metadata annotation; peta-scale machines; peta-scale scientific applications; preparatory data analytics; runtime data analysis; streaming data; write latency; Analytical models; Computational modeling; Data analysis; Data visualization; Delay; File systems; Large-scale systems; Middleware; Performance analysis; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
  • Conference_Location
    Atlanta, GA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-6442-5
  • Type

    conf

  • DOI
    10.1109/IPDPS.2010.5470454
  • Filename
    5470454