DocumentCode :
169840
Title :
Exploratory Analysis of Raw Data Files through Dataflows
Author :
Silva, Valter ; de Oliveira, Daniel ; Mattoso, Marta
Author_Institution :
Comput. Sci. - COPPE, Fed. Univ. of Rio de Janeiro, Rio de Janeiro, Brazil
fYear :
2014
fDate :
22-24 Oct. 2014
Firstpage :
114
Lastpage :
119
Abstract :
Scientific applications generate raw data files in very large scale. Most of these files follow a standard format established by the domain area application, like HDF5, Net CDF and FITS. These formats are supported by a variety of programming languages, libraries and programs. Since they are in large scale, analyzing these files require writing a specific program. Generic data analysis systems like database management systems (DBMS) are not suited because of data loading and data transformation in large scale. Recently there have been several proposals for indexing and querying raw data files without the overhead of using a DBMS, such as noDB, RAW and Fast Bit. Their goal is to offer query support to the raw data file after a scientific program has generated it. However, these solutions are focused on the analysis of one single large file. When a large number of files are all related and required to the evaluation of one scientific hypothesis, the relationships must be managed manually or by writing specific programs. The proposed approach takes advantage of existing provenance data support from Scientific Workflow Management Systems (SWfMS). When scientific applications are managed by SWfMS, the data is registered along the provenance database at runtime. Therefore, this provenance data may act as a description of theses files. When the SWfMS is dataflow aware, it registers domain data all in the same database. This resulting database becomes an important access method to the large number of files that are generated by the scientific workflow execution. This becomes a complementary approach to the single raw data file analysis support. In this work, we present our dataflow approach for analyzing data from several raw data files and evaluate it with the Montage application from the astronomy domain.
Keywords :
data analysis; natural sciences computing; programming languages; workflow management software; DBMS; FITS; Fast Bit; HDF5; Montage application; Net CDF; RAW; SWfMS; astronomy domain; data loading; data transformation; database management systems; dataflow aware; dataflows; domain area application; exploratory analysis; generic data analysis systems; indexing; noDB; programming languages; provenance data support; provenance database; query support; querying raw data files; scientific applications; scientific hypothesis; scientific program; scientific workflow execution; scientific workflow management systems; single raw data file analysis support; specific programs; standard format; Astronomy; Awards activities; Big data; Data mining; Indexing; Runtime; data analysis; high performance computing; in situ processing; raw data processing; scientific workflows;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on
Conference_Location :
Paris
Type :
conf
DOI :
10.1109/SBAC-PADW.2014.32
Filename :
6972025
Link To Document :
بازگشت