Title :
A Framework for Multitasking Data-Intensive Management Services in High Performance Computing Environments
Author :
Kulasekaran, Sivakumar ; Esteva, Maria ; Trelogan, Jessica ; Si Liu
fDate :
March 30 2015-April 2 2015
Abstract :
Data management entails a continuum of tasks to develop sustainable and reusable collections throughout their lifecycle. Large collections with complex data formats and structures may require what we define as "multitasking data management," involving a combination of manual and automated iterative tasks. When conducted in a desktop computing environment by curators, these tasks can be labor-intensive and disruptive of research. While the process can be made much more efficient within a Data-Intensive High Performance Computing (DIC/HPC) infrastructure, it remains a challenge to implement generalizable services so that automated workflows can be easily performed by non-expert users. This paper introduces a framework for automating data management activities as data-intensive computing jobs within a multitasking workflow. Using as a case study a set of legacy data from an archaeological collection in need of reorganization, we identified the steps required to re-sort and move approximately 27,000 data files into a structured collection architecture. Because not all data management workflows are the same, and because there are a wide range of requirements for job submission within data-intensive HPC resources, we derived a set of generalizable modules that can be used as a guide for curators and HPC consultants. This framework may accommodate collections with different data types and data management requirements and can be conducted by curators trained in HPC usage but without ample computational expertise. Upon testing, we implemented the framework as a service on a DIC/HPC cluster.
Keywords :
data handling; data structures; parallel processing; workflow management software; DIC-HPC infrastructure; archaeological collection; automated iterative tasks; automated workflows; complex data formats; complex data structures; data management workflows; data-intensive HPC resources; data-intensive high performance computing infrastructure; desktop computing environment; generalizable modules; multitasking data-intensive management services; multitasking workflow; structured collection architecture; Big data; Computer architecture; Documentation; Metadata; Multitasking; Software; Multitasking data management services; archaeology data; data intensive computing; high performance computing;
Conference_Titel :
Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on
Conference_Location :
Redwood City, CA
DOI :
10.1109/BigDataService.2015.42