DocumentCode :
3496522
Title :
Workflow-driven programming paradigms for distributed analysis of biological big data
Author :
Altintas, Ilkay
Author_Institution :
San Diego Supercomput. Center, Univ. of California, San Diego, La Jolla, CA, USA
fYear :
2013
fDate :
12-14 June 2013
Firstpage :
1
Lastpage :
1
Abstract :
Scientific workflows have been used as a programming model to automate scientific tasks ranging from short pipelines to complex workflows that span across heterogeneous data and computing resources. While utilization of scientific workflow technologies varies slightly across different scientific disciplines, all informatics and computational science disciplines provide a common set of attributes to facilitate and accelerate workflow-driven research. Scientific workflows provide assembly of complex processing easily in local or distributed environments via rich and expressive programming models. Scientific workflows enable transparent access to diverse resources ranging from local clusters and traditional supercomputers to elastic and heterogeneous Cloud resources. Scientific workflows support incorporation of multiple software tools including domain specific tools for standard processing to custom generalized workflows and middleware tools that can be reused in various contexts. Scientific workflows often collect provenance information on workflow entities, e.g., workflow definitions, their executions and run time parameters, and, in turn, assure a level of reproducibility while enabling referencing and replicating results. While doing all these, scientific workflows often foster an open-source, open-access and standards-driven community development model based on sharing and collaborations. Cyberinfrastructure platforms and gateways commonly employ scientific workflows to bridge the gap between the infrastructure and users needs. While capturing and communicating the scientific process formally, workflows ensure flexibility, synergy between users, provide optimized usage of resources, increase reuse and ensure compliance with system specific data models and community-driven standards. Currently, scientific workflows are used widely in life sciences at different stages of end-to-end data lifecycle from generation to analysis and publication of biological data. The - ata handled by such workflows can be produced by sequencers, sensor networks, medical imaging instruments and other heterogeneous resources at significant rates at decreasing costs making the analysis and archival of such data a ´big data´ challenge. Additionally, these new biological data resources are making new and exciting research in areas including metagenomics and personalized medicine possible. However, the analysis of big biological data is still very costly requiring new scalable computational models and programming paradigms to be applied to biological analysis. Although, some new paradigms exists for analysis of big data, application of these best practices to life sciences is still in its infancy. Scientific workflows can act as a scaffold and help speed this process up via combination of existing programming models and computational models with the challenges of biological problems as reusable blocks. In this talk, I will talk about such an approach that builds upon distributed data parallel patterns, e.g., MapReduce, and underlying execution engines, e.g., Hadoop, and matches the computational requirements of bioinformatics tools with such patterns and engines. The results of the presented approach is developed as a part of the bioKepler (bioKepler.org) module and can be downloaded to work within the release 2.4 of the Kepler scientific workflow system (kepler-project.org).
Keywords :
bioinformatics; cloud computing; distributed algorithms; Cyberinfrastructure platforms; Hadoop; MapReduce; bioinformatics; biological big data distributed analysis; computational science; distributed data parallel patterns; heterogeneous Cloud resources; medical imaging instruments; scientific workflows; sensor networks; sequencers; workflow driven programming paradigms; Biological system modeling; Computational modeling; Data handling; Data storage systems; Information management; Programming;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2013 IEEE 3rd International Conference on
Conference_Location :
New Orleans, LA
Type :
conf
DOI :
10.1109/ICCABS.2013.6629243
Filename :
6629243
Link To Document :
بازگشت