DocumentCode :
170316
Title :
HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis Using Hadoop
Author :
Siretskiy, Alexey ; Spjuth, Ola
Author_Institution :
Dept. of Inf. Technol., Uppsala Univ., Uppsala, Sweden
Volume :
1
fYear :
2014
fDate :
20-24 Oct. 2014
Firstpage :
317
Lastpage :
323
Abstract :
Hadoop is a convenient framework in e-Science enabling scalable distributed data analysis. In molecular biology, next-generation sequencing produces vast amounts of data and requires flexible frameworks for constructing analysis pipelines. We extend the popular HTSeq package into the Hadoop realm by introducing massively parallel versions of short read quality assessment as well as functionality to count genes mapped by the short reads. We use the Hadoop-streaming library which allows the components to run in both Hadoop and regular Linux systems and evaluate their performance in two different execution environments: A single node on a computational cluster and a Hadoop cluster in a private cloud. We compare the implementations with Apache Pig showing improved runtime performance of our developed methods. We also inject the components in the graphical platform Cloudgene to simplify user interaction.
Keywords :
Linux; biology computing; data analysis; genetics; molecular biophysics; parallel processing; pipeline processing; Apache Pig; Cloudgene; HTSeq; HTSeq package; HTSeq-Hadoop; Hadoop Linux systems; Hadoop-streaming library; analysis pipelines; computational cluster; e-Science; graphical platform; massively parallel sequencing data analysis; molecular biology; next-generation sequencing; private cloud; regular Linux systems; runtime performance; scalable distributed data analysis; user interaction; Bioinformatics; Cloud computing; Genomics; Libraries; Linux; Sequential analysis; Timing; Bioinformatics; Hadoop; Map-Reduce; Massively Parallel Sequencing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
e-Science (e-Science), 2014 IEEE 10th International Conference on
Conference_Location :
Sao Paulo
Print_ISBN :
978-1-4799-4288-6
Type :
conf
DOI :
10.1109/eScience.2014.27
Filename :
6972279
Link To Document :
بازگشت