DocumentCode :
168679
Title :
Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data
Author :
Kutlu, Mucahid ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2014
fDate :
26-29 May 2014
Firstpage :
455
Lastpage :
464
Abstract :
The available genetic data is increasing rapidly, with new high-throughput and low-cost technologies. While this data has enormous potential to impact scientific and medical advances, such data volumes cannot be processed without the use of parallelism. Most of the existing work on analysis of this data has focused on the accuracy of the analyses, and not performance, i.e. either the algorithms are serial and/or very simple and non-scalable parallelization techniques have been used. In this paper, we address the problem of identification of variants in large-scale genome sequencing data. After examining different possible approaches, we identify one which does not require any communication. However, achieving load-balance is non-trivial, because of the data-dependent nature of the processing. We develop three scheduling schemes including a dynamic scheme, which reduces scheduling overheads by using two different chunk sizes, a static scheme, which uses a pre-processing step to estimate workloads, and a combined scheme. In evaluating our schemes, we find that use of a pre-processing step (histogram computation) to estimate workloads is very effective, and thus, our combined scheme gives the best results. With a 32× increase in the number of cores, approximately a 24× performance improvement is seen, establishing that scalable processing of genomic data is possible. We also perform a comparison against an implementation based on Hadoop, and show that with our combined scheme, our implementation outperforms the one using Hadoop.
Keywords :
biology computing; data analysis; genetics; parallel processing; resource allocation; scheduling; cluster-based SNP calling; data analysis; dynamic scheme; genetic data; high-throughput technology; histogram computation; large-scale genome sequencing data; load-balance; low-cost technology; nonscalable parallelization techniques; scheduling overhead reduction; scheduling schemes; static scheme; variant identification; workload estimation; Algorithm design and analysis; Bioinformatics; Dynamic scheduling; Genomics; Histograms; Software; Software algorithms;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location :
Chicago, IL
Type :
conf
DOI :
10.1109/CCGrid.2014.111
Filename :
6846481
Link To Document :
بازگشت