DocumentCode
168679
Title
Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data
Author
Kutlu, Mucahid ; Agrawal, Gagan
Author_Institution
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear
2014
fDate
26-29 May 2014
Firstpage
455
Lastpage
464
Abstract
The available genetic data is increasing rapidly, with new high-throughput and low-cost technologies. While this data has enormous potential to impact scientific and medical advances, such data volumes cannot be processed without the use of parallelism. Most of the existing work on analysis of this data has focused on the accuracy of the analyses, and not performance, i.e. either the algorithms are serial and/or very simple and non-scalable parallelization techniques have been used. In this paper, we address the problem of identification of variants in large-scale genome sequencing data. After examining different possible approaches, we identify one which does not require any communication. However, achieving load-balance is non-trivial, because of the data-dependent nature of the processing. We develop three scheduling schemes including a dynamic scheme, which reduces scheduling overheads by using two different chunk sizes, a static scheme, which uses a pre-processing step to estimate workloads, and a combined scheme. In evaluating our schemes, we find that use of a pre-processing step (histogram computation) to estimate workloads is very effective, and thus, our combined scheme gives the best results. With a 32× increase in the number of cores, approximately a 24× performance improvement is seen, establishing that scalable processing of genomic data is possible. We also perform a comparison against an implementation based on Hadoop, and show that with our combined scheme, our implementation outperforms the one using Hadoop.
Keywords
biology computing; data analysis; genetics; parallel processing; resource allocation; scheduling; cluster-based SNP calling; data analysis; dynamic scheme; genetic data; high-throughput technology; histogram computation; large-scale genome sequencing data; load-balance; low-cost technology; nonscalable parallelization techniques; scheduling overhead reduction; scheduling schemes; static scheme; variant identification; workload estimation; Algorithm design and analysis; Bioinformatics; Dynamic scheduling; Genomics; Histograms; Software; Software algorithms;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location
Chicago, IL
Type
conf
DOI
10.1109/CCGrid.2014.111
Filename
6846481
Link To Document