مرکز منطقه ای اطلاع رساني علوم و فناوري - Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data

DocumentCode :

168679

Title :

Cluster-Based SNP Calling on Large-Scale Genome Sequencing Data

Author :

Kutlu, Mucahid ; Agrawal, Gagan

Author_Institution :

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear :

2014

fDate :

26-29 May 2014

Firstpage :

455

Lastpage :

464

Abstract :

The available genetic data is increasing rapidly, with new high-throughput and low-cost technologies. While this data has enormous potential to impact scientific and medical advances, such data volumes cannot be processed without the use of parallelism. Most of the existing work on analysis of this data has focused on the accuracy of the analyses, and not performance, i.e. either the algorithms are serial and/or very simple and non-scalable parallelization techniques have been used. In this paper, we address the problem of identification of variants in large-scale genome sequencing data. After examining different possible approaches, we identify one which does not require any communication. However, achieving load-balance is non-trivial, because of the data-dependent nature of the processing. We develop three scheduling schemes including a dynamic scheme, which reduces scheduling overheads by using two different chunk sizes, a static scheme, which uses a pre-processing step to estimate workloads, and a combined scheme. In evaluating our schemes, we find that use of a pre-processing step (histogram computation) to estimate workloads is very effective, and thus, our combined scheme gives the best results. With a 32× increase in the number of cores, approximately a 24× performance improvement is seen, establishing that scalable processing of genomic data is possible. We also perform a comparison against an implementation based on Hadoop, and show that with our combined scheme, our implementation outperforms the one using Hadoop.

Keywords :

biology computing; data analysis; genetics; parallel processing; resource allocation; scheduling; cluster-based SNP calling; data analysis; dynamic scheme; genetic data; high-throughput technology; histogram computation; large-scale genome sequencing data; load-balance; low-cost technology; nonscalable parallelization techniques; scheduling overhead reduction; scheduling schemes; static scheme; variant identification; workload estimation; Algorithm design and analysis; Bioinformatics; Dynamic scheduling; Genomics; Histograms; Software; Software algorithms;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on

Conference_Location :

Chicago, IL

Type :

conf

DOI :

10.1109/CCGrid.2014.111

Filename :

6846481

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=168679