DocumentCode :
2784612
Title :
De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs
Author :
Chang, Yu-Jung ; Chen, Chien-Chih ; Ho, Jan-Ming ; Chen, Chuen-Liang
Author_Institution :
Inst. of Inf. Sci., Acad. Sinica, Taipei, Taiwan
fYear :
2012
fDate :
24-29 June 2012
Firstpage :
155
Lastpage :
161
Abstract :
The next-generation sequencing technologies dramatically accelerate the throughput of DNA sequencing in a much faster rate than the growth rate of computer speed as predicted by the "Moore\´s Law." It is a problem even to load and run these sequencing data in memory. There is an urgent need for de novo assemblers to efficiently handle the huge amount of sequencing data using scalable commodity servers in the clouds. In this paper, we present CloudBrush, a parallel algorithm that runs on the MapReduce framework of cloud computing for de novo assembly of high-throughput sequencing data. The algorithm uses Myers\´s bi-directed string graphs as its basis and consists of two main stages: graph construction and graph simplification. First, a vertex is defined for each non-redundant sequence read. We present a prefix-and-extend algorithm to identify overlaps between a pair of reads and to reduce transitive edges. The graph is further simplified by using conventional operations including path compression, tip removal and bubble removal. We also present a new operation, Similar Neighbour Edge Adjustment, to remove error topology structures in string graphs. Besides, we also disconnect repeat regions by revised A-statistics. The goal is to partition the string graph so that all paths in each connected subgraph correspond to similar subsequences of the underlying genome. We then traverse each connected subgraph to find a long path supported by a sufficient amount of reads to represent the subgraph. Preliminary results show that the CloudBrush assembler, compared with Contrail and Edena on the sequencing data of E. coli genomes, may yield longer contigs.
Keywords :
DNA; bioinformatics; cloud computing; directed graphs; genomics; parallel algorithms; A-statistics; CloudBrush assembler; Contrail; DNA sequencing; E. coli genomes; Edena; MapReduce framework; Myers bidirected string graphs; bubble removal; cloud computing; conventional operations; de novo assembly; error topology structure removal; graph construction; graph simplification; high-throughput sequencing data; next-generation sequencing technologies; parallel algorithm; path compression; prefix-and-extend algorithm; scalable commodity servers; similar neighbour edge adjustment; tip removal; Cloud computing; Conferences; Decision support systems; bioinformatics cloud; de novo sequence assembly; map-reduce;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on
Conference_Location :
Honolulu, HI
ISSN :
2159-6182
Print_ISBN :
978-1-4673-2892-0
Type :
conf
DOI :
10.1109/CLOUD.2012.123
Filename :
6253501
Link To Document :
بازگشت