DocumentCode :
1785100
Title :
Spaced seed data structures
Author :
Birol, Inanc ; Mohamadi, Hamid ; Raymond, Anthony ; Raghavan, Karthika ; Chu, James ; Vandervalk, Benjamin P. ; Jackman, Shaun D. ; Warren, Rene L.
Author_Institution :
Canada´s Michael Smith Genome Sci. Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
fYear :
2014
fDate :
2-5 Nov. 2014
Firstpage :
15
Lastpage :
22
Abstract :
This past decade, genome sciences have benefitted from rapid advances in DNA sequencing technologies, and development of efficient algorithms for processing short nucleotide sequences played a key role in enabling their uptake in the field. In particular, reassembly of human genomes (de novo or reference-guided) from short DNA sequence reads had a substantial impact on health research. De novo assembly of a genome is essential in the absence of a reference genome sequence of a species. It is also gaining traction even when one is available, due to the utility of the method to resolve ambiguous or rearranged genomic regions with high specificity. With commercial high-throughput sequencing technologies increasing their throughput and their read lengths, the de Bruijn graph (DBG) paradigm used by many assembly algorithms needs to be revisited. DBG uses a table of k-mers, sequences of length k base pairs derived from the reads, and their k-1 base pair overlaps to assemble sequences. Despite longer k-mers unlocking longer genomic features for assembly, associated increases in memory usage and other compute resources are tradeoffs that limit the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we introduce three data structure designs for paired k-mers, or spaced seeds, each addressing memory and run time constraints imposed by longer reads. In spaced seeds, a fixed distance separates k-mer pairs, providing increased sequence specificity with increased distance, while keeping memory usage low. Further, we describe a data structure based on Bloom filters that would be suitable to implicitly store spaced seeds, and would be tolerant to sequencing errors. Building on the spaced seeds Bloom filter, we describe a data structure for tracking the frequencies of observed spaced seeds. We expect the data structure designs we introduce in this study to have broad applications in genomics research, with niche applications in genome, tran- criptome and metagenome assemblies, and in read error correction.
Keywords :
DNA; biochemistry; bioinformatics; data structures; digital storage; feature extraction; genomics; graph theory; information storage; molecular biophysics; molecular configurations; sequences; Bloom filter-based data structure; DBG paradigm; DBG practicability; DNA sequencing technology; ambiguous genomic region resolution; assembly algorithm; assembly archetype; commercial high-throughput sequencing technology; compute resource; data structure design; de Bruijn graph paradigm; de novo human genome reassembly; genome science; genomic feature; genomics research; health research; k base pair; k-1 base pair overlap; k-mer length; k-mer pair distance separation; memory constraint; memory usage; metagenome assembly application; nucleotide sequence processing algorithm; paired k-mer data structure; read error correction; read length; rearranged genomic region resolution; reference genome sequence; reference-guided human genome reassembly; run time constraint; sequence assembly; sequence length; sequence specificity; sequencing error tolerance; short DNA sequence read; spaced seed Bloom filter; spaced seed data structures; spaced seed frequency tracking; spaced seed storage; transcriptome assembly application; Assembly; Bioinformatics; Data structures; Genomics; Information filters; Sequential analysis; ABySS; Bloom filter; de Bruijn graph; de novo assembly; error correction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
Conference_Location :
Belfast
Type :
conf
DOI :
10.1109/BIBM.2014.6999305
Filename :
6999305
Link To Document :
بازگشت