• DocumentCode
    1376119
  • Title

    A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

  • Author

    Steinbiss, S. ; Kurtz, S.

  • Author_Institution
    Center for Bioinf., Univ. of Hamburg, Hamburg, Germany
  • Volume
    9
  • Issue
    2
  • fYear
    2012
  • Firstpage
    345
  • Lastpage
    357
  • Abstract
    Today´s genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language- specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support, and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 · 10-6 bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.
  • Keywords
    authoring languages; benchmark testing; bioinformatics; genetics; genomics; information retrieval; object-oriented programming; optimisation; programming languages; software engineering; spatial data structures; benchmarks; character distributions; genome analysis applications; internal representation optimization; large-scale data; metadata like sequence descriptions; multiple biosequence retrieval; multiple biosequence storage; object-oriented interface; portable software implementation; programming language; reading directions; scripting languages; space-efficient data structure; wildcards; Bioinformatics; Data structures; Encoding; Genomics; Libraries; Particle separators; Software; Data storage representations; biology and genetics; reusable libraries.; software engineering; Algorithms; Computational Biology; Databases, Genetic; Information Storage and Retrieval; Models, Genetic; Multigene Family; Sequence Analysis;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2011.146
  • Filename
    6081847