• DocumentCode
    1426910
  • Title

    Rapid Sequence Homology Assessment by Subsampling the Genome Space Using Difference Sets

  • Author

    Brodzik, Andrzej K.

  • Author_Institution
    MITRE Corp., Bedford, MA, USA
  • Volume
    56
  • Issue
    2
  • fYear
    2010
  • Firstpage
    756
  • Lastpage
    770
  • Abstract
    Availability of DNA data is growing roughly at the rate specified by Moore´s law. In many molecular biology applications this data must be compared with a reference sequence, either to establish similarity of genomes or to identify functionally homologous subsequences. Current approaches based on pair-wise sequence alignments are computationally expensive and often data dependent. To ameliorate this problem, alternative, less complex sequence comparison schemes, designed to capture the essential features of genomes, must be explored. In this work a new sequence comparison approach, based on difference set models, is proposed. These models are conceptually appropriate, as they quantify, in a certain sense, two key genome attributes: sequence complexity and symbol repetition. Moreover, it is shown that difference sets are abundant in bacterial genomes and that they coincide with homologous sequence segments. These findings motivate the construction of compact representations of DNA sequences in the difference set space. An alignment of these representations permits computationally efficient identification of differences between the DNA sequences. To illustrate the efficacy of the difference set approach, characterization of indels in closely related bacillus anthracis strains is performed, resulting in the discovery of two previously unreported collections of polymorphisms. In addition to these results, an open problem of extending the difference set approach to difference set and almost difference set families, for the analysis of more distant DNA sequences, is discussed.
  • Keywords
    DNA; biological techniques; genomics; microorganisms; molecular biophysics; Bacillus Anthracis strains; DNA; bacterial genomes; difference set models; genome space; homologous sequence segments; molecular biology; polymorphisms; rapid sequence homology; subsampling; Bioinformatics; Biological information theory; Biology computing; Boats; DNA; Genomics; Microorganisms; Moore\´s Law; Performance evaluation; Sequences; bacillus anthracis; Almost difference sets; DNA sequence alignment; DNA sequence homology assessment; DNA sequence markers; correlation; cyclic difference sets; phase-only matched filter; random sequence; repetition; sequence complexity;
  • fLanguage
    English
  • Journal_Title
    Information Theory, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9448
  • Type

    jour

  • DOI
    10.1109/TIT.2009.2037036
  • Filename
    5420294