Author_Institution :
MITRE Corp., Bedford, MA, USA
Abstract :
Availability of DNA data is growing roughly at the rate specified by Moore´s law. In many molecular biology applications this data must be compared with a reference sequence, either to establish similarity of genomes or to identify functionally homologous subsequences. Current approaches based on pair-wise sequence alignments are computationally expensive and often data dependent. To ameliorate this problem, alternative, less complex sequence comparison schemes, designed to capture the essential features of genomes, must be explored. In this work a new sequence comparison approach, based on difference set models, is proposed. These models are conceptually appropriate, as they quantify, in a certain sense, two key genome attributes: sequence complexity and symbol repetition. Moreover, it is shown that difference sets are abundant in bacterial genomes and that they coincide with homologous sequence segments. These findings motivate the construction of compact representations of DNA sequences in the difference set space. An alignment of these representations permits computationally efficient identification of differences between the DNA sequences. To illustrate the efficacy of the difference set approach, characterization of indels in closely related bacillus anthracis strains is performed, resulting in the discovery of two previously unreported collections of polymorphisms. In addition to these results, an open problem of extending the difference set approach to difference set and almost difference set families, for the analysis of more distant DNA sequences, is discussed.
Keywords :
DNA; biological techniques; genomics; microorganisms; molecular biophysics; Bacillus Anthracis strains; DNA; bacterial genomes; difference set models; genome space; homologous sequence segments; molecular biology; polymorphisms; rapid sequence homology; subsampling; Bioinformatics; Biological information theory; Biology computing; Boats; DNA; Genomics; Microorganisms; Moore\´s Law; Performance evaluation; Sequences; bacillus anthracis; Almost difference sets; DNA sequence alignment; DNA sequence homology assessment; DNA sequence markers; correlation; cyclic difference sets; phase-only matched filter; random sequence; repetition; sequence complexity;