• DocumentCode
    1362157
  • Title

    Multiple Sequence Assembly from Reads Alignable to a Common Reference Genome

  • Author

    Peng, Qian ; Smith, Andrew D.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Univ. of California, La Jolla, CA, USA
  • Volume
    8
  • Issue
    5
  • fYear
    2011
  • Firstpage
    1283
  • Lastpage
    1295
  • Abstract
    We describe a set of computational problems motivated by certain analysis tasks in genome resequencing. These are assembly problems for which multiple distinct sequences must be assembled, but where the relative positions of reads to be assembled are already known. This information is obtained from a common reference genome and is characteristic of resequencing experiments. The simplest variant of the problem aims at determining a minimum set of superstrings such that each sequenced read matches at least one superstring. We give an algorithm with time complexity O(N), where N is the sum of the lengths of reads, substantially improving on previous algorithms for solving the same problem. We also examine the problem of finding the smallest number of reads to remove such that the remaining reads are consistent with k superstrings. By exploiting a surprising relationship with the minimum cost flow problem, we show that this problem can be solved in polynomial time when nested reads are excluded. If nested reads are permitted, this problem of removing the minimum number of reads becomes NP-hard. We show that permitting mismatches between reads and their nearest superstrings generally renders these problems NP-hard.
  • Keywords
    biology computing; computational complexity; genomics; molecular biophysics; molecular configurations; polynomials; superstrings; NP-hard problem; assembly problems; common reference genome; genome resequencing; minimum cost flow problem; multiple sequence assembly; polynomial time; resequencing; sequenced read matches; superstrings; time complexity; Assembly; Bioinformatics; Computational biology; DNA; Genomics; Instruments; Organisms; Combinatorics; chain and antichain; haplotyping; sequence assembly; superstring.; Algorithms; Base Sequence; Genomics; Haplotypes; Models, Genetic; Molecular Sequence Data; Ploidies; Sequence Alignment; Sequence Analysis, DNA;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2010.107
  • Filename
    5611488