Abstract :
Summary form only given. Given a set of n sequences, the multiple sequence alignment problem is to align these n sequences, with gaps or otherwise, such that the commonality of the sequences is projected appropriately. If m is the total sum of the lengths of the input sequences, A is the alphabet size of the input sequences, and P is the final number of unique patterns, fixed by the user, that cause an alignment between sequences, then the algorithm runs in time bound O(m(A + P)), linear worst case time. Our algorithm runs on both sequences where A is small and large. Our algorithm forms the alignment by first discovering patterns, and thus is also a pattern discovery solution. We support our theoretical conclusions with experimental results obtained from running our algorithm on GenPept sequences and human genome sequences from the GenBank public domain database. Our algorithm uses direct n-wise alignment and constant memory space irrespective of the value of m. What differentiates this algorithm from most others is that it is deterministic; it is guaranteed and theoretically proved that all patterns of any arbitrary length that occur in at least k sequences and that are responsible for multiple sequence alignment are found by the algorithm, where k is specified by the user.
Keywords :
biology computing; computational complexity; deterministic algorithms; genetics; pattern recognition; sequences; trees (mathematics); GenBank public domain database; GenPept sequences; alphabet size; constant memory space; deterministic constant-space linear-time algorithm; direct n-wise alignment; human genome sequences; input sequences; linear worst case time; multiple sequence alignment; pattern discovery; selective tree growing; time bound; unique patterns; Bioinformatics; Clocks; Computer Society; DNA; Databases; Genomics; Humans; Linux; Pattern matching; Sequences;