Author_Institution :
Dept. of Comput. Sci., Vermont Univ., Burlington, VT
Abstract :
The edit distance between given two strings X and Y is the minimum number of edit operations that transform X into Y. Ordinarily, string editing is based on character insert, delete, and substitute operations. It has been suggested that extending this model with block (substring) edits would be useful in applications such as DNA sequence comparison. In its general form, the resulting problem is NP-hard. However, there are efficient algorithms when string edits include only character, and block replacements. We introduce a new edit model which permits insertions, deletions, and substitutions at character level, and also reversals of substrings. We present an algorithm whose worst-case time complexity is O(n2m) where n=|X|lesm=|Y|, and we prove that the average running time of the algorithm is O(nm). Our experiments on randomly generated strings verify these results. The main contribution of this paper is that we present an algorithm to find all possible reversals using a generalized suffix tree, which is fast on average
Keywords :
DNA; biology computing; computational complexity; molecular biophysics; string matching; trees (mathematics); DNA sequence comparison; block replacements; block reversal; character delete; character insert; character substitute operations; generalized suffix tree; randomly generated strings; string edit distance; substring reversals; Bioinformatics; Biological system modeling; Biology computing; Computational modeling; Computer science; DNA; Genetic mutations; Genomics; Sequences; Sorting;