Title :
Linear global detectors of redundant and rare substrings
Author :
Apostolico, Alberto ; Bock, Mary Ellen ; Lonardi, Stefano
Author_Institution :
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
Abstract :
The identification of strings that are, by some measure, redundant or rare in the context of larger sequences is an implicit goal of any data compression method. In the straightforward approach to searching for unusual substrings, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. As is well known, clever methods are available to compute and organize the counts of occurrences of all substrings of a given string. The corresponding tables take up the tree-like structure of a special kind of digital search index or trie. We show here that under several accepted measures of deviation from expected frequency, the candidate over- or under-represented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the Θ(n2) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, over-represented, then its extension to the nearest node of the tree is even more so. Based on this, we design global linear detectors of favoured and unfavored words for our probabilistic framework, and display the results of some preliminary that apply our constructions to the analysis of genomic sequences
Keywords :
data compression; probability; redundancy; sequences; string matching; tree data structures; tree searching; data compression; digital search index; genomic sequences; identification; linear global detectors; probabilistic framework; rare substrings; redundant substrings; suffix tree; trie; Application software; Bioinformatics; Biology computing; Buildings; Computational biology; Councils; Data compression; Detectors; Displays; Feature extraction; Frequency; Frequency measurement; Genomics; Statistics;
Conference_Titel :
Data Compression Conference, 1999. Proceedings. DCC '99
Conference_Location :
Snowbird, UT
Print_ISBN :
0-7695-0096-X
DOI :
10.1109/DCC.1999.755666