DocumentCode
1514465
Title
Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences
Author
Wong, Andrew K.C. ; Zhuang, Dennis ; Li, Gary C L ; Lee, En-Shiun Annie
Author_Institution
Dept. of Syst. Design Eng., Univ. of Waterloo, Waterloo, ON, Canada
Volume
24
Issue
8
fYear
2012
Firstpage
1408
Lastpage
1421
Abstract
Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality. This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns. It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree. Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences.
Keywords
data mining; sequences; delta closed pattern discovery; delta tolerance closed itemset; genomics; noninduced patterns; output quality; proteomics; redundant patterns; sequence data; statistically induced patterns; suffix tree; Algorithm design and analysis; Data mining; Frequency measurement; Genomics; Hidden Markov models; Itemsets; Markov processes; Sequence pattern discovery; delta closed patterns; statistically induced patterns; suffix tree.;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2011.100
Filename
5765954
Link To Document