• DocumentCode
    1514465
  • Title

    Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences

  • Author

    Wong, Andrew K.C. ; Zhuang, Dennis ; Li, Gary C L ; Lee, En-Shiun Annie

  • Author_Institution
    Dept. of Syst. Design Eng., Univ. of Waterloo, Waterloo, ON, Canada
  • Volume
    24
  • Issue
    8
  • fYear
    2012
  • Firstpage
    1408
  • Lastpage
    1421
  • Abstract
    Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality. This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns. It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree. Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences.
  • Keywords
    data mining; sequences; delta closed pattern discovery; delta tolerance closed itemset; genomics; noninduced patterns; output quality; proteomics; redundant patterns; sequence data; statistically induced patterns; suffix tree; Algorithm design and analysis; Data mining; Frequency measurement; Genomics; Hidden Markov models; Itemsets; Markov processes; Sequence pattern discovery; delta closed patterns; statistically induced patterns; suffix tree.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2011.100
  • Filename
    5765954