• DocumentCode
    3334563
  • Title

    Mining Frequent Patterns with Wildcards from Biological Sequences

  • Author

    He, Yu ; Wu, Xindong ; Zhu, Xingquan ; Arslan, Abdullah N.

  • Author_Institution
    Univ. of Vermont, Burlington
  • fYear
    2007
  • fDate
    13-15 Aug. 2007
  • Firstpage
    329
  • Lastpage
    334
  • Abstract
    Frequent pattern mining from sequences is a crucial step for many domain experts, such as molecular biologists, to discover rules or patterns hidden in their data. In order to find specific patterns, many existing tools require users to specify gap constraints beforehand. In reality, it is often nontrivial to let a user provide such gap constraints. In addition, a change made to the gap values may give completely different results, and require a separate time-consuming re-mining procedure. Consequently it is desirable to develop an algorithm to automatically and efficiently find patterns without user-specified gap constraints. In this paper, a frequent pattern mining problem without user-specified gap constraints is presented and studied. Given a sequence and a support threshold value, all subsequences whose support is not less than the given threshold value will be discovered. These frequent subsequences then form patterns later on. Two heuristic methods (one-way vs two-way scan) are proposed to mine frequent subsequences and estimate the maximum support for both artificial and real world data. Given a specific pattern, the simulated results demonstrate that the one-way scan heuristic performs better in the sense of estimating the maximum support with more than ninety percent accuracy.
  • Keywords
    DNA; biology computing; data mining; pattern recognition; DNA biological sequences; discover rules; frequent pattern mining; frequent subsequence mining; gap constraints; molecular biology; pattern finding; wildcard; Amino acids; Bioinformatics; Biological information theory; Biology; Computer science; DNA; Data engineering; Genomics; Helium; Protein sequence;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration, 2007. IRI 2007. IEEE International Conference on
  • Conference_Location
    Las Vegas, IL
  • Print_ISBN
    1-4244-1500-4
  • Electronic_ISBN
    1-4244-1500-4
  • Type

    conf

  • DOI
    10.1109/IRI.2007.4296642
  • Filename
    4296642