• DocumentCode
    3269013
  • Title

    Unordered tree mining with applications to phylogeny

  • Author

    Shasha, Dennis ; Wang, Jason T L ; Zhang, Sen

  • Author_Institution
    Courant Inst. of Math. Sci., New York Univ., NY, USA
  • fYear
    2004
  • fDate
    30 March-2 April 2004
  • Firstpage
    708
  • Lastpage
    719
  • Abstract
    Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data, such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. We present a new FSM technique for finding patterns in rooted unordered labeled trees. The patterns of interest are cousin pairs in these trees. A cousin pair is a pair of nodes sharing the same parent, the same grandparent, or the same great-grandparent, etc. Given a tree T, our algorithm finds all interesting cousin pairs of T in O(|T|2) time where |T| is the number of nodes in T. Experimental results on synthetic data and phylogenies show the scalability and effectiveness of the proposed technique. To demonstrate the usefulness of our approach, we discuss its applications to locating co-occurring patterns in multiple evolutionary trees, evaluating the consensus of equally parsimonious trees, and finding kernel trees of groups of phylogenies. We also describe extensions of our algorithms for undirected acyclic graphs (or free trees).
  • Keywords
    data mining; graph theory; pattern recognition; tree data structures; Web log analysis; XML processing; bioinformatics; co-occurring pattern; cousin pair; free tree; frequent structure mining; kernel tree; multiple evolutionary tree; pattern discovery; pattern extraction; phylogeny; rooted unordered labeled trees; structural data; undirected acyclic graph; unordered tree mining; Bioinformatics; Data mining; Educational institutions; History; Kernel; Organisms; Phylogeny; Scalability; Tree graphs; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2004. Proceedings. 20th International Conference on
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-2065-0
  • Type

    conf

  • DOI
    10.1109/ICDE.2004.1320039
  • Filename
    1320039