• DocumentCode
    3106411
  • Title

    GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space

  • Author

    He, Huahai ; Singh, Ambuj K.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of California, Santa Barbara, CA
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    885
  • Lastpage
    890
  • Abstract
    We propose a technique for evaluating the statistical significance of frequent subgraphs in a database. A graph is represented by a feature vector that is a histogram over a set of basis elements. The set of basis elements is chosen based on domain knowledge and consists generally of vertices, edges, or small graphs. A given subgraph is transformed to a feature vector and the significance of the subgraph is computed by considering the significance of occurrence of the corresponding vector. The probability of occurrence of the vector in a random vector is computed based on the prior probability of the basis elements. This is then used to obtain a probability distribution on the support of the vector in a database of random vectors. The statistical significance of the vector/subgraph is then defined as the p-value of its observed support. We develop efficient methods for computing p-values and lower bounds. A simplified model is further proposed to improve the efficiency. We also address the problem of feature vector mining, a generalization of item- set mining where counts are associated with items and the goal is to find significant sub-vectors. We present an algorithm that explores closed frequent sub-vectors to find significant ones. Experimental results show that the proposed techniques are effective, efficient, and useful for ranking frequent subgraphs by their statistical significance.
  • Keywords
    data mining; data structures; database management systems; graph theory; statistical distributions; GraphRank; feature space; feature vector mining; frequent subgraphs statistical significance; histogram; item-set mining; occurrence probability; probability distribution; random vectors. database; significant subgraphs mining; significant subgraphs modeling; Chemical analysis; Computer science; Data mining; Helium; Histograms; Itemsets; Multimedia databases; Probability distribution; Proteins; Spatial databases;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2006. ICDM '06. Sixth International Conference on
  • Conference_Location
    Hong Kong
  • ISSN
    1550-4786
  • Print_ISBN
    0-7695-2701-7
  • Type

    conf

  • DOI
    10.1109/ICDM.2006.79
  • Filename
    4053121