• DocumentCode
    2725079
  • Title

    More Efficient Classification of Web Content Using Graph Sampling

  • Author

    Bennett, Chris

  • Author_Institution
    Dept. of Comput. Sci., Georgia Univ., Athens, GA
  • fYear
    2007
  • fDate
    March 1 2007-April 5 2007
  • Firstpage
    485
  • Lastpage
    490
  • Abstract
    In mining information from very large graphs, processing time as well as system memory become computational bottlenecks as the properties of large graphs must be compared through each iteration of an algorithm. This is a particularly pronounced problem for complex properties. For example, distance metrics are used in many fundamental data mining algorithms including k-nearest neighbors for the classification task. Even the relatively efficient distance and similarity heuristics for large inputs, though, often require processing and memory well beyond linear with respect to the size of the input, and this rapidly becomes intractable with very large inputs. Complex properties such as the distance between two graphs can be extremely costly, but using samples of these large graphs to calculate the same properties proves to reduce memory requirements and processing time significantly without sacrificing quality of classification. Because the vast amount of Web data is easily and robustly represented with graphs, a data reduction technique that preserves the accuracy of mining algorithms on such inputs is important. The sampling techniques presented here show that very large graphs of Web content can be condensed into significantly smaller yet equally expressive graphs that lead to accurate but more efficient classification of Web content
  • Keywords
    Internet; classification; data mining; graph theory; Web content classification; data mining; graph sampling; information mining; k-nearest neighbors; very large graphs; Clustering algorithms; Computational intelligence; Computer science; Data mining; Image processing; Information systems; Robustness; Sampling methods; Sorting; USA Councils;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Data Mining, 2007. CIDM 2007. IEEE Symposium on
  • Conference_Location
    Honolulu, HI
  • Print_ISBN
    1-4244-0705-2
  • Type

    conf

  • DOI
    10.1109/CIDM.2007.368914
  • Filename
    4221338