DocumentCode
2725079
Title
More Efficient Classification of Web Content Using Graph Sampling
Author
Bennett, Chris
Author_Institution
Dept. of Comput. Sci., Georgia Univ., Athens, GA
fYear
2007
fDate
March 1 2007-April 5 2007
Firstpage
485
Lastpage
490
Abstract
In mining information from very large graphs, processing time as well as system memory become computational bottlenecks as the properties of large graphs must be compared through each iteration of an algorithm. This is a particularly pronounced problem for complex properties. For example, distance metrics are used in many fundamental data mining algorithms including k-nearest neighbors for the classification task. Even the relatively efficient distance and similarity heuristics for large inputs, though, often require processing and memory well beyond linear with respect to the size of the input, and this rapidly becomes intractable with very large inputs. Complex properties such as the distance between two graphs can be extremely costly, but using samples of these large graphs to calculate the same properties proves to reduce memory requirements and processing time significantly without sacrificing quality of classification. Because the vast amount of Web data is easily and robustly represented with graphs, a data reduction technique that preserves the accuracy of mining algorithms on such inputs is important. The sampling techniques presented here show that very large graphs of Web content can be condensed into significantly smaller yet equally expressive graphs that lead to accurate but more efficient classification of Web content
Keywords
Internet; classification; data mining; graph theory; Web content classification; data mining; graph sampling; information mining; k-nearest neighbors; very large graphs; Clustering algorithms; Computational intelligence; Computer science; Data mining; Image processing; Information systems; Robustness; Sampling methods; Sorting; USA Councils;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence and Data Mining, 2007. CIDM 2007. IEEE Symposium on
Conference_Location
Honolulu, HI
Print_ISBN
1-4244-0705-2
Type
conf
DOI
10.1109/CIDM.2007.368914
Filename
4221338
Link To Document