Title :
More Efficient Classification of Web Content Using Graph Sampling
Author_Institution :
Dept. of Comput. Sci., Georgia Univ., Athens, GA
fDate :
March 1 2007-April 5 2007
Abstract :
In mining information from very large graphs, processing time as well as system memory become computational bottlenecks as the properties of large graphs must be compared through each iteration of an algorithm. This is a particularly pronounced problem for complex properties. For example, distance metrics are used in many fundamental data mining algorithms including k-nearest neighbors for the classification task. Even the relatively efficient distance and similarity heuristics for large inputs, though, often require processing and memory well beyond linear with respect to the size of the input, and this rapidly becomes intractable with very large inputs. Complex properties such as the distance between two graphs can be extremely costly, but using samples of these large graphs to calculate the same properties proves to reduce memory requirements and processing time significantly without sacrificing quality of classification. Because the vast amount of Web data is easily and robustly represented with graphs, a data reduction technique that preserves the accuracy of mining algorithms on such inputs is important. The sampling techniques presented here show that very large graphs of Web content can be condensed into significantly smaller yet equally expressive graphs that lead to accurate but more efficient classification of Web content
Keywords :
Internet; classification; data mining; graph theory; Web content classification; data mining; graph sampling; information mining; k-nearest neighbors; very large graphs; Clustering algorithms; Computational intelligence; Computer science; Data mining; Image processing; Information systems; Robustness; Sampling methods; Sorting; USA Councils;
Conference_Titel :
Computational Intelligence and Data Mining, 2007. CIDM 2007. IEEE Symposium on
Conference_Location :
Honolulu, HI
Print_ISBN :
1-4244-0705-2
DOI :
10.1109/CIDM.2007.368914