More Efficient Classification of Web Content Using Graph Sampling

Author

Bennett, Chris

Author_Institution

Dept. of Comput. Sci., Georgia Univ., Athens, GA

fYear

2007

fDate

March 1 2007-April 5 2007

Firstpage

485

Lastpage

490

Abstract

In mining information from very large graphs, processing time as well as system memory become computational bottlenecks as the properties of large graphs must be compared through each iteration of an algorithm. This is a particularly pronounced problem for complex properties. For example, distance metrics are used in many fundamental data mining algorithms including k-nearest neighbors for the classification task. Even the relatively efficient distance and similarity heuristics for large inputs, though, often require processing and memory well beyond linear with respect to the size of the input, and this rapidly becomes intractable with very large inputs. Complex properties such as the distance between two graphs can be extremely costly, but using samples of these large graphs to calculate the same properties proves to reduce memory requirements and processing time significantly without sacrificing quality of classification. Because the vast amount of Web data is easily and robustly represented with graphs, a data reduction technique that preserves the accuracy of mining algorithms on such inputs is important. The sampling techniques presented here show that very large graphs of Web content can be condensed into significantly smaller yet equally expressive graphs that lead to accurate but more efficient classification of Web content

Keywords

Internet; classification; data mining; graph theory; Web content classification; data mining; graph sampling; information mining; k-nearest neighbors; very large graphs; Clustering algorithms; Computational intelligence; Computer science; Data mining; Image processing; Information systems; Robustness; Sampling methods; Sorting; USA Councils;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Intelligence and Data Mining, 2007. CIDM 2007. IEEE Symposium on

Conference_Location

Honolulu, HI

Print_ISBN

1-4244-0705-2

Type

conf

DOI

10.1109/CIDM.2007.368914

Filename

4221338