Title :
Leveraging Web 2.0 Sources for Web Content Classification
Author :
Banerjee, Somnath ; Scholz, Martin
Author_Institution :
Hewlett-Packard Labs., Bangalore
Abstract :
This paper addresses practical aspects of Web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the Web. We study techniques for building training corpora automatically from publicly available Web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the Web.
Keywords :
Internet; classification; data mining; text analysis; Web 2.0 source; Web content classification; text mining; Buildings; Information filtering; Information filters; Information services; Intelligent agent; Internet; Labeling; Text mining; Web pages; Web sites; corpus construction; text mining; web 2.0; web classification;
Conference_Titel :
Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-0-7695-3496-1
DOI :
10.1109/WIIAT.2008.291