DocumentCode :
264731
Title :
Large-Scale Web Page Classification
Author :
Marath, Sathi T. ; Shepherd, Morgan ; Milios, Evangelos ; Duffy, Jack
Author_Institution :
DNA 13, Ottawa, ON, Canada
fYear :
2014
fDate :
6-9 Jan. 2014
Firstpage :
1813
Lastpage :
1822
Abstract :
This research investigates the design of a unified framework for the content-based classification of highly imbalanced hierarchical datasets, such as web directories. In an imbalanced dataset, the prior probability distribution of a category indicates the presence or absence of class imbalance. This may include the lack of positive training instances (rarity) or an overabundance of positive instances. We partitioned the subcategories of the Yahoo! web directory into five mutually exclusive groups based on the prior probability distribution. The best performing classification methods for a particular prior probability distribution were identified and used to design a content-based classification model for the complete (as of 2007) Yahoo! web directory of 639,671 categories and 4,140,629 web pages. The methodology was validated using a DMOZ subset of 17,217 categories and 130,594 web pages and we demonstrated statistically that the methodology of this research works equally well on large and small datasets.
Keywords :
Web sites; pattern classification; probability; DMOZ subset; Yahoo! Web directory; best performing classification methods; content-based classification model; large-scale Web page classification; prior probability distribution; Classification algorithms; Probability distribution; Support vector machines; Taxonomy; Text categorization; Training; Web pages; scarcity; unbalanced distribution; web page classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
System Sciences (HICSS), 2014 47th Hawaii International Conference on
Conference_Location :
Waikoloa, HI
Type :
conf
DOI :
10.1109/HICSS.2014.229
Filename :
6758827
Link To Document :
بازگشت