DocumentCode :
2082759
Title :
Web page categorization using hierarchical headings structure
Author :
Soonthornphisaj, N. ; Chartbanchachai, Pisit ; Pratheeptham, Thanapol ; Kijsiriku, Boonserm
Author_Institution :
Dept. of Comput. Sci., Kasetsart Univ., Bangkok, Thailand
fYear :
2002
fDate :
2002
Firstpage :
37
Abstract :
The goal of Web page categorization is to classify the Web documents into a certain number of predefined categories. The previous works in this area employed a large number of labeled training documents for supervised learning. The problem is that, it is difficult to create the labeled training documents. While it is easy to collect the unlabeled documents, it is not so easy to manually categorize them for creating training documents. Therefore, a new machine learning algorithm should be investigated to overcome these difficulties. We proposed a new algorithm called Iterative Cross-Training (ICT). The paper also present a new feature set which is the hierarchical structure of headings appearing in the Web page to enhance the classification performance. We found that the hierarchical structure of headings has a high impact and could enhance the classification performance.
Keywords :
Internet; classification; learning (artificial intelligence); Web documents; Web page categorization; World Wide Web; feature sets; hierarchical structure; iterative cross-training; labeled training documents; machine learning algorithm; supervised learning; Computer science; Iterative algorithms; Knowledge engineering; Machine intelligence; Machine learning; Machine learning algorithms; Power capacitors; Search engines; Supervised learning; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Technology Interfaces, 2002. ITI 2002. Proceedings of the 24th International Conference on
ISSN :
1330-1012
Print_ISBN :
953-96769-5-9
Type :
conf
DOI :
10.1109/ITI.2002.1024649
Filename :
1024649
Link To Document :
بازگشت