Title :
A Multi-label and Adaptive Genre Classification of Web Pages
Author :
Jebari, Chaker ; Wani, M. Arif
Author_Institution :
Comput. Sci. Dept., Fac. of Sci. of Tunis, Tunis, Tunisia
Abstract :
This paper proposes a new centroid-based approach to classify web pages by genre using character ngrams extracted from different information sources such as URL, title, headings and anchors. To deal with the complexity of web pages and the rapid evolution of web genres, our approach implements a multi-label and adaptive classification scheme in which web pages are classified one by one and each web page can affect more than one genre. According to the similarity between the new page and each genre centroid, our approach either adapts the genre centroid under consideration or considers the new page as noise page and discards it. The experiment results show that our approach is very fast and achieves better results than existing multi-label classifiers.
Keywords :
Web sites; classification; information retrieval; URL; Web genre; Web page classification; adaptive genre classification scheme; anchors; character ngram; genre centroid; headings; information source extraction; multilabel genre classification scheme; noise page; title; Classification algorithms; Complexity theory; Data mining; Search engines; Training; Vectors; Web pages; Multi-label; adaptive; centroid; classification; genre;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4673-4651-1
DOI :
10.1109/ICMLA.2012.106