Title :
Genre Categorization of Web Pages
Author :
Chaker, Jebari ; Habib, Ounelli
Author_Institution :
King Saud Univ., Riyadh
Abstract :
With the increase of the number of web pages, it is very difficult to find wanted information easily and quickly out of thousands of web pages retrieved by a search engine. To solve this problem, many researches propose to classify documents according to their genre, which is another criteria to classify documents different from the topic. Most of these works assign a document to only one genre. In this paper we propose a new flexible approach for document genre categorization. Flexibility means that our approach assigns a document to all predefined genres with different weights. The proposed approach is based on the combination of two homogenous classifiers: contextual and structural classifiers. The contextual classifier uses the URL, while the structural classifier uses the document structure. Both contextual and structural classifiers are centroid-based classifiers. Experimentations provide a micro-averaged breakeven point (BEP) more than 85%, which is better than those obtained by other categorization approaches.
Keywords :
Internet; classification; information retrieval; search engines; text analysis; Web page retrieval; contextual classifier; document classification; document genre categorization; search engine; structural classifier; Computer science; Conferences; Data mining; Educational institutions; Graphics; HTML; Information retrieval; Search engines; Uniform resource locators; Web pages;
Conference_Titel :
Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on
Conference_Location :
Omaha, NE
Print_ISBN :
978-0-7695-3019-2
Electronic_ISBN :
978-0-7695-3033-8
DOI :
10.1109/ICDMW.2007.120