DocumentCode :
174897
Title :
A Pure URL-Based Genre Classification of Web Pages
Author :
Jebari, Chaker
Author_Institution :
Inf. Technol. Dept., Ibri Coll. of Appl. Sci., Ibri, Oman
fYear :
2014
fDate :
1-5 Sept. 2014
Firstpage :
233
Lastpage :
237
Abstract :
In this paper, we propose a new approach for multi-label genre classification of web pages that exploits character n-grams extracted from the URL of the web page rather than its content. Using only the URL reduces the time needed for feature extraction since it does not need to download the content of the web page. Our approach deals with the complexity of web pages because it uses a multi-label classification where each web page can be assigned to more than one genre. Moreover, our approach implements a new weighting technique that exploits the structure of the URL. Experiments conducted on a known multi-label dataset show that our approach achieves encouraging results.
Keywords :
Internet; computational complexity; pattern classification; Web pages; character n-gram extraction; feature extraction; multilabel genre classification; pure URL-based genre classification; weighting technique; Classification algorithms; Feature extraction; HTML; Measurement; Search engines; Uniform resource locators; Web pages; URL structure; genre; multi-label classification; web page;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
Conference_Location :
Munich
ISSN :
1529-4188
Print_ISBN :
978-1-4799-5721-7
Type :
conf
DOI :
10.1109/DEXA.2014.56
Filename :
6974855
Link To Document :
بازگشت