Title :
Using a thesaurus-based approach for the categorisation of web sites
Author :
Pudaruth, Sameerchand ; Ankiah, Youven ; Sembhoo, Keshav
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Mauritius, Réduit, Mauritius
Abstract :
With the increasing number of Mauritian-owned websites on the internet, the need for classification is becoming highly important. Our objective in this research is to classify a list of websites into seven broad categories namely education, entertainment, government, health, tourism, sports and shopping. The homepage of three hundred and nineteen websites have been used in this study. We have exploited the rich source of information (features) contained in the homepage like the meta tags, title tag, heading tags, hyperlinks, the content of the website and the domain name of the website. These information were then used to classify the websites into their most appropriate category. Several parameters like the weight applied to each feature and the keywords used to classify the websites were tuned to yield better results. The experimental evaluation revealed that the method implemented provides very high accuracy. In particularly, we obtained an accuracy of about 95% which is higher than all existing approaches considered so far in the research literature.
Keywords :
Internet; Web sites; classification; thesauri; Internet; Mauritian-owned Web sites; Web sites categorisation; education; entertainment; government; heading tags; health; hyperlinks; meta tags; shopping; sports; thesaurus-based approach; title tag; tourism; Accuracy; Classification algorithms; Education; Government; Thesauri; Web pages; classification; controlled vocabulary; natural language processing; thesaurus; website;
Conference_Titel :
Contemporary Computing (IC3), 2014 Seventh International Conference on
Conference_Location :
Noida
Print_ISBN :
978-1-4799-5172-7
DOI :
10.1109/IC3.2014.6897245