DocumentCode :
1787895
Title :
Novel frequent sequential patterns based probabilistic model for effective classification of web documents
Author :
Haleem, Hammad ; Sharma, Praveen Kumar ; Sufyan Beg, M.M.
Author_Institution :
Dept. of Comput. Eng., Jamia Millia Islamia, New Delhi, India
fYear :
2014
fDate :
26-28 Sept. 2014
Firstpage :
361
Lastpage :
371
Abstract :
Web page classification has been one of essential tasks in web information retrieval such as delivering content specific search results, focused crawling and maintaining web-directory projects like DMOZ, etc. This paper presents a novel probabilistic web page classification scheme that utilizes the occurrences of frequent sequential patterns to determine the class of the document. As being suggested by many previous works in the field of text mining, patterns possess more relevant information about the document than individual words. This paper is an attempt to successfully make use of this hypothesis for classification of web documents. After testing this novel approach on RCV1 dataset, we were able to obtain classify the test documents with 88% accuracy.
Keywords :
Internet; classification; data mining; information retrieval; probability; RCV1 dataset; Web document classification; Web information retrieval; Web page classification; frequent sequential pattern; probabilistic model; text mining; Abstracts; Accuracy; Probabilistic logic; Testing; Text mining; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Communication Technology (ICCCT), 2014 International Conference on
Conference_Location :
Allahabad
Print_ISBN :
978-1-4799-6757-5
Type :
conf
DOI :
10.1109/ICCCT.2014.7001520
Filename :
7001520
Link To Document :
بازگشت