DocumentCode
1787895
Title
Novel frequent sequential patterns based probabilistic model for effective classification of web documents
Author
Haleem, Hammad ; Sharma, Praveen Kumar ; Sufyan Beg, M.M.
Author_Institution
Dept. of Comput. Eng., Jamia Millia Islamia, New Delhi, India
fYear
2014
fDate
26-28 Sept. 2014
Firstpage
361
Lastpage
371
Abstract
Web page classification has been one of essential tasks in web information retrieval such as delivering content specific search results, focused crawling and maintaining web-directory projects like DMOZ, etc. This paper presents a novel probabilistic web page classification scheme that utilizes the occurrences of frequent sequential patterns to determine the class of the document. As being suggested by many previous works in the field of text mining, patterns possess more relevant information about the document than individual words. This paper is an attempt to successfully make use of this hypothesis for classification of web documents. After testing this novel approach on RCV1 dataset, we were able to obtain classify the test documents with 88% accuracy.
Keywords
Internet; classification; data mining; information retrieval; probability; RCV1 dataset; Web document classification; Web information retrieval; Web page classification; frequent sequential pattern; probabilistic model; text mining; Abstracts; Accuracy; Probabilistic logic; Testing; Text mining; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer and Communication Technology (ICCCT), 2014 International Conference on
Conference_Location
Allahabad
Print_ISBN
978-1-4799-6757-5
Type
conf
DOI
10.1109/ICCCT.2014.7001520
Filename
7001520
Link To Document