DocumentCode :
160998
Title :
An Effective Forum Crawler
Author :
Sreeja, S.R. ; Chaudhari, Sneha
Author_Institution :
Dept. of Comput. Sci. & Eng., A.C. Patil Coll. of Eng., Navi Mumbai, India
fYear :
2014
fDate :
4-5 April 2014
Firstpage :
230
Lastpage :
234
Abstract :
Web Forums or Internet Forums provide a space for users to share, discuss and request information. Web Forums are sources of huge amount of structured information that is rapidly changing. So crawling Web Forums require special softwares. A Generic Deep Web Crawler or a Focused Crawler cannot be used for this purpose. In this paper, we propose an effective Web Crawler especially for Internet Forums. This Forum Crawler overcomes the drawbacks of many of the existing Forum Crawlers. It has the ability to detect the Entry URL of a Forum site, given any page of it. Crawling process starting from Entry URL increases the coverage. Different URLs in the Web Forums are classified into four categories and our Forum Crawler is capable of detecting these URLs even if they are JavaScript-based which most of the existing Forum Crawlers cannot do. The entire process is divided into learning part and online crawling part. The learning part classifies different URLs in the forum site into four categories: Index URL, Thread URL, Index-Page-Turning URL and Thread-Page-Turning URL. This Forum Crawler uses a Freshness First Strategy rather than the BFS (Breadth First Strategy) for performing online crawling which is advantageous in situations where there are limited system resources available.
Keywords :
Internet; Web sites; information retrieval; tree searching; BFS; Entry URL; Internet forum crawler; JavaScript-based URLs; Web Forum crawling; Web forum crawler; breadth first strategy; focused crawler; forum site; freshness first strategy; generic deep Web crawler; index URL; index-page-turning URL; information requesting; information sharing; online crawling; thread URL; thread-page-turning URL; Crawlers; Indexes; Information technology; Internet; Kernel; Web pages; JavaScript-based URLs; URL type; crawling strategy; forum crawling; page classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on
Conference_Location :
Mumbai
Type :
conf
DOI :
10.1109/CSCITA.2014.6839264
Filename :
6839264
Link To Document :
بازگشت