DocumentCode :
5783
Title :
FoCUS: Learning to Crawl Web Forums
Author :
Jingtian Jiang ; Xinying Song ; Nenghai Yu ; Chin-Yew Lin
Author_Institution :
Inf. Process. Center, Univ. of Sci. & Technol. of China, Beijing, China
Volume :
25
Issue :
6
fYear :
2013
fDate :
Jun-13
Firstpage :
1293
Lastpage :
1306
Abstract :
In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL-type recognition problem. And we show how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying FoCUS on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.
Keywords :
Web sites; information retrieval; pattern classification; software packages; FoCUS; URL types; URL-type recognition problem; annotated forums; blog sites; community question-and-answer sites; entry pages; forum crawler under supervision; forum software packages; forum threads; implicit navigation paths; robust page type classifiers; social media sites; supervised Web-scale forum crawler; thread pages; Crawlers; Indexes; Layout; Message systems; Navigation; Software packages; Training; EIT path; ITF regex; URL pattern learning; URL type; forum crawling; page classification; page type;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2012.56
Filename :
6165295
Link To Document :
بازگشت