Title :
Focused crawling for building Web comment corpora
Author :
Neunerdt, M. ; Niermann, M. ; Mathar, Rudolf ; Trevisan, B.
Author_Institution :
Inst. for Theor. Inf. Technol., RWTH Aachen Univ., Aachen, Germany
Abstract :
Web 2.0 provides various types of social media applications, e.g., blogs, forums and news sites that allow users to post Web comments. This kind of communication plays an important role in acceptance research. To extract different opinions from such data, it is necessary to build Web comment corpora. Building such corpora requires focused crawling. Many focused Web crawling algorithms are known to build topic-specific Web collections. However, the type of Web pages is typically not considered. In this paper, we introduce a new type-specific focused crawler, which uses a classifier based on HTML meta information. Its application allows for collecting only Web pages that cover Web comments from various domains.
Keywords :
Internet; hypermedia markup languages; social networking (online); HTML meta information; Web 2.0; Web comment corpora; Web crawling algorithms; blogs; focused crawling; forums; news sites; social media applications; Blogs; Buildings; Crawlers; Prediction algorithms; Search engines; Web pages;
Conference_Titel :
Consumer Communications and Networking Conference (CCNC), 2013 IEEE
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4673-3131-9
DOI :
10.1109/CCNC.2013.6488526