Title :
Implementation of a distributed web community crawler
Author :
Seonyoung Park ; Youngseok Lee
Author_Institution :
Chungnam Nat. Univ., Daejeon, South Korea
Abstract :
A web community is an important space for online users to exchange information, ideas and thoughts. Due to collective intelligence of the web communities, marketing and advertisement activities have been highly focused on these sites. While articles in the web communities are open to the public, they cannot be easily collected and analyzed, because they are written in natural languages and their formats are diverse. Though many web crawlers are avaialble, they are not good at gathering web documents. First, the URLs of web articles are frequently changed and redundant, which will make the crawling job difficult. Second, the amount of articles is significantly large that the crawler should be designed in a scalable manner. Therefore, we propose a distributed web crawler optimized for collecting articles from popular communities. From the experiemnts we showed that our implementation achieves high throughput compared with the open-source crawler, Nutch.
Keywords :
Internet; information retrieval; public domain software; Nutch open-source crawler; Web document gathering; advertisement activity; collective intelligence; distributed Web community crawler; marketing activity; Communities; Crawlers; Linux; Throughput; Uniform resource locators; Web pages; Distributed web crawler; community; web forum;
Conference_Titel :
Network Operations and Management Symposium (APNOMS), 2014 16th Asia-Pacific
Conference_Location :
Hsinchu
DOI :
10.1109/APNOMS.2014.6996586