• DocumentCode
    1781168
  • Title

    Implementation of a distributed web community crawler

  • Author

    Seonyoung Park ; Youngseok Lee

  • Author_Institution
    Chungnam Nat. Univ., Daejeon, South Korea
  • fYear
    2014
  • fDate
    17-19 Sept. 2014
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    A web community is an important space for online users to exchange information, ideas and thoughts. Due to collective intelligence of the web communities, marketing and advertisement activities have been highly focused on these sites. While articles in the web communities are open to the public, they cannot be easily collected and analyzed, because they are written in natural languages and their formats are diverse. Though many web crawlers are avaialble, they are not good at gathering web documents. First, the URLs of web articles are frequently changed and redundant, which will make the crawling job difficult. Second, the amount of articles is significantly large that the crawler should be designed in a scalable manner. Therefore, we propose a distributed web crawler optimized for collecting articles from popular communities. From the experiemnts we showed that our implementation achieves high throughput compared with the open-source crawler, Nutch.
  • Keywords
    Internet; information retrieval; public domain software; Nutch open-source crawler; Web document gathering; advertisement activity; collective intelligence; distributed Web community crawler; marketing activity; Communities; Crawlers; Linux; Throughput; Uniform resource locators; Web pages; Distributed web crawler; community; web forum;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network Operations and Management Symposium (APNOMS), 2014 16th Asia-Pacific
  • Conference_Location
    Hsinchu
  • Type

    conf

  • DOI
    10.1109/APNOMS.2014.6996586
  • Filename
    6996586