• DocumentCode
    3765460
  • Title

    Distributed web crawling: A framework for crawling of micro-blog data

  • Author

    Jie Xia;Wanggen Wan;Renzhong Liu;Guodong Chen;Qing Feng

  • Author_Institution
    School of Communication and Information Engineering Shanghai, China
  • fYear
    2015
  • fDate
    7/1/2015 12:00:00 AM
  • Firstpage
    62
  • Lastpage
    68
  • Abstract
    These days´ social networks have attracted people to express and share their interests. We aim to monitor public opinions and other valuable discoveries by using the data collected from social network website Sina Weibo. This paper present a distributed web crawler framework called SWORM, which runs on the Raspberry Pi (cheap card-sized single-board computer) for fetching the micro-blog data and overwhelms the traditional web crawlers on efficiency, scale, scalability and cost. The framework can easily be extended according to the specific needs of the user with the help of some simple python scripts. This paper first propose a model for micro-blog network to confirm what and how our crawler will crawl from social website. Secondly it will introduce the implementation details of the whole distributed system and finally will present experimental results. We ran some crawlers within our framework on the Raspberry Pi and stored the obtained resources in Shared MongoDB which is a category of NoSQL. Experimental results demonstrated that the use of distributed framework can greatly improve the efficiency and accuracy for collecting data.
  • Publisher
    iet
  • Conference_Titel
    Smart and Sustainable City and Big Data (ICSSC), 2015 International Conference on
  • Type

    conf

  • DOI
    10.1049/cp.2015.0255
  • Filename
    7446438