• DocumentCode
    1331027
  • Title

    Practical Recommendations on Crawling Online Social Networks

  • Author

    Gjoka, Minas ; Kurant, Maciej ; Butts, C.T. ; Markopoulou, Athina

  • Author_Institution
    California Inst. for Telecommun. & Inf. Technol. (CalIT2), Univ. of California, Irvine, CA, USA
  • Volume
    29
  • Issue
    9
  • fYear
    2011
  • fDate
    10/1/2011 12:00:00 AM
  • Firstpage
    1872
  • Lastpage
    1892
  • Abstract
    Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples are the Metropolis-Hasting random walk (MHRW) and a re-weighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the "ground truth." In contrast, using Breadth-First-Search (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results. Second, and in addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process. We show how these diagnostics can be used to effectively determine when a random walk sample is of adequate size and quality. Third, as a case study, we apply the above methods to Facebook and we collect the first, to the best of our knowledge, representative sample of Facebook users. We make it publicly available and employ it to characterize several key properties of Facebook.
  • Keywords
    random processes; sampling methods; social networking (online); tree searching; Facebook; MHRW; Metropolis-Hasting random walk; RWRW; breadth-first-search; crawling; data collection process; online formal convergence diagnostics; online social network; re-weighted random walk; social graph; unadjusted random walk; Context; Convergence; Facebook; Markov processes; Peer to peer computing; Privacy; Convergence; Facebook; Graph sampling; Measurements; Random Walks; Sampling methods; Social network services;
  • fLanguage
    English
  • Journal_Title
    Selected Areas in Communications, IEEE Journal on
  • Publisher
    ieee
  • ISSN
    0733-8716
  • Type

    jour

  • DOI
    10.1109/JSAC.2011.111011
  • Filename
    6027868