Author :
Wang, Dong ; Li, Zhenyu ; Xie, Gaogang
Author_Institution :
Inst. of Comput. Technol., Chinese Acad. of Sci. (CAS), Beijing, China
Abstract :
Unbiased sampling of online social networks (OSNs) makes it possible to get accurate statistical properties of large-scale OSNs. However, the most used sampling methods, Breadth-First-Search (BFS) and Greedy, are known to be biased towards high degree nodes, yielding inaccurate statistical results. To give a general requirement for unbiased sampling, we model the crawling process as a Markov Chain and deduce a necessary and sufficient condition, which enables us to design various efficient unbiased sampling methods. To the best of our knowledge, we are among the first to give such a condition. Metropolis-Hastings Random Walk (MHRW) is an example which satisfies the condition. However, walkers in MHRW may stay at some low-degree nodes for a long time, resulting considerable self-loops on these nodes, which adversely affect the crawling efficiency. Based on the condition, a new unbiased sampling method, called USRS, is proposed to reduce the probabilities of self-loops. We use the dataset of Renren, the largest OSN in China, to evaluate the performance of USRS. The results have demonstrated that USRS generates unbiased samples with low self-loop probabilities, and achieves higher crawling efficiency.
Keywords :
Markov processes; data analysis; probability; sampling methods; social networking (online); tree searching; Markov chain; Renren dataset; breadth-first-search method; crawling efficiency; crawling process; greedy method; metropolis-hastings random walk; online social networks; sampling methods; self-loop probability; statistical properties; unbiased sampling; Algorithm design and analysis; IEEE Communications Society; Markov processes; Peer to peer computing; Sampling methods; Social network services; Sufficient conditions;