Discovering Communities with Self-Adaptive k Clustering in Microblog Data

Author

Ting Huang ; Dunlu Peng ; Lidong Cao

Author_Institution

Sch. of Opt.-Electr. & Comput. Eng., Univ. of Shanghai for Sci. & Technol., Shanghai, China

fYear

2012

fDate

1-3 Nov. 2012

Firstpage

383

Lastpage

390

Abstract

Nowadays, microblogging has been a popular social network service whose population has incredibly increased in past few years. Many business companies regard microblogging service as an indispensable medium to directly obtain timely opinions from customers and potential customers. A community in social network refers to a crowd of people having similar interests or paying their attention on same things. User community recognition in microblogging social network service is very important for identifying hot topics or users´ interests which are very helpful for companies to improve their marketing strategies. However, the massive non-structural tweet data brings tremendous challenge for efficiently mining the valuable communities hidden in it. Tweet data is characterized as containing massive information, being involved in large fields, short-length and non-structure. This makes tweets quite different from the conventional text documents. In order to analyze the data more effectively, in this paper, we propose a set of techniques to preprocess tweets, such as word identification, categories matching and data standardization. An unsupervised learning method has been presented to automatically cluster microblog users into different communities. In the method, an optimized CLARANS algorithm has been developed according to the characteristics of microblog data. During the process of clustering, the interactive relationship between tweets is also exploited to improve the clustering quality. In addition, a self-adaptive k strategy is employed to make the proposed approach more applicable. In order to investigate the performance of our approach from different aspects, we conducted a series of experiments with the microblog data collected from SINA Weibo.

Keywords

data mining; pattern clustering; pattern matching; social networking (online); unsupervised learning; CLARANS algorithm; Sina Weibo; category matching; clustering quality; community discovery; data mining; data standardization; microblog data; microblogging; self-adaptive k-clustering; social network service; tweet data; tweet preprocessing; unsupervised learning method; user community recognition; word identification; Algorithm design and analysis; Clustering algorithms; Communities; Data mining; Market research; Probability; Social network services; adaptive k; clustering; community recognition; microblogging; social network;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud and Green Computing (CGC), 2012 Second International Conference on

Conference_Location

Xiangtan

Print_ISBN

978-1-4673-3027-5

Type

conf

DOI

10.1109/CGC.2012.92

Filename

6382845