DocumentCode
635412
Title
Exploiting side information in distance dependent Chinese restaurant processes for data clustering
Author
Cheng Li ; Dinh Phung ; Rana, Sohel ; Venkatesh, Svetha
Author_Institution
Centre for Pattern Recognition & Data Analytics (PRaDA), Deakin Univ., Geelong, VIC, Australia
fYear
2013
fDate
15-19 July 2013
Firstpage
1
Lastpage
6
Abstract
Multimedia contents often possess weakly annotated data such as tags, links and interactions. The weakly annotated data is called side information. It is the auxiliary information of data and provides hints for exploring the link structure of data. Most clustering algorithms utilize pure data for clustering. A model that combines pure data and side information, such as images and tags, documents and keywords, can perform better at understanding the underlying structure of data. We demonstrate how to incorporate different types of side information into a recently proposed Bayesian nonparametric model, the distance dependent Chinese restaurant process (DD-CRP). Our algorithm embeds the affinity of this information into the decay function of the DD-CRP when side information is in the form of subsets of discrete labels. It is flexible to measure distance based on arbitrary side information instead of only the spatial layout or time stamp of observations. At the same time, for noisy and incomplete side information, we set the decay function so that the DD-CRP reduces to the traditional Chinese restaurant process, thus not inducing side effects of noisy and incomplete side information. Experimental evaluations on two real-world datasets NUS WIDE and 20 Newsgroups show exploiting side information in DD-CRP significantly improves the clustering performance.
Keywords
Bayes methods; multimedia systems; nonparametric statistics; pattern clustering; Bayesian nonparametric model; DD-CRP decay function; NUS_WIDE; auxiliary information; data clustering algorithms; discrete label subsets; distance dependent Chinese restaurant process; link data structure; multimedia contents; real-world datasets; side information; weakly annotated data; Bayes methods; Clustering algorithms; Layout; Multimedia communication; Mutual information; Noise measurement; Time measurement; Side information; annotated data; clustering; distance dependent Chinese restaurant processes; multimedia;
fLanguage
English
Publisher
ieee
Conference_Titel
Multimedia and Expo (ICME), 2013 IEEE International Conference on
Conference_Location
San Jose, CA
ISSN
1945-7871
Type
conf
DOI
10.1109/ICME.2013.6607475
Filename
6607475
Link To Document