Semi-supervised document clustering using Seeds affinity propagation and consensus algorithm in multi-domain settings

Author

Radha, R. ; Mirnalinee, T.T. ; Trueman, T.E.

Author_Institution

Dept. of Comput. Sci. & Eng., Anna Univ. of Technol., Chennai, India

fYear

2012

fDate

19-21 April 2012

Firstpage

85

Lastpage

90

Abstract

Domain adaptation is the process of transferring the knowledge to a different domain from a source domain but they are related. In this paper, we first apply `Consensus Regularization´ based algorithm to merge multiple source domain to a single source domain. Then we propose multi-domain adaptation in document clustering using Seeds affinity propagation and Consensus Regularization Algorithm. A semi-supervised document clustering algorithm, called Seeds Affinity Propagation (SAP) is applied based on an effective clustering algorithm Affinity Propagation (AP). The labeled and unlabeled documents are preprocessed through various processes such as stop words removal, word stemming and finding word frequency and given as the input. After pre-processing, structured documents are obtained. Tri-set Computation, a feature extraction technique is used to find out the features through Co-feature set, Unilateral feature set and Significant Co-feature set methods. Then calculate the similarity measure of the documents and assigning the label to the documents if they are matched. Finally clustered documents are obtained through seeds affinity propagation via similarity measurement. Further the performance of the algorithm can be evaluated and improved.

Keywords

document handling; learning (artificial intelligence); merging; pattern clustering; clustering algorithm affinity propagation; co-feature set method; consensus regularization based algorithm; domain adaptation process; domain merging; feature extraction technique; multidomain setting; multiple source domain; seeds affinity propagation; semi-supervised document clustering; significant co-feature set method; similarity measurement; single source domain; stop words removal process; tri-set computation technique; unilateral feature set method; word frequency finding process; word stemming process; Algorithm design and analysis; Clustering algorithms; Computer science; Convergence; Educational institutions; Entropy; Feature extraction; Consensus Regularization; Document Clustering; Multi-domain adaptation; Seeds Affinity Propagation (SAP); Tri set Computation;

fLanguage

English

Publisher

ieee

Conference_Titel

Recent Trends In Information Technology (ICRTIT), 2012 International Conference on

Conference_Location

Chennai, Tamil Nadu

Print_ISBN

978-1-4673-1599-9

Type

conf

DOI

10.1109/ICRTIT.2012.6206802

Filename

6206802