Title :
Semi-supervised document clustering using Seeds affinity propagation and consensus algorithm in multi-domain settings
Author :
Radha, R. ; Mirnalinee, T.T. ; Trueman, T.E.
Author_Institution :
Dept. of Comput. Sci. & Eng., Anna Univ. of Technol., Chennai, India
Abstract :
Domain adaptation is the process of transferring the knowledge to a different domain from a source domain but they are related. In this paper, we first apply `Consensus Regularization´ based algorithm to merge multiple source domain to a single source domain. Then we propose multi-domain adaptation in document clustering using Seeds affinity propagation and Consensus Regularization Algorithm. A semi-supervised document clustering algorithm, called Seeds Affinity Propagation (SAP) is applied based on an effective clustering algorithm Affinity Propagation (AP). The labeled and unlabeled documents are preprocessed through various processes such as stop words removal, word stemming and finding word frequency and given as the input. After pre-processing, structured documents are obtained. Tri-set Computation, a feature extraction technique is used to find out the features through Co-feature set, Unilateral feature set and Significant Co-feature set methods. Then calculate the similarity measure of the documents and assigning the label to the documents if they are matched. Finally clustered documents are obtained through seeds affinity propagation via similarity measurement. Further the performance of the algorithm can be evaluated and improved.
Keywords :
document handling; learning (artificial intelligence); merging; pattern clustering; clustering algorithm affinity propagation; co-feature set method; consensus regularization based algorithm; domain adaptation process; domain merging; feature extraction technique; multidomain setting; multiple source domain; seeds affinity propagation; semi-supervised document clustering; significant co-feature set method; similarity measurement; single source domain; stop words removal process; tri-set computation technique; unilateral feature set method; word frequency finding process; word stemming process; Algorithm design and analysis; Clustering algorithms; Computer science; Convergence; Educational institutions; Entropy; Feature extraction; Consensus Regularization; Document Clustering; Multi-domain adaptation; Seeds Affinity Propagation (SAP); Tri set Computation;
Conference_Titel :
Recent Trends In Information Technology (ICRTIT), 2012 International Conference on
Conference_Location :
Chennai, Tamil Nadu
Print_ISBN :
978-1-4673-1599-9
DOI :
10.1109/ICRTIT.2012.6206802