Title :
Feature Selection and Semi-supervised Clustering Using Multiobjective Optimization
Author :
Alok, Abhay Kumar ; Saha, Sriparna ; Ekbal, Asif
Author_Institution :
Comput. Sci. Eng., Indian Inst. of Technol., Patna, Patna, India
Abstract :
In this paper we have coupled feature selection problem with semi-supervised clustering. Semi-supervised clustering techniques are used to overcome the problems associated with unsupervised and supervised classification. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. Here, a newly developed multiobjective simulated annealing based optimization technique named archived multiobjective simulated annealing (AMOSA) is used as the underlying optimization technique. Here features and cluster centers are encoded in the form of a string. We assume that for each data set for 10% data points class level information are known to us. Four objective functions are used, first two objective functions represent, respectively, total symmetry present in the clusters and total compactness of the partitioning results. These are based on point symmetry and euclidean distance computations. Third objective function is an external cluster validity index which measures the similarity of the clustering obtained on labeled data with the original labeling, and fourth one counts number of features. Our objective is to optimize values of cluster validity indices where as to increase the number of features in order to remove the bias of internal cluster validity indices on lower dimensions. AMOSA is utilized to detect the appropriate subset of features, actual number of clusters as well as the true partitioning. For the purpose of assignment of data points to respective clusters, a point symmetry distance based new innovative methodology has been adopted. Mutation changes the feature combinations as well as the set of cluster centers. So in this paper, we have implemented a novel method to select a single solution from the Pareto-optimal front. So, the proposed Semi-FeaClustMOO technique ensures to obtain the actual number o- clusters as well as the true partitioning result. The efficacy of the proposed Semi-FeaClustMOO technique is shown on three real-life data sets, and compared with genetic algorithm based VGAPS clustering technique and K-mean clustering technique. These Clustering techniques work with all the available features of data sets and Semi-FeaClustMOO technique uses a subset of features during the computation.
Keywords :
Pareto optimisation; feature selection; genetic algorithms; pattern clustering; simulated annealing; unsupervised learning; AMOSA; K-mean clustering technique; Pareto-optimal front; archived multiobjective simulated annealing; cluster centers; coupled feature selection problem; euclidean distance computations; external cluster validity index; genetic algorithm based VGAPS clustering technique; innovative methodology; multiobjective optimization; multiobjective simulated annealing based optimization technique; semiFeaClustMOO technique; semisupervised clustering techniques; unsupervised classification; Cancer; Clustering algorithms; Euclidean distance; Indexes; Iris; Linear programming; Optimization; AMOSA; ARI; Clustering; MinkowskiScore; Semi-supervised clustering; Sym-index; XB-index; feature selection; multiobjective optimization (MOO);
Conference_Titel :
Soft Computing and Machine Intelligence (ISCMI), 2014 International Conference on
DOI :
10.1109/ISCMI.2014.19