Title :
Minimum entropy clustering and applications to gene expression analysis
Author :
Li, Haifeng ; Zhang, Keshu ; Jiang, Tao
Author_Institution :
California Univ., Riverside, CA, USA
Abstract :
Clustering is a common methodology for analyzing the gene expression data. We present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Pane´s inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon´s entropy with Havrda-Charvat´s structural α-entropy. Interestingly, the minimum entropy criterion based on structural α-entropy is equal to the probability error of the nearest neighbor method when α = 2. This is another evidence that the proposed criterion is good for clustering. With a nonparametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.
Keywords :
biology computing; entropy; estimation theory; genetics; iterative methods; Havrda-Charvat structural α-entropy; Pane inequality; Shannon entropy; a posteriori probabilities; adjusted Rand index; expectation maximization; gene expression analysis; hierarchical clustering; information theory; iterative algorithm; k-means/medians; minimum entropy clustering; nearest neighbor method; outliers; probability error; self-organizing maps; Bioinformatics; Biological processes; Cells (biology); Clustering algorithms; Data analysis; Entropy; Gene expression; Iterative algorithms; Nearest neighbor searches; Partitioning algorithms;
Conference_Titel :
Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE
Print_ISBN :
0-7695-2194-0
DOI :
10.1109/CSB.2004.1332427