Order-preserving clustering and its application to gene expression data

Author

Syeda-Mahmood, Tanveer

Author_Institution

IBM Almaden Res. Center, San Jose, CA, USA

Volume

4

fYear

2004

fDate

23-26 Aug. 2004

Firstpage

637

Abstract

Clustering of ordered data sets is a common problem faced in many pattern recognition tasks. Existing clustering methods either fail to capture the data or use restrictive models such as HMMs or AR models to model the data. In this paper, we present a general order-preserving clustering algorithm that allows arbitrary patterns of data evolution by representing each ordered set as a curve. Clustering of the data then reduces to grouping curves based on shape similarity. We develop a novel measure of shape similarity between curves using scale-space distance. Shape similarity or dis-similarity is judged by composing the higher-dimensional curves from constituent curves and noting the additional twists and turns in such curves that can be attributed to shape differences. An algorithm analogous to K-means clustering is then developed that uses prototypical curves for cluster representation. Results are demonstrated on the ordered gene expression data sets obtained from gene chips.

Keywords

data structures; genetics; pattern clustering; K-means clustering; data evolution; data representation; gene chips; gene expression data; order-preserving clustering algorithm; pattern recognition; scale-space distance; shape similarity; Clustering algorithms; Clustering methods; Data analysis; Data models; Gene expression; Hidden Markov models; Information analysis; Pattern recognition; Prototypes; Shape measurement;

fLanguage

English

Publisher

ieee

Conference_Titel

Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on

ISSN

1051-4651

Print_ISBN

0-7695-2128-2

Type

conf

DOI

10.1109/ICPR.2004.1333853

Filename

1333853