Title :
Clustering gene expression data: an experimental analysis
Author :
Ortiz-Gama, Sergio ; Sucar, L. Enrique ; Rodríguez, Andrés F.
Author_Institution :
Tec de Monterrey, Morelos, Mexico
Abstract :
Recent advances in information technologies and molecular biology have led to an exponential growth in genome data. DNA micro-arrays experiments are an important tool for monitoring and analyzing gene expression profiles of thousands of genes simultaneously. In particular, we are interested in identifying similar expressions patterns from the genes of the E. coli bacteria, which will help to improve the understanding of its regulation pathways. We applied the KDD (knowledge discovery in databases) methodology, in particular a clustering algorithm, to gene expression data from micro-arrays experiments for E. coli under different conditions. Using AutoClass on a database of more tan 1000 genes of E. coli, we identified about 70 clusters of genes that exhibit similar patterns of expression level, and compare them to the regulated genes groups that have been identified by the biologists. The results show many coincidences, but also important differences. These differences provide important clues for future research on the regulation process in E. coli. The contributions of This work are threefold. First, we illustrate the application of the KDD methodology in a difficult problem in molecular biology, including the necessary steps for preprocessing the data so that the clustering techniques could be applied. Second, we made an objective comparison of the clusters obtained form the data with the groups of regulated genes considered by the experts, using two different methodologies. One is based on the Jaccard index. The other is a methodology proposed by us to compare two different clusterings. Third, we identify possible groups of co-regulated genes in E. coli that merit further research in the understanding of the gene regulation pathways.
Keywords :
biology computing; data mining; genetics; molecular biophysics; pattern clustering; AutoClass; DNA micro-arrays experiment; E. coli bacteria; Jaccard index; KDD methodology; clustering algorithm; database knowledge discovery; gene expression data clustering; gene expression profile; genome data; information technology; molecular biology; Bioinformatics; Clustering algorithms; DNA; Data analysis; Databases; Gene expression; Genomics; Information technology; Microorganisms; Monitoring;
Conference_Titel :
Computer Science, 2004. ENC 2004. Proceedings of the Fifth Mexican International Conference in
Print_ISBN :
0-7695-2160-6
DOI :
10.1109/ENC.2004.1342602