Title :
Integrative data mining: the new direction in bioinformatics
Author :
Bertone, Paul ; Gerstein, Mark
Author_Institution :
Dept. of Molecular, Cellular, & Dev. Biol., Yale Univ., New Haven, CT, USA
Abstract :
Biological research is becoming increasingly database driven, motivated, in part, by the advent of large-scale functional genomics and proteomics experiments such as those comprehensively measuring gene expression. These provide a wealth of information on each of the thousands of proteins encoded by a genome. Consequently, a challenge in bioinformatics is integrating databases to connect this disparate information as well as performing large-scale studies to collectively analyze many different data sets. This approach represents a paradigm shift away from traditional single-gene biology, and it often involves statistical analyses focusing on the occurrence of particular features (e.g., folds, functions, interactions, pseudogenes, or localization) in a large population of proteins. Moreover, the explicit application of machine learning techniques can be used to discover trends and patterns in the underlying data. In this article, we give several examples of these techniques in a genomic context: clustering methods to organize microarray expression data, support vector machines to predict protein function, Bayesian networks to predict subcellular localization, and decision trees to optimize target selection for high-throughput proteomics.
Keywords :
biology computing; data mining; decision trees; deductive databases; genetics; learning automata; proteins; self-organising feature maps; unsupervised learning; Bayesian networks; SOFM; bioinformatics; clustering methods; decision trees; gene expression; high-throughput proteomics; integrative data mining; large-scale studies; machine learning techniques; microarray expression data; protein function; subcellular localization; support vector machines; target selection; unsupervised learning; Bioinformatics; Biological information theory; Data mining; Gene expression; Genomics; Large scale integration; Large-scale systems; Proteins; Proteomics; Spatial databases;
Journal_Title :
Engineering in Medicine and Biology Magazine, IEEE