DocumentCode :
2075058
Title :
sBGMM: A Stratified Beta-Gaussian Mixture Model for Clustering Genes with Multiple Data Sources
Author :
Dai, Xiaofeng ; Lahdesmaki, Harri ; Yli-Harja, Olli
Author_Institution :
Dept. of Signal Process., Tampere Univ. of Technol., Tampere
fYear :
2008
fDate :
June 29 2008-July 5 2008
Firstpage :
94
Lastpage :
99
Abstract :
Cluster analysis is widely applied to discover the function of previously unannotated genes. This paper presents a novel stratified beta-Gaussian mixture model, sBGMM, for clustering genes based on gene expression data, protein-DNA binding data and data that can provide information for constructing priors such as protein-protein interaction (PPI) data. An expectation maximization (EM) type of algorithm for Beta mixture model is first developed and then combined with that of Gaussian mixture model. This combined algorithm can jointly estimate the parameters for both Beta and Gaussian distributions and is used as the core in the sBGMM method. The stratification property of sBGMM is exhibited as Stratum-specific prior probabilities and is constructed by the pre-cluster results obtained from PPI data in this study. This proposed sBGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework and incorporation of prior information from a third data source. Several well-studied model selection methods, such as Akaike information criterion (AIC), modified AIC (AIC3), Bayesian information criterion (BIC), and integrated classification likelihood-BIC (ICL-BIC) are applied to estimate the number of clusters, and simulation results show that AIC3 works best for sBGMM. Simulations also indicate that combining two different data sources into a single mixture model can greatly improve the clustering accuracy and stability, and employing priors to stratify the model can further enhance its performance. This proposed method provides a more efficient use of multiple data sources than methods that analyze different data sources separately.
Keywords :
Gaussian distribution; biology computing; expectation-maximisation algorithm; genetic engineering; molecular biophysics; parameter estimation; probability; proteins; statistical analysis; AIC3; Akaike information criterion; Bayesian information criterion; Beta distribution parameter estimation; Gaussian distribution parameter estimation; ICL-BIC; Stratum specific prior probabilities; cluster analysis; cluster number estimation; expectation maximization type algorithm; gene clustering; gene expression data; gene function discovery; integrated classification likelihood-BIC; modified AIC; multiple data sources; probabilistic modeling framework; protein-DNA binding data; protein-protein interaction data; sBGMM stratification property; stratified beta Gaussian mixture model; Bayesian methods; Bioinformatics; Biological system modeling; Biomedical signal processing; Clustering algorithms; Power system modeling; Proteins; Signal analysis; Signal processing algorithms; Stability; BGMM (Beta-Gaussian mixture model); BMM (Beta mixture model); EM(Expectation maximization); GMM (Gaussian mixture model); sBGMM (stratified Beta-Gaussian mixture model);
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Biocomputation, Bioinformatics, and Biomedical Technologies, 2008. BIOTECHNO '08. International Conference on
Conference_Location :
Bucharest
Print_ISBN :
978-0-7695-3191-5
Electronic_ISBN :
978-0-7695-3191-5
Type :
conf
DOI :
10.1109/BIOTECHNO.2008.12
Filename :
4561141
Link To Document :
بازگشت