DocumentCode
2580678
Title
Detecting homogeneity in protein sequence clusters for automatic functional annotation and noise detection
Author
Chen, Chien-Yu
Author_Institution
Graduate Sch. of Biotechnol. & Bioinformatics, Yuan Ze Univ., Chung-Li, Taiwan
fYear
2005
fDate
15-16 Aug. 2005
Abstract
Protein sequence clustering is a process that aims to identify sets of homologous proteins in a protein database (Kriventseva et al., 2001). The information derived from protein sequence clustering is then widely used for further analysis such as protein family discovery, function prediction, and database compression. For some applications of protein sequence clustering, it is highly desirable that a hierarchical structure, also referred to as dendrogram, which shows how proteins are clustered at various levels is generated. However, it is not an easy task to decide the boundary of natural clusters that correspond to protein families. According to our previous studies, the weighted average precision of the homogeneous clusters in the hierarchy of 41.0 Swiss-Prot database is 98.5% (Chen et al., 2004). Our experimental results show that there are 2158 protein families getting its best matching rate on a homogeneous cluster, among which the biggest one contains 293 proteins. This result shows that many protein families possess the homogeneity property on their sequences. Those 2158 best matched clusters deliver a weighted average precision of 97.34% and a weighted average recall of 91.41%.
Keywords
biology computing; pattern clustering; proteins; sequences; automatic functional annotation; automatic noise detection; database compression; dendrogram; function prediction; homologous proteins; protein database; protein family discovery; protein sequence clustering; Bioinformatics; Biotechnology; Clustering algorithms; Educational institutions; Gaussian distribution; Information analysis; Protein sequence; Spatial databases; Statistical distributions; Testing;
fLanguage
English
Publisher
ieee
Conference_Titel
Emerging Information Technology Conference, 2005.
Print_ISBN
0-7803-9328-7
Type
conf
DOI
10.1109/EITC.2005.1544342
Filename
1544342
Link To Document