DocumentCode
2341384
Title
Towards automatic clustering of protein sequences
Author
Yang, Jiong ; Wang, Wei
Author_Institution
IBM Thomas J. Watson Res. Center, USA
fYear
2002
fDate
2002
Firstpage
175
Lastpage
186
Abstract
Analyzing protein sequence data becomes increasingly important recently. Most previous work on this area has mainly focused on building classification models. In this paper we investigate in the problem of automatic clustering of unlabeled protein sequences. As a widely recognized technique in statistics and computer science, clustering has been proven very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty, that prevents clustering from being performed directly on protein sequence is the lack of an effective similarity measure that can be computed efficiently. Therefore, we propose a novel model for protein sequence cluster by exploring significant statistical properties possessed by the sequences. The concept of imprecise probabilities are introduced to the original probabilistic suffix tree to monitor the convergence of the empirical measurement and to guide the clustering process. It is demonstrated that the proposed method can successfully discover meaningful families without the necessity of learning models of different families from pre-labeled "training data".
Keywords
DNA; biology computing; computational complexity; convergence; pattern clustering; probability; trees (mathematics); CLUSEQ algorithm; classification models; clustering algorithm; complexity; convergence; probabilistic suffix tree; probability; protein sequences; Computer science; Convergence; Data analysis; Monitoring; Object detection; Performance evaluation; Probability; Protein sequence; Statistics; Training data;
fLanguage
English
Publisher
ieee
Conference_Titel
Bioinformatics Conference, 2002. Proceedings. IEEE Computer Society
Print_ISBN
0-7695-1653-X
Type
conf
DOI
10.1109/CSB.2002.1039340
Filename
1039340
Link To Document