DocumentCode :
2506548
Title :
CLUSEQ: efficient and effective sequence clustering
Author :
Yang, Jiong ; Wang, Wei
Author_Institution :
Dept. of Comput. Sci., Illinois Univ., Urbana, IL, USA
fYear :
2003
fDate :
5-8 March 2003
Firstpage :
101
Lastpage :
112
Abstract :
Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, Web access logs, etc. We investigate the problem of clustering sequences based on their sequential features. As a widely recognized technique, clustering has proven to be very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed extensively on sequence data (in categorical domain) is the lack of an effective yet efficient similarity measure. Therefore, we propose a novel model (CLUSEQ) for sequence cluster by exploring significant statistical properties possessed by the sequences. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence behavior and to support the similarity measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize (the significant portion of) the CPD in a concise way. A novel algorithm is devised to efficiently discover clusters with high quality and is able to automatically adjust the number of clusters to its optimal range via a unique combination of successive new cluster generation and cluster consolidation. The performance of CLUSEQ has been demonstrated via extensive experiments on several real and synthetic sequence databases.
Keywords :
database management systems; probability; statistical analysis; CLUSEQ model; CPD; clustering sequences; conditional probability distribution; probabilistic suffix tree; real sequence database; sequential features; synthetic sequence database; Amino acids; Biological information theory; Biology; Computer science; Data analysis; Data mining; Object detection; Performance evaluation; Probability distribution; Protein sequence;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2003. Proceedings. 19th International Conference on
Print_ISBN :
0-7803-7665-X
Type :
conf
DOI :
10.1109/ICDE.2003.1260785
Filename :
1260785
Link To Document :
بازگشت