DocumentCode :
3060850
Title :
Generalized Sequence Signatures through Symbolic Clustering
Author :
Dorr, Dietmar ; Denton, Anne
Author_Institution :
North Dakota State Univ., Fargo
fYear :
2007
fDate :
13-15 Dec. 2007
Firstpage :
567
Lastpage :
572
Abstract :
Traditionally sequence motifs and domains, also called signatures, are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of generalized sequence signatures that can be composed of windows distributed throughout the sequence. We use an approach that is based on clustering analysis of recurring subsequences, to which we refer as symbols, of a predefined length. Symbols are not required to be located in close proximity to each other. The clustering algorithm group sequences so as to maximize the number of shared symbols among sequences. We evaluate our signatures in comparison to those obtained from the InterPro database, and show that our approach has benefits for deriving sequence annotations compared with InterPro´s signatures.
Keywords :
biology computing; pattern clustering; proteins; sequences; InterPro database; clustering algorithm group sequences; clustering analysis; generalized sequence signatures; recurring subsequences; sequence annotations; sequence motifs; symbolic clustering; Application software; Bioinformatics; Clustering algorithms; Computer science; Databases; Genomics; Hidden Markov models; Machine learning; Neodymium; Proteins;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on
Conference_Location :
Cincinnati, OH
Print_ISBN :
978-0-7695-3069-7
Type :
conf
DOI :
10.1109/ICMLA.2007.41
Filename :
4457290
Link To Document :
بازگشت