DocumentCode :
1348141
Title :
CLASS: a general approach to classifying categorical sequences
Author :
Kelil, Abdellali ; Nordell-Markovits, Alexei ; Zaralahy, Parakh Ousman Yassine ; Wang, Shengrui
Author_Institution :
Dept. of Comput. Sci., Univ. de Sherbrooke, Sherbrooke, QC, Canada
Volume :
34
Issue :
4
fYear :
2009
Firstpage :
158
Lastpage :
166
Abstract :
The rapid burgeoning of available data in the form of categorical sequences, such as biological sequences, natural language texts, network and retail transactions, makes the classification of categorical sequences increasingly important. The main challenge is to identify significant features hidden behind the chronological and structural dependencies characterizing their intrinsic properties. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but categorical sequences often have similar features in non-chronological order. In addition, these algorithms have serious difficulties in outperforming domain-specific algorithms. In this paper we propose CLASS, a general approach for the classification of categorical sequences. By using an effective matching scheme called SPM for Significant Patterns Matching, CLASS is able to capture the intrinsic properties of categorical sequences. Furthermore, the use of Latent Semantic Analysis allows capturing semantic relations using global information extracted from large number of sequences, rather than comparing merely pairs of sequences. Moreover, CLASS employs a classifier called SNN for Significant Nearest Neighbours, inspired from the K Nearest Neighbours approach with a dynamic estimation of K, which allows the reduction of both false positives and false negatives in the classification. The extensive tests performed on a range of datasets from different fields show that CLASS is oftentimes competitive with domain-specific approaches.
Keywords :
pattern matching; statistical analysis; CLASS; K nearest neighbour approach; SNN; SPM; categorical sequences; latent semantic analysis; matching scheme; significant nearest neighbour; significant patterns matching; Indexes; Matrix decomposition; Proteins; Semantics; Testing; Training; categorical sequences, significant patterns, nearest neighbours, N-gram, latent semantic analysis;
fLanguage :
English
Journal_Title :
Electrical and Computer Engineering, Canadian Journal of
Publisher :
ieee
ISSN :
0840-8688
Type :
jour
DOI :
10.1109/CJECE.2009.5599423
Filename :
5599423
Link To Document :
بازگشت