Nearest neighbor training of side effect machines for sequence classification

Author

Ashlock, Daniel ; McEachern, Andrew

Author_Institution

Dept. of Math. & Stat., Univ. of Guelph, Guelph, ON, Canada

fYear

2010

fDate

2-5 May 2010

Firstpage

1

Lastpage

8

Abstract

Side effect machines operate by associating side effects with the states of a finite state machine. The use of side effect machines permits the researcher to leverage information stored in the state transition structure, making machines that might be identical as recognizers behave differently as classifiers. The side effect machines in this study associate a counter with each state so that the number of times each state is visited becomes a numerical feature associated with each state. The key to effective use of these numerical feature is to locate side effect machines for which the count vectors are good feature sets. In this study side effect machines are selected with an evolutionary algorithm. The Rand index of nearest neighbor classification of the count vectors serves as the fitness function for selecting side effect machines. A parameter study is performed on simple synthetic data and then side effect machines are trained to classify two sets of biological sequences. The first set comprises two categories of HLA sequences from the human major histocompatibility complex. The second are positive and negative examples of human endogenous retroviral sequences taken from the human genome. The retroviral sequences are challenging but good results are obtained. The HLA data is classified with complete accuracy.

Keywords

biology computing; evolutionary computation; finite state machines; genomics; learning (artificial intelligence); pattern classification; Rand index; biological sequences; count vectors; evolutionary algorithm; finite state machine; fitness function; human endogenous retroviral sequences; human genome; human major histocompatibility complex; nearest neighbor training; sequence classification; side effect machines; state transition structure; synthetic data; Automata; Bioinformatics; Clustering algorithms; DNA; Evolution (biology); Evolutionary computation; Genomics; Humans; Nearest neighbor searches; Sequences;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2010 IEEE Symposium on

Conference_Location

Montreal, QC

Print_ISBN

978-1-4244-6766-2

Type

conf

DOI

10.1109/CIBCB.2010.5510426

Filename

5510426