Title :
DNA classifications with self-organizing maps (SOMs)
Author :
Naenna, T. ; Bress, Robert A. ; Embrechts, Mark J.
Author_Institution :
Dept. of Eng. Sci., Rensselaer Polytech. Inst., Troy, NY, USA
Abstract :
The objective of this paper is to apply self-organizing maps (SOMs) for exon/intron classification in DNA using windowed splice junction data. Splice junctions are group of nucleotides that serve as boundaries between sections of DNA that code for genetic material and sections that do not. Genes are often interrupted by sections of non-coding DNA sequences. The data used for this study is human DNA data taken from the National Center for Bioinformatics Information (http://www.ncbi.nih.gov/). The DNA dataset contains 1,424 DNA sequences with 128 descriptors for each sequence. SOMs are used to classify each DNA sequence into three categories that are sequences of transition from gene (exon) to non-gene (intron), non-gene (intron) to gene (exon), and no transition categories where the two-base pair code for the splice junction was coincidental. The multidimensional sequences are clustered into a two-dimensional space that was graphically displayed for data exploration and classification. Visual and graphical capabilities of SOMs are applied to classify the DNA dataset. The topographic properties of SOMs preserve similar sequences close to each other on the output map. Clusters of the dataset are determined and labeled based on the classes of the output neuron in the cluster. The highest frequency classes mapped on the output neuron are labeled as the classes of the output neuron.
Keywords :
DNA; biology computing; data visualisation; genetics; molecular biophysics; pattern classification; self-organising feature maps; sequences; DNA classification; DNA dataset; DNA sequence; National Center for Bioinformatics Information; SOM; class frequency; class mapping; cluster labeling; data classification; data exploration; dataset cluster; exon classification; gene-nongene transition; genetic material code; genetic section; graphical capability; graphical display; human DNA data; intron classification; multidimensional sequence; neuron class; nongene-gene transition; nucleotide; output map; output neuron; self-organizing map; sequence clustering; sequence descriptor; topographic property; transition category; two-base pair code; two-dimensional space; visual capability; windowed splice junction data; Bioinformatics; Biological materials; DNA; Frequency; Genetics; Humans; Multidimensional systems; Neurons; Self organizing feature maps; Sequences;
Conference_Titel :
Soft Computing in Industrial Applications, 2003. SMCia/03. Proceedings of the 2003 IEEE International Workshop on
Print_ISBN :
0-7803-7855-5
DOI :
10.1109/SMCIA.2003.1231361