DocumentCode :
3363117
Title :
Undersampling Strategy Based on Clustering to Improve the Performance of Splice Site Classification in Human Genes
Author :
Galarda Varassin, Claudia ; Plastino, Alexandre ; Da Gama Leitao, Helena C. ; Zadrozny, Bianca
Author_Institution :
Comput. Sci. Dept., Fed. Univ. of ES / Fluminense Fed. Univ., Vitoria, Brazil
fYear :
2013
fDate :
26-30 Aug. 2013
Firstpage :
85
Lastpage :
89
Abstract :
The recognition of splice sites plays an important role in the annotation of the eukaryotic genes structure. The detection of such sites is a highly imbalanced classification task because the number of negatives examples found in the DNA sequences is much higher than the number of positive ones. One possible strategy to deal with this particularity is to use training sets more balanced than the original dataset. It is necessary then to choose which part of the majority examples will be taken to compose those sets. Aiming at increasing the learning ability in this problem, we propose a new under sampling procedure. In this strategy, the negative examples used to train the classifier are selected based on clusters obtained from this majority class. The experimental results show that, for the splice site problem, it is possible to increase classification performance when compared to simpler under sampling techniques.
Keywords :
DNA; bioinformatics; genetics; genomics; learning (artificial intelligence); pattern classification; pattern clustering; sampling methods; DNA sequences; eukaryotic gene structure; human genes; learning ability; splice site classification performance; splice site problem; splice site recognition; training sets; undersampling procedure; undersampling strategy; undersampling techniques; Bioinformatics; Computer science; Conferences; Data mining; Educational institutions; Support vector machines; Training; class imbalance; classification; clustering; information gain; splice sites; undersampling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2013 24th International Workshop on
Conference_Location :
Los Alamitos, CA
ISSN :
1529-4188
Print_ISBN :
978-0-7695-5070-1
Type :
conf
DOI :
10.1109/DEXA.2013.40
Filename :
6621351
Link To Document :
بازگشت