• DocumentCode
    3363117
  • Title

    Undersampling Strategy Based on Clustering to Improve the Performance of Splice Site Classification in Human Genes

  • Author

    Galarda Varassin, Claudia ; Plastino, Alexandre ; Da Gama Leitao, Helena C. ; Zadrozny, Bianca

  • Author_Institution
    Comput. Sci. Dept., Fed. Univ. of ES / Fluminense Fed. Univ., Vitoria, Brazil
  • fYear
    2013
  • fDate
    26-30 Aug. 2013
  • Firstpage
    85
  • Lastpage
    89
  • Abstract
    The recognition of splice sites plays an important role in the annotation of the eukaryotic genes structure. The detection of such sites is a highly imbalanced classification task because the number of negatives examples found in the DNA sequences is much higher than the number of positive ones. One possible strategy to deal with this particularity is to use training sets more balanced than the original dataset. It is necessary then to choose which part of the majority examples will be taken to compose those sets. Aiming at increasing the learning ability in this problem, we propose a new under sampling procedure. In this strategy, the negative examples used to train the classifier are selected based on clusters obtained from this majority class. The experimental results show that, for the splice site problem, it is possible to increase classification performance when compared to simpler under sampling techniques.
  • Keywords
    DNA; bioinformatics; genetics; genomics; learning (artificial intelligence); pattern classification; pattern clustering; sampling methods; DNA sequences; eukaryotic gene structure; human genes; learning ability; splice site classification performance; splice site problem; splice site recognition; training sets; undersampling procedure; undersampling strategy; undersampling techniques; Bioinformatics; Computer science; Conferences; Data mining; Educational institutions; Support vector machines; Training; class imbalance; classification; clustering; information gain; splice sites; undersampling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications (DEXA), 2013 24th International Workshop on
  • Conference_Location
    Los Alamitos, CA
  • ISSN
    1529-4188
  • Print_ISBN
    978-0-7695-5070-1
  • Type

    conf

  • DOI
    10.1109/DEXA.2013.40
  • Filename
    6621351