• DocumentCode
    1378930
  • Title

    Identifying Relevant Data for a Biological Database: Handcrafted Rules versus Machine Learning

  • Author

    Sehgal, Aditya Kumar ; Das, Sanmay ; Noto, Keith ; Saier, Milton H. ; Elkan, Charles

  • Author_Institution
    Parity Comput., Core Technol. Group, San Diego, CA, USA
  • Volume
    8
  • Issue
    3
  • fYear
    2011
  • Firstpage
    851
  • Lastpage
    857
  • Abstract
    With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.
  • Keywords
    bioinformatics; data analysis; learning (artificial intelligence); molecular biophysics; proteins; MEDLINE documents; Swiss-Prot protein records; TrEMBL protein records; biological databases; data analysis; machine learning; protein sequence; Association rules; Bioinformatics; Computer science; Data mining; Databases; Genomics; Humans; Information retrieval; Machine learning; Proteins; Bioinformatics (genome or protein) databases; association rules; biomedical text classification; classification; clustering; data mining.; text mining; Algorithms; Artificial Intelligence; Carrier Proteins; Cluster Analysis; Data Mining; Databases, Genetic; Genomics; Humans; MEDLINE; Proteins;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2009.83
  • Filename
    5374367