• DocumentCode
    2591086
  • Title

    Protein-protein interaction extraction from bio-literature with compact features and data sampling strategy

  • Author

    Zhang, Hongtao ; Huang, Minglie ; Zhu, Xiaoyan

  • Author_Institution
    Dept. of Comput. Sci. & Tech, Tsinghua Univ., Beijing, China
  • Volume
    4
  • fYear
    2011
  • fDate
    15-17 Oct. 2011
  • Firstpage
    1767
  • Lastpage
    1771
  • Abstract
    A large number of protein-protein interactions (PPIs) have buried in massive biomedical articles published over the years. This leads to the development of automatic PPI extraction methods. However, existing methods based on supervised machine learning still face some challenges: (1) the feature space exploited in these methods is very sparse; and (2) the data used for training are imbalanced with respect to categories to be classified. In this paper, we first construct rich and compact features to alleviate the issue of feature sparseness. With these features, our method outperforms baselines by up to an F-score of 9.58% on the original AIMed corpus. Furthermore, we propose a data sampling strategy based on under-sampling to address the class imbalance problem. In order to re-balance data distribution, samples of the majority class are removed according to the prediction results iteratively. By this means, our method achieves a further 2.49% improvement in F-score on the original AIMed corpus.
  • Keywords
    bioinformatics; classification; learning (artificial intelligence); molecular biophysics; proteins; sampling methods; AIMed corpus; automatic PPI extraction method; bioliterature; biomedical articles; class imbalance problem; classification; compact features; data sampling; feature sparseness; iterative method; protein-protein interaction extraction; supervised machine learning; undersampling; Bioinformatics; Data mining; Feature extraction; Protein engineering; Proteins; Support vector machines; Training; Class Imbalance; Compact Features; Feature Sparseness; Protein-Protein Interaction Extraction; Unde-Sampling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Biomedical Engineering and Informatics (BMEI), 2011 4th International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4244-9351-7
  • Type

    conf

  • DOI
    10.1109/BMEI.2011.6098714
  • Filename
    6098714