• DocumentCode
    1785102
  • Title

    Adopting the MapReduce framework to pre-train 1-D and 2-D protein structure predictors with large protein datasets

  • Author

    Eickholt, Jesse ; Karki, Suman

  • Author_Institution
    Dept. of Comput. Sci., Central Michigan Univ., Mount Pleasant, MI, USA
  • fYear
    2014
  • fDate
    2-5 Nov. 2014
  • Firstpage
    23
  • Lastpage
    29
  • Abstract
    Sequence based machine learning approaches for 1-D and 2-D protein structure prediction tasks have long been limited by relatively small datasets, namely proteins with experimentally determined structure. Recent advances in machine learning provide a means of using unlabeled data and, as a result, this opens up access to a much larger sequence space in the context of protein structure prediction. Here we present a 3-stage pipeline to construct a representative protein sequence dataset, generate training data and pre-train deep network models for 1-D and 2-D protein structure prediction tasks. To handle the complexities of managing the large dataset, we implemented our pipeline using the MapReduce framework. This allowed us to leverage existing tools such as Hadoop. The result is the ability to apply large amounts of novel, protein sequence data to 1-D and 2-D protein structure prediction. We also used our pipeline to curate a non-redundant protein sequence dataset that we have made available with accompanying data.
  • Keywords
    biology computing; data handling; learning (artificial intelligence); molecular biophysics; parallel processing; proteins; 1D protein structure prediction task; 1D protein structure predictor; 2D protein structure prediction task; 2D protein structure predictor; 3-stage pipeline; Hadoop; MapReduce framework; determined protein structure; large protein dataset; nonredundant protein sequence dataset; pretrain deep network model; sequence based machine learning; Data models; Pipelines; Protein engineering; Protein sequence; Training; Training data; MapReduce; deep networks; protein structure prediction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
  • Conference_Location
    Belfast
  • Type

    conf

  • DOI
    10.1109/BIBM.2014.6999306
  • Filename
    6999306