• DocumentCode
    167287
  • Title

    Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction

  • Author

    Pinoli, Pietro ; Chicco, Davide ; Masseroli, Marco

  • Author_Institution
    Dipt. di Elettron. Inf. e Bioingegneria, Politec. di Milano, Milan, Italy
  • fYear
    2014
  • fDate
    21-24 May 2014
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    Gene function annotations are key elements in biology and bioinformatics. A typical annotation is the association between a gene and a feature term that describes a functional feature of the gene by using a controlled vocabulary term (e.g. a Gene Ontology (GO) feature term). Unfortunately, available annotations contain errors and biologically validated ones are incomplete by definition, since new knowledge is continuously discovered. Thus, computational algorithms which are able to provide ranked lists of predicted new gene annotations are an excellent contribution to the bioinformatics research. Here, we propose two variants of the known Latent Dirichlet Allocation (LDA) algorithm applied to the prediction of gene annotations. LDA is a very efficient machine learning method built on a set of multinomial probability distributions over a set of topics, given a document (a gene, in our case), and on a set of multinomial probability distributions over a set of words (feature terms, in our case), given a topic. In topic modeling, a topic can be considered as a latent meta-category of words, and a document as a mixture of topics. Our two LDA variants use the collapsed Gibbs Sampling method during the training phase, with two distinct initialization approaches to adapt the LDA mathematical model to the biomolecular annotation scenario. Using six outdated datasets of GO annotations of human and brown rat genes, we compared the annotations predicted by our methods to the ones given by the truncated Singular Value Decomposition (tSVD) method previously developed; then, we validated them by using the annotations available in an updated version of the same datasets. Obtained results show the efficiency of our new proposed algorithms.
  • Keywords
    bioinformatics; genomics; learning (artificial intelligence); ontologies (artificial intelligence); sampling methods; LDA algorithm; LDA variants; collapsed Gibbs sampling method; controlled vocabulary term; feature terms; gene annotation prediction; gene function annotations; gene function prediction; gene functional feature; gene ontology feature term; gene-feature term association; latent Dirichlet allocation; latent word metacategory; machine learning method; multinomial probability distributions; tSVD comparison; truncated singular value decomposition; Bioinformatics; Ontologies; Prediction algorithms; Probability distribution; Resource management; Semantics; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on
  • Conference_Location
    Honolulu, HI
  • Type

    conf

  • DOI
    10.1109/CIBCB.2014.6845514
  • Filename
    6845514