• DocumentCode
    2413033
  • Title

    Probabilistic topic modeling for genomic data interpretation

  • Author

    Chen, Xin ; Hu, Xiaohua ; Shen, Xiajiong ; Rosen, Gail

  • Author_Institution
    Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA
  • fYear
    2010
  • fDate
    18-21 Dec. 2010
  • Firstpage
    149
  • Lastpage
    152
  • Abstract
    Recently, the concept of a species containing both core and distributed genes, known as the supra- or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the `N-mer´ and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the `N-mer´ features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains within species share similar core and distributed topics? and 2) do genes with similar functional roles contain similar latent topics? After studying the mutual information between latent topics and gene regions, we provide examples of each, where the BioCyc database is used to correlate pathway and reaction information to the genes. The examples demonstrate the effectiveness of proposed method.
  • Keywords
    DNA; bioinformatics; genetics; genomics; molecular biophysics; BioCyc database; BioJava toolkit; DNA sequences; N-mer features; NCBI database; composition-based approach; core genes; distributed genes; gene functional roles; genome-level statistic patterns; genomic data interpretation; latent dirichlet allocation model; pangenome theory; probabilistic topic modeling; reaction information; supragenome theory; Bioinformatics; Biological system modeling; Data models; Databases; Genomics; Proteins; Strain; Latent Dirichlet Allocation; N-mer feature; core and distributed genes; functional annotation; genomic dataformatting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    978-1-4244-8306-8
  • Electronic_ISBN
    978-1-4244-8307-5
  • Type

    conf

  • DOI
    10.1109/BIBM.2010.5706554
  • Filename
    5706554