• DocumentCode
    1252884
  • Title

    DNA sequence classification via an expectation maximization algorithm and neural networks: a case study

  • Author

    Ma, Qicheng ; Wang, Jason T L ; Shasha, Dennis ; Wu, Cathy H.

  • Author_Institution
    Novartis Pharmaceuticals Corp., Summit, NJ, USA
  • Volume
    31
  • Issue
    4
  • fYear
    2001
  • fDate
    11/1/2001 12:00:00 AM
  • Firstpage
    468
  • Lastpage
    475
  • Abstract
    Presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different data sets
  • Keywords
    Bayes methods; DNA; biocybernetics; biology computing; encoding; microorganisms; neural nets; optimisation; pattern classification; probability; sequences; Bayesian inference; DNA sequence classification; E. Coli promoter recognition; binding site location; bioinformatics; biosequence classification; case study; data mining; expectation maximization algorithm; feature selection; information contents; neural networks; orthogonal encoding method; performance; probability distribution; spacer length; transcriptional start site; unlabeled DNA sequence; Computer aided software engineering; DNA; Data mining; Delta modulation; Encoding; Network topology; Neural networks; Probability distribution; Sequences; Training data;
  • fLanguage
    English
  • Journal_Title
    Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1094-6977
  • Type

    jour

  • DOI
    10.1109/5326.983930
  • Filename
    983930