Genomewide motif identification using a dictionary model

Author

Sabatti, Chiara ; Lange, Kenneth

Author_Institution

Human Genetics & Stat. Departments, California Univ., Los Angeles, CA, USA

Volume

90

Issue

11

fYear

2002

fDate

11/1/2002 12:00:00 AM

Firstpage

1803

Lastpage

1810

Abstract

This paper surveys and extends models and algorithms for identifying binding sites in noncoding regions of DNA. Binding sites control the transcription of genes into messenger RNA in preparation for translation into proteins. The base sequence of most binding sites is not entirely fixed, with the different permitted spellings collectively constituting a "motif." After summarizing the underlying biological issues, we review three different models for binding site identification. Each model was developed with a different type of dataset as reference. We then present a unified model that borrows from the previous ones and integrates their main features. In our unified model, one can identify motifs and their unknown positions along a sequence. One can also fit the model to data using maximum likelihood and maximum a posteriori algorithms. These algorithms rely on recursive formulas and the maximization/minorization principle. Finally, we conclude with a prospectus of future data analyses and theoretical research.

Keywords

DNA; biology computing; genetics; physiological models; proteins; binding sites; expectation-maximization algorithm; genes transcription; genomic sequence; maximum a posteriori algorithms; maximum likelihood algorithms; messenger RNA; permitted spellings; text segmentation; unknown positions along sequence; Bioinformatics; Biological cells; Biological system modeling; DNA; Dictionaries; Genetics; Genomics; Humans; Sequences; Statistics;

fLanguage

English

Journal_Title

Proceedings of the IEEE

Publisher

ieee

ISSN

0018-9219

Type

jour

DOI

10.1109/JPROC.2002.804689

Filename

1046958