DocumentCode
2461102
Title
An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis
Author
Evans, Scott ; Markham, Steve ; Torres, Andrew ; Kourtidis, Antonis ; Conklin, Douglas
Author_Institution
U.S. Army Med. Res. Acquisition Activity, Fort Derrick, MD
fYear
2006
fDate
Oct. 29 2006-Nov. 1 2006
Firstpage
1843
Lastpage
1850
Abstract
We present an improved minimum description length (MDL) learning algorithm - MDLCompress - for nucleotide sequence analysis that outperforms the compression of other Grammar Based Coding methods such as DNA Sequitur while retaining a two-part code that highlights biologically significant phrases. Phrases are recursively added to the MDLCompress model that are not necessarily the longest matches, or the most often repeated phrase of a certain length, but a combination of length and repetition such that inclusion of the phrase in the model maximizes compression. The deep recursion of MDLCompress combined with its two-part coding nature makes it uniquely able to identify biologically meaningful sequence without limiting assumptions. The ability to quantify cost in bits for phrases in the MDL model promotes prediction of fragile regions where single nucleotide polymorphisms (SNPs) may have the most impact on biological activity. MDLCompress improves our previous algorithm in runtime performance through use of an innovative data structure and in specificity of motif detection (compression) through use of improved heuristics. We also discuss recent results from MDLCompress analysis of 144 known overexpressed genes from a breast cancer cell line, BT474. Novel motifs, including potential microRNA (miRNA) binding sites, have been identified within certain genes and are being considered for in vitro validation studies.
Keywords
biological organs; cancer; cellular biophysics; genetics; learning (artificial intelligence); macromolecules; medical computing; molecular biophysics; MDLCompress; biological activity; breast cancer cell line; genes; heuristics; learning algorithm; microRNA binding sites; minimum description length algorithm; motif detection; nucleotide sequence analysis; single nucleotide polymorphisms; two-part coding; Algorithm design and analysis; Biological information theory; Biological system modeling; Breast cancer; Costs; DNA; Data structures; Predictive models; Runtime; Sequences;
fLanguage
English
Publisher
ieee
Conference_Titel
Signals, Systems and Computers, 2006. ACSSC '06. Fortieth Asilomar Conference on
Conference_Location
Pacific Grove, CA
ISSN
1058-6393
Print_ISBN
1-4244-0784-2
Electronic_ISBN
1058-6393
Type
conf
DOI
10.1109/ACSSC.2006.355081
Filename
4176891
Link To Document