Title :
An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis
Author :
Evans, Scott ; Markham, Steve ; Torres, Andrew ; Kourtidis, Antonis ; Conklin, Douglas
Author_Institution :
U.S. Army Med. Res. Acquisition Activity, Fort Derrick, MD
fDate :
Oct. 29 2006-Nov. 1 2006
Abstract :
We present an improved minimum description length (MDL) learning algorithm - MDLCompress - for nucleotide sequence analysis that outperforms the compression of other Grammar Based Coding methods such as DNA Sequitur while retaining a two-part code that highlights biologically significant phrases. Phrases are recursively added to the MDLCompress model that are not necessarily the longest matches, or the most often repeated phrase of a certain length, but a combination of length and repetition such that inclusion of the phrase in the model maximizes compression. The deep recursion of MDLCompress combined with its two-part coding nature makes it uniquely able to identify biologically meaningful sequence without limiting assumptions. The ability to quantify cost in bits for phrases in the MDL model promotes prediction of fragile regions where single nucleotide polymorphisms (SNPs) may have the most impact on biological activity. MDLCompress improves our previous algorithm in runtime performance through use of an innovative data structure and in specificity of motif detection (compression) through use of improved heuristics. We also discuss recent results from MDLCompress analysis of 144 known overexpressed genes from a breast cancer cell line, BT474. Novel motifs, including potential microRNA (miRNA) binding sites, have been identified within certain genes and are being considered for in vitro validation studies.
Keywords :
biological organs; cancer; cellular biophysics; genetics; learning (artificial intelligence); macromolecules; medical computing; molecular biophysics; MDLCompress; biological activity; breast cancer cell line; genes; heuristics; learning algorithm; microRNA binding sites; minimum description length algorithm; motif detection; nucleotide sequence analysis; single nucleotide polymorphisms; two-part coding; Algorithm design and analysis; Biological information theory; Biological system modeling; Breast cancer; Costs; DNA; Data structures; Predictive models; Runtime; Sequences;
Conference_Titel :
Signals, Systems and Computers, 2006. ACSSC '06. Fortieth Asilomar Conference on
Conference_Location :
Pacific Grove, CA
Print_ISBN :
1-4244-0784-2
Electronic_ISBN :
1058-6393
DOI :
10.1109/ACSSC.2006.355081