• DocumentCode
    2461102
  • Title

    An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis

  • Author

    Evans, Scott ; Markham, Steve ; Torres, Andrew ; Kourtidis, Antonis ; Conklin, Douglas

  • Author_Institution
    U.S. Army Med. Res. Acquisition Activity, Fort Derrick, MD
  • fYear
    2006
  • fDate
    Oct. 29 2006-Nov. 1 2006
  • Firstpage
    1843
  • Lastpage
    1850
  • Abstract
    We present an improved minimum description length (MDL) learning algorithm - MDLCompress - for nucleotide sequence analysis that outperforms the compression of other Grammar Based Coding methods such as DNA Sequitur while retaining a two-part code that highlights biologically significant phrases. Phrases are recursively added to the MDLCompress model that are not necessarily the longest matches, or the most often repeated phrase of a certain length, but a combination of length and repetition such that inclusion of the phrase in the model maximizes compression. The deep recursion of MDLCompress combined with its two-part coding nature makes it uniquely able to identify biologically meaningful sequence without limiting assumptions. The ability to quantify cost in bits for phrases in the MDL model promotes prediction of fragile regions where single nucleotide polymorphisms (SNPs) may have the most impact on biological activity. MDLCompress improves our previous algorithm in runtime performance through use of an innovative data structure and in specificity of motif detection (compression) through use of improved heuristics. We also discuss recent results from MDLCompress analysis of 144 known overexpressed genes from a breast cancer cell line, BT474. Novel motifs, including potential microRNA (miRNA) binding sites, have been identified within certain genes and are being considered for in vitro validation studies.
  • Keywords
    biological organs; cancer; cellular biophysics; genetics; learning (artificial intelligence); macromolecules; medical computing; molecular biophysics; MDLCompress; biological activity; breast cancer cell line; genes; heuristics; learning algorithm; microRNA binding sites; minimum description length algorithm; motif detection; nucleotide sequence analysis; single nucleotide polymorphisms; two-part coding; Algorithm design and analysis; Biological information theory; Biological system modeling; Breast cancer; Costs; DNA; Data structures; Predictive models; Runtime; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signals, Systems and Computers, 2006. ACSSC '06. Fortieth Asilomar Conference on
  • Conference_Location
    Pacific Grove, CA
  • ISSN
    1058-6393
  • Print_ISBN
    1-4244-0784-2
  • Electronic_ISBN
    1058-6393
  • Type

    conf

  • DOI
    10.1109/ACSSC.2006.355081
  • Filename
    4176891