Title :
Genome compression using normalized maximum likelihood models for constrained Markov sources
Author :
Tabus, Ioan ; Korodi, Gergely
Author_Institution :
Dept. of Signal Process., Tampere Univ. of Technol., Tampere
Abstract :
The paper presents exact and implementable solutions to the problem of universal coding of approximate repeats by using the normalized maximum likelihood model for the class of Markov sources of first order, incorporating constraints which are standard in the context of fast searching similarities over full genomes. A coding scheme combining universal codes for memoryless sources and for sources with memory is then presented. The results when compressing the full human genome show that the combined scheme is able to provide slight improvements over the existing state of the art. As a side result, interesting pairs of sequences may be found, which are highly similar by the new NML model for Markov sources, but have a lower similarity score when evaluated with the NML for memoryless sources.
Keywords :
Markov processes; genetic engineering; genetics; maximum likelihood estimation; Markov sources; coding scheme; constrained Markov sources; genome compression; memoryless sources; normalized maximum likelihood models; universal coding; Bioinformatics; Context modeling; DNA; Encoding; Genomics; Humans; Paper technology; Pattern matching; Sequences; Signal processing;
Conference_Titel :
Information Theory Workshop, 2008. ITW '08. IEEE
Conference_Location :
Porto
Print_ISBN :
978-1-4244-2269-2
Electronic_ISBN :
978-1-4244-2271-5
DOI :
10.1109/ITW.2008.4578663