DocumentCode :
554171
Title :
Algebraic length distribution of sequence duplications in whole genomes
Author :
Taillefer, E. ; Miller, Jason
Author_Institution :
Phys. & Biol. Unit, Okinawa Inst. of Sci. & Technol., Okinawa, Japan
Volume :
3
fYear :
2011
fDate :
26-28 July 2011
Firstpage :
1454
Lastpage :
1460
Abstract :
The field of comparative genomics relies upon inference of neutrality or selection from sequence conservation. Recent studies of exactly-conserved sequences have revealed an anomalous, algebraic distribution of conserved sequence lengths that is inconsistent with standard models of neutral evolution based solely on local mutation. It has been proposed that linkage contributes to the shape of this anomalous distribution. Here we identify, for a variety of species, all `maximal´ repeats, direct or reverse-complement, within a chromosomal or whole-genome sequence of a single genome. For a set of maximal repeats of a given nucleotide length L, we report that the number of elements in the set F(L) typically exhibits an algebraic tail. We propose a method based on a cost function that allows us to analyze this distribution and estimate the range over what the distribution is most likely to be well-approximated a power law. We find that the range is proportional to the genome size and that although the power-law exponent differs between species, it falls chiefly within a relatively narrow range of values. A sharp cut-off in the power-law regime is observed for some genomes that turns out to coincide with a peak in contig lengths and therefore can be attributed to artifacts of genome assembly, leading to a prediction that the extent of the power-law regime will increase as assemblies are improved. The typical algebraic behavior of length-frequency distribution is the most remarkable observation emerging from our analysis. The algebraic form of the empirical distribution of duplication lengths characterized here suggests that recombination events might as a general rule involve transfer of chunks of sequence with an algebraic length distribution. It also places strong constraints on any model of genome evolution. The observation of an algebraic distribution of exactly-duplicated sequence lengths within a genome is a direct demonstration of the net impact of linkage on genom- - e evolution, and is consistent with the proposal that linkage contributes to the anomalous distribution of strongly-conserved sequence lengths.
Keywords :
algebra; biology computing; genomics; inference mechanisms; algebraic length distribution; algebraic tail; chromosomal sequence; comparative genomics; exactly duplicated sequence lengths; local mutation; neutral evolution; neutrality inference; power law exponent; power law regime; sequence conservation selection; sequence duplications; whole genome sequence; Assembly; Bioinformatics; Biological cells; Cost function; Couplings; Evolution (biology); Genomics; Chromosome; Frequency; Genome; Ultraduplication; distribution; length; power-law;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Natural Computation (ICNC), 2011 Seventh International Conference on
Conference_Location :
Shanghai
ISSN :
2157-9555
Print_ISBN :
978-1-4244-9950-2
Type :
conf
DOI :
10.1109/ICNC.2011.6022506
Filename :
6022506
Link To Document :
بازگشت