Randomized Genome Sequences

 

You can use randomized DNA sequences to assess whether some observed property of a genomic DNA sequence is likely to occur by chance. However, there are different ways to generate a random sequence. In the "most random" nucleotide sequence you would find the letters A, C, G, or T at any given position with the probability 25%. However, this is not realistic because few real genomes have G+C content near 50%, and you may want to reproduce the G+C content of the original sequence by setting different probabilities for different letters (Bernoulli model, see below). Or you can go even further and reproduce dinucleotide (or short oligonucleotide) composition of the genome by using multiple sets of probabilities depending on the preceding letter(s) (Markov models).

 

Both Bernoulli and Markov models are homogeneous, i.e., the probabilities do not depend on the position in the sequence. This is unrealistic as sequence composition varies significantly between genes and intergenic regions and in different parts of the chromosome. In order to make our models more realistic we arbitrarily introduce heterogeneity by "chopping" the original sequence into segments corresponding to individual genes and intergenic sequences. Then we design a separate model for each segment that matches selected characteristics of that particular segment (different types of models can be used for protein coding and noncoding segments). We use that model to generate a random sequence of the same length as the original segment, and finally join the segments into a randomized sequence of the complete genome. In addition to Bernoulli and Markov models, the heterogeneous models can use periodic Markov or Bernoulli models for genes. These models use three sets of probabilities for each codon position (reading frame), thus reproducing the 3-bp periodic pattern resulting from biased codon usage.

 

Note that none of these models captures all complexities of the DNA sequences and different models may be suitable for different tasks. If you are not sure which model to use, try comparing the real sequence data to several different models. For more details and examples of application see my 2006 paper in MBE.

 

NOTES:

 

Summary of properties of stochastic models used to generate random sequences

Code

Model

Characteristics of the original sequence accurately reproduced by the model

Homogeneous models

b

Bernoulli

Overall nucleotide composition

m1

1st order Markov

Overall dinucleotide composition

m3

3rd order Markov

Overall tetranucleotide composition

m5

5th order Markov

Overall hexanucleotide composition

Heterogeneous models

bb

Bernoulli for intergenic, Bernoulli for genes

Nucleotide composition, heterogeneity at the gene scale, differences between protein-coding and non-coding regions

bbp

Bernoulli for intergenic, periodic Bernoulli for genes

Nucleotide composition, heterogeneity at the gene scale, differences between protein-coding and non-coding regions, differences between the three codon positions in genes

m1m1

1st order Markov for intergenic, 1st order Markov for genes

Dinucleotide composition, sequence heterogeneity at the gene scale, differences between protein-coding and non-coding regions

m1m1p

1st order Markov for intergenic, periodic 1st order Markov for genes

Dinucleotide composition, heterogeneity at the gene scale, differences between protein-coding and non-coding regions, differences between the three codon positions in genes

m1c

1st order Markov for intergenic, independent codons for genes

Dinucleotide composition in intergenic sequences, heterogeneity at the gene scale, differences between protein-coding and non-coding regions, codon frequencies in genes

m1c1

1st order Markov for intergenic, codons dependent on the preceding nucleotide for genes

Dinucleotide composition in intergenic sequences, heterogeneity at the gene scale, differences between protein-coding and non-coding regions, frequencies of tetranucleotides spanning a codon and the last base of the preceding codon in genes

m3m3p

3rd order Markov for intergenic, periodic 3rd order Markov for genes

Tetranucleotide composition in both genes and intergenic sequences, heterogeneity at the gene scale, differences between protein-coding and non-coding regions, differences among the three codon positions (reading frames)