Randomized Genome Sequences
You
can use randomized DNA sequences to assess whether some observed property of a
genomic DNA sequence is likely to occur by chance. However, there are different
ways to generate a random sequence. In the "most random" nucleotide sequence
you would find the letters A, C, G, or T at any given position with the
probability 25%. However, this is not realistic because few real genomes have
G+C content near 50%, and you may want to reproduce the G+C content of the
original sequence by setting different probabilities for different letters
(Bernoulli model, see below). Or you can go even further and reproduce
dinucleotide (or short oligonucleotide) composition of the genome by using
multiple sets of probabilities depending on the preceding letter(s) (Markov
models).
Both
Bernoulli and Markov models are homogeneous, i.e., the probabilities do not
depend on the position in the sequence. This is unrealistic as sequence
composition varies significantly between genes and intergenic regions and in
different parts of the chromosome. In order to make our models more realistic
we arbitrarily introduce heterogeneity by "chopping" the original sequence into
segments corresponding to individual genes and intergenic sequences. Then we
design a separate model for each segment that matches selected characteristics
of that particular segment (different types of models can be used for protein
coding and noncoding segments). We use that model to generate a random sequence
of the same length as the original segment, and finally join the segments into
a randomized sequence of the complete genome. In addition to Bernoulli and
Markov models, the heterogeneous models can use periodic Markov or Bernoulli
models for genes. These models use three sets of probabilities for each codon
position (reading frame), thus reproducing the 3-bp periodic pattern resulting
from biased codon usage.
Note
that none of these models captures all complexities of the DNA sequences and
different models may be suitable for different tasks. If you are not sure which
model to use, try comparing the real sequence data to several different models.
For more details and examples of application see my 2006
paper in MBE.
NOTES:
Summary of
properties of stochastic models used to generate random sequences
|
Code |
Model |
Characteristics
of the original sequence accurately reproduced by the model |
|
Homogeneous
models |
||
|
b |
Bernoulli |
Overall nucleotide
composition |
|
m1 |
1st order
Markov |
Overall dinucleotide
composition |
|
m3 |
3rd order
Markov |
Overall tetranucleotide
composition |
|
m5 |
5th order
Markov |
Overall hexanucleotide
composition |
|
Heterogeneous
models |
||
|
bb |
Bernoulli for intergenic,
Bernoulli for genes |
Nucleotide composition,
heterogeneity at the gene scale, differences between protein-coding and
non-coding regions |
|
bbp |
Bernoulli for intergenic,
periodic Bernoulli for genes |
Nucleotide composition, heterogeneity
at the gene scale, differences between protein-coding and non-coding regions,
differences between the three codon positions in genes |
|
m1m1 |
1st order Markov
for intergenic, 1st order Markov for genes |
Dinucleotide composition,
sequence heterogeneity at the gene scale, differences between protein-coding
and non-coding regions |
|
m1m1p |
1st order Markov
for intergenic, periodic 1st order Markov for genes |
Dinucleotide composition,
heterogeneity at the gene scale, differences between protein-coding and
non-coding regions, differences between the three codon positions in genes |
|
m1c |
1st order Markov
for intergenic, independent codons for genes |
Dinucleotide composition
in intergenic sequences, heterogeneity at the gene scale, differences between
protein-coding and non-coding regions, codon frequencies in genes |
|
m1c1 |
1st order Markov
for intergenic, codons dependent on the preceding nucleotide for genes |
Dinucleotide composition
in intergenic sequences, heterogeneity at the gene scale, differences between
protein-coding and non-coding regions, frequencies of tetranucleotides
spanning a codon and the last base of the preceding codon in genes |
|
m3m3p |
3rd order Markov
for intergenic, periodic 3rd order Markov for genes |
Tetranucleotide
composition in both genes and intergenic sequences, heterogeneity at the gene
scale, differences between protein-coding and non-coding regions, differences
among the three codon positions (reading frames) |