Analysis
of sequence heterogeneity by sliding window plots
This service provides access to
tools described in Karlin (2001)
and Mrázek and Karlin (1998),
and allows users to generate sliding window plots of seven different sequence
properties. It is intended for analysis of prokaryotic genomes but it can be
applied to eukaryotic chromosomes with some limitations.
1. INPUT
The input sequence has to be in
GenBank format and it has to include gene annotations (CDS features) if you use
the S3, codon usage or amino acid usage methods.
2. OUTPUT
You can choose to receive the plots
in postscript or pdf format, and the text files with numerical data. The text
files are formatted for use with GnuPlot
but can also be used with Excel or similar programs. You may prefer the text
files if you want to customize the plots, e.g., for a publication.
3. METHODS
3.1 G+C content
Percentage of G and C nucleotides in
the sliding window is plotted with respect to the position of the center of the
window.
3.2 S3
G+C content limited to third codon
position of genes. The G+C percentage is calculated for all genes with their
mid-point within the window. S3 correlates strongly with G+C content but shows
larger variance.
3.3
δ*-differences
This measures a difference in
dinucleotide relative abundances between the sequence in the window and the complete
chromosome. Peaks in the plot identify regions that differ from the rest of the
chromosome in terms of nearest neighbor propensities. Dinucleotide relative
abundance of a dinucleotide XY (X and Y stand for any of A, C, G, or T) is
defined as
,
where
,
, and
are frequencies of the
dinucleotide XY, nucleotide X, and nucleotide Y in the analyzed sequence,
respectively. δ*-differences between two DNA sequences A and B are then defined as
,
where
and
are the relative
abundances of the dinucleotide XY in the sequences A and B, respectively,
and the sum extends over all 16 dinucleotides. In this application, A refers to the sequence in the sliding
window and B refers to the complete
chromosome.
Note that the ρ* values factor
out nucleotide frequencies and therefore are independent of G+C content.
For more information on δ*-differences,
dinucleotide relative abundances, and genome signatures, see for example Karlin
and Burge (1995),
Blaisdell et al. (1996),
Karlin et al. (1997),
Karlin (1998),
Karlin et al. (1999),
and Gentles and Karlin (2001).
3.4 Synonymous
codon usage
This method plots a difference in
synonymous codon usage between the collection of genes with the mid-point in
the sliding window (designated G) and
the complete set of genes in the chromosome (designated C) (also called codon bias of G
with respect to C). The codon bias is
calculated as
,
where
pa(G) are the average amino acid frequencies in the gene
collection G,
is the frequency for
the codon (x,y,z) in the gene collection G normalized such that
for each amino acid,
are similarly
normalized codon frequencies in the gene collection C, and the second sum extends over all codons translated to the
amino acid a.
Peaks in this plot will indicate
regions containing many genes with synonymous codon usage significantly
different from that of an average gene of the chromosome.
For more information, see Karlin et
al. (1998a),
Karlin et al. (1998b),
and Karlin (2001).
3.5 Amino acid
usage
This method plots the difference in
amino acid frequencies between the collection of genes with the mid-point in
the sliding window (designated G) and
the complete set of genes in the chromosome (designated C). The difference is defined as
,
where
and
are frequencies of the
amino acid a in the gene collections G and C, respectively, and the sum extends over all 20 amino acids.
Peaks in this plot will indicate regions containing many
genes encoding proteins of unusual amino acid composition.
3.6 G-C skew
Lobry (1996)
noticed a significant compositional asymmetry in bacterial genomes between the
leading and lagging DNA strands with respect to replication. The leading strand
tends to contain more G and less C than the lagging strand. This method plots
the value (G-C)/(G+C) (difference in G and C counts divided by the sum of G and
C counts), known as G-C skew, in the sliding window. In most bacteria, it tends
to have positive values right of the origin of replication and left of the
terminus of replication, whereas negative values generally apply to regions
right of the terminus and left of the origin.
For more information, see for example Lobry (1996),
Mrázek
and Karlin (1998),
Rocha (2002),
Rocha and Sueoka (2002).
3.7 A-T skew
This is analogous to G-C skew.
However, A-T skew is less consistently associated with differences between
leading and lagging DNA strands related to replication and may reflect biases
related to transcription. For more information, see for example Mrázek and Kypr (1994),
Mrázek
and Karlin (1998),
Francino and Ochman (2001).
4. LIMITATIONS
FOR EUKARYOTIC CHROMOSOMES
Although this service is intended
for use with prokaryotic chromosomes, the methods G+C content, δ*-differences,
G-C skew, and A-T skew will work with any DNA sequence supplied in the
appropriate format. The methods S3, codon bias, and amino acid bias can be reasonably
applied only to chromosomes relatively densely packed with genes (prokaryotes,
yeast, etc.). Using them on genomes of higher eukaryotes where vast majority of
DNA is non-coding and genes can extend over very long regions May return errors
because some windows may contain few or no genes.