Analysis of sequence heterogeneity by sliding window plots

 

This service provides access to tools described in Karlin (2001) and Mrázek and Karlin (1998), and allows users to generate sliding window plots of seven different sequence properties. It is intended for analysis of prokaryotic genomes but it can be applied to eukaryotic chromosomes with some limitations.

 

1. INPUT

The input sequence has to be in GenBank format and it has to include gene annotations (CDS features) if you use the S3, codon usage or amino acid usage methods.

 

2. OUTPUT

You can choose to receive the plots in postscript or pdf format, and the text files with numerical data. The text files are formatted for use with GnuPlot but can also be used with Excel or similar programs. You may prefer the text files if you want to customize the plots, e.g., for a publication.

 

3. METHODS

 

3.1 G+C content

Percentage of G and C nucleotides in the sliding window is plotted with respect to the position of the center of the window.

 

3.2 S3

G+C content limited to third codon position of genes. The G+C percentage is calculated for all genes with their mid-point within the window. S3 correlates strongly with G+C content but shows larger variance.

 

3.3 δ*-differences

This measures a difference in dinucleotide relative abundances between the sequence in the window and the complete chromosome. Peaks in the plot identify regions that differ from the rest of the chromosome in terms of nearest neighbor propensities. Dinucleotide relative abundance of a dinucleotide XY (X and Y stand for any of A, C, G, or T) is defined as

,

where , , and  are frequencies of the dinucleotide XY, nucleotide X, and nucleotide Y in the analyzed sequence, respectively. δ*-differences between two DNA sequences A and B are then defined as

,

where  and  are the relative abundances of the dinucleotide XY in the sequences A and B, respectively, and the sum extends over all 16 dinucleotides. In this application, A refers to the sequence in the sliding window and B refers to the complete chromosome.

Note that the ρ* values factor out nucleotide frequencies and therefore are independent of G+C content.

For more information on δ*-differences, dinucleotide relative abundances, and genome signatures, see for example Karlin and Burge (1995), Blaisdell et al. (1996), Karlin et al. (1997), Karlin (1998), Karlin et al. (1999), and Gentles and Karlin (2001).

 

3.4 Synonymous codon usage

This method plots a difference in synonymous codon usage between the collection of genes with the mid-point in the sliding window (designated G) and the complete set of genes in the chromosome (designated C) (also called codon bias of G with respect to C). The codon bias is calculated as

,

where pa(G) are the average amino acid frequencies in the gene collection G, is the frequency for the codon (x,y,z) in the gene collection G normalized such that for each amino acid, are similarly normalized codon frequencies in the gene collection C, and the second sum extends over all codons translated to the amino acid a.

Peaks in this plot will indicate regions containing many genes with synonymous codon usage significantly different from that of an average gene of the chromosome.

For more information, see Karlin et al. (1998a), Karlin et al. (1998b), and Karlin (2001).

 

3.5 Amino acid usage

This method plots the difference in amino acid frequencies between the collection of genes with the mid-point in the sliding window (designated G) and the complete set of genes in the chromosome (designated C). The difference is defined as

,

where and are frequencies of the amino acid a in the gene collections G and C, respectively, and the sum extends over all 20 amino acids.

Peaks in this plot will indicate regions containing many genes encoding proteins of unusual amino acid composition.

 

3.6 G-C skew

Lobry (1996) noticed a significant compositional asymmetry in bacterial genomes between the leading and lagging DNA strands with respect to replication. The leading strand tends to contain more G and less C than the lagging strand. This method plots the value (G-C)/(G+C) (difference in G and C counts divided by the sum of G and C counts), known as G-C skew, in the sliding window. In most bacteria, it tends to have positive values right of the origin of replication and left of the terminus of replication, whereas negative values generally apply to regions right of the terminus and left of the origin.

For more information, see for example Lobry (1996), Mrázek and Karlin (1998), Rocha (2002), Rocha and Sueoka (2002).

 

3.7 A-T skew

This is analogous to G-C skew. However, A-T skew is less consistently associated with differences between leading and lagging DNA strands related to replication and may reflect biases related to transcription. For more information, see for example Mrázek and Kypr (1994), Mrázek and Karlin (1998), Francino and Ochman (2001).

 

4. LIMITATIONS FOR EUKARYOTIC CHROMOSOMES

Although this service is intended for use with prokaryotic chromosomes, the methods G+C content, δ*-differences, G-C skew, and A-T skew will work with any DNA sequence supplied in the appropriate format. The methods S3, codon bias, and amino acid bias can be reasonably applied only to chromosomes relatively densely packed with genes (prokaryotes, yeast, etc.). Using them on genomes of higher eukaryotes where vast majority of DNA is non-coding and genes can extend over very long regions May return errors because some windows may contain few or no genes.