1. Method
General
Highly expressed genes often exhibit biased
synonymous codon usage. We use genes encoding ribosomal proteins (RP), major
chaperones (CH), and the main translation and transcription processing factors (TF)
as prototypes of highly expressed genes, and postulate that a gene with
synonymous codon usage similar to these three standard classes but different
from the average gene (all genes and orfs of the genome combined; labeled C) is
Predicted Highly eXpressed (PHX),
whereas genes with synonymous codon usage significantly different from all four
standards RP, CH, TF, and C are considered Putative
Alien (PA).
Synonymous codon usage
bias
Let G be a
group of genes with average codon frequencies
for the codon (x,y,z)
normalized such that
for each amino acid.
Similarly, let
indicate the average
codon frequencies for a gene g. The
codon usage difference of g relative
to G is calculated by the formula
where
pa(g) are the average amino acid frequencies in the gene g.
Definitions of PHX and
PA genes
A
gene g is predicted highly expressed
if
is high while
,
, and
are low.Definition of PHX genes with respect to
individual standard gene classes is based on the ratios
,
, ![]()
, and a combined overall predicted expression measure
.
A
gene is designated PHX if the following two conditions are satisfied:at least two of the three ratios
and
exceed 1.05, and the overall
expression measure
is ≥1.00.
A
gene is designated PA if all four values
,
,
, and
exceed a threshold
, where
is the median codon
bias
among all genes of
similar length as
.
We
also define the value
as a combined measure
of alien character of the gene g.
See references
for details and justification.
2.
Input
Two web interfaces are provided. The
basic interface can
be used for organisms with a single chromosome. The program requires four input
files. Complete genome sequence and annotation has to be uploaded in the
GenBank format. For most complete genomes, the appropriate files can be
downloaded from the NCBI ftp
server (use the *.gbk files). In addition, you have to prepare the lists of
RP, CH, and TF genes by extracting their annotations from the genome file (see examples). The files have to be saved as plain text
(.txt). A second
interface is provided for genomes with multiple chromosomes or
megaplasmids. You have to prepare and upload these four files for each
chromosome. The primary (largest) chromosome should be first, order of the
other chromosomes or plasmids does not matter. If a particular chromosome/plasmid
does not have any genes of the RP, CH, or TF classes leave the appropriate box
blank.
Selection
of genes for the standard gene classes RP, CH, and TF
These
standard classes serve as prototypes of highly expressed genes to which other
genes are compared. The method is relatively robust with respect to the
selection of genes in the standard classes, i.e., including a fraction of genes
which are not in fact highly expressed has qualitatively little effect on the
results. Below are some general suggestions based on our experience:
RP: Include all
ribosomal protein genes but not genes that modify ribosomal proteins.
CH: This collection
should generally contain DnaK, GroEL, GroES, Tig, or thermosome and proteosome subunits
in Archaea. Other proteins that are often PHX and can be included: FtsH, ClpB,
some peptidyl-prolyl cis-trans isomerases (e.g., cyclophilin), thioredoxin.
TF: This class would
usually include translation elongation factors, ribosome release factor and main
RNA polymerase subunits RpoA, RpoB, and RpoC. Translation initiation factors
can also be included whereas specialized transcription factors should not be
used. RNA polymerase subunits are generally not among the most abundant
proteins in 2D gels but they are appropriate because they exhibit strong codon
biases characteristic of highly expressed genes.
Warning: The standard
classes should be sufficiently large so that each amino acid (except Met and
Trp, which use a single codon) is represented at least several times in each
standard. If an amino acid is not present in a standard the program will crash.
If there is such a problem it is usually due to lack of cysteine in the CH
standard.
3.
Output
You
will receive 8 files (7xN+1 files if your genomes includes N chromosomes) attached
to an email from webservices@poplar.mib.uga.edu:
*PHX-PA-data.txt includes the
output data for all annotated genes and orfs in the genome or chromosome. The
following values are included:
,
,
,
,
,
,
,
,
, gene length in codons (not counting the stop codon and the
initial ATG), G+C content at codon site 3, and the starting position of the
gene in the genome. The data are preceded by labels H or A signifying PHX and
PA genes, respectively.
*PHX-PA-data.highexp.txt:
data
for PHX genes with some information extracted from the annotations attached.
*PHX-PA-data.highexpSort.txt:
same
as above but the genes are sorted by decreasing
.
*PHX-PA-data.alien.txt:
similar
list of PA genes.
*PHX-PA-data.alienSort.txt:
same
as above but the genes are sorted by decreasing
.
*PHX-PA-data.htc.pdf
or PHX-PA-data.htc.ps: graphical representations of the distribution of PHX (red/orange)
and PA (blue/green) genes in a circular format, and significant clusters and
gaps in the distribution of PA and PHX genes detected by r-scan statistics (Dembo & Karlin, Ann. Appl. Prob. 1992;
Karlin & Brendel, Science 1992; Karlin et al., NAR 1996).
*PHX-PA-data.htl.pdf
or PHX-PA-data.htl.ps: same as above but in linear format.
*PHX-PA-data.cbtab.txt: table of pairwise codon bias differences
for the four standard
gene classes C, RP, CH, and TF. Ideally, the differences among RP, CH and TF classes
should be smaller than those with C. If they are slightly larger you probably
still receive reasonable results due to robustness of the method but you may
interpret the data with caution.
Warning: assignments of
annotations in the files *PHX-PA-data.highexp.txt, *PHX-PA-data.highexpSort.txt,
*PHX-PA-data.alien.txt and *PHX-PA-data.alienSort.txt as well as the graphical
output files rely on a specific
format of the input file and may be wrong if the format is not as expected. The
assignments should be correct if the first input file is in the common GenBank
format. If in doubt, check if the starting position in the CDS line is the same
as the starting position in the PHX/PA data line (the last number).
Note: If you are using
Windows open the files with WordPad or Word. These are plain text files
generated by Linux and may not show properly in NotePad.
4.
Please do:
5.
Please don't: