Highly expressed genes often exhibit biased synonymous codon usage. We use genes encoding ribosomal proteins (RP), major chaperones (CH), and the main translation and transcription processing factors (TF) as prototypes of highly expressed genes, and postulate that a gene with synonymous codon usage similar to these three standard classes but different from the average gene (all genes and orfs of the genome combined; labeled C) is Predicted Highly eXpressed (PHX), whereas genes with synonymous codon usage significantly different from all four standards RP, CH, TF, and C are considered Putative Alien (PA).
Synonymous codon usage bias
Let G be a group of genes with average codon frequencies for the codon (x,y,z) normalized such that for each amino acid. Similarly, let indicate the average codon frequencies for a gene g. The codon usage difference of g relative to G is calculated by the formula
where pa(g) are the average amino acid frequencies in the gene g.
Definitions of PHX and PA genes
A gene g is predicted highly expressed if is high while, , and are low.Definition of PHX genes with respect to individual standard gene classes is based on the ratios, , , and a combined overall predicted expression measure
A gene is designated PHX if the following two conditions are satisfied:at least two of the three ratios and exceed 1.05, and the overall expression measure is ≥1.00.
A gene is designated PA if all four values, , , and exceed a threshold , where is the median codon bias among all genes of similar length as .
We also define the value as a combined measure of alien character of the gene g.
See references for details and justification.
Two web interfaces are provided. The basic interface can be used for organisms with a single chromosome. The program requires four input files. Complete genome sequence and annotation has to be uploaded in the GenBank format. For most complete genomes, the appropriate files can be downloaded from the NCBI ftp server (use the *.gbk files). In addition, you have to prepare the lists of RP, CH, and TF genes by extracting their annotations from the genome file (see examples). The files have to be saved as plain text (.txt). A second interface is provided for genomes with multiple chromosomes or megaplasmids. You have to prepare and upload these four files for each chromosome. The primary (largest) chromosome should be first, order of the other chromosomes or plasmids does not matter. If a particular chromosome/plasmid does not have any genes of the RP, CH, or TF classes leave the appropriate box blank.
Selection of genes for the standard gene classes RP, CH, and TF
These standard classes serve as prototypes of highly expressed genes to which other genes are compared. The method is relatively robust with respect to the selection of genes in the standard classes, i.e., including a fraction of genes which are not in fact highly expressed has qualitatively little effect on the results. Below are some general suggestions based on our experience:
RP: Include all ribosomal protein genes but not genes that modify ribosomal proteins.
CH: This collection should generally contain DnaK, GroEL, GroES, Tig, or thermosome and proteosome subunits in Archaea. Other proteins that are often PHX and can be included: FtsH, ClpB, some peptidyl-prolyl cis-trans isomerases (e.g., cyclophilin), thioredoxin.
TF: This class would usually include translation elongation factors, ribosome release factor and main RNA polymerase subunits RpoA, RpoB, and RpoC. Translation initiation factors can also be included whereas specialized transcription factors should not be used. RNA polymerase subunits are generally not among the most abundant proteins in 2D gels but they are appropriate because they exhibit strong codon biases characteristic of highly expressed genes.
Warning: The standard classes should be sufficiently large so that each amino acid (except Met and Trp, which use a single codon) is represented at least several times in each standard. If an amino acid is not present in a standard the program will crash. If there is such a problem it is usually due to lack of cysteine in the CH standard.
You will receive 8 files (7xN+1 files if your genomes includes N chromosomes) attached to an email from email@example.com:
*PHX-PA-data.txt includes the output data for all annotated genes and orfs in the genome or chromosome. The following values are included: , , , , , , , , , gene length in codons (not counting the stop codon and the initial ATG), G+C content at codon site 3, and the starting position of the gene in the genome. The data are preceded by labels H or A signifying PHX and PA genes, respectively.
*PHX-PA-data.highexp.txt: data for PHX genes with some information extracted from the annotations attached.
*PHX-PA-data.highexpSort.txt: same as above but the genes are sorted by decreasing.
*PHX-PA-data.alien.txt: similar list of PA genes.
*PHX-PA-data.alienSort.txt: same as above but the genes are sorted by decreasing.
*PHX-PA-data.htc.pdf or PHX-PA-data.htc.ps: graphical representations of the distribution of PHX (red/orange) and PA (blue/green) genes in a circular format, and significant clusters and gaps in the distribution of PA and PHX genes detected by r-scan statistics (Dembo & Karlin, Ann. Appl. Prob. 1992; Karlin & Brendel, Science 1992; Karlin et al., NAR 1996).
*PHX-PA-data.htl.pdf or PHX-PA-data.htl.ps: same as above but in linear format.
*PHX-PA-data.cbtab.txt: table of pairwise codon bias differences for the four standard gene classes C, RP, CH, and TF. Ideally, the differences among RP, CH and TF classes should be smaller than those with C. If they are slightly larger you probably still receive reasonable results due to robustness of the method but you may interpret the data with caution.
Warning: assignments of annotations in the files *PHX-PA-data.highexp.txt, *PHX-PA-data.highexpSort.txt, *PHX-PA-data.alien.txt and *PHX-PA-data.alienSort.txt as well as the graphical output files rely on a specific format of the input file and may be wrong if the format is not as expected. The assignments should be correct if the first input file is in the common GenBank format. If in doubt, check if the starting position in the CDS line is the same as the starting position in the PHX/PA data line (the last number).
Note: If you are using Windows open the files with WordPad or Word. These are plain text files generated by Linux and may not show properly in NotePad.
4. Please do:
5. Please don't: