Ab Initio
Motif Finding Environment (AIMIE)
The purpose of this environment is
to provide tools for discovery and interpretation of significantly
overrepresented DNA sequence motifs in prokaryotic genomes. The discovery phase
uses the DNA sequence as the only input. Annotation is used in the
interpretation phase. The discovery phase ends with significantly
overrepresented sequence motifs displayed on the screen. The interpretation
phase provides tools for subsequent analysis of distribution of these motifs in
the analyzed sequence and with respect to genes in order to gain more insights
into their possible biological roles.
1. MOTIF
DISCOVERY
The algorithm for motif discovery
starts with identification
of frequent words (Karlin et al. 1996, Mrázek et al. 2002, Mrázek and Karlin 1996). The only input is
an annotated DNA sequence in GenBank format (example). You can upload
the sequence file or select from a local database, which is synchronized
(irregularly) with complete prokaryotic genomes stored on the NCBI FTP server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/).
The motif discovery algorithm is briefly described below.
1. The frequent word
technique provides a list of significantly overrepresented words of a fixed
length (between 10 and 12 bp for prokaryotic chromosomes, depending on the
sequence length). The frequent words are sorted by statistical significance and
a given number of top frequent words (user-defined parameter, default 100) are
selected for further analysis.
2. All copies of the
top frequent words in the analyzed sequence are found and combined into Segments
Consisting of Overlapping Frequent words (SCOFs). SCOFs
are of variable length but many represent different occurrences of the same
sequence motif.
3. SCOFs are
clustered into groups corresponding to the same (or similar) sequence motifs. A
distance between two SCOFs is defined as a minimum number of mismatched
nucleotides between the optimally aligned SCOFs (without inserting gaps)
divided by the length of the shorter SCOF. The standard UPGMA hierarchical
clustering algorithm is used to cluster the SCOFs. All SCOFs joined into a
single node below a given clustering cutoff (user-defined parameter, default
0.3) are considered the same sequence motif. The default value may not be
suitable for all sequences and you may want to experiment with different values
of the clustering cutoff.
4. Each sequence
motif is represented by an alignment of SCOFs that belong to that motif. A
consensus sequence is generated from the alignment using degenerate nucleotide
alphabet (standard NCIUB, formerly IUPAC code). The consensus-generating
algorithm ignores nucleotides that occur in less than a given fraction of SCOFs
in the alignment (user-defined parameter, default 10%). For example, if the
frequencies of A, C, G, and T at a given position in the alignment are, 70%,
20%, 5% and 5%, respectively, the consensus will have the letter M (A or C) at
that position. Ambiguous codes corresponding to three or four different nucleotides
(N, B, D, H, and V) at both termini are removed. Consequently, the consensus
sequence can be either longer or shorter than the initial word length s.
At this point you are presented with a
list of detected sequence motifs, which starts the interpretation phase. You can
also select some of the motifs to be masked out in the analyzed sequence (by
marking the checkboxes in the right column and clicking the button at the
bottom of the page) and repeat the motif discovery phase to find additional (less
significant) sequence motifs.
2. MOTIF
INTERPRETATION
You can analyze the distribution of
matches to any of the consensus sequences in the analyzed sequence (the "Analyze Consensus" button) or matches to the position-specific
score matrix (PSSM) representation of the aligned SCOFs. The PSSM represents
the motif using log-odds scores to account for the background nucleotide
frequencies. You can also view the aligned SCOFs for each motif. The PSSM
representation requires a score cutoff. That is, all substrings of the analyzed
sequence that score higher than the cutoff will be reported as matching the
motif. The cutoff can be defined directly or as a percentage of SCOFs in the
original alignment that score below the cutoff. For example, setting the cutoff
as "20" will be interpreted as the direct value of the score cutoff, whereas
"20%" will set the score cutoff equal to the 20th percentile among
scores for all SCOFs in the alignment. By default, the score cutoff is set to
10% but not less than zero.
The "Analyze Consensus" button links
to a modified Pattern
Locator interface. You can manually modify the consensus sequence using any
syntax allowed
by Pattern Locator. You will receive the results by email. They include
locations of all matches to the consensus sequence in the analyzed sequence
(the .pll and .plq files), analysis of distribution of the matches by r-scan statistics (text
output and graphical representation in the PostScript or PDF format), and "Pattern
vicinity analysis" The latter provides the list of all annotated genes
adjacent or overlapping the matching motifs and a brief summary statistic. You
can also request histograms showing how many times a matching
motif occurs at a specific distance from annotated starts and ends of
genes.
The "Analyze PSSM" button links to a
similar interface (also available independent of AIMIE as Motif Locator), which
uses the PSSM representation as described above instead of the consensus
representation of the motif.
WARNING:
The pattern vicinity analysis relies on the provided annotation and can be only
as accurate as the annotation. Moreover, the program can get confused by an
unexpected format of the annotation in the GenBank file. You may want to check
manually any result that is important to you.
3. LIMITATIONS
Although this environment is
intended for use with prokaryotic chromosomes and plasmids, it will analyze any
DNA sequence you upload with the following limitations.
4. REFERENCES
Mrázek, J., Xie, S., Guo, X., and Srivastava, A. (2008) "AIMIE: A Web-based Environment
for Detection and Interpretations of Significant Sequence Motifs in Prokaryotic
Genomes" Bioinformatics 24, 1041-1048.