Ab Initio Motif Finding Environment (AIMIE)

 

The purpose of this environment is to provide tools for discovery and interpretation of significantly overrepresented DNA sequence motifs in prokaryotic genomes. The discovery phase uses the DNA sequence as the only input. Annotation is used in the interpretation phase. The discovery phase ends with significantly overrepresented sequence motifs displayed on the screen. The interpretation phase provides tools for subsequent analysis of distribution of these motifs in the analyzed sequence and with respect to genes in order to gain more insights into their possible biological roles.

 

1. MOTIF DISCOVERY

The algorithm for motif discovery starts with identification of frequent words (Karlin et al. 1996, Mrázek et al. 2002, Mrázek and Karlin 1996). The only input is an annotated DNA sequence in GenBank format (example). You can upload the sequence file or select from a local database, which is synchronized (irregularly) with complete prokaryotic genomes stored on the NCBI FTP server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). The motif discovery algorithm is briefly described below.

 

1.      The frequent word technique provides a list of significantly overrepresented words of a fixed length (between 10 and 12 bp for prokaryotic chromosomes, depending on the sequence length). The frequent words are sorted by statistical significance and a given number of top frequent words (user-defined parameter, default 100) are selected for further analysis.

2.      All copies of the top frequent words in the analyzed sequence are found and combined into Segments Consisting of Overlapping Frequent words (SCOFs). SCOFs are of variable length but many represent different occurrences of the same sequence motif.

3.      SCOFs are clustered into groups corresponding to the same (or similar) sequence motifs. A distance between two SCOFs is defined as a minimum number of mismatched nucleotides between the optimally aligned SCOFs (without inserting gaps) divided by the length of the shorter SCOF. The standard UPGMA hierarchical clustering algorithm is used to cluster the SCOFs. All SCOFs joined into a single node below a given clustering cutoff (user-defined parameter, default 0.3) are considered the same sequence motif. The default value may not be suitable for all sequences and you may want to experiment with different values of the clustering cutoff.

4.      Each sequence motif is represented by an alignment of SCOFs that belong to that motif. A consensus sequence is generated from the alignment using degenerate nucleotide alphabet (standard NCIUB, formerly IUPAC code). The consensus-generating algorithm ignores nucleotides that occur in less than a given fraction of SCOFs in the alignment (user-defined parameter, default 10%). For example, if the frequencies of A, C, G, and T at a given position in the alignment are, 70%, 20%, 5% and 5%, respectively, the consensus will have the letter M (A or C) at that position. Ambiguous codes corresponding to three or four different nucleotides (N, B, D, H, and V) at both termini are removed. Consequently, the consensus sequence can be either longer or shorter than the initial word length s.

 

At this point you are presented with a list of detected sequence motifs, which starts the interpretation phase. You can also select some of the motifs to be masked out in the analyzed sequence (by marking the checkboxes in the right column and clicking the button at the bottom of the page) and repeat the motif discovery phase to find additional (less significant) sequence motifs.

 

2. MOTIF INTERPRETATION

You can analyze the distribution of matches to any of the consensus sequences in the analyzed sequence (the "Analyze Consensus" button) or matches to the position-specific score matrix (PSSM) representation of the aligned SCOFs. The PSSM represents the motif using log-odds scores to account for the background nucleotide frequencies. You can also view the aligned SCOFs for each motif. The PSSM representation requires a score cutoff. That is, all substrings of the analyzed sequence that score higher than the cutoff will be reported as matching the motif. The cutoff can be defined directly or as a percentage of SCOFs in the original alignment that score below the cutoff. For example, setting the cutoff as "20" will be interpreted as the direct value of the score cutoff, whereas "20%" will set the score cutoff equal to the 20th percentile among scores for all SCOFs in the alignment. By default, the score cutoff is set to 10% but not less than zero.

The "Analyze Consensus" button links to a modified Pattern Locator interface. You can manually modify the consensus sequence using any syntax allowed by Pattern Locator. You will receive the results by email. They include locations of all matches to the consensus sequence in the analyzed sequence (the .pll and .plq files), analysis of distribution of the matches by r-scan statistics (text output and graphical representation in the PostScript or PDF format), and "Pattern vicinity analysis" The latter provides the list of all annotated genes adjacent or overlapping the matching motifs and a brief summary statistic. You can also request histograms showing how many times a matching motif occurs at a specific distance from annotated starts and ends of genes.

The "Analyze PSSM" button links to a similar interface (also available independent of AIMIE as Motif Locator), which uses the PSSM representation as described above instead of the consensus representation of the motif.

 

WARNING: The pattern vicinity analysis relies on the provided annotation and can be only as accurate as the annotation. Moreover, the program can get confused by an unexpected format of the annotation in the GenBank file. You may want to check manually any result that is important to you.

 

3. LIMITATIONS

Although this environment is intended for use with prokaryotic chromosomes and plasmids, it will analyze any DNA sequence you upload with the following limitations.

 

 

4. REFERENCES

Mrázek, J., Xie, S., Guo, X., and Srivastava, A. (2008) "AIMIE: A Web-based Environment for Detection and Interpretations of Significant Sequence Motifs in Prokaryotic Genomes" Bioinformatics 24, 1041-1048.