Motif Locator

 

This program takes as input a set of aligned DNA sequence motifs (e.g., a set of transcription factor binding sites) and finds similar motifs in a DNA sequence (e.g., a prokaryotic chromosome). The primary output is a set of coordinates in the analyzed DNA sequence of motifs similar to those in the alignment. These coordinates can be subsequently passed to other programs (r-scan statistics, pattern vicinity analysis) in order to provide additional information about the distribution of the matching motifs in the analyzed sequence and with respect to genes.

 

 

Algorithm

 

The alignment is converted into a position-specific score matrix (PSSM). The PSSM is an matrix consisting of log-odds scores assigned to each nucleotide at every position in the alignment. n is the width of the alignment. Let be the score for the nucleotide i (i = A, C, G, or T) at the motif position j. is the probability of finding a nucleotide i at position j of the motif (the target probability), estimated from the alignment as the number of times the nucleotide i occurs at position j divided by the number of sequences in the alignment. Pseudocounts equal to the background probabilities are used in order to avoid the estimated probabilities being equal to zero. That is, the effect of pseudocounts becomes less significant when the number of motifs in the alignment is high. is a probability of finding the nucleotide i at any given position in the analyzed sequence (background probability). Any nucleotide sequence of length n can now be assigned a score , where ij is the nucleotide at the position j in the sequence at hand. The probabilistic rationale for the PSSM representation can be found in most bioinformatics textbooks.

 

After the aligned motifs are converted into a PSSM, the analyzed DNA sequence is scanned for all words of length n with a score S higher than a given cutoff S0. The cutoff can be specified in two ways: the user can provide the actual cutoff value or a percentile referring to the distribution of scores among the motifs in the input alignment. For example, specifying 10% when the alignment contains 50 sequences will set the score cutoff equal to the score of the 6th lowest-scoring motif in the alignment. By default, the score cutoff is set to 10% but not less than zero. r-scan statistics and analysis of distribution is applied to all copies of the motif with scores ≥S0.

 

Output

 

The results are returned by email. The primary output consists of files whose names end with "mll.txt" (coordinates of matching motifs), and "mlq.txt" (sequences of the matching motifs and their flanks). The files with "rscan" in the name contain the output of r-scan statistics (click here for details). The file "patvic.txt" describes the genes adjacent to each matching motif and their relative position. The files with "Gene_start" and "Gene_end" in their names include histograms of counts of matching motifs at specific distances from the starts or ends of annotated genes.

 

References

 

Mrázek, Xie, Guo, and Srivastava (2008) AIMIE: A Web-based Environment for Detection and Interpretation of Significant Sequence Motifs in Prokaryotic Genomes. Bioinformatics, in press.