Motif
Locator
This
program takes as input a set of aligned DNA sequence motifs (e.g., a set of transcription
factor binding sites) and finds similar motifs in a DNA sequence (e.g., a
prokaryotic chromosome). The primary output is a set of coordinates in the
analyzed DNA sequence of motifs similar to those in the alignment. These
coordinates can be subsequently passed to other programs (r-scan statistics, pattern vicinity analysis) in order to provide
additional information about the distribution of the matching motifs in the
analyzed sequence and with respect to genes.
Algorithm
The
alignment is converted into a position-specific score matrix (PSSM). The PSSM
is an
matrix consisting of
log-odds scores assigned to each nucleotide at every position in the alignment.
n is the width of the alignment. Let
be the score for the
nucleotide i (i = A, C, G, or T) at the motif position j.
is the probability of finding a nucleotide i at position j of the motif (the target probability), estimated from the
alignment as the number of times the nucleotide i occurs at position j
divided by the number of sequences in the alignment. Pseudocounts equal to the
background probabilities
are used in order to
avoid the estimated probabilities being equal to zero. That is, the effect of
pseudocounts becomes less significant when the number of motifs in the
alignment is high.
is a probability of
finding the nucleotide i at any given
position in the analyzed sequence (background probability). Any nucleotide
sequence of length n can now be
assigned a score
, where ij
is the nucleotide at the position j
in the sequence at hand. The probabilistic rationale for the PSSM
representation can be found in most bioinformatics textbooks.
After
the aligned motifs are converted into a PSSM, the analyzed DNA sequence is
scanned for all words of length n
with a score S higher than a given
cutoff S0. The cutoff can
be specified in two ways: the user can provide the actual cutoff value or a
percentile referring to the distribution of scores among the motifs in the input
alignment. For example, specifying 10% when the alignment contains 50 sequences
will set the score cutoff equal to the score of the 6th
lowest-scoring motif in the alignment. By default, the score cutoff is set to
10% but not less than zero. r-scan
statistics and analysis of distribution is applied to all copies of the motif
with scores ≥S0.
Output
The
results are returned by email. The primary output consists of files whose names
end with "mll.txt" (coordinates of matching motifs), and "mlq.txt" (sequences
of the matching motifs and their flanks). The files with "rscan" in the name
contain the output of r-scan
statistics (click here
for details). The file "patvic.txt" describes the genes adjacent to each
matching motif and their relative position. The files with "Gene_start" and "Gene_end"
in their names include histograms of counts of matching motifs at specific
distances from the starts or ends of annotated genes.
References