Our
interface is very easy to use. In step one, users can choose a DNA sequence
from the provided list, which contains the complete prokaryotic genomes
downloaded from the National Center for Biotechnology Information (NCBI) ftp
server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/), or to upload their own DNA
sequence. In step two, users type or paste the search pattern(s) into a
provided text area. In step three, users are prompted to enter their email
address to which the results will be sent. The process is completed by clicking
"Submit Query". The two output files and the error/warning log are sent to the
email address. The CGI interface was written in Python and another Python
script has also been developed to periodically update the list of complete
genomes.
Input and output:
Pattern
Locator reads sequences in standard FASTA or GenBank
format. Patterns are read from a separate file, one pattern per line. Pattern
Locator generates two output files. One contains only locations of the patterns
found in the genomic sequence, whereas the other prints the actual nucleotide
sequences and their flanks.
Specifying sequence
patterns:
Pattern Locator emphasizes the ease of use and utilizes an intuitive syntax for pattern description. We use the standard IUPAC code (e.g., NC-UIB 1986) to refer to individual nucleotides (S for G or C, W for A or T, Y for C or T, R for A or G, M for A or
C, K for G or T, B for C or G or T, D for A or G or T, H for A or C or T, V for
A or C or G and N for any base). Additional codes include +n referring to the actual nucleotide (A, C, G or T) at the n-th position in the pattern or past an active reference point, and -n to signify the nucleotide complementary to that at position n. These codes can be used to describe direct and inverted repeats. The symbol '#' sets the reference point, which affects the subsequent part of the pattern until reset by another '#' (Table 1). In addition, a specified number of errors (mismatches) can be allowed in any segment of the pattern (encoded as {...}[k], where k is the maximum number of errors in the segment within the curly brackets), and any subpattern can be repeated a given number of times (encoded as (...)[n:m], where n and m signify the minimum and maximum number of repeats, respectively, of the segment in the parentheses). Table 1 shows several examples of pattern descriptions. Note that parentheses can be nested whereas curly brackets cannot, i.e., constructions such as ({()}) are allowed but {({})} are not. Characters '>' or '<' may be included at the start of the pattern definition to specify search in the direct strand (>), complementary strand (<) or both strands (<>). If not specified only the direct strand is searched. Multiple patterns can be located simultaneously. For example, stem-loop structures of the type NNNNN(N)[3:7]-5-4-3-2-1 but allowing a single base bulge in any stem segment can be located by simultaneous search for patterns NNNNN(N)[3:7]-5N-4-3-2-1, NNNNNN(N)[3:7]-6-4-3-2-1, NNNNN(N)[3:7]-5-4N-3-2-1, NNNNNN(N)[3:7]-6-5-3-2-1, NNNNN(N)[3:7]-5-4-3N-2-1, NNNNNN(N)[3:7]-6-5-4-2-1, NNNNN(N)[3:7]-5-4-3-2N-1, and NNNNNN(N)[3:7]-6-5-4-3-1.
Examples of pattern
descriptions (Table 1):
1:
GAATTC
EcoRI site
2: {GAATTC}[1]
Same as #1 but allowing one error (mismatch)
3: (RY)[10:]
Alternating purine-pyrimidine pattern repeated
ten times or more (upper limit can be omitted)
4: {(RY)[10:]}[3]
Same as #3 but allowing up to 3 errors
5: NNNNN-5-4-3-2-1
Any exact 10-bp palindrome (like ACTTGCAAGT)
6: NNNNN(N)[0:20]-5-4-3-2-1
or (N)[
A close inverted 5bp repeat separated by up to 20 bp
of any sequence
7: NNNNN(N)[0:20]{-5-4-3-2-1}[1]
Same as #6 but allowing one error
8: {SSSSS}[1](N)[0:20]{-5-4-3-2-1}[1]
Same as #7 but with an additional requirement that at least 4 of the
first 5 nucleotides are C or G
9: NNN(+1+2+3)[9:]
Exact tandem repeat of any trinucleotide at
least ten times in a row.
10:
NNNNN((N)[0:10]+1+2+3+4+5)[5:]
Any pentanucleotide repeated 6 or more times
with gaps not exceeding 10 bp
11:
NNNNN((N)[0:10]{+1+2+3+4+5}[1])[5:]
Same as #10 but allowing one mismatch in each repeat relative to the
first pentanucleotide
12:
<>{SSSSSSSS}[3](N)[
A subset of E. coli rho-independent terminators (a G+C rich stem-loop structure
followed by a T-rich segment). '<>' at the beginning signifies
that both direct and complementary DNA strands will be searched.
13:
(#NNN-3-2-1)[3:]
Three or more 6-bp palindromes in a row. Note the '#', which resets the reference point for the ¡°-3-2-1¡± segment within each repeat. This will match sequences such as ATGCATTGGCCACCCGGG
14:
{SSSSS}[2](#NN-2-1)[2:]+1+2+3+4+5((N)[0:50] #{SSSSS}[2](#NN-2-1)[2:]+1+2+3+4+5)[2:2]
Three patterns separated by no more than 50 bp each composed of two or more 4-bp palindromes in a row surrounded by a 5-bp direct repeat composed mostly of G and C. Note that the '#' is "forgotten" when leaving a subpattern in (...) or {...}.
15:
WW#NNNNNNNN{((N)[0:5]#NNN-3-2-1)[3:5]}[1](N)[0:5]{+1+2+3+4+5+6+7+8}[3]WW
Three to five 6-bp palindromes separated by 0-5 bp from each other, surrounded by an 8-bp direct repeat separated by 0-5 bp from the palindromes, allowing three mismatches in the direct repeat and one mismatch in all the palindromes combined, and with two weak bases (A or T) on each side.
Limitations:
Pattern
Locator uses a recursive algorithm, which allows flexible pattern definitions.
On the downside, it can become slow when combinatorial complexity of the
search, affected mainly by the number of allowed mismatches and/or repeated
segments of variable length, increases. In particular, Pattern Locator is
impractical for finding distant direct or inverted repeats. Patterns such as
NNNNNN(N)[0:1000]+1+2+3+4+5+6 (a 6 bp direct repeat
within a 1000 bp region) can be found much more
effectively by specialized programs, which utilize
specifically designed algorithms. Note that only variable gaps, not those of
exact length, increase the search time. For example, searching for the pattern
NNNNNN(N)[990:1000]+1+2+3+4+5+6 will take roughly the same time as
NNNNNN(N)[0:10]+1+2+3+4+5+6.