Our interface is very easy to use. In step one, users can choose a DNA sequence from the provided list, which contains the complete prokaryotic genomes downloaded from the National Center for Biotechnology Information (NCBI) ftp server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/), or to upload their own DNA sequence. In step two, users type or paste the search pattern(s) into a provided text area. In step three, users are prompted to enter their email address to which the results will be sent. The process is completed by clicking "Submit Query". The two output files and the error/warning log are sent to the email address. The CGI interface was written in Python and another Python script has also been developed to periodically update the list of complete genomes.


Input and output:


Pattern Locator reads sequences in standard FASTA or GenBank format. Patterns are read from a separate file, one pattern per line. Pattern Locator generates two output files. One contains only locations of the patterns found in the genomic sequence, whereas the other prints the actual nucleotide sequences and their flanks.



Specifying sequence patterns:


Pattern Locator emphasizes the ease of use and utilizes an intuitive syntax for pattern description. We use the standard IUPAC code (e.g., NC-UIB 1986) to refer to individual nucleotides (S for G or C, W for A or T, Y for C or T, R for A or G, M for A or C, K for G or T, B for C or G or T, D for A or G or T, H for A or C or T, V for A or C or G and N for any base). Additional codes include +n referring to the actual nucleotide (A, C, G or T) at the n-th position in the pattern or past an active reference point, and -n to signify the nucleotide complementary to that at position n. These codes can be used to describe direct and inverted repeats. The symbol '#' sets the reference point, which affects the subsequent part of the pattern until reset by another '#' (Table 1). In addition, a specified number of errors (mismatches) can be allowed in any segment of the pattern (encoded as {...}[k], where k is the maximum number of errors in the segment within the curly brackets), and any subpattern can be repeated a given number of times (encoded as (...)[n:m], where n and m signify the minimum and maximum number of repeats, respectively, of the segment in the parentheses). Table 1 shows several examples of pattern descriptions. Note that parentheses can be nested whereas curly brackets cannot, i.e., constructions such as ({()}) are allowed but {({})} are not. Characters '>' or '<' may be included at the start of the pattern definition to specify search in the direct strand (>), complementary strand (<) or both strands (<>). If not specified only the direct strand is searched. Multiple patterns can be located simultaneously. For example, stem-loop structures of the type NNNNN(N)[3:7]-5-4-3-2-1 but allowing a single base bulge in any stem segment can be located by simultaneous search for patterns NNNNN(N)[3:7]-5N-4-3-2-1, NNNNNN(N)[3:7]-6-4-3-2-1, NNNNN(N)[3:7]-5-4N-3-2-1, NNNNNN(N)[3:7]-6-5-3-2-1, NNNNN(N)[3:7]-5-4-3N-2-1, NNNNNN(N)[3:7]-6-5-4-2-1, NNNNN(N)[3:7]-5-4-3-2N-1, and NNNNNN(N)[3:7]-6-5-4-3-1.



Examples of pattern descriptions (Table 1):


1: GAATTC      

         EcoRI site


 2: {GAATTC}[1]

         Same as #1 but allowing one error (mismatch)


 3: (RY)[10:]

         Alternating purine-pyrimidine pattern repeated ten times or more (upper limit can be omitted)


 4: {(RY)[10:]}[3]

         Same as #3 but allowing up to 3 errors


 5: NNNNN-5-4-3-2-1

         Any exact 10-bp palindrome (like ACTTGCAAGT)


 6: NNNNN(N)[0:20]-5-4-3-2-1 or (N)[5:25]-5-4-3-2-1

         A close inverted 5bp repeat separated by up to 20 bp of any sequence


 7: NNNNN(N)[0:20]{-5-4-3-2-1}[1]

         Same as #6 but allowing one error


 8: {SSSSS}[1](N)[0:20]{-5-4-3-2-1}[1]

         Same as #7 but with an additional requirement that at least 4 of the first 5 nucleotides are C or G


 9: NNN(+1+2+3)[9:]

         Exact tandem repeat of any trinucleotide at least ten times in a row.


10: NNNNN((N)[0:10]+1+2+3+4+5)[5:]

         Any pentanucleotide repeated 6 or more times with gaps not exceeding 10 bp


11: NNNNN((N)[0:10]{+1+2+3+4+5}[1])[5:]

         Same as #10 but allowing one mismatch in each repeat relative to the first pentanucleotide


12: <>{SSSSSSSS}[3](N)[3:10]{-8-7-6-5-4-3-2-1}[1]{TTTTTTT}[3]

         A subset of E. coli rho-independent terminators (a G+C rich stem-loop structure followed by a T-rich segment). '<>' at the beginning signifies that both direct and complementary DNA strands will be searched.


13: (#NNN-3-2-1)[3:]

         Three or more 6-bp palindromes in a row. Note the '#', which resets the reference point for the -3-2-1 segment within each repeat. This will match sequences such as ATGCATTGGCCACCCGGG


14: {SSSSS}[2](#NN-2-1)[2:]+1+2+3+4+5((N)[0:50] #{SSSSS}[2](#NN-2-1)[2:]+1+2+3+4+5)[2:2]

         Three patterns separated by no more than 50 bp each composed of two or more 4-bp palindromes in a row surrounded by a 5-bp direct repeat composed mostly of G and C. Note that the '#' is "forgotten" when leaving a subpattern in (...) or {...}.


15: WW#NNNNNNNN{((N)[0:5]#NNN-3-2-1)[3:5]}[1](N)[0:5]{+1+2+3+4+5+6+7+8}[3]WW

         Three to five 6-bp palindromes separated by 0-5 bp from each other, surrounded by an 8-bp direct repeat separated by 0-5 bp from the palindromes, allowing three mismatches in the direct repeat and one mismatch in all the palindromes combined, and with two weak bases (A or T) on each side.






Pattern Locator uses a recursive algorithm, which allows flexible pattern definitions. On the downside, it can become slow when combinatorial complexity of the search, affected mainly by the number of allowed mismatches and/or repeated segments of variable length, increases. In particular, Pattern Locator is impractical for finding distant direct or inverted repeats. Patterns such as NNNNNN(N)[0:1000]+1+2+3+4+5+6 (a 6 bp direct repeat within a 1000 bp region) can be found much more effectively by specialized programs, which utilize specifically designed algorithms. Note that only variable gaps, not those of exact length, increase the search time. For example, searching for the pattern NNNNNN(N)[990:1000]+1+2+3+4+5+6 will take roughly the same time as NNNNNN(N)[0:10]+1+2+3+4+5+6.