r-scan statistics

 

This method can be used to detect anomalies in a distribution of markers (sequence patterns, genes, etc) in a DNA sequence. Consider a sequence of length L containing n markers located at positions x1, x2,..., xn:

 

 

For a given r, one can find the minimum and maximum distance between a marker and the r-th next marker:

 

and ,

 

respectively. Dembo and Karlin (1992, Ann. Appl. Prob 2, 329-357) derived formulas to estimate the probability that the distances or exceed a given threshold assuming that the markers are randomly distributed. These formulas can be used to assess an expected range for both minimum and maximum observed distance given a probability cutoff. If the observed minimum distance is below the expected range it can be interpreted as significant clumping (clustering) of the markers. Analogously, the maximum distance being higher than the expected range can be interpreted as a significant overdispersion (gap) in the distribution of the markers. In addition, too high a minimum distance or too low a maximum distance indicates a significantly even distribution of the markers. We use two probability cutoffs 1% and 5% to analyze distribution of patterns found with Pattern Locator.

 

Different values of r can be used to assess the marker distribution at different scales. In practice, the r values for which r-scans can be applied are limited by the number of markers n because the Dembo-Karlin formulas use asymptotic approximations and the error increases with increasing r and with decreasing n.

 

 

Output files

 

You can choose to receive the output in text and graphical format. The text output lists positions of significant clusters and/or gaps. Graphical output can have circular or linear form. The circular plot features a circle indicating the scale (position 0 is at the top) with positions of the patterns found by Pattern Locator shown in black outside of the circle, and significant clusters indicated in blue and gaps in orange inside the circle. Clusters/gaps significant at 1% probability cutoff are shown as thick bars whereas those corresponding to 5% cutoff are indicated by thin bars. The linear graph includes multiple lines with scale, each corresponding to 500 kb sequence length. Positions of the patterns are indicated above the line, clusters/gaps below the line.

 

Overlapping patterns

 

Our r-scan implementation automatically combines all overlapping patterns into a single marker prior to the r-scan application regardless of whether you chose to combine overlapping patterns in Pattern Locator or not.

 

References

 

For examples of practical applications of r-scan statistics, see the following papers:

 

Karlin, S. and Brendel, V. (1992) Chance and statistical significance in protein and DNA sequence analysis. Science 257, 39-49.

 

Karlin, S., Mrázek, J., and Campbell, A.M. (1996) Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. Nucleic Acids Res. 24, 4263-4272.

 

Mrázek, J., Bhaya, D., Grossman, A.R., and Karlin, S. (2001) Highly expressed and alien genes of the Synechocystis genome. Nucleic Acids Res. 29, 1590-1601.

 

Mrázek, J., Gaynon, L.H., and Karlin, S. (2002) Frequent oligonucleotide motifs in genomes of three streptococci. Nucleic Acids Res. 30, 4216-4221.