r-scan
statistics
This method can be used to detect anomalies in a distribution of markers (sequence patterns, genes, etc) in a DNA sequence. Consider a sequence of length L containing n markers located at positions x1, x2,..., xn:

For
a given r, one can find the minimum and maximum distance between a marker
and the r-th next marker:
and
,
respectively.
Dembo and Karlin (1992, Ann. Appl. Prob
2, 329-357) derived formulas to
estimate the probability that the distances
or
exceed a given
threshold assuming that the markers are randomly distributed. These formulas
can be used to assess an expected range for both minimum and maximum observed
distance given a probability cutoff. If the observed minimum distance
is below the expected
range it can be interpreted as significant clumping (clustering) of the
markers. Analogously, the maximum distance
being higher than the
expected range can be interpreted as a significant overdispersion (gap) in the
distribution of the markers. In addition, too high a minimum distance or too
low a maximum distance indicates a significantly even distribution of the
markers. We use two probability cutoffs 1% and 5% to analyze distribution of
patterns found with Pattern Locator.
Different values of r can be used to assess the marker
distribution at different scales. In practice, the r values for which r-scans
can be applied are limited by the number of markers n because the Dembo-Karlin formulas use asymptotic approximations
and the error increases with increasing r
and with decreasing n.
Output
files
You can choose to receive the output
in text and graphical format. The text output lists positions of significant
clusters and/or gaps. Graphical output can have circular or linear form. The
circular plot features a circle indicating the scale (position 0 is at the top)
with positions of the patterns found by Pattern Locator shown in black outside
of the circle, and significant clusters indicated in blue and gaps in orange
inside the circle. Clusters/gaps significant at 1% probability cutoff are shown
as thick bars whereas those corresponding to 5% cutoff are indicated by thin
bars. The linear graph includes multiple lines with scale, each corresponding
to 500 kb sequence length. Positions of the patterns are indicated above the
line, clusters/gaps below the line.
Overlapping
patterns
Our r-scan implementation automatically combines all overlapping
patterns into a single marker prior to the r-scan
application regardless of whether you chose to combine overlapping patterns in
Pattern Locator or not.
References
For
examples of practical applications of r-scan
statistics, see the following papers: