This Title All WIREs
How to cite this WIREs title:
WIREs Comp Stat

# Computation of exact probabilities associated with overlapping pattern occurrences

Can't access this content? Tell your librarian.

Abstract Searching for patterns in data is important because it can lead to the discovery of sequence segments that play a functional role. The complexity of pattern statistics that are used in data analysis and the need of the sampling distribution of those statistics for inference renders efficient computation methods as paramount. This article gives an overview of the main methods used to compute distributions of statistics of overlapping pattern occurrences, specifically, generating functions, correlation functions, the Goulden‐Jackson cluster method, recursive equations, and Markov chain embedding. The underlying data sequence will be assumed to be higher‐order Markovian, which includes sparse Markov models and variable length Markov chains as special cases. Also considered will be recent developments for extending the computational capabilities of the Markov chain‐based method through an algorithm for minimizing the size of the chain's state space, as well as improved data modeling capabilities through sparse Markov models. An application to compute a distribution used as a test statistic in sequence alignment will serve to illustrate the usefulness of the methodology. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition Data: Types and Structure > Categorical Data Statistical and Graphical Methods of Data Analysis > Modeling Methods and Algorithms
Context tree corresponding to the variable length Markov chain with m = 3, Σ = {0, 1}, and partition of Γ = {γ1, …, γ4} = {{000, 100, 010, 110}, {011, 111}, {001}, {101}}. Here contexts are read from left to right, as in the text, and correspond to a path in the plot reading up from a leaf node. For example 001 means that xt − 2 = 0, xt − 1 = 0 and xt = 1, and corresponds to the path 0‐0‐1 reading up from the leaf node 0 at the bottom. Also, the 0 one level down from the root node of the plot corresponds to the collection of three‐tuples that end in 0, {000, 100, 010, 110}
[ Normal View | Magnified View ]
Coverage distribution for the Patternhunter seed, with partition Γ = {γ1, …, γ4} = {{000, 100, 010, 110}, {011, 111}, {001}, {101}}, and conditional success probabilities for classes given by (0.6,0.9,0.7,0.8)
[ Normal View | Magnified View ]
State space ϒ for computing the coverage distribution of spaced seed S = 1 * 11 * 1. States in green are those for which coverage is updated
[ Normal View | Magnified View ]
Parsimonious context tree representation of the partition Γ = {{AA, AC, AG, AT}, {CA, CC, CG, CT}, {GA, GC, GG, GT}, {TA, TC, TG, TT}}. The context is to be read from bottom to top, with, for example, the context {AA, AC, AG, AT} being represented by the path A − {A, C, G, T}
[ Normal View | Magnified View ]