Approximate substring indexing is accomplished by decomposing each string
in a database into overlapping "positional q-grams", sequences of a predetermined
length q, and containing information regarding the "position" of each q-gram within
the string (i.e., 1st q-gram, 4th q-gram, etc.). An index
is then formed of the tuples of the positional q-gram data (such as, for example,
a B-tree index or a hash index). Each query applied to the database is similarly
parsed into a plurality of positional q-grams (of the same length), and a candidate
set of matches is found. Position-directed filtering is used to remove the candidates
which have the q-grams in the wrong order and/or too far apart to form a "verified"
output of matching candidates. If errors are permitted (defined in terms of an
edit distance between each candidate and the query), an edit distance calculation
can then be performed to produce the final set of matching strings.