Processing of source documents to generate data for indexing, and of
queries to generate data for searching, is done in accordance with
retrieved tokenization rules and, if desired, retrieved normalization
rules. Tokenization rules are used to define exactly what characters
(letters, numbers, punctuation characters, etc.) and exactly what
patterns of those characters (one or more contiguous characters, every
individual character, etc.) comprise indexable and searchable units of
data. Normalization rules are used to (potentially) modify the tokens
created by the tokenizer in indexing and/or searching operations.
Normalization accounts for things such as case-insensitive searching and
language-specific nuances in which document authors can use accepted
variations in the spelling of words. Query processing must employ the
same tokenization and normalization rules as source processing in order
for queries to accurately search the databases, and must also employ
another set of concordable characters for use in the query language. This
set of "reserved" characters includes characters for wildcard searching,
quoted strings, field-qualified searching, range searching, and so forth.