A multi-lingual indexing and search system is presented that performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. The system includes a tokenizer that separates a string of text into individual word tokens, and eliminates predetermined types of tokens from further processing. The system also includes a stemmer that reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. The stemmer removes known word endings from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In an embodiment, the stemmer only removes those word endings which are associated with nouns. The system further includes an indexer that stores the stems in an index.

 
Web www.patentalert.com

< Method and system for providing electronic discovery on computer databases and archives using statement analysis to detect false statements and recover relevant data

> Detecting duplicate and near-duplicate files

~ 00430