Techniques are provided that identify near-duplicate items in large collections of items. A list of (value, frequency) pairs is received, and a sample (value, instance) is returned. The value is chosen from the values of the first list, and the instance is a value less than frequency, in such a way that the probability of selecting the same sample from two lists is equal to the similarity of the two lists.

 
Web www.patentalert.com

< Methods and systems for monitoring and diagnosing machinery by incremently testing a rule

< System and method for an automatic set-up of speech recognition engines

> System for supporting user's behavior

> Method and system for discovering knowledge from text documents using associating between concepts and sub-concepts

~ 00605