Varying cluster number in a scalable clustering system for use with large databases

In one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K' to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.

Web www.patentalert.com

< (none)

< Privacy data escrow system and method

> Optical scanner

> (none)

~ 00044