The invention concerns a detection of duplicate tuples in a database. Previous
domain independent detection of duplicated tuples relied on standard similarity
functions (e.g., edit distance, cosine metric) between multi-attribute tuples.
However, such prior art approaches result in large numbers of false positives if
they are used to identify domain-specific abbreviations and conventions. In accordance
with the invention a process for duplicate detection is implemented based on interpreting
records from multiple dimensional tables in a data warehouse, which are associated
with hierarchies specified through key—foreign key relationships in a snowflake
schema. The invention exploits the extra knowledge available from the table hierarchy
to develop a high quality, scalable duplicate detection process.