Adam C. Scott: Duplicate Keys

Cleaning Data

Duplicate keys happen. I see it most when you feed data into a database and the source data is dirty.

Source data is usually dirty, that's why they want it in a database.

The UNIQUE constraint clause in SQL prevents duplicate keys from ever getting into your pristine database--at least in theory. Sometimes a plain old attribute just needs to be turned into a key for some "practical" reason.

A common data cleaning need is finding and removing duplicate keys. Don't forget to turn on unique constraints for your newly clean keys when you are ready. You never know when you might get hit by a drive-by data sludger.

Finding Duplicate Keys

SELECT my_key, count(*) FROM my_table GROUP BY my_key HAVING count(*) > 1;

Remove Duplicate Keys

DELETE
FROM     tableA
WHERE uniquekey NOT IN
    (SELECT     MAX(duplicates.uniquekey)
      FROM           tableA As duplicates
      GROUP BY     duplicates.column1, duplicates.column2, duplicates.column3)

For relatively small tables these queries work just fine. When these take a long time, they can hog resources in your database (CPU, RAM, disk I/O). Then, you should create stored procedures and cursors and force commit every 100 rows or so. This can help prevent filling redo logs in Oracle as well.

Adam C. Scott

Monday, April 23, 2012

Duplicate Keys

Cleaning Data

Finding Duplicate Keys

Remove Duplicate Keys

No comments:

Post a Comment

About Me

Blog Archive