We all know we shouldn’t have duplicates in the database. And despite my best efforts, somehow they sneak in from some legacy code, or from the hyper-active-compulsory-submit-the-form-fifty-times-in-five-seconds-bloke.
So now and then I find myself writing the same SQL to track them down. Most solutions online say that you have to use a temporary table or rename tables to get rid of them. This solution works without either, providing my way of eliminating the duplicates is sufficient for your needs.
Here is my code to find duplicates:
SELECT * FROM users u JOIN ( SELECT username, min(userid) AS minid FROM users GROUP BY username HAVING ( COUNT(username) > 1 ) ) AS dupusers ON u.username = dupusers.username ORDER BY userid DESC
Let’s first look at the nested Select:
SELECT username, min(userid) AS minid FROM users GROUP BY username HAVING ( COUNT(username) > 1 )
Because I identify a record by the same username, I group on that. Once grouped on username, I can use Having to check the count for that username. If the count is > 1, then it is a duplicate of another record. I am taking the viewpoint that the first user inserted is the one we are going to keep. All users with the same username after that one is hence a duplicate. I am returning the minimum userid for each duplicate record so I know which one I will not delete later.
This query by itself would find you all the duplicates, but isn’t as handy because you can’t view all the columns of each record.
Now if we look at the full join, I join the two tables together: The nested table is the set of all users that have a duplicate in the database inner joined with the set of all users. This will give us the set of all users that have duplicates.
Ok, so how do we go about deleting the sibling duplicates of a base record and not every record that has a duplicate? In other words, if I have:
I only want to delete records with ID’s of 2 and 3; not the ID of 1. Let’s look at my Delete syntax:
DELETE users FROM users u JOIN ( SELECT username, min(userid) AS minid FROM users GROUP BY username HAVING ( COUNT(username) > 1 ) ) AS dupusers ON u.username = dupusers.username WHERE u.userid <> minid
The code is the exact same with two exceptions. One, I switched from the Select statement to a Delete. The second is the filter :
WHERE u.userid <> minid
This will delete every userid that isn’t the smallest userid for that group of duplicates. Using my example set above, ID’s 2 and 3 would not equal the minid of 1, and therefore would be deleted, leaving one non-duplicated record.
This code assumes that a row is a duplicate of another if their username is the same. In most cases, a duplicate is only a duplicate if a combination of fields are the same. This can be accommodated by changing the nested Select statement:
SELECT * FROM users u JOIN ( SELECT firstname, lastname, min(userid) AS minid FROM users GROUP BY firstname, lastname HAVING ( COUNT(*) > 1 ) ) AS dupusers ON u.firstname = dupusers.firstname AND u.lastname = dupusers.lastname ORDER BY u.lastname, u.firstname
This is a lame example since there can be many people with the same name, but you get the picture. In reality, I have had to use this technique when looking for duplicate addresses. For that I grouped on address, city and zip.
So there you go…a way to nuke duplicates without using temp tables.