SQL: Removing Duplicate Entries in a Table

Here’s a scenario I just faced: a long time ago our client had a CMS built that had it’s own newsletter app built into it. They used it lovingly for many years. A few years ago they had a redesign done(not by us). That designer decided that they were going to use Mailchimp from now on; even though all they did was re-skin the codebase. The client used Mailchimp for a while but didn’t like it and they wanted to go back to their integrated newsletter.

But they had transferred over their old mailing list when they first used Mailchimp.

So the object was to export their Mailchimp list and import that list into their current newsletter database. And remove all of the duplicate entries.

The export and import were cake. The duplicate entries, not so much. Here is the solution I eventually came to.

DELETE FROM ThisTable WHERE KeyField IN

( SELECT all_duplicates.KeyField FROM ( SELECT KeyField FROM ThisTable WHERE (KeyField) IN

( SELECT KeyField FROM ThisTable GROUP BY KeyField having count(*) > 1 ) )
AS all_duplicates

LEFT JOIN

(SELECT KeyField FROM ThisTable GROUP BY KeyField having count(*) > 1 )
AS grouped_duplicates ON all_duplicates.KeyField = grouped_duplicates.KeyField )

Let’s take a walk-through this code and find out what’s going on here.

First, a few words on the above code. ThisTable is the table that we are scouring through looking for duplicate entries. KeyField is the column in the table whose value we are going to use to determine if the entry is a duplicate or not. I chose to use EMAIL because that had the greatest chance of remaining constant should a user have signed up in the client’s old CMS and then later Mailchimp.

I also added returns to the code to aid in readability as well as how I’m going to break up the code. SQL doesn’t give a rat’s arse about returns, line breaks, etc.

DELETE FROM ThisTable WHERE KeyField IN

This should be fairly basic if you you’ve worked with SQL before. This says that we are looking for KeyField in all of the rows of ThisTable and we want to delete the row if it meets a certain condition.

That condition is defined by the next sections of code.

( SELECT all_duplicates.KeyField FROM ( SELECT KeyFieldL FROM ThisTable WHERE (KeyField) IN

( SELECT KeyField FROM ThisTable GROUP BY KeyField having count(*) > 1 ) )
AS all_duplicates

At first glance these look near identical, but for the magic word between then: IN.

Think of this section as working backwards. Consider the second section fist. It is searching ThisTable for any rows that have the same value in KeyField, which would make it a duplicate. It is taking all of these duplicates and grouping them together and calling the group all_duplicates.

Now for the first section of code. It is selecting all of the entries in the group that have KeyFields that match; just think of it as it’s selecting the entire group for later use.

JOIN LEFT

Joins can be a hard concept to understand. What JOIN LEFT does is returns all rows from the left table(or the first table in your SQL statement) that have a match in the right table(the second table in your SQL call.) The first table is to the left of the second one if you would put everything on one line in the SQL statement so it’s called JOIN LEFT.

But don’t we only have one table???

We do, but we have a group. So SQL will use the group as the right table for it’s comparison.

(SELECT KeyField FROM ThisTable GROUP BY KeyField having count(*) > 1 ) AS grouped_duplicates ON all_duplicates.KeyField = grouped_duplicates.KeyField )

Now for the last bit.

This says to select the rows from ThisTable that have duplicate values in KeyField and place them in a group called grouped_duplicates, but only on rows that the grouped_duplicates.KeyField(which syntax merely states GROUP.COLUMN) matches the value in all_duplicates.KeyField, which is our first group of duplicates.

These matches are the rows that are going to be DELETED in the very first line of our statement. (Sounds like SQL might have been written by Cybermen…Hmmmm……)

Think of the whole deal as this; we search the table for possible duplicates and place them into a group. This group creates a key that we are going to use to compare with the original table. We compare the table to our key-group and if we find a match the duplicate entry gets deleted.

Also, the duplicate entry is always the more recent entry. So if you have an entry from yesterday and one form today with the same EMAIL, as per our example, today’s entry would be deleted.

This may sound like a complicated way to do things, placing duplicate entries into a group to compare to the same table to place in another group to be deleted, but SQL needs a way to compare all of the rows. This example creates a group of rows that is more manageable to use to compare on subsequent searches.

If not it would need to take row one and compare it to all the rows, then row two, then row three…. if you have 10,000 rows you will need to go through the table 10,000 times, making 10,000 comparisons each time, to compare all of them. That is 100,000,000 comparisons!! In this case you go through once and make the key group, which is significantly smaller than the original table. You then compare the table to the key group and place what is returned into a different group and delete that.

So if you have 10,000 but only 1500 duplicates, you make 15,000,000. That’s a big difference in resources.

Which will make your poor old server very happy.

I’m sure this isn’t the only way to do it and would love to hear about how you would tackle this in the comments below. Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *