Pages

Wednesday, January 7, 2009

Eliminating Duplicate Primary Keys in SSIS

One of the common tasks in ETL and Data Warehousing is cleaning source system data. One of the common "cleaning" tasks is to weed out duplicate entries before you update dimension tables. Eliminating duplicate primary keys can be dead easy... but it can also be pretty difficult. Here are three ways - from dead easy to pretty hard - about how to do just that.
All of these methods work inside a single Data Flow Task in Integration Services - either 2005 or 2008, doesn't matter. I've created a SQL 2008 package that demonstrates all three methods - download it from my SkyDrive.
Dead Easy Duplicate Key Removal
The most basic way to detect and "weed out" duplicate rows is simple, but only really works if you don't care about the values in any "non-primary key" columns.
First, feed the rows into a Sort component. Sort by your primary key column(s) and check the box at the bottom to "discard duplicate rows". Done!
The problem with this is that you don't have (much) control over which of the duplicate rows gets passed through the Sort component and which one(s) get discarded.
Moderately Difficult Duplicate Key Removal
If you know you have duplicates, but you are required to use "non-primary key" column values from the "best" version of the row, this method may work. It's appropriate if you have a way of "scoring" a row that will enable you to identify which row is "best", without looking at the other rows that share the same primary key.
  1. Send your data into a Derived Column component that creates a temporary "score" or "value" column containing some value that will let you (later) pick out the "best" row. Perhaps some measurement as to how many fields of non-primary key data are filled in. You might not even need this Derived Column component if there's already some column that tells you exactly which row you're most interested in - like an "updated date" (where later is always "best"). If you can't do this because you need to know information from other rows that share the primary key, you have to go use the "Pretty Difficult" method.
  2. Multicast your flow into streams "A" and "B". (They aren't labeled that way, but I'll talk about stream "A" and "B" differently.)
  3. Send stream "A" through an Aggregate component, grouping by your primary key columns and ONLY your primary key columns. Use an aggregation operation - like Maximum (for the date example) or Minimum - on your "score" column. (All the other column information will be lost here - that's OK.)
  4. Send stream "A" through a Sort component, sorting by the primary key column(s), then your "score" column. (This is necessary for the Merge Join component coming up.)
  5. Send stream "A" into the LEFT side of a Merge Join component.
  6. Send stream "B" into a Sort component, sorting by the primary key column(s), and your "score" column. Do NOT check the "discard duplicates" box.
  7. Send stream "B" into the RIGHT side of the Merge Join component.
  8. Edit the Merge Join, make it an Inner Join (the default), and pick all the columns from the right-hand list.
Pretty Difficult Duplicate Key Removal

If you can't decide which row of the (potential) duplicates to pass through without knowing what's in the other rows, you need to use a Script Transform in asynchronous mode. (Asynchronous in this instance means that we're not outputting one row for each and every row that comes into the component.)
  1. Place a Script component in your flow, identifying it as a transformation component (the default).
  2. Open the Script component editor, pick the "Input Columns" tab. Select the key columns, and any other columns that you want to pass through and/or will be used to determine which row is "best" within the duplicates.
  3. Still in the Script component editor, pick the "Inputs and Outputs" tab. Pick the "Output0" node in the treeview. Change the SynchronousInputID value to "None".
  4. Still on the "Inputs and Outputs" tab of the Script component editor, open the "Output0" node, and select the "Output Columns" folder icon. Press the "Add Column" button, and add your first primary key column to the output - pay attention to data types! You will have to repeat this process and add back ALL of the columns. By making the Script asynchronous, we've essentially removed the entire output column definition that's usually constructed for us by the Script component.
  5. Now you get to edit the Script! (Too much to post here - download from my SkyDrive)
One note for the script component... As it is now, the code in there is NOT according to Microsoft documentation, or how I'm used to the Script component or custom components working. I've posted a bug at Connect for some clarification, since I have it in a repeatable form...
Other Methods?
If you know of some other slick ways of eliminating or processing duplicates, please share...
Update
If you're interested in extracting the duplicates from the data flow and doing something with them instead of discarding them, read Extracting Duplicate Primary Key Rows in SSIS.

6 comments:

  1. If you are getting the rows from SQL Server, you can use the Rank() function, as shown by Rafael here: http://rafael-salas.blogspot.com/2007/04/remove-duplicates-using-t-sql-rank.html

    ReplyDelete
  2. Thanks for the idea, John. You can do the same type of thing as RANK() by using SQLIS.com's Row Number transformation: http://www.sqlis.com/post/Row-Number-Transformation.aspx

    ReplyDelete
  3. Thanks mate...!!! This really helped. :-)

    ReplyDelete
  4. Just came back for the third time - have needed all three techniques now. Very clear info and exceedingly handy, thanks very much.

    ReplyDelete
  5. For flat files, a simple scoring mechanism for the moderately difficulty approach would be row numbering the flatfile and taking the last appearance of the dupe. Hope it helps.

    http://sqlsaga.com/ssis/how-to-generate-an-incrementing-number-in-ssis-using-a-script-component/

    ReplyDelete
  6. The Dead Easy method worked quite well for a quick fix. Thanks.

    ReplyDelete