Pages

Friday, January 9, 2009

Extracting Duplicate Primary Key Rows in SSIS

Wow.  This seems to be a hot topic this week at the Microsoft SSIS forums, so here's another fleshed out explanation of something very similar (and very related to) Eliminating Duplicate Primary Keys in SSIS.
Perhaps you don't want to discard the "duplicate" rows - perhaps you want to separate them out in order to log some information, report an error, or "fix" them somehow.  It only takes one and a half steps more than the "Moderately Difficult" process described in Eliminating Duplicate Primary Keys in SSIS.
Replace step 8 with: Edit the Merge Join and set the join type to "Full Outer Join".  Pick all the columns from the right-hand list.  Pick ONE column from the left-hand list (any column, doesn't matter).  Down in the bottom half of the Merge Join editor, you'll see all the columns the Merge Join is creating on the output.  Scroll down to the bottom, and you'll see the column you just picked from the "left" side - in the "Output Alias", it will most likely have the column name plus a " (1)" on it, because it's the same name as a column you already picked coming from the "right" side.  Change the column name to "duplicate_if_null".
Step 9 is to add a Conditional Split after the Merge Join.  Call the first output "Duplicates" with a condition of "ISNULL(duplicate_if_null)".  Name the default output "Originals" (or something like that).
Now you've got two outputs from the Conditional Split - one that gives you all of the "duplicate" rows, and another that gives you the "original" or "best" rows.

No comments:

Post a Comment