Pages

Thursday, August 20, 2009

Inserting Records AND Getting The Identity in SSIS

This is (another) topic that comes up in the MSDN SSIS Forums fairly frequently - and will likely do so again, so here's my reference for future use.
Update 2011-08-29: See part 2...
The Business Problem
It's not uncommon to have some sort of ETL process that requires you to insert some records into a SQL table that has an identity column, and then use the identity value for those newly inserted records in a subsequent process.  This is typically a requirement of parent-child processes such as sales order header and detail lines.
The Technical Problem(s)
Using SSIS, this doesn't appear to be possible using out-of-the-box components.  None of the stock Destination components has an output (other than a possible error output) - so even if they could retrieve the identity value for each row they inserted, they don't have a way to pass it along.  The OLE DB Command component can be configured to issue an INSERT command instead of the typical UPDATE - but again, there's no facility in that component for adding a new "return" column to the data flow, or even populating an existing column.
Seems like a major oversight on Microsoft's part, doesn't it?  But it isn't really - not when you look under the covers.  It just seems like a "hole" simply because everything else with the above scenario is so very easy to accomplish with SSIS.
So why do I say this isn't an "oversight"?  Well, it's only an oversight if it's possible to accomplish using other tools - particularly T-SQL.  And, well - it isn't possible AFAIK - at least not using "set-based" commands.  (Do let me know if I'm wrong!)  Think about it - how would you issue an INSERT command that would automatically SELECT the resulting inserted records?  (In special circumstances, this is possible - but those circumstances are usually easy to solve with other means, so you probably wouldn't be here - but I'll solve those too.)  Not even the new OUTPUT clause in SQL 2005+ can help us here (it doesn't return identity or computed columns).
UPDATE 2009-10-24: I just found out I was wrong about that OUTPUT clause - I was reading the documentation incorrectly!  Unfortunately, that still doesn't help us with the problem in SSIS... but now that I know it's possible in T-SQL, I'll be thinking about how to incorporate that ability within a custom component, or how to best post the suggestion on Connect.  TYVM RBarryYoung...
The Solutions
All is not lost!  Those of you who know a little T-SQL will know that you can work around the problem even without the OUTPUT clause - you just have to give up set-based methods and turn to iterative, cursor-based methods.  Yes - this means RBAR (row by agonizing row).  So therefore, most of the solutions given below perform poorly, because they're performing row by row.
I'll start with the "simplest" methods and progress to more complex ones.  I mean "simple" as in easiest to describe and implement quickly in SSIS.  For these (and most other things in life) simple typically does NOT mean "best".  I suggest you evaluate more than one of these alternatives to see if it is fit for your specific purposes - and modify the ideas freely to achieve a workable solution for you.
Insert, Then Re-Read And Merge
This solution applies when the rows you're inserting into the database can be uniquely identified by a (set of) non-identity columns - in other words, it has a primary key.
In your existing Data Flow, insert a Multicast component immediately preceding the Destination component. Attach a Raw File Destination component to another Multicast output, so that the same data gets written to a RAW file.  When the data gets sent to the Destination, an identity value will be given to each column.
Start a second Data Flow, and use the RAW file as a Source in that flow.  Then use the Lookup component to look up the identity value for each row, using the primary key column(s) to look it up.  (This solution only requires that the table have a conceptual unique constraint - regardless of whether the RDBMS it's stored in has defined a primary key or other referential integrity construct to enforce it.)
Add A GUID or Other "Unique" Column To The Destination Table
This solution applies if you are permitted and able to add a column to the destination table.  We're essentially adding data to your table to be able to solve the problem using the prior solution.
In your existing destination table definition, add an NCHAR(36) column (to hold a GUID), or an INT column (to hold a generated ID).  You can choose to ALTER your table to add this column before the rows are inserted, then drop it after - or you can permanently add this column to your table.  Choosing to use a GUID will work if concurrency is an issue (multiple package instances are adding at the same time), where an INT column will only work if exclusivity can be guaranteed.
If you choose to use an INT, I recommend that you issue an UPDATE to the table prior to your INSERT that SETs the column to zero - this makes assigning numbers easier.
In your existing Data Flow, you need to add a Script transform component immediately prior to the Destination component.  Using a Derived Column component will NOT work - there's no way to assign a unique GUID in a Derived Column component, or row numbers.  Inside this Script, you need to set the GUID/INT column as ReadWrite, and use the .Net Guid class to assign new GUIDs and/or unique integer values (typically a row number) to each row.  For INT columns, you could alternatively use one of the Row Number transforms available on the SSIS Community Tasks and Components site.
Similar to the prior solution, use a Multicast component to split the flow, sending one copy to the Destination component, and the other to a Raw File Destination.  Start a second Data Flow, read the Raw File, and use a Lookup component to look up the GUID/INT identifying column, and retrieve the identity value.
Use a Staging Table and Stored Procedure/Query
This solution applies when you are permitted and able to add a temporary table to the database.
Create a temporary table in the RDBMS of the identical structure as the actual destination table - including the identity column.  However, remove the identity specification from the column, leaving it as a non-identity, and allow it to be nullable.  Add another column to the table - a "temporary_identity" column, defined as an indentity field.
1. Alter your Data Flow to dump the data into the temporary table - but do not map any column to the original "identity" column (which will cause it to remain NULL when records are added), or the new "temporary_identity". 
2. Then add an Execute SQL Task that executes a query/stored procedure that defines a loop that will exit when all of the entries in the staging table have a non-NULL value in the identity column.  Inside the loop,
    a) Select the contents of the TOP 1 row in your staging table that has a NULL "identity".  Make sure you store the "temporary_identity" value.
    b) INSERT those values into your destination table.
    c) Read the value in @@IDENTITY (or @@SCOPE_IDENTITY or @@IDENT_CURRENT)
    d) UPDATE your temporary table, SETting the identity column to the value retrieved from @@IDENTITY, keyed on the temporary_identity column
3. Read in the temporary table in a subsequent Data Flow (without the temporary_identity column)
Other Alternatives
Like I said, none of the above is particularly pretty, but they'll all work.  If none of the above turns you on, you could always do something a little more complex, like use a Script - but you're still going to be implementing an RBAR, or a temporary table type of solution.

8 comments:

  1. My solution to this problem was to add a unique number to a comment field to find the rows I just inserted. Once I had them, I then clear out the comment field (it maps to a comment field on the user interface). Again, not pretty, but it does work.

    Thank you for this blog.

    JamesNT

    ReplyDelete
  2. Doesn't this elegantly solve the issue?

    http://stackoverflow.com/questions/8914818/ssis-insert-a-single-row-and-return-its-primary-key

    ReplyDelete
  3. Yes, it does - but at a cost. I found that solution a while back, but hadn't updated this entry.
    Using an OLE DB Command with an output parameter works... but it's also an OLE DB Command... and that's very slow!
    Use what works for you - thanks for the comment.

    ReplyDelete
  4. Spin it the way you want, but this is an oversight from MS, or atleast a design constraint that should not exist.

    ReplyDelete
  5. i found a solution to this issue

    ReplyDelete
  6. Hi
    Whats the solution you found?

    Can you pelase share.

    Thanks

    ReplyDelete
  7. For a solution that doesn't involve a RBAR loop, start with an OLEDB Command that querys destination table and returns MAX(ID) to a variable. Next, insert into the destination table. Finally, using an OLEDB Source, select ID from destination where ID > previously saved max id.

    ReplyDelete
  8. "Insert, Then Re-Read And Merge" explained above worked for me. Many Thanks!

    ReplyDelete