Pages

Monday, August 29, 2011

Inserting Records AND Getting The Identity in SSIS - Part 2

Since I posted a ways back on retrieving the identity value for inserts in SQL Server Integration Services, I've learned many things - one of which is an inline solution for that problem I didn't know back then.  Check out that previous post for the business problem, then come back here for an "inline" solution using SSIS - without scripts - just regular SSIS tasks and components.
It's a Love/Hate Relationship
The OLE DB Command component in SSIS is a near-universally hated object due to it's inherent performance problems.  But it's so hard to hate when it can actually get the job done when nobody else can.
Yes, our old nemesis is back to solve a problem for us... at the same price it's always asked.  Do you want your packages to run fast, but be complicated?  Or do you want them to run slow, but be easy to understand?  Sadly, there is not much middle ground.
The Keys
It turns out to be pretty easy, and I should have got it into my thick skull before, from all the hints I'd read.  A stored procedure with an OUTPUT parameter, and the OLE DB Command are the keys.  For this example, assume we're using a table that has only two columns: ID and TEXTVALUE, with ID being the automatically generated identity column.
The Stored Procedure
Unfortunately, this doesn't use much of the GUI power of SSIS, it's old hand-coding.  For our simple table, this would be the stored procedure definition:

CREATE PROCEDURE InsertData
  
@textvalue AS CHAR(10),
  
@id AS INT OUTPUTAS
INSERT INTO
IdentityTable
  
(textvalue)
  
VALUES
  
(@textvalue)
SET
@id = SCOPE_IDENTITY()

The SSIS Data Flow
Use a Derived Column to generate the ID column with a NULL(DT_I4) value in it.  (We need somewhere to place our identity value in.)


Following that, we can work on the OLE DB Command component, using a statement like this:

EXEC InsertData ?, ? OUTPUT


Map the columns as you'd expect to map them.
If you'd like I've got an SSIS 2008 package showing this technique.
The Big Picture
Sure, it works - but here are the drawbacks:
It's going to perform badly. Horribly if the package is run on a different server from the database the stored procedure is on.  I can't stress that enough.  The OLE DB Command component is making a round-trip call to the database for each row.  Even on the same machine, that's an expensive operation.
You're doing an RBAR (row by agonizing row) process which is much, much slower than a set-based process that SQL Server is more suited for.  (SSIS is fine with RBAR, it's designed that way.)  Neither SQL Server process nor the Integration Services runtime will be working very hard during this part of the data flow - I don't think even one core would get utilized fully due to the wait-states involved.
If you're interested, here's a sample package to show this process.  It creates a database on your local machine, creates the sample table, generates one row, performs the insert (and shows the result to you), then tears the database down.
Can we mitigate this RBAR flaw?  Stick around or subscribe your RSS reader to find out in an upcoming post.

7 comments:

  1. Todd I wrote about another method of doing something similar to this for inferred members a couple weeks ago. Can I get your opinion on Pros/Cons of the method i webt with. One thing I don't like is some is hardcoded. :( http://www.bidn.com/blogs/DevinKnight/ssis/2102/ssis-inferred-member-insert-in-fact-table-load

    ReplyDelete
  2. This is very much like what we use for our inferred members, but we have a lookup against the target table first to fetch that ID if present and insert an inferred member if not. (Sorry, Devin - can't remember if we got this idea from something of yours or not. We've been using it for ~1.5 years and kudos go to our manager at the time for the suggestion whether he found it or came up with it on his own.) I'm looking forward to the next post to see how we can get away from the RBAR part of this, if that's even possible for our configuration.

    ReplyDelete
  3. @Devin - yes, another way to go is Scripting. I wish it weren't ever necessary! There should be enough tools in the box to work with. I think both our examples have the RBAR issue to overcome. The only thing I'd change in yours (cause it works for you) is to work with managed connections this way - but that's up to you.

    @Peter - Yes, a full-cached lookup would be appropriate for that for dimension tables. This is more general purpose, but you could use it in your "no match" flow, only if you know you won't try to lookup the same value twice.

    The RBAR won't go away in the next post(s) - we'll just try to parallelize it. To make the RBAR go away, you have to go back to my older post (part 1).

    ReplyDelete
  4. Our problem at the time was that the "no cache" lookup was too slow and any actual caching would try (and all too often succeed) to insert duplicate inferred members. We used a raw file cache w/ something very similar to what you suggested here. It just added an extra lookup against the root table just in case it had already been handled. I appreciate you covering it again because that duplicate issue was killing us ~2-3 years back. I wish we'd had something like this to reference. :) I also appreciate seeing it without scripting, if only because that's way outside my normal comfort level. And I'll look forward to seeing some parallel processing. Anything to speed up some of these processes we're running now, though I dread the refactoring that will likely ensue.

    ReplyDelete
  5. Hi I can not download the sample. you can upload it again

    ReplyDelete
  6. Probably a curse of SkyDrive not having permalinks. I've updated the link - hope that keeps working!

    ReplyDelete
  7. Good article, thank you. It turns out that the sample is no longer available again. I would like to have it. Thanks!

    ReplyDelete