Thursday, August 20, 2009

Inserting Records AND Getting The Identity in SSIS

This is (another) topic that comes up in the MSDN SSIS Forums fairly frequently - and will likely do so again, so here's my reference for future use.
Update 2011-08-29: See part 2...
The Business Problem
It's not uncommon to have some sort of ETL process that requires you to insert some records into a SQL table that has an identity column, and then use the identity value for those newly inserted records in a subsequent process.  This is typically a requirement of parent-child processes such as sales order header and detail lines.
The Technical Problem(s)
Using SSIS, this doesn't appear to be possible using out-of-the-box components.  None of the stock Destination components has an output (other than a possible error output) - so even if they could retrieve the identity value for each row they inserted, they don't have a way to pass it along.  The OLE DB Command component can be configured to issue an INSERT command instead of the typical UPDATE - but again, there's no facility in that component for adding a new "return" column to the data flow, or even populating an existing column.
Seems like a major oversight on Microsoft's part, doesn't it?  But it isn't really - not when you look under the covers.  It just seems like a "hole" simply because everything else with the above scenario is so very easy to accomplish with SSIS.
So why do I say this isn't an "oversight"?  Well, it's only an oversight if it's possible to accomplish using other tools - particularly T-SQL.  And, well - it isn't possible AFAIK - at least not using "set-based" commands.  (Do let me know if I'm wrong!)  Think about it - how would you issue an INSERT command that would automatically SELECT the resulting inserted records?  (In special circumstances, this is possible - but those circumstances are usually easy to solve with other means, so you probably wouldn't be here - but I'll solve those too.)  Not even the new OUTPUT clause in SQL 2005+ can help us here (it doesn't return identity or computed columns).
UPDATE 2009-10-24: I just found out I was wrong about that OUTPUT clause - I was reading the documentation incorrectly!  Unfortunately, that still doesn't help us with the problem in SSIS... but now that I know it's possible in T-SQL, I'll be thinking about how to incorporate that ability within a custom component, or how to best post the suggestion on Connect.  TYVM RBarryYoung...
The Solutions
All is not lost!  Those of you who know a little T-SQL will know that you can work around the problem even without the OUTPUT clause - you just have to give up set-based methods and turn to iterative, cursor-based methods.  Yes - this means RBAR (row by agonizing row).  So therefore, most of the solutions given below perform poorly, because they're performing row by row.
I'll start with the "simplest" methods and progress to more complex ones.  I mean "simple" as in easiest to describe and implement quickly in SSIS.  For these (and most other things in life) simple typically does NOT mean "best".  I suggest you evaluate more than one of these alternatives to see if it is fit for your specific purposes - and modify the ideas freely to achieve a workable solution for you.
Insert, Then Re-Read And Merge
This solution applies when the rows you're inserting into the database can be uniquely identified by a (set of) non-identity columns - in other words, it has a primary key.
In your existing Data Flow, insert a Multicast component immediately preceding the Destination component. Attach a Raw File Destination component to another Multicast output, so that the same data gets written to a RAW file.  When the data gets sent to the Destination, an identity value will be given to each column.
Start a second Data Flow, and use the RAW file as a Source in that flow.  Then use the Lookup component to look up the identity value for each row, using the primary key column(s) to look it up.  (This solution only requires that the table have a conceptual unique constraint - regardless of whether the RDBMS it's stored in has defined a primary key or other referential integrity construct to enforce it.)
Add A GUID or Other "Unique" Column To The Destination Table
This solution applies if you are permitted and able to add a column to the destination table.  We're essentially adding data to your table to be able to solve the problem using the prior solution.
In your existing destination table definition, add an NCHAR(36) column (to hold a GUID), or an INT column (to hold a generated ID).  You can choose to ALTER your table to add this column before the rows are inserted, then drop it after - or you can permanently add this column to your table.  Choosing to use a GUID will work if concurrency is an issue (multiple package instances are adding at the same time), where an INT column will only work if exclusivity can be guaranteed.
If you choose to use an INT, I recommend that you issue an UPDATE to the table prior to your INSERT that SETs the column to zero - this makes assigning numbers easier.
In your existing Data Flow, you need to add a Script transform component immediately prior to the Destination component.  Using a Derived Column component will NOT work - there's no way to assign a unique GUID in a Derived Column component, or row numbers.  Inside this Script, you need to set the GUID/INT column as ReadWrite, and use the .Net Guid class to assign new GUIDs and/or unique integer values (typically a row number) to each row.  For INT columns, you could alternatively use one of the Row Number transforms available on the SSIS Community Tasks and Components site.
Similar to the prior solution, use a Multicast component to split the flow, sending one copy to the Destination component, and the other to a Raw File Destination.  Start a second Data Flow, read the Raw File, and use a Lookup component to look up the GUID/INT identifying column, and retrieve the identity value.
Use a Staging Table and Stored Procedure/Query
This solution applies when you are permitted and able to add a temporary table to the database.
Create a temporary table in the RDBMS of the identical structure as the actual destination table - including the identity column.  However, remove the identity specification from the column, leaving it as a non-identity, and allow it to be nullable.  Add another column to the table - a "temporary_identity" column, defined as an indentity field.
1. Alter your Data Flow to dump the data into the temporary table - but do not map any column to the original "identity" column (which will cause it to remain NULL when records are added), or the new "temporary_identity". 
2. Then add an Execute SQL Task that executes a query/stored procedure that defines a loop that will exit when all of the entries in the staging table have a non-NULL value in the identity column.  Inside the loop,
    a) Select the contents of the TOP 1 row in your staging table that has a NULL "identity".  Make sure you store the "temporary_identity" value.
    b) INSERT those values into your destination table.
    c) Read the value in @@IDENTITY (or @@SCOPE_IDENTITY or @@IDENT_CURRENT)
    d) UPDATE your temporary table, SETting the identity column to the value retrieved from @@IDENTITY, keyed on the temporary_identity column
3. Read in the temporary table in a subsequent Data Flow (without the temporary_identity column)
Other Alternatives
Like I said, none of the above is particularly pretty, but they'll all work.  If none of the above turns you on, you could always do something a little more complex, like use a Script - but you're still going to be implementing an RBAR, or a temporary table type of solution.

Thursday, August 13, 2009

New Jersey PASS August: Presentation Slides and Notes

I had a fantastic time at the NJSQL chapter meeting!  You folks in New Jersey certainly get a bad rap - I didn't see any greasy hair or thick gold chains - and nobody threatened to kneecap me if I couldn't solve their SSIS performance problem.  Many thanks to Melissa Demsak, the members that attended, and SetFocus for making it all possible.
As the abstract said, I talked about the typical things asked of SSIS when loading a Data Warehouse: slowly changing dimension processing, surrogate key lookups for fact table loading, and performance tweaks.  It certainly was a lot to cover - and since it was, I buried a bunch of links to supporting information that expanded on each topic in the slide deck.  If you're looking for more information on any of the topics we covered, download the slides and look at the notes, and/or look at the resources slide near the end.
I'll be sure to let Melissa know when I'm in Jersey visiting family again...

Monday, August 10, 2009

I'm Looking Forward To: Parallelism in .Net 4.0

I've written a lot here to date about what I've been able to do with SQL Server Integration Services and Data Warehousing - within my organization as well as within the community.  The largest push for me to do any and all of that is to address "pain points" - irritations with the current process, or a desire to do away with repetitive manual interaction.  I'm not only talking about relieving pain points that I run into.  Part of my job (and my passion) has always been to use what tools and knowledge I have to relieve others' pain points.  Because nobody pays me to make my own job easier - they pay me to make their jobs easier, and achieve better results doing it.
Today I'm going to look at things the other way.  Who out there is going to help me?  (Selfish question, isn't it!)  There are a few answers to that question, I'm glad to say.  I'm looking forward to some new technologies that are on their way, some learning opportunities, some economic recovery, and lots of other things.  Here's the first in (what will hopefully turn out to be) a series of posts to make you (and I) thankful to receive the fruits of other people's labour!
Parallelism in .Net 4.0
A great technological cherry I'm waiting for is in the .Net Framework v4.0 - Parallelism. 
Even before I went to DevTeach and listened to a very informative talk by Ranjan Sen, I was always disgusted by the waste in today's software.  Disgust is a strong word - but I mean it.  This is an era where "waste" in software development typically describes developers or applications that consume massive amounts of memory or disk space, but the kind of waste I'm talking about is a little unorthodox.  I don't know how you feel - but it's driven me to the point of insanity over the past few years having to wait for software to grind through what I've asked it to do - all the while watching it use only one of my processor's cores.  And I've only ever had a dual-core machine!  I would surely have been in the looney bin if I'd sprung for a quad-core.  (In fact, that's one reason I didn't!)
Upcoming in the .Net Framework v4.0 are some great advances in bringing parallel computing to the masses.  Let me start by saying that (AFAIK) there is absolutely nothing in here for the "hard core" parallel computing fan.  Everything a true power-user of .Net's threading capabilities is still there, relatively unchanged.  What is new is parallelism for developers (and applications) that don't understand (or don't have the time to understand) how to manually multi-thread their application.  The architect(s) who pushed for these changes in Microsoft is/are rock stars - I wish I could say he/she had groupies, but sadly, we all know they don't. 
The new capabilities make retrofitting parallelism into existing applications a ton easier, and using parallelism in a new application will be amazingly simple.  How are they doing that?  Not by adding more locking constructs, or thread pooling - no, those are still "too close to the metal" for developers uncomfortable with threading to feel good about.  What they've done is basically taken some common situations where threading would do an app some good, and built some language constructs to almost transparently support multi-threading for them.  Think "design patterns" for threading.
Future<T>
Let's take an example: populating controls on a form - say, dropdown lists.  Now, I'm an ancient developer - I'm not so into databinding and all that, and one of the reasons is that if I databind a combobox control to a database source, it's going to cause visible performance problems in my app at runtime as it goes to "fetch" the data.  So I've typically done this loading manually in a threaded manner, in order to provide a better UI experience.  The new capabilities in .Net 4.0 make this even easier, by introducing a new generic called Future<T> Quite simply, it makes anything you assign to that variable an asynchronous call, but on top of that, it automatically joins the spawned thread back once you actually want to use the results.  Take this code:
List<string> items = MyClass.GoGetItems();
List<string> otherItems = MyOtherClass.GoGetOtherItems();
// do some other work here
// now use my first item list
foreach (string item in items)
{
    cboItems.Items.Add(item);
}
// and now the second list
lblOtherItemCount.Text = otherItems.Count.ToString() + " items in other list.";
The problem with the above is that if it takes "GoGetItems" ten seconds to retrieve whatever it is that it's retrieving, those are ten seconds that the UI (and the user) has to wait.  Then they both have to wait again for "GoGetOtherItems"!
The Future<T> generic allows you to seamlessly spawn threads to handle those calls, like this:
Future<List<string>> items = MyClass.GoGetItems();
Future<List<string>> otherItems = MyOtherClass.GoGetOtherItems();
// do some other work here
// now use my first item list
foreach (string item in items)
{
 cboItems.Items.Add(item);
}
// and now the second list
lblOtherItemCount.Text = otherItems.Count.ToString() + " items in other list.";
What's happening here is that the call to GoGetItems is being started, but doesn't block the rest of the code from executing.  The call to GoGetOtherItems is made immediately, and so is the "do some other work here".  Transparently to you, the compiler recognizes that the start of the foreach loop is requesting use of the results of your first call.  If the spawned call to GoGetItems has already completed, it executes immediately.  If not, it waits until the results are available - again, all transparent to you, the developer.
More Capabilities
Add on to that the parallel for loops and other goodies - and I think you'll see this is a "good thing".  Here's hoping that developers everywhere start to use more than one core... and yes, I'm looking at you Microsofties!
Additional Information
On the Parallel Programming Team's blog:
.NET 4 Beta 1 is now available, with parallelism!
Parallelism Videos Galore
Parallel For Loops over Non-Integral Types

Friday, August 7, 2009

File Properties Task for SSIS

This seems like such an essential tool for Integration Services...
What Is It For?
If you've ever needed to know if a specific file exists, whether you'll be able to open it to read or write to it, or what some of the date or file attribute properties of it are, you probably resorted to using a Script Task and the System.IO namespace.  If you were a little more resourceful, you probably tracked down the File Properties task that Brian Knight of Pragmatic Works developed.  Scripts work well - except for the reusability and opaqueness aspect, and the fact that you need to know the System.IO .Net namespace to get anything done.  The Pragmatic Works task is fine too - except it's only available for SSIS 2005.
If you need more than the above can provide, hopefully this Task can come to the rescue.
What Does It Do?
The task will read properties from a file you specify as a literal string, using property expressions, or from an SSIS variable.  You can fail the task if the file doesn't exist, or you can record whether or not it exists in an SSIS boolean-typed variable.  If it does exist, you can read all kinds of properties off the file into appropriately-typed SSIS variables - perfect for conditionally controlling your package execution with Precedence Constraints set to Expression and Constraint.
Here's a sample view of the Task editor (click to enlarge):
How Do I Get This?
The Task is available for SQL Server Integration Services 2005 and 2008 from here.
If you have any difficulties at all, post in the Project Discussion area.
Can I See How It Works?
You bet!  The entire Task source including the installer projects are on CodePlex.  Download the source, tweak it to add some more capabilities, and then upload a patch back to CodePlex to share your new feature.

Saturday, August 1, 2009

The Post-Build Command Line for SSIS Custom Objects (Updated)

(Refer to my previous post on this topic.)
I was recently troubleshooting a custom object I'd constructed from my template - and (maddeningly) it didn't work!  Of course, trying to spot the undeniably simple problem in the wee hours isn't always productive... but it does sometimes help get some productive supporting work done.  In this case, I came up with the brilliant hypothesis that builds of my components were interfering with each other, so I decided to revisit my post-build commands.
The good thing, I suppose, was that I only made the decision to do that in the wee hours - because the actual problem had nothing to do with that, but I couldn't see that in my befuddled state.  I packed it in and made the execution of that decision my first task the next morning.  I wanted to remove all possibility of "clutter" of previously registered/copied components and tasks could have caused.  Unfortunately, the project properties in Visual Studio has no "clean" command that corresponds to the "post-build"... maybe they should add that...
Since that didn't exist, I needed a batch file that would remove all (installed) traces of a particular SSIS custom object.  In order to reliably do so, I needed to review my "install" script - the post-build events.  I decided I would kill two birds with one stone and encapsulate the several lines that were in my post-build into one batch file - and make it multi-purpose, able to install Tasks, Components, and ForEachEnumerators.  (Those are the kinds of custom objects I've built so far.)  In the end, I planned to have two batch files - an "installer" and an "uninstaller".  Of course, neither of those could add or remove the object(s) from the BIDS toolbox.  Maddeningly, you still have to do that by hand - vote for this Connect issue if you want that resolved.
The result - a fairly large pair of batch files.  I'll share them now to avoid recrimination on pointing people towards an inferior solution earlier...  Find those batch files on my SkyDrive.  The batch files are easily extensible to the other types of custom objects - I leave that as an exercise for the reader...