Todd McDermid's Blog: SSIS "SCD" Component

One would have thought this would have merited more attention from the SSIS team at Microsoft before SQL Server Integration Services debuted in SQL 2005, but apparently not. On top of some sketchy support for ODBC out of the box, SSIS came packaged with a Slowly Changing Dimension "wizard." One or two uses of this wizard demonstrate that it's cute - but definitely not ready for prime time.

My gripes:

It's a "wizard" that isn't really a wizard at all. IMO, a wizard assists in creating a structure of multiple items that are otherwise individually configurable. In other words, a wizard is not required to get the job done - it just helps with initial setup. Once the initial setup is done, the individual components can be edited directly. The packaged SCD "component" doesn't fit that bill. You are required to use the wizard for initial creation AND editing. Sounds like a self-contained editor of an SSIS component instead of a wizard, doesn't it? Unfortunately not. True to my definition of a wizard, it creates multiple components with individual configurations. You can (and should) edit the properties of the constituent components the wizard creates - except you can't edit the central "SCD" component itself individually. You have to use the wizard to do so - and in doing so, most of the edits you made to the other components the wizard initially created are overwritten.
It doesn't handle surrogate keys. (Many opinions on this one - IMO, relying on SQL identity columns to handle SKs is a cop out. Might as well use T-SQL to perform your ETL...)
It's SLOW.

I can live with slow. I can't live with not being able to edit my components easily. I make too many mistakes!

I finally bit the bullet, and true to my nature, dug in to coding my own custom SSIS component that does what I think a Kimball Method slowly changing dimension processor ought to. Apologies up front to Ralph Kimball for using his name so freely - but it's almost public domain in DW/BI, isn't it? I don't have the billions that Microsoft has to shove behind a product such as SSIS (and I think they ought to have put more behind it - it's a great tool), and yet in a couple weeks I was able to whip up something a little more than serviceable, if I may say so. It wasn't easy, because the other thing left behind in the crush to deliver SQL 2005 and SSIS was documentation on developing custom components. Many thanks to the blog posts of Jamie Thompson, Kirk Haselden, Phil Brammer, the MSDN forum posts of "DarrenSQLIS" (whomever he really is), and primarily the example provided by Alberto Ferrari's TableDifference SSIS component.

I might go into the intricacies of SSIS custom component programming in another post - it's a long haul all in itself. I sure don't know as much as the MS guys, but looking at code always helps me (that's why Alberto's stuff pushed me over the top).

To me, a Kimball Method SCD component in SSIS ought to do the following:

Take (at least) two inputs - the "existing dimension" table rows, and rows from the "source system" table. I'll refer to those as ED and SS inputs. The ED input is basically the entire contents of your data warehouse dimension table - the component needs that to determine if any changes have been made in the source system since the last time the dimension was loaded. The SS input should be information from your source system(s), mangled as you need to in order to conform it to your dimensional model.
Supply several outputs: First, an "error" output which I'll call an "invalid input" (II) output (since it doesn't really match the definition of a true SSIS error data flow - truncation errors aren't the only invalid input to the SCD component). Rows will flow to the II output when the component detects that the input row (ED or SS) can't be processed - the business key is invalid, dupliated, ... Second, an "unchanged" output (O) that flows through those rows that exist in the ED (as the "current" row) and SS with no differences. Third, an "SCD1 Changes" (SCD1) output that contains rows requiring an UPDATE to be issued to your dimension table. Fourth, an "SCD2 Expired" (SCD2X) output that contains all the rows that (up until now) were marked as the "current" records in your dimension, but an SCD2 change in the source system now requires them to "expire". Fifth, a "New" (N) output which contains brand-new rows, as well as new versions of rows due to SCD2 changes.
Be easy to identify the "SCD" participation of each column in the ED input, and map the SS columns to the equivalent ED columns for business key and SCD comparisons.
Allow the component to manage surrogate keys all by itself, with some help, or according to rigid directions.
Decent performance.

There were a couple other things I wanted as well, that I wouldn't consider strictly "Kimball Method":

An input for what I call "special members" (SM) for my dimension. Typically, this is only one record for the "unknown member". You know, the one you usually assign the "-1" surrogate key. For some of my dimensions, I have several "special" members to manage, so a separate input appealed to me.
Ability to reject columns if they contain nulls. Nulls just don't belong in a Dimension as business keys, or attributes.

I've finally made it work - here are some action shots:

On the design surface:

The first page of the editor - to indicate the participation of each Existing Dimension column:

Page two, to link the "Special Member" columns (if any) to the Existing Dimension columns:

Page three, to link the "Source System" columns to the Existing Dimension columns:

And an options page to set up surrogate key handling and comparisons, among other things:

How well does it work? With a little performance tweaking, it processed a 150 thousand member dimension table with 70column/1KB rows in about 50 seconds on my laptop (a Toshiba Tecra A7 - T2400(1.83GHz), 4GB RAM, Vista 32bit). After I run it through some more significant testing, I'll probably post the source somewhere...

October 2, 2008 Update: The CodePlex open source project

October 2, 2008 Update: Demo Videos - Part 1 (configuring inputs), Part 2 (configuring outputs)

October 19, 2008 Update: Now available for SSIS 2005

Todd McDermid's Blog

Pages

Tuesday, August 26, 2008

SSIS "SCD" Component

5 comments: