Todd McDermid's Blog: An Alternative to SSIS's SCD Wizard

This will be the last time I plug this for a while - I promise! I just wanted to make sure this page would come up in Google for those who are struggling with the Slowly Changing Dimension Wizard in SSIS.

If you want an SCD component for SQL Server Integration Services that is a single component and doesn't take hours and hours to execute, you ought to try out this alternative. It's available (open-source) on CodePlex - the KimballSCD project. It's modeled on the Kimball Method of dimension processing, which ought to make some of it's more advanced features recognizable and very useful.
(UPDATE 2010-01-27: Version 1.5 has been released!)

Here's a short list of the advantages this component has over the stock SCD Wizard:

It's one component. The "stock" component creates five or six other components on the design surface in addition to itself. The major difficulty with that is that when you have to make changes to the component (and you will) then it will completely wipe out any changes you made to those other attached components - so you'll have to change those again. And again. And again. The KimballSCD component is only one component - making changes doesn't mess up the rest of your data flow.
It's (about 100 times) faster. Especially when configured properly, it can start processing rows before all rows have arrived on the inputs. On my biggest dimension (not that big, I know) I have 380,000 records that are about 1KB in size. Each row has 75 columns, about half are SCD1 and half are SCD2. The KimballSCD component took 2 minutes 26 seconds to process 110,000 incoming rows and determine that 155,000 rows needed updating due to SCD1 changes, 22,000 rows needed to be "expired" due to SCD2 changes, and 9,000 new rows had to be added. It's been ten minutes now waiting on the "stock" component, and it's processed 4% of the records. The major reasons are that it's "cached" (you supply the existing dimension table as a data flow) and multi-threaded.
It supports case (in)sensitivity on column compares. The "stock" component does not, making it much more difficult to construct a data flow that ignores and/or is strict about case.
It supports leading and trailing space (in)sensitivity on column compares. The "stock" component does not, requiring pre- and possibly post-processing to align comparisons between fields.
It supports Surrogate Key generation. A great many SSIS users leave surrogate keys to a database IDENTITY column. However, there can be situations when even those users might like more control over key generation, or want to know keys prior to the insert (for auditing or other processing).
It has auditing support. You're on your own with the "stock" component. The KimballSCD component has two ways of helping you audit changes. You can simply hook up integer variables inside the component that get input and output row counts written to them, or you can attach the Auditing output to the logging mechanism of your choice. The Auditing output includes all the row counts, and can "link" that auditing information to every changed row that comes out of the component via the "audit key" columns. With the Auditing features in the KimballSCD component, you can track what changed on which rows at a very fine level of detail - automatically and effortlessly.
It supports a separate "Special Members" input. For those situations where you maintain more than one "unknown member" (for example, an "unknown date", and an "unknown future date") - or any number of other "special" dimension members - no special pre-or post-processing is required, the KimballSCD component takes care of it.
It has more SCD2 tracking flexibility and completeness. The KimballSCD component completely handles SCD2 housekeeping columns according to flexible rules you specify - no downstream Derived Column transforms required.
It allows you to completely eliminate NULL values from your dimension table. The "stock" SCD component requires NULLs in the SCD 2 expiry column for "current" entries, and doesn't enforce any non-NULL requirements you may have. The KimballSCD component requires all values to be non-NULL by default (although you can change this to allow them), and uses a true (configurable) date for expiries (although you can change this to use NULL if you wish).

If that sounds good to you, go download and install the binaries from CodePlex. And please provide some feedback or feature suggestions - I'll be sure to do those that will help me out, and I do have a tiny bit of spare time. Or download the source, make your changes, and send me a patch!

4 comments:

MorenoAugust 25, 2011 at 10:48 AM
I started using the SCD nad have once question, how do you handle the surrogate key creation. In the scd tutorial they turn off the creation of surrogate keys.
Todd McDermidAugust 25, 2011 at 12:47 PM
The DMSCD has the option of generating surrogate keys for you. Most people use an Identity column in SQL Server to do this, but I wanted some flexibility on numbering, so I rolled it into the component. You have no options with the SCD Wizard.
MorenoAugust 25, 2011 at 7:43 PM
If I try to use it seed 1 increment 1, a warning message pops up saying, Performance is not optimized. Will work with a variable, but is that the right appoach.

I have also noted the package will run without errors, but it is not appending data to the table.
Todd McDermidAugust 26, 2011 at 8:29 AM
I appreciate that you're trying to use it, but this isn't the place for a step-by-step personal tutorial on the DMSCD, Slowly Changing Dimensions, or SSIS. Please take advantage of the tons of training opportunities (for free) available on the web to learn how SSIS works, what you use SCDs for, and how you properly configure the DMSCD. There are videos, walkthroughs, you name it - have fun!

Todd McDermid's Blog

Pages

Wednesday, October 15, 2008

An Alternative to SSIS's SCD Wizard

4 comments: