nd it's about ^%#@ time! I simply can't believe it's been over eight months - all for a few new features. Well, okay, maybe not just a few...
What Is The Kimball Method Slowly Changing Dimension Component?
The Kimball Method Slowly Changing Dimension component is a superior alternative to the Slowly Changing Dimension Wizard in SQL Server Integration Services. Both components implement a very common data warehousing process. If you have a data warehouse ETL system, chances are you're using the SCD Wizard... Or you got frustrated and resorted to a more performant but less maintainable Script, tens of data flow transforms, or a T-SQL MERGE.
I - like many others - was a little disappointed by the poor performance and lack of flexibility in the included SCD wizard in SSIS. Not that the wizard is horrible - it encapsulates the SCD process very nicely - but it's just "not good enough" for me. I have to count myself lucky, I know a little .Net and happen to be a glutton for up-front punishment - so I investigated the extensibility of Integration Services and ended up creating a custom component that made some different architectural design choices than the SCD wizard. This different design has many advantages, requiring very few compromises - in my opinion, of course.
All you need to do to use it in place of the SCD wizard is to download and install it, add it to your BIDS toolbox, then drop it on your data flow design surface. If you're familiar with processing slowly changing dimensions for a Kimball method data warehouse, I think you'll be using it in seconds.
What Does The Kimball Method SCD Component Do?
Basically, the same thing as the SCD wizard - just differently, and with quite a few more configurable options. The component "processes" slowly changing dimensions. "Processing" means that it compares the current information from your line-of-business system information with a dimension table in your data warehouse. (This line-of-business system is also known as your OLTP or "source" system.) It detects changes that have occurred in the OLTP system, identifies new business entities that need to be inserted into the dimension table, retroactive changes that need to be made to the dimension table, or changes that require historical "versioning" of those business entities.
How Is The Kimball Method SCD "Better" Than The SCD Wizard?
There are quite a few differences that I think make the KSCD superior to the SCD wizard, but the "big two" reasons are that you can actually edit the KSCD without destroying your data flow logic, and it operates about 100x faster. There are many more things differentiating them, but for now I'll explore just those two. And really - aren't those two things alone worth spending a few minutes trying this out?
Non-Destructive Edits
First, I'll talk about the "issue" with the built-in SCD wizard. The SCD wizard steps you through "configuring" your SCD processing logic quite nicely. It does use some "foreign" terminology for someone that's read up on the Kimball methodology for describing SCD processing - but nothing too odd. But once you get to the end of the wizard, it doesn't place just one component in the data flow, it places multiple components! There's the SCD component itself, plus some derived columns, union alls, and OLE DB commands. Some of the choices you made in the wizard are encoded in the SCD component itself, but quite a few are represented in those other components and the way they're connected up to each other.
You can modify some of those properties directly as you need to (as many experts recommend you do) in order to have the process operate a little more efficiently. The drawback of doing this is that you can't change everything you might want to that way. Should you need to change a column's method of "change tracking" from "historical" (type 2) to "changing" (type 1), or a new column is introduced to your dimension table... you have to run the SCD wizard all over again. Unfortunately, this results in the destruction and recreation of all the components that made up the processing logic - even if you've made tweaks to them. All that optimizing and customizing... destroyed in seconds!
Good news for you - the KSCD is completely editable, and doesn't espouse a scorched-earth policy on the design surface. Edit the component all you want - the worst that will happen is that columns will get added or dropped from the outputs just like you'd expect, requiring no more effort to adjust your data flow than any other component.
Performance
The other large obstacle in the SCD wizard is it's (frankly) underwhelming performance. It's completely understandable once you know the design choice that was made (see below) - but that choice makes it basically unusable well before you have dimensions that get into the hundred thousand row range.
The KSCD component trades off higher memory usage in order to blow the freakin' doors off the CPU, powering through dimensions in about 1% of the time the SCD wizard takes. (That 100x speed advantage is actually conservative, in my opinion.)
My thought when I was starting this project was - if you're processing a hundred-thousand-row-plus dimension, you probably have a hundred MB of RAM you wouldn't mind using to make it go faster.
What Were The Design Choices That Made This Possible?
As I've already alluded - the major design choices were a commitment to encapsulate SCD processing into one component only, and a tradeoff of memory versus lookups. Everything else the component has going for it, I think the SCD wizard could do - if the SSIS team had some free time on their hands.
Multiple Versus Single Components
The first is puzzling to me - the SSIS team could have encapsulated all of the SCD processing logic into one component had they made that decision (with time and resources to back it up - but probably not that much). Every other component in the toolbox works that way, why not the SCD? Perhaps it was a time constraint issue after all...
Lookup Versus Caching
The second choice was a significant one that constrains both the SCD wizard and the KSCD to opposite sides of a coin - there are definite consequences to each design that can't be ignored.
The SCD wizard included in SSIS takes only one input - the rows coming from your OLTP or "source" system. It uses a (limited choice of) Connection Manager to perform lookups on your data warehouse's dimension table. The benefit of this is that the resulting component operates synchronously, doesn't require any memory overhead, and might be able to process multiple "changes" per business entity in one pass. But due to the lookup implementation inside the SCD wizard - it isn't as performant or flexible as the logic in the standalone Lookup component. It's a very basic "uncached" lookup - and this means that it suffers a complete round-trip SQL query penalty for each and every row sent through the component. Think about that. Take your best possible performance profile for a round trip lookup query, be it 20, 10 or even 5ms (!)... then multiply that by the number of rows in your OLTP table. Completely ignoring any other performance penalties with SQL query round trips, a 10ms cost per row on a 100,000 row flow is over 16 minutes. Not too horrible, but you'll soon realize how unscalable that math is, and how impervious to how big or expensive the hardware is you might throw at it. It doesn't scale up or out - at all.
The KSCD component uses a different approach. It takes the same OLTP "source" system input as the SCD wizard - but instead of a Connection Manager to perform lookups, it asks you to provide another input to it with the entire contents of your dimension table. The benefit to this design is that the component can use the full power of however many cores your SSIS box has in order to make multithreaded comparisons of two in-memory data sets. The flipside to that benefit is that it requires memory in order to hold those rows in cache - and depending on how "badly" your data flow is optimized, the component could theoretically have to cache one of the input streams completely. In a worst-case scenario, for a 100,000 row dimension with a rediculous 8K row size, that could mean almost 800MB of RAM. Again, not particularly scary - even bumping it up to a million row (of 8K rows!) dimension only means 8GB of RAM. Under my (basic) tests, instead of 16 minutes for 100,000 rows or over 2.5 hours for one million, you'd be looking at times well under ten minutes, probably less than two. And seriously - if you can't tweak your data flow to cut that memory usage in half, I'll come and fire you, because you have tons of options for improving the KSCD cache's performance.
The Sensible Way To Process Slowly Changing Dimensions
Review the pros and cons of the Kimball Method Slowly Changing Dimension component, and I think you'll find that it gives the SCD wizard, scripting, using multiple derived columns and conditional splits, and even the T-SQL MERGE statement a massive beat-down. Each one of those ways of processing SCDs has a single strength: the SCD wizard is easy to set up, scripting is flexible, using multiple components is easily maintainable, and MERGE is performs great. But none of them deliver on all those aspects like the KSCD does - it's easy to set up, flexible, easy to maintain, and wicked fast. But don't believe me - check out the reviews people submitted about the previous version. Version 1.5 delivers a lot more flexibility and usability on top of what they've experienced.
I would also very much like to thank the community that's grown around this project. I have had discussions with scores of colleagues online and in person that have resulted in me learning much more about "practical" data warehouse ETL processes. Those SSIS experts have spent an inordinate amount of time diagnosing problems, educating me, and fixing code with me. You have prompted almost all of the feature improvements inside v1.5 and continue to energize me about extending SSIS - thanks.
I'll be posting up some walkthroughs and more videos soon as well... stay tuned!
Hi Todd,
ReplyDeleteI have one suggestion.
Can't we have option in the form of a property, which allows user to select whether the component should use Cache connection or Lookup kind of implementation ? I mean if the user has a limited memory, coz of the cache connection architecture, it makes this component a non-useable option. But if the user has got a choice, depending upon the memory one has got on the machine, one might provide more intelligence/hint to this component on how to function.
I greatly admire this component, but I think this idea might add value :)
Hi Siddharth,
ReplyDeleteUnfortunately, that's a very involved request. Implementing a lookup would remove (at least) half of the benefit of using the component - the performance. Sure, the usability of the component would remain.
I would love to be able to use the Cache Connection architecture - but... it's not documented anywhere I know of - the internals are something I'd need to know. It also may not fit in to how I manage my caches - which would be a big issue.
But overall - what kind of memory pressures are you expecting/experiencing? When I run this against a 1GB dimension, it only consumes 100MB of RAM. I didn't think that would be much of an issue.
Hi Todd,
ReplyDeleteI just stumbled on this component Friday. I've been playing with it all weekend. So far I just keep saying things like "Oh my god, this is so (cool, amazing, other positive adjective)" as I test each piece of functionality. I was wondering if you would mind (or maybe you would love it?) if I presented its use at SQL Saturday 41 in Atlanta, GA in April? I was thinking of presenting how to use the merge sql statement for scd's instead of the canned scd, but I would love to add this to my presentation as another, probably better alternative.
--Julie Smith
Abso-frickin-lootely!
ReplyDeleteHit me up by email through CodePlex's "contact user" form...
Glad you're liking it!
Hi Todd,
ReplyDeleteI've been playing around with this component, and I am also very happy with it!! I am especially interested in keeping track of the changes that you can also save in the dimension table (the Row Changed Reason). One question though, I've seen the instructions on YouTube and on the Codeplex page, but can you share an example of how you go further in the flow? For example, I put different Union All's on the different outputs, one for the rows that have to be inserted and one for the rows that need to be updated. On the insert-flow I put a Destination component, and on the update-flow an OLE DB Command. Do you have a better suggestion?
Thanks!
Erik
To improve performance of updates (the OLE DB Command is very slow) you should insert into a temp table, then issue an UPDATE T-SQL command from the Control Flow. Or use the Batch Update Destination or MERGE Destination from CodePlex.
ReplyDeleteTodd, we love KSCD 1.5.
ReplyDelete(1) is it correct that SCD2 can accept a delta as the source input, provided that you disconnect the Deleted output?
(2) if the SCD2 delta input contains zero rows, will the ED input halt, to save time?
(3) what's the correct approach to feeding an ETL run number into the batch: a derived column at the source, or on the output?
(4) thanks!
dave,
ReplyDeleteTYVM - I'm glad you're enjoying it.
1) Without thinking hard about it, it seems like it could work for you by ignoring the Deleted output. I didn't design it that way, so doing so make me nervous - test that theory a lot!
2) The component has to read in all rows, but it does speed up internally a little on those edge cases.
3) The best way to feed in a "batch" or "run" number is by using the audit features. (I have to rewrite that stuff from the v1.4 docs - I unlinked them on the project.) If you have your run number in an SSIS variable, you can update a column (or several) with that value when the component detects any change (SCD 1 or 2, new row, deletion), or just when there's an SCD 1 or 2 change. I'd suggest playing around with that.
4) Thanks! You're very welcome. Make sure all your colleagues know about it - because it's just plain mean to let them continue to use the SCD Wizard ;)
Look for an update to v1.5 soon - I've been trying to speed it up some more - I hate waiting 15 minutes/day to load all my dimensions, and I think I can get it faster... and there's still one or two guys experiencing hangs that these changes should also address.
Hi Todd,
ReplyDeleteFirstly thanks for taking the time to create an improvement on the standard SCD component.
I am currently evaluating whether to use your component in an BI project. I am comparing it to the Change Data Capture (CDC) functionality exposed in SQL Server 2008.
One aspect of a solution using CDC is the ability to do partial loads of data based on the last processed Log Sequence Number. If I were to use this solution I could execute two ETL runs on the same day without causing duplication.
The way I had considered implementing similar functionality using the Kimball SCD component was to allow records to be expired to within minutes or seconds of creation. So that running the process more than once in one day would not result in problems.
It appears from my so far brief inspection of the component that the only options for record expiry (in the SCD2 Date Handling tab) are based on whole days being the smallest division of time over which a record can be valid. For Example 'Old Ends Yesterday New Starts Today'.
How might I be able to implement something like 'Old Ends One Second Ago New Starts Now'?
Or am I perhaps barking up the wrong tree?
Hi Todd,
ReplyDeleteIt looks like someone else has been pondering the same problem as I have.
They have created a modification to the SCD component for a new SCD option for millisecond-level effective/expiry.
http://kimballscd.codeplex.com/Thread/View.aspx?ThreadId=203432
Yes - your request is a popular one - it's the most up-voted on the site so far, feel free to add your own vote for it! I'll be reviewing that thread as well as my own notes on how to implement it when I get around to it.
ReplyDeleteAt present, it's not a concern for my environment, that's why it hasn't been coded yet. However, it appears so easy to modify the code to do, that I'm sure it won't be too long before I can find a few minutes to do it.
Let me know how your project turns out, RDelayed!
can your commponet be connected throug JDBC and vertica as database?
ReplyDeleteThanks
Annie
Ana - this component doesn't use connections (unlike the SCD provided with SSIS). It uses data flows instead, and requires at least two incoming flows - one for the "source data" like the standard SCD component does, and one for the "existing dimension" data unlike the SCD component.
ReplyDeleteSo if you can connect to either (or both) of those databases with OLE DB or ADO.Net sources, then yes, this component will work with them.
Hi Todd, Many thanks for the component.
ReplyDeletebut i have little trouble in using it . i am not understanding how it actually work like how it insert update or delete records into my data warehouse {Dimenstion tables}. In the built in SCD, i have just run wizard and it create related transformation and destination for me.
Here i have confusion in your component.
I have observed in your component have many outputs like "New" , "Deleted" , "Expired SCD2" so do i need to create separate oledb destination or update command myslef for updating my data warehouse. if this is the case how it insert all rows at once which you have mentioned here in the blog.
Looking forward for your reponse.
Yes, you do have to create and manage the persisting of data to your database.
ReplyDeleteI've just posted new videos on YouTube, as well as full quality (so you can read the code as I talk about it) on the CodePlex project front page.
The one about the KSCD is here on YouTube, and the full-quality version is here on SkyDrive.
Hope those help.
Hi, I saw where you said that the ssis version does a round trip to the db for every row that comes through. Do you have anything to substantiate that. All of my tests show that it caches it somehow.
ReplyDelete@Anonymous - Not quite sure what kind of tests you're running, but a Profiler trace should show the uncached row-by-row lookups quite clearly. But don't take my word for it, the SSIS Team Blog says the transform does not cache any lookup results from the reference dimension, so every incoming row results in a query against the database. There is absolutely no caching going on at all.
ReplyDeleteTodd,
ReplyDeleteI am not able to get any records to go the delete output port when I know for sure there are business keys in the target table that are not in the source.
Is there anything special i need to do in the configuration?
@Anonymous - it really depends how you have the component set up. Those rows are probably being sent to the SCD2 Expired output...
ReplyDeleteIf you can't figure it out, please send me an email via CodePlex.
Hi
ReplyDeleteDo you have a version for SSIS 2012?
Does Conversion Wizard support KSCD?
What is minimum recommended size of dimension table to start using KSCD with win in performance?
No version for 2012 yet, but I plan to have that in the next few months as I prepare to move to SQL 2014 (I'll add 2012 and 2014 versions).
DeleteIf you use DMSCD (no longer KSCD) in 2005 projects, and open them on a system with 2008 or 2008R2 BIDS/DMSCD installed, the package will upgrade successfully.
I'd use the DMSCD for any table size - because any difference with small tables is negligible compared to the re-use, editing, and other reasons to use the DMSCD. But I'm biased :)