Something occurred to me today as I was working with optimizing my Kimball Method SCD component for SSIS. The component caches rows from two inputs - much like a Merge Join - until it can match rows between the sources. But unlike the Merge Join, the algorithm I have inside the component works whether or not the sources are sorted. (Vote for this Connect issue if your think requiring the Merge Join's inputs to be sorted should be changed.) But if the inputs are sorted, there are some additional things that it can do to process the rows faster.
The Issue
In one particular dimension of mine, the business keys are somewhat "time sensitive". Like all dimensions, the Source System only contains the "current" version of the business entity - but in this case, part of what makes the entity "current" happens to be encoded in the business key of the entity. I'm talking specifically about an invoice dimension, where the invoice number tends to be (but isn't always) a number that increases over time. When the invoice is old enough (and completely paid) it gets purged from our Source System. That action gets captured in by the KSCD component (unlike the SCD Wizard) by marking the row as "expired."
The Hypothesis
Since I was taking advantage of some optimizations in the component related to sorting, it occurred to me that I might want to try sorting in descending order instead, with that invoice number moved to the most significant sort position. I posited that if I did that, the most current rows would tend to come in from both inputs first, leaving the older expired records from the dimension table to come in last. The two effects I was expecting to see were that the size of my internal cache would be reduced, and I'd be able to make more decisions quicker. I figured the caches would be smaller because I wouldn't have to hold on to as many rows from the Existing Dimension input for as long, waiting for potential matches to arrive from the Source System. An added bonus there is that a smaller cache means that managing those caches is faster. The second expected benefit relates to hoping to have all rows from the Source System matched up very early in the process, allowing the component to take a significant shortcut in determining the rest of the older expired Existing Dimension rows should just be pushed out the Unchanged output.
The Test
The first obstacle that came up for me was the concept of "descending" sort order - I'd never really worked with it before in the context of Integration Services. I was used to hacking at the OLE DB Source's output column's SortKeyPosition property to reflect the ORDER BY I'd used in my SELECT statement... but where was the property to indicate that the order was descending? A quick examination of the output metadata of a Sort transform educated me on that - descending sort orders are represented by negative numbers. (Search for "negative" on this BoL page for details.) Proof to the axiom that you learn something new every day.
The next obstacle was realizing that my component - much like myself - had no idea descending (negative) sort orders existed. In fact, as it was coded it didn't even register a problem, and would have merrily processed information improperly[1]. Once I'd fixed that up, I was off to the races to test my hypothesis.
The Results
Through very unscientific means (only one run in each configuration) I did find what I was expecting - the component had to cache fewer rows during processing, and was able to take greater advantage of optimizations. The net result was a faster execution with lower memory usage - a win/win. By hard numbers, I have a 180 thousand row source system set being processed into a 4 million row dimension table with a 32 byte business key. On a dual-socket, four-core per socket (8 cores) 6GB 64-bit machine, this took under 9 minutes to process.
If you're looking to tweak a little more performance out of your Kimball Method SCD component - you might want to consider running some tests on different sort orders...
[1]A side-effect of this investigation was discovering that it wasn't taking advantage of some other optimizations, despite my best intentions. Look for more on that issue as I dig into it...
No comments:
Post a Comment