Tuesday, September 28, 2010

Performing a Cross Join (Cartesian Product) in SSIS

There are a few times where you might want to perform a cross join/cartesian join/cartesian product operation in SSIS.  One of those scenarios could be performing some kind of a double-sided "range join" where you can't line up a specific key in your data flow to a specific key in your lookup table.  Quite often you're planning to  have a Conditional Split right after your cross join operation to heuristically eliminate the rows that don't qualify for your "range" matching.
Warning - performing a cross join is an inherently expensive endeavour!  It's called a cartesian product for a reason - you're multiplying rows.  Every row from one source will be replicated for each and every row from the other source.  If you have 100 rows from each source, your cartesian product will result in 100x100 = 10,000 rows.  This can quickly get to be a rediculously large number.  A source of 200,000 rows joined to a 3,000 row table - quite a reasonably small "range lookup" scenario - will result in 600,000,000 rows.  Don't say you haven't been warned about performance - in terms of processing power AND memory.
The obvious component you'll try to get this functionality from is the Merge Join - but it only supports Full Outer joins, Left Outer joins, and Inner joins.  There are no other components that offer join capability in SQL Server Integration Services, so how can you perform this kind of join?
Cross Join Pattern #1
By using several components together, you can replicate a cross join result.  The first method I'll describe will work with any two data flows, from any data source, at any point in the data flow.  It won't perform spectacularly - but then I already warned you about the dangers of performing cross joins, didn't I?  It's actually quite simple.
I've chosen to (quite uselessly) cross join SQL Server's sysobjects and sysindexes tables.
Step 1 - Add an Artificial Key Column
Drop Derived Column components onto your design surface - one per flow.  In each of those Derived Columns, add a brand-new column called "join_key" and give it a value of 1.  (You can call your column whatever you like, and give it whatever value you like - as long as you give the column in each flow the same value.)
Step 2 - Sort on the Artificial Key
Now drop two Sort components on your surface, attaching them to the outputs of the Derived Columns, and configure them to sort on the artificial key.  In my case, I set them to sort on "join_key" (of course).
Step 3 - Perform a Full Outer Join
Place a Merge Join component on the design surface, attach both flows to it - the inputs you attach them to don't matter at all.  Configure the component to perform a Full Outer Join.  Select the columns you want from each input - and be careful to label them appropriately as they exit the component.  You do NOT need to select the artificial key column - I've left my "join_key" unselected.
The Result
As you can see, my queries of sysobjects and sysindexes have been cross joined to each other.  Individually, they have very few rows each, but the cross join makes that row count explode.

Cross Join Pattern #2
This pattern performs a little better than #1 - but it has more restrictions on it.  This technique can't be used anywhere in the data flow - it has to be used at the top.  It can also only be used on sources where we can ask for artificially generated columns - so it's not going to work on text files.
Step 1 - Add an Artificial Key Column
We still have to do this - but we do it a different way this time.  First, we have to add the join_key to the source query.  We don't have to issue an ORDER BY operation in the query (not that SSIS would pay attention to it), because the join_key is a constant.
Step 2 - Sort on the Artificial Key
Like we said before, we don't actually need to sort the data - we just have to let SSIS know that it is sorted, otherwise it won't let us use the Merge Join component to merge the flows.  In order to do that, we have to right-click on the OLE DB Source, and select Advanced Editor.  Change to the Input and Output Properties tab, select the "OLE DB Source Output" node, and find the IsSorted property.  Set that property to true.
Next, expand the "OLE DB Source Output" node, expand the "Output Columns" node, select the "join_key" node (or whatever column name you used), and find the SortKeyPosition property.  Set that property to 1.  (Click on the images to get larger, readable copies.)
Step 3 - Perform a Full Outer Join
Place a Merge Join component on the design just like in pattern #1, and you've got your cross join set up.

Thursday, September 23, 2010

A Busy Fall Now Includes Speaking at PASS Summit

Fall is looking pretty exciting from where I'm standing. I've got a few travel events lined up, a couple projects nearing completion, and the anticipation/dread of MVP renewal day on October 1st.
Events
I've got an internal company training event mid-October where close to our whole IT team gets together to share where we're at, as well as expose each other to what future projects are close to coming online.  It's not a big team, but we're spread out across the continent, so actually seeing people face to face isn't a usual occurrence.
The guys that work on our internally-developed ERP system have been doing some excellent UX work recently, so I'm looking forward to having them show some of those features as well as what they've got in the pipeline.  It won't be this year, but the time when we move away from ISAM to a real RDBMS is coming too.  We haven't started evaluating, but SQL Server is (of course) in the mix, and likely a Postgres variant too.  But when you have to live with a badly supported ODBC ISAM driver, just about any database vendor will do!
We've also (finally) got some new server hardware coming online which is carrying with it a bunch of other goodies.  We run a Citrix remote environment and will be moving to XenApp 6 on the new hardware.  The move should also include an upgrade to Office 2010 if internal app remediation projects continue to exhibit positive results.  I'll be explaining the changes and new features in Office 2010 - and this provides an opportunity to expose people to PowerPivot to see if it can generate some interest.  I'm pretty sure that the server-side management framework will be appealing to help control the spreadsheet sprawl.  Although if it does, that will mean somebody will have to bone up on SharePoint, which we haven't deployed nicely yet.  (Is an "elegant Sharepoint deployment" an oxymoron?)
Early November is jam-packed full of fun.  I'll be at SQL Connections in Las Vegas presenting a trio of SSIS ETL sessions - if you're going to be there, please introduce yourself!  It's the first time I've gone to Connections, and it will be very interesting to compare it to...
...The PASS Summit that's in Seattle the next week.  In the spring, I sent in the maximum number of proposals to get a session slot in the main conference, but failed to make the cut.  But failure is supposed to be just another opportunity to try again, so I consoled myself that there was always next year.  Then Jeremiah Peschka announced that PASS would open some slots from the traditionally Microsoft-only Chalk Talk segment of sessions to community submissions.  Unfortunately, they didn't mention a limit on the number of applications, so I believe my attempt to overwhelm them ended up overwriting my initial abstracts.  The good news is that they happened to like one of the topics that did stick, and I'm now officially a speaker at the seminal event of the SQL Server world.  I'll be hosting a discussion on debugging Integration Services packages - in development and in production.  I've got a few techniques, and anticipate some suggestions and advice from attendees on the topic to make it really fruitful.  At least - I hope to have an audience there, otherwise the one-sided "dialogue" will result in me being carted off to the Buck Woody wing of whatever sanitarium Microsoft endowed for him in the Seattle area.  So for the legions of followers one fan of this blog, I do expect to see you there.
Projects
I have a few company projects nearing internal (informal) milestones - most notably I've set (and plan to reach) targets on my data warehousing initiative.  It's taken far too long to get to where it is, but I think I can finally start to see... the real start of the project :)
But perhaps most exciting is that it's only a matter of time (like always) until I get version 1.6 of the Kimball Method Slowly Changing Dimension component for Integration Services released.  I've already got an alpha up on CodePlex that I've got some feedback from - many thanks to those that took the risk.  I'm much more satisfied about the internal thread management behaviour, even though I'm dissatisfied with the internal code organization of it.  I would really like to get this finished up in October so that I have something cool to talk about when people ask me "what do I do" at the conferences.  (Well - at least I'll think it's cool.)
MVP Renewal Day
You may or may not know that I'm reaching the anniversary of my year as a Microsoft Most Valuable Professional.  You may or may not know that this is an annual award, and continued recognition isn't a given - it's an honour that you have to earn anew.  For that, I'm very grateful, because it means a continual high quality of resources that are easily recognizeable to the community.
There are four renewal days each year - one per quarter - and over the last year I've seen some remarkable professionals receive their due with this award.  (I'm certain I'll miss people in this list, but) Tim Mitchell, Ted Kreuger, and Nitesh Rai were new recipients over the last year, demonstrating the high bar of contribution and expertise required to be awarded.  I hope I make that bar and get renewed, because it sure does help my credibility (with people that don't otherwise know me) if I can keep those three letters after my name.
One of the influential people within my organization is fond of saying that he "doesn't want to belong to any group that will have him."  His intent is to express that he's striving to achieve membership in groups that don't yet think he's good enough - and I certainly subscribe to that mentality.  However, being an MVP isn't quite the same as an exclusive country club membership or appearing on Page Six.  I can say wholeheartedly that I'm very fortunate to have been included in its ranks - if even for just a year.

Monday, September 6, 2010

Parallelism in SSIS - Multiple Lookups

This question just came up in the forums, and it highlights how difficult it can be to try to optimize Integration Services.  It's difficult because the SSIS concept of the Data Flow - an in-memory pipeline of buffers - isn't the paradigm that a great many of its users are familiar with.
Specifically, the question assumes that the Lookups occur serially, and somehow there must be a better arrangement to have them occur in parallel.
The Assumption
It's a good question to ask.  Just about any question is because they all come with answers, sometimes more than one.  On its face, it seems reasonable to think that manually arranging the data flow in a parallel manner will allow Integration Services to process the lookups at the same time.  Especially because (as the package developer) you know that the lookups aren't going to touch the same columns - and Integration Services can't be expected to know that, can it?  Once we're done doing the parallel lookup, we'll just stitch the results back together, easy-peasy, right?
Implementing Manual Parallelism
So what does that look like?  Something like this to start...
And then you have to merge the results...

Oh - I caught that.  I see you questioning our premise!  Merge Joins aren't very efficient, are they?  Will the gains from manually parallelizing the lookups be washed out by the performance penalty associated with merging the datasets back together?  Does it depend on how many lookups are being done?  Or how big the rowsets are?
How Do You Spell "Ass u me"?
Perhaps a little harsh - there's no need to tarnish an inquiring mind!  But the moral behind that statement remains - we're assuming that Integration Services needs our help to parallelize the workflow.  Fortunately not!  We also assumed that "stitching the results back together" will actually be fast and easy.  Far from it as we've seen - but that ends up being pretty irrelevant in the end.
Sequential Parallelism - An Assembly Line
That makes no sense, does it?  Maybe not - but it's the best way I could describe what Integration Services is actually doing.  I've found that comparing the data flow pipeline to an automotive assembly line can be very enlightening.  One of the great achievements of the industrial age was the development of the assembly line - the paradigm that allowed the product to be worked on by different people at different stations, rather than a single worker performing multiple tasks on a single product.  The efficiencies gained from being able to train workers more thoroughly at specific tasks, and being able to "tune the workflow" by placing more resources on the "slower" steps allows modern manufacturers to smoothly push a consistent flow of product out of their plants.
A strength of the assembly line process is that almost any one step can be parallelized fairly easily.  If step #10 in a process takes twice as long as the other steps, simply set up two stations to do the work, and split the "cars" between those two paths.  Another strength is that the plant doesn't have a lot of inventory "tied up" in the manufacturing process - if there are one hundred steps in manufacturing the car, then there are only one hundred cars under construction inside the plant (disregarding parallel optimizations).  Quite similar to that strength is that it doesn't take that long (or waste that much product) to finish the first car - it only takes as long as the sum of the steps.  (In the alternative workflow - a batch approach - it could take much longer to get the first product out the door, and mean lots of in-process chassis in the workflow.)
Those same strengths are achieved in the Integration Services data flow.  For example, a Derived Column transformation (or any of the non-blocking synchronous transforms) can easily be parallelized - multiple threads can process different buffers at the same time, coordinating amongst themselves.  Similarly, non-blocking transformations mean that the buffers can get passed from transform to transform efficiently, such that there aren't that many buffers (memory) occupied at any one time.  Finally, since we're working with buffers, it doesn't take all that long to get the first results out of the pipeline and into the hands of the storage engine to "put away" in the finished car lot.
Enough About Cars - What's Going On With The Data?
Quite simply put - the design surface of Integration Services is lying through it's teeth to you.  It may very well look like your data flow is sequential in nature, and that "Lookup A" has to go through all the data before "Lookup B" touches anything - but that's not the case.  You should immediately see through this fiction when you run your packages, and see something like I have pictured to the right.  You'll see that all of those lookups are being executed (because they're yellow).  If you look closely at the row counts shown between the lookups, you'll see that Lookup A has (approximately) ten thousand more rows input to it than it's output, and that this pattern continues through the next lookups.  What this is telling you is that as soon as Lookup A has finished it's lookup operation on one of the buffers (about 10,000 rows) it passes the buffer on to the next lookup.  It's not what I'd call "pure" parallelism - but it's definitely most of the way there.
Compare to the Manual Method
Let's take a snapshot of what's going on in the "manual" method.  A quick look at the execution snapshot shows that all five lookups appear to be operating on the same buffer.  That's fantastic!  It definitely is a step towards better parallelism.
Unfortunately, it comes at some cost.
The cost doesn't happen to be "buffer copies" by the Multicast component.  The Multicast doesn't actually copy any data - it's another Data Flow designer fiction.  It only makes you think there are multiple copies, when in fact it's using the same memory for five operations at the same time.
The cost comes in the Merge Joins.  They aren't "inefficient" by any means - it's just that they're not as efficient as not using them at all.  In fact, the performance gain we may see from manually parallelizing the lookups disappears - and then some - due to their use.
Execution Results
My tests using this package on my dual-core machine with plenty of headroom showed pretty conclusive results.  The apparently unoptimized "standard" arrangement of doing lookups was more than SEVEN TIMES faster than the manually parallelized arrangement.  That's correct.  I pushed ten million rows through five lookups in just over 33s using the standard arrangement compared to the manual rearrangement runtime of 241s.  Based on the CPU usage profiles of the runs, I would expect a significantly wider gap in runtimes in "real-life" scenarios where your server has four or more cores.  All CPU cores were maxed out in the "standard" arrangement, but hovered between 55% and 60% for the majority of the manual test - of which 5-10% was consumed with running my other desktop apps.  Those CPU results clearly demonstrate that attempting to manually parallelize a series of lookups actually results in serialization.  Brutal serialization, in fact.
There is Still Room for Improvement
So what if you're still dissatisfied with the performance of your Lookups, and you were hoping "parallelizing" them would help you?  I suggest you try tweaking your Lookups individually - adjust the caching behaviour, and analyze how you've indexed the underlying lookup tables.  I also suggest you see the many articles from reputable sources about improving Lookup performance:
One thing you'll note about all of those articles is that none of them suggest attempting to manually parallelize the data flow like I demonstrated above.  Once you understand the internals of the Data Flow (which isn't easy) - it's one of those ideas that just dies in your brain because you know it no longer makes any sense at all.

Wednesday, September 1, 2010

Set ComponentMetaData.Version in ProvideComponentProperties

This post is for a fairly small audience (perhaps only myself) and applies to those creating custom components for SQL Integration Services.  The SSIS API provides a reasonably nice way of handling upgrades to components through the PerformUpgrade method.  If you're not sure how to do that, I suggest reading Using PerformUpgrade.
However, there's one issue that I constantly run into that I didn't mention in that post.  For whatever reason, when you drop a completely new component onto the design surface, SSIS does not save the correct version information into the DTSX.  To be clear, we're talking about an internal versioning system here.  This has nothing to do with assembly versioning.  We're talking about how two version numbers interact - what I call the "runtime" version number, and the "metadata" version number. 
The Runtime Version
The runtime version number is the version number that's encoded inside your custom component's assembly - it's the number you've hardcoded in the DtsPipelineComponentAttribute at the top of your custom component's class. This version number corresponds to the way your component reads and writes properties to ComponentMetaData's CustomPropertyCollection.  "Version 1" of your component will expect a different set of properties, or a different way of interpreting the information in those properties that isn't compatible with what you expected in "version 0."
The Metadata Version
The metadata version number is the version number stored in the DTSX file.  This version number gets written by SSIS when it saves your component's custom properties.  This version number should correspond to the stored format of the custom properties, and indicate that a component of the same version can interpret them properly.
The Versioning Process
When you create a brand-new custom component, you'll likely start off with a version number of zero in your DtsPipelineComponentAttribute.  Placing a component on the design surface will result in a version of zero being written to the DTSX.  Everything works well.
As you work on your component, the time will come when you need to alter how it stores information in its custom properties.  If you simply change how your component expects these properties to look without coding a PerformUpgrade, your existing packages will almost surely break, and you'll have to open them up, delete all instances of your component, add them back to the data flow, and configure them from scratch.  If you implement a PerformUpgrade, you should be able to "read in" the old properties, interpret them, and "write back" the new format to ComponentMetaData such that the current version of your code will understand them.
The Problem
Unfortunately, a nice feature was missed - so much so that I'd call it a bug. 
If your code currently has its DtsPipelineComponentAttribute set to a non-zero version, and you have a PerformUpgrade override, you'll see something you probably don't expect when you drop an instance on the data flow and attempt to edit it.
Attach a debugging session to BIDS and place a breakpoint on PerformUpgrade.  Drop a new component on the data flow and edit it.  What just happened?  Yes - you hit that breakpoint.  Why?  There's no upgrade necessary, is there?  According to SSIS, there is.  If you examine the ComponentMetaData.Version property, you'll see it's set to zero - not to the value you've set on your DtsPipelineComponentAttribute.
The Workaround
No, I didn't file this on Connect like I should have - but I have brought it to the team's attention.  Whether it gets fixed or not based on that lame communication is on me - sorry about that.  Here's what you need to do... add the following code to ProvideComponentProperties for every custom component you make:

DtsPipelineComponentAttribute componentAttribute
=
  (DtsPipelineComponentAttribute)Attribute.GetCustomAttribute(this.GetType(),
    typeof(DtsPipelineComponentAttribute), false);
ComponentMetaData.Version = componentAttribute.CurrentVersion;


This code explicitly writes your "current" version number to the DTSX.  If you watch your code with the debug session again, you'll now see that SSIS doesn't call PerformUpgrade when you edit a brand-new component anymore.