Pages

Sunday, August 3, 2014

Poor Man's Balanced Data Distributor

If you're looking to increase the performance of your SSIS packages like I am, (and you should be,) then you'll know about the Balanced Data Distributor component that Microsoft released a couple years ago. 
The Balanced Data Distributor is a component that works much like a blend between a Multicast and a Conditional Split.  It splits the incoming data into several outputs, roughly equally, and with a very efficient internal method.  This allows you to copy a part of your data flow once, twice, or more times - and have the BDD help you parallelize the execution.  One of the downsides - as with any custom component - is that you need to install it on each machine that will run the package.  That can be a little problematic in some environments.

Simple Doesn't Work

You can achieve something that looks similar by using a Row Numbering type of transform and a Conditional Split.  The Conditional Split would divide the rows by some function of the row number, allowing different threads to process the rows.  The problem with doing so is that the division of rows isn't reliably made on a "buffer boundary" - even if you think you've crafted it that way.  Buffer sizes can't be specified by you, they're dynamically chosen by the execution engine.  Your Conditional Split doesn't allow you to phrase an expression that will let you tell it to pass a complete buffer out output #1, and the next complete buffer out output #2.  The result of using a Row Number plus Conditional Split might be a slight increase in parallelism - but not an efficient one.

A Little More Work Does

Here's how to make that split much more efficient - perfectly along buffer boundaries - and get almost all of the benefits of the BDD without having to install it on all your systems.
The "poor man's" BDD has 5 parts:
  1. A Derived Column - to create our "split" column (called BufferNumber)
  2. A Script - to fill the "split" column
  3. A Conditional Split - to split the buffers
  4. Space for your parallel parts
  5. A Union All (or series of Merges) - to serialize the flow again
It's steps 2 and 3 that need the explanation.  Step 1 is pretty self-explanatory, especially if you know my aversion to creating columns in a Script (the interface is so opaque).  Step 2 should be a "transform" script, as you'd expect - but we will not be using the pre-supplied ProcessInputRow method.  Instead, paste this code inside the ScriptMain class:

private int _bufferCounter = 0;
public override void Input0_ProcessInput(Input0Buffer Buffer)
{
    while (Buffer.NextRow())
    {
        Buffer.BufferNumber = this._bufferCounter;
    }
    this._bufferCounter++;
}

Step 3's contents depend on how many parallel flows you're making.  And, unlike the BDD, we have to coordinate the contents of the Conditional Split with the parallel flows - it doesn't do the split automatically.  So, let's say you want three (3) parallel flows.  You'll need three outputs from the Conditional Split, which means two expressions (leaving the remainder to flow out the default).  The first expression should be "(BufferNumber % 3) == 1".  The second should be "(BufferNumber % 3) == 2".  Your Conditional Split should now have three outputs for you to hook up to identical copies of a particular data flow you're looking to parallelize.  Then use step 5 to join the flows back together.

Step 2 Explained

Why use a script this way?  What's that method you pasted in, and why use it instead of ProcessInputRow?  Because of how the internals of SSIS, and the script component work.  Typically, you're presented with ProcessInputRow - a nice, useful method for doing things with rows one by one.  And really, that's pretty much all you are ever concerned with in a script - or any SSIS component - because trying to do things across multiple rows isn't particularly what it's designed for.  However, inside the script component, if you look at the generated code behind what you're presented with, you'll see that there's a method called Input0_ProcessInput that calls ProcessInput in a loop, once for each row that it's presented.  However, you have to know some SSIS internals to know that Input0_ProcessInput is called once per buffer.  Now you know why we override that default implementation with our own.  We're labeling each buffer with a number, perfectly identifying which rows belong to which buffers, so that we can route rows in buffer-sized chunks, whatever that size may be.

Step 3 Explained

This is an explainer for those that don't know what modulus is.  If you do, never mind.  In SSIS' expression language, like many other languages, the "percent" sign means "modulus."  This essentially means that you want to divide the two numbers, but instead of getting the quotient, you want the remainder.  So for our example, since we wanted three parallel flows, we wanted to know what the remainder of the BufferNumber divided by 3 was.  That answer in the expression will cause one, and only one of the outputs to be true.  If the BufferNumber is evenly divisible, the modulus will be zero - nothing left over - in which case the rows will go out the default output.


No comments:

Post a Comment