The Balanced Data Distributor is a component that works much like a blend between a Multicast and a Conditional Split. It splits the incoming data into several outputs, roughly equally, and with a very efficient internal method. This allows you to copy a part of your data flow once, twice, or more times - and have the BDD help you parallelize the execution. One of the downsides - as with any custom component - is that you need to install it on each machine that will run the package. That can be a little problematic in some environments.
Simple Doesn't Work
You can achieve something that looks similar by using a Row Numbering type of transform and a Conditional Split. The Conditional Split would divide the rows by some function of the row number, allowing different threads to process the rows. The problem with doing so is that the division of rows isn't reliably made on a "buffer boundary" - even if you think you've crafted it that way. Buffer sizes can't be specified by you, they're dynamically chosen by the execution engine. Your Conditional Split doesn't allow you to phrase an expression that will let you tell it to pass a complete buffer out output #1, and the next complete buffer out output #2. The result of using a Row Number plus Conditional Split might be a slight increase in parallelism - but not an efficient one.A Little More Work Does
Here's how to make that split much more efficient - perfectly along buffer boundaries - and get almost all of the benefits of the BDD without having to install it on all your systems.The "poor man's" BDD has 5 parts:
- A Derived Column - to create our "split" column (called BufferNumber)
- A Script - to fill the "split" column
- A Conditional Split - to split the buffers
- Space for your parallel parts
- A Union All (or series of Merges) - to serialize the flow again
private int _bufferCounter = 0;
public override void Input0_ProcessInput(Input0Buffer Buffer)
{
while (Buffer.NextRow())
{
Buffer.BufferNumber = this._bufferCounter;
}
this._bufferCounter++;
}
Step 3's contents depend on how many parallel flows you're making. And, unlike the BDD, we have to coordinate the contents of the Conditional Split with the parallel flows - it doesn't do the split automatically. So, let's say you want three (3) parallel flows. You'll need three outputs from the Conditional Split, which means two expressions (leaving the remainder to flow out the default). The first expression should be "(BufferNumber % 3) == 1". The second should be "(BufferNumber % 3) == 2". Your Conditional Split should now have three outputs for you to hook up to identical copies of a particular data flow you're looking to parallelize. Then use step 5 to join the flows back together.
No comments:
Post a Comment