Pages

Sunday, August 3, 2014

Imposing Sort Order on Already Sorted Rows in SSIS

There are times when you know a data flow is sorted in Integration Services, but SSIS doesn't know this fact.  For example, you may have data coming out of a Derived Column or synchronous Script where you can guarantee a certain sort order based on how you're specifically processing the rows.  (Synchronous components modify data in the rows without changing row order, so they're not permitted to change the IsSorted or SortKeyPosition properties.)  Without having SSIS know how the data is sorted, there are limitations to what you can do with it.

What Limitations?

The most common impact is that you can't use a Merge Join (or the lesser-used Merge).  SSIS will require that you throw a Sort component into the flow in order to use a Merge Join.  Of course, doing that is worse than useless... the data is already sorted, and placing a Sort component in the flow causes all of the rows to be accumulated and stored in-memory before the flow can continue!  You're not exactly going to get great performance out of that arrangement.

So How Do I Deal With That?

What can you do to optimize your data flow?  Typical suggestions include pushing the rows into SQL Server (or other persistent location) and reading it again.  Since all Source components are asynchronous, they allow you inform SSIS of a sort order by using the Advanced Editor.  The drawback of that advice is the expense (network or disk I/O) of persisting the data.

Is there another option? 

In fact, there is.  We can use an asynchronous Script to our advantage with much lower performance impact than either a Sort component or persisting data.  Yes, there will be a performance hit - not so much because of the "asynchronous" attribute of the Script, but because of the COM interop and memory copying.

The Details

Unfortunately, you'll have to set up an asynchronous script component.  This means you'll have to define columns on the Output that match what's on your Input (because you want exactly the same columns out as come in).  Unfortunately, this means your metadata suffers a break here, and if any column definitions change, you have to update this component manually.
In order to create an asynchronous script, you have to change the SynchronousInputID of Output0 on the Inputs and Outputs tab of the script component editor to "None".  Once you do that, you'll see a new property available called IsSorted.  Set this to true.
We're not done, but the meat of the script is really quite simple.  Select all of the columns on the Input Columns tab with checkmarks.  Then open the script code editor.  Inside the Input0_ProcessInputRow method, use code like this:
while (Row.NextRow())
{
    Output0Buffer.AddRow();
    Output0Buffer.Column1 = Row.Column1;
    Output0Buffer.Column2 = Row.Column2;
    Output0Buffer.Column3 = Row.Column3;
    // ...
}

Of course, I'm certain that your column names will be more imaginative, and practical, than mine.

One Further Note on Performance

I did say that this will have a performance hit because it's jumping over the COM interop layer.  The net effect of this (if your data flow is otherwise fast) will be to pin a single core, as SSIS will allocate one thread to perform all the duties of this script.  If you have cores to spare, you may want to try the Balanced Data Distributor, or my poor-man's BDD in order to drive more parallelism...

No comments:

Post a Comment