Monday, August 30, 2010

Convert Several Columns in SSIS With Fewer Clicks and Confusion

I've had this one in my toolbox for a while now, decided to publish it on CodePlex a while back, and am now getting around to blogging about it.  If you've seen this Connect submission, you've got an idea as to why I've constructed it.  Jamie Thomson (blog|twitter) and I had a good discussion on the topic there.  It helps me solve the issues I regularly face converting batches of columns - usually between Unicode and Non-Unicode.  I'd developed a pattern for converting columns using standard SSIS components, but it still fell short on usability.
If you understand what the Data Conversion component does, you'll have a good start at understanding what the Replacing Data Conversion component does - and perhaps why I had such a hard time coming up with a name for it. I'd like to take this space to explain why I created it - and what kinds of small changes could be made to the stock component in SSIS that could have eliminated that need.
Same-Name, In-Place Conversion
One of the big issues I had with the Data Conversion component was the naming problem.  I regularly used the original column - usually named exactly how I wanted it, and placed in the column list exactly where I expected it - even though it was the wrong data type, and I had a converted column available further down the column list, with a different name.  That was the primary rationale for my data conversion pattern.
Using the Replacing Data Conversion component fixes that problem.  The columns output from the component match those that enter - with respect to column names and positions.  The only change in the output flow is the data type of the columns you've converted.  You can't possibly make a mistake later on in your flow.
Quick Bulk Conversions
Another problem I frequently had with the Data Conversion component was that I typically had to convert multiple columns in exactly the same fashion.  Either I had to convert a bunch of Unicode string (DT_WSTR) columns to Non-unicode (DT_STR), or a bunch of floats (DT_R4) to exact numeric types (DT_NUMERIC).  Doing this with the Data Conversion component was a long series of clicks - and even more clicks and typing if I followed my own best practice.
The Replacing Data Conversion component fixes that problem too.  It has two "modes" - a simple mode, where you can define a blanket conversion such as "all DT_WSTR to DT_STR", and apply that to your column set.  You can filter only columns with specific data types, or leave the bulk conversion wide open to catch all matching types.  Quite convenient for those wider tables, I've got to tell you.
The Drawbacks
Of course, like anything I do - rest assured it's not all unicorns and rainbows.  Here's what you sacrifice by using this component.
A Little Performance
This component is an asynchronous component.  In slightly less technical terms, it doesn't just modify the data as it flies by in the data flow pipeline.  It has to copy all of the data in each buffer to a newly formatted buffer.  That's going to incur some additional memory I/O and CPU activity that you wouldn't see with a Data Conversion component, as well as change how the optimizer sees the flow.  That said, this is not a "worst case" asynchronous component like the Sort.  It's not going to hold up all the rows in the flow, just one buffer at a time similar to the Union All component.  For myself, the improvement in the design experience is worth it.
An Imperfect Interface
Despite my best intentions, I don't get a lot of time to polish these things off.  Despite being somewhat of a perfectionist - especially when exposing my work to the world and building a "brand" - the interface is definitely not as good as I'd want it.
No Quick Installer
For the same reason as above, I don't have the time to construct an installer.  And to be honest, even if I did, you'd still have to do some manual work to get it completely installed.  (Good news on that front coming up.)
Happy Converting
On the whole, I find it very useful.  I hope you do too - and please keep suggestions on improvements coming...

Tuesday, August 24, 2010

Use ReadOnlyVariables and ReadWriteVariables properties in Scripts

For those of you just starting out with SQL Server Integration Services, you'll find a LOT of information out there on the internet.  This series (posts tagged with "Best Practice") is intended to bring the focus back to fundamentals.  I've found that it's easy to get confused with advanced techniques that seem to be targeted at your issue - but are overkill and meant for specific tricky situations.  I've done it on both sides - got suckered in to using something more complicated than I needed to, as well as written about more advanced things that got used where they didn't need to be.
The Misconception
One of those techniques is getting at SSIS variables from within a Script Task or Script component.  You'll find a lot of posts out there about the LockOneForRead, LockOneForWrite, LockForRead, LockForWrite and GetVariables methods.  Here's a small sample of posts - but go ahead and Bing/Google yourself:
Even on Twitter, from one of the SSIS demigods:
The way those posts are phrased, it sounds like those methods are what you should use to get at SSIS variables when you're using a Script.  In my opinion, they're not.
The truth is that 99.9% of the time, getting dirty with those method calls isn't necessary. Using them requires you to understand what those methods do, what sequence they should be called in, and the fact you need to call Unlock after you're done.  (Oh, yeah - that wasn't in the list of VariableDispenser methods, was it?  Easy to miss...)
The reason those posts are out there is because the experts who've posted them know that most of the time, the simple way works perfectly, but they've run into a situation where it doesn't.  They're helpfully posting how to use the advanced technique - since the advanced technique isn't typically documented well.  But by doing so - and being very enthusiastic about the advanced technique, they (we) tend to lead others astray.  Additionally, they're confident in their abilities to use the advanced technique correctly every time.
The Simple Method
SSIS provides a really simple way to get at variables, and it doesn't take code.  See these two properties on the Script Task?
And these two on a Script component?
Quite simply - use those instead!  Here's the complete rundown.
Accessing Variables in the Script Task
Type the fully qualified variable name - or select it off the list exposed by the ellipsis (...) button - into either the ReadOnlyVariables property or ReadWriteVariables property.  Separate multiple variables using a comma (,).
Inside the code, those variables are now available in the Dts.Variables collection.  To read a variable called "StringVar", do the following:
// C# code to read an SSIS variable
string scriptVariable;
scriptVariable = (string)Dts.Variables["User::StringVar"].Value;

' VB code to read an SSIS variable
Dim scriptVariable As String
scriptVariable = Dts.Variables("User::StringVar").Value


If you have a variable you've placed in the ReadWriteVariables property, you can do the above to read the value out of it so you can use it in your script.  You can also push values into the SSIS variable, like this:
// C# code to write to an SSIS variable
Dts.Variables["User::StringVar"].Value = "New Value";

' VB code to write to an SSIS variable
Dts.Variables("User::StringVar").Value = "New Value"


Accessing Variables in the Script Component
Identically to the Script Task, type the fully qualified variable name - or select it off the list exposed by the ellipsis (...) button - into either the ReadOnlyVariables property or ReadWriteVariables property. Separate multiple variables using a comma (,).
The difference here is that accessing the variables is much more user-friendly... but where you're allowed to access them is not and I'll explain that next.  So first, to read SSIS variables in a script component, use code like this:
// C# code for reading an SSIS variable
string scriptVariable;
scriptVariable = Variables.StringVar;

' VB code for reading an SSIS variable
Dim scriptVariable As String
scriptVariable = Variables.StringVar


To write to a variable inside a Script component, do this:
// C# code for writing to an SSIS variable
Variables.StringVar = "New Value";

' VB code for writing to an SSIS variable
Variables.StringVar = "New Value"


Now for the "tricky" part in Script components... you can't use the above code just anywhere (nor should you).  The ReadOnly variables can be read from anywhere in your script - the PreExecute method is the most common place to read them, but you can also get at them in ProcessInputRow and PostExecute.  The ReadWrite variables can only be used in the PostExecute method - your script will report an error at runtime of "The collection of variables locked for read and write access is not available outside of PostExecute."  You can't even read them in the PreExecute method - all access to ReadWrite variables is limited to PostExecute.  If you need to read a variable's value in the script, do the work, then update the value, you'll need to use two SSIS variables to get that job done.
The Rationale for the Advanced Method
You've heard about the "simple" technique, and yet there are these fancy methods that other articles and posts may have said "work better."  It's absolutely true - but for a very small fraction of scenarios.  Here they are:
Tighter Control Over Locking (Concurrency)
When you place variables in the ReadOnlyVariables and ReadWriteVariables properties, SSIS writes code for you that you never see.  This code calls all those Lock* methods in the proper sequence and makes sure to Unlock them after you're done.  The code to lock the variables is run in the PreExecute phase of the task, and the unlocking is done in the PostExecute.  This means that for a long-running task (either a Script Task or a Data Flow Task containing a script) your variables are locked for a long time.  Is this bad?  Only if you have other tasks running at the same time that want to read or update those same variables.  They simply won't be able to do so, and SSIS will report errors.
Using the Lock* and Unlock methods allows you to reduce that window to only the portion of time that you need it.
This is usually a horrible rationale.  The fact that SSIS is reporting concurrent access problems with a resource should be drawing your attention to the fact that your package is improperly architected.  You have resources that two tasks are using.  One of those tasks will depend on settings made by the other task... but you want to run them in parallel?  Resolve the contention issue by serializing the tasks.  If you code the variable locking to a smaller area of code, all you'll be doing is reducing the risk of having a lock contention issue - you will not be eliminating it.  Your package will eventually fail in production - and you probably won't be able to replicate the failure to fix it.  If your intent is to have two script components in one data flow use the same variable - that's just bad architecture on your part.  There's absolutely no guarantee that one of those scripts will finish before the other starts - even if the data flow looks like it should happen that way.  Use the advanced technique in these situations at your own risk!
Reading From ReadWriteVariables In PreExecute
As you now know, placing a variable in a Script component's ReadWriteVariables property means you can only read and write to it in the PostExecute method.  Sometimes you may have a script process - such as a row numbering operation - that may need to read an "initial" value from a variable, run through the rows, then update the variable with the "final" value for later use elsewhere in your package.  In order to do this with the "simple" technique, you need to use two variables. 
Using the advanced technique, you only need one.  In the PreExecute method, call LockForRead and GetVariables, or call LockOneForRead.  In the PostExecute method, call LockForWrite and GetVariables, or LockOneForWrite.  Don't forget to Unlock the variable as well.
This rationale is understandable.  The alternative is maintaining two variables in SSIS in order to accomplish a simple task.  The downside is that you now have to understand what those methods do, and your use of the variables is not exposed to anyone who might support your package unless they read your script line by line.  Using the ReadWriteVariables property exposes your use of those variables very nicely to everyone else.  Using the advanced technique here makes for a cleaner set of variables in the package, so I can't say it's a horrible use case.
Best Practices for Reading and Writing Variables in Scripts
Given all those facilities and restrictions on using variables, and the freedom the advanced technique gives you - what should you do?  Here are my best practices for using variables.  They won't apply to every situation you run into, but you should try to adhere to them for the reasons I mention with them.
  1. Use the ReadOnlyVariables and ReadWriteVariables properties instead of the Lock* methods.
    * Advertises to other package designers which variables you're using in your Scripts.
    * Guarantees that the variable locking is handled properly.
  2. In Script components, read variables in PreExecute - not in ProcessInputRow.
    * Reading variables crosses the COM/.Net barrier and that's expensive.  If you need the value in the variable when you're processing rows, read it in the PreExecute method into a module-level variable, and use that.
  3. In Script Tasks, read or write to variables once and only once.
    * The same reasoning as #2 - crossing that COM barrier is expensive.  Pull the value out and store it locally for use.  When writing, don't use the SSIS variable as an intermediate storage location - use a local variable and only assign the final value once.
  4. In Script components, if you need to read a variable, process data, then write back to the same variable... don't.  Use two variables instead.
    * See the reasons for #1 - they apply here just as well.
    * Your SSIS variables will be labeled appropriately for "before" and "after" states.
Learn Something?
I hope you did - I know it took me quite a while to figure all that stuff out.  Some of the blame lies with guys like me - we tend to broadcast the "neat" and "difficult" solutions, and forget about the simple ones that took a long time to learn.

Monday, August 16, 2010

Integration Services vNext Coming Soon?

That title may be a little premature - but I've recently seen a (relative) boatload of Connect submissions getting resolved, either positively or negatively.  Some submissions of mine have been "fixed", and several of the other MVPs have had movement on their issues, like Simon Sabin's (blog|twitterrecent post about the long-awaited "undo" feature.
I'm encouraged by three of my issue resolutions.  First, a cosmetic Data Flow Sequence Container got a "fixed" resolution.  Not a huge feature, to be sure, but an indication that the designer interface is improving.  Here's hoping they fix up the inconsistencies and lack of polish on the Task and component editor interfaces that many of us have complained mercilessly reported to them over the years about.  Second, the ability to add Input Dependency directives to custom data flow components will be quite significant to me - and perhaps you if you use the Kimball Method SCD Component.  That capability should help me (the component developer) better communicate to the SSIS data flow engine about how I consume buffers, allowing it to better manage memory and other resources.  Third, and perhaps largest for the "third-party" developers out there is that there will (allegedly) be a method to install custom objects into the toolbox.  This will make using third-party objects in SSIS a lot less daunting for most organizations - and hopefully make using them a little more commonplace.
On the other hand, some very basic suggestions have been nixed.  My suggestion to extend the Aggregate component to allow Min and Max operations on strings was just rejected.  I have to say I simply can't understand it.  Seriously, I have to code this by hand in a Script?
Of course, just because those three example suggestions have been marked as "fixed" by the SSIS team that doesn't mean that they're fixed.  The proof will be in the product itself, whenever we get our sweaty little palms on it.
And despite our collective tingles as we watch Connect items get resolved - we may be jumping the gun when it comes to drawing conclusions like the title of this post.  Since I haven't been observing the release process for SQL Server for very long, assuming that resolving these issues means we're close to a CTP or something like that may be false.  Even if it is an indicator that they've "locked down" the release feature set... it may be a year... or two... before the product works its way through the rest of their development cycle.
So take those Connect items for exactly what they are - a commitment from the product team that they heard you, and did what they could to deliver.

Friday, August 6, 2010

VanPASSBI August Presentation Recap

Thanks to Scott Stauffer (twitter) and the people that turned up in-person and online for my "In-Depth with the SSIS Script Task" session on Friday.  Lots of good questions from the audience - I've got to add a slide on the Office API and add some more code to ensure I can whip out a relevant sample when needed.
Scott will be getting up the recording of the session soon - so check the Vancouver BI PASS chapter site for that.  My slide deck and sample SSIS package (2008 format) is up in my SkyDrive Presentations folder - browse through the slides for links to samples and articles that I mentioned in the session.
Please, please give me some feedback on the session on SpeakerRate - specifically, please leave comments on how the presentation flowed for you, and if there were any technical details I didn't cover that you would have liked to see.  It's nice to know if the session was good or bad - but it's better for me to know how it was good or bad - and mostly I want to hear the bad stuff so I can fix it.
If you missed this presentation, and you're REALLY bummed out about it - don't watch the recording!  Come see me in person at SQL Connections in Las Vegas in November...

Tuesday, August 3, 2010

Presenting at SQL Connections in November

This will be my first trip to Las Vegas that isn't to catch a connection at McCarran, and it's also the first time I'll be going to SQL Connections.  Several months ago, the call for abstracts went out, and Denny Cherry (blog|twitter) and I were the two lucky "new guys" that were given spots.  The SQL Connections conference is part of a larger meta-conference involving tons of other Microsoft technologies (ASP, Silverlight, Visual Studio, Sharepoint, Exchange, Windows, ...) and even a DotNetNuke conference.  Your typical heavy hitters will be on tap - Kimberly Tripp, Paul Randal, Itzik Ben-Gan, Brad McGehee, Allen White, Brent Ozar, Craig Utley, ... - most of whom should be moving right on to the PASS Summit the next week.
I've been selected to present three sessions on SSIS: Data Warehouse ETL with Integration Services, In-Depth with the SSIS Script Task, and In-Depth with the SSIS Script Component.  The session difficulties range from intermediate to advanced - but it looks like mine is the only Integration Services content at the conference, so you're stuck with that!  Don't worry, there will be plenty of time for Q&A on any SSIS topic, and plenty of demos and take-home code/packages for you to play with.
Come join me in Las Vegas with the rest of the SQL twitter posse.