Wednesday, July 27, 2011

Quick Hit: Using XLSM Files With SSIS

If you've tried to use an Excel 2010 XLSM file - a "macro-enabled" spreadsheet - with Integration Services, you'll have seen the incorrect (and badly worded) message that the "File path contains invalid Excel file.  Please provide file with .xls or .xlsx extension."  It looks like the SSIS team just didn't know that Excel makes other flavours of files.
How Do You Get Around This Limitation?
  1. Create an Excel Connection Manager to refer to ANY other spreadsheet with a "valid" XLS or XLSX extension.  (You don't have to save your XLSM file "as" an XLSX.)
  2. Select the Excel Connection Manager in your Connection Managers pane.
  3. Open the Properties Window (F4).
  4. Change the "Server Name" property to the full pathname of your XLSM file.
Done!

Tuesday, July 19, 2011

PASS Summit 2011 Content - It's Your Choice

If you're reading this post and attending the PASS Summit 2011 in Seattle, you absoutely should vote for two things with regards to the Summit: Dr. DeWitt's keynote subject, and the Community Sessions.
Dr. DeWitt's Keynote
I've been to the Summit for the past two years (I think... it's hazy) and one of the absolute highlights has been Dr. DeWitt's presentations.  2009 was on columnar database technology, 2010 was about query optimization.  Believe me - he does not dumb things down, and he's not from marketing.  There's a reason everyone raves about his presentations - he talks to geeks the way we want to be talked to.
Just like last year, PASS is asking you to help select Dr. DeWitt's topic.  The theme this year seems to be parallelism technology - big data or big iron.  Would you like the doctor to explain Hadoop and distributed data processing, or describe the coming impact of the increasing number of CPU cores and I/O changes?  If you're a DBA, I can see you being interested in the Moore's Law presentation.  But since I'm closer to the DBD demographic, I've voted for the Hadoop talk.
Best quote from Dr. DeWitt last year (I'm paraphrasing): "Guys that drop out of query optimizing because it's too hard... go build rockets for NASA."
Community Sessions
Apparently, the PASS Summit Program Committee is taking a day off, because PASS is asking for our help to round out the Summit schedule.  (I kid - the PC did a great job this year and deserve a day off... even though I know they're not taking one.)  From various sources, the Program Committee has selected 20 candidate sessions that are competing for five slots.  Here are my choices, based solely on what I need to learn and want to see:
Robert Cain on Project Juneau - The team I'm with needs to understand databases better.  It's really hard to help them get there when the code they write is so very disconnected from the database.  As I understand it, Juneau is a step towards integrating code and the database better - which would give me better tools to encourage using better code and a real database.
Wendy Pastrick on Visualizing Indexes - My understanding of index internals is woefully deficient.  I certainly think I know the basics, but I could really use more.
Chris Webb on SSAS Security - This one is probably over my head at the moment - but I know it's inevitable that I'll be called upon to deliver it.  My org is presently just too paranoid (IMO) about spilling too many secrets to parallel lines of business.
John Welch on Advanced DW Scenarios - Some of what I want to be able to deliver in my DW seems to require what John's talking about here... but I don't know enough to say for sure, that's why I have to go.
Jen Stirrup on Data Visualizations - This topic is both fascinating to me and severely frustrating.  I yearn for the ability to get rid of the complex reams of tabular reporting that require significant mental effort to decipher.  I've done my best with Excel 2007, but I need to expand my toolset to accomplish transforming my business' expectations.
Vote Now
If you're attending the Summit, please vote for the keynote and session schedule you'd like to see.  You've only got until the 20th of July.  Please don't make it a popularity contest about the speaker - because that means I'll lose :)  Yes, I happen to be listed twice!  I have a session on complex SSIS executions and another on processing slowly changing dimensions in the running.  If straightforward SQL Agent Jobs can't solve your SSIS execution scenario, I've got some techniques to help.  If John Welch's presentation on advanced DW scenarios is a little too advanced, maybe mine fits in your ballpark a little better.
Whatever your interests, please do pick some.  It's a shame that so many great sessions are in competition with each other - but I'll be happy to be a loser if it means you get sessions that will do you some good!

Monday, July 18, 2011

Presenting Upserts to PASS AppDev VC

Hopefully this crowd knows a little about what they're getting into!  John Jakubowski (blog|twitter) has graciously asked me to present on upsert strategies with SSIS.  What are "upserts"?  A not-so-fancy mashup of the words "update" and "insert".
If you struggle with incremental table loading performance - updating rows that already exist, and inserting rows that didn't - then you might want to drop by.
When?
July 26th, 4pmGMT/12pm ET/9am PT
Where?
Go to the PASS AppDev website, and find the LiveMeeting link.
What?
The full abstract:
One of the most sought after techniques in Integration Services is how to perform "upserts" - propagating changes from one system to another that may require new records to be inserted, or existing records to be updated. Integration Services can't accomplish this with a single click, but it does contain all the tools required to construct packages that will achieve this goal. The task of synchronizing data will be broken down into two parts: detecting changes, and updating or inserting. Refinements to use of the Lookup and alternatives to the OLE DB Command transform will be explored.

Thursday, July 7, 2011

What is Pipeline Backpressure?

First - I'll get the "what's a pipeline got to do with Integration Services" question out of the way.  The "pipeline" I'm referring to here is the word commonly used to describe the flow of data in the Data Flow Task.  That term doesn't seem to show up in any of the MSDN documentation, but the Microsoft bloggers just can't seem to talk enough about it.  I suppose it's an internal term that's leaked out into general usage.
The SQL Server Integration Services' Data Flow Task reads data from one or more sources, flows that data through this "pipeline" where some parts of it will get transformed (changed), and then sends the data to one or more destinations.  The choice of the word "pipeline" will tend to get people thinking of a liquid in a pipe - and that's an apt metaphor for the phenomenon of backpressure.  The same kind of action takes place in SSIS.
Fluid Backpressure
When talking about a liquid, backpressure refers to the resistance that the fluid encounters flowing through a pipe.  When the fluid encounters a constricted section of the pipe, it can't flow as fast through that section.  But since (most) fluids are incompressible and most pipes are rigid, the reduction in flow rate at this constricted point means that the flow is "backed up" all the way to the start of the pipe.  If a kink is stopping the flow of water near the end of the hose, the tap isn't able to inject any more water into the hose - there's simply no space to put it.  The rate at which you can inject water into the hose is directly related to two things: how constricted the smallest part of the hose is, and how much the hose can "expand" prior to the blockage.  (If the hose can inflate like a balloon, you can still pump water in.)
Data Flow Backpressure
SSIS's Data Flow Task has very similar constraints.  The source - or "tap" - wants to inject data into the pipeline as fast as it can.  But that data has to flow through the components - the "pipe" - in order to reach the destination - the "end" of the hose.  Somewhat like a garden hose, the data flow pipeline can stretch to allow the source to inject more data into the pipeline than is ultimately getting through to the end.  But there's a limit, of course, just like with a hose.  If it's stretched too much, it will burst. 
In SSIS's case, since it's a "fictitious" hose, the nice engineers that built it were able to set limits on how much of a balloon you're allowed to turn your hose into.  It's the RAM limit of the process and/or machine you're running your package on.  In most cases, taking advantage of SSIS' "stretchy hose" is a very bad thing - you'll easily consume all the available memory on your machine, and start spilling into virtual memory... and that's slow disk!
Backpressure In Action
Here's an SSIS package that demonstrates backpressure quite well. 

You can build your own quite easily from what I'll describe here, in case you don't have SSIS 2008.  The first task in it is a Data Flow Task that looks like this:
The first component hooks up to your local instance's master database, fetching a cross-joined resultset from sysobjects.  (It's the volume of rows that matter here - not the content.)  The second component is one of my Data Flow wiretap scripts that will dump timing information to the logging stream that we'll be able to watch in the Output window.  The last component is a script destination.  Normally, I'd just use a Union All or Derived Column to cap the flow, but those result in a lot of "unused column" warnings that would clutter up the logging stream.
As you can see below, running this flow results in something like this pretty quickly.  In my case, under 1.5 seconds, just about as quick as SQL Server can deliver the rows to SSIS.  (So fast I didn't bother to capture an "in-flight" picture.)
The next Data Flow Task in the package shows how an "impeded" data flow will behave.  In this flow, there's one more script component in there.
The component inserted at the third position is intended to mimic a transformation that takes effort to perform.  This is the equivalent of an uncached Lookup, OLE DB Command, Merge Join, etc...  What does this look like at runtime?  Well, no problem taking a screen capture in mid-execution on this one...
How does this image show the effects of backpressure?  It was taken over 20 seconds after the data flow started.  Let me explain in more detail.
Our first "free-flowing" data flow loaded the data into SSIS' pipeline in under 1.5 seconds.  The second "impeded" data flow hasn't even finished loading the 100K rows after 20 seconds... why not?  Because the "sleep" transform is slowing things down.  So why isn't the source simply reading as fast as it can, and piling up the data at the slow transform's doorstep?  Because if it did that, it would consume memory.  The developers of SSIS thought that consuming memory in this situation would be completely unnecessary if they could just tell the source to "slow down" to a manageable level... so they did.
Any "execution tree" (there's another bit of SSIS internals jargon) will experience this backpressure - forcing every component to operate no faster than the slowest.  One of the key parts of managing the performance of your packages is understanding pipeline backpressure.  It can mislead you about what's performing slowly in your flow, causing you to spend time and effort where it will have no effect.  Tweaking the DefaultBufferSize and DefaultBufferMaxRows properties can change some of the effects of backpressure - but doing so only delays the inevitable with sufficiently large data sets, and typically offers little meaningful benefit.  The real key is understanding your data flow's behaviour through decomposition.
I have one more example data flow for you in the package, where I add a sort component to the middle of the flow.  How will this affect the flow of data?  The sleep component will definitely cause the end-to-end flow to be slow, but will the source stream quickly like the "free-flowing" example, or will it be throttled like the "impeded" flow?  I'll leave that up to you to explain...