Monday, June 28, 2010

Beware of OLE DB Commands, Data Types, and Calculations

Square peg...
The OLE DB Command in Integration Services is evil - and for more than the typically cited reason of poor row-by-row performance.  Here's another example of where it can cause difficulties, and where a set-based update can save the day.
I recently had need to perform some mathematical calculations on a fact table to populate values retroactively.  I'm not a fan of having to do that, because it means I need to fully process the fact table after each update - but I'll work on optimizing this process later.  The important part is the math.
I have sale amounts in this fact table - gross sale amounts.  I have a need to also provide a calculated value on each row for the dollar value that the sale was discounted.  Some discounts can be known exactly at the time of sale, but others may not be known for weeks or months later.  The reason for this is national contracts with larger customers where volume discounts are promised based on the actual volume of orders that were placed.  At the time the individual orders are being processed, there's no way to know if the customer will reach a specific discount level, so it's not possible to record an exact discount amount.  At the end of the contract period we can calculate the total discount the customer should receive, and it helps the business users if we can push that discount back down to each line item on a proportional basis, so they can see the "true" value of each sale.
A simple way of looking at this is with an example of a customer we promise a $1000 discount should they spend more than $50,000 with us.  If, at the end of the contract period, they spent $65,000 - we owe them $1000.  This means our net revenue was actually $64,000 - and we'd like individual transactions to reflect that reduction.  Since I wanted the reduction to be applied equally to each transaction - and in this case "equally" meant proportional to the amount sold - populating that "discount" column with a percentage of the transaction sale amount would do fine.  It works even better that way, because I wouldn't have to apply a specific T-SQL command for each row, I could apply a calculation to a set of rows.
In our example, I wanted to populate each transaction's discount amount with 1/65th of the sale amount.  My data flow calculated the ratio I wanted to apply - in this case, a sufficiently high precision value of 0.01538.  I set up the OLE DB Command to issue an UPDATE command, and then surveyed the results of my work.  A quick SELECT SUM(discount) arrived at a total applied discount of $650.  Nowhere near the expected $1000.
I won't bore you with the hours of investigation that led me to the real problem.  If you're eagle-eyed, you may see that the $650 actually applied corresponds to a factor of 0.01, rather than the 0.01538 factor I listed earlier.  I'll protect my ego by saying that this example is contrived to be simple - and this correlation wasn't obvious in the real data.  This is in fact the root of the problem - and it's exposed by the OLE DB Command component.  I used a command of "UPDATE FACT_TABLE SET amount_discount = amount_sold * ?".  The data type of the "amount_discount" column in SQL is a DECIMAL(10,2).  The OLE DB Command component therefore assumed that the parameter placeholder should represent a value of that data type, and so constrained the value it passed to SQL.  So instead of issuing a command of "UPDATE FACT_TABLE SET amount_discount = amount_sold * 0.01538", Integration Services issued a command of "UPDATE FACT_TABLE SET amount_discount = amount_sold * 0.15".
Not exactly what I was looking for - but then, I'm not quite sure what else I would expect from the OLE DB Command and its parameter replacement.  But that's what you get when you try to fit a square peg into a round hole.

Monday, June 21, 2010

Feature Packs for SQL Server (and SSIS)

Some of the best extensions for Integration Services (and SQL Server itself) are found in what Microsoft has released as "Feature Packs".  They aren't very well advertised, but include some of the essential tools for certain scenarios - like the DB2 connectors, PowerPivot, and Report Builder.  Most of those tools are available individually (somewhere) but the feature pack is a good place to find them all.
Microsoft recently released the first SQL 2008 R2 feature pack (for RTM) - the link is below along with links to the most recent feature packs for the other recent SQL versions.
Feature Pack for SQL Server 2005
Download here. Integration Services relevant bits:
Microsoft OLEDB Provider for DB2
Microsoft SQL Server 2000 DTS Designer Components
Microsoft SQL Server 2005 Backward Compatibility Components
Microsoft SQL Server 2005 JDBC Driver
Microsoft .NET Data Provider for mySAP Business Suite, Preview Version
Feature Pack for SQL Server 2008
Download here. Integration Services relevant bits:
Microsoft SQL Server 2008 Analysis Services 10.0 OLE DB Provider
Microsoft SQL Server 2005 Backward Compatibility Components
Microsoft Connector 1.0 for SAP BI
Microsoft SQL Server Driver for PHP 1.0
Microsoft SQL Server 2005 JDBC Driver 1.2
Microsoft OLEDB Provider for DB2
Feature Pack for SQL Server 2008 R2
Download here.  Integration Services relevant bits:
Microsoft® SQL Server® JDBC Driver 3.0
Microsoft® Connect 1.1 for SAP BW for SQL Server® 2008 R2
Microsoft® OLEDB Provider for DB2
Microsoft® SQL Server® Driver for PHP 1.1
Other Addons for Integration Services
Of course, if you're looking for addons for SSIS, the place to go is the SSIS Community Tasks and Components directory on CodePlex.

Wednesday, June 16, 2010

SQL Saturday #43 Redmond - Recap

Sorry for the delay in getting this post up - I've been busy with the regular day job.
This was my second SQL Saturday - very fun, and very different from the first.  I didn't happen to get to spend all day at the event - I spent the morning in downtown Seattle with the family.  I wanted to be there in the morning to catch Scott Stauffer and Matt Masson's sessions - in fact, I missed Matt entirely!  I'm sure I'll catch up with him sometime before the PASS Summit, if not then.
My session was extremely well attended, I thought - probably the highest headcount I've had staring back at me.  The number and inquisitiveness of the questions confirmed to me that you guys were indeed paying attention and cared about the subject as well.  I hope I was able to let you see a few of the options you have for loading your dimension tables with SSIS - and I know there are a couple of you waiting to see the demo package.  Like I counseled in the session - try a few of them to see how they "feel" to you.  Know that they all can be tweaked and altered - some more than others, of course, but that's why I showed you four techniques.
The slide deck is now posted on my SkyDrive, as well as the demonstration package.  In order to run the fourth sample task in each package, you'll need to install the Kimball Method SCD Component from CodePlex.  Feedback on the presentation would be great - I've set up an account at SpeakerRate - please provide some comments about things I could do better.  For those that want more information on things like dimensional database design vs. normalized operational database design, you may want to review my other presentations/slide decks/demos - I've got them all listed on my Presentations Gallery page.
Hope to see you at SQL Connections, or the PASS Summit!

Tuesday, June 8, 2010

Guerilla BI Sanctioned... With Plausible Deniability

(You'll have to read through past posts on this subject to fill in the background on my progress in bringing data warehousing to my own organization.)
I'm in a good mood, and this is only one of the reasons.  Perhaps the past few years of recession has brought this about - or it could be the expected upturn in housing activity related to coming out of said recession - but hunger for data analysis at my company has grown.  So much, in fact, that I'm fielding inquiries that would otherwise not be answerable without significant expenditure of hours on the part of current data analysts.  (Even though they don't call them that.)
Given that, I've been forced out of the woodwork - somewhat.  I'm now working with a group to provide some metrics on an exploratory basis for the business decision makers in my organization.  That in itself is a good thing.  But better is the fact that I have now exposed some of the traditional data analysts to the capabilities of a real data warehouse versus Access and Excel on their own.  I'm optimistic that when this project is over, another project will be born to solidify delivery of business intelligence information - reliably, automatically, and trustworthy - through a data warehouse.  Of course, the possibility exists that everything could just go back to the way it was...

Wednesday, June 2, 2010

SQL Saturday #43 - Redmond!

Yet another SQL Saturday close enough to get to - but this time, I'll only be asking you to suffer through one session with me.  If you didn't make it down to Portland for SQL Saturday #27, make sure you drop by in Redmond and say hi.  I'll be doing my (continually morphing) session on Dimension Processing with SSIS - explaining and demonstrating alternative dimension processing techniques.  Using any one of the techniques should cut your load time significantly - as in "orders of magnitude".  If you're interested in SSIS, there's a whole track for it - don't miss an intro by Scott Stauffer (twitter), performance design patterns by Matt Masson (blog|twitter), and an SSIS+CDC session by Doug Wheaton.  Once again, scheduling will preclude me from watching Remus Rusanu's talk on real-time ETL (without SSIS).
Go sign up for SQL Saturday #43 - there's lots of great content, and best of all - it's free!