Saturday, March 27, 2010

Quick Reference: Formatting Code in Blogger

I'd been looking for something to help with this - pasting code into the Blogger post editor results in... well... garbage.  It took so much time to format code that I typically didn't do it.  I'll probably go overboard in the other direction - but at least I found a tool to help.
Head on over to Phil Factor's Code Prettifier.  I'm going to be using that facility (and this post to find it) for a great many posts.  A close runner-up is Format My Source Code (thanks Andrew Beacock for blogging about that).  I like Phil's colours, so I'm going with that one now - but I do like the boxed code Greg Houston's site makes.

Thursday, March 25, 2010

Presenting Virtually to Columbus PASS Chapter in May

Jeremiah Peschka (blog|twitter) sent out a call for speakers for the Columbus PASS chapter over the twitterverse last month, and invited virtual presenters to apply.  After a short back and forth about topics, we settled on talking about using SSIS to do ETL for data warehouses.  Yes, I've done that one before - but I think it's a great idea to do a presentation more than once.  As Brent Ozar (blog|twitter) just happened to say, there's more "junior" professionals out there than "senior" - so tailor your presentations for your audience!  As much as I'd love to do presentations about coding complex Integration Services data flow components (and I would), I can probably count a worldwide audience for that without running out of fingers - and that includes devs I now know in the SSIS team in Redmond.
The Abstract
Microsoft designed Integration Services with Data Warehouse ETL processes in mind, so loading one should be pretty easy, right?  It's not that bad, but there are a few things you should avoid, and some non-obvious best practices to use to make that ETL faster and more reliable. Assuming attendees have a basic level of familiarity with a Kimball method data warehousing methodology, this talk will explain how to accomplish Kimball style ETL using SQL Server Integration Services. We'll review the SCD Wizard and how it works for processing your dimension tables - what it's good for, what it's not, and what alternatives you have to it. We'll also examine fact table loading using a surrogate key replacement pipeline, especially using the Lookup component efficiently. Finally, we’ll look at specific and easy performance tweaks, techniques and best practices that apply directly to DW loads. That includes reviewing which Data Flow components to avoid and how to still get your work done without them, and tuning inserts and updates.
The Goals
For the members of CBusPASS, your goals should be:
  • To walk away with firm knowledge that SQL Server Integration Services really can load a data warehouse, and fast.
  • To pick up at least one useful tip to improve your warehousing ETL processes.
  • To take advantage of the fact I can't crack a whip and demand your attention like I might be able to if I was physically present - ask demanding questions and don't be intimidated!
And there's always goals for myself:
  • Improve the content and flow of the presentation, since this is the second time around.
  • Learn what it's like to present virtually.  I'd like to know if I can pull it off, what works and what doesn't - it'll come in handy for my day job.
  • Do my best to impress Jeremiah and hope he passes a comment or two the way of the PASS Community Summit program committee.  (Wink, wink!)
The Time and Place
Synchronize your watches for 6:45pm ET on May 13th - details for connecting via LiveMeeting will be up on the Columbus SQL Server User Group website.  If you aren't in Columbus, attend anyway!  I know they won't mind at all.
UPDATE 2010-05-14: I've posted the slides and demo package on SkyDrive.

Tuesday, March 9, 2010

SSIS, the KSCD and Ascending/Descending Sort Order

Something occurred to me today as I was working with optimizing my Kimball Method SCD component for SSIS.  The component caches rows from two inputs - much like a Merge Join - until it can match rows between the sources.  But unlike the Merge Join, the algorithm I have inside the component works whether or not the sources are sorted.  (Vote for this Connect issue if your think requiring the Merge Join's inputs to be sorted should be changed.)  But if the inputs are sorted, there are some additional things that it can do to process the rows faster.
The Issue
In one particular dimension of mine, the business keys are somewhat "time sensitive".  Like all dimensions, the Source System only contains the "current" version of the business entity - but in this case, part of what makes the entity "current" happens to be encoded in the business key of the entity.  I'm talking specifically about an invoice dimension, where the invoice number tends to be (but isn't always) a number that increases over time.  When the invoice is old enough (and completely paid) it gets purged from our Source System.  That action gets captured in by the KSCD component (unlike the SCD Wizard) by marking the row as "expired."
The Hypothesis
Since I was taking advantage of some optimizations in the component related to sorting, it occurred to me that I might want to try sorting in descending order instead, with that invoice number moved to the most significant sort position.  I posited that if I did that, the most current rows would tend to come in from both inputs first, leaving the older expired records from the dimension table to come in last.  The two effects I was expecting to see were that the size of my internal cache would be reduced, and I'd be able to make more decisions quicker.  I figured the caches would be smaller because I wouldn't have to hold on to as many rows from the Existing Dimension input for as long, waiting for potential matches to arrive from the Source System.  An added bonus there is that a smaller cache means that managing those caches is faster.  The second expected benefit relates to hoping to have all rows from the Source System matched up very early in the process, allowing the component to take a significant shortcut in determining the rest of the older expired Existing Dimension rows should just be pushed out the Unchanged output.
The Test
The first obstacle that came up for me was the concept of "descending" sort order - I'd never really worked with it before in the context of Integration Services.  I was used to hacking at the OLE DB Source's output column's SortKeyPosition property to reflect the ORDER BY I'd used in my SELECT statement... but where was the property to indicate that the order was descending?  A quick examination of the output metadata of a Sort transform educated me on that - descending sort orders are represented by negative numbers.  (Search for "negative" on this BoL page for details.)  Proof to the axiom that you learn something new every day.
The next obstacle was realizing that my component - much like myself - had no idea descending (negative) sort orders existed.  In fact, as it was coded it didn't even register a problem, and would have merrily processed information improperly[1].  Once I'd fixed that up, I was off to the races to test my hypothesis.
The Results
Through very unscientific means (only one run in each configuration) I did find what I was expecting - the component had to cache fewer rows during processing, and was able to take greater advantage of optimizations.  The net result was a faster execution with lower memory usage - a win/win.  By hard numbers, I have a 180 thousand row source system set being processed into a 4 million row dimension table with a 32 byte business key.  On a dual-socket, four-core per socket (8 cores) 6GB 64-bit machine, this took under 9 minutes to process.
If you're looking to tweak a little more performance out of your Kimball Method SCD component - you might want to consider running some tests on different sort orders...
[1]A side-effect of this investigation was discovering that it wasn't taking advantage of some other optimizations, despite my best intentions.  Look for more on that issue as I dig into it...

Wednesday, March 3, 2010

Presenting at Calgary PASS in March

I finally have an opportunity to get out to see one of our manufacturing plants in Calgary in mid-March, and as part of that road trip, I've arranged to present at the Calgary PASS chapter.  I've been talking to Noel Tan on and off for a few months, and he's graciously agreed to let me bore the pants off his prairie brethren with some Integration Services claptrap.
What
I'll be doing an SSIS topic - or maybe two, depending on how indulgent the crowd is on that night.  We'll start off with Diagnosing and Addressing Performance in SSIS, and if we make it through that one with most of the audience awake, we'll take a caffeine break then jump In-Depth with the SSIS Script Component.  To expand upon those vague titles, here's what's in store:
Diagnosing and Addressing Performance in SSIS
Integration Services is a complex system of parallelism and in-memory buffers. Determining what parts of your package are causing sub-par performance requires some specific techniques. Before you apply any "top ten" performance enhancements, you need to identify where the issue is - or you'll be wasting your time. Learn what tools are available for monitoring and diagnosing poor performance in SSIS packages and how to apply them. Dive in to the only technique to diagnose Data Flow bottlenecks - decomposition - and how to effectively address the most common causes.
In-Depth with the SSIS Script Component
The Script component is the Swiss Army knife of data transformation in Integration Services. If there isn't a built-in transformation that will parse, reformat, restructure, or otherwise mash your data the way you need, then you should look to the Script transform. The Script component has access to SSIS variables, connections, and all of the columns in the data flow. It can act as a data source (perhaps reading information from a web service), a destination (maybe writing to an EDI text file), or a transformation (possibly encrypting a column's contents). The Script component has full access to the .Net Framework, so the possibilities are endless. But there are a lot of ground rules a .Net developer or IS package developer needs to know before they can effectively solve problems with the Script. Learn what you need to know.
When
The Calgary chapter doesn't have monthly meetings - I interpret that to mean that every meeting is a special event!  This special event takes place Wednesday, March 17th (St. Patrick's Day) - 5th Floor, 715-5th Avenue SW.  I hope to see you there, and perhaps after to find some green barley sandwiches...