Saturday, January 30, 2010

Quick Reference: Interacting With Office in SSIS

(Again, this a quick reference post for me, but I'm sure this information will be very valuable for you out there as well.)
There are lots of potholes and confined spaces in SQL Server Integration Services all by itself - but when you add in the very likely endpoint of an Office document, that's when the gremlins really start coming out to get you.  Whether you're reading data from an Access database, or writing information to an Excel spreadsheet, it's not uncommon for data warehousing and even regular data integration.  And sometimes your project asks for you to manipulate the contents of an existing Office document - which is not a great idea.  This will be an organic post, I'll add new Q&As as I run into them on the forums.  Now, on with the tidbits and links!
Is Office Required to Read or Write to Excel or Access?
With Office 2007, you may receive a message such as "The 'Microsoft.ACE.OLEDB.12.0' provider is not registered on the local machine."
No - a full Office install or installation of the specific product is NOT necessary.  You do need the Data Providers though.  For Office 2007, those are here: 2007 Office System Driver: Data Connectivity Components.  If you're using Office 2010 (beta), the providers are here: 2010 Office System Driver Beta: Data Connectivity Components.  (I will attempt to revisit and relink when the RTM components are released.)
It Runs Fine in BIDS, But Not On My Server!
This might be a manifestation of the prior issue - not many network admins like Office installed on servers (I know mine don't) - but it could be that your servers are 64-bit.  If they are, check out this post for the answers: Quick Reference: SSIS in 32- and 64-bits.
It Worked With My XLS When Testing, but Fails in Production!
If it's not one of the above, you might be getting "metadata errors" in validation, or NULLs showing up in columns in your Data Flow where you see perfectly good data in the source spreadsheet.
The likely cause of this kind of problem is with the Excel data provider that actually reads the data out of the Excel spreadsheet and gives it to SSIS.  The problem originates with the simple fact that Excel isn't a database, and doesn't require that all the values in a particular column meet "rules" for a specific data type.  You can put text, numbers, dates - whatever you want - anywhere you want in an Excel spreadsheet.  Very powerful for doing what spreadsheets are good at - but reason #1 that any DBA will tell you not to to use Excel like a database.  Since there are no data type "rules" for Excel to follow, it guesses.  You heard right - it completely ignores any formatting you have set on the cells, and by default only takes a look at the first eight rows to guess at what kind of data to pass on to Integration Services.  If the first eight rows make it think the column contains numbers, then any string value it runs into magically gets imported as NULL.
So how do you get around it?  It's unfortunately fairly complicated - but there is one spectacular resource out there to help: Douglas Laudenschlager's blog posts tagged "Excel".  Pure gold for this and any other Excel issue you happen to run into.
How Do I Read A Section Of A Spreadsheet?
As in "how do I read columns 5 to 10?"  Use a SELECT statement, using the standard Excel column notation.  Such as "SELECT F, G, H, I, J FROM SheetName"
But what about "how do I read columns 5 through 10, between rows 4 and 1500?"  For that, you need Excel's named ranges to help out.  Define a named range in Excel, then use a SELECT statement FROM that named range.  (I suggest NOT calling your named range the same name as a worksheet.)
Can I Read Formulas Out Of A Spreadsheet?
Simple!  No.  The Excel provider only provides data, not how the Excel spreadsheet arrived at it.  However, you may use the Office object model to manipulate a spreadsheet and extract the formulas that way.  But I don't recommend it!  See the "macros" issue below.
Can I Write Formulas Into A Spreadsheet?
Trivial answer!  No.  See above. 
How Do I Run Macros On An Office Document?
Not a good idea... But you can (but shouldn't) use a Script Task to use the Office COM API.  You'll probably regret it.
Why am I so down on this idea?  Because Excel, Word, Access, and all other Office apps were designed to be run by an end-user, not the server.  "Oh, get off your high horse - client-side apps won't bring down your server."  No, they won't - they just won't work reliably for you.  I'm not denigrating the quality of Office applications - I'm talking actual architecture here.  Office applications depend on having access to a user's registry for mundane little things like retrieving the user's initials.  Yes - you remember that, right?  The first thing any Office app asks you after installation.  Guess what happens when a headless server process like SQL Agent starts up an Integration Services package?  No user registry is loaded - and no, it doesn't help to use an actual user account as a proxy - the user registry still isn't loaded.  So what does Excel do?  Ask for the user's initials!  The net effect is that this hangs your package waiting for user input that will never come.
Review Considerations for server-side Automation of Office for specific details and unsavory options.
Bottom line - if you can accomplish the task you're asking about any other way - do it the other way.  Even closing Excel reliably using the API with an interactive user session is a problem.  (Grist for another blog post on how I've gotten as close as humanly possible.)
Where Do I Get The Office COM API?
If you absolutely must, and are willing to accept the risks stated above... You should use the Office Primary Interop Assemblies to do this.  If you do, you won't have to install a full Office on your server.
For Office XP: Office XP PIAs
For Office 2003: Office 2003 Update: Redistributable Primary Interop Assemblies
For Office 2007: 2007 Microsoft Office System Update: Redistributable Primary Interop Assemblies
Info on how to install and use them: Installing and Using the Office 2003 Primary Interop Assemblies
Links Restated
Douglas Laudenschlager's Excel Posts
Considerations for server-side Automation of Office
Using Excel 2007 XSLX Files as Destinations

Sunday, January 24, 2010

Kimball Method Slowly Changing Dimension Component v1.5 Released!

nd it's about ^%#@ time!  I simply can't believe it's been over eight months - all for a few new features.  Well, okay, maybe not just a few...
What Is The Kimball Method Slowly Changing Dimension Component?
The Kimball Method Slowly Changing Dimension component is a superior alternative to the Slowly Changing Dimension Wizard in SQL Server Integration Services.  Both components implement a very common data warehousing process.  If you have a data warehouse ETL system, chances are you're using the SCD Wizard...  Or you got frustrated and resorted to a more performant but less maintainable Script, tens of data flow transforms, or a T-SQL MERGE.
I - like many others - was a little disappointed by the poor performance and lack of flexibility in the included SCD wizard in SSIS.  Not that the wizard is horrible - it encapsulates the SCD process very nicely - but it's just "not good enough" for me.  I have to count myself lucky, I know a little .Net and happen to be a glutton for up-front punishment - so I investigated the extensibility of Integration Services and ended up creating a custom component that made some different architectural design choices than the SCD wizard.  This different design has many advantages, requiring very few compromises - in my opinion, of course.
All you need to do to use it in place of the SCD wizard is to download and install it, add it to your BIDS toolbox, then drop it on your data flow design surface.  If you're familiar with processing slowly changing dimensions for a Kimball method data warehouse, I think you'll be using it in seconds.
What Does The Kimball Method SCD Component Do?
Basically, the same thing as the SCD wizard - just differently, and with quite a few more configurable options.  The component "processes" slowly changing dimensions.  "Processing" means that it compares the current information from your line-of-business system information with a dimension table in your data warehouse.  (This line-of-business system is also known as your OLTP or "source" system.)  It detects changes that have occurred in the OLTP system, identifies new business entities that need to be inserted into the dimension table, retroactive changes that need to be made to the dimension table, or changes that require historical "versioning" of those business entities.

How Is The Kimball Method SCD "Better" Than The SCD Wizard?
There are quite a few differences that I think make the KSCD superior to the SCD wizard, but the "big two" reasons are that you can actually edit the KSCD without destroying your data flow logic, and it operates about 100x faster.  There are many more things differentiating them, but for now I'll explore just those two.  And really - aren't those two things alone worth spending a few minutes trying this out?
Non-Destructive Edits
First, I'll talk about the "issue" with the built-in SCD wizard.  The SCD wizard steps you through "configuring" your SCD processing logic quite nicely.  It does use some "foreign" terminology for someone that's read up on the Kimball methodology for describing SCD processing - but nothing too odd.  But once you get to the end of the wizard, it doesn't place just one component in the data flow, it places multiple components!  There's the SCD component itself, plus some derived columns, union alls, and OLE DB commands.  Some of the choices you made in the wizard are encoded in the SCD component itself, but quite a few are represented in those other components and the way they're connected up to each other. 
You can modify some of those properties directly as you need to (as many experts recommend you do) in order to have the process operate a little more efficiently.  The drawback of doing this is that you can't change everything you might want to that way.  Should you need to change a column's method of "change tracking" from "historical" (type 2) to "changing" (type 1), or a new column is introduced to your dimension table... you have to run the SCD wizard all over again.  Unfortunately, this results in the destruction and recreation of all the components that made up the processing logic - even if you've made tweaks to them.  All that optimizing and customizing... destroyed in seconds!
Good news for you - the KSCD is completely editable, and doesn't espouse a scorched-earth policy on the design surface.  Edit the component all you want - the worst that will happen is that columns will get added or dropped from the outputs just like you'd expect, requiring no more effort to adjust your data flow than any other component.
Performance
The other large obstacle in the SCD wizard is it's (frankly) underwhelming performance.  It's completely understandable once you know the design choice that was made (see below) - but that choice makes it basically unusable well before you have dimensions that get into the hundred thousand row range.
The KSCD component trades off higher memory usage in order to blow the freakin' doors off the CPU, powering through dimensions in about 1% of the time the SCD wizard takes.  (That 100x speed advantage is actually conservative, in my opinion.)
My thought when I was starting this project was - if you're processing a hundred-thousand-row-plus dimension, you probably have a hundred MB of RAM you wouldn't mind using to make it go faster.
What Were The Design Choices That Made This Possible?
As I've already alluded - the major design choices were a commitment to encapsulate SCD processing into one component only, and a tradeoff of memory versus lookups.  Everything else the component has going for it, I think the SCD wizard could do - if the SSIS team had some free time on their hands.
Multiple Versus Single Components
The first is puzzling to me - the SSIS team could have encapsulated all of the SCD processing logic into one component had they made that decision (with time and resources to back it up - but probably not that much).  Every other component in the toolbox works that way, why not the SCD?  Perhaps it was a time constraint issue after all...
Lookup Versus Caching
The second choice was a significant one that constrains both the SCD wizard and the KSCD to opposite sides of a coin - there are definite consequences to each design that can't be ignored.
The SCD wizard included in SSIS takes only one input - the rows coming from your OLTP or "source" system.  It uses a (limited choice of) Connection Manager to perform lookups on your data warehouse's dimension table.  The benefit of this is that the resulting component operates synchronously, doesn't require any memory overhead, and might be able to process multiple "changes" per business entity in one pass.  But due to the lookup implementation inside the SCD wizard - it isn't as performant or flexible as the logic in the standalone Lookup component.  It's a very basic "uncached" lookup - and this means that it suffers a complete round-trip SQL query penalty for each and every row sent through the component.  Think about that.  Take your best possible performance profile for a round trip lookup query, be it 20, 10 or even 5ms (!)... then multiply that by the number of rows in your OLTP table.  Completely ignoring any other performance penalties with SQL query round trips, a 10ms cost per row on a 100,000 row flow is over 16 minutes.  Not too horrible, but you'll soon realize how unscalable that math is, and how impervious to how big or expensive the hardware is you might throw at it.  It doesn't scale up or out - at all.
The KSCD component uses a different approach.  It takes the same OLTP "source" system input as the SCD wizard - but instead of a Connection Manager to perform lookups, it asks you to provide another input to it with the entire contents of your dimension table.  The benefit to this design is that the component can use the full power of however many cores your SSIS box has in order to make multithreaded comparisons of two in-memory data sets.  The flipside to that benefit is that it requires memory in order to hold those rows in cache - and depending on how "badly" your data flow is optimized, the component could theoretically have to cache one of the input streams completely.  In a worst-case scenario, for a 100,000 row dimension with a rediculous 8K row size, that could mean almost 800MB of RAM.  Again, not particularly scary - even bumping it up to a million row (of 8K rows!) dimension only means 8GB of RAM.  Under my (basic) tests, instead of 16 minutes for 100,000 rows or over 2.5 hours for one million, you'd be looking at times well under ten minutes, probably less than two.  And seriously - if you can't tweak your data flow to cut that memory usage in half, I'll come and fire you, because you have tons of options for improving the KSCD cache's performance.
The Sensible Way To Process Slowly Changing Dimensions
Review the pros and cons of the Kimball Method Slowly Changing Dimension component, and I think you'll find that it gives the SCD wizard, scripting, using multiple derived columns and conditional splits, and even the T-SQL MERGE statement a massive beat-down.  Each one of those ways of processing SCDs has a single strength: the SCD wizard is easy to set up, scripting is flexible, using multiple components is easily maintainable, and MERGE is performs great.  But none of them deliver on all those aspects like the KSCD does - it's easy to set up, flexible, easy to maintain, and wicked fast.  But don't believe me - check out the reviews people submitted about the previous version.  Version 1.5 delivers a lot more flexibility and usability on top of what they've experienced. 
I would also very much like to thank the community that's grown around this project.  I have had discussions with scores of colleagues online and in person that have resulted in me learning much more about "practical" data warehouse ETL processes.  Those SSIS experts have spent an inordinate amount of time diagnosing problems, educating me, and fixing code with me.  You have prompted almost all of the feature improvements inside v1.5 and continue to energize me about extending SSIS - thanks.
I'll be posting up some walkthroughs and more videos soon as well... stay tuned!

Friday, January 22, 2010

Vancouver PASS January Presentation Recap


As usual, everything at the January chapter meeting of the Vancouver PASS was exceptional - many thanks to Richard Baumet (blog|twitter), Scott Stauffer (twitter), and all the attendees.  The slides for "How a Data Warehouse is Different From a Regular Database" are up on my SkyDrive, and a recording of the presentation will be available at the Vancouver PASS chapter site soon as well.
I got some great feedback - specifically, I'm quite glad that a couple of you mentioned that the presentation "flowed" very well.  In my mind, that's one of the most important parts of presenting - because even if you otherwise had spectacular technical content, I think the audience doesn't benefit from it if they can't engage with the material and follow how it all fits together. 
Oh, and I got one other comment thanking me for not saying the words "slowly changing dimension".  At first, I thought that was a planted comment from Scott or Rich, because I have droned on to them about my SCD extension to SQL Server Integration Services.  But it wasn't, in fact - it was a reference to how I had described the intent and table structure implementation (at a high level) of using SCDs to track changes in a data warehouse - without using the "jargon" that DW/BI practitioners tend to use.  I'd have to say that avoiding that jargon wasn't consciously intended - but was part of my overall intention to not talk to the crowd like it was the BI chapter of PASS.  That's always a strong concern of mine when I've presented to Richard's chapter - I don't want to bore them to tears with information they won't be able to use - or even relate to - in their more "traditional" DBA roles.
There is one more thing that I would really appreciate hearing from those who attended.  My presentation was a little shorter than I expected it to be - it only took an hour to go over my material.  What other information would you have wanted to know?  Could I have spent more time on a particular point, slide, or section?
Thanks again for providing me with the opportunity to present, and look for my ugly mug at a future BI chapter meeting!

Monday, January 18, 2010

SSAS Calculated Measures for Business (II): Annualized Measures

As part of my SQL Server Analysis Services cube, I need to base some measures on how much of the current time slice occurs in the future, and how much occurs in the past.  The measures I'm talking about are "projections" of normal measures - such as "projected sales" for 2009, when all of 2009 hasn't happened yet.  These aren't forecasts or budgets, they're just straight-line projections based on elapsed and remaining time.  I have other measures that build on those - GMROI and Inventory Turns, for example.  This is the second post in a series related to twisting SSAS into calculating those business measures - catch up by reading the first post.
Annualized Sales

The intent of creating this calculated measure is to have an annualized value for any time period sliced by the date dimension.  For example, if the "current date" is December 15th, 2009 and I want to know the "2009 annualized sales" (based on the calendar year), I would take the total sales from January 1st to December 15th, and multiply that by 365/349.  I do that because the total sales figure I get from Jan 1 to Dec 15 isn't the complete year, it represents only 95.62% of 2009 (approximately).  In order to "annualize" my sales figures, I have to multiply by the reciprocal (or divide by that percentage).  To complete the example with real numbers: if my total sales up to and including Dec 15th were $123,000, my annualized sales are $128,638.97.  Numbers like that are valuable when comparing projections to targets and budgets.
My fumbling about in MDX books and even a question or two in the forums didn't turn up any magical functions or techniques for accomplishing this.  So I continued my fumbling about with MDX and calculated measures and ended up with the following.  I can't vouch for the scalability or relative performance of this method - I only know it arrives at the right answer!  If there are any real MDX experts out there who happen upon this post - I'd love to hear how you would improve rewrite this.
Time Intelligence?
A great deal of my problem probably started with trying to figure out what the "time intelligence wizard" capability in the Analysis Services designer (BIDS) generated. I still can't quite figure out how a subset of your time dimension ends up as a calculated measure... and how in order to use a "YTD" column, it had to be tacked on to every slice of a date dimension... and how "YTD" stopped adding up numbers at the beginning of the time slice you're looking at - rather than "today".  I'm sure I missed something very basic - and I fear someone will tell me most of the work I did getting what I need was already available, and I was just too thick to understand it.
Annualized Margin
To supply a concrete target for this scenario, I'll talk about annualized margin dollars.  For those who need it, a little background:  Margin is the difference between what you paid for a product or service, and what you sold it for.  Margin can include only "direct" costs (gross margin), or can include all of the other indirect costs of providing the product or service for sale (net margin).  Margin can also be expressed as a percentage or an absolute dollar value.  Another word often used for margin is profit.
In my calculation of this metric, I've broken it out into several steps and intermediate calculated measures.  This is almost entirely due to my attempts to verify that the numbers I arrive at are correct.  It may very well be possible to combine the entire series of calculated measures into one calculation, and one calculated measure.  Given my lack of experience with MDX, attempting that would be sheer folly, and an invitation to need to relearn everything all over again should I suspect a calculation problem.
Step 1: Determining "Today" (part 1)
Unfortunately for me, this is actually difficult in MDX (at least to my freshman brain).  I had to do this in two steps - it didn't seem to work at all when I nested them.  That didn't turn out to be a bad thing however, as the first step did allow me to have a calculated measure that informs the business users when the cube was processed last:
"[Date].[Calendar]"
  + ".[" + Format(Now(), "yyyy") + "]"
  + ".[Q" + IIF(Month(Now()) <= 3, "1", IIF(Month(Now()) <= 6, "2", IIF(Month(Now()) <= 9, "3", "4"))) + "]"
  + ".[" + Format(Now(), "MMMM") + "]"
  + ".[" + Format(Now(), "ddd MMM d, yyyy") + "]"
The above simply turns the current date into a string which is the specific, explicit incarnation of today's date within my (standard) calendar hierarchy.  For example, December 15th 2009 would be "[Date].[Calendar].[2009].[Q4].[December].[Tue Dec 15, 2009]".  I called this calculated member "Today Calendar Date Member As String", and use it in step 2.
Step 2: Determining "Today" (part 2)
As a second step, I turn the above string into it's counterpart in the fiscal calendar hierarchy, prepared as a set for use later in the calculations.  To do this, I use the StrToMember function to turn the above string into the "object" which is the MDX member the string represents, then use the LinkMember to get the equivalent fiscal calendar member, which I then enclose in a set:
{ LinkMember(StrToMember([Today Calendar Date Member As String]), [Date].[Fiscal Calendar]) }
Just a quick note as to why steps one and two were necessary, as you may be asking why I just didn't pick a member of the Fiscal date hierarchy directly.  It's because it's a fiscal calendar, the month, quarter, and even year that some dates belong to is not what you'd think, so I can't directly determine those just by looking at a date.  I actually have to look them up in a "calendar date to fiscal date" table - which is almost exactly what I'm doing here with the LinkMember function.
Step 3: Fiscal Days Prior To Today
In order to annualize anything, I need to know both the total number of days within the current slice, as well as the number of days that are in the past within that slice.  The result of that will be my divisor in any annualization calculation.  The calculation I have here is likely the one that will make an MDX expert puke at, or send to the Daily WTF's CodeSOD.
Intersect({ [Date].[Fiscal Calendar].[Day].MEMBERS.Item(0) : [Today Fiscal Calendar Set].Item(0) }, Descendants([Date].[Fiscal Calendar].CurrentMember, [Date].[Fiscal Calendar].[Day])).Count
Essentially what I'm doing here is intersecting two sets.  The first set is the set of days from the "beginning of time" (according to my date dimension) up to the first member of the time slice I'm currently looking at.  The second set is the complete set of days in the date dimension.  The end result of the Count property applied to the intersection is to give me the total number of days in the current date slice that occur in the past.
Step 4: Annualized Margin $
The last step is to put some of those intermediate calculations together, and do a little bit more.
IIF([Fiscal Days Prior To Today],
  ([Amount Sold Gross] - [Amount Material Cost])
    / ([Fiscal Days Prior To Today]
    / Descendants(
      Ancestor([Date].[Fiscal Calendar].CurrentMember,
        [Date].[Fiscal Calendar].[Fiscal Year]),
      [Date].[Fiscal Calendar].[Day]).Count),
  0)
Unwinding this from the inside, we'll start with Ancestor call.  This function finds the fiscal year member in the date dimension that my current slice belongs to.  So if I'm looking at a column for [December 2009], this function returns [2009].  It functions regardless of the level in the dimension you're slicing at, so a slice looking at [Q4 2009] or [Week 49 2009] would still return [2009].
The second step of unwinding this is to look at the Descendants call.  This function returns all of the Day-level members included in the Year-level member I just found with the Ancestor call.  If this were a standard calendar-based date dimension, I'd expect this call to return 365 or 366.  Since this is a fiscal calendar, and our fiscal calendar is based on complete weeks, I expect 364 or 371.
One step further in the unwinding is a little easier on the brain - it's just regular math.  I'm taking the difference between what I sold and what it cost (the margin dollars), and dividing it by the percentage of the year that's gone by.
The last step out is the IIF, where I only go through this complex calculation if there is any valid data in this date slice I'm looking at.  (In MDX, 0 = True.)
Annualize That!
Using the above has provided me with a relatively easy way to calculate important metrics like annualized margin dollars, projected annual sales, and projected annual costs.  Watch for business calculations that build off of this calculated measure in future installments of this series.

Thursday, January 14, 2010

Another Instance of Plagiarism

This morning, quite a few of the people I follow on Twitter and my blogroll discovered that some of their blog posts had been reposted without permission on another site.  The person that had set up this blog has visited numerous respected bloggers' feeds, as well as Microsoft's TechNet content, and reposted it on his site.  This person didn't even bother to remove things inside the content that clearly indicated where it came from - making his transgression insanely easy to catch.  It seems that absolutely nothing on his site was created by himself, and it's simply an "aggregation" of other people's work.
The general consensus within five minutes of discovering this "rip-off artist" was to:
  • Flood his site with comments calling him a thief.
  • Send him "DMCA" style takedown nastygrams.
  • Post information on your own site with his name on it, branding him a plagiarist, so his future employers' searches would find out.
  • Getting lawyers on his case, to shut down his site permanently.
Stop, or I'll Shoot!
Whoa.  Scary sh*t.  Anything more than the first option up there probably has him filling his pants with bio-degradable material.  I know I would.
But I'd ask you all to take a step back for a minute here and get some perspective.  Take a deep breath.  I know I'm talking about sensitive stuff - this is material that a lot of people I respect very much have put a lot of effort into.
Examine the Evidence
First, by looking directly at the posts, it's obvious that this person is either completely incompetent at plagiarism, or has no concern for having the content thought of as something he made.  It's obvious because he hasn't removed glaring attributions inside the posts, or links to the original authors' other posts.  If making people think these posts were generated by him was the intent, he fails 99%.  In fact, he fails so badly at it that it raises the very real probability that there was no intent to do that.  (Sadly, I didn't take a screen picture showing examples of the obviousness of the scraping.)
Second, look at his bio page (which is probably taken down or changed by now - so here's a picture - I did have that page left open):
It seems quite clear to me that he's done exactly what he's said - he's scraped material that he finds very useful from other sites and reposted it here.  And he's very thankful for the existence of that information.
Inside the Criminal Mind
So, from a (hopefully) objective perspective, here's my analysis of the situation.
  • This guy loves what he's read online elsewhere - it helped him.
  • He wants other people to see it, and get that same help.
  • He's very uneducated in copyright law.  (He has his own copyright notice in the site footer!)
  • He is unaware of how it makes authors feel when their content is scraped and reposted without consultation.
Instead of unleashing ungodly fury on this kid for doing something he clearly should not be doing, I think he should be gently informed of the right way to go about satisfying what he wants to accomplish, lest he expose himself to sphincter-relaxing outcomes.  The "right way" is to attribute properly, with the permission of the original author.  That way, he gets what he wants - a collection of work he likes, and a way to share with the world his collection.  And the original authors get what they want - exposure.
The Victims' Reaction
Here's why I think that the "victims" went about this in completely the wrong way - and maybe I read too much TechDirt.
Ignorance Of the Law Is No Defense!
First and foremost, the people who had their work reposted are "educators" within the SQL community.  And yet their first reaction wasn't to educate this person, it was to draw and quarter them.  Shameful.  You're acting by assuming something that's done incorrectly by this guy has been intentionally crafted that way despite full knowledge of the correct way.  That's like seeing somebody's database with AutoShrink turned on and assuming they're doing it solely to piss off their users and SAN administrator.  You guys would never react that way in that case - so why do you in this case?
He's a Thief!
Secondly, going to the extreme by accusing the guy of "stealing".  This is a highly controversial and sensitive subject, and I'm definitely influenced by TechDirt on this one - but the logic is irrefutable.  You can't "steal" content, period.  There is nothing physical to "take away".  Just because someone else "has it" doesn't mean you still don't "have yours".  Mind you, I do look on this argument as splitting hairs - because it sure does feel like stealing, doesn't it?  But think about it.  It's not stealing - it's the possibility that he's receiving credit for your work that's "wrong".  In that respect, I can understand characterizing it as "stealing" - but not with attribution, whether expressly given or not.
Attribution Without Permission Is Still Stealing!
Third, would you feel any different if the content he'd "stolen" was attributed properly?  From the comments on Twitter this morning, most of you thought absolutely not.  And that's where I think you're absolutely wrong and short-sighted.  And this is the point I'd really like you to think about and take away from this post.  Think about the gains and losses of having someone re-post your content WITH attribution, but WITHOUT permission.  Stop reading now for a second, and please do that.  If you've been honest with yourself, I think you'll find there aren't any losses, and there are only gains. 
The "Loss Myths"
Did you "lose readers"?  No - because people reading your stuff on his site didn't find yours, so they weren't your readers to begin with.
Ah - but he's stolen search engine hits from you!  Maybe - but probably not.  If your blog/site is really as good as you (and I) think it is, then your site almost certainly ranks higher than his, so you should be getting those hits.  He's "earning" his own traffic through other means.  (And this guy, I'm sure, isn't out there trying to flog his site.)  Balance that with the fact that the attribution links to your site, so he's driving traffic and "respect" your way.  Think of your recent searches online.  I'm sure you've found hits (as I have) on "aggregator" sites - which are stupidly easy to recognize, aren't they?  Don't you click through to the source site?  Maybe not to read the original article, but definitely to see the author's other content?  Of course you do.
He's stolen ad revenue from you!  Again, I doubt it - same arguments as in the previous paragraph.  At the very worst, he's contributed to the devaluation of CPM rates across the internet by the watering down of ad clickthroughs.  But again, that shouldn't be a loss to you, since your site attracts knowledgeable SQL people - people advertisers want.
The Real and True Gains
Anyone who does see your stuff on his site and likes it will surely know where to go to get more.  And no - not his site.  Please be serious and think about how outrageous it would be for this guy to actually pass himself off as someone who could have produced that content.  Could he really land a job commensurate with that knowledge?  Get a book deal?  A speaking engagement?
You get validation that the great content you're giving away for free into the boundless internet is actually getting read and appreciated.  Sure, comments on your blog do that too - but this is one step further, IMO.
What The F*** Do I Know?
How can I comment on this, if I've never been plagiarized before?
That's not entirely true.  I recently discovered that SSAS-Info.com, run by Vidas Matelis (blog|twitter) - a very respected and knowledgeable SQL Server MVP (at least by me) had been reposting some of my content without permission on SSAS-Info.  I discovered this via Google Analytics of my own blog's activity showing hits coming from there, wondering why, and digging deeper into it.  Turns out that Vidas (just like many of us) scours the internet looking for good material.  When he finds it, he posts it on SSAS-Info.com - with a few important caveats.  He identifies the author and source on every post (not as clearly as I'd like), and only shows a paragraph of the article followed by a click-through to your site that hosts the complete article.
To be perfectly honest, my initial reaction was "WTF?  Where does he get off copying my stuff, making it look like he made it?"  (Of course, I didn't see my name there to start.)  However, after a little poking around, that initial feeling disappeared entirely. 
Was it that he only posted some of my article, and not all of it?  No - on reflection, I always hate it when I get RSS feeds that way.  I don't (intentionally) write my articles to have a teaser in the first paragraph to induce a click-through.  I want readers to read the whole thing.  They'll only read the whole thing if they like it, of course - and then my (and I think their) first reaction is to find out who wrote it so I can find more.
Was it because my name was there?  No - because to be honest, it wasn't very prominent, AND one could argue it made it look like I was a "columnist" there, and people could come back to SSAS-Info for more of my content.  (Never mind that fact is true - Vidas would have kept reposting new stuff from me, regardless of whether I found out.)
Was it because, by clicking through two other links (All Articles By "X" in the Latest Author Articles box, then the "read this" link) I discovered what Vidas was intentionally doing with his repostings?  Not exactly, although it reinforced the good feeling that I was getting.
My initial feeling disappeared because I realized I was getting exposure.  And that's one of the specific things I wanted by blogging.  Who cares if Vidas makes some money of ads on his site?  I don't have ads on mine at all.  Should I begrudge him for making money off me if I didn't want to make money off me?  Just because he's more aware, on-the-ball, or adept at doing that, I should try to stop him - even if doing so does nothing to increase my ability to do what he's doing?  Stupid isn't it?
As soon as I read his "procedure" I sent him an email asking him to post full versions of whatever he wanted to on his site.  I am now insanely happy with this arrangement - more so because Vidas decided to pick out my stuff all on his own without any lobbying from me.
The Verdict
As you've read, I'm sorely disappointed by the SQL Server "community leaders" reaction to this.  I think that those who have put in a lot of effort to educate the technical community about how to best use SQL Server would fly off the handle and jump to threats before considering a course of action that should be second-nature: education and understanding.  I do hope the site does change some things - attributions and permission would be a good start.  But more than that, I hope this guy hasn't written off the SQL speaker/blogger community as a bunch of selfish people.  I hope that he continues to "steal" content much like Vidas does to help promote the community that I'm a part of, and expand awareness that there is such a community.
No doubt there will be those who will plagiarize "better" than this guy - and intend to do it maliciously for personal gain.  Or some that refuse to attribute properly when asked.  For those, I don't mind taking advantage of the protections that copyright affords.  But rounding up a lynch mob to string up a guy on first contact is not the right thing to do.
Think.  Before you act.

Monday, January 11, 2010

Long Time Coming - my DW Server is Here!

I am feeling GREAT today - caused by only one event which is outweighing quite a few other negatives, like a multi-thousand dollar transmission repair bill.  Why is that?  I'm FINALLY getting my data warehouse server built. 
Muwahahahaha!



The delay has all been about logistics with an ever-decreasing resource base to work with to repurpose an existing server. But now the wait (for me) is over, and everyone else can now wait on me to deliver... even though they don't know they're waiting. Now I just have to make sure I don't get over-enthused, and still do enough work on the stuff that "has to get done" (read: Access sh*t) so that I don't get in trouble working on the BI stuff.

Tuesday, January 5, 2010

SSAS Calculated Measures for Business (I): Introduction

Using SQL Server Analysis Services to present a cube to business users can be extremely informative for them.  Such a cube is typically constructed from a data warehouse - in my case, a dimensionally-modeled Kimball Method warehouse.  That warehouse has a lot of data in it, but it's all "simple" data: direct costs and sale amounts of every invoice line item and periodic snapshots of inventory levels.  Using SSAS, aggregations of those facts can easily be presented: "total sales", "total costs", and "average inventory".  With a little more elbow grease, you can do some more complex calculations involving ratios (like price and margin) that aggregate correctly.  (Hint: prices and margins don't sum or average, and shouldn't be a column in your fact table!)  Using more complex calculations, it's also possible to project values into the future, and combine aggregated facts from different measure groups into a new measure.
In a series of blog posts starting with this one, I'll be describing how I put together some business calculations that matter to my organization - GMROI and Inventory Turns.  I'll start with some simple ones in this post, but then get into the projections and combined measure group calculations in subsequent posts.  I'm almost positive I've gone about them in a less than optimal manner - so if you know a better way, don't hesitate to leave a comment!  For a fantastic series on the power of SSAS, Bill Pearson has a virtual encyclopedia written on Database Journal - and more gets added every month.  Scroll down to the Introduction to MSSQL Server Analysis Services Series section on his bio page.
Simple Aggregates
Any column you've got in your fact table that isn't a foreign key is automatically set up as a measure by SQL Server Analysis Services.  Most of the time, the default aggregation (sum) is what you want to do to those facts, because they're usually dollar amounts or quantities.
If you think you want an average, or some other form of aggregate, be very mindful of what you're asking for.  Something as simple as an average may not be so simple to implement.  The built-in average aggregate in SSAS will divide the sum of the selected rows by the number of rows.  There isn't any way to change the choice of denominator - it's always based on row count.
Simple Calculated Measures
For anything more complex than the simple aggregates, you're going to have to get dirtier and make a calculated measure.  For those of us who aren't MDX wizards, the recommended solution of constructing an MDX query in SQL Server Management Studio isn't a great option.  I'd never get the calculation past a syntax check - I just can't drive MDX well at all yet.
What I can do is use the semi-structured interface in BIDS' Calculations tab in the cube.  For simple calculated measures, like one that gets margin or price, the setup is quite easy.  You pick a name for your measure, then type in the "expression" for it.  For our simple calculated measures, we don't need any fancy MDX functions - just simple division will do.  A margin calculated measure expression would be something as simple as "([Amount Sold] - [Amount Cost]) / [Amount Sold]".  (I'm assuming the sold amount will never be zero - you may find you have to use an IIF function here to protect your calculation.)
Earlier I said that you shouldn't have a measure like "price" in your fact table.  You shouldn't have a "margin" measure in there either, because you can't aggregate either of them.  It simply makes no sense to sum them - it doesn't result in a meaningful number.  You also can't use the average aggregate unless you can guarantee that each row in your fact table refers to only one "unit" - whatever your price is based on.  If you can't guarantee that, then an average will be incorrect.
What's Next?
Eventually, I want to get to a calculation for GMROI and Inventory Turns - but I'm going to lead you there slowly instead of by the zig-zag repetitive iteration that I went through to figure it out.  Be patient - next up is projecting measures out into the future.