Monday, December 28, 2009

Microsoft BI Conference in New Orleans in 2010?

According to Vidas Matelis' blog at SSAS-info.com, the Microsoft BI Conference for 2010 is to be held in conjunction with TechEd 2010 in New Orleans on June 7th.  The Microsoft BI Conference website doesn't make any mention of this change - it still states that Seattle is intended to be the site for the (now) bi-annual shindig.  The source of the news - Bruce Kyle, an ISV architect posted the news to the US ISV Dev Community Blog.  The blog mentions that "demand generation" will begin in early January - so I'm assuming that emails and the website will change soon - if the information is accurate.  The posting doesn't clarify if the BI conference will be at the same time as TechEd, or in the week before or after TechEd.
I'm hoping it's not - Seattle is so much easier for me to get to.  If the posting ends up being accurate, I think I'll have to find some creative ways to get there...
UPDATE: Yes, it is going to be co-located with TechEd North America 2010.  If you register for the BI Conference, you get access to the TechEd "BI" tracks, but can't get in to the other TechEd track sessions.  Suffering that restriction to your session choices knocks $800 off the registration price.

Saturday, December 26, 2009

Presenting at Vancouver PASS in January


I've weaseled my way to bore the pants off another session of the Vancouver PASS chapter's monthly meetings.  Sooner or later, the chapter leader Richard Baumet (Blog|Twitter) will catch on to the fact that nobody wants to hear me wax lyrical on Business Intelligence junk, and he'll politely decline next time I ask. 
All kidding aside, I'll try and make it worth your time!  The January topic is data warehousing - but with a "regular DBA" angle.  I hope to start a discussion about what makes a Data Warehouse database different from a "regular" OLTP database.  I'm going to put forward some reasons why it's OK to violate just about everything you've had hammered into you about normalization and other relational design best practices.  If you happened to make it to my talk at Vancouver TechFest, I did start to lay out some of the differences about halfway in to the presentation (recorded here) - but I skimmed the surface.
If you want to know why normalization isn't a good practice with a data warehouse database, why it's absolutely acceptable to have a dimension table with wide character columns and redundant data, and why we're really only interested in optimizing for read performance, you'll want to drop by.  I also hope to let you know what SQL Server Analysis Services brings to the party, and a little on how you can monitor your relational and SSAS databases to detect and optimize queries.  If we have time, I'll talk a little about the column-oriented database technology that Donald Farmer touched on at last month's meeting, and Dr. Dewitt talked about at the PASS Summit.
Hope to see you there on January 21st at the Sierra Systems building, 25th floor at 5:30pm!

Wednesday, December 9, 2009

Using RAW Files for Staging Data in SSIS

In my last post, I mentioned that I stored staged source system data in RAW files for SSIS as part of managing metadata changes. I knew that would be "controversial", and sure enough, the first comment on that post questioned my choice.  Now I'm  going to take the opportunity to explain why I chose that format for persisting source system staged data.  Please feel free to continue the discussion about that choice here.
My Criteria
The question of how to store staged data is fairly important - the staging format has to have a couple characteristics IMO:
  • Atomic Load Management - It must be easy to manage the "atomic unit" of a single extract for the data. By this I don't mean manipulating the contents of the staged data, I mean being able to back it up, restore it, and secure it - operations on the complete collection of tables that were extracted. It should be "difficult" to inadvertently manage only a part of the load.
  • Content Modification Control - It should be difficult to change the contents of the data. Although I'm lucky enough to not have to comply with strict data governance policies, it should never be "easy" to manipulate the contents of the data files - that defeats a few of the points from my (and Marco Russo/Alberto Ferrari's) reasons to stage the data "unadulterated".
  • Metadata Association - The associated metadata should accompany the data, and ideally be inseparable from it. It's good to be able to have "original" data from the source system, and good to have it in a format that's difficult to change. But all that is meaningless if the data can be easily misinterpreted if data types or sizes can be incorrectly applied to that data.
  • Sequential R/W Performance - The format used to store the data should be fairly performant for sequential reads and writes. Random read and/or extraction (sorting) performance are not required, but may be beneficial. Random write performance is irrelevant for this scenario.
  • Lightweight Access - The storage mechanism should be "lightweight" in that it would be best if additional software wasn't required to interact with the storage. If additional software is required, configuration should be as minimal as possible, and should consume few system resources during activity, leaving them for Integration Services to use.
  • Durability - Whatever form the data is stored in should be accessible for years to come. It shouldn't become obsolete or inaccessible over time.
If you think I've missed a good characteristic for the staged data to have - please discuss it in the comments. Some of the above requirements may be more or less important in certain situations, and may lead to different choices of format.

The Persistence Choices

Now that we have criteria for what's important in whatever form of persistence (which you may or may not agree with), we can now discuss the storage formats for the data. The options are almost limitless, as SSIS has a wide range of connectors available, and connectors can be custom-coded for virtually any source. I'll focus on a smaller group of "standard" sources - but please feel free to suggest a source I've left off the list if you think it's a valid alternative.

SQL Server - There are a few "subchoices" here. You could store one load's worth of data per database, or store multiple loads in one database separating them by schema or table prefix.

Flat Files - Pretty self-explanatory - store data one table per file, storing one load per folder or prefixing the file names.

RAW Files - Same naming mechanism options as Flat Files.

XML/Access/Other - some other form of marked-up or hosted file.

Evaluating the Choices

Let's go one by one, applying the criteria as objectively as possible.

SQL Server

Storing the extracts in SQL Server is certainly a reasonable choice. I'd say that it scores pretty high on the atomic load management criteria - you can easily back up, restore, and secure the complete load contents. It also takes special actions to break that atomicity and manage only certain tables rather than the whole. Content modification control doesn't score so highly for me, as SQL has a fantastic ability to easily modify data. That said, you can lock it down pretty tightly - but that's an additional step. Metadata association in SQL is near-perfect - everything is strongly typed, and if you really need to, you can constrain and relate the data as well. (Although I would definitely recommend against that.) Sequential read/write performance of SQL is also very high, and it does come with the added benefit of being able to request the data in sorted order, allowing SSIS packages to avoid having to use the bottleneck Sort component. On the lightweight scale, SQL doesn't score that highly for me. It can be a pretty sizeable job to install and configure SQL - especially if it is just to store your staged data. On top of that, if you're going to realize the benefits on the other criteria, you're going to have to spend some time configuring the server appropriately. SQL can also be fairly heavy on resources, and doesn't "give back" memory once interaction with the storage is done. I think SQL passes with flying colours on the durability front - Microsoft isn't about to make restores incompatible with future SQL versions!

Flat Files
One of the worst choices, IMO - and I know that might be controversial at first glance, but evaluate the criteria along with me. Atomic load management is almost as good as a SQL repository - it's easy to back up and restore files, perhaps easier than SQL in some environments because file-backup software is more common than SQL backup. One drawback is that it's not difficult to only manage some of the files in the set - you can inadvertently miss backing a few up without being a dummy. Flat files also fall down heavily on content management control. You can control access to the file with ACLs - no doubt about that - but anyone that does have access to read/write can seriously screw up the data. Metadata association? Non-existent - and probably the worst drawback of flat files. It's way too easy for a package developer to misinterpret what data types/sizes exist in the data. In the particular scenario I find myself in - certain text fields are now larger than they used to be - there is absolutely no indication to package developers using the staged data that they can't continue to operate "as usual". If they don't change the downstream transformations, serious problems will eventually result, perhaps in a cryptic way.  Sequential access performance is just fine for flat files, but there's no added benefit of being able to get the data in an arbitrary sort order. It's hard to beat flat files for their light weight though - the OS handles everything for us, and pretty fast too. Durability isn't an issue here - the OS filesystem will be around for the forseeable future.

RAW Files
My personal choice, because they score well on my criteria. Just like flat files, they're very easy to manage atomically, but do expose the possibility (however remote) of being incompletely managed. A very good thing about RAW files is that they are very difficult to edit - you have to construct an SSIS package to do that, even if you have write permissions. Metadata association is also very strong, and in fact better than SQL Server because they already use the SSIS type system, and there are no implied conversions going on. Sequential read and write are the only things RAW files can do - and they do it very well.  In fact, the SSIS team says RAW files perform better than flat files. RAW files are also extremely lightweight - definitely lighter than SQL, and even lighter than flat files because there's no conversions required. Durability of the format is an open question - I don't know if the SSIS team placed an internal version number inside the RAW format in order to plan for changes. They may not have, given their poor responses to suggestions to improve the format. Despite this, I don't see Microsoft deprecating the format unless they deprecate the entire Integration Services product line, which I don't see happening.

XML/Access/Other "Marked Up" File

This is a mixed bag - I won't go into specifics here, aside from saying that selecting any format in this category is likely to have more drawbacks than strengths when compared to the first three options. Feel free to disagree.

The Best Staging Storage Format
The best staging storage format always depends on your specific situation! Some of you may weight some of my criteria more heavily than others. Some of you may have different criteria that are dictated to you by your environment. But whatever your reasons are for selecting a staging format - make sure you have some! If the above analysis points out some things you need haven't thought about - take the time to do so. Ensuring your source system data is properly persisted is just like formulating a backup strategy for SQL Server - you don't really have a strategy until you've put it to the test in a restore scenario. Don't give yourself a false sense of security that you can reload your entire data warehouse from scratch should you find a mistake in your transformation layer - test it out and make sure you can!

Friday, December 4, 2009

Allowing for Gradual Source System Changes in ETL

Recently, the main source system for my company's (eventual) data warehouse underwent a significant change.  It's an in-house developed and maintained system - so most changes are very gradual over time, and the developers are very aware to keep the interface to external systems as static as possible.  However, there always comes a time when the interface has to change, regardless of who is managing change in your operational system.  I've thought through the issue, and have a solution I'm going to try out.
Staging Source System Data
At the most recent PASS Community Summit, I managed to attend Alberto Ferrari and Marco Russo's session on their SQLBI Methodology for data warehousing.  I highly recommend that you find out about it - it's a very practical "extension" to Kimball's methodology, incorporating some of the strengths of the Inmon CIF approach.  I was pleasantly surprised to find that I was already on the path to implementing one of the core tenets of their approach - staging the source system data as "unadulterated" a form as possible.  Basically, they advocate extracting data from the source system without any transformations, dumping it to a persistent intermediate location.  The primary reason for this is to act as a stable "safety net" for the rest of the ETL process.  It serves this function in several ways:
  • It highly simplifies the initial extract - meaning the load on your source system will probably be as light and quick as possible.
  • It decouples the "E" from the "TL" portions of your ETL, such that the more "fragile" portion of ETL - the "TL" can be restarted in case of failure without worrying about impacting the source system or missing an extract window.
  • The warehouse can always be reloaded from the ground up.  Much like you can restore a SQL Server database from logs alone, without any data backups.
  • It satisfies concerns about auditability.  The ETL team can always demonstrate that the data for each load came from the appropriate source.
It's a great idea, and one that I appear to have inadvertently subscribed to.  I really need to read over their paper to pick up what I'm sure are other gems about practical design and operation of data warehousing.
One Step Further - Versioning
The one extension to that - at least I think it is, not having read their papers - is how to handle versioning of this staged source system data.  Like I said at the top of this post, my source system recently had a large change - we expanded key business fields like the invoice number.  I'd dealt with minor changes before - usually by adding some hardcoded date-detection logic into my packages to behave differently before and after this cutoff date.  I quickly came to the conclusion that this wasn't scalable, but hadn't sat down to figure out a more workable solution.  But this larger change was the kick in the pants that I needed - I couldn't keep a nasty version-handling tree embedded in my ETL packages any more.
I've chosen to stage my data as SSIS RAW files - for reasons that probably warrant a separate post. My solution was to mark my staged data with a version number, included in the file name itself as RAW files are opaque and not easily viewable outside of the data flow execution pipeline.  When the source system metadata changes, I'll alter my extract package to handle the change, and increment the version number of the resulting RAW file.
The transform-load stage will then require modification to handle the metadata change.  Once again, decoupling is the order of the day.  For a particular day's load, I'll examine the staged source system data, pulling the version numbers of each file.  If a file doesn't exist with the current version number, then whatever version of that file that is in the folder is then upgraded and reconstructed into the current version.  This step doesn't remove the "original" staged file - it creates a second file.  (This is to preserve the audit trail I talked about earlier.)  The second part of the transformation phase will then take this "current version" and process it according to "current" rules.
The end result is that my collection of staged files from the source system may have multiple versions within the staging folder where I keep them: the original "as extracted" version, and the "current" version.  Enough talking about it - it's time for me to go do it!  If cold impersonal reality causes some alterations to this plan, I'll let you know.