Staging Source System Data
At the most recent PASS Community Summit, I managed to attend Alberto Ferrari and Marco Russo's session on their SQLBI Methodology for data warehousing. I highly recommend that you find out about it - it's a very practical "extension" to Kimball's methodology, incorporating some of the strengths of the Inmon CIF approach. I was pleasantly surprised to find that I was already on the path to implementing one of the core tenets of their approach - staging the source system data as "unadulterated" a form as possible. Basically, they advocate extracting data from the source system without any transformations, dumping it to a persistent intermediate location. The primary reason for this is to act as a stable "safety net" for the rest of the ETL process. It serves this function in several ways:
- It highly simplifies the initial extract - meaning the load on your source system will probably be as light and quick as possible.
- It decouples the "E" from the "TL" portions of your ETL, such that the more "fragile" portion of ETL - the "TL" can be restarted in case of failure without worrying about impacting the source system or missing an extract window.
- The warehouse can always be reloaded from the ground up. Much like you can restore a SQL Server database from logs alone, without any data backups.
- It satisfies concerns about auditability. The ETL team can always demonstrate that the data for each load came from the appropriate source.
One Step Further - Versioning
The one extension to that - at least I think it is, not having read their papers - is how to handle versioning of this staged source system data. Like I said at the top of this post, my source system recently had a large change - we expanded key business fields like the invoice number. I'd dealt with minor changes before - usually by adding some hardcoded date-detection logic into my packages to behave differently before and after this cutoff date. I quickly came to the conclusion that this wasn't scalable, but hadn't sat down to figure out a more workable solution. But this larger change was the kick in the pants that I needed - I couldn't keep a nasty version-handling tree embedded in my ETL packages any more.
I've chosen to stage my data as SSIS RAW files - for reasons that probably warrant a separate post. My solution was to mark my staged data with a version number, included in the file name itself as RAW files are opaque and not easily viewable outside of the data flow execution pipeline. When the source system metadata changes, I'll alter my extract package to handle the change, and increment the version number of the resulting RAW file.
The transform-load stage will then require modification to handle the metadata change. Once again, decoupling is the order of the day. For a particular day's load, I'll examine the staged source system data, pulling the version numbers of each file. If a file doesn't exist with the current version number, then whatever version of that file that is in the folder is then upgraded and reconstructed into the current version. This step doesn't remove the "original" staged file - it creates a second file. (This is to preserve the audit trail I talked about earlier.) The second part of the transformation phase will then take this "current version" and process it according to "current" rules.
The end result is that my collection of staged files from the source system may have multiple versions within the staging folder where I keep them: the original "as extracted" version, and the "current" version. Enough talking about it - it's time for me to go do it! If cold impersonal reality causes some alterations to this plan, I'll let you know.
Hi Todd, I wonder why you choose to use RAW files. As you mentioned the obvious disadvantages are that one can't interrogate this data outside of Data Flow engine (thou a free utility called RAW File Viewer is available that works with SSIS 2005). Also this gurantees that any change in the metadata of sourcesystem, means a change in the transform source or a new source adapter to read from the RAW file. If the data is placed in some free-flow form like text file or relational form like for eg. in a DB, then there is always an option to manipulate or filter data thru an SP. Staging in RAW files makes metadata tightly coupled which cancels off the effect of decoupling E from TL :)
ReplyDeleteThanks for reading - and yes, I hear what you're saying - but that's the behaviour I want! I'm going to post a follow up on this exact topic - because I am intentionally breaking compatibility between my E and TL layers. Perhaps I'm using the term "decouple" incorrectly... stay tuned...
ReplyDelete