My Criteria
The question of how to store staged data is fairly important - the staging format has to have a couple characteristics IMO:
- Atomic Load Management - It must be easy to manage the "atomic unit" of a single extract for the data. By this I don't mean manipulating the contents of the staged data, I mean being able to back it up, restore it, and secure it - operations on the complete collection of tables that were extracted. It should be "difficult" to inadvertently manage only a part of the load.
- Content Modification Control - It should be difficult to change the contents of the data. Although I'm lucky enough to not have to comply with strict data governance policies, it should never be "easy" to manipulate the contents of the data files - that defeats a few of the points from my (and Marco Russo/Alberto Ferrari's) reasons to stage the data "unadulterated".
- Metadata Association - The associated metadata should accompany the data, and ideally be inseparable from it. It's good to be able to have "original" data from the source system, and good to have it in a format that's difficult to change. But all that is meaningless if the data can be easily misinterpreted if data types or sizes can be incorrectly applied to that data.
- Sequential R/W Performance - The format used to store the data should be fairly performant for sequential reads and writes. Random read and/or extraction (sorting) performance are not required, but may be beneficial. Random write performance is irrelevant for this scenario.
- Lightweight Access - The storage mechanism should be "lightweight" in that it would be best if additional software wasn't required to interact with the storage. If additional software is required, configuration should be as minimal as possible, and should consume few system resources during activity, leaving them for Integration Services to use.
- Durability - Whatever form the data is stored in should be accessible for years to come. It shouldn't become obsolete or inaccessible over time.
The Persistence Choices
Now that we have criteria for what's important in whatever form of persistence (which you may or may not agree with), we can now discuss the storage formats for the data. The options are almost limitless, as SSIS has a wide range of connectors available, and connectors can be custom-coded for virtually any source. I'll focus on a smaller group of "standard" sources - but please feel free to suggest a source I've left off the list if you think it's a valid alternative.
SQL Server - There are a few "subchoices" here. You could store one load's worth of data per database, or store multiple loads in one database separating them by schema or table prefix.
Flat Files - Pretty self-explanatory - store data one table per file, storing one load per folder or prefixing the file names.
RAW Files - Same naming mechanism options as Flat Files.
XML/Access/Other - some other form of marked-up or hosted file.
Evaluating the Choices
Let's go one by one, applying the criteria as objectively as possible.
SQL Server
Storing the extracts in SQL Server is certainly a reasonable choice. I'd say that it scores pretty high on the atomic load management criteria - you can easily back up, restore, and secure the complete load contents. It also takes special actions to break that atomicity and manage only certain tables rather than the whole. Content modification control doesn't score so highly for me, as SQL has a fantastic ability to easily modify data. That said, you can lock it down pretty tightly - but that's an additional step. Metadata association in SQL is near-perfect - everything is strongly typed, and if you really need to, you can constrain and relate the data as well. (Although I would definitely recommend against that.) Sequential read/write performance of SQL is also very high, and it does come with the added benefit of being able to request the data in sorted order, allowing SSIS packages to avoid having to use the bottleneck Sort component. On the lightweight scale, SQL doesn't score that highly for me. It can be a pretty sizeable job to install and configure SQL - especially if it is just to store your staged data. On top of that, if you're going to realize the benefits on the other criteria, you're going to have to spend some time configuring the server appropriately. SQL can also be fairly heavy on resources, and doesn't "give back" memory once interaction with the storage is done. I think SQL passes with flying colours on the durability front - Microsoft isn't about to make restores incompatible with future SQL versions!
Flat Files
One of the worst choices, IMO - and I know that might be controversial at first glance, but evaluate the criteria along with me. Atomic load management is almost as good as a SQL repository - it's easy to back up and restore files, perhaps easier than SQL in some environments because file-backup software is more common than SQL backup. One drawback is that it's not difficult to only manage some of the files in the set - you can inadvertently miss backing a few up without being a dummy. Flat files also fall down heavily on content management control. You can control access to the file with ACLs - no doubt about that - but anyone that does have access to read/write can seriously screw up the data. Metadata association? Non-existent - and probably the worst drawback of flat files. It's way too easy for a package developer to misinterpret what data types/sizes exist in the data. In the particular scenario I find myself in - certain text fields are now larger than they used to be - there is absolutely no indication to package developers using the staged data that they can't continue to operate "as usual". If they don't change the downstream transformations, serious problems will eventually result, perhaps in a cryptic way. Sequential access performance is just fine for flat files, but there's no added benefit of being able to get the data in an arbitrary sort order. It's hard to beat flat files for their light weight though - the OS handles everything for us, and pretty fast too. Durability isn't an issue here - the OS filesystem will be around for the forseeable future.
RAW Files
My personal choice, because they score well on my criteria. Just like flat files, they're very easy to manage atomically, but do expose the possibility (however remote) of being incompletely managed. A very good thing about RAW files is that they are very difficult to edit - you have to construct an SSIS package to do that, even if you have write permissions. Metadata association is also very strong, and in fact better than SQL Server because they already use the SSIS type system, and there are no implied conversions going on. Sequential read and write are the only things RAW files can do - and they do it very well. In fact, the SSIS team says RAW files perform better than flat files. RAW files are also extremely lightweight - definitely lighter than SQL, and even lighter than flat files because there's no conversions required. Durability of the format is an open question - I don't know if the SSIS team placed an internal version number inside the RAW format in order to plan for changes. They may not have, given their poor responses to suggestions to improve the format. Despite this, I don't see Microsoft deprecating the format unless they deprecate the entire Integration Services product line, which I don't see happening.
XML/Access/Other "Marked Up" File
This is a mixed bag - I won't go into specifics here, aside from saying that selecting any format in this category is likely to have more drawbacks than strengths when compared to the first three options. Feel free to disagree.
The Best Staging Storage Format
The best staging storage format always depends on your specific situation! Some of you may weight some of my criteria more heavily than others. Some of you may have different criteria that are dictated to you by your environment. But whatever your reasons are for selecting a staging format - make sure you have some! If the above analysis points out some things you need haven't thought about - take the time to do so. Ensuring your source system data is properly persisted is just like formulating a backup strategy for SQL Server - you don't really have a strategy until you've put it to the test in a restore scenario. Don't give yourself a false sense of security that you can reload your entire data warehouse from scratch should you find a mistake in your transformation layer - test it out and make sure you can!
Hi Todd,
ReplyDeleteGreat post. Again on this subject I have some different opinion.
I agree with your opinion about all other different options, but RAW Files and SQL Server.
Staging in my viewpoint should not be done keeping in view only the data warehousing technology platform being used. When you stage data in RAW Files, this staged data can't be used by other tools except SSIS. For eg, Say Sony has internal sub-companies like Sony Singapore and Sony UK, and both uses different data warehousing tools and this staged data is the delta that gets extracted each time from the transactional system.
Staging in RAW files in this case would force all sub-companies to use SSIS which is not desired. Do let me know your thoughts on the same and in case if you find discussing on comments inconvenient, I can be reached on siddhumehta@gmail.com
"In fact, the SSIS team says RAW files perform better than flat files"
ReplyDeleteWell actually...
I recently saw some perf tests comparing RAW File vs Flat File. It turns out that the flat file is actually faster (especially with fast disks). It apparently does some buffering/read ahead tricks that give it an edge over the RAW file reader.
The RAW file does have the advantage of containing its metadata, and I still recommend it for temporary staging.
For long term / persisted staging, I'd have to go with SQL Server.
John Welch recently did a basic comparison of RAW vs Flat File performance which was pretty interesting. I'm sure the answer "depends" on how insane you need the performance to be for your ETL window, and what kind of hardware you have access to.
ReplyDeleteMatt - if you happen back this way, tell me why you like SQL storage better. Are your criteria different than mine, did you evaluate those criteria differently, or weight them differently?