Monday, April 26, 2010

Building Multiple VMs Easily in Microsoft Virtual PC

I've currently got a couple virtual machines that I use for development purposes.  Why? Because I don't care to clutter up my primary workspace with side-by-side installations of three different versions of Visual Studio, SQL Server, Office, etc...  I set up a VM with the specific software I need for development and testing purposes, and whenever I need that environment, I fire it up.  The benefits for me are a faster and less cluttered "primary" environment, with a completely segregated set of legacy environments.
Creating a Virtual Farm of VMs
Of course, setting up a variety of VMs with different sets of developer tools and services can take a lot of time - and a lot of disk space too.  Unless you make smart use of differencing disks, that is.  I found a great solution courtesy Andrew Connell - and if you have to spin up VMs every once in a while, it ought to work for you too.  There's a little more work up-front, but a lot less later.
Read his full article here: HOWTO: Use Virtual PC's Differencing Disks to your Advantage.
Of course - this is a reference for me, and I can find this post easier than finding his...
Part I: Make a "Base" VHD
  1. Start up a fresh VM, install an OS on it, and whatever "standard" apps/tools you'll want on ALL of your images of that OS.
  2. Defragment the VM filesystem inside the VM.
  3. Mount the Virtual Disk Precompactor ISO inside the VM to zero out the now-consolidated empty space.
  4. Shut down the VM and compact the VHD
  5. Delete the VMC/VMCX file for this VM, and mark the VHD read-only.
Part II: Build Your Real VM Using the "Base" VHD
  1. Create a new VM, but don't create a new disk, or use an existing disk - tell Virtual PC to create a differencing disk from your "base" VHD.
  2. Fire up the VM, make any changes you want!
Repeat from Part II, step 1 whenever you want another VM based on the OS/tools you happen to have in one of your "base" VHDs.
Thanks Andrew!

Wednesday, April 14, 2010

Working With UTC Dates in Integration Services

It's quite common to have to take some kind of a string representation of a date and convert it to a true date type, or convert a numeric date representation into a date - but I've recently had the requirement to process UTC dates in my packages.  In order to process this information in SSIS, the DT_DBTIMESTAMPOFFSET data type is provided, which stores a high-precision date and time as well as a time zone offset.
So how do you work with a this type of data?  Here are a couple considerations to be aware of.
Know What You're Looking At
One of the most common sources for data in Integration Services is your plain flat file - and in my case, a recent one included UTC date information.  Plain text files don't have unambiguous formats!  The trick here is understanding exactly how the author of the file encoded the time zone information.  This isn't much different from understanding a localized date/time string - except that we typically don't "think" in time-zone offsets, so they can become confusing pretty fast.
Convert Early!
I highly recommend performing all conversions as soon as possible, and VERY explicitly.  I always hated VB6 because it allowed so many implicit conversions without any warnings - and most of your destination components will allow this too.  So if you're storing UTC data in your destination, be sure to convert it in your data flow to a DT_DBTIMESTAMPOFFSET as soon as possible, using the format described below. At least then you'll be able to see any odd conversion behaviour in a data viewer, with Profiler, or whatever tools you have at your disposal.  If you push a string-typed column straight through to the destination, problems with your conversion format will appear out of nowhere.
Converting String UTC Dates
Converting string typed data to a real UTC date type is just as finicky as converting strings to regular dates.  Any ambiguity in the format can cause SSIS (or any other tool) to misinterpret your conversion and you'll end up with an error, or worse - silent bad data.  Read over my previous article on converting strings to dates to get my recommendations for the basic format and tools. 
To convert to a DT_DBTIMESTAMPOFFSET, we just need to build a little on the foundation laid there.  The unambiguous format for converting to an SSIS UTC datetime is:
yyyy-MM-ddThh:mm:ss.fffffff+hh:mm
or
yyyy-MM-ddThh:mm:ss.fffffff-hh:mm
The number of "f"s in that format (the fraction of seconds) is flexible.  You can specify none, or up to seven decimal places.  In order to convert that to a UTC date in SSIS, you need to perform a regular cast operation, but you also have to tell SSIS how many decimal places of fractional seconds you want it to store as well.  So something like this will work, as it specifies three decimal places for seconds:
(DT_DBTIMESTAMPOFFSET, 3)"2010-03-05T16:46.154-08:00"
Storing UTC Dates
Now that you have a UTC date type, you probably need to store it somewhere.  In SQL Server, the appropriate column data type for this SSIS type is datetimeoffset.  Sure, you could store it in a regular datetime column - but you just put in a bunch of work to convert it to the most accurate and appropriate data type - keep that work intact!

Thursday, April 8, 2010

Kimball SCD Internals: A High-Performance Unique Queue

In order to satisfy some of the performance desires that I had for the Kimball Method Slowly Changing Dimension component, I had to implement some code constructs to support the processing.  This post describes one of those constructs - what I call a HashQueue.  See Kimball SCD Internals labeled posts for more information.
The Problem
The Kimball SCD component utilizes multiple threads to process data in several stages.  In order to pass work from the "matching" stage to the "processing" stage, I needed a "collection" type of construct that would deliver the following properties:
  1. It must be thread-safe for readers and writers.  There will be many threads attempting to add and remove items from a single construct.
  2. It must detect and reject duplicate items.  The "matching" process could result in two threads arriving at the same answer independently - but the answer must only be processed once.
  3. It must allow a non-specific element to be retrieved.  When the "processing" threads retrieve items, they don't care which item in the collection they get, they just want a "next" one.  Specifically, they don't know anything about any of the objects that are in the collection.
  4. All operations should be O(1), or O(log n) at worst - no O(n).  The collections could potentially contain millions of items, and O(n) operations cost dearly at that scale.
The Possibilities
The built-in .Net collections that could possibly be used are constructs like Lists, Queues, and Hashtables.  But those all didn't fit some part of the bill.  First, all of them would have to be "wrapped" to provide thread-safe behaviour, since they don't come that way out of the box.  Lists have O(n) performance for interrogations and removals - unless removals only occur at the end of the list, which would result in actual O(1) performance there.  Lists wouldn't require unique items - unless you paid the O(n) penalty to interrogate first.  Queues failed the uniqueness requirement as well, having the same O(n) interrogation flaw.  Hashtables looked promising - with O(1) operations on interrogations, additions, and removals - they were almost OK.  Hashtables can't retrieve an element without knowing the "key" for that element - which doesn't fit the "non-specific" requirement.  Sure, you could pick the "first" item off of the Values property - but that would require (internally to the Hashtable) traversing the whole structure in order to build the Values property - an O(n) operation.
The Plan
What I decided on was a hybrid between a Hashtable and a Queue.  I'd use all the high-performance capabilities of the Hashtable to manage uniqueness on a Queue, and take advantage of the Queue's ability to retrieve a single element at a time.  The fact that a Queue encapsulated FIFO behaviour was comfortably nice - but not required.  The concept of the Dequeue operation both returning the item as well as removing it from the collection in an internally thread-safe manner translated well - I didn't have to manage locking over two separate external calls.  I took that atomic concept and applied it to the Enqueue process as well.
The Result
What I ended up with was a very well performing construct for the constraints I'd wanted.  It's not terribly generic - it doesn't implement any interfaces, for example - so it may only be useful for the narrow purpose for which I'm using it.  However, it does demonstrate that even if you don't have a perfect match in the built-in .Net framework, you can construct something yourself to accomplish your goals.
The HashQueue:
  • Prevents duplicate items from appearing on the collection over the lifetime of the collection
  • Has O(1) behaviour on "adds" (Enqueues)
  • Has O(1) behaviour on "removes" (Dequeues)
  • Encapsulates all required thread safety - no external locking via a public SyncRoot-type property is required.
    #region CLASS: HashQueue
    
/// <summary>
    /// An implementation of a Queue that only permits unique items to exist on it,
    /// and can be interrogated as to whether a specific item exists on the queue
    /// in a high-performance fashion (O(1) rather than iterating over all elements,
    /// which would be O(n))
    /// </summary>
    /// <typeparam name="K"></typeparam>
    
public class HashQueue<K>
    
{
        #region
Private Variables
        
private readonly object _lock = new object();
        
private Queue<K> _queue;
        
private Hashtable _hashtable;
        
#endregion

        #region Constructor
        
public HashQueue()
        
{
            
this._queue = new Queue<K>();
            
this._hashtable = new Hashtable();
        
}
        #endregion

        #region Properties
        
public int Count
        {
            get {
lock (this._lock) { return this._queue.Count; } }
        }

        
public bool Contains(K key)
        
{
            
lock (this._lock) { return this._hashtable.Contains(key); }
        }
        #endregion

        #region Methods
        
public bool Enqueue(K key)
        
{
            
lock (this._lock)
            
{
                
if (!this._hashtable.Contains(key))
                
{
                    
this._queue.Enqueue(key);
                    
this._hashtable.Add(key, true);
                    
return true;
                
}
                
else
                
{
                    
return false;
                
}
            }
        }

        
public int EnqueueRange(IEnumerable keySet)
        
{
            
int numItemsQueued = 0;

            
lock (this._lock)
            
{
                
foreach (K key in keySet)
                
{
                    
if (!this._hashtable.Contains(key))
                    
{
                        
this._queue.Enqueue(key);
                        
this._hashtable.Add(key, true);
                        
numItemsQueued++;
                    
}
                }
            }

            
return numItemsQueued;
        
}

        
public K Dequeue()
        
{
            K key
;
            
lock (this._lock)
            
{
                
if (this._queue.Count == 0)
                
{
                    key
= default(K);
                
}
                
else
                
{
                    key
= this._queue.Dequeue();
                
}
            }
            
return key;
        
}
        #endregion
    }
    #endregion


For the latest source of this construct, visit the Kimball Method Slowly Changing Dimension component on CodePlex, and retrieve the HashQueue.cs file from the source.

Friday, April 2, 2010

SQLSaturday #41 - Atlanta - Slowly Changing Dimensions

I'd love to be there - but it's really far away from where I live.  A couple months ago though, I started talking to Julie Smith (blog) as she had a question or two about the Kimball Method Slowly Changing Dimension component.  Turns out that Julie is going to do a session at the Atlanta SQLSaturday on processing dimensions using a few different methods - with the KSCD being one of them.  (I still can't believe that Julie asked permission to do that talk - I'm quite flattered and ecstatic.)  Here's the abstract:
This session will address the different ways to handle slowly changing dimensions with SSIS. I'll cover the native SCD Component, the SQL Merge Statement , and the new Codeplex Kimball Slowly Changing Dimension Component. The goal is to show how to incorporate auditing, deal with Type 1 and Type 2 attributes, and most importantly, how to get it done QUICKLY!
I'd encourage anyone who reads my drivel, and lives within three hours of "hotlanta" to take in Julie's session.  I'm certain that Julie will do a fantastic job of highlighting the strengths and weaknesses of each method - reading her blog and emails on the subject leave no doubt she's very sharp.
It's the kind of session I'd considered doing myself - and in fact recently submitted to the closest SQLSaturday I might be able to get to (in Portland).  Of course, I'll have to work very hard at being unbiased.  If I can possibly get Julie's permission ;) I'm sure to lift a few of the more objective ideas from her presentation to even out my own.
Good luck, Julie!