Pages

Thursday, April 8, 2010

Kimball SCD Internals: A High-Performance Unique Queue

In order to satisfy some of the performance desires that I had for the Kimball Method Slowly Changing Dimension component, I had to implement some code constructs to support the processing.  This post describes one of those constructs - what I call a HashQueue.  See Kimball SCD Internals labeled posts for more information.
The Problem
The Kimball SCD component utilizes multiple threads to process data in several stages.  In order to pass work from the "matching" stage to the "processing" stage, I needed a "collection" type of construct that would deliver the following properties:
  1. It must be thread-safe for readers and writers.  There will be many threads attempting to add and remove items from a single construct.
  2. It must detect and reject duplicate items.  The "matching" process could result in two threads arriving at the same answer independently - but the answer must only be processed once.
  3. It must allow a non-specific element to be retrieved.  When the "processing" threads retrieve items, they don't care which item in the collection they get, they just want a "next" one.  Specifically, they don't know anything about any of the objects that are in the collection.
  4. All operations should be O(1), or O(log n) at worst - no O(n).  The collections could potentially contain millions of items, and O(n) operations cost dearly at that scale.
The Possibilities
The built-in .Net collections that could possibly be used are constructs like Lists, Queues, and Hashtables.  But those all didn't fit some part of the bill.  First, all of them would have to be "wrapped" to provide thread-safe behaviour, since they don't come that way out of the box.  Lists have O(n) performance for interrogations and removals - unless removals only occur at the end of the list, which would result in actual O(1) performance there.  Lists wouldn't require unique items - unless you paid the O(n) penalty to interrogate first.  Queues failed the uniqueness requirement as well, having the same O(n) interrogation flaw.  Hashtables looked promising - with O(1) operations on interrogations, additions, and removals - they were almost OK.  Hashtables can't retrieve an element without knowing the "key" for that element - which doesn't fit the "non-specific" requirement.  Sure, you could pick the "first" item off of the Values property - but that would require (internally to the Hashtable) traversing the whole structure in order to build the Values property - an O(n) operation.
The Plan
What I decided on was a hybrid between a Hashtable and a Queue.  I'd use all the high-performance capabilities of the Hashtable to manage uniqueness on a Queue, and take advantage of the Queue's ability to retrieve a single element at a time.  The fact that a Queue encapsulated FIFO behaviour was comfortably nice - but not required.  The concept of the Dequeue operation both returning the item as well as removing it from the collection in an internally thread-safe manner translated well - I didn't have to manage locking over two separate external calls.  I took that atomic concept and applied it to the Enqueue process as well.
The Result
What I ended up with was a very well performing construct for the constraints I'd wanted.  It's not terribly generic - it doesn't implement any interfaces, for example - so it may only be useful for the narrow purpose for which I'm using it.  However, it does demonstrate that even if you don't have a perfect match in the built-in .Net framework, you can construct something yourself to accomplish your goals.
The HashQueue:
  • Prevents duplicate items from appearing on the collection over the lifetime of the collection
  • Has O(1) behaviour on "adds" (Enqueues)
  • Has O(1) behaviour on "removes" (Dequeues)
  • Encapsulates all required thread safety - no external locking via a public SyncRoot-type property is required.
    #region CLASS: HashQueue
    
/// <summary>
    /// An implementation of a Queue that only permits unique items to exist on it,
    /// and can be interrogated as to whether a specific item exists on the queue
    /// in a high-performance fashion (O(1) rather than iterating over all elements,
    /// which would be O(n))
    /// </summary>
    /// <typeparam name="K"></typeparam>
    
public class HashQueue<K>
    
{
        #region
Private Variables
        
private readonly object _lock = new object();
        
private Queue<K> _queue;
        
private Hashtable _hashtable;
        
#endregion

        #region Constructor
        
public HashQueue()
        
{
            
this._queue = new Queue<K>();
            
this._hashtable = new Hashtable();
        
}
        #endregion

        #region Properties
        
public int Count
        {
            get {
lock (this._lock) { return this._queue.Count; } }
        }

        
public bool Contains(K key)
        
{
            
lock (this._lock) { return this._hashtable.Contains(key); }
        }
        #endregion

        #region Methods
        
public bool Enqueue(K key)
        
{
            
lock (this._lock)
            
{
                
if (!this._hashtable.Contains(key))
                
{
                    
this._queue.Enqueue(key);
                    
this._hashtable.Add(key, true);
                    
return true;
                
}
                
else
                
{
                    
return false;
                
}
            }
        }

        
public int EnqueueRange(IEnumerable keySet)
        
{
            
int numItemsQueued = 0;

            
lock (this._lock)
            
{
                
foreach (K key in keySet)
                
{
                    
if (!this._hashtable.Contains(key))
                    
{
                        
this._queue.Enqueue(key);
                        
this._hashtable.Add(key, true);
                        
numItemsQueued++;
                    
}
                }
            }

            
return numItemsQueued;
        
}

        
public K Dequeue()
        
{
            K key
;
            
lock (this._lock)
            
{
                
if (this._queue.Count == 0)
                
{
                    key
= default(K);
                
}
                
else
                
{
                    key
= this._queue.Dequeue();
                
}
            }
            
return key;
        
}
        #endregion
    }
    #endregion


For the latest source of this construct, visit the Kimball Method Slowly Changing Dimension component on CodePlex, and retrieve the HashQueue.cs file from the source.

1 comment:

  1. I must say... your a smart dude! I love the kimball SCD, thanks for this.

    ReplyDelete