Wednesday, July 28, 2010

Converting to a Star Schema for Performance - Stealth Mode

I was recently contacted by an end user in my organization about query performance problems.  Turns out they were using an Excel pivot table to query a 17GB table.  The server powering this database is what I'd describe (at best) as "departmental" - a 32 bit OS, only 4GB of RAM, and plenty of other databases on it that typically get used more often.  The table in question had 13 million rows, and the users' standard use of the table was a fairly wide-open template Excel pivot table that they narrowed down later.  The table was also fairly wide, with quite a few character columns.
The Best Solution... is Not Appropriate
In a perfect world, I would take the time to solve this problem properly by completely rearchitecting the table (and the rest of the tables in that database) into a star schema and placing a cube on top of it.  The primary purpose of the database in question was analytics, hence the pivot table.  Problem is, I truly didn't have the time to investigate all of the stakeholders of this database.  I knew that there were some automated .Net processes that updated it, but there very well could have been other readers and writers, so I worked for a quick fix that didn't change the appearance of the table or kill insert performance.  The fix I'm going to describe works - but I completely expect that the true DBAs out there might have a few more suggestions for me that I might want to consider.  So - as you should with every other suggestion or piece of advice you read on the internet - don't take this to the bank, test it in your scenario.  This advice may be worth no more than exactly what you paid for it.
The Duct Tape Epiphany
Upon examination of the table structure and contents, I found that quite a few of the character columns had a very low cardinality - which means they had very few different values within the table.  The net effect of that is that I had a table with 13 million rows where the vast majority of those rows had redundant, repeating values.  My natural reaction was to split that column out into another table and reference it with a foreign key.  But like I said earlier, that would change how everyone would have to interact with the table.
Unless... I covered up that underlying schema change with a view (for readers), and decorated the view with INSTEAD OF triggers for inserts and deletes!  (I didn't need to support updates, although I'm almost certain you can extend to do this very easily.)  I'm blogging about this because it may help out on a thread in the Integration Services forums - hopefully it does at least that.  Once again, I'll state that there are experts out there who know way more about the query engine, performance, locks, and all that important stuff than I do.  They'll probably tell you this method has a few pitfalls - and I'm hoping they tell me too.  But for the past month, it's been working better than fine for me.  So on to how it's done, followed by a concrete example!
Choosing Lookups
The first step was identifying individual columns and groups of columns to separate from the main table in order to replace them with lookup values.  The columns I looked at first were the larger ones - CHARs and VARCHARs.  I stood to gain the most "compression" from converting those columns to lookups.  From there, I had to test the range of those fields - how many unique values did they hold?  The cardinality of that column relative to the total row count basically equates to the compression level I'd get by moving it off to a lookup table.  I wasn't too shocked to discover CHAR(10) columns with a cardinality in single digits in my 13 million row table.  You may even be able to go a few steps further to see if you can stuff more than one column into a lookup table as a group.  There were columns in my table that were related and covariant - prime candidates for grouping together into a single lookup table.
The individual lookup tables were simple - a primary identity key column with one or more CHAR/VARCHAR data columns.  The main table simply replaced the column(s) of real data for each lookup with a foreign key column.  (Don't restrict yourself to just examining the CHAR/VARCHAR columns - although for me, just doing that was more than enough.)
Creating the View
The simplest scenario to handle in maintaining backwards compatibility was defining a view that appeared to be identical in structure to the original monolithic table.  A straightforward view definition SELECTing from the main table with JOINs to the lookup tables using the FK/PK relationships did the job.
I manually inserted a few rows into the set of tables to test out the view - and after a not too strenuous ten minutes, I was done.  For reads, that is.  Yes, you can use views to insert and update data - but that ability comes with restrictions.  Disregarding what surely would be other reasons why inserting and updating wouldn't be possible for this view - there's no way SQL would know to try to look up a value in one of the lookup tables to use the key if it was found, or add a new row to the lookup table and use the newly generated key if it wasn't.
The Tricky Triggers
Here's the part I wasn't trained for - the INSTEAD OF triggers.  Based on my untested memory of what I'd read about, they ought to be able to do the job.  The intent is to cover up the inability of the view to handle inserts and deletes with a "manual" set of instructions that do the operations properly.  Of course, one of the keys here is to perform your inserts and deletes in a set based manner - avoid cursors, and don't assume only one row is being inserted or deleted at a time.  To be clear, these triggers are being placed on the view - not on any of the underlying tables.  In order to construct them, we're going to make use of the inserted and deleted temporary tables that SQL Server provides us inside the triggers.
The INSERT Trigger
The first stage of this trigger is to ensure that you're going to have valid lookups for all the lookup tables when you get around to adding the row to the main table.  In order to do so, you need to issue one command for each lookup table, to add values it doesn't aleady have in it.  Each one of these commands is simple - but you'll have one per lookup table.
The second stage of the trigger is to add the rows to the main table with the redundant data elements replaced with lookup values.  You'll need a long INSERT statement to do that, but it's not much more complicated than the view definition that reconstructs the original table schema.
The DELETE Trigger
The DELETE trigger is actually much simpler than the INSERT.  It's effectively a massive JOIN operation like the view itself to allow the criteria from the DELETE to be applied to the underlying data columns.  You probably don't have to bother removing unused lookup values - your lookup table shouldn't be very large with or without them, and you won't be able to re-use the primary keys... unless you get a lot more complicated.
An Example
Here's a contrived example to form the basis for actual code.  Imagine I have a table containing transactions for customers, defined as follows:

CREATE TABLE OriginalMonolithicTable
(
  
customer_code CHAR(6),
  
customer_name VARCHAR(50),
  
transaction_date DATETIME,
  
product_code CHAR(10),
  
sale_amount CURRENCY
)


Taking a look at that table, I'm sure you'll all see what I do - the customer information is most certainly repeated within the table, as is the product code.  You could go to the trouble of testing the other columns for cardinality, but the payback of replacing 66 bytes with perhaps 8 (two INTs or less) will result in a row size 60% less than it is now.
Specifically, assuming the customer information is covariant and both the customer and product columns are low enough cardinality to be represented by INTs, the above table would change three (key relationships and indexes optional):

CREATE TABLE NewTable_main
(
  
customer_key INT,
  
transaction_date DATETIME,
  
product_key INT,
  
sale_amount CURRENCY
)


CREATE TABLE NewTable_lookup_Customer
(
  
customer_key INT IDENTITY,
  
customer_code CHAR(6),
  
customer_name VARCHAR(50)
)


CREATE TABLE NewTable_lookup_Product
(
  
product_key INT IDENTITY,
  
product_code CHAR(10)
)


The read-only view definition would look like:

CREATE VIEW NewMonolithicView
AS
SELECT
LC.customer_code, LC.customer_name, M.transaction_date,
  
LP.product_code, M.sale_amount

FROM NewTable_main AS M
JOIN NewTable_lookup_Customer AS LC
  
ON LC.customer_key = M.customer_key

JOIN NewTable_lookup_Product AS LP
  
ON LP.product_key = M.product_key


The INSERT trigger:

CREATE TRIGGER NewMonolithicView_insert ON NewMonolithicView
INSTEAD
OF INSERT
AS
BEGIN
  SET NOCOUNT ON

  
-- Stage One: insert "new" lookup values

  
INSERT INTO NewTable_lookup_Product
    
(product_code)
    
SELECT DISTINCT I.product_code
      
FROM inserted AS I
      
LEFT JOIN NewTable_lookup_Product AS LP
        
ON LP.product_code = I.product_code
      
WHERE LC.product_code IS NULL
        AND
I.product_code IS NOT NULL

  
INSERT INTO NewTable_lookup_Customer
    
(customer_code, customer_name)
    
SELECT DISTINCT I.customer_code, I.customer_name
      
FROM inserted AS I
      
LEFT JOIN NewTable_lookup_Customer AS LC
        
ON LC.customer_code = I.customer_code
        
AND LC.customer_name = I.customer_name
      
WHERE LC.customer_code IS NULL -- testing one column is sufficient
        
AND (I.customer_code IS NOT NULL
          OR
I.customer_name IS NOT NULL)

  
-- Stage Two: insert rows with lookup values

  
INSERT INTO NewTable_main
    
(customer_key, transaction_date, product_key, sale_amount)
    
SELECT LC.customer_key, I.transaction_date, LP.product_key, I.sale_amount
    
FROM inserted AS I
    
JOIN NewTable_lookup_Customer AS LC
      
ON LC.customer_code = I.customer_code
      
AND LC.customer_name = I.customer_name
    
JOIN NewTable_lookup_Product AS LP
      
ON LP.product_code = I.product_code

END

The DELETE trigger:

CREATE TRIGGER NewMonolithicView_delete ON NewMonolithicView
INSTEAD
OF DELETE
AS
BEGIN
  SET NOCOUNT ON

  DELETE FROM
NewTable_main
    
FROM NewTable_main AS M
    
JOIN NewTable_lookup_Customer AS LC
      
ON LC.customer_key = M.customer_key
    
JOIN NewTable_lookup_Product AS LP
      
ON LP.product_key = M.product_key
    
JOIN deleted AS D
      
ON D.customer_code = LC.customer_code
      
AND D.customer_name = LC.customer_name
      
AND D.transaction_date = M.transaction_date
      
AND D.product_code = LP.product_code
      
AND D.sale_amount = M.sale_amount

END

Loose Ends
Of course, my example is a simplistic representation of a simplistic real problem I had.  The above doesn't support UPDATE commands, and is surely not as optimized as it could be.  But it's not bad for duct tape.

Wednesday, July 21, 2010

PASS Summit Community Sessions

Allan Kinsel (blog|twitter) has just let us know that PASS will have a track at Summit for sessions voted on by the community.  They're going to label this a "Community" track, giving you the opportunity to second guess the program selection committee and pick sessions you want to attend.  So when the details come out next week, get off your duffs and make your vote count.
Now that I have the altruistic rant out of the way - I'm here to pitch one of my sessions for the Community track.  None of my abstracts got accepted by the program committee - competition was extremely fierce, and I know I have niche topics.  Out of the four topics I submitted, I believe one is uniquely suited to the Community track - and I hope you agree and cast your vote for it.  That one session is "Expand your SSIS Toolbox":
Despite the large number of Tasks and Components shipped in SSIS, a common complaint from ETL developers is that certain capabilities just aren't there. The only way of addressing those issues seems to be resorting to Scripts and diving into unfamiliar .Net code. But there is an alternative! Microsoft exposed an API to developers that allows them to extend SSIS by creating custom Tasks, Components and more. Once installed, these elements function just like standard parts of SSIS - but encapsulate that "missing" functionality you've been looking for. Take a tour of some of the most popular and useful extensions available for SSIS, and find out where to go to find more - usually for free.
What could be more appropriate for a Community Track than highlighting tools built by the community to improve your productivity inside SQL Server Integration Services?
UPDATE: Yes, I (read: "my ego") skipped over the tiny detail that only 20 sessions from the over 500 that weren't selected by the program committee will be placed on the ballot - and I have no idea if mine will make it on that list.  Look forward to more plugs if it does happen to make it ;)

Monday, July 19, 2010

Presenting to VanPASSBI in August

Via Donabel Santos (blog|twitter) "Join us for a morning of just pure SSIS fun!"  I'm not so sure that's accurate - unless you use Integration Services a lot :)
I'll be at the Vancouver BI PASS chapter meeting on August 6th, presenting "In Depth with the SSIS Script Task."  Basically - everything you'd ever want to know about what you can do in the Script Task, and how to go about doing it.  Here's the abstract:
The Script Task is the Swiss Army knife of Integration Services process orchestration. It’s rare to be able to construct a real-world ETL solution with only the tools provided with the built-in SSIS tasks. When no task in the toolbox can accomplish what you need, then the Script Task is a powerful alternative. Scripts have easy access to SSIS variables and connections, and have the full power of the .Net Framework behind them. But scripting in Integration Services isn’t all unicorns and rainbows. Knowledge of C# or VB.Net is essential even for very basic operations, and restrictions placed on the Script for the privilege of operating within the SSIS control flow must be understood. Find out how to get results quickly with the Script Task, focusing on common problems it’s being used to solve in the real world.
I hope to see you at Sierra Systems (25th floor - 1177 W. Hastings) around 9am for general chit-chat.  If you can't make it in person, it will be broadcast live (and recorded).

Monday, July 12, 2010

Conversion Between Unicode and Non-Unicode Data Flow Pattern


I don't think I'm alone when I find myself having to convert several columns in my Data Flow from Unicode strings to non-Unicode strings, or vice versa.  Reading from text files, Excel sources, ODBC drivers, or lookups resulted in my Data Flow being populated with one string type or the other. 
My destination (naturally) wanted the other type of string, and SSIS's type strictness rightly prevented a straight-up implicit conversion, resulting in "cannot convert between unicode and non-unicode string data type" errors:

Under the covers, I'm trying to push a non-unicode string column called UserName into a database table that wants a unicode string.  What's really occurring in the data flow looks something like this, and it doesn't work:
The "By The Book" Solution
To explicitly convert those columns, you're supposed to use a Data Conversion component (although a Derived Column will serve this purpose admirably as well).  When you hook the Data Conversion component up to the flow, you select the columns you want to convert, and select the (target) Data Type. 
The component (unhelpfully) names this new converted column "Copy of [original column name]" which you can choose to alter, and you carry on with the rest of your flow.  This is how the data flow now looks:

I've shown the "automatic" mappings with green arrows, and the things you have to manually set with the blue arrows.  In the above, the blue arrows represent the work you have to do to set up the Data Conversion component to change the data type (creating the new column) and the work you have to do to manually change the mapping in the OLE DB Destination.
Issues With "The Book"
The prescribed solution above works perfectly... for perfect people... but I'm not one of those.  I frequently found myself stumbling over a few problems.
I really don't like the "Copy of" name that was given to the new column.  Primarily because it wasn't a copy of that column - it was a derived value.  To me, that was like using a Derived Column transformation to create a new column from an existing date column called "Today" with the expression DATEADD("day", 1, [Today]) and calling the new column "Copy of Today".  It's not.  It's tomorrow.  Of course, SSIS has absolutely no way of knowing this semantic distinction.  I found myself changing the default converted output column name to things like "original column_str"... but that didn't quite sit right because of the occasional mixture of spaces and underscores.  (Yeah, I'm anal that way.)
With a little counseling, I could probably see past the naming issues... except that I would find myself inadvertently using the original column later on in the data flow - because it was named appropriately, and appeared in the column list before the converted column I should have been using.  Quite frequently, I found my mistake at the end of the data flow after realizing I'd used the wrong column all the way through the flow.

Facepalm.
In order to protect myself and still properly transform the data, I found myself performing the following "dance"...
The Workaround
The end result I wanted was to convert the columns to the desired data type, but keep the column name the same - because it was a column name I (and those that will read my package later) would understand.  I also wanted to remove the possibility that the original inappropriately typed column couldn't be used.  I also wanted to make it perfectly clear what was happening in my workaround - being clever is fine... but don't outsmart yourself (or the next guy).
The workaround consists of using a Data Conversion component immediately followed by a Union All.  The Data Conversion did its standard job - creating a new, converted column with a dumb name.  The purpose of the Union All was to reconstruct the data flow by eliminating the "original" columns, and rename the converted columns back to the original names, like this:
You'll see that I've still used the green and blue arrows here - the result I'm looking for is that last green arrow, the automatic mapping of the correct column name.
Using the Pattern - Pros and Cons
The above workaround is foolproof - I know it is because it saves me from making that mistake every day.  I use it every time I do a simple data type conversion, whether it's strings to integers or floats to currency.  That's the overwhelming positive for me.  However, there are drawbacks to using this pattern - the Union All is an asynchronous component, which ends up throwing a few curves our way. 
On the good side, it's not a "blocking" component - it doesn't hold up all the incoming rows until it's seen them all before sending rows out (like the Sort component).  This is a good thing, because it doesn't consume a lot of memory.
On the bad side, it causes the Data Flow engine to stop using the buffer (block of memory) it has the data in, and causes a memory copy to be made into a new buffer of a different "shape" (columns and data types).  It does this because it assumes that I am joining several flows together, as well as renaming and removing columns from the buffer.  Joining several flows together (probably) means that they all have different shapes, and as such, you can't simply pass them out the other side - the output has to have a consistent shape.
The end result (regardless of reasons) is that a memory copy takes place in this pattern where it otherwise shouldn't have to - and that can slow down your data flow in two ways.  First, the copy simply takes time.  Second, copying means that you're breaking the "execution tree" of the data flow into two parts manually - meaning SSIS's optimizer is stuck with working around that decision, which may result in a less optimal worker thread allocation.  (In SSIS 2005, it could actually improve performance in long flows, as 2005 didn't optimize those execution trees as well as 2008 does.)
Still Unresolved
Even with this pattern, I still have one of my original gripes unresolved, and have actually added some more.  I still have a ton of clicks to go through to set this up, especially when I have more than one column to convert.  I've just published a component on CodePlex that addresses those issues for me - and I'll blog about that next...
UPDATE 2010-08-30: See how to do the above a lot faster with a new custom component.