Pages

Wednesday, July 28, 2010

Converting to a Star Schema for Performance - Stealth Mode

I was recently contacted by an end user in my organization about query performance problems.  Turns out they were using an Excel pivot table to query a 17GB table.  The server powering this database is what I'd describe (at best) as "departmental" - a 32 bit OS, only 4GB of RAM, and plenty of other databases on it that typically get used more often.  The table in question had 13 million rows, and the users' standard use of the table was a fairly wide-open template Excel pivot table that they narrowed down later.  The table was also fairly wide, with quite a few character columns.
The Best Solution... is Not Appropriate
In a perfect world, I would take the time to solve this problem properly by completely rearchitecting the table (and the rest of the tables in that database) into a star schema and placing a cube on top of it.  The primary purpose of the database in question was analytics, hence the pivot table.  Problem is, I truly didn't have the time to investigate all of the stakeholders of this database.  I knew that there were some automated .Net processes that updated it, but there very well could have been other readers and writers, so I worked for a quick fix that didn't change the appearance of the table or kill insert performance.  The fix I'm going to describe works - but I completely expect that the true DBAs out there might have a few more suggestions for me that I might want to consider.  So - as you should with every other suggestion or piece of advice you read on the internet - don't take this to the bank, test it in your scenario.  This advice may be worth no more than exactly what you paid for it.
The Duct Tape Epiphany
Upon examination of the table structure and contents, I found that quite a few of the character columns had a very low cardinality - which means they had very few different values within the table.  The net effect of that is that I had a table with 13 million rows where the vast majority of those rows had redundant, repeating values.  My natural reaction was to split that column out into another table and reference it with a foreign key.  But like I said earlier, that would change how everyone would have to interact with the table.
Unless... I covered up that underlying schema change with a view (for readers), and decorated the view with INSTEAD OF triggers for inserts and deletes!  (I didn't need to support updates, although I'm almost certain you can extend to do this very easily.)  I'm blogging about this because it may help out on a thread in the Integration Services forums - hopefully it does at least that.  Once again, I'll state that there are experts out there who know way more about the query engine, performance, locks, and all that important stuff than I do.  They'll probably tell you this method has a few pitfalls - and I'm hoping they tell me too.  But for the past month, it's been working better than fine for me.  So on to how it's done, followed by a concrete example!
Choosing Lookups
The first step was identifying individual columns and groups of columns to separate from the main table in order to replace them with lookup values.  The columns I looked at first were the larger ones - CHARs and VARCHARs.  I stood to gain the most "compression" from converting those columns to lookups.  From there, I had to test the range of those fields - how many unique values did they hold?  The cardinality of that column relative to the total row count basically equates to the compression level I'd get by moving it off to a lookup table.  I wasn't too shocked to discover CHAR(10) columns with a cardinality in single digits in my 13 million row table.  You may even be able to go a few steps further to see if you can stuff more than one column into a lookup table as a group.  There were columns in my table that were related and covariant - prime candidates for grouping together into a single lookup table.
The individual lookup tables were simple - a primary identity key column with one or more CHAR/VARCHAR data columns.  The main table simply replaced the column(s) of real data for each lookup with a foreign key column.  (Don't restrict yourself to just examining the CHAR/VARCHAR columns - although for me, just doing that was more than enough.)
Creating the View
The simplest scenario to handle in maintaining backwards compatibility was defining a view that appeared to be identical in structure to the original monolithic table.  A straightforward view definition SELECTing from the main table with JOINs to the lookup tables using the FK/PK relationships did the job.
I manually inserted a few rows into the set of tables to test out the view - and after a not too strenuous ten minutes, I was done.  For reads, that is.  Yes, you can use views to insert and update data - but that ability comes with restrictions.  Disregarding what surely would be other reasons why inserting and updating wouldn't be possible for this view - there's no way SQL would know to try to look up a value in one of the lookup tables to use the key if it was found, or add a new row to the lookup table and use the newly generated key if it wasn't.
The Tricky Triggers
Here's the part I wasn't trained for - the INSTEAD OF triggers.  Based on my untested memory of what I'd read about, they ought to be able to do the job.  The intent is to cover up the inability of the view to handle inserts and deletes with a "manual" set of instructions that do the operations properly.  Of course, one of the keys here is to perform your inserts and deletes in a set based manner - avoid cursors, and don't assume only one row is being inserted or deleted at a time.  To be clear, these triggers are being placed on the view - not on any of the underlying tables.  In order to construct them, we're going to make use of the inserted and deleted temporary tables that SQL Server provides us inside the triggers.
The INSERT Trigger
The first stage of this trigger is to ensure that you're going to have valid lookups for all the lookup tables when you get around to adding the row to the main table.  In order to do so, you need to issue one command for each lookup table, to add values it doesn't aleady have in it.  Each one of these commands is simple - but you'll have one per lookup table.
The second stage of the trigger is to add the rows to the main table with the redundant data elements replaced with lookup values.  You'll need a long INSERT statement to do that, but it's not much more complicated than the view definition that reconstructs the original table schema.
The DELETE Trigger
The DELETE trigger is actually much simpler than the INSERT.  It's effectively a massive JOIN operation like the view itself to allow the criteria from the DELETE to be applied to the underlying data columns.  You probably don't have to bother removing unused lookup values - your lookup table shouldn't be very large with or without them, and you won't be able to re-use the primary keys... unless you get a lot more complicated.
An Example
Here's a contrived example to form the basis for actual code.  Imagine I have a table containing transactions for customers, defined as follows:

CREATE TABLE OriginalMonolithicTable
(
  
customer_code CHAR(6),
  
customer_name VARCHAR(50),
  
transaction_date DATETIME,
  
product_code CHAR(10),
  
sale_amount CURRENCY
)


Taking a look at that table, I'm sure you'll all see what I do - the customer information is most certainly repeated within the table, as is the product code.  You could go to the trouble of testing the other columns for cardinality, but the payback of replacing 66 bytes with perhaps 8 (two INTs or less) will result in a row size 60% less than it is now.
Specifically, assuming the customer information is covariant and both the customer and product columns are low enough cardinality to be represented by INTs, the above table would change three (key relationships and indexes optional):

CREATE TABLE NewTable_main
(
  
customer_key INT,
  
transaction_date DATETIME,
  
product_key INT,
  
sale_amount CURRENCY
)


CREATE TABLE NewTable_lookup_Customer
(
  
customer_key INT IDENTITY,
  
customer_code CHAR(6),
  
customer_name VARCHAR(50)
)


CREATE TABLE NewTable_lookup_Product
(
  
product_key INT IDENTITY,
  
product_code CHAR(10)
)


The read-only view definition would look like:

CREATE VIEW NewMonolithicView
AS
SELECT
LC.customer_code, LC.customer_name, M.transaction_date,
  
LP.product_code, M.sale_amount

FROM NewTable_main AS M
JOIN NewTable_lookup_Customer AS LC
  
ON LC.customer_key = M.customer_key

JOIN NewTable_lookup_Product AS LP
  
ON LP.product_key = M.product_key


The INSERT trigger:

CREATE TRIGGER NewMonolithicView_insert ON NewMonolithicView
INSTEAD
OF INSERT
AS
BEGIN
  SET NOCOUNT ON

  
-- Stage One: insert "new" lookup values

  
INSERT INTO NewTable_lookup_Product
    
(product_code)
    
SELECT DISTINCT I.product_code
      
FROM inserted AS I
      
LEFT JOIN NewTable_lookup_Product AS LP
        
ON LP.product_code = I.product_code
      
WHERE LC.product_code IS NULL
        AND
I.product_code IS NOT NULL

  
INSERT INTO NewTable_lookup_Customer
    
(customer_code, customer_name)
    
SELECT DISTINCT I.customer_code, I.customer_name
      
FROM inserted AS I
      
LEFT JOIN NewTable_lookup_Customer AS LC
        
ON LC.customer_code = I.customer_code
        
AND LC.customer_name = I.customer_name
      
WHERE LC.customer_code IS NULL -- testing one column is sufficient
        
AND (I.customer_code IS NOT NULL
          OR
I.customer_name IS NOT NULL)

  
-- Stage Two: insert rows with lookup values

  
INSERT INTO NewTable_main
    
(customer_key, transaction_date, product_key, sale_amount)
    
SELECT LC.customer_key, I.transaction_date, LP.product_key, I.sale_amount
    
FROM inserted AS I
    
JOIN NewTable_lookup_Customer AS LC
      
ON LC.customer_code = I.customer_code
      
AND LC.customer_name = I.customer_name
    
JOIN NewTable_lookup_Product AS LP
      
ON LP.product_code = I.product_code

END

The DELETE trigger:

CREATE TRIGGER NewMonolithicView_delete ON NewMonolithicView
INSTEAD
OF DELETE
AS
BEGIN
  SET NOCOUNT ON

  DELETE FROM
NewTable_main
    
FROM NewTable_main AS M
    
JOIN NewTable_lookup_Customer AS LC
      
ON LC.customer_key = M.customer_key
    
JOIN NewTable_lookup_Product AS LP
      
ON LP.product_key = M.product_key
    
JOIN deleted AS D
      
ON D.customer_code = LC.customer_code
      
AND D.customer_name = LC.customer_name
      
AND D.transaction_date = M.transaction_date
      
AND D.product_code = LP.product_code
      
AND D.sale_amount = M.sale_amount

END

Loose Ends
Of course, my example is a simplistic representation of a simplistic real problem I had.  The above doesn't support UPDATE commands, and is surely not as optimized as it could be.  But it's not bad for duct tape.

2 comments:

  1. Hi Todd,

    Wouldn't Power Pivot solve this user's problem without any additional schema changed?

    ~Matt

    ReplyDelete
  2. I think it certainly would - if our environment were mature enough to have Office 2010 and SharePoint 2010 installed and managed. I think we'll get there, upgrades I didn't think were even under consideration are underway. A managed PowerPivot (server-side) sheet should cover it - but I'd still want to integrate it into a conformed DW so they have drill-across and context capabilities...

    ReplyDelete