Todd McDermid's Blog: Converting to a Star Schema for Performance

I was recently contacted by an end user in my organization about query performance problems. Turns out they were using an Excel pivot table to query a 17GB table. The server powering this database is what I'd describe (at best) as "departmental" - a 32 bit OS, only 4GB of RAM, and plenty of other databases on it that typically get used more often. The table in question had 13 million rows, and the users' standard use of the table was a fairly wide-open template Excel pivot table that they narrowed down later. The table was also fairly wide, with quite a few character columns.
The Best Solution... is Not Appropriate
In a perfect world, I would take the time to solve this problem properly by completely rearchitecting the table (and the rest of the tables in that database) into a star schema and placing a cube on top of it. The primary purpose of the database in question was analytics, hence the pivot table. Problem is, I truly didn't have the time to investigate all of the stakeholders of this database. I knew that there were some automated .Net processes that updated it, but there very well could have been other readers and writers, so I worked for a quick fix that didn't change the appearance of the table or kill insert performance. The fix I'm going to describe works - but I completely expect that the true DBAs out there might have a few more suggestions for me that I might want to consider. So - as you should with every other suggestion or piece of advice you read on the internet - don't take this to the bank, test it in your scenario. This advice may be worth no more than exactly what you paid for it.
The Duct Tape Epiphany
Upon examination of the table structure and contents, I found that quite a few of the character columns had a very low cardinality - which means they had very few different values within the table. The net effect of that is that I had a table with 13 million rows where the vast majority of those rows had redundant, repeating values. My natural reaction was to split that column out into another table and reference it with a foreign key. But like I said earlier, that would change how everyone would have to interact with the table.
Unless... I covered up that underlying schema change with a view (for readers), and decorated the view with INSTEAD OF triggers for inserts and deletes! (I didn't need to support updates, although I'm almost certain you can extend to do this very easily.) I'm blogging about this because it may help out on a thread in the Integration Services forums - hopefully it does at least that. Once again, I'll state that there are experts out there who know way more about the query engine, performance, locks, and all that important stuff than I do. They'll probably tell you this method has a few pitfalls - and I'm hoping they tell me too. But for the past month, it's been working better than fine for me. So on to how it's done, followed by a concrete example!
Choosing Lookups
The first step was identifying individual columns and groups of columns to separate from the main table in order to replace them with lookup values. The columns I looked at first were the larger ones - CHARs and VARCHARs. I stood to gain the most "compression" from converting those columns to lookups. From there, I had to test the range of those fields - how many unique values did they hold? The cardinality of that column relative to the total row count basically equates to the compression level I'd get by moving it off to a lookup table. I wasn't too shocked to discover CHAR(10) columns with a cardinality in single digits in my 13 million row table. You may even be able to go a few steps further to see if you can stuff more than one column into a lookup table as a group. There were columns in my table that were related and covariant - prime candidates for grouping together into a single lookup table.
The individual lookup tables were simple - a primary identity key column with one or more CHAR/VARCHAR data columns. The main table simply replaced the column(s) of real data for each lookup with a foreign key column. (Don't restrict yourself to just examining the CHAR/VARCHAR columns - although for me, just doing that was more than enough.)
Creating the View
The simplest scenario to handle in maintaining backwards compatibility was defining a view that appeared to be identical in structure to the original monolithic table. A straightforward view definition SELECTing from the main table with JOINs to the lookup tables using the FK/PK relationships did the job.
I manually inserted a few rows into the set of tables to test out the view - and after a not too strenuous ten minutes, I was done. For reads, that is. Yes, you can use views to insert and update data - but that ability comes with restrictions. Disregarding what surely would be other reasons why inserting and updating wouldn't be possible for this view - there's no way SQL would know to try to look up a value in one of the lookup tables to use the key if it was found, or add a new row to the lookup table and use the newly generated key if it wasn't.
The Tricky Triggers
Here's the part I wasn't trained for - the INSTEAD OF triggers. Based on my untested memory of what I'd read about, they ought to be able to do the job. The intent is to cover up the inability of the view to handle inserts and deletes with a "manual" set of instructions that do the operations properly. Of course, one of the keys here is to perform your inserts and deletes in a set based manner - avoid cursors, and don't assume only one row is being inserted or deleted at a time. To be clear, these triggers are being placed on the view - not on any of the underlying tables. In order to construct them, we're going to make use of the inserted and deleted temporary tables that SQL Server provides us inside the triggers.
The INSERT Trigger
The first stage of this trigger is to ensure that you're going to have valid lookups for all the lookup tables when you get around to adding the row to the main table. In order to do so, you need to issue one command for each lookup table, to add values it doesn't aleady have in it. Each one of these commands is simple - but you'll have one per lookup table.
The second stage of the trigger is to add the rows to the main table with the redundant data elements replaced with lookup values. You'll need a long INSERT statement to do that, but it's not much more complicated than the view definition that reconstructs the original table schema.
The DELETE Trigger
The DELETE trigger is actually much simpler than the INSERT. It's effectively a massive JOIN operation like the view itself to allow the criteria from the DELETE to be applied to the underlying data columns. You probably don't have to bother removing unused lookup values - your lookup table shouldn't be very large with or without them, and you won't be able to re-use the primary keys... unless you get a lot more complicated.
An Example
Here's a contrived example to form the basis for actual code. Imagine I have a table containing transactions for customers, defined as follows:



CREATE TABLE OriginalMonolithicTable

(

  customer_code CHAR(6),

  customer_name VARCHAR(50),

  transaction_date DATETIME,

  product_code CHAR(10),

  sale_amount CURRENCY

)

Taking a look at that table, I'm sure you'll all see what I do - the customer information is most certainly repeated within the table, as is the product code. You could go to the trouble of testing the other columns for cardinality, but the payback of replacing 66 bytes with perhaps 8 (two INTs or less) will result in a row size 60% less than it is now.
Specifically, assuming the customer information is covariant and both the customer and product columns are low enough cardinality to be represented by INTs, the above table would change three (key relationships and indexes optional):

CREATE TABLE NewTable_main

(

  customer_key INT,

  transaction_date DATETIME,

  product_key INT,

  sale_amount CURRENCY

)

CREATE TABLE NewTable_lookup_Customer

(

  customer_key INT IDENTITY,

  customer_code CHAR(6),

  customer_name VARCHAR(50)

)

CREATE TABLE NewTable_lookup_Product

(

  product_key INT IDENTITY,

  product_code CHAR(10)

)

The read-only view definition would look like:

CREATE VIEW NewMonolithicView

AS

SELECT LC.customer_code, LC.customer_name, M.transaction_date, 

  LP.product_code, M.sale_amount

FROM NewTable_main AS M

JOIN NewTable_lookup_Customer AS LC

  ON LC.customer_key = M.customer_key

JOIN NewTable_lookup_Product AS LP

  ON LP.product_key = M.product_key

The INSERT trigger:

CREATE TRIGGER NewMonolithicView_insert ON NewMonolithicView 

INSTEAD OF INSERT

AS

BEGIN

  SET NOCOUNT ON



  -- Stage One: insert "new" lookup values



  INSERT INTO NewTable_lookup_Product

    (product_code)

    SELECT DISTINCT I.product_code

      FROM inserted AS I

      LEFT JOIN NewTable_lookup_Product AS LP

        ON LP.product_code = I.product_code

      WHERE LC.product_code IS NULL

        AND I.product_code IS NOT NULL



  INSERT INTO NewTable_lookup_Customer 

    (customer_code, customer_name)

    SELECT DISTINCT I.customer_code, I.customer_name

      FROM inserted AS I

      LEFT JOIN NewTable_lookup_Customer AS LC

        ON LC.customer_code = I.customer_code

        AND LC.customer_name = I.customer_name

      WHERE LC.customer_code IS NULL -- testing one column is sufficient

        AND (I.customer_code IS NOT NULL

          OR I.customer_name IS NOT NULL)



  -- Stage Two: insert rows with lookup values



  INSERT INTO NewTable_main

    (customer_key, transaction_date, product_key, sale_amount)

    SELECT LC.customer_key, I.transaction_date, LP.product_key, I.sale_amount

    FROM inserted AS I

    JOIN NewTable_lookup_Customer AS LC

      ON LC.customer_code = I.customer_code

      AND LC.customer_name = I.customer_name

    JOIN NewTable_lookup_Product AS LP

      ON LP.product_code = I.product_code

END

The DELETE trigger:

CREATE TRIGGER NewMonolithicView_delete ON NewMonolithicView 

INSTEAD OF DELETE

AS

BEGIN

  SET NOCOUNT ON



  DELETE FROM NewTable_main

    FROM NewTable_main AS M

    JOIN NewTable_lookup_Customer AS LC

      ON LC.customer_key = M.customer_key

    JOIN NewTable_lookup_Product AS LP

      ON LP.product_key = M.product_key

    JOIN deleted AS D

      ON D.customer_code = LC.customer_code

      AND D.customer_name = LC.customer_name

      AND D.transaction_date = M.transaction_date

      AND D.product_code = LP.product_code

      AND D.sale_amount = M.sale_amount

END

Loose Ends
Of course, my example is a simplistic representation of a simplistic real problem I had. The above doesn't support UPDATE commands, and is surely not as optimized as it could be. But it's not bad for duct tape.

2 comments:

Matt MassonJuly 28, 2010 at 12:03 PM
Hi Todd,

Wouldn't Power Pivot solve this user's problem without any additional schema changed?

~Matt
Todd McDermidJuly 28, 2010 at 1:25 PM
I think it certainly would - if our environment were mature enough to have Office 2010 and SharePoint 2010 installed and managed. I think we'll get there, upgrades I didn't think were even under consideration are underway. A managed PowerPivot (server-side) sheet should cover it - but I'd still want to integrate it into a conformed DW so they have drill-across and context capabilities...

Todd McDermid's Blog

Pages

Wednesday, July 28, 2010

Converting to a Star Schema for Performance - Stealth Mode

2 comments: