Pages

Wednesday, October 7, 2009

More on Dealing with NULLs in Integration Services

In an earlier post, I talked about how having NULLs in your Data Warehouse is probably not the best thing - in dimension tables or in fact tables.  That post went over how to detect and cleanse the NULL values out of your source data before inserting into your Data Warehouse.
This post will discuss some of the issues that NULL values cause inside SQL Server Integration Services when you're not proactively handling them.  The typical indication of a rampant NULL is the dreaded DTS_E_INDUCEDTRANSFORMFAILUREONERROR message.  It's human nature to assume that the information you're retrieving from a text file, an Excel spreadsheet, or even a database is actually information - and not NULL.  It's the kind of thing that will end up biting you in the nether regions sooner or later, after having caused you to run around trying to find out where this cryptic error from SSIS is coming from.
Unnatural NULL Behaviour
The root of the problems with NULL - on top of usually being "unexpected" when you should be expecting it - is that it behaves completely unlike other values.  Let me give you a quick rundown of facts that you'll have to force into your brain (eventually) and accept as truth:
  • NULL is not equal to zero or blank
  • NULL is not equal to NULL
  • NULL is not "not equal" to NULL
  • Expressions with NULL in them typically evaluate to NULL, and eventually cause errors
Ah, I see your brain hurts already.
Why Is NULL Not Equal To Zero or Blank?
NULL is an "unknown" value.  Because it specifically represents "unknown", you can't compare it to a known value such as zero or a blank string.  Since we don't know what value NULL represents, we can't be sure if it is or isn't equivalent to zero or a blank string.  The result of evaluating an expression comparing NULL to a real value is... NULL.  (Read later for why.)
So an expression of:
[Column1] == [Column2]
Can cause you real problems in a Conditional Split component when either of Column1 or Column2 happens to be NULL for a particular row.
(I've described comparing NULL to zero or a blank string here - two different data types - because that's a common issue. In SSIS, NULL is actually typed. A NULL-valued integer column is different from a NULL-valued string column.  This is kindof irrelevant, because you typically don't compare a NULL-valued integer to a NULL-valued string.)
Why Is NULL Not Equal To NULL?
Can you hear the Twilight Zone music?  This is where it really starts to get weird.  If you accept the fact above that you can't detect NULL values by comparing them to zero or an empty string... your next logical option could be to try comparing it to NULL itself.  In SSIS, you could try an expression like this:
[Column1] == NULL(DT_I4) ? "Column1 is NULL!" : "Column1 is not NULL"
Except that won't work.  Why not?  Because NULL != NULL... which seems to make no sense!  0 == 0, right?  1 == 1.  Why does NULL != NULL?  If you remember, NULL represents "unknown."  Taken that way, the value could be anything... and since it could be anything, we can't say that one "value of something we don't know" is equal to another "value of something we don't know" because those two "somethings" could be different.  We just don't know.  And that leads us to...
Why Is NULL Not "Not Equal" To NULL?
Wait a minute.  I just told you that NULL == NULL was incorrect.  Isn't there some mathematical theorem that says that must mean NULL is not equal to NULL then?  You could be experiencing a full-on aneurysm... or we could be talking about NULLs.  To put you on solid ground - yes, there is such a theorem.  But no, it doesn't apply here - because I didn't tell you that NULL == NULL was false... I just told you they weren't equal.  The same twisted logic applies to trying to evaluate NULL != NULL.  It's not true, any more than NULL == NULL was.  With the same logic, we just can't know that two "unknowns" aren't equivalent.  And that leads us to...
Expressions With NULL In Them Typically Evaluate To NULL, And Eventually Cause Errors
All of the following statements will evaluate to NULL in Integration Services if Column1 is NULL-valued:
[Column1] == [Column2]
[Column1] != [Column2]
[Column1] + [Column2]
SUBSTRING([Column1], 1, 1)
FINDSTRING([Column1], "x", 1)
DATEDIFF("day", [Column1], GETDATE())
...etc...
The nasty problem here is that you probably have complex expressions like this:
SUBSTRING([Column2], FINDSTRING([Column1], "abc", 1), 10 - [Column3])
Which will pass a syntax check at design time just fine, but will fail miserably if Column3 is NULL in your data.  The tricky thing is (in the above example) if Column1 and/or Column2 is NULL also... SSIS won't throw an error - it will return NULL for the expression.  (It's able to do that because the NULL math in the SUBSTRING argument doesn't need to get executed, because the result of taking any SUBSTRING of NULL is... NULL!)
Dealing With NULL - "Properly"
The net effect of this is that if you have a NULL value somewhere in an expression, that NULL can get "bubbled up" from one nested function/operator into another, into another.  It can be difficult to track down where the problem is coming from.
Prevention
An ounce is worth a pound of cure, right?  So take your medicine and check for NULL values as soon as you can in the data flow.  If you don't expect certain columns to contain NULL at runtime, then use a Conditional Split to route those rows to your logging/audit mechanism to alert you of the bad data.  You do have a logging mechanism that you monitor, right?  You can (and probably should) be overzealous here - unless you know for sure that you may get NULL values in a certain column and you've checked that your data flow works as you expect with NULL values in it - detect rows with NULLs in that column.  If you find out later that a row comes in with that column NULL, and in your data quality investigation ("did you want this data?") they tell you they need it, and how to interpret it, you can then modify your flow.  Of course, in the perfect world, you would have the answers to all this up-front.  But then again, those responsible for data sources have been known to be mistaken...
The same caution should be taken after Merge Join components - if a left join is used, it's possible for rows to be emitted that have NULL values in them.  It costs very, very little to ensure you're data's clean by using a Conditional Split component to detect NULLs.  You only have to check one column on the output that came from the "right" input of the Merge Join to make sure.
Diagnosis
Well - you're likely to have a report from SSIS about exactly which component failed due to the NULL value sneaking through your flow.  Again, the problem here could be that the DTS_E_INDUCEDTRANSFORMFAILUREONERROR reported by a specific component could just be the "last straw" in the chain.  Since there are lots of cases where having a NULL value can slip through expressions undetected, showing up later in the data flow than the problem crept in.
The best way to debug your flow is with Data Viewers.  If it's not obvious to you what column the offending NULL appeared in, this feature will help you.  Run the flow, and the Data Viewer will "pause" the data flow for each buffer (set of rows) that passes through it.  You can click on the column headers to sort the buffer by the column - which will nicely bring the NULLs to the top of the list.  (Even though NULL means we don't know what the value is... it's "less" than any other value with respect to this sort.)
Once you've detected which column is at fault, you can start to find out where the NULL was introduced.  If just knowing the column that has the NULL doesn't inform you of where it came from, then place Data Viewers higher and higher upstream until you find the component that spawned them.
Treatment
OK - so you've found the problem - now what?  It depends, of course.  The NULL may be a result of an incorrect query, a misunderstood Merge Join - you should be able to spot the problem if that's it, and fix it by fixing the query, or executing the Merge Join properly.
But if it's simply a case of the NULL being a part of source data, or an inevitable result of a Merge Join that you can't avoid or fix by altering the Join, you'll need to deal with that NULL and change it. The way you do that is with a Derived Column component.  The two tools you'll use in that component are the ISNULL function, and the Conditional Operator (?).  You can replace the contents of the column with an expression like this:
ISNULL([Column1]) ? <null replacement> : [Column1]
Where you'd place a literal value you wanted to replace your NULL value with.  Here are some common examples.
To replace a NULL-valued string column with a blank (empty) string:
ISNULL([Column1]) ? "" : [Column1]
To replace a NULL-valued string column with a specific string value, in this example "n/a":
ISNULL([Column1]) ? "n/a" : [Column1]
To replace a NULL-valued integer column with a zero:
ISNULL([Column1]) ? 0 : [Column1]
To replace a NULL-valued date with a specific date, in this example January 1, 1900:
ISNULL([Column1]) ? (DT_DBTIMESTAMP)"1900-01-01" : [Column1]
Good luck tracking down and zapping those NULLs!

No comments:

Post a Comment