Contrary to conventional "database wisdom", even though you've stored your Data Warehouse in a relational database, it's better to eliminate NULL values and use "real" values instead[1]. Business users don't understand NULLs, and saving space in your dimension tables isn't a priority. Use "n/a", "unknown", or some other value your business users will understand, and be able to work with - that's adding business value to your Warehouse.
In this post, I'll deal with several scenarios where NULLs come into play in SSIS:
* Cleaning source system data - converting NULL values into "real" values
* Cleaning source system data - detecting "invalid" values and replacing with NULL (for later cleaning)
Detecting NULL Values
There are more than a few instances where you want to detect NULLs - in order to throw errors, report bad quality data, attempt to "fix" data, or simply replace them with more meaningful information.
You have quite a few tools in SSIS at your disposal for doing that. The one you use will depend on what you want to do with the row/column after you've detected the NULL.
Using the Conditional Split Transform
This method works best when you're redirecting rows that contain NULLs to some kind of error reporting mechanism, or if you're going to do some more complicated repair work that you don't want to subject the rest of your rows to. Basically, this transform is going to split up your data into two flows - one that has NULL values in the particular column you're interested in, and one that doesn't.
Place a Conditional Split on the design surface and hook up the input. Open the editor, and add this expression to the first "case" (where [column] is the column you're testing):
ISNULL([column])
Call that case "Column Is NULL", and rename the "Default Output" as "Column Is Not NULL" (or something equally informative). Now your Conditional Split will have two outputs - one you can connect to the "regular" downstream processing, and one you can redirect to some error processing, data repair, or other NULL handling. (You can rejoin those rows to the original flow later with a Union All or Merge component, if your intent was to "fix up" the NULLs.)
Using the Derived Column Transform
This component is used more for simple in-line conversions, where whatever action you take to "clean" a NULL value is very simple. You shouldn't use this transform to detect NULLs and redirect rows to another flow - even though it's technically possible - use the Conditional Split instead.
Again, you'll use the SSIS expression language's ISNULL function to detect the NULL value - and then do something with it, like this conversion of an integer column:
ISNULL([integer_column]) ? -1 : [integer_column]
Using the Script Transform
By far the most flexible option - but you'll pay for that flexibility in setup difficulty. This method is really only beneficial when your NULL processing is complex, or asynchoronous (NULL values mean you want to make two rows instead of one, for example), or you've already got a Script component where you need it in the flow.
Of course, you need to set the Script's Input page to identify that you want to read the value out of the column you intend to detect NULLs in. Inside the script, detect NULLs for a column named "column" by using the "column_IsNull" property of the Row object passed in to the InputX_ProcessInputRow method. For example, if you wanted to count the NULL occurrences in a column called "integer_column" (even though the Aggregate Transform can do that), you could use this code:
if (Row.integer_column_IsNull)
{
_counterVariable++;
}
Inserting a NULL Value Into a Column
There can be several scenarios that cause you to do this - and there are several methods of going about doing it.
Using the Derived Column Transform
The easiest and most common method of filling a column with a NULL value is to use the Derived Column transform. To hammer in a NULL value for every row, use the "NULL Functions" in the upper-right hand corner of the Derived Column transform editor. There is one "NULL" function for each data type - just pick the right one. You can even drag and drop the function name into the expression for your column.
Of course, you don't need to be so "brutal" - you can pick and choose when you are going to replace your column's value with NULL, rather than replacing every row's value. For example, if you want to replace an integer column with NULL if the value is negative, use the following expression:
([integer_column] < 0) ? NULL(DT_I4) : [integer_column]
If you want to replace a date column with NULL if the value is "in the future", use the following expression:
([date_column] > GETDATE()) ? NULL(DT_DBTIMESTAMP) : [date_column]
The above two examples ought to point you in the right direction for using the other type-specific NULL functions...
Using a Script Transformation
Using a Script transform to replace column values with NULLs isn't my first recommendation. It's a lot of heavy lifting to get something simple done (see above). However, if you've got a condition that's a little more complicated than you can wrap your head around using SSIS's expression syntax - or you're already using a Script component in the "right place" in your Data Flow, it's pretty easy to add this in.
First, of course, you have to set up your Script component. The bare minimum is to set that column to "read/write" on the Input Columns page of the Script editor. Select the column, and change it's Usage Type to Read/Write.
Then, inside the Script itself, you'll want to edit the Input?_ProcessInputRow method. Do your condition detecting to see if you need to set a column to NULL. If you do need to set this row's column to NULL, do the following (assuming your column is named "column"):
Row.column_IsNull = True;
Don't mess with the "Row.column" property at all. Attempting to set it to "null" or "DBNull" won't do a thing except cause runtime errors.
Conclusion
Dealing with NULLs is up to you - but I strongly recommend you replace all NULL values in your dimension and fact tables with "dummy" or known-"unknown" values. I've penned some more thoughts and recommendations on dealing with NULLs in Integration Services as well.
[1] Kimball Design Tip #43: Dealing with Nulls in the Dimensional Model
This is really nasty.
ReplyDelete