Wednesday, January 28, 2009

Kimball Method SCD Project Updated

(UPDATE 2010-01-27: Version 1.5 has been released)
It is a good day - I've finally put the pieces together to update the Kimball Method SCD project on CodePlex.  Version 1.4 is available, and I've been quite pleased with the improvements I've made, as I've been using it for the past month or so, tweaking as I went.  If it's not a viable alternative for the SCD Wizard, I'd like to know why.

In addition to featuring an installer and Inferred member support, the new release contains a sample package for comparing the SCD Wizard (the component included in SSIS) to the Kimball SCD component.  Making that package almost drove me insane as I had to deal with the SCD Wizard (again).  My #1 gripe (refreshed) about the Wizard is that any little change to how I want the SCD handled results in it recreating itself (and all it's little "helper" components), disconnecting it from the rest of the flow, and basically causing carnage in my package.

The end result, however, is that the package shows (in my environment) that I can definitely claim a 100x speed improvement over the SCD Wizard.  The primary reason for this is that this component doesn't do lookups to the Dimension table, it streams them in.  If you're as frustrated as I was with the SCD Wizard, you owe it to yourself to try this out.

Full details at the Kimball Method SCD project on CodePlex.  I've currently got it labeled as "beta" - so I'd appreciate a little feedback in order to push it to "stable" status.  Let me know how it goes for you.

Friday, January 23, 2009

Converting Numbers to Dates in the Derived Column Component

While not as frequent as questions about converting strings to dates, this is a common request at the SSIS forums.  The difference here is that the incoming number can represent quite a wide variety of interpretations of the date value - although integers can be easier to work with.
Unlike the post on converting strings to dates, this post is less about the tools (functions) you'll use to get the job done, and more about the specific number formats I've seen questions about.  But to start, we'll cover the tools.
Basic Tools for Converting Numbers to Dates
Casting to String
Typically you'll cast to a Unicode string type (DT_WSTR) since it's only an intermediate type (before you do your final cast to a date type) and fussing with code pages in DT_STR is unnecessary.
Check the samples below, as well as the post on converting strings to dates in order to understand the format you're going to need to cast the number to, or what you have to do with your string after you've casted it.
A cast from a number to a string is straightforward - the only possible tricky part is making sure you've given enough room (characters) to your cast command, such as:
(DT_WSTR, 10)
DATEADD
The DATEADD function works great for building a date from the ground up.  The arguments might be different from what you might be used to in other languages, so pay attention!  The first argument is the date part; You'll be using "Year", "Month", and "Day".  The second argument is the number of those elements to add to the third argument: the base date.  Unfortunately, SSIS doesn't have a literal character set to denote a typed-in date (like VBA does, for example) - so you'll have to start with a string, casted to a date.  Again, check the post about strings to dates for the reasons why the date format is important in these examples.  You will likely always use the following for your "base" date:
(DT_DBDATE)"1900-01-01"
Breaking Up the Date
It's also quite common to have your incoming date be a "concatenated" format, like YYYYMMDD.  That kind of format can be quite convoluted to deal with in a single expression in a Derived Column transformation.  You might be more comfortable dealing with concatenated integer values in two steps: breaking the number down, then converting it.  That requires two Derived Column components to accomplish, but adds the debugging capability of using a Data Viewer between them to examine the intermediate "decomposed" values.
Conversion Examples
Integer in YYYYMMDD Format
The following examples will use a column called [DateYYYYMMDD] to generate a date type.
Convert Directly to String, Then Date
First, deconstruct your single integer value into its components, then build up a string to cast to the date type of your choice:
(DT_DBDATE)((DT_WSTR, 4)(ROUND([DateYYYYMMDD] / 10000, 0) + "-" + (DT_WSTR, 2)(ROUND([DateYYYYMMDD] / 100, 0) % 100) + "-" + (DT_WSTR, 2)([DateYYYYMMDD] % 100))
Create Date from Scratch
DATEADD("Year", ROUND([DateYYYYMMDD] / 10000, 0), DATEADD("Month", ROUND([DateYYYYMMDD] / 100, 0) % 100), DATEADD("Day", [DateYYYYMMDD] % 100, (DT_DBDATE)"1899-12-31")))
Or, Decompose the Integer, Then Convert or Create
Place two Derived Column components in your flow.  The first one will create the following columns:
[DateYear]: ROUND([DateYYYYMMDD] / 10000, 0)
[DateMonth]: ROUND([DateYYYYMMDD] / 100, 0) % 100
[DateDay]: [DateYYYYMMDD] % 100
Which will then be used in the next Derived Column component to get the date by conversion to a string first, or by using DATEADD - your preference:
(DT_DBDATE)((DT_WSTR, 4)[DateYear] + "-" + (DT_WSTR, 2)[DateMonth] + "-" + (DT_WSTR, 2)[DateDay])
or
DATEADD("Year", [DateYear], DATEADD("Month", [DateMonth], DATEADD("Day", [DateDay], (DT_DBDATE)"1899-12-31")))
Serial Date Numbers from Excel (or Other Sources)
Dates from Excel (or other sources) can arrive at your package as a serial number that is "interpreted" by Excel (or the other source) as a date.  You'll have to research other source's interpretations for yourself, but they're likely to match Excel's (and SQL's) fairly closely.  Excel stores dates as floating point numbers - the integral part is the number of days past Dec 31, 1899, and the fractional part is the fraction of the day (see How to use Dates and Times in Excel).  SQL stores dates similarly, with the first two bytes containing the number of days since the base date of Jan 1, 1900, and the second two bytes containing the number of milliseconds past midnight (see datetime and smalldatetime).
That was long-winded... the short form is this:
DATEADD("Day", [SerialDateFromExcel], (DT_DBDATE)"1899-12-31")
Questions?
If you have any questions as to how to convert any particular number format into a date, please post here, or over at the MSDN SSIS Forum.  I'll update this post if any other notable conversion information comes my way.

Monday, January 12, 2009

Community Tasks and Components - Updated

About two months ago I told you about a new place you could go to get information on what custom components and tasks the community had built for SSIS.  John Welch and I have added to that resource over time, and there's now:
  • 4 Tasks (for Mail, Twitter, ...)
  • 1 Connection Manager (for SQL Server Data Services)
  • 6 Data Sources (Ragged-right delimited file, SSDS, Sharepoint, ...)
  • 24 Transforms (Geocoder, Vector, Checksum, CodePage Converter, ...)
  • 6 Destinations (Batch/Merge, Sharepoint, ...)
Most of them are hosted on CodePlex, but some of them are provided by other people/organizations that saw a need and filled it.  If you are ever banging your head on how to do something with SSIS, or can't believe something you're trying to accomplish with twenty components hasn't been encapsulated into one, I'd suggest you search the >SSIS Community Tasks and Components project on CodePlex.  If you run across a task or component you like and it's not listed, please let me know!

Friday, January 9, 2009

Extracting Duplicate Primary Key Rows in SSIS

Wow.  This seems to be a hot topic this week at the Microsoft SSIS forums, so here's another fleshed out explanation of something very similar (and very related to) Eliminating Duplicate Primary Keys in SSIS.
Perhaps you don't want to discard the "duplicate" rows - perhaps you want to separate them out in order to log some information, report an error, or "fix" them somehow.  It only takes one and a half steps more than the "Moderately Difficult" process described in Eliminating Duplicate Primary Keys in SSIS.
Replace step 8 with: Edit the Merge Join and set the join type to "Full Outer Join".  Pick all the columns from the right-hand list.  Pick ONE column from the left-hand list (any column, doesn't matter).  Down in the bottom half of the Merge Join editor, you'll see all the columns the Merge Join is creating on the output.  Scroll down to the bottom, and you'll see the column you just picked from the "left" side - in the "Output Alias", it will most likely have the column name plus a " (1)" on it, because it's the same name as a column you already picked coming from the "right" side.  Change the column name to "duplicate_if_null".
Step 9 is to add a Conditional Split after the Merge Join.  Call the first output "Duplicates" with a condition of "ISNULL(duplicate_if_null)".  Name the default output "Originals" (or something like that).
Now you've got two outputs from the Conditional Split - one that gives you all of the "duplicate" rows, and another that gives you the "original" or "best" rows.

Wednesday, January 7, 2009

Eliminating Duplicate Primary Keys in SSIS

One of the common tasks in ETL and Data Warehousing is cleaning source system data. One of the common "cleaning" tasks is to weed out duplicate entries before you update dimension tables. Eliminating duplicate primary keys can be dead easy... but it can also be pretty difficult. Here are three ways - from dead easy to pretty hard - about how to do just that.
All of these methods work inside a single Data Flow Task in Integration Services - either 2005 or 2008, doesn't matter. I've created a SQL 2008 package that demonstrates all three methods - download it from my SkyDrive.
Dead Easy Duplicate Key Removal
The most basic way to detect and "weed out" duplicate rows is simple, but only really works if you don't care about the values in any "non-primary key" columns.
First, feed the rows into a Sort component. Sort by your primary key column(s) and check the box at the bottom to "discard duplicate rows". Done!
The problem with this is that you don't have (much) control over which of the duplicate rows gets passed through the Sort component and which one(s) get discarded.
Moderately Difficult Duplicate Key Removal
If you know you have duplicates, but you are required to use "non-primary key" column values from the "best" version of the row, this method may work. It's appropriate if you have a way of "scoring" a row that will enable you to identify which row is "best", without looking at the other rows that share the same primary key.
  1. Send your data into a Derived Column component that creates a temporary "score" or "value" column containing some value that will let you (later) pick out the "best" row. Perhaps some measurement as to how many fields of non-primary key data are filled in. You might not even need this Derived Column component if there's already some column that tells you exactly which row you're most interested in - like an "updated date" (where later is always "best"). If you can't do this because you need to know information from other rows that share the primary key, you have to go use the "Pretty Difficult" method.
  2. Multicast your flow into streams "A" and "B". (They aren't labeled that way, but I'll talk about stream "A" and "B" differently.)
  3. Send stream "A" through an Aggregate component, grouping by your primary key columns and ONLY your primary key columns. Use an aggregation operation - like Maximum (for the date example) or Minimum - on your "score" column. (All the other column information will be lost here - that's OK.)
  4. Send stream "A" through a Sort component, sorting by the primary key column(s), then your "score" column. (This is necessary for the Merge Join component coming up.)
  5. Send stream "A" into the LEFT side of a Merge Join component.
  6. Send stream "B" into a Sort component, sorting by the primary key column(s), and your "score" column. Do NOT check the "discard duplicates" box.
  7. Send stream "B" into the RIGHT side of the Merge Join component.
  8. Edit the Merge Join, make it an Inner Join (the default), and pick all the columns from the right-hand list.
Pretty Difficult Duplicate Key Removal

If you can't decide which row of the (potential) duplicates to pass through without knowing what's in the other rows, you need to use a Script Transform in asynchronous mode. (Asynchronous in this instance means that we're not outputting one row for each and every row that comes into the component.)
  1. Place a Script component in your flow, identifying it as a transformation component (the default).
  2. Open the Script component editor, pick the "Input Columns" tab. Select the key columns, and any other columns that you want to pass through and/or will be used to determine which row is "best" within the duplicates.
  3. Still in the Script component editor, pick the "Inputs and Outputs" tab. Pick the "Output0" node in the treeview. Change the SynchronousInputID value to "None".
  4. Still on the "Inputs and Outputs" tab of the Script component editor, open the "Output0" node, and select the "Output Columns" folder icon. Press the "Add Column" button, and add your first primary key column to the output - pay attention to data types! You will have to repeat this process and add back ALL of the columns. By making the Script asynchronous, we've essentially removed the entire output column definition that's usually constructed for us by the Script component.
  5. Now you get to edit the Script! (Too much to post here - download from my SkyDrive)
One note for the script component... As it is now, the code in there is NOT according to Microsoft documentation, or how I'm used to the Script component or custom components working. I've posted a bug at Connect for some clarification, since I have it in a repeatable form...
Other Methods?
If you know of some other slick ways of eliminating or processing duplicates, please share...
Update
If you're interested in extracting the duplicates from the data flow and doing something with them instead of discarding them, read Extracting Duplicate Primary Key Rows in SSIS.

Monday, January 5, 2009

The Post-Build Command Line for SSIS Custom Components

I saw a search engine hit on my blog for this today, and thought I'd share my Post-Build Command Line. I'd found the bones of this from other bloggers, but added a few things of my own:

"C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin\gacutil.exe" -u $(TargetName)
copy "$(TargetDir)$(TargetName).DLL" "C:\Program Files\Microsoft SQL Server\100\DTS\PipelineComponents\"
copy "$(TargetDir)$(TargetName).DLL" "C:\Program Files (x86)\Microsoft SQL Server\100\DTS\PipelineComponents\"
"C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin\gacutil.exe" -i "$(TargetDir)$(TargetName).DLL"

First thing to know is that you MUST start Visual Studio as an Administrator (in Vista anyways) in order for the GAC commands to work.

Breaking down that set of commands:
The first command uses GACUtil to remove the component from the GAC - this is NOT strictly necessary, as installing a new component replaces the old one. However, I like to do it step by step in case there are problems, and this way I can tell where the problem occurred. My GACUtil was installed with the .Net SDK - yours may be located somewhere else, just use Windows Search to find it.
The next two lines copy the project's target DLL to the (standard) x86 and x64 PipelineComponents folders (I run 64-bit Vista). If you're just using a 32-bit OS, remove the "(x86)" line.
The last line adds your component to the GAC, plain and simple.