Friday, October 14, 2011

PASS Community Summit 2011 Day 3 Keynote

Dr. DeWitt has spoken to the PASS Summit for the past two years – turning what was otherwise a very forgettable “day 3” keynote slot into a must-see event for anyone going to the Summit.
The past two years Dr. DeWitt took us through the internals of columnstore databases and query optimization.  They were fantastically content-rich and dug right to the bottom of the technology.  On top of explaining the mechanics of the query optimizer and theory of column-based storage, Dr. DeWitt demonstrated understandable applications of the change in technology to common queries and functions we all enjoy today in SQL Server.  He's also a pretty funny presenter.
That brings us to today and the changing world of data.  We’re going to get the concept of “big data” funneled into our brains faster than we can process it – but that’s part of why I’m here.  I hope to capture some of the content that comes flying at us this morning about Hadoop and distributed data processing.  But my other job is to comment on how this movement will affect our industry going forward, changing your data, your job, your server architecture, and even SQL Server internals.  I have no doubt that I’ll only scratch the surface of what Dr. DeWitt conveys this morning – and for that reason alone, I recommend you don’t miss his keynote (if they can get him) next year at the PASS Community Summit.
According to Dr. DeWitt, big data is a large collection of data – petabytes – on a large collection of machines.  To others, “big data” means NoSQL, but that's not a requirement to Dr. DeWitt.  For some sophisticated commentary on Dr. DeWitt’s presentation (and some inspired discussion), I expect to see something on Jeremiah Peschka’s blog – I’d encourage you to go read it if you end up interested in this type of technology.
The data explosion is largely due to automation, a sea change in collecting data.  In the past, it was all manually entered, but now cell phones collect location data automatically, clickstreams rule advertising, and on and on.  The “old guard” (and Dr. DeWitt freely considers himself part of that) would handle these petabyte data warehouses on a parallel database system.  The new guard wants to use NoSQL.  In his observations, current implementations of NoSQL take about 5 times the server hardware as parallel technologies.  For example, EBay stores 10PB on 256 nodes vs. Facebook’s 20PB on 2700 nodes, and Bing’s 150PB on 27,000 nodes!  Of course, parallel DB technologies are typically (for good reason) delivered as appliances of high-quality hardware compared to NoSQL usage of commodity type (possibly-heterogeneous) servers.  As with anything to do with databases - "it depends."
NoSQL isn’t all the same – it’s a big tent of technology – but there are a couple major types: Key/value stores (CouchBase, MongoDB, …) and Hadoop.  The former are intended primarily for single-record updates/retrievals – you can think of them as the “OLTP NoSQL” technology.  Hadoop is designed for processing massive data sets, and is roughly expressible as “data warehouse NoSQL.”
The claim that NoSQL works on unstructured data, and RDBMSs are for structured data is a misconception.  NoSQL databases do have structure – it’s just that the structure is applied to the data later in the process than for your traditional RDBMS.  Instead of the metadata being stored with the data – in fact being a requirement to actually store the data – NoSQL metadata is “applied” when the analysis function processes the data.  I think of it as “compiled” vs. “interpreted” code.
Only eight years ago, Google started in on the big data problem in order to analyze clickstream data.  Hadoop is an open-source interpretation of Google’s Map/Reduce+GFS system.  The general structure has two parts – a fault-tolerant data storage layer (HDFS) and a fault tolerant processing layer.
The data storage layer provides fault tolerant storage using file redundancy across different nodes in the cluster of servers, protecting against individual server failure as well as entire racks failing.  You can think of this like RAID – data is smartly distributed over the cluster.  If one node fails, the data exists on other nodes, so it isn’t lost.  The main difference here is that RAID only addresses single disk failures, whereas the HDFS protects against query, disk, server, switch, rack, and datacenter failures without necessarily having to restart or recover… using only one “redundancy” algorithm.
The layer that sits on top of the storage is the processing engine – Map/Reduce.  Map/Reduce is best at performing aggregations, but it isn’t designed to do joins.  Joins take a ton of code to achieve – four full screens of multiple “programs” compared to a quarter-screen of SQL (HiveQL).  Dr. DeWitt compared performance of a few TPC-H “structured” queries between Hive and PDW – all showing significant wins by parallel database “old guard” technology.  The reason is primarily due to “old technology” being intentionally designed to handle these specific type of queries – even though they’re aggregations that he’d just said perform typically better in NoSQL scenarios.  Why?  Because of the inherent optimizations possible when the data used is a TPC-H workload, which is already schema’d and loaded.  If the data structure wasn’t known ahead of time, a parallel database would not be capable of servicing the query at all – until metadata was “applied” by performing ETL.
There still is a strong use case for NoSQL – and his reasoning follows how I think about the relationship between SSAS (cubes) and PowerPivot (tabular) analysis technologies.  In Map/Reduce, the schema is stored in the application that’s doing the analysis itself – not within the data like in SQL.  So it’s a lot “easier” and better to use when you’re doing exploratory analysis of data.  Much like PowerPivot, you only need to define enough schema to get your particular job done – not enough schema to get any type of query done, or “explain” each piece of data in the set.  But at some point, if your exploratory analysis finds value in the data, you’re going to want to move it into an RDBMS to get the query performance out of it, and have a consistent schema applied to avoid two different analysis applications interpreting the data in inconsistent ways.  Currently, there are techniques and technologies for moving data between these platforms, but they can be pretty difficult and involved.  One of those is called Sqoop.
Dr.DeWitt envisions a better way to share data between SQL and NoSQL worlds than Sqoop.  He’s investigating the concept of an “Enterprise Data Manager” that can relate and use unstructured and structured data so that the appropriate engine can traverse and analyze whichever type of data it’s presented with.
A few ideas Dr. DeWitt had for this year that “didn’t make the cut”, but could be a topic for next year’s Summit:
  • What the “Cloud” means for databases
  • The impact of Solid State Drives and Phase Change Memory
  • ... whatever you might want to see – send your ideas to PASS!
If you want more, visit David DeWitt and his team on the web, there are links to the PASS talks on the left side.  My thanks go to Dr. DeWitt and the team that got him up on stage - I really hope they can do so again next year.

Friday, October 7, 2011

What I Learned at #SPC11

This week, I attended the Microsoft SharePoint Conference 2011 (#SPC11) in Anaheim.  Not my usual beat, to be sure, but our company is starting to use SharePoint more, and we have no internal resources trained on it.  We have a "wild west" set of sites already, plan to add more for cross-functional project team sites, want to host BI solutions delivery, and we've got a project on the books to migrate an existing old-tech Intranet to SharePoint.  Better late than never, we're taking this opportunity to learn a little more.
I'm A Newb Here
Overall, the conference was pretty good - I learned things I needed to know and got more comfortable with the usage, architecture, and management of SharePoint.  Of course, having not known these things prior to our "wild west" deployment, I now know we did it all wrong.  Some good news is that even though I may not yet know the "perfect" way to deploy SharePoint for our needs, I know what questions to ask, and generally how to "fix" our current deployment.
There was one more good opportunity I took advantage of at the conference.  I found myself in a fairly typical position for other people: I'm an almost complete newbie at the technology, and I didn't know anyone in the community.  That sounds eerily familiar to the situation that first-timers have at the SQL PASS Community Summit.  I was in the same position with SharePoint that they'll be in with SQL Server.  The difference is that I now know what I didn't when I was a first-timer at Summit.  I now know a great deal about the technology, the community, and the conference itself.  I'm participating as a "big brother" or "guide" for a gaggle of first-timers... Could I put my own "first-timer" experience at the SharePoint conference, contrast it with my "veteran" status at Summit, and help my first-timers get great value out of the Summit?  You bet.
And It Wasn't A Great Experience
To start off, I'll describe my experience at SPC11 for you - the discomfort, unfamiliarity, and frustration of it all.  Yes, I wasn't terribly comfortable at the conference.  Despite what you may assume from reading here or meeting me in person, I'm an introvert.  I only exhibit this extrovert type of behaviour when I'm around friends, or in an environment where I'm comfortable with my perceived "expertise."  I didn't really want to start conversations with people I ate breakfast and lunch with in the convention dining hall.  Saying more than hi - even to the bare extent of offering my name as an introduction was difficult.  I forced myself to, but my reluctance was reinforced by the lack of responses I usually got.  I hardly acknowledged people sitting next to me in sessions - I often sat two or three seats away from anyone else.  Sound like your typical convention-goer?  It does to me.
Add to that that I'm completely unfamiliar with the technology.  How do I know what sessions to go to, or what vendors to talk to?  I know that my organization has issues with SharePoint - not knowing how to use it, manage it, or architect it well.  But is there a session that covers that?  Not that I could find.  One may argue that a convention isn't the place I should be going to find this stuff out... but here I am at SPC11 in exactly this boat, and I'm positive there are other wallflowers bailing water with me.  I don't even know the terminology well enough to know where to start!  From what I now know about SharePoint and language used to describe it, I missed some helpful sessions.  Luckily they'll be online for me to review later.  But I won't get the opportunity to talk to the speaker after the session, or meet others who were also up at the podium asking similar questions (making me feel better about being a dummy) or talk to other experts that were trolling the area, offering assistance.  Opportunity definitely lost...
Being uncomfortable with the setting and unfamiliar with the technology made me frustrated.  I'm not used to being in that position, and the frustration drove me to learn more so I didn't stay in that situation.  Unfortunately, there's not much in the way of hand- or foot-holds to help me climb out of that pit... so I got more frustrated!
It wasn't a complete disaster, I did get a lot of useful information out of the conference.  Thanks in no small part to Mahmood Jaffer - he helped fill in that basic knowledge I was missing.  Yes - it only took one contact to "make things better."  But the experience itself was illuminating to me as to what little things could have helped me get so much more out of it.  I'm glad to say that PASS' programs - of which the first-timer program is an example - do help.  But even though PASS is doing many things "right", I think they could do more, and relatively easily. 
Necessity Is The Mother...
Here were my issues on being a first-time conference attendee at the SharePoint Conference:
  • I couldn't identify "experts".  (Perhaps I should clarify that I'm equating "experts" with "people that will help me".)  I think speakers had a different badge accent colour - but I'm not sure.  Vendors definitely did.  MVPs did not.  No ribbons either.  I spent the entire welcome reception in vain trying to find one - not even vendors knew who or where they were.
  • I didn't know my (independently reserved) hotel had a shuttle to the conference until I registered.
  • I didn't know my (independently reserved) hotel internet access was covered until I checked out.  (I bought it anyway - had to in order to do office work, and was prepared to expense it.)
  • Apparently, there was a "birds of a feather" area.  The only information I saw on it was on a room directory, but I didn't investigate further.  I don't know what kind of "feathers" they were - industry or feature - and I didn't know when it was either.
  • Apparently, there were Hands-On labs available.  Again, I only saw passing reference to rooms on signage, without any description of lab content in the conference registration package or emails.  (They did promote certification exams though.)
I think it's also important to note I had the double-whammy of being a product newbie as well as a conference newbie.  Here were my issues on being completely unfamiliar to the technology of SharePoint:
  • I didn't know the product features or purpose - not even enough to know what I didn't know.
  • I didn't know how important knowing certain things was. Was understanding claims-based security more important than knowing how to leverage content types?
  • I still don't even know what I don't know...
Even though I'm sure some of the above two lists was in the documentation or registration package, I was more interested in addressing other issues than in reading all that to pull out a nugget or two. 
...Of Invention
I think the conference organizers could have helped me out with those problems by:
  • Delivering better preparatory email to new/all attendees.  I was actually shocked by the grand total of two - yes TWO - emails from the conference organizers that only served to help me prepare logistically... minus key shuttle and hotel internet information.
  • Targeting emails by attendee "type".  I don't want a ton of spam either.  Separating communications for "newbs" versus "returnees" would help keep content from being inadvertently categorized as spam or irrelevant by the receiver.
  • Identifying experts.  SPC11 didn't have the ribbons all us PASS vets know (and I hope you love as much as I now do).  Yes, they had a "badge customization" booth you could go to after registration to "add bling" to your badge (their words, not mine) with fairly unhelpful ribbons like "Internet Sensation" and "Hire Me".  (Really?)  Summit's ribbons are better, but perhaps they aren't as helpful as they could be.  Instead of a ribbon for "MVP" or "Speaker" (can't remember all the ones I've had), they should have one of the colours (say red) with technology words.  Pasting a red "Clustering" ribbon on your badge would mean "I know clustering, and I can (and want to) answer your questions about it."  SPC11 didn't put those "bling" ribbons in everyone's packet (nor would I want PASS to), they had a free-for-all booth to pick and apply your own.  That would work for a set of "I'm here to help with" ribbons (borrowing Andy Leonard's trademark).
SPC11 could have helped a product ignoramus like me more easily by:
  • Having an experts area manned at all hours of the conference.  Heck, "experts" is asking too much.  "People with a passing familiarity" would do!  I would have LOVED to have sat down with someone at the Welcome Reception who could have:
    • Found out what I knew about SharePoint and what I didn't
    • Found out what my learning goals were for the conference
    • Given those, helped me to identify appropriate sessions/events to attend
  • Having an "I'm new to SharePoint" track.  I know that program selection is a hard job - one of the problems is having appropriately leveled content.  If there was a dedicated "I'm new to <product>" track with ONLY 100-level sessions in it, I think a newbie would do quite well, thank you.  Yes, as a SQL newb, I might only care about DBA type content, not BI, cloud, and/or modeling content that would (should) be delivered in such a track.  But if the first session(s) in that track explained the big picture - the stack, the different tech, AND the typical job descriptions associated with SQL Server, I think that would help newbs a LOT.  Follow that up with intro sessions to the sub-products and features.  With that taken care of, the program committee could feel more freedom in loading up on 300/400 level sessions.
  • Alternatively, have an "I'm new to <product>" pre-con to cover that...
  • Pass out/win "I know <product> <feature>, thanks to <conference>" shirts with "Ask me about <feature>" on the back.  Have them at the ATE booths for "experts" to give to attendees they think learned something about the subject... with a soft commitment from attendees to WEAR THE SHIRT during the conference and bring it next year.
Make Conferences Better!
The above lists are directed at the SharePoint Conference - but I hope PASS can take away some nuggets as well.  I think most of those points could help make the first-timer and/or newbie experience so much better, and draw attendees into the awesome community that we have already.  Since this has turned into an "open letter" to PASS (I'll be drawing their attention to it), I would also encourage members of the PASS board to attend a conference every year that they've never attended before - big, small, on a technology which they're completely new to, doesn't matter - for the express purpose of putting themselves in the shoes of those they want to serve.

Monday, October 3, 2011

A "New Guy" Joins Our Team

It's a great day for me and Moulding and Millwork (the company I work for).  A few weeks back, I let you all know that I'd been given the green light on finding someone to help me out with Data Warehousing.  One of the first people I turned to for suggestions on finding someone was Scott Stauffer, the chapter leader of the Vancouver BI PASS chapter, PASS regional mentor, and SQLSaturday organizer.  I was hoping Scott could help me find the right person out there to help grow and support the fact-based decision making engine I envisioned working at M&M.  Scott has a ton of contacts from being a leader in the community, and having a busy consulting practice in the area for years and years.
Scott thought my news was great - he was really excited for me that what I've been working towards was looking like it was gaining a lot of traction.  But then he surprised me.  He didn't start by giving me a list of people he knew that would fit the position.  Nor did he point me towards recruiters or job boards.  The first thing Scott asked was if I thought he would fit the position.  That totally threw me for a loop.  I was expecting to have either a naive kid to train up completely, or a SSAS geek to learn from.  It won't surprise you to find out that I was (after the shock) VERY interested in working out arrangements with Scott.  (If you get an MVP interested in working with you... you find out a way to say yes.)  On top of being a great community supporter and organizer - which only touches the tip of his interpersonal capabilities and vision for the big picture - Scott is an exceptional technical talent at the whole Microsoft BI stack.  Yes, he'll claim that I've got him beat in Integration Services, but I actually don't think so.  I'm a hacker (in the classical sense) in SSIS, and only have one "project" to claim on my resume.  I've managed to milk that into an MVP award for the third year in a row now (yay!) which I just found out about this past weekend.  But Scott has a list longer than my arm of successes in this space that I now get to draw on and use to help our IT department support our business with insightful data to help them make visionary decisions quickly and more confidently than ever before.
With that, I'll be accepting your congratulations to me for landing Scott as a partner in crime here at Moulding and Millwork.  It's a real coup for me, and I'm pretty happy with myself at the moment. ;) I'll make sure his badge at PASS Summit is appropriately labeled with his new allegiance.  Oh... and I guess you can send your congratulations to him on twitter or in person at the Summit.  He'll have to get used to the good-natured ribbing I dole out - welcome to M&M Scott!

Tuesday, September 6, 2011

Parallelizing Slow Parts of the Data Flow - Part 1 - Preparation

It's quite common to have parts of your Data Flow that are slow, and there are some techniques you can use to improve performance of those parts.  One of them is to try to parallelize the slow operation - but that only works for operations that are parallelizable.  Sort isn't one of them - sorry to say.  And you really have to know how the data flow works to make sure you don't try fixing a problem that looks "serial" but is, in fact parallel - you'll only make things worse.
Identifying Parallelization Opportunities
So what scenarios does parallelization work for? 
Positive Signs
Here are some characteristics that should identify a good candidate for parallelizing a portion of your data flow.
  • The operation applies independently to single rows, or to an easily identifiable subset of rows.
  • There's a known-slow resource in use that allows concurrent use (SQL Server, disk, web, ...)
Counter Indications
If your data flow has any of the following characteristics, it's not likely that parallelizing (alone) will help speed anything up.
  • The server's CPU is at or near capacity during the slow section of the data flow.
  • Any external resources (SQL Server, file system, network) the slow section of the data flow uses are already at or near capacity.
  • Some part of the slow section requires all rows at once - an aggregate or sort, for example.
Test, Backup, Change, Test, Backup, Change, ...
After applying a parallelization technique, you'll need to figure out if it's better, right?  Then the first thing you have to do is measure the performance of your current design (obviously!).  Make sure you measure a few things - time to completion, CPU, RAM, network, and disk utilizations.  Run the test at least three times in a row to eliminate caching effects and get a good average. 
We measure all those attributes so that we can see what we're trading speed for.  The idea is to trade CPU utilization for speed - but we could have to pay RAM and I/O as well - to a degree you might not be comfortable with.  That's why we keep an eye on those metrics.
Before you make any changes... make sure you back up your original!  Your quest to improve performance using one technique might work somewhat, and you may want to try another technique.  If you have a backup of your original package, it'll (probably) be easier to implement the second technique from there, while ensuring your actual logic remains consistent.  Do also back up your attempts - you might try a few techniques but decide the first one worked "best".
Make your change to parallelize the package, then double-check the result is the same as your original.  It doesn't do any good to make your data go faster if you ruin it in the process.
Be Prepared For Disappointment
Even if your slow section seems to fit the profile I outlined, there's no guarantee applying a parallelization technique will help.  You may not be able to squeeze any more time out of the process due to some limiting factor you weren't aware of when you started.  Or the trade of speed for some other resource just isn't something you're ready to make.
What follows will be posts on using the Balanced Data Distributor, then a "poor man's" BDD using existing components for those of you that can't install additional software (even when published by Microsoft).

Monday, August 29, 2011

Inserting Records AND Getting The Identity in SSIS - Part 2

Since I posted a ways back on retrieving the identity value for inserts in SQL Server Integration Services, I've learned many things - one of which is an inline solution for that problem I didn't know back then.  Check out that previous post for the business problem, then come back here for an "inline" solution using SSIS - without scripts - just regular SSIS tasks and components.
It's a Love/Hate Relationship
The OLE DB Command component in SSIS is a near-universally hated object due to it's inherent performance problems.  But it's so hard to hate when it can actually get the job done when nobody else can.
Yes, our old nemesis is back to solve a problem for us... at the same price it's always asked.  Do you want your packages to run fast, but be complicated?  Or do you want them to run slow, but be easy to understand?  Sadly, there is not much middle ground.
The Keys
It turns out to be pretty easy, and I should have got it into my thick skull before, from all the hints I'd read.  A stored procedure with an OUTPUT parameter, and the OLE DB Command are the keys.  For this example, assume we're using a table that has only two columns: ID and TEXTVALUE, with ID being the automatically generated identity column.
The Stored Procedure
Unfortunately, this doesn't use much of the GUI power of SSIS, it's old hand-coding.  For our simple table, this would be the stored procedure definition:

CREATE PROCEDURE InsertData
  
@textvalue AS CHAR(10),
  
@id AS INT OUTPUTAS
INSERT INTO
IdentityTable
  
(textvalue)
  
VALUES
  
(@textvalue)
SET
@id = SCOPE_IDENTITY()

The SSIS Data Flow
Use a Derived Column to generate the ID column with a NULL(DT_I4) value in it.  (We need somewhere to place our identity value in.)


Following that, we can work on the OLE DB Command component, using a statement like this:

EXEC InsertData ?, ? OUTPUT


Map the columns as you'd expect to map them.
If you'd like I've got an SSIS 2008 package showing this technique.
The Big Picture
Sure, it works - but here are the drawbacks:
It's going to perform badly. Horribly if the package is run on a different server from the database the stored procedure is on.  I can't stress that enough.  The OLE DB Command component is making a round-trip call to the database for each row.  Even on the same machine, that's an expensive operation.
You're doing an RBAR (row by agonizing row) process which is much, much slower than a set-based process that SQL Server is more suited for.  (SSIS is fine with RBAR, it's designed that way.)  Neither SQL Server process nor the Integration Services runtime will be working very hard during this part of the data flow - I don't think even one core would get utilized fully due to the wait-states involved.
If you're interested, here's a sample package to show this process.  It creates a database on your local machine, creates the sample table, generates one row, performs the insert (and shows the result to you), then tears the database down.
Can we mitigate this RBAR flaw?  Stick around or subscribe your RSS reader to find out in an upcoming post.