The past two years Dr. DeWitt took us through the internals of columnstore databases and query optimization. They were fantastically content-rich and dug right to the bottom of the technology. On top of explaining the mechanics of the query optimizer and theory of column-based storage, Dr. DeWitt demonstrated understandable applications of the change in technology to common queries and functions we all enjoy today in SQL Server. He's also a pretty funny presenter.

According to Dr. DeWitt, big data is a large collection of data – petabytes – on a large collection of machines. To others, “big data” means NoSQL, but that's not a requirement to Dr. DeWitt. For some sophisticated commentary on Dr. DeWitt’s presentation (and some inspired discussion), I expect to see something on Jeremiah Peschka’s blog – I’d encourage you to go read it if you end up interested in this type of technology.

NoSQL isn’t all the same – it’s a big tent of technology – but there are a couple major types: Key/value stores (CouchBase, MongoDB, …) and Hadoop. The former are intended primarily for single-record updates/retrievals – you can think of them as the “OLTP NoSQL” technology. Hadoop is designed for processing massive data sets, and is roughly expressible as “data warehouse NoSQL.”
The claim that NoSQL
works on unstructured data, and RDBMSs are for structured data is a
misconception. NoSQL databases do have
structure – it’s just that the structure is applied to the data later in the
process than for your traditional RDBMS.
Instead of the metadata being stored with the data – in fact being a
requirement to actually store the data – NoSQL metadata is “applied” when the
analysis function processes the data.
I think of it as “compiled” vs. “interpreted” code.
Only eight years ago, Google started in on the
big data problem in order to analyze clickstream data. Hadoop is an open-source interpretation of
Google’s Map/Reduce+GFS system. The
general structure has two parts – a fault-tolerant data storage layer (HDFS)
and a fault tolerant processing layer.
The data storage layer
provides fault tolerant storage using file redundancy across different nodes in
the cluster of servers, protecting against individual server failure as well as
entire racks failing. You can think of
this like RAID – data is smartly distributed over the cluster. If one node fails, the data exists on other
nodes, so it isn’t lost. The main
difference here is that RAID only addresses single disk failures, whereas the
HDFS protects against query, disk, server, switch, rack, and datacenter
failures without necessarily having to restart or recover… using only one “redundancy”
algorithm.

There still is a
strong use case for NoSQL – and his reasoning follows how I think about the
relationship between SSAS (cubes) and PowerPivot (tabular) analysis
technologies. In Map/Reduce, the schema
is stored in the application that’s doing the analysis itself – not within the
data like in SQL. So it’s a lot “easier”
and better to use when you’re doing exploratory analysis of data. Much like PowerPivot, you only need to define
enough schema to get your particular job done – not enough schema to get any
type of query done, or “explain” each piece of data in the set. But at some point, if your exploratory
analysis finds value in the data, you’re going to want to move it into an RDBMS
to get the query performance out of it, and have a consistent schema applied to
avoid two different analysis applications interpreting the data
in inconsistent ways. Currently, there are
techniques and technologies for moving data between these platforms, but they can
be pretty difficult and involved. One of
those is called Sqoop.

-
What the “Cloud” means
for databases
-
The impact of Solid State
Drives and Phase Change Memory
-
... whatever you might
want to see – send your ideas to PASS!
No comments:
Post a Comment