bill vorhiesBusiness Foundation Series, #1

Summary:  If you’re new to Big Data you’re not alone in trying to nail down exactly what this is and how it may impact your life.  There are several characteristics of Big Data that are commonly quoted such as volume, velocity, and variability but none of these provide a definitive test dividing Big Data from not-so-big-data.  It’s much easier to understand this opportunity if you look at it from the perspective of the new tools and techniques that underlie Big Data.  In our opinion Big Data is about new breakthrough techniques in data storage, retrieval, and analysis in search of appropriate problems to solve.

The Challenge of Understanding Big Data

If you’ve done some reading or research into Big Data on your own then you can skip this post and read further down in our Foundation Series.  However, if you’re new to the topic, want to figure out what all the fuss is about and possibly whether you need Big Data then start here.

It seems to be a requirement for this topic to start out by stating the obvious, that the data in our business, professional, and personal lives is growing rapidly and in fact exponentially.  OK, but why suddenly all the hoopla.  What’s this all about?

Since July I’ve been participating in the National Institute of Standards & Technology (NIST) Big Data Working Group.  Big Data has been recognized as one of those critical emerging technologies that the US as a nation needs to stay out in front of and our goal, through several subcommittees is to establish the definitions, taxonomy, reference architecture, and technology roadmap for Big Data.  All of this in order to establish a kind of common language and baseline understanding.  But here’s the real kicker, all these experts from business, academia, research, and government are having one heck of a time agreeing on just what Big Data is.  I’m sure that when we’re done in September or October we’ll have agreed on something but I’m just saying, you’re not the only one who’s confused.

The issue is that when you try to find one or more unique characteristics of data that makes it “Big” you can make a list, but every attribute needs significant qualification.  Right now it’s a lot like great art; you’ll know it when you see it.

It’s About the Technology

The reason for this confusion is that Big Data is not so much about the data as it is about the Data Engineering, which is a technical way of saying it’s the tools or technical solution that are unique.  And what’s unique?  Well up to this point, at least up to about 2008, everyone stored their data to be analyzed in a data warehouse.  Essentially all data warehouses are based on relational database management systems (RDBMS) – no need to get more technical than that.

Around 2008, some companies you know, Google, Yahoo, Amazon, eBay, Facebook, Netflix, and their similar friends were recognizing a problem.  In that time, the question was how to develop a “personalization-of-one”, that is, how to make each and every customer’s experience unique to them.

It should be apparent that in order to do that, you have to store everything you know about each customer, analyze it, and act on it.  If the problem could be solved then Netflix can offer you a unique list of shows you personally might be interested in, or at Amazon unique books and merchandise just for you, and so on.  The problem was that’s a lot more data than anyone had been storing and the basic RDBMS technology that had served us well since the 1980’s was too expensive and too slow.

Also, what is known about each customer is contained not just in our transactional systems (ERP, CRM, Order and Fulfillment Systems and the like) but in web logs that contain the full history of each customer’s journey through our web sites, in social media comments about our companies, in email, in call center logs other text sources, and even in sensor and video data that show events like where the customer is physically and how he is behaving.  Most of these latter data types simply don’t fit well in our traditional RDBMS data warehouses, and are of such large quantity that they outstrip the current ability (and cost) to contain them.

The Rise of the Big Data Solution

Some very smart folks in Silicon Valley set out to solve the problem by inventing a new technology for data storage and retrieval with the name Hadoop.  Hadoop grew from a research paper published in 2004 based on the Google File System. Development began in 2006 and Yahoo made the first commercial implementation in 2008.

The key elements of Hadoop are:  a “key value store” (replacing RDBMS) for storage, and “MapReduce” for query and retrieval (previously mostly done with SQL).  If you’ve heard of Hadoop, Cloudera, Hortonworks, or any of the other branded names for these solutions they are all variations on this new schema.  These two technology elements are then implemented across a large number of servers linked together and operating as a whole for storage.  This is known as Massive Parallel Processing (MPP).  Key Value Store and MapReduce implemented through Massive Parallel Processing are the foundations on which essentially all new Big Data tools are based.

New Technology – Big Benefits

It didn’t take long for others to realize that this new model for data storage was going to be a very big deal indeed and the market has responded accordingly.  Why is it a big deal?  Well for starters:

  1. Cost: Data storage is now orders of magnitude less expensive.  Large ‘data warehouses’ for Big Data could be built for 10% or 20% of what old RDBMS data stores cost.  In fact, it’s likely that as this evolves over the next decade that data stores based on this new technology will almost complete replace all but the most modest of data warehouses.
  2. Compatible with the Cloud: Big Data storage technology is also very compatible with cloud storage, further allowing the business and scientific communities to rent rather than buy the servers they needed for storage.  And these servers can be inexpensive commodity servers (think the Dell desktop machine you may have at work) instead of the expensive hyper-reliable fault-tolerant servers that RDBMS requires.
  3. Speed: Even when we’re talking about Terabytes, Petabytes, or more of the data needed to track everything about a customer or event from the web pages they’ve visited, to everything they’ve ever put in an on-line shopping cart, purchased or not, the data can be retrieved for analysis at lightning speed.  RDBMS data warehouses may have taken several days to extract similar data from data stores this large, much too slow to be able to offer you what you might want to see next while you’re still on the web site.  Visa for example used Hadoop to process 36 terabytes of data in 13 minutes where it previously took one month for the same task.

So in a nutshell, Big Data is really a new data storage and retrieval technology in search of appropriate problems to solve.  It’s much more about the technology solution than any specific characteristic of the data.

Characteristics of Big Data

In later posts we’ll talk about more specifics, but here are some of the basics that determine whether you need Big Data in your business or professional life:

Volume:  If you have Terabytes or more of data to store, it may be Big Data.  There’s no specific cutoff that says at this size you need to switch.  If you are successfully managing your business with Gigabytes of data you will continue to do fine with your current RDBMS warehouse.  Later on in this series we’ll talk about what you can do with Terabytes of data if you choose to explore this further.

Velocity:  Data scientists talk about data-at-rest and data-in-motion.  The simplest examples of this are the streams of data that can be captured about a customer from a web site (capturing and storing web logs), or examining all of the transactions being authorized by a credit card company, or capturing all the measurements from an industrial control system such as at a chemical refinery, and doing this in real time so you can react to it in real time.  High velocity data is necessarily also High Volume data but velocity presents a separate set of considerations.

Variety:  Frequently it’s desirable to aggregate and combine data from many different sources and then merge them for analysis.  Big Data technology is very good at this, RDBMS less so.

So do you have a problem that warrants the use of Big Data solutions?  And which of the Big Data solutions are appropriate for what you want to accomplish?  That’s a much more complicated question.  Just remember that it’s the analysis of the data through reports, dashboards, visualization tools, or predictive models that is the goal.  The answer to whether you need a Big Data solution always hinges on what you want to accomplish.

Bill Vorhies, President & COO – Data-Magnum – © 2013, all rights reserved.



Tags: , , ,