bill vorhiesBusiness Foundation Series, #5

Summary: To understand how far and how fast we have come in our ability to store, retrieve, and analyze Big Data it helps to know a little about its history.  It’s not very long but it’s moving very fast.  When considering how Big Data and analytics fit in your company it will be helpful to know the roots of these developments and the variations that are flowering today.

During the first half of the 2000s, the big on-line data aggregators, search, social media, and ecommerce giants (e.g. Google, Yahoo, Facebook, Amazon, and Twitter) had both a blessing and a curse.  They had been building huge amounts of potentially valuable data about their users but their ability to analyze it was coming up against the wall of traditional data storage and retrieval technologies.  2004 was an important dividing line.

The Age of Relational Databases:  Prior to 2004 essentially all data was stored in data warehouses based on relational database management systems (RDBMS).  When data is placed in a RDBMS it is subjected to a variety of steps that make it useful.  The structure imposed on the data during entry is at the root of the definition of ‘structured data’.  We know where it’s located, that name ‘A’ goes with birthday ‘B’, salary ‘C’ and address ‘D’.  Even if these data are not stored on the same record or even on the same system they can be tied together with a ‘key’ during retrieval.  Also, during entry the data is scrubbed so that illogical variants such as text in a numeric field can’t exist.

By far the most common way to query these RDBMS databases is using a language called SQL (Structured Query Language).  Today IT departments are populated by experts in the use of RDBMS databases and SQL queries, and our colleges continue to churn out graduates with these skills since even today well over 99% of the databases in existence are RDBMS and require these skills.

The downside for RDBMS databases is revealed as they grow very large, say at the outer limits of Gigabytes and well into Terabytes.  The weaknesses are threefold:

1)      They require very robust and expensive specialized servers designed to prevent failure and loss of information making scaling costs high.

2)      SQL queries can take very long to execute, in the range of several days to search and download desired data at very large data volumes (e.g. at Visa a mission critical query took more than a month of machine time to process).

3)      You need to know what kind of questions you want to ask when building the RDBMS so that the data is present and structured in such a way that it can be retrieved.

The Birth of NoSQL:  In the few years leading up to 2004 Google set out with an in-house team of scientists to build a new type of proprietary database they called “Big Table”.  Big Table successfully broke through the barriers presented by RDBMS but Google was making the investment to solve their own problem, not anyone else’s.

In 2004 the Google scientists published two scholarly research papers on their project the first describing the Google File System (GFS) (a means of storing data across distributed machines), and Google MapReduce (a distributed number-crunching and retrieval platform running atop GFS).  That work found its way into the hands of Doug Cutting and Mike Cafarella, two independent developers who convinced Yahoo that this new structure was the solution to their search and indexing challenges.

By 2006 Yahoo had the first prototype called ‘Hadoop’ and in 2008 Yahoo announced the first commercial implementation of Hadoop.  Facebook was a fast follower and within a very short period of time Twitter, eBay, and other major players were also adopting the technology.  In the four or five short years since 2008 Hadoop, MapReduce, and their variants have become the source of a technology revolution that IDC forecasts to top $813 Million by 2015.

The Age of NoSQL:  Following 2008 the general terminology “NoSQL” was adopted to set apart these new databases, or more correctly file systems, that do not require SQL (or as sometimes put ‘Not-Only-SQL’) from their RDBMS predecessors.  There are a number of major variants all with different strengths and weaknesses and designed for specific applications.  They all have several things in common:

1)      Low Cost: They run on inexpensive commodity servers at perhaps 10% or 20% the cost of specialized RDBMS servers.

2)      Fast Massive Parallel Processing: They are ‘horizontally scalable’ meaning that you could hook up hundreds or thousands of these low cost machines and the software would partition out the storage and retrieval process across what came to be known as ‘Massively Parallel Processing’ (MPP) networks.  This dramatically reduced the storage time and particularly the retrieval time so that queries that had previously taken several days could be done in a few hours or even minutes.

3)      Cloud Friendly: Hand-in-hand with MPP architecture, these new storage farms were very cloud friendly so very large MPP server farms could be setup quickly and with no capital cost by simply renting them from major providers like Amazon.

4)      Fault Tolerant:  MPP allows storing the same data on several different machines so that the failure of one does not endanger the data.  This is made even more reasonable because of the low cost of the commodity hardware.

5)      Store Anything Together: Depending on the flavor of NoSQL system you install you could store literally anything, including things that had never been practical to store before like massive social networking text files, satellite images, graphs, huge volume web logs, and other types of massive data files.  You can store the same data you are storing in your RDBMS, or raw unstructured text, sensor values, or in the same database pull in data from outside sources including images, DNA analysis data, or 3D representations of complex molecules or architectural drawings and store them all in the same place using the same technology.

6)      Store It Now and Find the Value Later: Thanks to the new ‘key value store’ architecture any data including previously undefined or unstructured data could be stored and retrieved using a parallel technology ‘MapReduce’ which facilitates distributing the data across the massive parallel processing servers, and also allows its retrieval, all without having to define your question in advance.

Major Types of NoSQL Databases

When users began to consider the applications of NoSQL file systems the first major division in their minds was between structured and unstructured data, structured belonging to the world of RDBMS databases, and everything else in unstructured.  However, not all types of unstructured data are alike and different variations on NoSQL were quickly developed for different specialty applications.  A more detailed analysis of this discussion will have to wait for a separate posting, but the major types are these:

1)      Key Value Stores (e.g. Hadoop):  Allows the storage of data without a pre-existing schema and handles complex, unstructured, and semi-structured data well, with fast speed of storage and retrieval.  This type of data encompasses most text, semi-structured text, social media, web logs, and the dominant types of business-oriented data.  As a consequence most Big Data projects default to these Key-Value Stores like Hadoop or similar.  Note that traditional structured data can be equally well handled here and at lower cost than RDBMS systems.  MapReduce is a key component of Key Value Stores.  In one example, the added speed of Key Value Stores file systems has allowed Visa to process 36 Terabytes of data in 13 minutes which used to require a full month.

2)      Document Oriented Databases:  Second in popularity in the business world behind Key-Value-Stores are Document Oriented Databases.  Here an entire document is treated as a record.  While they can accommodate completely unstructured text, they excel at semi-structured text.  That is text that has been encoded according to a known schema such as XML, YAML, JSON, PDF, or even MS Office.  Smaller Document Oriented Databases are found at the heart of traditional Enterprise Content Management Systems (sometimes known as Records Management Systems).  Search can be further facilitated by adding metadata or keys and several query languages exist depending on the specific flavor you are using.  Document Oriented Databases excel at tasks like patent search, litigation support, legal precedent search, search of scientific papers and experimental data, email compliance searches, or simply retrieving knowledge on a particular topic hidden among a forest of internal or externally prepared reports and document.

3)      Column Oriented Databases (CODs):   CODs may be selected when the application focuses on calculating metrics from a particular set of columns or when updating tends to be of columnar data.  CODs are inefficient at analyzing across rows or writing new rows but some data compression advantages can be achieved over row-oriented databases.

4)      Graph Databases:  These highly specialized databases use an ‘associative model’ of data for which there is no specific index, are not record based, and associate nodes and edges for relationships.  They are excellent for graph algorithms that may be found in image storage and for ‘semantic web’ applications.  The semantic web gives meaning to data by defining relationships and properties, or ontologies.  Specialized languages such as SPARQL are used for querying.

Let a Thousand Flowers Bloom – 2008 to today:  While all these variations may seem overwhelming, the fact is that Key-Value-Stores dominate business applications of Big Data.  Among the best known is Hadoop, because it was first and also because it has been given over the Apache Software Foundation which continues to nurture its development as open-source, and therefore available free without license fees.

And while our brief history will wrap up about here, you should also know that the ever creative market place has been busy creating both open source and proprietary variants of Hadoop.  Forrester Research offers us one way to categorize these, according to whether the utilization can be batch (Hadoop) or is required to support real time decisioning, and whether the content is to be polystructured (different types of data together in one place) or homogenous.  The Forrester chart lists only a few of the competitors that should be considered today.data structure

Rest assured that the market has even considered that as with any completely new technology there are few true experts around and have variants that even allow us to use our traditional SQL tools, so that our current IT staffs can continue to contribute in this new Big Data world.

 

Bill Vorhies, President & COO – Data-Magnum – © 2013, all rights reserved.