bill vorhies

Business Foundation Series, #4

Summary:  Big Data is indeed growing rapidly, but not all types or in all geographies.  Here are four things you need to know to evaluate the opportunity.


It seems a requirement to start every business discussion of Big Data with a mind-blowing recitation of how fast data is growing.  The fact is that it is growing fast.  But what about the specific opportunities to exploit Big Data?  Are they all growing equally as fast?  Are they all growing in markets that you care about?  The answer to both questions is an emphatic NO.  So to get our bearings in this future world of Big Data we offer several ways for business leaders to evaluate these opportunities in a more rational way.

1. Overall Growth of Data

Looking back about seven years and forecasting forward about seven years to 2020 IDC says that in 2005 there were 130 exabytes of data created, about 2,837 exabytes in 2012, and a forecast of 40,000 exabytes by 2020.  From 2005 to 2020 that’s a growth factor of 300.  From 2010 to 2020 it’s a growth factor of 50 or essentially a doubling of data every two years.

Source: IDC's Digital Universe Study, Dec. 2012

Source: IDC’s Digital Universe Study, Dec. 2012

It’s important to keep in mind that not all data now or in the future is or should be stored.  Much of it is transitory – phone calls, images viewed and not saved, packets temporarily stored in routers, digital surveillance images purged from memory when new images come in, and so on. It’s also important to remember that only data that is stored and analyzed can create value.

For example, a four-engine jumbo jet crossing the Atlantic will create 640 terabytes of performance and engine monitoring data.  Multiply that by about 25,000 flights each day and you have access to 16 exabytes of data.[i]  Should all of that be saved and analyzed, and for how long retained?  You can begin to appreciate the scale of the problem.

2. How Much Data Can and Should be Analyzed

The answer to this question of course relies on its business value which will be unique to each reader.  Today data that can be analyzed resides almost exclusively within structured relational databases along with a relatively small percentage of advanced applications in Hadoop to take advantage of semi-structured and unstructured data like text, social media, web logs video, and sensor data.  IDC estimates that only ½% of data today is being analyzed; that 3% of data today is tagged and could be analyzed, and that of the 2,837 exabytes created in 2012, 23% would be useful if tagged and analyzed.[ii]  Clearly a great deal of potential value is being lost.

The challenge is to create semi-structured (tagged) data that can be captured, stored, retrieved, and analyzed.  If that is done, IDC estimates that as much as 33% of the 40,000 exabytes of data forecast for 2020 could be usefully analyzed and create value.

Four facts related to data growth have come together to drive the new Big Data technologies, particularly the massive parallel processing of NoSQL databases like Hadoop run on inexpensive commodity servers.

  1. The rate of data growth is outstripping Moore’s Law;
  2. The amount of data we want to store would be too expensive with current technology;
  3. The speed at which read-heads in large disk drives can retrieve data is not keeping up with data growth meaning that retrieval times are simply too long, and
  4. Newly available semi-structured and unstructured data can yield new insights with the new technology.

3. Where is Data Growing

So far we’ve been talking about worldwide data growth.  If you’re an international business that’s what you need to pay attention to.

Source: IDC's Digital Universe Study, Dec. 2012

Source: IDC’s Digital Universe Study, Dec. 2012

But the results vary by region.  From 2005 through 2012 Big Data has been largely a phenomenon of the developed world accounting for 51% in 2012 (see IDC 2012 chart).  However, over the next 8 years the pendulum will swing to the developing world where emerging markets will account for 62%.[iii]

Over the next 7 years the majority of opportunities as measured by newly available data volumes (translating to information about new consumers and customers) will be in emerging markets.  Still, the explosion of data in the developed world will be remarkable and therefore full of opportunity.

4. What Types of Data are Growing

When looking to the future to decide where to position resources it will be important to understand what categories and types of data are growing at what rates.  All things being equal the data types with the highest growth rates represent the greatest opportunities for new insights – depending on your business of course.

Shaun Connolly of Hortonworks published a really excellent graphic that neatly categorizes types of data[iv] and helps us with this analysis.  He says Big Data can best be understood as the sum of Transactions + Interactions + Observations.types of data growing

The category of transactions fits neatly with traditional structured data generated by our ERPs, CRMs, and SCMs which are stored and analyzed through our current relational databases. Connolly’s interactions category we will split into three subsets:  aggregated external data, web logs, and much of social media. The third category he labels observations we will split into plain text, sensor data, and surveillance.

Transactions:  This category of data is structured, lives in our current data warehouses, and is readily available for analysis.  It is growing though estimates of the growth rate are hard to find.  The question sometimes asked is if the economy is growing only very slowly then shouldn’t this transactional data also be growing slowly?  From a vertical perspective that’s true but we are capturing increasing levels of detail and whole new types of data about our transactions so it is expanding horizontally.

Some of this is driven by increased sophistication of our transactional systems capturing more detail, some by increased regulation requiring the capture of new data types to verify compliance.  Since only about 1/6th of this data is currently being analyzed and only about one company in eight has adopted predictive analytics, this is a large and immediate opportunity, Big Data or not.

Aggregated External Data:  This type of data is structured and can be accommodated in our current data warehouses.  An example is adding externally purchased data about our customers to their customer records to make predictive analytics more accurate.  Think household size, age, owner/renter, and the hundreds of other demographic data types that can be purchased commercially today.  Direct marketers have been doing this for at least a decade but it is increasingly being adopted by all B2C organizations.

The reason that it is sometimes categorized with Big Data is that it is an aspect of Variety in data and both traditional and new Big Data technologies can be used to marry these new data types to your current records.  Adding this data to current records would increase their size and would need to be periodically refreshed, but should not grow faster than your transactional data.  The level of opportunity here is very high because it uses existing traditional technology and lends itself to immediate new insights.

Interactions:  Web Logs:  Web logs are the digital history of every visit made to your web site, click-by-click, page-by-page, including everything searched, downloaded, placed in a basket, or actually purchased.  Web logs fall in the category of semi-structured data since they have a known taxonomy and frequently have tags that can be used to interpret content making it accessible for analysis.

In 2012 the world added 51 Million new web sites for a total of 634 Million web sites.  There were 2.4 Billion web users worldwide with about 1.0 Billion of those in the developed world.  Between North America and Europe about 70% of the population were web users implying slower growth in the developed world going forward but a potential explosion in users in the rest of the world where penetration is currently only about 23%[v].  Growth of this data in US commercial markets is likely to be moderate but so little has been tapped that there is a huge opportunity here for companies to make their web sites easier to use for their customers and more profitable at the same time.

Interactions:  Social Media and Plain Text:  Although Twitter and Facebook and other social media entries may look like plain text, much of social media today is in fact tagged, semi-structured data that makes it easier to analyze. social media growth True plain text documents include the documents produced inside your company, email, and customer service logs.  Although these are categorized as unstructured data which makes analysis more challenging they frequently have a recognizable internal structure (e.g. email or invoices) that document storage systems like Enterprise Content Management Systems can separate semantically into meaningful components.

Both types but particularly social media are subject to new Big Data analytics techniques of which ‘sentiment analysis’ is a common type.  Sentiment analysis is designed to scan an extremely large volume of social media data and focus on key words and phrases that indicate customer satisfaction or dissatisfaction, or reaction to new products, services, features, pricing, and the like.

Plain text data is likely to grow at a rate proportionate to business growth, modest and linear over the next seven years.  Social media which has had a huge run-up in volume to 2012 (see the chart of US users of social media by age[vi]) is now forecast to slow dramatically and grow at just 4.1% per year in North America and between 12% and 23% in the emerging and developing world.[vii]

Observations:  Surveillance:  It’s tempting to lump this entire category into sensor data but IDC breaks out surveillance as a separate case and perhaps for good reason.  This isn’t as sinister as it sounds.  IDC is drawing our attention to the fact that almost all digital still and video images have embedded tags that show at least date, time, and location, and in many cases much more such as tagged face information.  These tags are what make the future of digital imagery so potentially valuable as we refine location-based GIS analysis of consumer behavior.  As IDC observes, “the amount of information individuals create themselves — writing documents, taking pictures, downloading music, etc. — is far less than the amount of information being created about them in the digital universe.”[viii]

Observations:  Sensors:  For me the real story in Big Data is in sensors.  Sensors are everywhere, in our phones, cars, refrigerators, on our bodies (FitBit), RFID tags on clothing and merchandise, vending machines, parking meters, home security systems, healthcare devices, and also in their traditional domains of equipment and throughout transportation and the logistical supply chain.

The big news is that sensors are becoming so inexpensive and pervasive that by 2015 the data they produce will outstrip all of social media, and thereafter will grow at 10 to 20 times the rate of social media[ix] making it the dominant type of information by far in 2020.  This includes not only sensors that we will interact with directly but the whole new domain of machine-to-machine (M2M) coordination currently blossoming in the sensor world.

Where IDC forecasts 40,000 exabytes of data by 2020, several organizations forecast that sensors alone will produce 50,000 exabytes[x] all by themselves, putting the top end of the forecast open for even higher expectations.  All agree that we will pass 1 Trillion sensors in use by 2020 with some estimates as high as 7 Trillion by 2017.[xi]

One of the key things to remember about sensors is that their output is binary and numeric making it easily accessible for analysis and depending on how much is to be stored, even suitable for storage and analysis through traditional relational database systems.  Based on its projected dominance among all information types by 2020, sensor data gets the nod for the greatest potential coming opportunity in Big Data.


Key Takeaways:

  • Understanding how much, where, and what types of Big Data will grow over the next several years is central to planning how to distribute your resources for maximum value and competitive advantage.
  • If you are not currently utilizing Big Data or predictive analytics, start now by inventorying your current opportunities, then build a plan that integrates this new knowledge.

“Information is the oil of the 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Senior Vice President, Gartner Research.


Bill Vorhies, President & COO – Data-Magnum – © 2013, all rights reserved.


[i] Instrumentation Newsletter Jan 15, 2013, National Instruments Corp. Dr. Tom Bradicich.

[ii] IDC’s Digital Universe Study, sponsored by EMC, December 2012

[iii] Ibid.

[iv] 7 Key Drivers for the Big Data Market , Shaun Connolly, Hortonworks, Blog post, May 14th, 2012

[v] Internet 2012 in numbers, Pingdom , January 16th, 2013

[vi] The Growth of Social Media, Search Engine Journal, 2013.

[vii] The Can’t-Miss Social Media Trends For 2013, Ryan Holmes, Hootsuite, November 29, 2012, Fast Company

[viii] IDC’s Digital Universe Study, sponsored by EMC, December 2012

[ix] Sensor data is data analytics’ future goldmine, Kevin Kwang, June 11, 2010

[xi] Ibid.