bill vorhiesSummary: Simpson’s Paradox. A source of risk for real time analytics and for the citizen data scientist.





Most of us practicing the predictive arts know to look for sources of bias in our data. There are seven that are common, the first six of which are:

  1. Confirmation Bias
  2. Selection Bias
  3. Outliers
  4. Over Fitting and Under Fitting
  5. Confounding Variables
  6. Non-normality

Not Simpson's Paradox

The 7th and the one for today’s thoughts is Simpson’s Paradox, sometimes described as a special case of Confounding Variables. And while all these sources of bias can lead to answers that are probably directionally correct but wrong in the specifics of the forecast, the impact of Simpson’s Paradox can be to give you totally and directionally the wrong answer.

There are many famous examples of this but for the sake of simplicity let’s use this very short one for review. Two drugs, A and B are being tested and compared for efficacy. There are two different observations, call them Test 1 and Test 2. The results of those tests are:

Simpson's Paradox 2

It is evident to the observers of Test 1 and 2 that Drug B is more effective having cured a higher percentage of patients in both instances.

But a sharp data scientist might question this, wondering if Simpson’s Paradox could be at work. He examines the combined results and finds:

Simpson's Paradox 1

When the data are combined the result is exactly the opposite of the first conclusion. In real life this could have been disastrous.

And this is exactly what Simpson’s Paradox is: in which a trend that appears in different groups of data but disappears or reverses when these groups are combined. This is not meant to be a comprehensive review so I leave it to you to explore some of the other ways Simpson’s Paradox can confuse and misdirect.

So why the focus on Simpson’s Paradox and why now. Two reasons.

  1. Real Time Analytics: The entire thrust of real time analytics is to be able to spot a pattern and take action in shorter and shorter time periods. The shorter the time periods the more likely that the true overall trend is masked by short term misdirections. If you’re involved in real time analytics I suggest you do a serious risk analysis of what the consequences would be for your employer or client if you were misdirected by Simpson’s Paradox and took exactly the wrong action.
  2. Citizen Data Scientists: This is the term that Gartner has given to well-intentioned managers who are given access to data and even some predictive analytic tools with the intent that they discover insights for themselves. A very significant portion of software development in predictive analytics is attempting to automate data science to the point that Citizen Data Scientists can achieve this goal. You see software offerings in automated data prep and cleansing, heavily templated and simplified tools for regression and decision trees, and of course lots and lots of data viz tools which are supposed to create the ability to “see” the answer in complex correlations by just looking. All of this is exacerbated by the shortage of data scientists and the inexperience of most organization about how to implement predictive analytics. If you are relying on heavily templated and packaged software and have no awareness of what’s going on under the hood, what’s the likelihood that you would spot this bias or for that matter any of the other six? Pretty much none.

So let’s not forget the basics of questioning data for its hidden biases especially as data speeds up and intervals of analysis become shorter and shorter. And if you’re interacting with the Citizen Data Scientists in your organization who are getting carried away frolicking through the tea leaves let’s (gently) instruct them in these inherent dangers.

For more on Simpson’s Paradox try these links:


September 1, 2015

Bill Vorhies, President & Chief Data Scientist – Data-Magnum – © 2015, all rights reserved.


About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. Bill is also Editorial Director for Data Science Central. He can be reached at: or