bill vorhiesSummary: Do you think we can replace Data Scientist with software?

Talk about a fraught concept, this one ought to give you the willies. I don’t mean to be a Luddite about the magical abilities of technology but the concept here is to replace data scientists with software.

As long as I have practiced in data science I am constantly coming upon new and unexpected reasons that my results may be misguided. As a reminder you might look back at my recent articles on Simpson’s Paradox (read it here) or Why Big Data Isn’t Better Data (read it here).

So perhaps even though the most recent NoSQL ML techniques are too directional and not sufficiently specific in their results to lend themselves to automation (or perhaps these are exactly the required conditions), then the question is open as to whether more traditional ML techniques like predictive modeling can be successfully automated. Which opens the door to their application by untrained users, or charitably, “citizen data scientists”.

Tom Simonite takes on this topic in his February article by this same name. The premise he says is this, not enough data scientists to go around means the process must be automated.

pie chartFor example, Google is apparently funding work on the “Automatic Statistician”.

“It’s not meant to replace exactly what a statistician would do, but it can help a lot,” says Zoubin Ghahramani, professor of information engineering at the University of Cambridge, who developed the software. “Sometimes it finds patterns that a regular data analyst would not,” he adds.

The Automatic Statistician uses an iterative building block approach to create mathematical models. The software first tries out the simplest of those methods on the data; it then selects the ones that best explain the data for another round of experimentation, adding more mathematical techniques to see what happens. The best model is then used to generate the final written report. (There’s a hint at genetic programming here which I think is a much overlooked ML technique, but more on that separately.)

Another entrant to this field is Skytree. Simonite reports “it claims (to be) the first commercial tool that can automatically select the best model to explain a particular data set.” When I read the Skytree site and the most recent 451 Research report on Skytree I see references to ‘near real time’ modeling and results from Big Data sets with connectors for the big three Hadoop distributions. It appears to actually compete with SAS and operates the same way. The automagical operation claim isn’t easy to spot.

As we all know, the modeling takes only a small fraction of the total time. It’s the decisions about data prep such as whether or not and how to impute miss values or include or exclude variables that takes time for iteration. Skytree says it claims “accurate” models. My take is that speed means “good enough” models and if you’re satisfied with that, I’ll be happy to be your competitor any day.

Finally Simonite identifies the company Narrative Science that provides a service to turn numerical data into readable reports. Cofounder Kristian Hammond, who is also a professor at Northwestern University, says that Google’s Automatic Statistician could help data scientists be more efficient, but its reports would offer little to those who are unfamiliar with statistics. “Most business people don’t want to know about mathematical models” says Hammond, “they want to know that they could save money by reducing factory activity by 50 percent between the hours of 1 a.m. and 6 a.m.”

Hmmm. Since the interpretation of results, the story telling, is such a major portion of any project I’ll hold judgement. We should wait for some proof on this one and look for some head-to-head tests of this automated output pitted against the interpretive skills of a decent senior data scientist.

These are just three examples and I’m sure there are 10 times that number in development somewhere. I can’t wait for some serious adopters of data science automation to report their results so we can get some head-to-head comparisons.

You can read Tom Simonite’s original article here.


September 14, 2015

Bill Vorhies, President & Chief Data Scientist – Data-Magnum – © 2015, all rights reserved.


About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. Bill is also Editorial Director for Data Science Central. He can be reached at: or