In 1997, IBM ran a funny television ad building on the hype surrounding The Internet. Two office workers are sitting at an office table, and the older one is shaking his head at some Gartner report. “It says here that the internet is the future… We need to be on the internet.” His colleague asks, “Why?” The older man pauses and squints at the paper, “It doesn’t say!”
Data science is in a similar situation now as the internet was 20 years ago. Because it is producing such incredible results in both academic research and business endeavors, naturally, it has drawn a lot of attention. Some are eager to ride coattails waves of success and usher change, but a significant number of people resent the new culture. To them, data science is either hype without substance or not new at all. In short, the old-timers need to step up their game because data science is significant and is only going to grow.

The business perspective

would-not-hireIn the late 90s, a quantitative analyst could pick a clustering algorithm from their favorite statistics software, put the results in a customer-segmentation report, then cash their check. Since then, basic turn-key analytic functions have become so ubiquitous, and exist in everything from databases, to logistics tools, to campaign management tools, that essentially all major businesses have them. Unless you’ve drained the market of absolutely cutting edge talent, having “a SAS strategy” will provide as much of an advantage as having “an iPhone strategy”.
Though many – no doubt talented – quantitative, BI, data-mining, and analytics experts have taken the opportunity to update their title to the sexiest job of the 21st century, the expect skillset of a data scientist is not trivial. For a business to differentiate itself with regards to data-driven decision making, it must develop novel tools and platforms that its competitors don’t have. This type of delivery can only be reasonably expected from someone with the skills of a data scientist:
Solid coding skills
It falls on data scientists to build their own tools and conduct academic-style research. Hacking skills and the ability to improvise are mandatory. If you’re looking to do streaming image processing and want to train ConvNets, you won’t find them in the drop-down menu of your favorite analytics tool! It must all be coded from scratch using something like MXNet, Kafka, and Spark streaming.
Interview Question: “Explain how machine epsilon characterizes floating point precision.” [1]
Multidisciplinary grasp of mathematics
Working with machine learning and streaming data requires several forms of mathematics; asymptotic analysis to gauge computational complexity, graph theory for data structures, enumerative combinatorics for probabilistic data structures, optimization for conic solvers, and so forth. Studying statistical methods is a minimal prerequisite, not the end state. The more mathematics you know, the better.
Interview Question: “Discuss the computational complexity of rank n tensor decompositions.” [2]
Experience in technical computing and distributed systems
Tailoring a computing environment to suit data science requires strong familiarity with operating systems principles, networking, hardware performance metrics, and other factors that can affect the outcome of workloads.
Interview Question: “Give examples of stacked task schedulers leading to priority inversion.” [3]
The ability to tell a story
Perhaps it’s becoming clear why the word “scientist” fits this emerging role. It requires the practitioner to design tools, gather data, conduct multiple experiments, and then finally communicate the results. Explaining the results in a pedagogical way is an important part of the job.
Interview Question: “Explain the use of instrumental variables in estimating causal relationships.” [4]
My experience corroborates what industry reports are saying, which is that as of early 2016 this specific combination of skills is scarce. This does not mean that it is impossible to find data scientists, but when many companies are offering high salaries, things like having an exciting workplace can be the deciding factor in attracting talent. Nevertheless, the best place to start looking is at universities that have been forming curriculums around data science for a few years. The cardinal mistake is to re-brand the old customer intelligence team, data warehouse people, or systems engineers and ask them to “build a Hadoop”.
Data Science as an emerging discipline is undeniably making waves and is, in my opinion, a game changer. You’ll find people today still denying that “the cloud” has changed the way we do business, but what they’re noticing is that the time it takes for such a technology to be generally adopted and change an entire industry can span the course of a decade or longer. The problem is that if by waiting until a technology is easy to implement or “proven” before capitalizing, the technology will not offer the same competitive advantage it once did, and in the worst case you’ll be left behind.

The scientific perspective

Discussions concerning what data science “truly” means can be overly heated in some communities. Normally, when discussing semantics, people spend a few seconds describing what they intend by a word and then get on with the conversation. Something else is going on here, and it stems from the fact that a new discipline with a foundation in computer science has developed which in the minds of many has made statistics irrelevant. Of course such a notion is silly, but it cuts at the heart of an issue which is personal to many statisticians.
Statistics in the most liberal sense is a branch of science that pertains to data analysis. It has a clear history in academia that spans nearly a century, but many of its techniques are older than that. Mathematical demigod Carl Gauss invented the least squares algorithm in 1795 to calculate the orbits of planets. Imagine if he had been here today and said dismissively “statistics is nonsense, you’re just describing applied physics.” That wouldn’t sound quite right, because it’s not right! A field of study doesn’t own some mathematical method or other. I’d be willing to wager that modern physicists are more intimately familiar with the particulars of probability density functions than most statisticians. Does that mean that we need to master theoretical physics to understand the Weibull distribution? Of course not. The takeaway here is that statistics is not some monolithic body of knowledge as much as an active field of scientific study with a distinguished history. By extension, what actually differentiates statistics it from data science is the aim of the community more than anything else.
Esteemed scientist Leo Breiman elucidated on this in a 2001 paper that was remarkably ahead of its time:

  • Statistics assumes that data is generated by a stochastic model and this hypothesis must be tested.
  • Data Science treats data mechanisms as complex and a hypothesis is generated algorithmically.

I find the distinction to be valid in a broad sense, and regularly meet high level statisticians that view deep learning as a “hack”. Notwithstanding my own opinion, the criteria for any field of science is to produce results. During the AI winter of the 80s and 90s, the computational complexity of non-linear non-convex loss functions made it practically impossible to test advanced learning theories empirically, but that’s not the case anymore. Despite claiming machine learning as its own, the statistics community has yet to produce theoretical foundations even for simple things like learning halfspaces with SVMs, much less deep learning.
It is not without a stroke of irony given how many self-expressed statistics practitioners wish to position it as being a pure and mathematical field, whereas data science is some sort of fumbling engineering discipline. It only takes a cursory glance at the papers presented at NIPS and ICML to understand that the data science has a strong foundation in mathematics, and needs no “assistance” in finding provably efficient solutions to unresolved learning problems.
Throughout history, the field of statistics has made great and valuable contributions to data analysis, probability theory, and much more. What the future has in store for it as a scientific community though, seems uncertain. Meanwhile, the explosive growth of data science has only just begun.