### Your competitors are probably not using it yet.

Software from well-established analytics vendors has always been 5 to 10 years behind leading machine learning technology. Consider the following example:

- 2001 – The Random Forest algorithm was formalized by Leo Breiman
- 2002 – An official R implementation was written by Leo himself

(10 years later…) - 2012 – SAS announces the HP FOREST procedure

Given how XGBoost (similar to random forest) has **absolutely dominated **machine learning competitions, the competitive advantage such algorithms could have offered businesses a decade ago is not negligible.

Agile and data driven companies have taken advantage of this gap to stay ahead of competition. In 2011 Microsoft published a notable paper describing how using **deep neural networks** improved their voice recognition software by 33% over the classic GMM method. Around the same time, a number of universities and other large companies reached similarly incredible results and began devoting tremendous resources into developing techniques applied to deep learning such as the following:

- ReLU and Maxout activation functions
- Random Search hyperparameter tuning
- Momentum and Adagrad learning rate strategies
- Dropout and Stochastic Pooling regularization
- Elastic Averaging stochastic gradient descent
- Batch Normalization of input
- BinaryNet weight and activation constraints
- Tensor representations
- GPU acceleration

As of 2016, support for deep learning is practically nonexistent in SAS, SPSS, and Stata. They provide no standardized methods of implementing deep learning architectures, much less any of the above techniques.

### Why you should care about deep learning.

Deep learning is at the root of some incredible new developments in image processing, natural language processing, biology, autonomous driving, artificial intelligence, and more. This is because compared to many traditional ML techniques, deep learning excels in four closely related areas:

**1 – Detecting complex nonlinear patterns**

We know that accurate car resale value cannot be determined solely by a feature like “engine volume”. One must include more data like how old the car is, what brand it is, its upkeep, and more. In addition one may choose include more volatile features such as what state the economy is in, trending colors, marketing campaigns, and so on. What is important to understand is that all these variables are interrelated and the effect of changing one depends on the values of the others. The effect of having a 5L engine depends on if you’re talking about a sports car or an old pickup truck. The age of a car relates to how collectible and well maintained it is. When all these features are taken into consideration a pretty complex picture is drawn.

Below is a visceral plot made by H2O.ai illustrating how much more flexible deep learning is than classic approaches in learning convoluted patterns. Notice also the deceptively good score (AUC) of the other plots even though they generalize very badly. Such models would not produce good predictions.

**2 – Learn the importance of variables automatically**

Because deep learning algorithms fold their problem space so well, variables that do the same thing end up canceling each other out (in a manner of speaking).

When the deep autoencoder algorithm was developed by Hinton et al., the paper included a great plot showing the difference between a low level object function and deep learning. The data included roughly 800,000 Newswire articles categorized into eight different classes.

The first plot below is from a real workhorse algorithm from statistical analysis known as PCA. It is often used to illustrate which variables contribute the most certain classification. The image is essentially unintelligible.

The second plot is from a deep learning architecture that specializes in removing unimportant variables. When the number of variables is reduced from 2000 down to 2 and then plotted, something remarkable happens. The eight categories become astoundingly clear, and the distance of a point from the center of the plot signifies its how much it contributes to a certain classification.

**3 – Learns highly variable patterns from few examples**

Deep architectures allow for a feature to be learned without having to study an exponentially large number of configurations. If only *N* out of D variables are required to correctly classify something and there are *K* classes, a deep learning algorithm needs at most N*K observations in order to learn the classes. In contrast, traditional methods can require an exponential N^{D} observations! This huge difference has been researched extensively by Bengio et al. Another way to think of it is that deep learning generalizes very well. That is to say, it can correctly classify things which it has never seen before, and is not vulnerable to out-of-distribution examples like simpler models can be.

Why does this matter? With around 5000 examples per image class, deep convolutional networks achieve excellent performance. With 10,000,000 examples per class, they can surpass humans.

**4 – Learns abstract representations**

The philosophical principle of compositionality lies at the very core of deep learning. It is probably impossible to describe the concept of “dog” using a single layer of binary pixel intensity values. Rather, one should use a stack of increasingly abstract representations of a dog. Andrew Ng et al. discovered that these abstractions are not only important to solving complicated problems, but once learned, they can be *transferred* across problems.

A fantastic example of this principle comes from Microsoft when they discovered that a deep network trained to understand English text could improve by 5% by learning another language like French. Learning even more languages improved its English capabilities further. Not only that, but learning a new language was much easier if the network had been pre-trained on other languages.

A critique I often hear about deep learning is that the models it produces are not easy to “interpret”. This is simply not correct. If you model simple relations between two variables like a car’s tire-wear to stopping distance, then a deep neural network will be just as easy to interpret as a decision tree. The crux of the matter is that if a more complex phenomenon is being modeled, it is the phenomenon itself which is unintelligible. Moreover, these complicated problems are seldom solved by old analytical methods. How is that intelligible?

A business owner should not be asking for a one-sentence answer to why a marketing campaign performed as it did; there might not even be a simple answer. Instead they should ask, “Is it possible for us to predict with an even higher degree of accuracy which individuals will respond positively to a specific campaign in the future”?

### Deep Business Decisions

Don’t marginalize your business by saying that it is too simple to benefit from deep learning. There are extremely complex emergent patterns in every facet of life, not the least in business. You might want to optimize your order fulfillment, improve your hardware maintenance strategy, or perhaps adjust asset prices in real-time. The question in this case would be, what is an improvement of a few percent worth? In many cases these types of marginal improvements result in enormous savings.

More interestingly though, understanding how customers interact with a product (and each other) can open up entirely new business models where the primary asset is the insight itself.