nvidia TitanX
If you’re serious about big data and machine learning, you’re already taking advantage of GPUMIC, and FPGA powered analytics tools. This new breed of software can allow a single workstation to outperform a 100-node compute cluster in tasks like machine learning, graph analysis, financial modeling, and many other scenarios. Don’t believe it? In this post I will give a simple example of GPU accelerated ML using grayscale images from the MNIST database. Each record includes 748 pixels depicting a handwritten digit.
mnist
I decided to start by setting a lower bar with the Random Forest algorithm because it’s the most versatile and advanced ML algorithm available to users of SAS, SPSS, Apache Spark, and so forth. In most cases Random Forest is a sequential (single threaded) process living in a multitenant environment so people just accept that their modelling tools are slow as dirt. Starting off with Leo Breiman’s own implementation randomForest, I trained a 50 tree forest on the MNIST data. The code is at the bottom of this post. Here is the result:

randomForest – 50 trees – 10 minutes 5.75% error rate

This means that the model made mistakes and misunderstood numbers 5.75% of the time. Not so great. I happen to know that this result can be improved by at least a whole order of magnitude so I increased the number of trees to 500 and trained a new model. The result:

randomForest – 500 trees – 3.5 hours 2.93% error rate

Well, that result was twice as good, but it took a whole afternoon to get there! Not acceptable. What if we use a parallel, in-memory, highly tweaked version of random forest? For this test I turned to one of my favorite machine learning tools, H2O, and trained a 250 tree model with a number of tweaked parameters. The result:

H2O – 250 trees – 11 minutes2.92% error rate

OK, we’re getting there. H2O had my CPU pegged at 100% and the training time isn’t so bad anymore, but this is pretty much the end of the road for random forest. The original MNIST authors were only able publish sub-1% error rates by writing a lot of custom code. Let’s see if we can get closer to their results using a simple conv-net. When setting up my own machine learning models I prefer to use MXNet because it has:
– the broadest language support including R, Python, C++, Java, Scala, and Julia
– a blend of symbolic (computation graph) and imperative programming styles
– automatic scaling to multiple CPUs and GPUs across multiple compute nodes
– blazing fast code

MXNet – i7-3930K CPU – 10 minutes1.00% error rate

That’s more like it! I cut the training off at 20 epochs (number of passes through the data) so that training would take about 10 minutes, making it comparable to the random forest test above. I think we can do even better. My personal computer has a mid-range graphics card from 2012. It’s not good enough to play any modern games, but it has 2GB of on-board memory and CUDA support which means it can be used for some light-weight scientific computing. After switching the compute context to GPU the results were:

MXNet – GeForce 670 2GB GPU – 53 seconds0.998% error rate

I was surprised to say the least! The dusty old 670 GPU outperformed a pretty beefy 6-core i7 processor by a factor of 10x… meaning the 4 year old gaming card on my personal computer is faster than your average data-mart. 🙂
A workstation with four Titan X cards in it will give you 28 TeraFLOPs of compute power in one place.
What would you do if you had access to 1000x more analytical compute power than your competitors?
randomForest in R:

> library(randomForest)
> system.time(rf <- randomForest(train.data[,-785], train.data[,785], ntree=500))
    user  system   elapsed
12608.60    0.76  12614.04

> show(rf)
Call:
 randomForest(x = train.data[, -785], y = train.data[, 785], ntree = 500, importance = T)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 28
        OOB estimate of  error rate: 2.93%

MXNet in Julia on CPU:

julia> using MXNet
julia> conv1 = @mx.chain mx.Convolution(data=data, kernel=(5,5), num_filter=20)  =>
                         mx.Activation(act_type=:tanh) =>
                         mx.Pooling(pool_type=:max, kernel=(2,2), stride=(2,2))
MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x000000002599ae40))
julia> conv2 = @mx.chain mx.Convolution(data=conv1, kernel=(5,5), num_filter=50) =>
                         mx.Activation(act_type=:tanh) =>
                         mx.Pooling(pool_type=:max, kernel=(2,2), stride=(2,2))
MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x000000002599aee0))
julia> fc1   = @mx.chain mx.Flatten(data=conv2) =>
                         mx.FullyConnected(num_hidden=500) =>
                         mx.Activation(act_type=:tanh)
MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x0000000025971e30))
julia> fc2   = mx.FullyConnected(data=fc1, num_hidden=10)
MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x0000000025971eb0))
julia> lenet = mx.SoftmaxOutput(data=fc2, name=:softmax)
MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x00000000259719d0))
julia> model = mx.FeedForward(lenet, context=mx.cpu())
MXNet.mx.FeedForward(MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x0000000025b491f0)),[GPU0],#undef,#undef,#undef)
julia> optimizer = mx.SGD(lr=0.05, momentum=0.9, weight_decay=0.00001)
MXNet.mx.SGD(MXNet.mx.SGDOptions(0.05,0.9,0,1.0e-5,MXNet.mx.LearningRate.Fixed(0.05),MXNet.mx.Momentum.Fixed(0.9)),#undef)
julia> mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
INFO: == Epoch 011 ==========
INFO: ## Training summary
INFO:           accuracy = 0.9997
INFO:               time = 52.6990 seconds
INFO: ## Validation summary
INFO:           accuracy = 0.9900