Hi, Albin here again! Part of the work I do at LakeTide is making our machine learning models more computationally efficient. For instance, sometimes the neural networks we develop end up being big enough that they don’t fit on GPUs anymore. Other times we just need to increase the throughput of our models so they can handle streaming data better. In this blog post I will walk you through an easy way to decompose convolutional layers, significantly reducing their size and computational demands, with no loss in accuracy. The proposed scheme is developed by Cheng et al. https://arxiv.org/abs/1511.06067. As always, Julia and MXNet are our weapons of choice!
First we download and define two data providers for testing and training our data. In MXNet data providers for Cifar10 are neatly provided which is convenient for proof of concept implementations.
Next we define an original network structure. Here is a network of 6 convolutional layers where the first and the third have 5×5 that will later be decomposed.
We bind the network to a feed forward model and use an SGD optimizer.
Now we train the model for 30 epochs and reach an accuracy of 80.02%.
So far so good, now we want to compress the two convolutional layers that are most computationally expensive, layer one and three. The compression algorithm is based on the following scheme:
The convolutional kernel can be described as a 4D tensor where N and C are the number of output and input channels respectively and d is the kernel size. Let the input feature map be given as which gives the output feature map as:
Now the approximation scheme used looks like:
Where K is a hyperparameter controlling the rank of the approximation, forms a horizontal filter and forms a vertical filter. With this the convolution becomes:
Here the function we want to minimize is over the Frobenius norm as:
This is a minimization problem that has a closed form solution. To solve the problem we use the following algorithm:
Define a bijection that maps a tensor to a matrix as:
by letting the tensor element map to the matrix element as
Then define and take the singular value decomposition as
Now we set:
Which is a non-unique solution to the minimization problem.
The proof is included at the end of the blog post.
Based on the algorithm of Cheng et al. we define a function that takes an NDArray of weights from the pre-trained convolutional layer together with the desired rank of the decomposition. It returns two NDArrays containing the weights for the decomposed network layer we will define below. The method is extensively described at the end of this blog post.
Next we define a function to overwrite the network weights that are not in layer one or three.
Now we define a new network architecture where we have decomposed the first and third layer with rank 10 and 20 respectively.
Next we bind this architecture to a feed forward model andinitialize it.
Now we use our functions to overwrite the weights from the original network and calculate the weights for the decomposed layers of the network.
We fine-tune the parameters for only 5 epochs a get a new accuracy of 80.08%.
Voila! With a fast check on the forward pass we see that we’ve managed to speed it up by ~20% !
That’s a non-negligible increase for many applications that put pressure on the throughput of your network.
We have also reduced the number of weights in the first convolutional layer by x1.48 and for the third convolutional layer by x16!
Fast and smooth as promised.
Where . Set as a solution to equation
1. Then construct a solution to equation 3 as:
Then we know (because the Frobenius norm is separable):
We know that so further on:
Here is any solution to 2. Then we construct a solution to equation 1 with equation 2 which gives:
With equation 4 this gives:
Which proves that is a solution to equation 1