SGD and TensorFlow
TensorFlow
TensorFlow is literally the “flow” of “tensors”! It runs a series of computations that it represents as a graph with nodes. At each node an “Operation” occurs that requires 0 or more “Tensor”s and produces 0 or more “Tensor”s. Each “Tensor” is a structured, typed, representation of the data.
TensorFlow has many high level data types we need to understand:
- Constant: variable that does not depend on data (e.g. a mxn matrix of 3’s), a static value
- Tensor: this represents the data, the only thing passed between ops
- Variable: these record and maintain the state of the system/model
- Operation: an arbitrary operation, e.g. a function that is called on a Tensor with Variables
- Session (this is the workhorse!)
The “Session” is the thing that launches the graph that describes the computations. The “Session” provides the methods with which to perform the Operations and also places the corresponding process on to the correct device (e.g. CPU or GPU).
Setting up TensorFlow to do a computation requires two stages:
- Set up the graph
- Execute the graph
For example, if we wanted to multiply two matrices:
[[ 13.]]
<type 'numpy.ndarray'>
We see that we get returned a 1x1 numpy array equal to the multiplication of the two matrices.
We could get complicated here and go into how TensorFlow can distribute the executable operations across available compute resources, but I’m more interested in how the API works.
Stochastic Gradient Descent
I want a refresher on stochastic gradient descent, and since I’m also keen to learn how to use TensorFlow, so I’m going to follow the tutorial given here.
In usual gradient descent (see this previous post) we minimised the cost function; simultaneously updating every feature parameter using the full data set. Stochastic gradient descent however, updates using a single sample at a time, or a mini-batch of samples. It is an approximation to gradient descent on the full set of samples, and should converge to the same solution. It can be an advantage for large data sets, when the evaluation of the gradient of the cost function could become expensive, or for data sets with streaming input data.
The implementation
First step is to download the MNIST data, a database of images of handwritten digits, tagged with lables indicating the digit they represent.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
A quick look at what we have:
<class 'tensorflow.contrib.learn.python.learn.datasets.base.Datasets'>
count
index
test
train
validation
epochs_completed epochs_completed epochs_completed
images images images
labels labels labels
next_batch next_batch next_batch
num_examples num_examples num_examples
Number of training examples 55000
Number of validation examples 5000
Number of testing examples 10000
(55000, 784) (55000, 10)
(5000, 784) (5000, 10)
(10000, 784) (10000, 10)
We have a training, validation and test set containing the images and labels. The images are flattened vectors of the 28x28 pixels, into 784 elements, and are stored as numpy arrays. The labels are “one-hot” vectors, the nth digit is represented as a vector which is 1 in the nth element, and zero everywhere else, also stored as numpy arrays.
Let’s peek at the actual data
Next step: how to train a model
This problem is many-class classification, so we are going to train a model for each class. Specifically, we are going to train a set of weights that will be used to calculate a weighted sum of all the pixel intensities in the image. This weighted sum will be maximised when the weight model for the class matches the actual true class of the written digit. I.e. if the true label of the written digit is “3” then we want the weights corresponding to the class “3” to produce the largest weighted sum for that image.
The “evidence” for class is:
where the sum is over all the pixels of the image . are the weights for class in each pixel . is a bias that deals with class inbalances (e.g. maybe there are an overwhelming number of “3”’s in the data set compared to anything else so the “evidence” should favor “3” before any input data is even seen).
We need to turn this into a probability distribution: for any image the sum of “evidence” per class must equal 1. The probability distribution is defined using the softmax function (chosen so that it quickly becomes large for larger values):
where is the total number of classes.
Now implement in TensorFlow! Note all code below is taken from the tutorial, but the explainations are my own!
So the “stochastic” bit in the above was when we just grabbed 100 random data samples to iteratively train the model with in each step.
Evaluation with TensorFlow
0.9169
So our model is nearly 92% accurate, which is actually quite bad apparently!
How to get the data back!
From the docs:
A Tensor
is a symbolic handle to one of the outputs of an Operation
. It does
not hold the values of that operation’s output, but instead provides a means of
computing those values in a
TensorFlow.
So how do we look at the data we’ve created!
More from the docs:
After the graph has been launched in a session, the value of the Tensor
can be
computed by passing it to Session.run()
We get it returned from the graph node itself, “feeding” in the data we want used in the operation. See the example below that plots some of the classification failures.
4478 incorrect labels out of 55000
A visual inspection gives an idea of why this model is such a “failure” with only 92% accuracy.
It’s easy to see how the prediction could have occured in many cases: the predicted number often looks like it might share many high intensity pixels with the true number.
The model didn’t account explicitly for correlations between pixel intensities, e.g. if pixel is high intensity it’s very likely pixel will be too. I would imagine adding this kind of non-linearity to the model would easily bring the accuracy up to 97%+.