In the previous post on
the Movie Lens data set I had a look at collaborative filtering to predict movie
ratings using methods based upon user-user and movie-movie similarities.
In this post I will play with matrix factorisation to learn “hidden” variables
for each user and movie that interact in some way to produce the final rating.
We don’t learn what these “hidden” variables might actually correspond to.
Probably we would find variables that correlate with e.g. the age of the user on
the user-side and how much action there is in the movie on the movie side. I have followed this blog post very closely, but attempted to
explain the problem in my own words and streamlined the code so it matches my weak coding capability.
The image below shows how this technique works. Two smaller matrices, one
corresponding to the users (and their variables) and the movies (and their
variables). In this example there are only two variables () for each
user and movie. The user matrix has rows and columns, the movie
matrix has rows and columns.
The equation for the rating of user 0 of film 0 is shown: this demonstrates how
the variables of the user and the movie interact to predict this rating.
and are two variables for that user, and and
.
It’s possible the two user variables might represent the user’s age and location
(e.g. a large number for very urban location, small for rural), and the movie
variables might encode representations of the movie genre e.g. one variable may
relate to crime thriller and the other to comedy. Then, if there existed a trend
for older people who live in urban areas to like comedic crime movies this
method would predict a high rating.
Finally, because not all users assign ratings in the same way (some users may be
harsher raters and never give a movie above a 4 etc) we want to take account of
this “bias” by adding a column of user bias values. A similar effect may be
present for different types of movies (an epic movie may naturally garner higher
ratings than a low budget rom-com) so we take account of this with an equivalent
row of movie bias values. A corresponding column/row of ones is added to the
user/movie matrix so the biases are treated separately (as shown in the
equation).
A quick reminder of the data set is shown below
user_id
movie_id
rating
timestamp
0
196
242
3
881250949
1
186
302
3
891717742
2
22
377
1
878887116
3
244
51
2
880606923
4
166
346
1
886397596
The algorithm
The function below details the algorithm that learns the hidden factors via
gradient descent minimizing the cost function, which is basically the
regularized sum of the differences squared between the predicted and true
ratings.
The most important arguments are (beside the actual training_set data):
d: the number of hidden variables to learn
lam: the amount of regularisation (0=no regularisation)
lr: the learning rate of the gradient descent
The training set
As a first exploration of this method we just use the entire 10,000 ratings to
train with.
In practice not only would you use a subset to train with but you would also set
asside a cross-validation set to learn the optimum number of hidden variables to
learn and the optimum regularisation to use.
Not including user or item bias
To begin we see what kind of precision we get when we ignore the user and item
biases. Here we set there to be 100 hidden variables to learn (100 user ones +
100 movie ones) and set the learning rate to 10 (for fast convergence).
Iteration 1 current cost = 1.13354 Delta-cost = 1.1335362196
Iteration 101 current cost = 0.572312 Delta-cost = 0.000516891
Iteration 201 current cost = 0.548027 Delta-cost = 0.000113606
Iteration 301 current cost = 0.540612 Delta-cost = 4.91738e-05
Iteration 401 current cost = 0.536911 Delta-cost = 2.85506e-05
Iteration 501 current cost = 0.534622 Delta-cost = 1.87159e-05
Iteration 601 current cost = 0.533034 Delta-cost = 1.32322e-05
Iteration 701 current cost = 0.531851 Delta-cost = 1.06096e-05
Iteration 801 current cost = 0.530925 Delta-cost = 8.10623e-06
Iteration 901 current cost = 0.530173 Delta-cost = 6.79493e-06
Final RMSE = 1.02912
The results are ok, the root-mean-square-error is about +/- 1 rating. However we
can see from the distribution of ratings the predictions are ending up too
tightly clustered around the mean overall rating
Learn user bias alone
While we learn the column of user bias we can also see how well a having only a
single column of user bias can predict the ratings.
Note that now mode is set to be user_bias
Iteration 1 current cost = 0.638533 Delta-cost = 0.638532996178
Iteration 101 current cost = 0.542808 Delta-cost = 0.000134468
Iteration 201 current cost = 0.535901 Delta-cost = 3.67165e-05
Iteration 301 current cost = 0.533597 Delta-cost = 1.54972e-05
Iteration 401 current cost = 0.53255 Delta-cost = 7.21216e-06
Iteration 501 current cost = 0.53201 Delta-cost = 3.8743e-06
Iteration 601 current cost = 0.531713 Delta-cost = 2.26498e-06
Iteration 701 current cost = 0.531543 Delta-cost = 1.13249e-06
Converged at iteration 704
0.53154 0.531539
Final RMSE = 1.03106
The performance is very similar to running without bias but with 100 hidden
variables. However the predictions look more clustered around the mean rating
than before.
Learn movie bias alone
Repeating the same thing, but now for the movie bias.
Note that now mode is set to be movie_bias
Iteration 1 current cost = 0.638533 Delta-cost = 0.638532996178
Iteration 101 current cost = 0.547283 Delta-cost = 0.000288308
Iteration 201 current cost = 0.529536 Delta-cost = 0.000110269
Iteration 301 current cost = 0.521484 Delta-cost = 5.93066e-05
Iteration 401 current cost = 0.516879 Delta-cost = 3.67761e-05
Iteration 501 current cost = 0.513903 Delta-cost = 2.42591e-05
Iteration 601 current cost = 0.511821 Delta-cost = 1.78218e-05
Iteration 701 current cost = 0.510283 Delta-cost = 1.33514e-05
Iteration 801 current cost = 0.509098 Delta-cost = 1.055e-05
Iteration 901 current cost = 0.508156 Delta-cost = 8.70228e-06
Final RMSE = 1.00736
So far the performance is best when just including the movie bias!
Including bias
Note that now mode is set to be a tuple of the learned user bias matrix and
the learned movie bias matrix.
Iteration 1 current cost = 0.958202 Delta-cost = 0.9582015872
Iteration 101 current cost = 0.469477 Delta-cost = 0.000476152
Iteration 201 current cost = 0.445877 Delta-cost = 0.000118017
Iteration 301 current cost = 0.438018 Delta-cost = 5.33164e-05
Iteration 401 current cost = 0.433981 Delta-cost = 3.10242e-05
Iteration 501 current cost = 0.431411 Delta-cost = 2.18451e-05
Iteration 601 current cost = 0.429543 Delta-cost = 1.66595e-05
Iteration 701 current cost = 0.428053 Delta-cost = 1.33812e-05
Iteration 801 current cost = 0.426783 Delta-cost = 1.20699e-05
Iteration 901 current cost = 0.42565 Delta-cost = 1.06394e-05
Final RMSE = 0.921525
43% of predictions were correct, and 47% were only 1 rating off, so 10% of
predictions were 2 or more ratings off. I don’t know how “good” these results actually are for typical implementations of this method, though a quick google search showed RMSE’s that were comparable to ~0.9 (result achieved here is 0.92). However, if you are within 1 rating (out of a total of 5) 90% of the time, then I think 90% of the time you definitely manage to discriminate between whether a user “likes” or “dislikes” a movie.