This is a classic data set that is used by people who want to try out
recommender system approaches. It was set up in 1997 by a group at the
University of Minnesota who wanted to gather research data on personalised
recommendations.
We have the following three tables:
the ratings users gave to movies
data on the users (age, gender, occupation, zip code)
data on the movies (title, release data, genre(s))
In this version of the data set we have 100,000 ratings of nearly 1700 movies
from nearly 1000 users.
User ratings for movies
This table is the ‘meat’ required for any recommender approach: the actual
ratings
user_id
movie_id
rating
timestamp
0
196
242
3
881250949
1
186
302
3
891717742
2
22
377
1
878887116
3
244
51
2
880606923
4
166
346
1
886397596
Number of unique users = 943
Number of unique movies = 1682
Number of ratings = 100000
Mean rating = 3.52986
141 movies were rated by only 1 user
Movie data
movie_title
release_date
video_release_date
IMDb_URL
unknown
Action
Adventure
Animation
Children
Comedy
...
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western
movie_id
1
Toy Story (1995)
01-Jan-1995
NaN
http://us.imdb.com/M/title-exact?Toy%20Story%2...
0
0
0
1
1
1
...
0
0
0
0
0
0
0
0
0
0
2
GoldenEye (1995)
01-Jan-1995
NaN
http://us.imdb.com/M/title-exact?GoldenEye%20(...
0
1
1
0
0
0
...
0
0
0
0
0
0
0
1
0
0
3
Four Rooms (1995)
01-Jan-1995
NaN
http://us.imdb.com/M/title-exact?Four%20Rooms%...
0
0
0
0
0
0
...
0
0
0
0
0
0
0
1
0
0
4
Get Shorty (1995)
01-Jan-1995
NaN
http://us.imdb.com/M/title-exact?Get%20Shorty%...
0
1
0
0
0
1
...
0
0
0
0
0
0
0
0
0
0
5
Copycat (1995)
01-Jan-1995
NaN
http://us.imdb.com/M/title-exact?Copycat%20(1995)
0
0
0
0
0
0
...
0
0
0
0
0
0
0
1
0
0
5 rows × 23 columns
Users
age
gender
occupation
zip_code
user_id
1
24
M
technician
85711
2
53
F
other
94043
3
23
M
writer
32067
4
24
M
technician
43537
5
33
F
other
15213
User-item matrix
Sparsity of the user-item matrix = 6.3%
Step 0: Most basic recommendation
Predict rating to always be the mean movie rating! This is the baseline we seek
to improve upon.
Maximum root-mean-square error = 1.13
Step 1: Rating prediction based upon user-user similarity
We define “similarity” using the cosine similarity metric. Imagining each user’s
set of ratings as a vector, the cosine similarity of two users is simply the
cosine of the angle between their two vectors. This is given by the dot product
of the two vectors divided by their magnitudes:
And for ease i’ll simply do a ‘leave one out’ approach: for every user I will
use the other users’ ratings to predict that user’s ratings, and then calculate
an overall root-mean-square-error on those predictions.
The code below is also slow because it’s not taking full advantage of matrix
operations, but THIS CODE IS FOR ILLUSTRATION ONLY!
(943, 943)
Predict a rating as exactly the rating given by the most similar user
Failed to make 141 predictions
Number of predictions made = 99859
Root mean square error = 1.30217928718
Erm, it got worse! Let’s try something more sensible. (The failed predictions occur because that movie was only rated by a single user.)
Predict a rating as the weighted mean of all ratings
Weighted by the user similarities …
141 movies had only one user rating
Number of predictions made = 100000
Root mean square error = 1.01584998447
It improved! Can we do any better by only counting the top most similar
users in the weighted sum?
Using top 2 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 1.14548307354
Using top 10 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 1.01567130006
Using top 25 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 1.0023325746
Using top 50 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 1.0039703331
Using top 75 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 1.00673516576
Using top 100 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 1.00894503162
Using top 300 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 1.01523721106
Yes it improves if we use top users, with looking like it gives
the best improvement.
Step 2: item-item similarity
We can do exactly the same process for items instead of users; this time treating an item as
a vector of ratings and calculate a similarity between two items in the same
manner using cosine similarity.
(1682, 1682)
Using top 2 most similar movies to predict rating
Number of predictions made = 100000
Root mean square error = 1.0627135524
Using top 10 most similar movies to predict rating
Number of predictions made = 100000
Root mean square error = 0.959118494932
Using top 25 most similar movies to predict rating
Number of predictions made = 100000
Root mean square error = 0.965039905069
Using top 50 most similar movies to predict rating
Number of predictions made = 100000
Root mean square error = 0.980703964443
Using top 75 most similar movies to predict rating
Number of predictions made = 100000
Root mean square error = 0.989516609768
Using top 100 most similar movies to predict rating
Number of predictions made = 100000
Root mean square error = 0.995218394229
Using top 300 most similar movies to predict rating
Number of predictions made = 100000
Root mean square error = 1.01019290087
Step 3: User bias!
Not all users rate movies in the same way, so it would be more useful if the
collaborative filtering looked at the relative difference between movie ratings
rather than the absolute values. E.g. look at the large scatter in the
distribution of the mean rating given by each user; some of this is coming from
noise (some users will have only rated 10 movies), but some is also coming from
the fact that some users will tend to consistently rate things higher than other
users. To account fo this we simply re-define the predicted rating
for user and item as:
Using top 25 most similar users to predict rating
Number of predictions made = 100000
Root mean square error = 0.99968273602
It improved slightly upon no bias!
Summary
Alright! So after trying the following, predict user ’s rating of movie as being:
the same rating as the most similar user to user who has rated movie (result=bad)
the weighted sum of ratings by all other users who have rated movie . The weights are given by the other users’ similarities to user (result=ok)
the weighted sum of ratings by the top k most similar users to user who have also rated movie (result=ok)
as above, taking account of “user bias” (result=ok)
the weighted sum of ratings for the top k most similar movies to movie (result=best)
Using the top-10 most similar items with an item-item collaborative filtering approach seems to perform the best!
To be continued …. to play with one or more of: matrix factorisation, additional features!