Collaborative filtering made easy
December 12th, 2006 by Bryan O'Sullivan
During a lunchtime conversation the other day, a coworker mentioned that he was hacking in his spare time on an entry for the Netflix Prize. This got me to thinking about collaborative filtering: why had I never seen a good description of how to do it? I suspect that people who might ordinarily have a casual interest in the subject hear that there are some statistics involved, whereupon they immediately freeze in the mathematical headlights, and turn the conversation to something else, anything else.
In early 2005, a researcher named Daniel Lemire published, with Anna Maclachlan, a paper with the jazzy title of “Slope One Predictors for Online Rating-Based Collaborative Filtering“. This is an important paper, because it presents a family of really simple collaborative filtering schemes. I mean really simple: there are no statistics involved, just a little bit of linear algebra.
Unfortunately, because the Lemire/Maclachlan paper is aimed at an academic audience, it’s not trivial to tell by merely reading it how simple the technique it presents is, so what follows is my attempt to translate its ideas into my native language of Attention Deficit Programmerese. To make things more concrete, I’m going to present an implementation in less than 40 lines of Python (and I’ll try to explain any obscure bits of Python as I go). Cool, huh?
Regardless of the underlying implementation, collaborative filters tend to try to solve the same broad problem using much the same data.
The coloured cells above indicate which elements of the matrix contain useful information. I’ll describe why most of the matrix is empty later.
To keep things simple, let’s represent our matrices as two levels of Python dict objects (a dict is simply a hash table, if you’re not familiar with Python). The key of each dict is an item ID, so to get the matrix entry for “squid vs cuttlefish”, we look in the first-level dict for squid, then the second-level dict for cuttlefish.
- You have your crowd of users.
- You have your big pile of items.
- Some of your users have been generous enough to tell you what they think of some of your items.
- You want to hawk more items to a user, and you’d prefer to make your suggestions relevant to their likely interests.
The coloured cells above indicate which elements of the matrix contain useful information. I’ll describe why most of the matrix is empty later.
To keep things simple, let’s represent our matrices as two levels of Python dict objects (a dict is simply a hash table, if you’re not familiar with Python). The key of each dict is an item ID, so to get the matrix entry for “squid vs cuttlefish”, we look in the first-level dict for squid, then the second-level dict for cuttlefish.
class SlopeOne(object):
def __init__(self):
self.diffs = {}
self.freqs = {}
The self.diffs matrix will hold differential ratings (I’ll describe what these are soon), and the self.freqs matrix will contain the number of times we’ve computed a differential rating for each pair of items. We’ll fill in the first and second levels of the dicts once we know what our data looks like, so let’s talk about the shape of the data.
We need to be able to express ratings such as “Alice rated a squid as 1, and a cuttlefish as 4″. This is easy to express with two levels of dicts; the first maps from a user to all of that user’s ratings, and the second maps items to their ratings. In other words, our user data looks like this in Python:
userdata = dict(alice=dict(squid=1, cuttlefish=4)),
bob=dict(squid=3, octopus=2))
The time has come to load the data into our matrices; let’s call the method update.
def update(self, data):
To fill in the matrices, we must look at each rating from every user.
for user, ratings in data.iteritems():
For each user and rating they have submitted, we will be iterating over every rating they have submitted a second time. The first thing we do is make sure that the necessary second-level dicts in each matrix exist.
for item, rating in ratings.iteritems():
self.freqs.setdefault(item, {})
self.diffs.setdefault(item, {})
The setdefault call is an unfortunately ugly Pythonic idiom that means “if there’s no value for the given key (the first parameter) in the dict, assign it the given value (the second parameter)”.
Inside this loop, we iterate over the user’s ratings again, to compare each rating against every other.
for item2, rating2 in ratings.iteritems():
self.freqs[item].setdefault(item2, 0)
self.diffs[item].setdefault(item2, 0.0)
Here, we ensure that there are zeroes present in the matrix for every pair of ratings. (Notice that we’re using an integer zero to count frequencies, and a floating point zero for ratings.)
For every pair of ratings, we modify the appropriate cell in self.freqs by adding one to the number of times we’ve seen that pair of ratings.
self.freqs[item][item2] += 1
And to update the self.diffs matrix, we compute the difference between the ratings for the two items.
self.diffs[item][item2] += rating - rating2
This is why I called self.diffs the “differential rating” matrix. Every time a user rates an item, we update the appropriate cells in this matrix to reflect the accumulated difference between that rating and every other rating the user has made.
Once we’re done, we scale each differential rating based on the number of times we’ve seen that pair of items.
for item, ratings in self.diffs.iteritems():
for item2, rating in ratings.iteritems():
ratings[item2] /= self.freqs[item][item2]
Now let’s pull the whole thing together, so you can see how simple this update method is.
def update(self, data):
for user, prefs in data.iteritems():
for item, rating in prefs.iteritems():
self.freqs.setdefault(item, {})
self.diffs.setdefault(item, {})
for item2, rating2 in prefs.iteritems():
self.freqs[item].setdefault(item2, 0)
self.diffs[item].setdefault(item2, 0.0)
self.freqs[item][item2] += 1
self.diffs[item][item2] += item - item2
for item, ratings in self.diffs.iteritems():
for item2, rating in ratings.iteritems():
ratings[item2] /= self.freqs[item][item2]
That’s a measly 13 lines of code, 4 of which are due to us using dict objects to represent matrices. Nice!
(By the way, notice that the update method made no use of the user IDs we passed in. This technique doesn’t care who rated what, just how an item has been rated. Collaborative filtering techniques that don’t use user information directly are called “item-based”, while those that do are called “user-based”.)
It’s easy to fill out our matrices:
s = SlopeOne()
s.update(dict(alice=dict(squid=1.0, cuttlefish=4.0)),
bob=dict(squid=1.0, octupus=3.0)))
With a little data in place, we’d like to make a prediction. To do this, we’ll pass the ratings a user has made to the predict method; it will return predictions of ratings for items the user hasn’t rated.
def predict(self, userprefs):
We begin with two empty dicts; preds will accumulate prediction data, and freqs will accumulate counts.
preds, freqs = {}, {}
We iterate over every item the user has rated, for items that have been rated by anyone (the try/except clause below lets us skip over unrated item pairs).
for item, rating in userprefs.iteritems():
for diffitem, diffratings in self.diffs.iteritems():
try:
freq = self.freqs[diffitem][item]
except KeyError:
continue
We make sure there are zeroes present in our dicts before we accumulate anything into them.
preds.setdefault(diffitem, 0.0)
freqs.setdefault(diffitem, 0)
The accumulation step is simple.
preds[diffitem] += freq * (diffratings[item] + rating)
freqs[diffitem] += freq
We must filter the result a little before we return it. We return a new dict, containing weighted predictions for every item, but omitting items that the user has rated (no point in predicting those), and omitting items for which we have a cumulative frequency of zero.
return dict([(item, value / freqs[item])
for item, value in preds.iteritems()
if item not in userprefs and freqs[item] > 0])
And now let’s see the predict method without all of those intrusive comments, again to see how simple it is.
def predict(self, userprefs):
preds, freqs = {}, {}
for item, rating in userprefs.iteritems():
for diffitem, diffratings in self.diffs.iteritems():
try:
freq = self.freqs[diffitem][item]
except KeyError:
continue
preds.setdefault(diffitem, 0.0)
freqs.setdefault(diffitem, 0)
preds[diffitem] += freq * (diffratings[item] + rating)
freqs[diffitem] += freq
return dict([(item, value / freqs[item])
for item, value in preds.iteritems()
if item not in userprefs and freqs[item] > 0])
Our proof-of-concept Weighted Slope One predictor clocks in at 33 lines of Python. If we had a sparse 2D matrix class handy, we could trim six of those lines. That’s not bad!
It wouldn’t be difficult to go from this trivial implementation to something truly efficient. From the way we construct the matrices in the update method, we can make a few potentially space saving observations.
- Some portions of each matrix are entirely empty. They don’t contain zeroes; they contain nothing, because nobody has rated those pairs of items. If Alice rates squid and cuttlefish, while Bob rates squid and octopus, there will be no entries for cuttlefish vs octopus in either matrix.
- The diagonal of the self.diffs matrix will always be either empty or zero, so there’s no need to store values on the diagonal of the self.freqs matrix.
- Each entry in the upper right of the self.diffs matrix is the negative of the symmetric entry in the bottom left. The upper right of the self.freqs matrix is symmetric to the bottom left (corresponding entries have identical values).

[...] Bryan O’Sullivan wrote an excellent Collaborative Filtering tutorial. He gives out the first Slope One Collaborative Filtering algorithm in Python that I know of. Excellent work. [...]
Wow, that’s simple. Thanks for posting it!
good work!! i’ve implemented a few different CF algorithms. the smallest one being about 10 times what you have done
Nice clear explaination
I think there is a typo in the article:
self.diffs[item][item2] += item – item2
should be:
self.diffs[item][item2] += rating – rating2
Thanks for the fantastic article!
This is a excellent post .Thanks for it.
[...] months ago, I wrote a Python implementation of Daniel Lemire’s Weighted Slope One collaborative filtering [...]
Just for the sake of comparison, here’s an implementation of the same algorithm in D.
http://pastebin.ca/673772
It’s not quite as neat, but it comes surprisingly close for a strong-typed, C-derivative language.
[...] Collaborative filtering made easy In Python [...]
Thanks for the great explanation and tutorial
I’m a little curious about implementing this using a sparse matrix. Would I need a special sparse matrix class or could I just pass in dicts that aren’t full (dense) matrices.
I.E, can I just pass in a matrix (formed using dicts) that just contains non-zero ratings.
I’m guessing when I do this I could also remove these two lines:
self.freqs[item].setdefault(item2, 0)
self.diffs[item].setdefault(item2, 0.0)
Looking through the code, I can’t see a reason why it wouldn’t work, but I’m new to this so if you can chime in, I’d appreciate it
Hi,
I run this algorithm against the MovieLens 100K data set.
When I try to get recommendation for a movie sometimes the predicted recommendation is far above 5. Notice that the valid rating should be in between 1 and 5.
Is this an issue with this SlopeOne algo?
Thanks
@Doug,
Yes, that is expected. If you put the algorithm into words (in the simplest case of only two items) it would be something like:
“On average, everyone else rates Octopus with two more points than they rate Squid. Therefore, your predicted rating for Octopus should be two points more than your actual rating for Squid.”
If your actual rating for Squid is 4, then your predicted rating for Octopus will be calculated as 6. If the acceptable range of ratings is 1 to 5, then you just have to truncate the prediction to make it fit.