Collaborative filtering is a technique using in machine learning to find similarities and trends between people and items due to choices they have made, items they have looked at and ratings they have given.
Think of Amazon.com – whenever you view an item you get that ‘similar items’ section. You also get emailed the latest offers tailored to your tastes. In fact the whole site morphs into your own personal shopping centre. There is obviously some clever machine learning going on behind the scenes, and that’s what I am hoping to outline.
There are three major sections to collaborative filtering :
Rating
We need some metrics to allow us to measure how similar two people are.
Scoring
How we use the metrics to ‘score’ one person against another.
Ranking
Once we have collected all the scores, how can we use this information?
Rating
Lets make up an example. Say we have a film rental system, that infers from old rentals the movies a customer is most likely to enjoy. The main metric in this case would be some kind of movie rating system, let’s say out of 5. So here below is an example of two members of our film site.

As you can see, these people have watched and rated different films, and this will be true of virtually every member.
Scoring
So the first problem we come up against is ‘how to compare two people?’.
As (we just mentioned) two people will have watched (and rated) different films, then the first thing we have to do is to find the intersection of films rated by them. So simply chop out non-common films :

Now taking the remaining intersection of films, we can grab a unique vector from each customer, describing their personal preferences.
Mr A Ms B
|2| |4|
|5| |2|
Now we have a comparable format, we can simply use one of the ‘out of the box’ scoring algorithms such as :
Euclidean Distance
Pearson Coeffient
These basically take the pair of vectors, work their magic and return a score. The higher the score the closer match the users are.
If you’ve had a look at those above algorithms, you may be a bit overwhelmed by the maths – don’t worry, underneath it all they’re pretty simple, I’ll take Euclid as an example.
We’ll throw a new person in the mix, ‘Dr X’, he’s pretty evil, so loved ‘Aliens’ (5/5), and totally hated ‘ The Notebook’ (1/5). Let us compare these three customers using Euclidean Distance :

From the graph you can see the distance between Ms B and Dr X is smaller than that between Ms B and Mr A – this means their taste is more similar (with respect to these two films). All we need to do now to get a decent score is invert the distance, so that we have more similar people having higher scores (not smaller ones), all we do for this is :
score = 1 / (1 + dist)
and voila – we have our scoring system to compare users.
Ranking
Now the hard work is out of the way all we have to do is :
1) Choose a target user, (to find their closest matches)
2) Score this user against every other using the method above.
3) Order them by their scores.
So, as we saw, to find out Ms B’s matches, you figure out her score w.r.t Mr A and Dr X and simply rank them. So Dr X comes top followed by Mr A.
We now have a ranked list of similar users, allowing our film application to recommend like people, and the movies they have rented – and it’s as simple as that!
You can perform exactly the same actions on the items themselves (by simply swapping places of the users and items). This secondary technique is called item-based filtering as opposed to what we just went through which was user-based filtering.
If you want to check out more of this kind of thing I recommend taking a look at the book http://www.amazon.co.uk/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=sr_1_1?ie=UTF8&s=books&qid=1268602889&sr=8-1