Wednesday, July 16, 2008

Netflix Prize for the best collaborative filtering algorithm

Collaborative filtering, also known as social tagging, is much more popular these days in the Internet but it’s not so clear yet how much and what kind of information we can extract from the system. Regarding this problem, I think Netflix prize would be a great challenge to make those questions clear out.

@. What to predict?

Simple. From the training set, provided by Netflix, which contains over 100M movie 1-to-5 scale ratings, we need to predict unknown movie ratings for the given qualifying set. More specifically, each data in the training set is quadruple of <user, movie, date of grade, grade> and the qualifying set is given <user, movie, date of grade, unknown grade>. We need to fill out the unknown grades by prediction and submit them for evaluation.

Besides two data sets, the training and qualifying set, Netflix provides the probe set which is a problem set with answers. With this, we can roughly estimate the accuracy without consulting with the scoring oracle.

@. How to predict?

Hard. However, we can learn from the front-runners, posted at the Leaderboard. The first annual progress winner is BellKor and they wrote about their algorithm.

@. Number-wise story

#. of data in the training set = 100,480,507
#. of users in the training set = 480,189
#. of movies in the training set = 17,770
#. of data to predict in the qualifying set = 2,817,131
#. of data in the probe set = 1,408,395

@. Research problems

As a student, trying to apply machine learning algorithms to various applications, it is interesting to study:

  1. What kind of machine learning algorithms can be applied to analyze the Netflix collaborative filtering data? Furthermore, Can we find more general algorithms can be used for the data in the net?
  2. How such computations can be expedited by using parallel, multi-core platform? Interestingly, is it possible to use other computing powers, such as cloud computing?
  3. Can we improve accuracy by adding other information easily accessible from the Internet? What kind of infrastructure of the Internet can help this? Web2.0 or else?

@. Reading list

J. Herlocker, J. Konstan, L. Terveen, and J. Riedl. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS), 22(1):5--53, 2004.

R. Bell, Y. Koren, and C. Volinsky. Modeling relationships at multiple scales to improve accuracy of large recommender systems. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 95--104, 2007

R. Bell and Y. Koren. Improved Neighborhood-based Collaborative Filtering. KDD-Cup and Workshop.

R. M. Bell and Y. Koren, Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights, Proc. IEEE International Conference on Data Mining (ICDM'07), 2007

Possible useful papers can be found from the Internet. I will keep this list up-to-date, as I read through.

3 comments:

Anonymous said...

Can anyone recommend the best Network Monitoring software for a small IT service company like mine? Does anyone use Kaseya.com or GFI.com? How do they compare to these guys I found recently: N-able N-central remote desktop
? What is your best take in cost vs performance among those three? I need a good advice please... Thanks in advance!

Anonymous said...

Rather valuable message

Anonymous said...

Genial post and this mail helped me alot in my college assignement. Say thank you you on your information.