SWAMI was a class project in the Large Datasets course, Fall 1999. The class paper we wrote up is also available.
Some of the users can be eliminated outright, because of low number of votes.
Interestingly, each of the movie votes has a weight and a score, where the weight denotes whether or not the user actually saw the movie, and where the score denotes the user's movie rating. (For our algorithms, we ignored the weight, setting it to 1 for all users.)
Just FYI, here are some interesting stats on the EachMovie dataset:
Some "macro-statistics" about the dataset as a whole.
These stats were taken only over the users who actually
voted, and movies for which votes were cast.
number of users who actually voted: 61263
number of movies with at least 1 vote: 1623
maximum number of votes by a user: 1455
min : 1
mean number of votes by a user: 45.9
median : 26
maximum number of votes for a movie: 32864
min : 1
mean votes for a movie: 1732.4
median : 379
Roughly speaking, a user cast a mean vote of about 3.5 with a variation of about 1.25. This suggests that the Microsoft paper, which showed a number of algorithms that predicted votes to within 1 vote, didn't really do much better than what one can do by just predicting the user's mean vote on every movie.
Here are some interesting graphs of the EachMovie dataset:
Just FYI, here are some interesting attempts at visualizing the EachMovie dataset:
Here are the MatLab files used to generate these graphs.
The Pre-indexed data is generated from the EachMovie dataset, which organizes the votes to make them fast to retrieve by user or by movie. We create a BTree-like index on top of the file after sorting it. More information on the indexing process is available.
In some cases, we use a subset of the data, for performance purposes. The subset is comprised of the original data with certain members aggregated together using Pearson clustering, or it could just be a random subset of the data.
The filters let you filter out votes based on user, movie, or list of votes. The filters come in handy when doing evaluation (explained in more detail below).
The predictors predict what a user's vote will be on a movie. Currently, we have a few simple predictors and a few more complex ones. The predictors are the heart of a collaborative filtering system, and are what we were most interested in evaluating in SWAMI.
The evaluation is a suite of scripts and Java classes for seeing how well a predictor works. There are a range of attributes for ranking predictors, for example elapsed time and accuracy of prediction. We were most interested in the accuracy of predictions. The approach we used was to take a user, filter out some votes that this user actually had, and then use a predictor to predict a score. This predicted score was compared to the actual score.