Introduction

SWAMI is a framework for running collaborative filtering algorithms and evaluating the effectiveness of those algorithms. It uses the EachMovie dataset, generously provided by Compaq.

SWAMI was a class project in the Large Datasets course, Fall 1999. The class paper we wrote up is also available.


The EachMovie Dataset

The EachMovie movie preference dataset was collected by DEC (now Compaq) research. It consists of 72916 users, 1628 movies, and 2811983 movie votes. More information, including the original Eachmovie data and terms of usage, is available.


Our Thoughts on the EachMovie Dataset

Some of the users can be eliminated outright, because of low number of votes.

Interestingly, each of the movie votes has a weight and a score, where the weight denotes whether or not the user actually saw the movie, and where the score denotes the user's movie rating. (For our algorithms, we ignored the weight, setting it to 1 for all users.)

Just FYI, here are some interesting stats on the EachMovie dataset:

Here are some interesting graphs of the EachMovie dataset:

Just FYI, here are some interesting attempts at visualizing the EachMovie dataset:

Here are the MatLab files used to generate these graphs.


Overview of SWAMI Architecture

SWAMI can be roughly partitioned into four parts: Pre-indexed data, filters, predictors, and evaluation. Each of the four parts is, for the most part, loosely coupled and interchangeable. A system block-diagram is depicted below.

The Pre-indexed data is generated from the EachMovie dataset, which organizes the votes to make them fast to retrieve by user or by movie. We create a BTree-like index on top of the file after sorting it. More information on the indexing process is available.

In some cases, we use a subset of the data, for performance purposes. The subset is comprised of the original data with certain members aggregated together using Pearson clustering, or it could just be a random subset of the data.

The filters let you filter out votes based on user, movie, or list of votes. The filters come in handy when doing evaluation (explained in more detail below).

The predictors predict what a user's vote will be on a movie. Currently, we have a few simple predictors and a few more complex ones. The predictors are the heart of a collaborative filtering system, and are what we were most interested in evaluating in SWAMI.

The evaluation is a suite of scripts and Java classes for seeing how well a predictor works. There are a range of attributes for ranking predictors, for example elapsed time and accuracy of prediction. We were most interested in the accuracy of predictions. The approach we used was to take a user, filter out some votes that this user actually had, and then use a predictor to predict a score. This predicted score was compared to the actual score.


Downloading SWAMI and Running it on Your System

System Requirements

Here are the files for download

Functionality and Non-Functionality


Current Status and Contacting Us

SWAMI was developed in 1999 as a class project. As you can probably guess, the current status of the software is "as is". As graduate students, we don't really have too much spare time to revise the software. However, if you want to submit contributions and make them available to the research community, we'd be glad to host them on our web pages.