CS2710 / ISSP 2160: Homework 5

Machine Learning (Chapter 18)

Assigned: November 20, 2006

Due: December 4, 2006 (by beginning of class, via both dropbox (at least for non-graphical parts) and hardcopy). Please write your Pitt email alias on your hardcopies.

Programming Assignment (100 pts)

For this homework you will be using some machine learning software in java called WEKA. You will also be working with a dataset called OpenSecret, linked to a subset of Census data from April 2000. OpenSecret lists contributions to Presidential candidates from the 2004 race along with information about each contributor, including name, city, state, zip code, company employed by, and job title. From the Census data for 2000, you will have access to information about each zip code, including race, education level, income level, and employment.

Your job is to build a machine learning model from this data using Weka such that, given a new person with all of the associated information specified above, your model will predict

Which presidential candidate the person will donate money to (Bush or Kerry)

You can install WEKA and the Java runtime environment from the WEKA site. It's very easy to run. Here's an example (linux command line), assuming the training data is in file.arff:


java weka.classifiers.rules.ZeroR -t file.arff

This will run ZeroR (zero rules) on the "file.arff" file, show the learned model, and evaluate it using cross-validation. A tutorial is also available.

The data can be found here.

After pre-processing the data to put it in the necessary format for WEKA, you will experiment with the impact of three parameters on the accuracy of your results:

Learning algorithm: You will experiment with two different learning algorithms: decision trees and naive bayes. You are asked to document which algorithm produces more accurate results for this data set. You should discuss in your writeup why you think the algorithms perform differently.

Attributes: You will use feature selection to pinpoint which attributes most significantly impact results. You may experiment with extracting different types of features from the data. For example, it should be relatively straightforward to use zipcode as a feature, but you could also extract gender from this data. You might be able to improve accuracy by experimenting with attributes that are implicit in the data (i.e., not in the list of attributes given above). See the notes on feature selection below.

Training data size: You will experiment with different sizes of the training data set to determine the smallest amount of training data you need to get good results.

You will use cross validation to select the model that you think is most accurate. We will test your submissions on a new set of test data and you will receive a ranking of your model against all other assignments.

Notes on feature selection: You may experiment with feature selection entirely within WEKA or outside of WEKA. WEKA implements an incremental search over features (attributes) to select the ones that are the best predictors of your classes as described in the paper that is assigned for commentary. However, you may wish to have finer grained control over selection of features. For example, you might represent the feature "personal information" containing the text string including name, title, city, state and zipcode. Or, you might extract a separate feature for each one of these attributes. If you do the latter, you could experiment with the impact of different features outside of WEKA by systematically exploring the impact of these different attributes on results. There are many choices for how to do this. There is the chance that one attribute determines everything (e.g., that zipcode determines whether you give money and to whom). If we have this case, you can continue experimenting by removing the one dominating feature and then measuring how other attributes contribute.

Note that, when performing feature selection, your data should be divided into at least three sets: training set (on which you train your model), validation set (on which you evaluate selection of features), and test set (on which you perform the final evaluation). This in particular means that you should never use any of your testing data to run the feature selection algorithm (testing data is only used after the choice of features has been finalized).

You are to write a several page report that describes:

A description of the model you submitted with charts that document why the parameters you chose did best. This part of the paper should show the differences in learning between the two different methods you used, in terms of accuracy and in terms of what is learned. For your decision tree results, you should identify generalizations that were made and discuss whether these learned generalizations are meaningful (e.g., do they correspond to your intuition about why these attributes play a role?) (25 points)

Quantification of the effect of different attributes on the learning process. Which attributes were most important? Did accuracy degrade as you reduced the number of attributes? Use charts and a description of the charts to answer this question. Discuss your results and explain why they do or do not make sense. (30 points)

Quantification of the effect of amount of data on the learning process. Again, use charts and description of the charts to show how accuracy is affected by data set size. (15 points)

You should hand in:

Any scripts you wrote to assist you in running the experiments (15 points)

A readme file describing your approach and the scripts (1 page) (10 points)

The report (70 points as divided above)

In addition to the above, 5 points of your grade will reflect your performance on both the training data (which you have access to now), and on the test data which you will not see.