Assigned: November 20, 2006
Due: December 4, 2006 (by beginning of class, via both dropbox (at least for non-graphical parts) and hardcopy). Please write your Pitt email alias on your hardcopies.
For this homework you will be using some machine learning software in java called WEKA. You will also be working with a dataset called OpenSecret, linked to a subset of Census data from April 2000. OpenSecret lists contributions to Presidential candidates from the 2004 race along with information about each contributor, including name, city, state, zip code, company employed by, and job title. From the Census data for 2000, you will have access to information about each zip code, including race, education level, income level, and employment.
Your job is to build a machine learning model from this data using Weka such that, given a new person with all of the associated information specified above, your model will predict
You can install WEKA and the Java runtime environment from the WEKA site. It's very easy to run. Here's an example (linux command line), assuming the training data is in file.arff:
java weka.classifiers.rules.ZeroR -t file.arff
This will run ZeroR (zero rules) on the "file.arff" file, show the learned model, and evaluate it using cross-validation. A tutorial is also available.
The data can be found here.
After pre-processing the data to put it in the necessary format for WEKA, you will experiment with the impact of three parameters on the accuracy of your results:
You will use cross validation to select the model that you think is most accurate. We will test your submissions on a new set of test data and you will receive a ranking of your model against all other assignments.
Notes on feature selection: You may experiment with feature selection entirely within WEKA or outside of WEKA. WEKA implements an incremental search over features (attributes) to select the ones that are the best predictors of your classes as described in the paper that is assigned for commentary. However, you may wish to have finer grained control over selection of features. For example, you might represent the feature "personal information" containing the text string including name, title, city, state and zipcode. Or, you might extract a separate feature for each one of these attributes. If you do the latter, you could experiment with the impact of different features outside of WEKA by systematically exploring the impact of these different attributes on results. There are many choices for how to do this. There is the chance that one attribute determines everything (e.g., that zipcode determines whether you give money and to whom). If we have this case, you can continue experimenting by removing the one dominating feature and then measuring how other attributes contribute.
Note that, when performing feature selection, your data should be divided into at least three sets: training set (on which you train your model), validation set (on which you evaluate selection of features), and test set (on which you perform the final evaluation). This in particular means that you should never use any of your testing data to run the feature selection algorithm (testing data is only used after the choice of features has been finalized).
You are to write a several page report that describes:
You should hand in: