HOMEWORK 4 (CS 1573): Learning from Examples

Assigned: March 18, 2004

Due: April 6, 2004

For this homework we'll be using some machine learning software in java called WEKA which has a large variety of learners implemented and is set up to automatically perform experiments using cross-validation.

You can install WEKA and the Java runtime environment from the WEKA site.

It's very easy to run. Here's an example (linux command line), assuming the training data is in file.arff:

java weka.classifiers.rules.ZeroR -t file.arff

This will run ZeroR (zero rules) on the "file.arff" file, show the learned model, and evaluate it using cross-validation.

You can also download extra datasets from the UCI Machine Learning Repository in the WEKA arff file format (the datasets are described here).

Run the following three classifiers on the labor data included with WEKA (in data/labor.arff)

This is the decision tree classifier. It is based on C4.5.

java weka.classifiers.trees.j48.J48 -t data/labor.arff

This is a boosted version.

java weka.classifiers.meta.AdaBoostM1 -W weka.classifiers.trees.j48.J48 -t data/labor.arff

This is a strawman algorithm that always picks the majority class.

java weka.classifiers.rules.ZeroR -t data/labor.arff

  1. Describe labor.arff in English. (10 pts)
  2. Note the model, training set accuracy, and cross-validation accuracy in the ouput for each execution.  (10 pts)
  3. Rank the classifiers in terms of accuracy in testing through cross-validation. (10 pts)
  4. Rank the classifiers in terms of learning time. (10 pts)
  5. Which classifier has the greatest discrepancy between training set and cross-validation test set accuracy (5 pts)? Why might that be (10 pts)?
  6. For only the decision tree classifer, explain the learned tree in English (10 pts). Try changing two options specific (5 pts) to weka.classifiers.trees.j48.J48 to get different results (one with pruning, one with instances), and discuss what happens and why (10 pts). Also, instead of using cross-validation, reserve 25% of the data as a test set, and use (up to) the remaining 75% of the data for training (10 pts). Construct a learning curve to show how performance on the test set varies as you increase the number of examples used for training (10 pts).

Please submit hardcopies and electronic versions of your experiments, showing the input and output of WEKA, as well as the answers to the above questions. Bring the hardcopies to class, and submit electronically following the class submission policies.

IMPORTANT NOTE:Points will be deducted if the submission procedure is not followed.