CIS 595-2 Machine Learning

Time: Monday, 4:40pm -7:15pm, TL 306

Instructor: Milos Hauskrecht
Wachman Hall 313, x1-5775
e-mail: milos@joda.cis.temple.edu
office hours: by appointment

TA: Dragoljub Pokrajac "Pokie"
Wachman Hall 303-A, x1-5908
e-mail: pokie@snowhite.cis.temple.edu
office hours: Thursday 1:30pm -2:30pm

Links

Course description
Handouts
Lectures
Grading
Homeworks
Term projects
Matlab

!!! Announcements !!!!

Term projects (important dates):

April 16 : - 5 minute in class presentation (progress in the project)
April 30 : - 20 minute in class presentation
May 4 (4pm): - project reports

Demo file from the tutorial. Click here to download it.

Abstract

The goal of the field of machine learning is to build computer systems that learn from experience and that are capable to adapt to their environments. Learning techniques and methods developed by researchers in this field have been successfully applied to a variety of learning tasks in a broad range of areas, including, for example, text classification, gene discovery, financial forecasting, credit card fraud detection, collaborative filtering, design of adaptive web agents and others.

This introductory machine learning course will give an overview of many techniques and algorithms in machine learning, beginning with topics such as simple concept learning and ending up with more recent topics such as boosting, support vector machines, and reinforcement learning. The objective of the course is not only to present the modern machine learning methods but also to give the basic intutions behind the methods as well as, a more formal understanding of how and why they work.

Topics to be covered

Concept learning. Version spaces. PAC learning. VC dimension.
Regression. Loss function. Least-squares fit. Parameter estimation. Statistical view on the regression. Log likelihood measures. On-line learning techniques.
Classification. Logistic regression. Class-conditional densities. Parameter estimation. Perceptron algorithm. On-line techniques. Multiple classes.
Neural networks. Nonlinear decision boundaries. Backpropagation. Radial basis functions.
Classification and regression trees (CART).
Support vector machines. Classification. Max margin hyperplanes. Kernel functions.
Ensamble methods. Mixture of experts. Bagging, boosting.
Unsurpervised learning. Clustering. k-means.
Density estimation. Parametric methods. Mixture of Gaussians. Non-parametric. Parzen windows. Nearest neighbor.
Dimensionality reduction. Feature extraction. Mutual information measure. PCA. Clustering.
Bayesian networks. Independence structure. Inference. Parameter learning. Structure learning.
Learning Bayesian networks with hidden variables and missing data. Expectation-maximization algorithm.
Hidden Markov models. Forward, backward algorithm. Baum-Welch algorithm.
Markov random fields. Independence structure. Inference. Learning.
Reinforcement learning. Learning to act. Markov decision processes. Reinforcement learning with delayed rewards.
Time series prediction.

Textbooks (available at the bookstore)

R.O. Duda, P.E. Hart, D.G. Stork. Pattern Classification. Second edition. John Wiley and Sons, 2000.

T. Mitchell. Machine Learning. Mc Graw Hill, 1997.

Additional readings

see lecture descriptions

Electronic handouts

Course information (postscript, pdf)
Syllabus (postcript, pdf)

Lecture handouts (slides + additional reading material + homework assignments) are always distributed during the class.

Lectures

Lecture 1. (January 22) Administrivia and Introduction.

Objectives of machine learning.
Examples.
Types of machine learning problems:
    - supervised (regression, classification)
    - unsupervised (clustering, density estimation)
    - reinforcement learning.
Learning modes:
    - batch
    - on-line
Design cycle: data, feature selection, model selection, learning, evaluation.
Training and generalization error.
    Overfitting.
    Bias and variance.

Readings: DHS - chapter 1, Mitchell - chapter 1.

***************************

Lecture 2. (January 29). Concept learning.

Learning concepts.
    Instance space.
    Hypothesis space.
    Mitchell's version space algorithm
PAC framework.
    Sample complexity bound.
    Efficient PAC learnability.
    Measuring inductive biases.
    Vapnik Chervonenkis dimension.
    Improved sample complexity bound.
    Adding noise to examples.

Readings: Mitchell - chapters 2, 7
Additional readings (distributed during the class):
David Haussler. Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, vol. 36, 1988.

*****************************
February 5 lecture. Cancelled due to the snow storm.

*****************************

Lecture 3. (February 12). Supervised learning.
Regression.
Classification.

Supervised learning.
Regression problems.
    Linear regression. Squared error loss function. Parameter fitting.
    Generalized additive models.
    Statistical model with Gaussian noise.
    On-line gradient-based techniques.
Classification problems.
    Binary classification.
    Generative model of classification. Class-conditional distributions.
    Decision boundaries for two Gaussians. Parameter estimation.
    Logistic regression.
    Exponential family.
    Parameter estimation.
    On-line learning.

Readings: DHS -chapters 2 (excl. 2.9 and up), 3.2, 5
Additional readings:
Michael Jordan. Why the logistic function? A tutorial discussion on probabilities and neural networks. TR 9503, Computational Cognitive Science, MIT, 1995.

*******************************

Lecture 4 (February 19). Multi-layer neural networks.

Regression and classification review.
    Linear units.
    On-line methods.
    Parameter estimation for logistic regression.
Multi-layer neural networks.
    Extending linear units with feature (basis) functions to model non-linearities.
    Cascading linear units.
    Computing weight derivatives. Backpropagation.
    Examples.

Readings: DHS - chapter 6, Mitchell - chapter 4.

******************************

Lecture 5 (February 26). Multiway classification.
Unsupervised learning.
Introduction to Bayesian belief networks.

Multiway classification.
    Naive approaches.
    Softmax model.
    Softmax and posteriors for the exponential family.
    Decision boundaries.
    Parameter estimation.
Unsupervised learning.
Bayesian belief networks.
     Representing large joint distributions with conditional independences.
     Components of the Bayesian belief network (BBN): DAG + parameters
     Example.
     Probabilistic inference in the BBN.

Readings: Mitchell - chapter 6 (6.11), DHS - chapter 2 (2.11).
Additional reading (distributed during the class):
E. Charniak. Bayesian network without tears. AI magazine, pp.50-63, 1991.

************************************

Lecture 6 (March 12). Learning Bayesian belief networks.

Bayesian belief networks with discrete values.
Conditional independences.
Learning and inference advantage.
Parameter learning for complete data.
    Maximum likelihood (ML) and Maximum aposteriori (MAP) estimates.
    Representing priors. Beta and Diriichlet distributions.
    Full Bayesian approach.
Structure learning for complete data.
    ML and MAP criteria.
    Score decomposability.
    Occam's Razor.
    Approximations of the Bayesian criteria (Akaike, BIC, MDL).

Readings:
D. Heckerman. A tutorial on learning with Bayesian belief networks. MSR-TR-95-06, 1995.

**************************************

Lecture 7 (March 19). Clustering. Mixture of Gaussians.
Expectation-maximization.

Clustering.
    K-means algorithm.
Mixture of Gaussians model.
    Soft K-means.
Expectation-maximization (EM).
    General form.
    Proof of maximization of loglikelihood.
    Application of EM to learning BBNs with hidden variables and missing values.
    Learning Naive Bayes model.

Readings: DHS -chapters 10, 3.9

Optional reading:
A.P. Dempster, N.M. Laird, D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal statistical society, vol. 39, issue 1, pp. 1-28, 1977

****************************************

Lecture 8 (March 26). Hidden Markov models.
Feature selection. Dimensionality reduction.

Hidden Markov models.
Example.
Inference. Forward - Backward algorithms.
Learning. Baum-Welch.
Data preprocessing.
Input normalization.
Feature selection. Dimensionality reduction.
Feature/input selection:
    Feature subset selection.
    Reduction through the combination of inputs/features.
Feature subset selection:
    Mutual information.
    Selection independent/dependent on the original learning task.
    Cross-validation to estimate generalization error.
Dimensionality reduction .
    PCA. Derivation. Example.
    Non-linear dimensinality reduction through auto-associative neural network.
    Reduction through clustering.

Readings:

Additional readings.
L.R. Rabiner, B.H. Juang. Introduction to Hidden Markov Models. IEEE ASSP magazine, vol.3, no. 1, pp. 4-16, 1986.

******************************************************
Lecture 9 (April 2).   Mixture of experts.
                              Decision trees.
                              Hierarchical mixture of experts.

Dimensionality reduction review.
Dimensionality reduction through clustering.
Combining multiple learners.
Mixture of experts.
    Combining multiple learners.
    On-line learning of parameters.
Decision trees.
    Impurity measures.
    Growing the tree.
    Drawback of the greedy selection. Parity functions.
    Overfitting and pruning of the tree.
Hierarchical mixture of experts.
    Relation to decision trees.
    Gating functions and responsibilities for tree parameters.
    On-line learning.

Readings:
Mixture of experts: DHS - chapter 9.7.
Decision trees: Mitchell - chapter 3, DHS - chapter 8
Hierarchical mixtures of experts:
Michael Jordan. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, pp. 181-214, 1994.

**********************************************************
Lecture 10 (April 9). Ensamble methods. Bagging and boosting.

Readings:
L. Breiman. Arcing classifiers.
Y. Freund, R. Schapire. Experiments with a New Boosting algorithm. In the Proceedings of the 13-th International Conference on Machine Learning, 1996.
R.E. Schapire, Y. Freund, P. Barlett, W.S. Lee. Boosting the Margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, vol.26, pp. 1651-1686, 1998.

**********************************************************
Lecture 11 (April 16). Support vector machines.
Non-parametric density estimation.

Support vector machines.
    Classification for the linearly separable case. Separating hyperplane.
    Algorithms (perceptron, linear programming).
    Maximum margin hyperplane. Support vectors.
    Finding maximum margin hyperplane.
    Linearly non-separable case.
    Extension to the non-linear case.
    Kernel functions.
Non-parametric density estimation.
    Histograms.
    Parzen windows.
    Parzen windows with Gaussian kernels.
    K-nearest neighbor.

Readings:
Support vector machines:
C. J.C. Burgess. A tutorial on support vector machines for pattern recognition. Machine Learning journal, 1998.
DHS - sections 5.10, 5.11
Non-parametric density estimation:
DHS - chapter 4
Mitchell - section 8.2.

**********************************************************
Lecture 12 (April 23). Reinforcement learning.

Reinforcement learning.
   Basic RL scheme.
    Objective functions.
    Exploration/Exploitation dilemma.
Learning with immediate rewards.
Learning with delayed rewards.
    Markov decision process (MDP).
    Finding the optimal plan for an MDP.
    Learning optimal policies.
        Model-based learning.
        Model-free learning. Q-learning.
     RL speed-ups.

Readings:
L.P. Kaelbling, M.L. Littman, A. W. Moore. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996.
Mitchell - chapter 13.

Grading

The course grade will be determined roughly as follows:

Homework assignments - 40%
Final projects - 60 %

Homeworks

There will be 3-4 homework assignments during the semester. These will have a character of projects and will require you to implement some of the learning algorithms covered during the semester and apply them. No collaboration on homeworks is allowed. To implement homework assignments you may use a programming language and graphic tools of your choice. However, it is strongly recommended you use Matlab package, which is now available in the new CIS lab at Wachman Hall 200.

****************************************************
Homework assignment 1. Out: 2/12/2001 Due: 2/26/2001

Data sets to be used with the assignment.

Problem 1. Boston housing data.

Problem 2. Binary classification.

To read the data from files into a matrix you can use dopen function . For example,
data = dopen(classification-train.data, 3)
will create a matrix data and read into it the training data for the classification dataset, and
data=dopen('housing-test.data',14)
reads in the test data for the housing dataset.

********************************************************
Homework assignment 2. Out: 3/19/2001 Due: 4/2/2001

Datasets to be used with the assignment.

Problem 1. Multiway classification.

Problem 2. Customer profiling and brand preference predictions.

Term projects

Term project is due at the end of the semester and accounts for about 60% of your grade. You are free to choose your own problem topic. However, the project must have a distinctive and non-trivial learning or adaptive component. In general, a project may consist of a replication of previously published results, design of new learning methods and their testing, or application of machine learning to a domain or a problem of your interest.

Matlab

Matlab is a mathematical tool for numerical computation and manipulation, with excellent graphing capabilities. It provides a great deal of support and capabilities for things you will need to run Machine Learning experiments.

To use Matlab you need to login to one of the machines at Wachman Hall, Room 200. If you have an active NT account with CIS department just login to your existing account and click on the Matlab icon on the desktop to run the program. Note that all machines in Room 200 are double boot. If you prefer to work under LINUX you can run Matlab also in LINUX. Just type matlab in an xterm window. Please see the course staff if you do not have any active account at CIS department !!!

Pokie gave a two hour tutorial on February, 6 2001. The file with short demos from the tutorial is now available on-line.

Click here to download the tutorial demo file.
dopen.m file used in the demo - downloading the data

Other Matlab resources on the web:

Online MATLAB documentation
Online Mathworks documentation including MATLAB toolboxes

Last updated by Milos on 02/11/2001