Time: Monday, 4:40pm
-7:15pm, TL 306
Instructor: Milos Hauskrecht
Wachman Hall 313, x1-5775
e-mail:
milos@joda.cis.temple.edu
office hours: by appointment
TA: Dragoljub Pokrajac "Pokie"
Wachman Hall 303-A, x1-5908
e-mail: pokie@snowhite.cis.temple.edu
office hours: Thursday 1:30pm -2:30pm
Course
description
Handouts
Lectures
Grading
Homeworks
Term
projects
Matlab
Term projects (important dates):
The goal of the field of machine learning is to build computer systems that learn from experience and that are capable to adapt to their environments. Learning techniques and methods developed by researchers in this field have been successfully applied to a variety of learning tasks in a broad range of areas, including, for example, text classification, gene discovery, financial forecasting, credit card fraud detection, collaborative filtering, design of adaptive web agents and others.
This introductory machine learning course will give an overview of many techniques and algorithms in machine learning, beginning with topics such as simple concept learning and ending up with more recent topics such as boosting, support vector machines, and reinforcement learning. The objective of the course is not only to present the modern machine learning methods but also to give the basic intutions behind the methods as well as, a more formal understanding of how and why they work.
Topics to be covered
Textbooks (available at the bookstore)
R.O. Duda, P.E. Hart, D.G. Stork. Pattern Classification. Second edition. John Wiley and Sons, 2000.
T. Mitchell. Machine Learning. Mc Graw Hill, 1997.
Additional readings
see lecture descriptions
Lecture 1. (January 22) Administrivia and Introduction.
Objectives of machine learning.
Examples.
Types of machine learning
problems:
- supervised (regression, classification)
- unsupervised (clustering, density estimation)
- reinforcement learning.
Learning modes:
- batch
- on-line
Design
cycle: data, feature selection, model selection, learning, evaluation.
Training and generalization error.
Overfitting.
Bias and variance.
Readings: DHS - chapter 1, Mitchell - chapter 1.
***************************
Lecture 2. (January 29). Concept learning.
Learning concepts.
Instance space.
Hypothesis space.
Mitchell's
version space algorithm
PAC framework.
Sample
complexity bound.
Efficient PAC learnability.
Measuring inductive biases.
Vapnik
Chervonenkis dimension.
Improved sample complexity bound.
Adding noise to examples.
Readings: Mitchell - chapters 2, 7
Additional readings
(distributed during the class):
David Haussler. Quantifying inductive
bias: AI learning algorithms and Valiant's learning framework. Artificial
Intelligence, vol. 36, 1988.
*****************************
February 5 lecture. Cancelled due to
the snow storm.
*****************************
Lecture 3. (February 12). Supervised learning.
Regression.
Classification.
Supervised learning.
Regression problems.
Linear regression. Squared error loss function.
Parameter fitting.
Generalized additive models.
Statistical model with Gaussian noise.
On-line gradient-based techniques.
Classification
problems.
Binary classification.
Generative model of classification.
Class-conditional distributions.
Decision boundaries for
two Gaussians. Parameter estimation.
Logistic regression.
Exponential family.
Parameter
estimation.
On-line learning.
Readings: DHS -chapters 2 (excl. 2.9 and up), 3.2, 5
Additional readings:
Michael Jordan. Why the
logistic function? A tutorial discussion on probabilities and neural
networks. TR 9503, Computational Cognitive Science, MIT, 1995.
*******************************
Lecture 4 (February 19). Multi-layer neural networks.
Regression and classification review.
Linear
units.
On-line methods.
Parameter
estimation for logistic regression.
Multi-layer neural networks.
Extending linear units with feature (basis) functions to
model non-linearities.
Cascading linear units.
Computing weight derivatives. Backpropagation.
Examples.
Readings: DHS - chapter 6, Mitchell - chapter 4.
******************************
Lecture 5 (February 26). Multiway classification.
Unsupervised learning.
Introduction to Bayesian belief networks.
Multiway classification.
Naive approaches.
Softmax model.
Softmax and
posteriors for the exponential family.
Decision
boundaries.
Parameter estimation.
Unsupervised
learning.
Bayesian belief networks.
Representing large joint distributions with conditional independences.
Components of the Bayesian belief network (BBN):
DAG + parameters
Example.
Probabilistic inference in the BBN.
Readings: Mitchell - chapter 6 (6.11), DHS - chapter 2 (2.11).
Additional reading (distributed during the class):
E. Charniak. Bayesian
network without tears. AI magazine, pp.50-63, 1991.
************************************
Lecture 6 (March 12). Learning Bayesian belief networks.
Bayesian belief networks with discrete values.
Conditional
independences.
Learning and inference advantage.
Parameter learning
for complete data.
Maximum likelihood (ML) and
Maximum aposteriori (MAP) estimates.
Representing priors.
Beta and Diriichlet distributions.
Full Bayesian
approach.
Structure learning for complete data.
ML and MAP criteria.
Score
decomposability.
Occam's Razor.
Approximations of the Bayesian criteria (Akaike, BIC, MDL).
Readings:
D. Heckerman. A tutorial on
learning with Bayesian belief networks. MSR-TR-95-06, 1995.
**************************************
Lecture 7 (March 19). Clustering. Mixture of Gaussians.
Expectation-maximization.
Clustering.
K-means algorithm.
Mixture
of Gaussians model.
Soft K-means.
Expectation-maximization (EM).
General form.
Proof of maximization of loglikelihood.
Application of EM to learning BBNs with hidden variables
and missing values.
Learning Naive Bayes model.
Readings: DHS -chapters 10, 3.9
Optional reading:
A.P. Dempster, N.M. Laird, D.B. Rubin. Maximum
likelihood from incomplete data via the EM algorithm. Journal of Royal
statistical society, vol. 39, issue 1, pp. 1-28, 1977
****************************************
Lecture 8 (March 26). Hidden Markov models.
Feature selection. Dimensionality reduction.
Hidden Markov models.
Example.
Inference. Forward - Backward
algorithms.
Learning. Baum-Welch.
Data preprocessing.
Input
normalization.
Feature selection. Dimensionality reduction.
Feature/input selection:
Feature subset selection.
Reduction through the combination of inputs/features.
Feature subset selection:
Mutual information.
Selection independent/dependent on the original learning
task.
Cross-validation to estimate generalization error.
Dimensionality reduction .
PCA. Derivation. Example.
Non-linear dimensinality reduction through
auto-associative neural network.
Reduction through
clustering.
Readings:
Additional readings.
L.R. Rabiner, B.H. Juang. Introduction to Hidden
Markov Models. IEEE ASSP magazine, vol.3, no. 1, pp. 4-16, 1986.
******************************************************
Lecture 9 (April
2). Mixture of experts.
Decision trees.
Hierarchical mixture of experts.
Dimensionality reduction review.
Dimensionality reduction through
clustering.
Combining multiple learners.
Mixture of experts.
Combining multiple learners.
On-line learning of parameters.
Decision trees.
Impurity measures.
Growing the
tree.
Drawback of the greedy selection. Parity functions.
Overfitting and pruning of the tree.
Hierarchical
mixture of experts.
Relation to decision trees.
Gating functions and responsibilities for tree
parameters.
On-line learning.
Readings:
Mixture of experts: DHS - chapter 9.7.
Decision trees:
Mitchell - chapter 3, DHS - chapter 8
Hierarchical mixtures of experts:
Michael Jordan. Hierarchical
mixtures of experts and the EM algorithm. Neural Computation, 6, pp.
181-214, 1994.
**********************************************************
Lecture 10
(April 9). Ensamble methods. Bagging and boosting.
Readings:
L. Breiman. Arcing
classifiers.
Y. Freund, R. Schapire. Experiments with
a New Boosting algorithm. In the Proceedings of the 13-th
International Conference on Machine Learning, 1996.
R.E.
Schapire, Y. Freund, P. Barlett, W.S. Lee. Boosting
the Margin: A new explanation for the effectiveness of voting methods. The
Annals of Statistics, vol.26, pp. 1651-1686, 1998.
**********************************************************
Lecture 11
(April 16). Support vector machines.
Non-parametric density estimation.
Support vector machines.
Classification for the
linearly separable case. Separating hyperplane.
Algorithms (perceptron, linear programming).
Maximum
margin hyperplane. Support vectors.
Finding maximum
margin hyperplane.
Linearly non-separable case.
Extension to the non-linear case.
Kernel functions.
Non-parametric density estimation.
Histograms.
Parzen windows.
Parzen windows with Gaussian kernels.
K-nearest neighbor.
Readings:
Support vector machines:
C. J.C. Burgess. A tutorial
on support vector machines for pattern recognition. Machine Learning
journal, 1998.
DHS - sections 5.10, 5.11
Non-parametric density
estimation:
DHS - chapter 4
Mitchell - section 8.2.
**********************************************************
Lecture 12
(April 23). Reinforcement learning.
Reinforcement learning.
Basic RL scheme.
Objective functions.
Exploration/Exploitation dilemma.
Learning with immediate rewards.
Learning with delayed rewards.
Markov decision
process (MDP).
Finding the optimal plan for an MDP.
Learning optimal policies.
Model-based learning.
Model-free learning. Q-learning.
RL speed-ups.
Readings:
L.P. Kaelbling, M.L. Littman, A. W. Moore. Reinforcement
learning: a survey. Journal of Artificial Intelligence Research, 1996.
Mitchell - chapter 13.
The course grade will be determined roughly as follows:
There will be 3-4 homework assignments during the semester. These will have a
character of projects and will require you to implement some of the learning
algorithms covered during the semester and apply them. No collaboration on
homeworks is allowed. To implement homework assignments you may use a
programming language and graphic tools of your choice. However, it is strongly
recommended you use Matlab package, which is now available in the new CIS lab at
Wachman Hall 200.
****************************************************
Homework
assignment 1. Out: 2/12/2001 Due: 2/26/2001
Data sets to be used with the assignment.
Problem 1. Boston housing data.
Problem 2. Binary classification.
********************************************************
Homework
assignment 2. Out: 3/19/2001 Due: 4/2/2001
Datasets to be used with the assignment.
Problem 1. Multiway classification.
Problem 2. Customer profiling and brand
preference predictions.
Term project is due at the end of the semester and accounts for about 60% of your grade. You are free to choose your own problem topic. However, the project must have a distinctive and non-trivial learning or adaptive component. In general, a project may consist of a replication of previously published results, design of new learning methods and their testing, or application of machine learning to a domain or a problem of your interest.
Matlab is a mathematical tool for numerical computation and manipulation, with excellent graphing capabilities. It provides a great deal of support and capabilities for things you will need to run Machine Learning experiments.
To use Matlab you need to login to one of the machines at Wachman Hall, Room 200. If you have an active NT account with CIS department just login to your existing account and click on the Matlab icon on the desktop to run the program. Note that all machines in Room 200 are double boot. If you prefer to work under LINUX you can run Matlab also in LINUX. Just type matlab in an xterm window. Please see the course staff if you do not have any active account at CIS department !!!
Pokie gave a two hour tutorial on February, 6 2001. The file with short demos from the tutorial is now available on-line.
Click here
to download the tutorial demo file.
dopen.m file used in
the demo - downloading the data
Other Matlab resources on the web:
Online
MATLAB documentation
Online
Mathworks documentation including MATLAB toolboxes