CIS 595-2  Machine Learning


Time:  Monday, 4:40pm -7:15pm,  TL 306






Instructor:  Milos Hauskrecht
Wachman Hall 313, x1-5775
e-mail: milos@joda.cis.temple.edu
office hours: by appointment

TA: Dragoljub Pokrajac "Pokie"
Wachman Hall 303-A,  x1-5908
e-mail: pokie@snowhite.cis.temple.edu
office hours: Thursday 1:30pm -2:30pm
 



Links

Course description
Handouts
Lectures
Grading
Homeworks
Term projects
Matlab
 
 

  • !!! Announcements !!!!
  • Term projects (important dates):


  • Abstract

    The goal of the field of machine learning is to build computer systems that learn from experience and that are capable to adapt to their environments. Learning techniques and methods developed by researchers in this field have been successfully applied to a variety of learning tasks in a broad range of areas, including, for example, text classification, gene discovery, financial forecasting, credit card fraud detection, collaborative filtering, design of adaptive web agents and others.

    This introductory machine learning course will give an overview of many techniques and algorithms in machine learning, beginning with topics such as simple concept learning and ending up with more recent topics such as boosting, support vector machines, and reinforcement learning. The  objective of the course  is  not only to  present  the modern machine learning methods but also to give the basic intutions behind the methods as well as, a more formal understanding of how and why they work.

    Topics to be covered


    Textbooks (available at the bookstore)

    R.O. Duda, P.E. Hart, D.G. Stork.  Pattern Classification. Second edition. John Wiley and Sons, 2000.

    T. Mitchell.  Machine Learning. Mc Graw Hill, 1997.
     

    Additional readings

    see lecture descriptions



    Electronic handouts
    1. Course information  (postscript, pdf)
    2. Syllabus  (postcript, pdf)
    Lecture handouts  (slides + additional reading material  + homework assignments) are always distributed during the class.



    Lectures

     Lecture 1.   (January 22)   Administrivia and Introduction.

    Objectives of machine learning.
    Examples.
    Types of machine learning problems:
        - supervised (regression, classification)
        - unsupervised (clustering, density estimation)
        - reinforcement learning.
    Learning modes:
        - batch
        - on-line
    Design cycle: data, feature selection, model selection, learning, evaluation.
    Training and generalization error.
        Overfitting.
        Bias and variance.

    Readings:  DHS - chapter 1, Mitchell - chapter 1.

    ***************************

    Lecture 2. (January 29).  Concept learning.

    Learning concepts.
        Instance space.
        Hypothesis space.
        Mitchell's version space algorithm
    PAC framework.
        Sample complexity bound.
        Efficient PAC learnability.
        Measuring inductive biases.
        Vapnik Chervonenkis dimension.
        Improved sample complexity bound.
        Adding noise to examples.

    Readings:  Mitchell - chapters 2, 7
    Additional readings (distributed during the class):
    David Haussler.  Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence,  vol. 36, 1988.

    *****************************
    February 5 lecture. Cancelled due to the snow storm.

    *****************************

    Lecture 3. (February 12). Supervised learning.
                                          Regression.
                                          Classification.

    Supervised learning.
    Regression problems.
        Linear regression. Squared error  loss function. Parameter fitting.
        Generalized additive models.
        Statistical model with Gaussian noise.
        On-line gradient-based techniques.
    Classification problems.
        Binary classification.
        Generative model of classification.  Class-conditional distributions.
        Decision boundaries for two Gaussians. Parameter estimation.
        Logistic regression.
        Exponential family.
        Parameter estimation.
        On-line learning.

    Readings: DHS -chapters  2 (excl. 2.9 and up), 3.2, 5
    Additional readings:
    Michael Jordan. Why the logistic function? A tutorial discussion on probabilities and neural networks.  TR 9503, Computational  Cognitive Science, MIT, 1995.

    *******************************

    Lecture 4 (February 19). Multi-layer neural networks.

    Regression and classification review.
        Linear units.
        On-line methods.
        Parameter estimation for logistic regression.
    Multi-layer neural networks.
        Extending linear units with feature (basis) functions to model non-linearities.
        Cascading linear units.
        Computing weight derivatives. Backpropagation.
        Examples.

    Readings: DHS - chapter 6, Mitchell - chapter  4.

    ******************************

    Lecture 5 (February 26). Multiway classification.
                                         Unsupervised learning.
                                         Introduction to Bayesian belief networks.

    Multiway classification.
        Naive approaches.
        Softmax model.
        Softmax and posteriors for the exponential family.
        Decision boundaries.
        Parameter estimation.
    Unsupervised learning.
    Bayesian belief networks.
         Representing large joint distributions with conditional independences.
         Components of the Bayesian belief network (BBN): DAG + parameters
         Example.
         Probabilistic inference  in the BBN.

    Readings:  Mitchell - chapter 6 (6.11), DHS - chapter 2 (2.11).
    Additional reading (distributed during the class):
    E. Charniak. Bayesian network without tears. AI magazine, pp.50-63, 1991.

    ************************************

    Lecture 6 (March 12). Learning Bayesian belief networks.

    Bayesian belief networks with discrete values.
    Conditional independences.
    Learning and inference advantage.
    Parameter learning for complete data.
        Maximum likelihood (ML) and Maximum aposteriori (MAP) estimates.
        Representing priors. Beta and Diriichlet distributions.
        Full  Bayesian approach.
    Structure learning for complete data.
        ML and MAP criteria.
        Score decomposability.
        Occam's Razor.
        Approximations of the Bayesian criteria (Akaike, BIC, MDL).

    Readings:
    D. Heckerman.  A tutorial on learning with Bayesian belief networks. MSR-TR-95-06, 1995.

    **************************************

    Lecture 7 (March 19). Clustering.  Mixture of Gaussians.
                                     Expectation-maximization.

    Clustering.
        K-means algorithm.
    Mixture of Gaussians model.
        Soft K-means.
    Expectation-maximization (EM).
        General form.
        Proof of maximization of loglikelihood.
        Application of EM to learning BBNs with hidden variables and missing values.
        Learning Naive Bayes model.

    Readings:  DHS -chapters 10,  3.9

    Optional reading:
    A.P. Dempster, N.M. Laird, D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal statistical society, vol. 39, issue 1, pp. 1-28, 1977
     

    ****************************************

    Lecture 8 (March 26). Hidden Markov models.
                                     Feature selection.  Dimensionality reduction.

    Hidden Markov models.
    Example.
    Inference. Forward - Backward algorithms.
    Learning. Baum-Welch.
    Data preprocessing.
    Input normalization.
    Feature selection. Dimensionality reduction.
    Feature/input selection:
        Feature subset selection.
        Reduction through the combination of inputs/features.
    Feature subset selection:
        Mutual information.
        Selection independent/dependent on the original learning task.
        Cross-validation to estimate generalization error.
    Dimensionality reduction .
        PCA. Derivation. Example.
        Non-linear dimensinality reduction through auto-associative neural network.
        Reduction through clustering.

    Readings:

    Additional readings.
    L.R. Rabiner, B.H. Juang. Introduction to Hidden Markov Models. IEEE ASSP magazine, vol.3, no. 1,  pp. 4-16, 1986.

    ******************************************************
    Lecture 9 (April 2).   Mixture of experts.
                                  Decision trees.
                                  Hierarchical mixture of experts.

    Dimensionality reduction review.
    Dimensionality reduction through clustering.
    Combining multiple learners.
    Mixture of experts.
        Combining multiple learners.
        On-line learning of parameters.
    Decision trees.
        Impurity measures.
        Growing the tree.
        Drawback of the greedy selection. Parity functions.
        Overfitting and pruning of the tree.
    Hierarchical mixture of experts.
        Relation to decision trees.
        Gating functions and responsibilities for tree  parameters.
        On-line learning.

    Readings:
    Mixture of experts: DHS - chapter 9.7.
    Decision trees: Mitchell - chapter 3, DHS - chapter 8
    Hierarchical mixtures of experts:
    Michael Jordan.  Hierarchical mixtures of experts and the EM algorithm.  Neural Computation, 6, pp. 181-214, 1994.

    **********************************************************
    Lecture 10 (April 9).   Ensamble methods. Bagging and boosting.

    Readings:
    L. Breiman. Arcing classifiers.
    Y. Freund, R. Schapire. Experiments with a New Boosting algorithm. In the Proceedings of the 13-th International  Conference on Machine Learning, 1996.
    R.E.  Schapire, Y. Freund, P. Barlett, W.S. Lee. Boosting the Margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, vol.26, pp. 1651-1686, 1998.

    **********************************************************
    Lecture 11 (April 16). Support vector machines.
                                             Non-parametric density estimation.
     

    Support vector machines.
        Classification for the linearly separable case.  Separating hyperplane.
        Algorithms (perceptron, linear programming).
        Maximum margin hyperplane. Support vectors.
        Finding maximum margin hyperplane.
        Linearly non-separable case.
        Extension to the non-linear case.
        Kernel functions.
    Non-parametric density estimation.
        Histograms.
        Parzen windows.
        Parzen windows with Gaussian kernels.
        K-nearest neighbor.

    Readings:
    Support vector machines:
    C. J.C. Burgess. A tutorial on support vector machines for pattern recognition. Machine Learning journal, 1998.
    DHS - sections 5.10, 5.11
    Non-parametric density estimation:
    DHS - chapter 4
    Mitchell - section 8.2.

    **********************************************************
    Lecture 12 (April 23).  Reinforcement learning.

    Reinforcement learning.
       Basic RL scheme.
        Objective functions.
        Exploration/Exploitation dilemma.
    Learning with immediate rewards.
    Learning with delayed rewards.
        Markov decision process (MDP).
        Finding the optimal plan for  an MDP.
        Learning optimal policies.
            Model-based learning.
            Model-free learning. Q-learning.
         RL speed-ups.

    Readings:
    L.P. Kaelbling, M.L. Littman, A. W. Moore.  Reinforcement learning: a survey.  Journal of Artificial Intelligence Research, 1996.
    Mitchell - chapter  13.



    Grading
     

    The course grade will be determined roughly as follows:



    Homeworks

    There will be 3-4 homework assignments during the semester. These will have a character of projects and will require you to implement some of the learning algorithms covered during the semester and apply them. No collaboration on homeworks is allowed. To implement homework assignments you may use a programming language and graphic tools of your choice. However, it is strongly recommended you use Matlab package, which is now available in the new CIS lab at Wachman Hall 200.
     

    ****************************************************
    Homework assignment 1.   Out: 2/12/2001   Due: 2/26/2001

    Data sets to be used with the assignment.

    Problem 1. Boston housing data.


    Problem 2. Binary classification.

    To read the data from files into a matrix  you can use dopen function . For example,
    data = dopen(classification-train.data, 3)
    will  create a matrix data and read into it the  training data for the classification dataset, and
    data=dopen('housing-test.data',14)
    reads in the test data for the housing dataset.

    ********************************************************
    Homework assignment 2. Out: 3/19/2001    Due: 4/2/2001

    Datasets to be used with the assignment.

    Problem 1. Multiway classification.
     


    Problem 2. Customer profiling and brand  preference predictions.



    Term projects

    Term project is due at the end of the semester and accounts for about 60% of your grade. You are free to choose your own problem topic. However, the project must have a distinctive and non-trivial learning or adaptive component. In general, a project may consist of a replication of previously published results, design of new learning methods and their testing, or application of machine learning to a domain or a problem of your interest.



    Matlab

    Matlab is a mathematical tool for numerical computation and manipulation, with excellent graphing capabilities. It provides a great deal of support and capabilities for things you will need to run Machine Learning experiments.

    To use Matlab you need to login to one of the machines at Wachman Hall, Room 200.  If you have an active NT account with CIS department  just login to your existing account and click on the Matlab icon on the desktop to run the program.  Note that all machines in Room 200 are double boot. If you prefer to work under LINUX  you  can run Matlab also in LINUX.  Just type matlab in an xterm window. Please see the course staff if you do not have any active account at CIS department !!!

    Pokie gave a two hour tutorial on February, 6  2001.  The file with short demos from the tutorial is now available on-line.

    Click here to download  the tutorial demo file.
    dopen.m file used in the demo  - downloading the data

    Other Matlab resources on the web:

    Online MATLAB  documentation
    Online Mathworks documentation including MATLAB toolboxes



    Last updated by Milos on 02/11/2001