1	CS 2750 Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square
2	Schedule Exam: April 18, 2007 Term projects & project presentations: April 25, 2007 At 1:00-4:00pm in SNSQ 5313 No class: on April 23, 2007
3	Ensemble methods Mixture of experts Multiple ‘base’ models (classifiers, regressors), each covers a different part (region) of the input space Committee machines: Multiple ‘base’ models (classifiers, regressors), each covers the complete input space Each base model is trained on a slightly different train set Combine predictions of all models to produce the output Goal: Improve the accuracy of the ‘base’ model Methods: Bagging Boosting Stacking (not covered)
4	Bagging (Bootstrap Aggregating) Given: Training set of N examples A class of learning models (e.g. decision trees, neural networks, …) Method: Train multiple (k) models on different samples (data splits) and average their predictions Predict (test) by averaging the results of k models Goal: Improve the accuracy of one model by using its multiple copies Average of misclassification errors on different data splits gives a better estimate of the predictive ability of a learning method
5	Bagging algorithm Training In each iteration t, t=1,…T Randomly sample with replacement N samples from the training set Train a chosen “base model” (e.g. neural network, decision tree) on the samples Test For each test example Start all trained base models Predict by combining results of all T trained models: Regression: averaging Classification: a majority vote
6	Simple Majority Voting
7	Analysis of Bagging Expected error= Bias+Variance Expected error is the expected discrepancy between the estimated and true function Bias is squared discrepancy between averaged estimated and true function Variance is expected divergence of the estimated function vs. its average value
8	When Bagging works? Under-fitting and over-fitting Under-fitting: High bias (models are not accurate) Small variance (smaller influence of examples in the training set) Over-fitting: Small bias (models flexible enough to fit well to training data) Large variance (models depend very much on the training set)
9	Averaging decreases variance Example Assume we measure a random variable x with a N(m,s²) distribution If only one measurement x₁ is done, The expected mean of the measurement is m Variance is Var(x₁)=s² If random variable x is measured K times (x₁,x₂,…x_k) and the value is estimated as: (x₁+x₂+…+x_k)/K, Mean of the estimate is still m But, variance is smaller: [Var(x₁)+…Var(x_k)]/K²=Ks²/ K²= s²/K Observe: Bagging is a kind of averaging!
10	When Bagging works Main property of Bagging (proof omitted) Bagging decreases variance of the base model without changing the bias!!! Why? averaging! Bagging typically helps When applied with an over-fitted base model High dependency on actual training data It does not help much High bias. When the base model is robust to the changes in the training data (due to sampling)
11	Boosting Mixture of experts One expert per region Expert switching Bagging Multiple models on the complete space, a learner is not biased to any region Learners are learned independently Boosting Every learner covers the complete space Learners are biased to regions not predicted well by other learners Learners are dependent
12	Boosting. Theoretical foundations. PAC: Probably Approximately Correct framework (e-d) solution PAC learning: Learning with the pre-specified accuracy e and confidence d the probability that the misclassification error is larger than e is smaller than d Accuracy (e ): Percent of correctly classified samples in test Confidence (d ): The probability that in one experiment some accuracy will be achieved
13	PAC Learnability Strong (PAC) learnability: There exists a learning algorithm that efficiently learns the classification with a pre-specified accuracy and confidence Strong (PAC) learner: A learning algorithm P that given an arbitrary classification error e (<1/2), and confidence d (<1/2) Outputs a classifier With a classification accuracy > (1-e) A confidence probability > (1- d) And runs in time polynomial in 1/ d, 1/e Implies: number of samples N is polynomial in 1/ d, 1/e
14	Weak Learner Weak learner: A learning algorithm (learner) W Providing classification accuracy >1-e_o With probability >1- d_o For some fixed and uncontrollable classification error e_o (<1/2) confidence d_o (<1/2) And this on an arbitrary distribution of data entries
15	Weak learnability=Strong (PAC) learnability Assume there exists a weak learner it is better that a random guess (50 %) with confidence higher than 50 % on any data distribution Question: Is problem also PAC-learnable? Can we generate an algorithm P that achieves an arbitrary (e-d) accuracy? Why is important? Usual classification methods (decision trees, neural nets), have specified, but uncontrollable performances. Can we improve performance to achieve pre-specified accuracy (confidence)?
16	Weak=Strong learnability!!! Proof due to R. Schapire An arbitrary (e-d) improvement is possible Idea: combine multiple weak learners together Weak learner W with confidence d_o and maximal error e_o It is possible: To improve (boost) the confidence To improve (boost) the accuracy by training different weak learners on slightly different datasets
17	Boosting accuracy Training
18	Boosting accuracy Training Sample randomly from the distribution of examples Train hypothesis H₁_.on the sample Evaluate accuracy of H₁on the distribution Sample randomly such that for the half of samples H₁_.provides correct, and for another half, incorrect results; Train hypothesis H₂. Train H₃ on samples from the distribution where H₁ and H₂ classify differently Test For each example, decide according to the majority vote of H₁, H₂ and H₃
19	Theorem If each hypothesis has an error e_o, the final classifier has error < g(e_o) =3 e_o²- 2e_o³ Accuracy improved !!!! Apply recursively to get to the target accuracy !!!
20	Theoretical Boosting algorithm Similarly to boosting the accuracy we can boost the confidence at some restricted accuracy cost The key result: we can improve both the accuracy and confidence Problems with the theoretical algorithm A good (better than 50 %) classifier on all data problems We cannot properly sample from data-distribution Method requires large training set Solution to the sampling problem: Boosting by sampling AdaBoost algorithm and variants
21	AdaBoost AdaBoost: boosting by sampling Classification (Freund, Schapire; 1996) AdaBoost.M1 (two-class problem) AdaBoost.M2 (multiple-class problem) Regression (Drucker; 1997) AdaBoostR
22	AdaBoost Given: A training set of N examples (attributes + class label pairs) A “base” learning model (e.g. a decision tree, a neural network) Training stage: Train a sequence of T “base” models on T different sampling distributions defined upon the training set (D) A sample distribution D_t for building the model t is constructed by modifying the sampling distribution D_t-1 from the (t-1)th step. Examples classified incorrectly in the previous step receive higher weights in the new data (attempts to cover misclassified samples) Application (classification) stage: Classify according to the weighted majority of classifiers
23	AdaBoost training .
24	AdaBoost algorithm Training (step t) Sampling Distribution - a probability that example i from the original training dataset is selected for the first step (t=1) Take K samples from the training set according to Train a classifier h_t on the samples Calculate the error of h_t: Classifier weight: New sampling distribution
25	AdaBoost. Sampling Probabilities
26	AdaBoost: Sampling Probabilities
27	AdaBoost classification We have T different classifiers h _t weight w_t of the classifier is proportional to its accuracy on the training set Classification: For every class j=0,1 Compute the sum of weights w corresponding to ALL classifiers that predict class j; Output class that correspond to the maximal sum of weights (weighted majority)
28	Two-Class example. Classification. Classifier 1 “yes” 0.7 Classifier 2 “no” 0.3 Classifier 3 “no” 0.2 Weighted majority “yes” The final choose is “yes” + 1
29	What is boosting doing? Each classifier specializes on a particular subset of examples Algorithm is concentrating on “more and more difficult” examples Boosting can: Reduce variance (the same as Bagging) But also to eliminate the effect of high bias of the weak learner (unlike Bagging) Train versus test errors performance: Train errors can be driven close to 0 But test errors do not show overfitting Proofs and theoretical explanations in a number of papers
30	Boosting. Error performances
31	Bayesian model Averaging An alternative to combine multiple models: can be used for supervised and unsupervised frameworks For example: Likelihood of the data can be expressed by averaging over the multiple models Prediction: