1

 Milos Hauskrecht
 milos@cs.pitt.edu
 5329 Sennott Square

2

 Exam:
 April 18, 2007
 Term projects & project presentations:
 April 25, 2007
 At 1:004:00pm in SNSQ 5313
 No class:
 on April 23, 2007

3

 Mixture of experts
 Multiple ‘base’ models (classifiers, regressors), each covers a
different part (region) of the input space
 Committee machines:
 Multiple ‘base’ models (classifiers, regressors), each covers the
complete input space
 Each base model is trained on a slightly different train set
 Combine predictions of all models to produce the output
 Goal: Improve the accuracy of the ‘base’ model
 Methods:
 Bagging
 Boosting
 Stacking (not covered)

4

 Given:
 Training set of N examples
 A class of learning models (e.g. decision trees, neural networks, …)
 Method:
 Train multiple (k) models on different samples (data splits) and
average their predictions
 Predict (test) by averaging the results of k models
 Goal:
 Improve the accuracy of one
model by using its multiple copies
 Average of misclassification errors on different data splits gives a
better estimate of the predictive ability of a learning method

5

 Training
 In each iteration t, t=1,…T
 Randomly sample with replacement N samples from the training set
 Train a chosen “base model” (e.g. neural network, decision tree) on
the samples
 Test
 For each test example
 Start all trained base models
 Predict by combining results of all T trained models:
 Regression: averaging
 Classification: a majority vote

6


7

 Expected error= Bias+Variance
 Expected error is the expected discrepancy between the estimated and
true function
 Bias is squared discrepancy between averaged estimated and true
function
 Variance is expected divergence of the estimated function vs. its
average value

8

 Underfitting:
 High bias (models are not accurate)
 Small variance (smaller
influence of examples in the training set)
 Overfitting:
 Small bias (models flexible enough to fit well to training data)
 Large variance (models depend
very much on the training set)

9

 Example
 Assume we measure a random variable x with a N(m,s^{2}) distribution
 If only one measurement x_{1} is done,
 The expected mean of the measurement is m
 Variance is Var(x_{1})=s^{2}
 If random variable x is measured K times (x_{1},x_{2},…x_{k})
and the value is estimated as: (x_{1}+x_{2}+…+x_{k})/K,
 Mean of the estimate is still m
 But, variance is smaller:
 [Var(x_{1})+…Var(x_{k})]/K^{2}=Ks^{2 }/ K^{2
}= s^{2}/K
 Observe: Bagging is a kind of averaging!

10

 Main property of Bagging (proof omitted)
 Bagging decreases variance of the base model without changing the
bias!!!
 Why? averaging!
 Bagging typically helps
 When applied with an overfitted base model
 High dependency on actual training data
 It does not help much
 High bias. When the base model is robust to the changes in the training
data (due to sampling)

11

 Mixture of experts
 One expert per region
 Expert switching
 Bagging
 Multiple models on the complete space, a learner is not biased to any
region
 Learners are learned independently
 Boosting
 Every learner covers the complete space
 Learners are biased to regions not predicted well by other learners
 Learners are dependent

12

 PAC: Probably Approximately Correct
framework
 PAC learning:
 Learning with the prespecified accuracy e and
confidence d
 the probability that the misclassification
error is larger than e is smaller than d
 Accuracy (e ): Percent of correctly classified samples in
test
 Confidence (d ):
The probability that in one experiment some accuracy will be achieved

13

 Strong (PAC) learnability:
 There exists a learning algorithm that efficiently learns the
classification with a prespecified accuracy and confidence
 Strong (PAC) learner:
 A learning algorithm P that given an arbitrary
 classification error e (<1/2), and
 confidence d (<1/2)
 Outputs a classifier
 With a classification accuracy
> (1e)
 A confidence probability >
(1 d)
 And runs in time polynomial in 1/ d, 1/e
 Implies: number of samples N is
polynomial in 1/ d, 1/e

14

 Weak learner:
 A learning algorithm (learner) W
 Providing classification accuracy
>1e_{o}
 With probability >1 d_{o}
 For some fixed and uncontrollable
 classification error e_{o} (<1/2)
 confidence d_{o}
(<1/2)
 And this on an arbitrary distribution of data entries

15

 Assume there exists a weak learner
 it is better that a random guess (50 %) with confidence higher than 50
% on any data distribution
 Question:
 Is problem also PAClearnable?
 Can we generate an algorithm P that achieves an arbitrary (ed) accuracy?
 Why is important?
 Usual classification methods (decision trees, neural nets), have
specified, but uncontrollable performances.
 Can we improve performance to achieve prespecified accuracy
(confidence)?

16

 Proof due to R. Schapire
 An arbitrary (ed) improvement is
possible
 Idea: combine multiple weak learners together
 Weak learner W with confidence d_{o} and maximal error e_{o}
 It is possible:
 To improve (boost) the confidence
 To improve (boost) the accuracy
 by training different weak
learners on slightly different datasets

17


18

 Training
 Sample randomly from the distribution of examples
 Train hypothesis H_{1}_{.}on the sample
 Evaluate accuracy of H_{1}_{ }on the distribution
 Sample randomly such that for the half of samples H_{1}_{. }provides
correct, and for another half, incorrect results; Train hypothesis H_{2}.
 Train H_{3} on samples from the distribution where H_{1}
and H_{2} classify differently
 Test
 For each example, decide according to the majority vote of H_{1},
H_{2} and H_{3}

19

 If each hypothesis has an error e_{o}, the final classifier has error <
g(e_{o})
=3 e_{o}^{2}
2e_{o}^{3}
 Accuracy improved !!!!
 Apply recursively to get to the target accuracy !!!

20

 Similarly to boosting the accuracy we can boost the confidence at some restricted accuracy cost
 The key result: we can improve both the accuracy and confidence
 Problems with the theoretical algorithm
 A good (better than 50 %) classifier on all data problems
 We cannot properly sample from datadistribution
 Method requires large training set
 Solution to the sampling problem:
 Boosting by sampling
 AdaBoost algorithm and variants

21

 AdaBoost: boosting by sampling
 Classification (Freund, Schapire; 1996)
 AdaBoost.M1 (twoclass problem)
 AdaBoost.M2 (multipleclass
problem)
 Regression (Drucker; 1997)

22

 Given:
 A training set of N examples (attributes + class label pairs)
 A “base” learning model
(e.g. a decision tree, a
neural network)
 Training stage:
 Train a sequence of T “base” models
on T different sampling distributions defined upon the training set (D)
 A sample distribution D_{t} for building the model t is
constructed by modifying the
sampling distribution D_{t1} from the (t1)th step.
 Examples classified incorrectly in the previous step receive higher
weights in the new data (attempts to cover misclassified samples)
 Application (classification) stage:
 Classify according to the weighted majority of classifiers

23


24

 Training (step t)
 Sampling Distribution
  a probability that
example i from the original training dataset is selected
 Take K samples from the training set according to
 Train a classifier h_{t} on the samples
 Calculate the error of h_{t }:
 Classifier weight:
 New sampling distribution

25


26


27

 We have T different classifiers h
_{t}
 weight w_{t} of the classifier is proportional to its accuracy
on the training set
 Classification:
 For every class j=0,1
 Compute the sum of weights w corresponding to ALL classifiers that
predict class j;
 Output class that correspond to the maximal sum of weights (weighted
majority)

28

 Classifier 1 “yes” 0.7
 Classifier 2 “no” 0.3
 Classifier 3 “no” 0.2
 Weighted majority “yes”
 The final choose is “yes”
+ 1

29

 Each classifier specializes on a particular subset of examples
 Algorithm is concentrating on “more and more difficult” examples
 Boosting can:
 Reduce variance (the same as Bagging)
 But also to eliminate the effect of high bias of the weak learner
(unlike Bagging)
 Train versus test errors performance:
 Train errors can be driven close to 0
 But test errors do not show overfitting
 Proofs and theoretical explanations in a number of papers

30


31

 An alternative to combine multiple models: can be used for supervised
and unsupervised frameworks
 For example:
 Likelihood of the data can be expressed by averaging over the multiple
models
 Prediction:
