Boostrap, Bagging, Boosting

"Unity is strength"

Posted by dliu on January 29, 2020

Boosting vs Bagging

Bagging (bootstrap aggregating)

Bagging is the bagging method, and its algorithm is as follows:

A) Extract the training set from the original sample set. In each round, n training samples are extracted from the original sample set using Bootstrapping (in the training set, some samples may be drawn multiple times, with replacement and some samples may never be drawn). A total of k rounds of extraction are performed to obtain k training sets. (The k training sets are independent of each other)

B) Use one training set at a time to get one model, and k training sets get a total of k models. (Note: There is no specific classification algorithm or regression method here, we can use different classification or regression methods according to specific problems, such as decision trees, perceptrons, etc.)

C) For the classification problem: the k models obtained in the previous step are used to obtain the classification results by voting; for the regression problem, the average value of the above models is calculated as the final result. (All models have the same importance)

Boosting

The main idea is to assemble the weak classifier into a strong classifier. Under the PAC (probability approximate correct) learning framework, the weak classifier can be assembled into a strong classifier.

Two core questions about Boosting:

1) How to change the weight or probability distribution of training data in each round?

By increasing the weights of the samples that were misclassified classified by the weak classifier in the previous round and reducing the weights of the samples that were correctly classified in the previous round, the classifier will have a better effect on the misclassified data.

2) What methods are used to combine weak classifiers?

The weak classifiers are combined linearly through the additive model. For example, AdaBoost uses weighted majority voting, to increase the weight of classifiers with smaller error rates, while reducing the weight of classifiers with larger error rates.

By gradually reducing the tree’s fitting residual error, the model generated at each step is ensembled to obtain the final model.

The difference between Bagging and Boosting

1) Sample selection:

Bagging: The training set is selected with replacement from the original set, and each round of the training set selected from the original set is independent.

Boosting: The training set of each round remains unchanged, but the weight of each sample in the training set in the classifier changes. The weights are adjusted according to the classification results of the previous round.

2) Sample weights during training:

Bagging: Use uniform sampling, each sample has equal weight

Boosting: The weights of the samples are continuously adjusted according to the error rate. The greater the error rate, the greater the weight. (greater weights with higher misclassifications)

3) Prediction function:

Bagging: All prediction functions have equal weight.

Boosting: Each weak classifier has a corresponding weight. At the end of voting, a classifier with a small classification error will have a greater weight.

4) Parallel computing:

Bagging: Each prediction function can be generated in parallel

Boosting: Each prediction function can only be generated sequentially, because the latter model parameters require the results of the previous round of models.

Bootstrap

Codes

Bagging (Bootstrap aggregating)

Use bootstrap for more accurate prediction (not just uncertainty)

Bagging for regression

Will lower variance but same bias

Bagging for classification

mode of iid classifiers that are better than chance is a better classifier

  • use voting

crowds are wiser when

  • individuals are better than random
  • votes are uncorrelated

Example

Random forests

Reduce the correlation between decision trees

Feature sub-sampling

only a random subset () of features are available for split at each step (further reduce the dependence between decision trees)

Features of Random Forest

  • It is unexcelled in accuracy among current algorithms.
  • It runs efficiently on large data bases.
  • It can handle thousands of input variables without variable deletion.
  • It gives estimates of what variables are important in the classification.
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
  • It has methods for balancing error in class population unbalanced data sets.
  • Generated forests can be saved for future use on other data.
  • Prototypes are computed that give information about the relation between the variables and the classification.
  • It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
  • The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
  • It offers an experimental method for detecting variable interactions

Random forest does not overfit because each tree is independent and starts from scratch and Random Forest takes the average of all the predictions, which makes the biases cancel each other out. In general, more trees will be resulted in more stable generalization errors.

formula prove:

Stack Exchange Answer

****

Out of bag (OOB) samples

  • the instances not included in a bootstrap dataset can be used for validation
  • simultaneous validation of decision trees in a forest
  • no need to set aside data for cross validation

Out Of Bag (OOB) error can be used for parameter tuning (e.g., size of the forest)

Summary

  • Bootstrap is a powerful technique to get uncertainty estimates
  • Bootstrap aggregation (Bagging) can reduce the variance of unstable models
  • Random forests (subsample data and features):
  • Bagging + further de-correlation of features at each split
  • OOB validation instead of CV
  • destroy interpretability of decision trees
  • perform well in practice
  • can fail if only few relevant features exist (due to feature-sampling)

Adaptive bases

Optimization idea

Example

Exponential loss

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

AdaBoost

AdaBoost algorithm

AdaBoost detailed derivations

Discrete AdaBoost

Discrete AdaBoost Algorithm Example

Example

Decision stump: decision tree with one node

Gradient boosting

Algorithm

Gradient tree boosting

two ensemble methods

  • bagging & random forests (reduce variance)
    • produce models with minimal correlation
    • use their average prediction
  • boosting (reduces the bias of the weak learner)
    • models are added in steps
    • a single cost function is minimized
    • for exponential loss: interpret as re-weighting the instance (AdaBoost)
    • gradient boosting: fit the weak learner to the negative of the gradient
    • interpretation as L1 regularization for “weak learner”-selection
    • also related to max-margin classification (for large number of steps T)
  • random forests and (gradient) boosting generally perform very well

Some loss functions for gradient boosting

Maximum Likelihood Estimation

  1. given a pdf
  2. the pdf w.r.t X
  3. Log step 2
  4. Get the derivate = 0 w.r.t the estimator you would like to
  5. Get the MLE w.r.t the estimator

David Dalpiaz Homework

pg. 4

Resources

Boosting vs Bagging