ML Models Pros vs Cons

"All models are wrong"

Posted by dliu on March 1, 2020

“Do one thing at a time, and do well”

ML Models & Pros and Cons

Error = bias + variance; Tradeoff

Error versus ModeI CompIexity 
(€xpected) Тем 
тга'п 
1-0w е Complexity Hiqh 
LOW 
l_ow е— Varjance —4 Нјдђ

Supervised vs. unsupervised:

Supervised (regression || classification): train model with labeled y

Unsupervised (clustering): only inputs

Algorithms:

KNN (non-parametric)

KNN categorizes object based on the classes of their nearest neighbors, (distance function) in the dataset, it assumes that objects near each other are similar.

Tuning parameters: 1. Number of nearest neighbors K 2. Distance function, Euclidean

  1. Pros:
    1. Simple
    2. Low dimensional datasets
  2. Cons:
    1. High dimensional data might not be predicted correctly

Decision Tree

A decision tree predicts responses to data by following the decisions in the tree from its root down to leaf node. A tree consists of branching conditions where the value of a predictor is compared to a trained weight.

Tuning parameters: 1. maximum depth 2. Minimum samples in a leaf 3. Maximum numbers of leaves

Pros: 1. Easily visualized and interpreted 2. No feature normalization needed 3. Work well with a mixture of feature types

Cons: 1. Can often overfit the data 2. Usually need an ensemble trees for better performance

Random Forest: Fully grown decision tree (low bias, high variance)

  1. Lower variance
  2. Faster and better than SVM

Boosted Tree: Shallow decision tree (high bias, low variance)

  1. Lower bias

SVM

SVC: It classifies data by finding the decision boundary (hyperplane) that separates all data points of one class from other class by maximizing the margin between the two classes. The vectors that define the hyperplane are the support vectors. If we remove non-support vectors, it would not affect the margin. (Margin is the minimal distance from the observations to the hyperplane).

SVM It’s an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels. (Linear, polynomial and radial kernels)

A kernel is a function that quantifies the similarity of two observations. When we use kernels, we need compute kernel functions at the n^2 inner products between training points.

Tuning parameters: 1. Kernel 2. Kernel parameters (gamma: radial) 3. Regularization parameter, C

Pros: 1. Can Perform well on a range of datasets 2. Work well for both low and high-dimensional data Cons: 1. Efficiency decreases as training set size increases 2. Needs careful normalization and parameter tuning

  1. SVM with linear kernel = logistic regression
  2. If data is not linear, use non-linear kernel
  3. Highly dimensional space
  4. High accuracy
  5. Avoid overfitting
  6. No distribution needed
  7. Not suffer multicollinearity
  8. Requires significant memory and processing power
  9. After 10,000 examples it starts taking too long to train

Naïve Bayes (GM)

p(ClF1,••• 
p(F1,...,

  1. Pros:
    1. Simple
    2. Less data if iid
    3. No distribution requirement
    4. Good for few categories variables
  2. Cons:
    1. Multicollinearity

Gaussian Processes (use in geography)

  1. Infinite collection of random variables
  2. Pros:

    1. No need to understand the data
    2. Powerful

    ****

K-means (unsupervised); (choose best K: Elbow Method)

잎이S미0 ]0 jeqwnN

  1. Pros:
    1. Fast
    2. Can detect outliers
  2. Cons:
    1. Cluster are spherical, can’t detect groups of other shape
    2. Multicollinearity

Lasso

  1. Pros:
    1. No distribution needed
    2. Variable selection
    3. Minimize RMSE
  2. Cons:
    1. Multicollinearity

Ridge

  1. Pros:
    1. No distribution needed
    2. Not suffer multicollinearity
    3. Minimize RMSE
  2. Cons:
    1. No variable selection

Logistic (classification)

It fits a model that can predict the probability of a binary response belonging to one class or the other.

Cost Function of Logistic Regression

The error between predicted values and observed values.

Multinom() for the response contains more than two categories

  1. Pros:
    1. Can transform non-linear features into linear by feature engineering
    2. Avoid noise, overfitting, and feature selection
    3. No distribution needed
    4. Easy to interpret
    5. No need to worry about highly correlated
  2. Cons:
    1. Multicollinearity

PCA

  1. Removing collinearity
  2. Reducing dimensions
  3. Use in highly correlated datasets

OVERALL:

  1. if the variables are normally distributed and the categorical variables all have 5+ categories: use Linear discriminant analysis
  2. if the correlations are mostly nonlinear: use SVM
  3. if sparsity and multicollinearity are a concern: Adaptive Lasso with Ridge(weights) + Lasso

General Idea

  1. Regression:
  2. Classification:
  3. ANOVA:
  4. Cluster Analysis:
  5. Discriminant Analysis:
  6. Logistics:
  7. ROC curve (Receiver Operating Characteristic curve): sweep through all possible cutoffs, and plot the sensitivity and specificity

0.0 
0.2 
Sensitivity 
0.4 
0.6 
0.8 
1.0