“Do one thing at a time, and do well”

ML Models & Pros and Cons

Error = bias + variance; Tradeoff

Error versus ModeI CompIexity
(€xpected) Тем
тга'п
1-0w е Complexity Hiqh
LOW
l_ow е— Varjance —4 Нјдђ

Supervised vs. unsupervised:

Supervised (regression || classification): train model with labeled y

Unsupervised (clustering): only inputs

Algorithms:

KNN (non-parametric)

KNN categorizes object based on the classes of their nearest neighbors, (distance function) in the dataset, it assumes that objects near each other are similar.

Tuning parameters: 1. Number of nearest neighbors K 2. Distance function, Euclidean

Pros:
1. Simple
2. Low dimensional datasets
Cons:
1. High dimensional data might not be predicted correctly

Decision Tree

A decision tree predicts responses to data by following the decisions in the tree from its root down to leaf node. A tree consists of branching conditions where the value of a predictor is compared to a trained weight.

Tuning parameters: 1. maximum depth 2. Minimum samples in a leaf 3. Maximum numbers of leaves

Pros: 1. Easily visualized and interpreted 2. No feature normalization needed 3. Work well with a mixture of feature types

Cons: 1. Can often overfit the data 2. Usually need an ensemble trees for better performance

Random Forest: Fully grown decision tree (low bias, high variance)

Lower variance
Faster and better than SVM

Boosted Tree: Shallow decision tree (high bias, low variance)

Lower bias

SVM

SVC: It classifies data by finding the decision boundary (hyperplane) that separates all data points of one class from other class by maximizing the margin between the two classes. The vectors that define the hyperplane are the support vectors. If we remove non-support vectors, it would not affect the margin. (Margin is the minimal distance from the observations to the hyperplane).

SVM It’s an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels. (Linear, polynomial and radial kernels)

A kernel is a function that quantifies the similarity of two observations. When we use kernels, we need compute kernel functions at the n^2 inner products between training points.

Tuning parameters: 1. Kernel 2. Kernel parameters (gamma: radial) 3. Regularization parameter, C

Pros: 1. Can Perform well on a range of datasets 2. Work well for both low and high-dimensional data Cons: 1. Efficiency decreases as training set size increases 2. Needs careful normalization and parameter tuning

SVM with linear kernel = logistic regression
If data is not linear, use non-linear kernel
Highly dimensional space
High accuracy
Avoid overfitting
No distribution needed
Not suffer multicollinearity
Requires significant memory and processing power
After 10,000 examples it starts taking too long to train

Naïve Bayes (GM)

p(ClF1,•••
p(F1,...,

Pros:
1. Simple
2. Less data if iid
3. No distribution requirement
4. Good for few categories variables
Cons:
1. Multicollinearity

Gaussian Processes (use in geography)

Infinite collection of random variables
Pros:
1. No need to understand the data
2. Powerful
****

K-means (unsupervised); (choose best K: Elbow Method)

잎이S미0 ]0 jeqwnN

Pros:
1. Fast
2. Can detect outliers
Cons:
1. Cluster are spherical, can’t detect groups of other shape
2. Multicollinearity

Lasso

Pros:
1. No distribution needed
2. Variable selection
3. Minimize RMSE
Cons:
1. Multicollinearity

Ridge

Pros:
1. No distribution needed
2. Not suffer multicollinearity
3. Minimize RMSE
Cons:
1. No variable selection

Logistic (classification)

It fits a model that can predict the probability of a binary response belonging to one class or the other.

Cost Function of Logistic Regression

The error between predicted values and observed values.

Multinom() for the response contains more than two categories

Pros:
1. Can transform non-linear features into linear by feature engineering
2. Avoid noise, overfitting, and feature selection
3. No distribution needed
4. Easy to interpret
5. No need to worry about highly correlated
Cons:
1. Multicollinearity

PCA

Removing collinearity
Reducing dimensions
Use in highly correlated datasets

OVERALL:

if the variables are normally distributed and the categorical variables all have 5+ categories: use Linear discriminant analysis
if the correlations are mostly nonlinear: use SVM
if sparsity and multicollinearity are a concern: Adaptive Lasso with Ridge(weights) + Lasso

General Idea

Regression:
Classification:
ANOVA:
Cluster Analysis:
Discriminant Analysis:
Logistics:
ROC curve (Receiver Operating Characteristic curve): sweep through all possible cutoffs, and plot the sensitivity and specificity

0.0
0.2
Sensitivity
0.4
0.6
0.8
1.0

ML Models Pros vs Cons

"All models are wrong"