| Scenario | Metric | Why |
|---|---|---|
| Imbalanced classes | F1, AUC-ROC, AUC-PR | Accuracy misleading; AUC-PR better when positives rare |
| Cost FP ≫ FN (spam, fraud alert) | Precision | Minimize false alarms |
| Cost FN ≫ FP (cancer, fraud detection) | Recall | Minimize missed positives |
| Regression with outliers | MAE / Huber | Robust to large errors (L1 vs L2) |
| Regression, large errors critical | RMSE | Squares amplify large deviations |
| Ranking, recommendation | NDCG, MAP, MRR | Order-sensitive evaluation |
| Variant | Samples/Update | Noise | Speed |
|---|---|---|---|
| Batch GD | All m | None | Slowest per epoch |
| SGD | 1 | High | Fastest, can escape local min |
| Mini-batch | k (32–512) | Low | Best; GPU-vectorized |
| Ridge | Lasso | ElasticNet | |
|---|---|---|---|
| Sparsity | No | Yes | Yes |
| Correlated feats | Equal shrink | Picks one | Groups |
| Closed form | Yes | No (subgradient) | No |
| Condition | Bias | Variance | Fix |
|---|---|---|---|
| Overfitting | Low | High | Regularize, prune, more data, dropout |
| Underfitting | High | Low | More features, complex model, less regularization |
| Generative | Discriminative | |
|---|---|---|
| Models | NB, GMM, HMM, LDA | LR, SVM, NN, RF |
| Learns | P(x,y) = P(x|y)P(y) | P(y|x) directly |
| Pros | Works w/ small data; handles missing; can sample | Higher accuracy on large data |
| Cons | Strong distributional assumptions | Needs labeled data |
| Property | ID3 | C4.5 | CART |
|---|---|---|---|
| Split criterion | Information Gain | Gain Ratio | Gini (class) / Variance (reg) |
| Task | Classification | Classification | Both |
| Continuous features | ❌ | ✓ threshold search | ✓ threshold search |
| Missing values | ❌ | ✓ fractional weights | ✓ surrogate splits |
| Split arity | Multi-way | Multi-way | Binary only |
| Pruning | ❌ | ✓ post-prune | ✓ cost-complexity |
| Bias (IG) | High-cardinality attrs preferred | Corrected by SplitInfo | Gini less biased |
| Kernel | Formula | Feature Space | Best For |
|---|---|---|---|
| Linear | xᵀz | Original | High-d sparse (text) |
| Polynomial | (γxᵀz + r)ᵈ | Degree-d poly | Images, NLP |
| RBF/Gaussian | exp(−γ||x−z||²) | Infinite-dim | General purpose |
| Sigmoid | tanh(κxᵀz + θ) | Varies | NN analog (not always valid) |
| Scenario | SVM? | Notes |
|---|---|---|
| High-d, few samples (n≪d) | ✓ Great | Genomics, text; margin maximization helps |
| Large n (>100K) | ✗ Slow | O(n²–n³); use SGD/Linear SVM |
| Need probabilities | Partial | Platt scaling; not well-calibrated natively |
| Heavily overlapping classes | Tune C | Soft margin; RBF kernel |
| Multiclass | Manual | OvR or OvO decomposition |
| MLE | MAP | Full Bayes | |
|---|---|---|---|
| Prior | No | Yes (point) | Yes (full) |
| Output | Point θ | Point θ | Distribution P(θ|D) |
| Overfit risk | High | Low | Lowest |
| Tractability | Easy | Easy | Often hard |
| Variant | Feature Type | P(xⱼ|Cₖ) |
|---|---|---|
| Gaussian NB | Continuous | N(μₖⱼ, σ²ₖⱼ) — estimate μ,σ per class |
| Bernoulli NB | Binary | pₖⱼ^xⱼ · (1−pₖⱼ)^(1−xⱼ) |
| Multinomial NB | Counts (text) | θₖⱼ^xⱼ — word frequencies |
| Method | Training | Reduces | Examples |
|---|---|---|---|
| Bagging | Parallel | Variance | Random Forest |
| Boosting | Sequential | Bias | AdaBoost, GBM, XGBoost |
| Stacking | Both | Both | Meta-learner on OOF preds |
| Hyperparameter | Effect (↑) |
|---|---|
| n_estimators | Better, no overfit risk; diminishing returns after ~200 |
| max_features | Less diversity per tree but higher individual accuracy |
| max_depth | Higher variance per tree, richer structure |
| min_samples_leaf | More regularization, smoother boundaries |
| Property | sklearn GBM | XGBoost | LightGBM | CatBoost |
|---|---|---|---|---|
| Tree growth | Level-wise | Level-wise (depth-first) | Leaf-wise (best-first) | Symmetric (oblivious trees) |
| Taylor order | 1st only | 1st + 2nd | 1st + 2nd | 1st + 2nd |
| Speed | Slow | Fast (parallel column sort) | Fastest (GOSS + EFB) | Moderate |
| Missing values | Manual impute | ✓ native default direction | ✓ native | ✓ native |
| Categorical | Manual encode | Manual encode | ✓ native (optimal split) | ✓ target encoding with ordered boosting |
| Regularization | subsample, max_depth | γ, λ, α, subsample, colsample | min_gain, min_data_in_leaf | Symmetric structure = implicit reg |
| Overfit risk (leaf-wise) | Low | Low | Higher; use min_data_in_leaf | Low (symmetric) |
| Unique optimization | — | Block structure, approx splits | GOSS + EFB histogram | Ordered boosting (no leakage) |
| AdaBoost | GBM | |
|---|---|---|
| Weighting mechanism | Sample reweighting | Residual fitting |
| Step size | αₜ derived from εₜ | Line search or fixed η |
| Loss function | Exponential (fixed) | Any differentiable loss |
| Outlier sensitivity | Very high (exp loss) | Depends on loss choice |
| Flexibility | Low | High (plug any loss) |
| Variant | Key Difference | Use When |
|---|---|---|
| K-Means++ | Probabilistic init | Always over random init |
| K-Medoids (PAM) | Centers = actual data points | Non-Euclidean, robust to outliers |
| Mini-batch K-Means | SGD-style centroid updates | n > 1M; online streams |
| Fuzzy C-Means | Soft memberships uᵢₖ ∈ [0,1] | Overlapping clusters |
| Bisecting K-Means | Recursively split largest cluster | Hierarchical structure needed |
| DBSCAN | Density-based, no k needed | Arbitrary shapes; detect outliers |
| HDBSCAN | Hierarchical DBSCAN | Varying density clusters |
| Property | K-Means | GMM | DBSCAN |
|---|---|---|---|
| Requires k | Yes | Yes (K) | No (ε, MinPts) |
| Soft assignment | No (hard) | Yes (γₙₖ) | No |
| Cluster shape | Spherical | Elliptical (per Σₖ) | Arbitrary |
| Outlier detection | None | Low-responsibility pts | Explicit noise pts |
| Probabilistic model | No | Yes | No |
| Scalability | Very high | Moderate (O(nKd²)) | O(n log n) w/ index |
| Convergence | Local WCSS min | Local log-lik max | Deterministic |
| Method | Description | Use When |
|---|---|---|
| k-Fold CV | k disjoint folds; rotate test fold | General; k=5 or 10 |
| Stratified k-Fold | Preserve class ratio per fold | Classification, imbalanced |
| LOOCV | n-fold; one sample left out | Very small n; high variance of estimate |
| Repeated k-Fold | Multiple random k-fold runs | Reduce CV estimate variance |
| Time series split | Walk-forward; no future leakage | Any temporal data |
| Nested CV | Outer: eval; Inner: HP tuning | Unbiased performance estimate when doing HP search |
| Method | Type | Principle |
|---|---|---|
| LIME | Local, model-agnostic | Fit local linear approx in perturbed neighborhood |
| SHAP | Local + Global | Shapley values; unique fair attribution |
| Grad-CAM | Local, CNN | Gradient × activation for saliency map |
| PDP / ICE | Global / Local | Marginal effect of one feature |
| Permutation Imp. | Global | Score drop when feature shuffled |
| Scenario | Top Choice | Alternative | Avoid |
|---|---|---|---|
| Tabular data, structured, n≥1K | XGBoost / LightGBM | Random Forest | KNN (slow predict) |
| Few features, linear relationship | Ridge / Lasso Regression | Logistic Regression | Deep trees |
| Text classification | Logistic Reg + TF-IDF | Linear SVM, Multinomial NB | KNN (high-d) |
| Small dataset (<1K samples) | SVM (RBF), NB | Logistic Regression | Deep neural networks |
| Need interpretability | Decision Tree, Linear Model | SHAP + XGBoost | Deep NN without SHAP |
| Probabilistic output needed | Logistic Regression, GMM | Calibrated RF / XGB | Hard-margin SVM |
| Clustering, unknown k | DBSCAN / HDBSCAN | Hierarchical clustering | K-Means |
| Clustering, known k, spherical | K-Means++ | Mini-batch K-Means | DBSCAN |
| Soft cluster assignments | GMM | Fuzzy C-Means | K-Means |
| Non-linear boundary, few samples | SVM + RBF kernel | Random Forest | Plain KNN |
| Online / streaming | SGD variants, NB | Hoeffding Trees | Batch SVM |
| Imbalanced binary | XGB (scale_pos_weight) + F1 | RF + class_weight | Default threshold |
| Algorithm | Train | Predict | Space |
|---|---|---|---|
| Linear Regression (GD) | O(nd·iter) | O(d) | O(d) |
| Linear Regression (Normal Eq) | O(nd² + d³) | O(d) | O(d) |
| Logistic Regression | O(nd·iter) | O(d) | O(d) |
| KNN | O(1) / O(nd) build index | O(nd) | O(nd) |
| Decision Tree | O(nd log n) | O(depth) | O(n·depth) |
| SVM (kernel) | O(n²d – n³) | O(SV·d) | O(SV) |
| Naive Bayes | O(nd) | O(Kd) | O(Kd) |
| Random Forest | O(T·nd log n) | O(T·depth) | O(T·n) |
| GBM / XGBoost | O(T·nd log n) | O(T·depth) | O(T·n) |
| K-Means | O(nkd·iter) | O(kd) | O((n+k)d) |
| GMM-EM | O(nkd²·iter) | O(kd²) | O(kd²) |
| Algorithm | Key Assumptions |
|---|---|
| Linear Regression | Linear f(x), homoscedasticity, no multicollinearity, errors N(0,σ²) |
| Logistic Regression | Linear decision boundary in feature space |
| LDA | Gaussian class-conditionals, equal covariance matrices |
| Naïve Bayes | Conditional independence of features given class |
| KNN | Locally similar points share labels; smooth decision boundary |
| K-Means | Spherical, similar-size, convex, well-separated clusters |
| GMM | Data generated from K Gaussian components |
| SVM | Separable (or nearly) by hyperplane; kernel defines geometry |
| Decision Tree | Axis-aligned decision boundaries; recursive partition |
| Bayesian LR | Gaussian prior on weights, Gaussian likelihood |