ML Interview Notes

// all topics · formulas + pseudocode + comparisons · interview-ready
01

Introduction to Machine Learning

ML Taxonomy

Mitchell's def: "A program learns from E w.r.t. T if P improves with E"

Designing a Learning System

Core Challenges

02

ML Workflow

Data Preprocessing

Normalization (Min-Max)
x' = (x − min) / (max − min) → [0, 1]
Standardization (Z-score)
x' = (x − μ) / σ → μ=0, σ=1

Class Imbalance Strategies

Accuracy misleading on imbalanced data → use F1 / AUC-PR / AUC-ROC

Performance Metrics

Classification
Accuracy = (TP+TN) / N
Precision = TP / (TP+FP) // of predicted +, how many real +
Recall = TP / (TP+FN) // of actual +, how many caught
F1 = 2·P·R / (P+R)
AUC-ROC = P(score(+) > score(−))
Regression
MAE = (1/n)Σ|yᵢ − ŷᵢ|
MSE = (1/n)Σ(yᵢ − ŷᵢ)²
RMSE = √MSE
R² = 1 − SS_res/SS_tot

Metric Selection Guide

ScenarioMetricWhy
Imbalanced classesF1, AUC-ROC, AUC-PRAccuracy misleading; AUC-PR better when positives rare
Cost FP ≫ FN (spam, fraud alert)PrecisionMinimize false alarms
Cost FN ≫ FP (cancer, fraud detection)RecallMinimize missed positives
Regression with outliersMAE / HuberRobust to large errors (L1 vs L2)
Regression, large errors criticalRMSESquares amplify large deviations
Ranking, recommendationNDCG, MAP, MRROrder-sensitive evaluation
03

Linear Models for Regression

Model & Loss

Hypothesis
h(x) = θᵀx = θ₀ + θ₁x₁ + ··· + θₙxₙ
MSE Loss
J(θ) = (1/2m) Σᵢ (hθ(xᵢ) − yᵢ)²
Normal Equation — Direct Solution (O(n³))
θ* = (XᵀX)⁻¹ Xᵀy
Fails if XᵀX singular (multicollinear); use pseudoinverse / Ridge. Only feasible for small feature count.
Gradient
∂J/∂θⱼ = (1/m) Σᵢ (hθ(xᵢ) − yᵢ) xᵢⱼ
In matrix form: ∇J = (1/m) Xᵀ(Xθ − y)
Set gradient to 0 analytically (normal eq.) or descend iteratively (GD)

Gradient Descent Variants

Update Rule
θⱼ := θⱼ − α · ∂J/∂θⱼ
// Mini-batch GD (general form)
Initialize θ randomly
repeat until convergence:
for each mini-batch B ⊂ {1..m}:
θ := θ − (α/|B|) Σᵢ∈B ∇L(θ; xᵢ,yᵢ)
VariantSamples/UpdateNoiseSpeed
Batch GDAll mNoneSlowest per epoch
SGD1HighFastest, can escape local min
Mini-batchk (32–512)LowBest; GPU-vectorized
α too large → diverge; α too small → slow. Use learning rate schedule or Adam.

Regularization

Ridge (L2) — Normal Eq becomes invertible
J(θ) = MSE + λΣⱼ θⱼ²
θ* = (XᵀX + λI)⁻¹ Xᵀy
Shrinks weights continuously; never exactly 0; handles multicollinearity
Lasso (L1) — Feature selection
J(θ) = MSE + λΣⱼ |θⱼ|
Sparsity: corners of L1 ball → exact zeros; use coordinate descent
ElasticNet
J(θ) = MSE + λ₁Σ|θⱼ| + λ₂Σθⱼ²
Groups correlated features; L1 selects one, ElasticNet selects group
RidgeLassoElasticNet
SparsityNoYesYes
Correlated featsEqual shrinkPicks oneGroups
Closed formYesNo (subgradient)No

Bias-Variance Decomposition

Decomposition
E[(y − ŷ)²] = Bias²(ŷ) + Var(ŷ) + σ²
Bias: error from wrong model assumptions (underfitting)
Variance: sensitivity to training sample (overfitting)
σ²: irreducible noise — can never eliminate
Bias of Estimator
Bias(ŷ) = E[ŷ] − f(x) // systematic offset from truth
ConditionBiasVarianceFix
OverfittingLowHighRegularize, prune, more data, dropout
UnderfittingHighLowMore features, complex model, less regularization
Bagging reduces variance; boosting reduces bias; regularization shifts tradeoff

Basis Function Models

General Form
h(x) = θᵀφ(x) where φ: X → ℝᵐ
04

Linear Models for Classification

Logistic Regression

Sigmoid Activation
σ(z) = 1 / (1 + e^(−z)), z = θᵀx
P(y=1|x;θ) = σ(θᵀx)
Binary Cross-Entropy Loss (NLL of Bernoulli)
J(θ) = −(1/m) Σᵢ [yᵢ log ŷᵢ + (1−yᵢ) log(1−ŷᵢ)]
Gradient — same form as linear regression!
∂J/∂θⱼ = (1/m) Σᵢ (ŷᵢ − yᵢ) xᵢⱼ
In matrix: ∇J = (1/m) Xᵀ(ŷ − y)
Multiclass — Softmax
P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx)
Decision Boundary
θᵀx = 0 → linear hyperplane in feature space
Predict 1 if P(y=1|x) ≥ 0.5, i.e., θᵀx ≥ 0
Max likelihood of Bernoulli outputs; convex loss → unique global minimum via GD
Perfect separability → weights diverge; fix with L2 regularization

Decision Theory

GenerativeDiscriminative
ModelsNB, GMM, HMM, LDALR, SVM, NN, RF
LearnsP(x,y) = P(x|y)P(y)P(y|x) directly
ProsWorks w/ small data; handles missing; can sampleHigher accuracy on large data
ConsStrong distributional assumptionsNeeds labeled data

Linear Discriminant Analysis (LDA)

Projection Criterion (Fisher)
max_w (wᵀSBw) / (wᵀSWw)
SB: between-class scatter; SW: within-class scatter
Solution
SW⁻¹ SB w = λw → generalized eigenproblem
Reduce to (K−1) dimensions for K classes
05

Decision Trees

Information Theory Fundamentals

Entropy (impurity measure)
H(S) = −Σₖ pₖ log₂ pₖ
pₖ = fraction of class k; 0 log 0 ≡ 0; range [0, log₂K]
Information Gain (ID3 criterion)
IG(S, A) = H(S) − Σᵥ (|Sᵥ|/|S|) H(Sᵥ)
A: attribute; Sᵥ: subset where A=v
Gain Ratio (C4.5 — corrects IG bias for high-cardinality attrs)
GR(S,A) = IG(S,A) / SplitInfo(S,A)
SplitInfo = −Σᵥ (|Sᵥ|/|S|) log₂(|Sᵥ|/|S|)
Gini Impurity (CART)
Gini(S) = 1 − Σₖ pₖ²
Range [0, 1−1/K]; 0 = pure node
Variance Reduction (CART regression)
VR = Var(S) − Σᵥ (|Sᵥ|/|S|) Var(Sᵥ)
Entropy = expected bits to encode label. IG = entropy reduction from knowing attribute value.

ID3 Algorithm — Pseudocode

// ID3 — Entropy + Info Gain
function BuildTree(S, Attrs):
if all S same class C: return Leaf(C)
if Attrs empty: return Leaf(majority(S))
A* = argmaxA∈Attrs IG(S, A)
node = Node(A*)
for each value v of A*:
Sᵥ = {s ∈ S : s[A*] = v}
if Sᵥ empty:
node.child[v] = Leaf(majority(S))
else:
node.child[v] = BuildTree(Sᵥ, Attrs\{A*})
return node
// CART (Binary splits — Classification)
function BestSplit(S, Attrs):
best_gain = 0; best_split = null
for each attr A:
for each threshold t in sorted A values:
S_L = {s: s[A] ≤ t}; S_R = {s: s[A] > t}
gain = Gini(S) − (|S_L|/|S|)Gini(S_L) − (|S_R|/|S|)Gini(S_R)
if gain > best_gain: update best_split
return best_split

Continuous & Missing Values

Continuous Threshold Search
Sort xᵢ values: v₁ ≤ v₂ ≤ ... ≤ vₙ
Candidate thresholds: tₖ = (vₖ + vₖ₊₁)/2
t* = argmax_t IG(S, A_t)
O(n log n) per attribute per node

Overfitting & Pruning

Deep trees: overfit. Low depth: underfit. Use CV to pick optimal depth / α.

ID3 vs C4.5 vs CART — Comparison

PropertyID3C4.5CART
Split criterionInformation GainGain RatioGini (class) / Variance (reg)
TaskClassificationClassificationBoth
Continuous features✓ threshold search✓ threshold search
Missing values✓ fractional weights✓ surrogate splits
Split arityMulti-wayMulti-wayBinary only
Pruning✓ post-prune✓ cost-complexity
Bias (IG)High-cardinality attrs preferredCorrected by SplitInfoGini less biased
06

Instance-Based Learning

K-Nearest Neighbors (KNN)

Prediction
ŷ = majority_vote({yᵢ : i ∈ kNN(x)}) // classification
ŷ = (1/k) Σᵢ∈kNN(x) yᵢ // regression
Distance Metrics
Euclidean: d(x,z) = √Σⱼ(xⱼ−zⱼ)²
Manhattan: d(x,z) = Σⱼ|xⱼ−zⱼ|
Minkowski: d(x,z) = (Σⱼ|xⱼ−zⱼ|^p)^(1/p)
Cosine sim: (x·z)/(||x||·||z||)
Weighted KNN
ŷ = Σᵢ wᵢ yᵢ / Σᵢ wᵢ, wᵢ = 1/d(x, xᵢ)²
No model; memorize training set; predict by proximity

Locally Weighted Regression (LWR)

Objective (per query point x)
J(θ; x) = Σᵢ wᵢ(x) · (yᵢ − θᵀxᵢ)²
Gaussian Kernel Weight
wᵢ(x) = exp(−||x − xᵢ||² / 2τ²)
τ (bandwidth): small → more local; large → global linear fit
Closed-Form Solution
θ*(x) = (XᵀW(x)X)⁻¹ XᵀW(x)y
W(x) = diag(w₁(x), ..., wₙ(x))
Fit linear model around query point using proximity-weighted samples

Radial Basis Functions (RBF)

RBF Network Output
f(x) = Σⱼ₌₁ᴷ wⱼ φ(||x − cⱼ||) + w₀
cⱼ: centers; φ: radial basis function; wⱼ: output weights
Common Basis Functions
Gaussian: φ(r) = exp(−r²/2σ²)
Multiquadric: φ(r) = √(r² + c²)
Thin-plate: φ(r) = r² log(r)
// RBF Network Training
1. Choose centers cⱼ (k-means on X, or use training points)
2. Fix widths σⱼ (heuristic: mean distance to neighbors)
3. Compute Φᵢⱼ = φ(||xᵢ − cⱼ||) for all i,j
4. Solve linear system: w* = (ΦᵀΦ)⁻¹Φᵀy
Universal approximator; interpolates smoothly between centers
07

Support Vector Machines

Hard-Margin SVM (Linearly Separable)

Primal Problem
min_{w,b} ½||w||²
s.t. yᵢ(wᵀxᵢ + b) ≥ 1 ∀i
Margin = 2/||w||. Maximize margin ↔ minimize ½||w||².
Lagrangian
L(w,b,α) = ½||w||² − Σᵢ αᵢ[yᵢ(wᵀxᵢ+b) − 1]
Dual Problem (substitute KKT: w = Σαᵢyᵢxᵢ)
max_α Σᵢ αᵢ − ½ Σᵢ Σⱼ αᵢαⱼ yᵢyⱼ xᵢᵀxⱼ
s.t. αᵢ ≥ 0 ∀i, Σᵢ αᵢyᵢ = 0
Decision Function
f(x) = sign(Σᵢ αᵢyᵢ xᵢᵀx + b)
Only support vectors (αᵢ > 0) contribute (KKT sparsity)
Bias term b
b = yₛ − Σᵢ αᵢyᵢ xᵢᵀxₛ for any support vector s
Maximum margin → best generalization by VC theory; only SVs matter

Soft-Margin SVM (Non-Separable)

Primal with Slack Variables
min_{w,b,ξ} ½||w||² + C Σᵢ ξᵢ
s.t. yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0
ξᵢ: slack — how much sample i violates margin
Dual (same form with upper bound on α)
max_α Σᵢ αᵢ − ½ Σᵢ Σⱼ αᵢαⱼ yᵢyⱼ xᵢᵀxⱼ
s.t. 0 ≤ αᵢ ≤ C, Σᵢ αᵢyᵢ = 0
Hinge Loss Equivalence
min_w ||w||² + C Σᵢ max(0, 1 − yᵢ(wᵀxᵢ+b))
SVM = L2-regularized hinge loss minimization

Kernel Trick (Mercer)

Kernel Substitution in Dual
xᵢᵀxⱼ → K(xᵢ, xⱼ) = φ(xᵢ)ᵀφ(xⱼ)
Never compute φ explicitly; compute K directly in O(d)
Mercer Condition (valid kernel)
K must be symmetric positive semi-definite:
Σᵢ Σⱼ cᵢcⱼ K(xᵢ,xⱼ) ≥ 0 for all {cᵢ}, {xᵢ}
KernelFormulaFeature SpaceBest For
LinearxᵀzOriginalHigh-d sparse (text)
Polynomial(γxᵀz + r)ᵈDegree-d polyImages, NLP
RBF/Gaussianexp(−γ||x−z||²)Infinite-dimGeneral purpose
Sigmoidtanh(κxᵀz + θ)VariesNN analog (not always valid)
Kernelized Decision Function
f(x) = sign(Σᵢ αᵢyᵢ K(xᵢ, x) + b)
Map to high-d implicitly; linear boundary in φ-space = nonlinear boundary in x-space

SVM: When to Use / Avoid

ScenarioSVM?Notes
High-d, few samples (n≪d)✓ GreatGenomics, text; margin maximization helps
Large n (>100K)✗ SlowO(n²–n³); use SGD/Linear SVM
Need probabilitiesPartialPlatt scaling; not well-calibrated natively
Heavily overlapping classesTune CSoft margin; RBF kernel
MulticlassManualOvR or OvO decomposition
08

Bayesian Learning

MLE vs MAP vs Full Bayes

Bayes Rule
P(θ|D) = P(D|θ) · P(θ) / P(D)
Posterior ∝ Likelihood × Prior
MLE — Maximum Likelihood
θ_MLE = argmax_θ P(D|θ) = argmax_θ Σᵢ log P(xᵢ|θ)
No prior; can overfit with small data; point estimate
MAP — Maximum A Posteriori
θ_MAP = argmax_θ log P(D|θ) + log P(θ)
Gaussian prior P(θ) = N(0, λ⁻¹I) → L2 regularization
Laplace prior P(θ) ∝ exp(−λ|θ|) → L1 regularization
Full Bayesian Prediction
P(y*|x*,D) = ∫ P(y*|x*,θ) P(θ|D) dθ
Averages over all θ; intractable for most models; use MCMC / VI
MLEMAPFull Bayes
PriorNoYes (point)Yes (full)
OutputPoint θPoint θDistribution P(θ|D)
Overfit riskHighLowLowest
TractabilityEasyEasyOften hard

Naïve Bayes Classifier

Conditional Independence Assumption
P(x|Cₖ) = Πⱼ P(xⱼ|Cₖ)
Classification Rule
ŷ = argmax_k log P(Cₖ) + Σⱼ log P(xⱼ|Cₖ)
Log-space: avoids underflow from multiplying small probabilities
Laplace Smoothing (prevents zero probs)
P(xⱼ=v|Cₖ) = (count(xⱼ=v, Cₖ) + α) / (count(Cₖ) + α·|V|)
α=1: add-one; |V|: vocab/category size
VariantFeature TypeP(xⱼ|Cₖ)
Gaussian NBContinuousN(μₖⱼ, σ²ₖⱼ) — estimate μ,σ per class
Bernoulli NBBinarypₖⱼ^xⱼ · (1−pₖⱼ)^(1−xⱼ)
Multinomial NBCounts (text)θₖⱼ^xⱼ — word frequencies
Independence assumption wrong but ranking often correct; fast, works with tiny data
Poor probability calibration despite good accuracy; feature correlation breaks it

Bayesian Linear Regression

Prior over weights
p(w) = N(0, α⁻¹I)
Likelihood
p(y|X,w) = N(Xw, β⁻¹I)
Posterior (conjugate — also Gaussian)
p(w|X,y) = N(mN, SN)
SN = (αI + βXᵀX)⁻¹
mN = β · SN · Xᵀy
Predictive Distribution
p(y*|x*,X,y) = N(mNᵀφ(x*), σ²_N(x*))
σ²_N(x*) = β⁻¹ + φ(x*)ᵀ SN φ(x*)
Uncertainty grows where data is sparse — automatic credible intervals
Full posterior over weights → predictive uncertainty quantification; mN = MAP when α→0

Optimal Bayes Classifier

ŷ = argmax_k P(Cₖ|x) = argmax_k P(x|Cₖ)P(Cₖ)
09

Ensemble Learning

Ensemble Types & Error Decomposition

MethodTrainingReducesExamples
BaggingParallelVarianceRandom Forest
BoostingSequentialBiasAdaBoost, GBM, XGBoost
StackingBothBothMeta-learner on OOF preds
Variance of Average of n Models
Var(avg) = ρσ² + (1−ρ)σ²/n
ρ: inter-model correlation. Lower ρ → more reduction. Need diversity + accuracy.
Diverse accurate models that fail on different samples → strong ensemble

Bagging

// Bootstrap Aggregating
for t = 1 to T:
Dₜ = bootstrap_sample(D, n) // n samples w/ replacement
hₜ = train(Dₜ)
return H(x) = majority_vote(hₜ(x)) // or mean for regression

Random Forest

Feature Importance (Gini / Mean Decrease Impurity)
FIⱼ = Σ_trees Σ_nodes[split on j] · ΔImpurity · (n_node/n_total)
Normalized to sum=1. Biased toward high-cardinality continuous features!
Permutation Importance (unbiased)
PIⱼ = score(original) − score(with feature j shuffled)
HyperparameterEffect (↑)
n_estimatorsBetter, no overfit risk; diminishing returns after ~200
max_featuresLess diversity per tree but higher individual accuracy
max_depthHigher variance per tree, richer structure
min_samples_leafMore regularization, smoother boundaries
Wisdom of many uncorrelated trees; OOB provides free validation
9.4.1

AdaBoost

AdaBoost — Algorithm & Equations

Final Classifier
H(x) = sign(F(x)), F(x) = Σₜ αₜ hₜ(x)
// AdaBoost (Binary ±1 labels)
Initialize: wᵢ = 1/m ∀i
for t = 1 to T:
hₜ = train weak learner on (X, y, w) // e.g., depth-1 tree
εₜ = Σᵢ wᵢ · 𝟙[hₜ(xᵢ) ≠ yᵢ] // weighted error
αₜ = ½ ln((1 − εₜ) / εₜ) // learner confidence
for each i:
wᵢ ← wᵢ · exp(−αₜ · yᵢ · hₜ(xᵢ))
// misclassified: yᵢhₜ(xᵢ)=−1 → weight × exp(+αₜ) (↑)
// correct: yᵢhₜ(xᵢ)=+1 → weight × exp(−αₜ) (↓)
wᵢ ← wᵢ / Σⱼ wⱼ // normalize to sum=1
Learner Weight
αₜ = ½ ln((1−εₜ)/εₜ)
εₜ → 0: αₜ → ∞ (perfect learner heavily weighted)
εₜ = 0.5: αₜ = 0 (random; ignored)
Requires εₜ < 0.5 (better than random)
Training Error Upper Bound
Train error ≤ exp(−2 Σₜ γₜ²) where γₜ = 0.5 − εₜ
Exponentially decreasing with T; theoretically zero with enough boosting rounds
Loss Function (implicit)
L(y, F) = exp(−y·F(x)) — exponential loss
Very sensitive to outliers and label noise
At each step, focus on what previous models got wrong by upweighting their mistakes
9.4.2

Gradient Boosting Machine (GBM)

GBM — Algorithm & Equations

Additive Model
F_M(x) = F₀(x) + Σₘ₌₁ᴹ γₘ hₘ(x)
hₘ: weak learner (shallow tree); γₘ: step size via line search
// Gradient Boosting (Friedman 2001)
F₀(x) = argmin_γ Σᵢ L(yᵢ, γ) // constant init (e.g., mean y)
for m = 1 to M:
// Compute pseudo-residuals (negative gradient)
rᵢₘ = −[∂L(yᵢ, F(xᵢ)) / ∂F(xᵢ)]_{F=Fₘ₋₁}
// Fit weak learner to pseudo-residuals
hₘ = fit_tree({(xᵢ, rᵢₘ)})
// Line search for step size (or use fixed η)
γₘ = argmin_γ Σᵢ L(yᵢ, Fₘ₋₁(xᵢ) + γ hₘ(xᵢ))
Fₘ(x) = Fₘ₋₁(x) + η · γₘ · hₘ(x)
Pseudo-Residuals for Common Losses
MSE (L2): rᵢ = yᵢ − Fₘ₋₁(xᵢ) // actual residuals
MAE (L1): rᵢ = sign(yᵢ − Fₘ₋₁(xᵢ))
Log-loss: rᵢ = yᵢ − σ(Fₘ₋₁(xᵢ))
Leaf Region Output (regression, MSE)
γⱼₘ = mean{yᵢ − Fₘ₋₁(xᵢ) : xᵢ ∈ Rⱼₘ}
Rⱼₘ: region (leaf j of tree m)
Gradient descent in function space; each tree fits the gradient of loss w.r.t. current predictions
9.4.3

XGBoost — eXtreme Gradient Boosting

XGBoost — Full Derivation

Regularized Objective
Obj = Σᵢ L(yᵢ, ŷᵢ) + Σₖ Ω(fₖ)
Ω(f) = γT + ½λ Σⱼ₌₁ᵀ wⱼ²
T: #leaves; wⱼ: leaf score; γ: min gain threshold; λ: L2 on leaf weights
2nd-Order Taylor Approximation of Obj at step t
Obj⁽ᵗ⁾ ≈ Σᵢ [gᵢ fₜ(xᵢ) + ½ hᵢ fₜ(xᵢ)²] + Ω(fₜ) + const
gᵢ = ∂_{ŷ⁽ᵗ⁻¹⁾} L(yᵢ, ŷ⁽ᵗ⁻¹⁾) // 1st derivative
hᵢ = ∂²_{ŷ⁽ᵗ⁻¹⁾} L(yᵢ, ŷ⁽ᵗ⁻¹⁾) // 2nd derivative (Hessian)
GBM uses only gᵢ (1st order); XGBoost adds hᵢ for better curvature info
Optimal Leaf Weight (closed-form, per leaf j)
wⱼ* = −Gⱼ / (Hⱼ + λ)
Gⱼ = Σᵢ∈Iⱼ gᵢ, Hⱼ = Σᵢ∈Iⱼ hᵢ
Iⱼ: set of sample indices in leaf j
Optimal Obj Score (after finding best tree structure)
Obj* = −½ Σⱼ₌₁ᵀ Gⱼ² / (Hⱼ + λ) + γT
Split Gain (greedy — maximize per split)
Gain = ½ [ G²_L/(H_L+λ) + G²_R/(H_R+λ) − (G_L+G_R)²/(H_L+H_R+λ) ] − γ
Split only if Gain > 0; γ acts as minimum gain for pruning (pre-stop)
Regression (MSE): gᵢ, hᵢ
L = ½(yᵢ − ŷᵢ)²
gᵢ = ŷᵢ − yᵢ (predicted − actual)
hᵢ = 1 (constant Hessian)
Binary Classification (Log-loss)
L = −yᵢ log pᵢ − (1−yᵢ) log(1−pᵢ), pᵢ = σ(ŷᵢ)
gᵢ = pᵢ − yᵢ
hᵢ = pᵢ(1 − pᵢ) // non-constant → better curvature
Final output: P(y=1|x) = σ(Fₜ(x))
// XGBoost Single Tree Build
Compute {gᵢ, hᵢ} for all samples
Start with all samples in root
while not max_depth:
for each leaf node:
for each feature f:
Sort instances by f value
Scan thresholds; compute Gain(f,t)
best = argmax_{f,t} Gain(f,t)
if max Gain > 0: split node
else: mark leaf
Assign wⱼ* = −Gⱼ/(Hⱼ+λ) to leaves
Update: F_t(x) += η · wⱼ*(leaf of x)
  • Shrinkage (η): scale each tree's contribution → reduce each tree's influence, allow later trees to correct
  • Column subsampling: random feature fraction per tree/level/node → like RF, reduces correlation
  • Row subsampling: stochastic GBM — random sample fraction per tree
  • Approximate splits: sketch-based quantiles (weighted) for distributed/large data
  • Sparsity-aware: missing values → learn default branch direction during split
  • Cache-aware: data stored in compressed column blocks; prefetch for CPU cache efficiency
  • max_delta_step: bounds leaf weight update — stabilizes logistic regression training

GBM vs XGBoost vs LightGBM vs CatBoost

Propertysklearn GBMXGBoostLightGBMCatBoost
Tree growthLevel-wiseLevel-wise (depth-first)Leaf-wise (best-first)Symmetric (oblivious trees)
Taylor order1st only1st + 2nd1st + 2nd1st + 2nd
SpeedSlowFast (parallel column sort)Fastest (GOSS + EFB)Moderate
Missing valuesManual impute✓ native default direction✓ native✓ native
CategoricalManual encodeManual encode✓ native (optimal split)✓ target encoding with ordered boosting
Regularizationsubsample, max_depthγ, λ, α, subsample, colsamplemin_gain, min_data_in_leafSymmetric structure = implicit reg
Overfit risk (leaf-wise)LowLowHigher; use min_data_in_leafLow (symmetric)
Unique optimizationBlock structure, approx splitsGOSS + EFB histogramOrdered boosting (no leakage)
LightGBM GOSS: keep large-gradient samples + random low-gradient sample → reduce data without losing signal. EFB: bundle mutually exclusive features.

AdaBoost vs GBM

AdaBoostGBM
Weighting mechanismSample reweightingResidual fitting
Step sizeαₜ derived from εₜLine search or fixed η
Loss functionExponential (fixed)Any differentiable loss
Outlier sensitivityVery high (exp loss)Depends on loss choice
FlexibilityLowHigh (plug any loss)
AdaBoost = GBM with exponential loss + exact line search. GBM generalizes to any loss.

Boosting Hyperparameter Guide

Boosting sensitive to outliers with MSE loss → use Huber or quantile loss for robustness
10

Unsupervised Learning

K-Means Clustering

Objective (Within-Cluster Sum of Squares, WCSS)
J = Σₖ Σᵢ∈Cₖ ||xᵢ − μₖ||²
// Lloyd's Algorithm
Initialize μ₁,...,μₖ // random or k-means++
repeat until assignments don't change:
// E-step: Assign to nearest centroid
cᵢ = argmin_k ||xᵢ − μₖ||² ∀i
// M-step: Recompute centroids
μₖ = (1/|Cₖ|) Σᵢ∈Cₖ xᵢ ∀k
K-Means++ Initialization
1. c₁ = uniform random sample from X
2. For j = 2..k:
cⱼ ~ P(x) ∝ min_{i<j} ||x − cᵢ||² (prefer distant points)
O(log k) approximation guarantee; reduces bad local minima
Silhouette Score (per sample)
s = (b − a) / max(a, b)
a = mean intra-cluster distance; b = mean nearest-cluster distance
Range: [−1, +1]; higher = better; >0.5 is good
Sensitive to scale → standardize. Assumes spherical, equal-size clusters. Fails on non-convex shapes.

K-Means Variants

VariantKey DifferenceUse When
K-Means++Probabilistic initAlways over random init
K-Medoids (PAM)Centers = actual data pointsNon-Euclidean, robust to outliers
Mini-batch K-MeansSGD-style centroid updatesn > 1M; online streams
Fuzzy C-MeansSoft memberships uᵢₖ ∈ [0,1]Overlapping clusters
Bisecting K-MeansRecursively split largest clusterHierarchical structure needed
DBSCANDensity-based, no k neededArbitrary shapes; detect outliers
HDBSCANHierarchical DBSCANVarying density clusters
DBSCAN Parameters
ε: neighborhood radius; MinPts: min samples to form core point
Core point: |{x' : d(x,x') ≤ ε}| ≥ MinPts
Noise = points not reachable from any core point

EM Algorithm

Goal: maximize log P(X|θ) with latent variables Z
log P(X|θ) ≥ L(q,θ) = Σᵢ Σₖ q(zᵢ=k) log [P(xᵢ,zᵢ=k|θ)/q(zᵢ=k)]
ELBO (Evidence Lower BOund) — maximize by iterating E and M steps
// EM General Algorithm
Initialize θ⁰ randomly
repeat until convergence:
// E-step: Compute Q function (expected complete-data log-likelihood)
Q(θ|θᵗ) = E_{Z|X,θᵗ} [log P(X,Z|θ)]
// M-step: Maximize Q over θ
θᵗ⁺¹ = argmax_θ Q(θ|θᵗ)
Alternate between inferring missing data (E) and fitting parameters (M)

Gaussian Mixture Models (GMM)

Generative Model
P(x) = Σₖ₌₁ᴷ πₖ N(x | μₖ, Σₖ)
πₖ: mixing coefficients (Σπₖ=1, πₖ≥0); z ~ Categorical(π) is latent
// GMM-EM Steps
Initialize μₖ, Σₖ, πₖ // e.g., from k-means
repeat until convergence:
// E-step: Responsibilities
γₙₖ = πₖ N(xₙ|μₖ,Σₖ) / Σⱼ πⱼ N(xₙ|μⱼ,Σⱼ)
// M-step: Update parameters
Nₖ = Σₙ γₙₖ
μₖ = (1/Nₖ) Σₙ γₙₖ xₙ
Σₖ = (1/Nₖ) Σₙ γₙₖ (xₙ−μₖ)(xₙ−μₖ)ᵀ
πₖ = Nₖ / N
Log-Likelihood (to monitor convergence)
ℓ(θ) = Σₙ log [Σₖ πₖ N(xₙ|μₖ,Σₖ)]
Covariance Types (from most to least flexible)
Full: Σₖ arbitrary (K·d² params)
Tied: Σₖ = Σ shared (d²)
Diagonal: Σₖ = diag(σ²ₖ) (K·d params)
Spherical: Σₖ = σ²ₖI (K params)
Singularity if μₖ collapses to a data point with full covariance → regularize: Σₖ += εI

K-Means vs GMM vs DBSCAN

PropertyK-MeansGMMDBSCAN
Requires kYesYes (K)No (ε, MinPts)
Soft assignmentNo (hard)Yes (γₙₖ)No
Cluster shapeSphericalElliptical (per Σₖ)Arbitrary
Outlier detectionNoneLow-responsibility ptsExplicit noise pts
Probabilistic modelNoYesNo
ScalabilityVery highModerate (O(nKd²))O(n log n) w/ index
ConvergenceLocal WCSS minLocal log-lik maxDeterministic
11

ML Model Evaluation & Comparison

Cross-Validation Methods

MethodDescriptionUse When
k-Fold CVk disjoint folds; rotate test foldGeneral; k=5 or 10
Stratified k-FoldPreserve class ratio per foldClassification, imbalanced
LOOCVn-fold; one sample left outVery small n; high variance of estimate
Repeated k-FoldMultiple random k-fold runsReduce CV estimate variance
Time series splitWalk-forward; no future leakageAny temporal data
Nested CVOuter: eval; Inner: HP tuningUnbiased performance estimate when doing HP search
Fit all preprocessors (scalers, imputers) on training fold only; never on test fold.

Statistical Tests for Model Comparison

Paired t-test (k-fold CV)
tₛₜₐₜ = d̄ / (s_d / √k)
dᵢ = acc_A(foldᵢ) − acc_B(foldᵢ)
H₀: models equivalent; reject if |t| > t_{α/2, k-1}
5×2 CV Test (Dietterich — more reliable)
Use 5 repetitions of 2-fold CV
pᵢⱼ = err_A(fold j, rep i) − err_B(fold j, rep i)
tₛₜₐₜ = p¹₁ / √(1/5 Σᵢ Ŝ²ᵢ)
McNemar's Test (single test set)
χ² = (|b − c| − 1)² / (b + c)
b: A right, B wrong; c: B right, A wrong
Tests if disagreements are asymmetric; no assumption of independence

Bias, Fairness & Interpretability

MethodTypePrinciple
LIMELocal, model-agnosticFit local linear approx in perturbed neighborhood
SHAPLocal + GlobalShapley values; unique fair attribution
Grad-CAMLocal, CNNGradient × activation for saliency map
PDP / ICEGlobal / LocalMarginal effect of one feature
Permutation Imp.GlobalScore drop when feature shuffled
SHAP Shapley Value
φᵢ = Σ_{S⊆F\{i}} [|S|!(|F|−|S|−1)!/|F|!] · [f(S∪{i}) − f(S)]
Average marginal contribution across all feature orderings. Satisfies: efficiency, symmetry, dummy, additivity.
SHAP is the only method satisfying all 4 fairness axioms for feature attribution

Common Interview Traps

QR

Quick Reference — Algo Selection & Complexity

Algorithm Selection Guide

ScenarioTop ChoiceAlternativeAvoid
Tabular data, structured, n≥1KXGBoost / LightGBMRandom ForestKNN (slow predict)
Few features, linear relationshipRidge / Lasso RegressionLogistic RegressionDeep trees
Text classificationLogistic Reg + TF-IDFLinear SVM, Multinomial NBKNN (high-d)
Small dataset (<1K samples)SVM (RBF), NBLogistic RegressionDeep neural networks
Need interpretabilityDecision Tree, Linear ModelSHAP + XGBoostDeep NN without SHAP
Probabilistic output neededLogistic Regression, GMMCalibrated RF / XGBHard-margin SVM
Clustering, unknown kDBSCAN / HDBSCANHierarchical clusteringK-Means
Clustering, known k, sphericalK-Means++Mini-batch K-MeansDBSCAN
Soft cluster assignmentsGMMFuzzy C-MeansK-Means
Non-linear boundary, few samplesSVM + RBF kernelRandom ForestPlain KNN
Online / streamingSGD variants, NBHoeffding TreesBatch SVM
Imbalanced binaryXGB (scale_pos_weight) + F1RF + class_weightDefault threshold

Complexity Reference

AlgorithmTrainPredictSpace
Linear Regression (GD)O(nd·iter)O(d)O(d)
Linear Regression (Normal Eq)O(nd² + d³)O(d)O(d)
Logistic RegressionO(nd·iter)O(d)O(d)
KNNO(1) / O(nd) build indexO(nd)O(nd)
Decision TreeO(nd log n)O(depth)O(n·depth)
SVM (kernel)O(n²d – n³)O(SV·d)O(SV)
Naive BayesO(nd)O(Kd)O(Kd)
Random ForestO(T·nd log n)O(T·depth)O(T·n)
GBM / XGBoostO(T·nd log n)O(T·depth)O(T·n)
K-MeansO(nkd·iter)O(kd)O((n+k)d)
GMM-EMO(nkd²·iter)O(kd²)O(kd²)

Model Assumptions Reference

AlgorithmKey Assumptions
Linear RegressionLinear f(x), homoscedasticity, no multicollinearity, errors N(0,σ²)
Logistic RegressionLinear decision boundary in feature space
LDAGaussian class-conditionals, equal covariance matrices
Naïve BayesConditional independence of features given class
KNNLocally similar points share labels; smooth decision boundary
K-MeansSpherical, similar-size, convex, well-separated clusters
GMMData generated from K Gaussian components
SVMSeparable (or nearly) by hyperplane; kernel defines geometry
Decision TreeAxis-aligned decision boundaries; recursive partition
Bayesian LRGaussian prior on weights, Gaussian likelihood