ML Interview Notes

ML Taxonomy

Supervised: labeled (x,y) pairs → learn f: X→Y; regression / classification
Unsupervised: no labels → find structure (clustering, density estimation, DR)
Semi-supervised: few labeled + many unlabeled
Reinforcement: agent + environment + reward signal → learn policy π(s)→a
Self-supervised: labels from data structure (masked LM, contrastive)
Online learning: update model sequentially on new examples

Mitchell's def: "A program learns from E w.r.t. T if P improves with E"

Designing a Learning System

Task T: classify, regress, rank, cluster, generate
Experience E: dataset, reward signal
Performance P: accuracy, MSE, reward
Pipeline: Problem → Data → Features → Model → Evaluate → Deploy → Monitor
No free lunch theorem: no single algorithm optimal for all problems
Occam's razor: prefer simpler hypothesis that fits data

Core Challenges

Overfitting: low train error, high test error → high variance → regularize, get more data
Underfitting: high train + test error → high bias → more features, complex model
Curse of dimensionality: volume grows exponentially; data becomes sparse; distances lose meaning
Distribution shift: P_train(x,y) ≠ P_test(x,y)
Label noise: corrupts loss gradients; use label smoothing / robust loss
Data imbalance: minority class underrepresented → model biased toward majority

Data Preprocessing

Normalization (Min-Max)

x' = (x − min) / (max − min) → [0, 1]

Standardization (Z-score)

x' = (x − μ) / σ → μ=0, σ=1

Needs scaling: KNN, SVM, GD-based (LR, NN), PCA, K-Means
Doesn't need: trees (threshold-based splits)
Missing values: mean/median impute, KNN impute, model-based, drop
Categorical: one-hot (nominal), ordinal encoding (ordered)
Fit scaler on train only → transform test; prevents data leakage

Class Imbalance Strategies

SMOTE: synthesize minority points by interpolating between k-NN pairs
Oversample: duplicate minority rows (random or informed)
Undersample: drop majority rows; risk losing info
Class weights: w_k = n / (K × n_k); penalizes errors on rare class more
Threshold moving: shift decision boundary ≠ 0.5 to optimize F1/recall
Ensemble: BalancedRF, EasyEnsemble, BalancedBagging

Accuracy misleading on imbalanced data → use F1 / AUC-PR / AUC-ROC

Performance Metrics

Classification

Accuracy = (TP+TN) / N
Precision = TP / (TP+FP) // of predicted +, how many real +
Recall = TP / (TP+FN) // of actual +, how many caught
F1 = 2·P·R / (P+R)
AUC-ROC = P(score(+) > score(−))

Regression

MAE = (1/n)Σ|yᵢ − ŷᵢ|
MSE = (1/n)Σ(yᵢ − ŷᵢ)²
RMSE = √MSE
R² = 1 − SS_res/SS_tot

Scenario	Metric	Why
Imbalanced classes	F1, AUC-ROC, AUC-PR	Accuracy misleading; AUC-PR better when positives rare
Cost FP ≫ FN (spam, fraud alert)	Precision	Minimize false alarms
Cost FN ≫ FP (cancer, fraud detection)	Recall	Minimize missed positives
Regression with outliers	MAE / Huber	Robust to large errors (L1 vs L2)
Regression, large errors critical	RMSE	Squares amplify large deviations
Ranking, recommendation	NDCG, MAP, MRR	Order-sensitive evaluation

Model & Loss

Hypothesis

h(x) = θᵀx = θ₀ + θ₁x₁ + ··· + θₙxₙ

MSE Loss

J(θ) = (1/2m) Σᵢ (hθ(xᵢ) − yᵢ)²

Normal Equation — Direct Solution (O(n³))

θ* = (XᵀX)⁻¹ Xᵀy

Fails if XᵀX singular (multicollinear); use pseudoinverse / Ridge. Only feasible for small feature count.

Gradient

∂J/∂θⱼ = (1/m) Σᵢ (hθ(xᵢ) − yᵢ) xᵢⱼ
In matrix form: ∇J = (1/m) Xᵀ(Xθ − y)

Set gradient to 0 analytically (normal eq.) or descend iteratively (GD)

Gradient Descent Variants

Update Rule

θⱼ := θⱼ − α · ∂J/∂θⱼ

// Mini-batch GD (general form)

Initialize θ randomly
repeat until convergence:
for each mini-batch B ⊂ {1..m}:
θ := θ − (α/|B|) Σᵢ∈B ∇L(θ; xᵢ,yᵢ)

Variant	Samples/Update	Noise	Speed
Batch GD	All m	None	Slowest per epoch
SGD	1	High	Fastest, can escape local min
Mini-batch	k (32–512)	Low	Best; GPU-vectorized

α too large → diverge; α too small → slow. Use learning rate schedule or Adam.

Regularization

Ridge (L2) — Normal Eq becomes invertible

J(θ) = MSE + λΣⱼ θⱼ²
θ* = (XᵀX + λI)⁻¹ Xᵀy

Shrinks weights continuously; never exactly 0; handles multicollinearity

Lasso (L1) — Feature selection

J(θ) = MSE + λΣⱼ |θⱼ|

Sparsity: corners of L1 ball → exact zeros; use coordinate descent

ElasticNet

J(θ) = MSE + λ₁Σ|θⱼ| + λ₂Σθⱼ²

Groups correlated features; L1 selects one, ElasticNet selects group

	Ridge	Lasso	ElasticNet
Sparsity	No	Yes	Yes
Correlated feats	Equal shrink	Picks one	Groups
Closed form	Yes	No (subgradient)	No

Bias-Variance Decomposition

Decomposition

E[(y − ŷ)²] = Bias²(ŷ) + Var(ŷ) + σ²

Bias: error from wrong model assumptions (underfitting)
Variance: sensitivity to training sample (overfitting)
σ²: irreducible noise — can never eliminate

Bias of Estimator

Bias(ŷ) = E[ŷ] − f(x) // systematic offset from truth

Condition	Bias	Variance	Fix
Overfitting	Low	High	Regularize, prune, more data, dropout
Underfitting	High	Low	More features, complex model, less regularization

Bagging reduces variance; boosting reduces bias; regularization shifts tradeoff

Basis Function Models

General Form

h(x) = θᵀφ(x) where φ: X → ℝᵐ

Polynomial: φ(x) = [1, x, x², ..., xᵈ] — risk of overfitting at high d
Gaussian RBF: φⱼ(x) = exp(−||x − μⱼ||² / 2s²)
Sigmoid: φⱼ(x) = σ(wⱼᵀx + b) — one-layer NN basis
Model still linear in θ → same GD / normal equation apply
Choice of φ controls bias-variance tradeoff (d ↑ → variance ↑)

Logistic Regression

Sigmoid Activation

σ(z) = 1 / (1 + e^(−z)), z = θᵀx
P(y=1|x;θ) = σ(θᵀx)

Binary Cross-Entropy Loss (NLL of Bernoulli)

J(θ) = −(1/m) Σᵢ [yᵢ log ŷᵢ + (1−yᵢ) log(1−ŷᵢ)]

Gradient — same form as linear regression!

∂J/∂θⱼ = (1/m) Σᵢ (ŷᵢ − yᵢ) xᵢⱼ
In matrix: ∇J = (1/m) Xᵀ(ŷ − y)

Multiclass — Softmax

P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx)

Decision Boundary

θᵀx = 0 → linear hyperplane in feature space
Predict 1 if P(y=1|x) ≥ 0.5, i.e., θᵀx ≥ 0

Max likelihood of Bernoulli outputs; convex loss → unique global minimum via GD

Perfect separability → weights diverge; fix with L2 regularization

Decision Theory

Bayes optimal: argmax_k P(Cₖ|x) — minimizes expected error
Discriminant function: g(x) = wᵀx + w₀; assign class by sign
Discriminative: model P(y|x) directly — LR, SVM, NN
Generative: model P(x|y), P(y) → P(y|x) via Bayes — NB, GMM, LDA

	Generative	Discriminative
Models	NB, GMM, HMM, LDA	LR, SVM, NN, RF
Learns	P(x,y) = P(x\|y)P(y)	P(y\|x) directly
Pros	Works w/ small data; handles missing; can sample	Higher accuracy on large data
Cons	Strong distributional assumptions	Needs labeled data

Linear Discriminant Analysis (LDA)

Projection Criterion (Fisher)

max_w (wᵀSBw) / (wᵀSWw)

SB: between-class scatter; SW: within-class scatter

Solution

SW⁻¹ SB w = λw → generalized eigenproblem
Reduce to (K−1) dimensions for K classes

Assumes: Gaussian class-conditionals with equal covariance
QDA: allows different covariance per class

Information Theory Fundamentals

Entropy (impurity measure)

H(S) = −Σₖ pₖ log₂ pₖ

pₖ = fraction of class k; 0 log 0 ≡ 0; range [0, log₂K]

Information Gain (ID3 criterion)

IG(S, A) = H(S) − Σᵥ (|Sᵥ|/|S|) H(Sᵥ)

A: attribute; Sᵥ: subset where A=v

Gain Ratio (C4.5 — corrects IG bias for high-cardinality attrs)

GR(S,A) = IG(S,A) / SplitInfo(S,A)
SplitInfo = −Σᵥ (|Sᵥ|/|S|) log₂(|Sᵥ|/|S|)

Gini Impurity (CART)

Gini(S) = 1 − Σₖ pₖ²

Range [0, 1−1/K]; 0 = pure node

Variance Reduction (CART regression)

VR = Var(S) − Σᵥ (|Sᵥ|/|S|) Var(Sᵥ)

Entropy = expected bits to encode label. IG = entropy reduction from knowing attribute value.

ID3 Algorithm — Pseudocode

// ID3 — Entropy + Info Gain

function BuildTree(S, Attrs):
if all S same class C: return Leaf(C)
if Attrs empty: return Leaf(majority(S))
A* = argmax_A∈Attrs IG(S, A)
node = Node(A*)
for each value v of A*:
Sᵥ = {s ∈ S : s[A*] = v}
if Sᵥ empty:
node.child[v] = Leaf(majority(S))
else:
node.child[v] = BuildTree(Sᵥ, Attrs\{A*})
return node

// CART (Binary splits — Classification)

function BestSplit(S, Attrs):
best_gain = 0; best_split = null
for each attr A:
for each threshold t in sorted A values:
S_L = {s: s[A] ≤ t}; S_R = {s: s[A] > t}
gain = Gini(S) − (|S_L|/|S|)Gini(S_L) − (|S_R|/|S|)Gini(S_R)
if gain > best_gain: update best_split
return best_split

Continuous & Missing Values

Continuous Threshold Search

Sort xᵢ values: v₁ ≤ v₂ ≤ ... ≤ vₙ
Candidate thresholds: tₖ = (vₖ + vₖ₊₁)/2
t* = argmax_t IG(S, A_t)

O(n log n) per attribute per node

Missing values (C4.5): fractional assignment — split sample with weight proportional to branch fraction
Missing values (CART): surrogate splits — use best secondary split that mimics primary
Leaf prediction (regression): mean of training samples in leaf
Leaf prediction (classification): majority class / class distribution

Overfitting & Pruning

Pre-pruning: max_depth, min_samples_split, min_samples_leaf, min_impurity_decrease
Post-pruning (reduced-error): build full tree → prune subtree if val accuracy doesn't drop
Cost-complexity pruning (CART): R_α(T) = R(T) + α·|leaves(T)|; increase α until CV-optimal tree found
MDL Principle: minimize DL(h) + DL(D|h); simpler model that explains data well

Deep trees: overfit. Low depth: underfit. Use CV to pick optimal depth / α.

Property	ID3	C4.5	CART
Split criterion	Information Gain	Gain Ratio	Gini (class) / Variance (reg)
Task	Classification	Classification	Both
Continuous features	❌	✓ threshold search	✓ threshold search
Missing values	❌	✓ fractional weights	✓ surrogate splits
Split arity	Multi-way	Multi-way	Binary only
Pruning	❌	✓ post-prune	✓ cost-complexity
Bias (IG)	High-cardinality attrs preferred	Corrected by SplitInfo	Gini less biased

K-Nearest Neighbors (KNN)

Prediction

ŷ = majority_vote({yᵢ : i ∈ kNN(x)}) // classification
ŷ = (1/k) Σᵢ∈kNN(x) yᵢ // regression

Distance Metrics

Euclidean: d(x,z) = √Σⱼ(xⱼ−zⱼ)²
Manhattan: d(x,z) = Σⱼ|xⱼ−zⱼ|
Minkowski: d(x,z) = (Σⱼ|xⱼ−zⱼ|^p)^(1/p)
Cosine sim: (x·z)/(||x||·||z||)

Weighted KNN

ŷ = Σᵢ wᵢ yᵢ / Σᵢ wᵢ, wᵢ = 1/d(x, xᵢ)²

k=1: overfit, Voronoi boundary; k=n: predict global majority
Lazy learner: no training; O(nd) per query — use KD-tree / ball-tree to reduce
Must scale features — distances sensitive to magnitude
Curse of dimensionality: all points equidistant in high-d; use dim reduction first

No model; memorize training set; predict by proximity

Locally Weighted Regression (LWR)

Objective (per query point x)

J(θ; x) = Σᵢ wᵢ(x) · (yᵢ − θᵀxᵢ)²

Gaussian Kernel Weight

wᵢ(x) = exp(−||x − xᵢ||² / 2τ²)

τ (bandwidth): small → more local; large → global linear fit

Closed-Form Solution

θ*(x) = (XᵀW(x)X)⁻¹ XᵀW(x)y
W(x) = diag(w₁(x), ..., wₙ(x))

Re-solve linear system for each new query point → expensive
Non-parametric — fits a local linear model around each query

Fit linear model around query point using proximity-weighted samples

Radial Basis Functions (RBF)

RBF Network Output

f(x) = Σⱼ₌₁ᴷ wⱼ φ(||x − cⱼ||) + w₀

cⱼ: centers; φ: radial basis function; wⱼ: output weights

Common Basis Functions

Gaussian: φ(r) = exp(−r²/2σ²)
Multiquadric: φ(r) = √(r² + c²)
Thin-plate: φ(r) = r² log(r)

// RBF Network Training

1. Choose centers cⱼ (k-means on X, or use training points)
2. Fix widths σⱼ (heuristic: mean distance to neighbors)
3. Compute Φᵢⱼ = φ(||xᵢ − cⱼ||) for all i,j
4. Solve linear system: w* = (ΦᵀΦ)⁻¹Φᵀy

Universal approximator; interpolates smoothly between centers

Hard-Margin SVM (Linearly Separable)

Primal Problem

min_{w,b} ½||w||²
s.t. yᵢ(wᵀxᵢ + b) ≥ 1 ∀i

Margin = 2/||w||. Maximize margin ↔ minimize ½||w||².

Lagrangian

L(w,b,α) = ½||w||² − Σᵢ αᵢ[yᵢ(wᵀxᵢ+b) − 1]

Dual Problem (substitute KKT: w = Σαᵢyᵢxᵢ)

max_α Σᵢ αᵢ − ½ Σᵢ Σⱼ αᵢαⱼ yᵢyⱼ xᵢᵀxⱼ
s.t. αᵢ ≥ 0 ∀i, Σᵢ αᵢyᵢ = 0

Decision Function

f(x) = sign(Σᵢ αᵢyᵢ xᵢᵀx + b)

Only support vectors (αᵢ > 0) contribute (KKT sparsity)

Bias term b

b = yₛ − Σᵢ αᵢyᵢ xᵢᵀxₛ for any support vector s

Maximum margin → best generalization by VC theory; only SVs matter

Soft-Margin SVM (Non-Separable)

Primal with Slack Variables

min_{w,b,ξ} ½||w||² + C Σᵢ ξᵢ
s.t. yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

ξᵢ: slack — how much sample i violates margin

Dual (same form with upper bound on α)

max_α Σᵢ αᵢ − ½ Σᵢ Σⱼ αᵢαⱼ yᵢyⱼ xᵢᵀxⱼ
s.t. 0 ≤ αᵢ ≤ C, Σᵢ αᵢyᵢ = 0

Hinge Loss Equivalence

min_w ||w||² + C Σᵢ max(0, 1 − yᵢ(wᵀxᵢ+b))

C large: hard margin → overfit risk, few SVs
C small: many violations allowed → underfitting, many SVs

SVM = L2-regularized hinge loss minimization

Kernel Trick (Mercer)

Kernel Substitution in Dual

xᵢᵀxⱼ → K(xᵢ, xⱼ) = φ(xᵢ)ᵀφ(xⱼ)

Never compute φ explicitly; compute K directly in O(d)

Mercer Condition (valid kernel)

K must be symmetric positive semi-definite:
Σᵢ Σⱼ cᵢcⱼ K(xᵢ,xⱼ) ≥ 0 for all {cᵢ}, {xᵢ}

Kernel	Formula	Feature Space	Best For
Linear	xᵀz	Original	High-d sparse (text)
Polynomial	(γxᵀz + r)ᵈ	Degree-d poly	Images, NLP
RBF/Gaussian	exp(−γ\|\|x−z\|\|²)	Infinite-dim	General purpose
Sigmoid	tanh(κxᵀz + θ)	Varies	NN analog (not always valid)

Kernelized Decision Function

f(x) = sign(Σᵢ αᵢyᵢ K(xᵢ, x) + b)

Map to high-d implicitly; linear boundary in φ-space = nonlinear boundary in x-space

SVM: When to Use / Avoid

Scenario	SVM?	Notes
High-d, few samples (n≪d)	✓ Great	Genomics, text; margin maximization helps
Large n (>100K)	✗ Slow	O(n²–n³); use SGD/Linear SVM
Need probabilities	Partial	Platt scaling; not well-calibrated natively
Heavily overlapping classes	Tune C	Soft margin; RBF kernel
Multiclass	Manual	OvR or OvO decomposition

SMO (Sequential Minimal Optimization): standard solver; optimizes 2 αᵢ at a time
Support vectors as prototypes: sparse; robust to non-SVs being removed

MLE vs MAP vs Full Bayes

Bayes Rule

P(θ|D) = P(D|θ) · P(θ) / P(D)
Posterior ∝ Likelihood × Prior

MLE — Maximum Likelihood

θ_MLE = argmax_θ P(D|θ) = argmax_θ Σᵢ log P(xᵢ|θ)

No prior; can overfit with small data; point estimate

MAP — Maximum A Posteriori

θ_MAP = argmax_θ log P(D|θ) + log P(θ)
Gaussian prior P(θ) = N(0, λ⁻¹I) → L2 regularization
Laplace prior P(θ) ∝ exp(−λ|θ|) → L1 regularization

Full Bayesian Prediction

P(y*|x*,D) = ∫ P(y*|x*,θ) P(θ|D) dθ

Averages over all θ; intractable for most models; use MCMC / VI

	MLE	MAP	Full Bayes
Prior	No	Yes (point)	Yes (full)
Output	Point θ	Point θ	Distribution P(θ\|D)
Overfit risk	High	Low	Lowest
Tractability	Easy	Easy	Often hard

Naïve Bayes Classifier

Conditional Independence Assumption

P(x|Cₖ) = Πⱼ P(xⱼ|Cₖ)

Classification Rule

ŷ = argmax_k log P(Cₖ) + Σⱼ log P(xⱼ|Cₖ)

Log-space: avoids underflow from multiplying small probabilities

Laplace Smoothing (prevents zero probs)

P(xⱼ=v|Cₖ) = (count(xⱼ=v, Cₖ) + α) / (count(Cₖ) + α·|V|)

α=1: add-one; |V|: vocab/category size

Variant	Feature Type	P(xⱼ\|Cₖ)
Gaussian NB	Continuous	N(μₖⱼ, σ²ₖⱼ) — estimate μ,σ per class
Bernoulli NB	Binary	pₖⱼ^xⱼ · (1−pₖⱼ)^(1−xⱼ)
Multinomial NB	Counts (text)	θₖⱼ^xⱼ — word frequencies

Independence assumption wrong but ranking often correct; fast, works with tiny data

Poor probability calibration despite good accuracy; feature correlation breaks it

Bayesian Linear Regression

Prior over weights

p(w) = N(0, α⁻¹I)

Likelihood

p(y|X,w) = N(Xw, β⁻¹I)

Posterior (conjugate — also Gaussian)

p(w|X,y) = N(mN, SN)
SN = (αI + βXᵀX)⁻¹
mN = β · SN · Xᵀy

Predictive Distribution

p(y*|x*,X,y) = N(mNᵀφ(x*), σ²_N(x*))
σ²_N(x*) = β⁻¹ + φ(x*)ᵀ SN φ(x*)

Uncertainty grows where data is sparse — automatic credible intervals

Full posterior over weights → predictive uncertainty quantification; mN = MAP when α→0

Optimal Bayes Classifier

ŷ = argmax_k P(Cₖ|x) = argmax_k P(x|Cₖ)P(Cₖ)

Minimizes expected misclassification error — theoretical gold standard
Requires true P(x|C) — unknown in practice
Bayes error: minimum achievable error = overlap of class-conditional distributions
Bayes error bound: all models must have error ≥ Bayes error

Ensemble Types & Error Decomposition

Method	Training	Reduces	Examples
Bagging	Parallel	Variance	Random Forest
Boosting	Sequential	Bias	AdaBoost, GBM, XGBoost
Stacking	Both	Both	Meta-learner on OOF preds

Variance of Average of n Models

Var(avg) = ρσ² + (1−ρ)σ²/n

ρ: inter-model correlation. Lower ρ → more reduction. Need diversity + accuracy.

Diverse accurate models that fail on different samples → strong ensemble

Bagging

// Bootstrap Aggregating

for t = 1 to T:
Dₜ = bootstrap_sample(D, n) // n samples w/ replacement
hₜ = train(Dₜ)
return H(x) = majority_vote(hₜ(x)) // or mean for regression

Each bag: ~63.2% unique samples [1−(1−1/n)ⁿ ≈ 1−1/e]
OOB error: use ~36.8% unused per tree as free validation — nearly unbiased
Effective when base learner has high variance (deep trees)
Embarrassingly parallel

Random Forest

Bagging + random feature subset at each split → decorrelates trees
Feature subset size: √p (classification), p/3 (regression)

Feature Importance (Gini / Mean Decrease Impurity)

FIⱼ = Σ_trees Σ_nodes[split on j] · ΔImpurity · (n_node/n_total)

Normalized to sum=1. Biased toward high-cardinality continuous features!

Permutation Importance (unbiased)

PIⱼ = score(original) − score(with feature j shuffled)

Hyperparameter	Effect (↑)
n_estimators	Better, no overfit risk; diminishing returns after ~200
max_features	Less diversity per tree but higher individual accuracy
max_depth	Higher variance per tree, richer structure
min_samples_leaf	More regularization, smoother boundaries

Wisdom of many uncorrelated trees; OOB provides free validation

AdaBoost — Algorithm & Equations

Final Classifier

H(x) = sign(F(x)), F(x) = Σₜ αₜ hₜ(x)

// AdaBoost (Binary ±1 labels)

Initialize: wᵢ = 1/m ∀i
for t = 1 to T:
hₜ = train weak learner on (X, y, w) // e.g., depth-1 tree
εₜ = Σᵢ wᵢ · 𝟙[hₜ(xᵢ) ≠ yᵢ] // weighted error
αₜ = ½ ln((1 − εₜ) / εₜ) // learner confidence
for each i:
wᵢ ← wᵢ · exp(−αₜ · yᵢ · hₜ(xᵢ))
// misclassified: yᵢhₜ(xᵢ)=−1 → weight × exp(+αₜ) (↑)
// correct: yᵢhₜ(xᵢ)=+1 → weight × exp(−αₜ) (↓)
wᵢ ← wᵢ / Σⱼ wⱼ // normalize to sum=1

Learner Weight

αₜ = ½ ln((1−εₜ)/εₜ)

εₜ → 0: αₜ → ∞ (perfect learner heavily weighted)
εₜ = 0.5: αₜ = 0 (random; ignored)
Requires εₜ < 0.5 (better than random)

Training Error Upper Bound

Train error ≤ exp(−2 Σₜ γₜ²) where γₜ = 0.5 − εₜ

Exponentially decreasing with T; theoretically zero with enough boosting rounds

Loss Function (implicit)

L(y, F) = exp(−y·F(x)) — exponential loss

Very sensitive to outliers and label noise

At each step, focus on what previous models got wrong by upweighting their mistakes

GBM — Algorithm & Equations

Additive Model

F_M(x) = F₀(x) + Σₘ₌₁ᴹ γₘ hₘ(x)

hₘ: weak learner (shallow tree); γₘ: step size via line search

// Gradient Boosting (Friedman 2001)

F₀(x) = argmin_γ Σᵢ L(yᵢ, γ) // constant init (e.g., mean y)
for m = 1 to M:
// Compute pseudo-residuals (negative gradient)
rᵢₘ = −[∂L(yᵢ, F(xᵢ)) / ∂F(xᵢ)]_{F=Fₘ₋₁}
// Fit weak learner to pseudo-residuals
hₘ = fit_tree({(xᵢ, rᵢₘ)})
// Line search for step size (or use fixed η)
γₘ = argmin_γ Σᵢ L(yᵢ, Fₘ₋₁(xᵢ) + γ hₘ(xᵢ))
Fₘ(x) = Fₘ₋₁(x) + η · γₘ · hₘ(x)

Pseudo-Residuals for Common Losses

MSE (L2): rᵢ = yᵢ − Fₘ₋₁(xᵢ) // actual residuals
MAE (L1): rᵢ = sign(yᵢ − Fₘ₋₁(xᵢ))
Log-loss: rᵢ = yᵢ − σ(Fₘ₋₁(xᵢ))

Leaf Region Output (regression, MSE)

γⱼₘ = mean{yᵢ − Fₘ₋₁(xᵢ) : xᵢ ∈ Rⱼₘ}

Rⱼₘ: region (leaf j of tree m)

Gradient descent in function space; each tree fits the gradient of loss w.r.t. current predictions

XGBoost — Full Derivation

Regularized Objective

Obj = Σᵢ L(yᵢ, ŷᵢ) + Σₖ Ω(fₖ)
Ω(f) = γT + ½λ Σⱼ₌₁ᵀ wⱼ²

T: #leaves; wⱼ: leaf score; γ: min gain threshold; λ: L2 on leaf weights

2nd-Order Taylor Approximation of Obj at step t

Obj⁽ᵗ⁾ ≈ Σᵢ [gᵢ fₜ(xᵢ) + ½ hᵢ fₜ(xᵢ)²] + Ω(fₜ) + const
gᵢ = ∂_{ŷ⁽ᵗ⁻¹⁾} L(yᵢ, ŷ⁽ᵗ⁻¹⁾) // 1st derivative
hᵢ = ∂²_{ŷ⁽ᵗ⁻¹⁾} L(yᵢ, ŷ⁽ᵗ⁻¹⁾) // 2nd derivative (Hessian)

GBM uses only gᵢ (1st order); XGBoost adds hᵢ for better curvature info

Optimal Leaf Weight (closed-form, per leaf j)

wⱼ* = −Gⱼ / (Hⱼ + λ)
Gⱼ = Σᵢ∈Iⱼ gᵢ, Hⱼ = Σᵢ∈Iⱼ hᵢ

Iⱼ: set of sample indices in leaf j

Optimal Obj Score (after finding best tree structure)

Obj* = −½ Σⱼ₌₁ᵀ Gⱼ² / (Hⱼ + λ) + γT

Split Gain (greedy — maximize per split)

Gain = ½ [ G²_L/(H_L+λ) + G²_R/(H_R+λ) − (G_L+G_R)²/(H_L+H_R+λ) ] − γ

Split only if Gain > 0; γ acts as minimum gain for pruning (pre-stop)

Regression (MSE): gᵢ, hᵢ

L = ½(yᵢ − ŷᵢ)²
gᵢ = ŷᵢ − yᵢ (predicted − actual)
hᵢ = 1 (constant Hessian)

Binary Classification (Log-loss)

L = −yᵢ log pᵢ − (1−yᵢ) log(1−pᵢ), pᵢ = σ(ŷᵢ)
gᵢ = pᵢ − yᵢ
hᵢ = pᵢ(1 − pᵢ) // non-constant → better curvature
Final output: P(y=1|x) = σ(Fₜ(x))

// XGBoost Single Tree Build

Compute {gᵢ, hᵢ} for all samples
Start with all samples in root
while not max_depth:
for each leaf node:
for each feature f:
Sort instances by f value
Scan thresholds; compute Gain(f,t)
best = argmax_{f,t} Gain(f,t)
if max Gain > 0: split node
else: mark leaf
Assign wⱼ* = −Gⱼ/(Hⱼ+λ) to leaves
Update: F_t(x) += η · wⱼ*(leaf of x)

Shrinkage (η): scale each tree's contribution → reduce each tree's influence, allow later trees to correct
Column subsampling: random feature fraction per tree/level/node → like RF, reduces correlation
Row subsampling: stochastic GBM — random sample fraction per tree
Approximate splits: sketch-based quantiles (weighted) for distributed/large data
Sparsity-aware: missing values → learn default branch direction during split
Cache-aware: data stored in compressed column blocks; prefetch for CPU cache efficiency
max_delta_step: bounds leaf weight update — stabilizes logistic regression training

GBM vs XGBoost vs LightGBM vs CatBoost

Property	sklearn GBM	XGBoost	LightGBM	CatBoost
Tree growth	Level-wise	Level-wise (depth-first)	Leaf-wise (best-first)	Symmetric (oblivious trees)
Taylor order	1st only	1st + 2nd	1st + 2nd	1st + 2nd
Speed	Slow	Fast (parallel column sort)	Fastest (GOSS + EFB)	Moderate
Missing values	Manual impute	✓ native default direction	✓ native	✓ native
Categorical	Manual encode	Manual encode	✓ native (optimal split)	✓ target encoding with ordered boosting
Regularization	subsample, max_depth	γ, λ, α, subsample, colsample	min_gain, min_data_in_leaf	Symmetric structure = implicit reg
Overfit risk (leaf-wise)	Low	Low	Higher; use min_data_in_leaf	Low (symmetric)
Unique optimization	—	Block structure, approx splits	GOSS + EFB histogram	Ordered boosting (no leakage)

LightGBM GOSS: keep large-gradient samples + random low-gradient sample → reduce data without losing signal. EFB: bundle mutually exclusive features.

AdaBoost vs GBM

	AdaBoost	GBM
Weighting mechanism	Sample reweighting	Residual fitting
Step size	αₜ derived from εₜ	Line search or fixed η
Loss function	Exponential (fixed)	Any differentiable loss
Outlier sensitivity	Very high (exp loss)	Depends on loss choice
Flexibility	Low	High (plug any loss)

AdaBoost = GBM with exponential loss + exact line search. GBM generalizes to any loss.

Boosting Hyperparameter Guide

n_estimators: more = better (use early stopping on val set)
learning_rate η: lower → needs more trees; η × T ≈ constant rule
max_depth: 3–8 typical; deeper = higher variance per tree
subsample: 0.5–0.8 reduces overfit + speeds up
colsample_bytree: 0.5–0.9 (XGB/LGB)
reg_lambda / reg_alpha: L2/L1 on leaf weights
min_child_weight: min sum of hᵢ in a node → prevents tiny leaves
Early stopping: monitor val metric; stop if no improvement for k rounds

Boosting sensitive to outliers with MSE loss → use Huber or quantile loss for robustness

K-Means Clustering

Objective (Within-Cluster Sum of Squares, WCSS)

J = Σₖ Σᵢ∈Cₖ ||xᵢ − μₖ||²

// Lloyd's Algorithm

Initialize μ₁,...,μₖ // random or k-means++
repeat until assignments don't change:
// E-step: Assign to nearest centroid
cᵢ = argmin_k ||xᵢ − μₖ||² ∀i
// M-step: Recompute centroids
μₖ = (1/|Cₖ|) Σᵢ∈Cₖ xᵢ ∀k

K-Means++ Initialization

1. c₁ = uniform random sample from X
2. For j = 2..k:
cⱼ ~ P(x) ∝ min_{i<j} ||x − cᵢ||² (prefer distant points)

O(log k) approximation guarantee; reduces bad local minima

Convergence: WCSS monotonically non-increasing; guaranteed but local minimum
Time complexity: O(nkd) per iteration
Choosing k: Elbow (WCSS vs k), Silhouette score, Gap statistic

Silhouette Score (per sample)

s = (b − a) / max(a, b)
a = mean intra-cluster distance; b = mean nearest-cluster distance
Range: [−1, +1]; higher = better; >0.5 is good

Sensitive to scale → standardize. Assumes spherical, equal-size clusters. Fails on non-convex shapes.

K-Means Variants

Variant	Key Difference	Use When
K-Means++	Probabilistic init	Always over random init
K-Medoids (PAM)	Centers = actual data points	Non-Euclidean, robust to outliers
Mini-batch K-Means	SGD-style centroid updates	n > 1M; online streams
Fuzzy C-Means	Soft memberships uᵢₖ ∈ [0,1]	Overlapping clusters
Bisecting K-Means	Recursively split largest cluster	Hierarchical structure needed
DBSCAN	Density-based, no k needed	Arbitrary shapes; detect outliers
HDBSCAN	Hierarchical DBSCAN	Varying density clusters

DBSCAN Parameters

ε: neighborhood radius; MinPts: min samples to form core point
Core point: |{x' : d(x,x') ≤ ε}| ≥ MinPts
Noise = points not reachable from any core point

EM Algorithm

Goal: maximize log P(X|θ) with latent variables Z

log P(X|θ) ≥ L(q,θ) = Σᵢ Σₖ q(zᵢ=k) log [P(xᵢ,zᵢ=k|θ)/q(zᵢ=k)]

ELBO (Evidence Lower BOund) — maximize by iterating E and M steps

// EM General Algorithm

Initialize θ⁰ randomly
repeat until convergence:
// E-step: Compute Q function (expected complete-data log-likelihood)
Q(θ|θᵗ) = E_{Z|X,θᵗ} [log P(X,Z|θ)]
// M-step: Maximize Q over θ
θᵗ⁺¹ = argmax_θ Q(θ|θᵗ)

Monotonic: log P(X|θ) non-decreasing each iteration — guaranteed
Convergence: local maximum only; try multiple inits
K-Means = EM with hard assignments + equal isotropic Gaussians

Alternate between inferring missing data (E) and fitting parameters (M)

Gaussian Mixture Models (GMM)

Generative Model

P(x) = Σₖ₌₁ᴷ πₖ N(x | μₖ, Σₖ)

πₖ: mixing coefficients (Σπₖ=1, πₖ≥0); z ~ Categorical(π) is latent

// GMM-EM Steps

Initialize μₖ, Σₖ, πₖ // e.g., from k-means
repeat until convergence:
// E-step: Responsibilities
γₙₖ = πₖ N(xₙ|μₖ,Σₖ) / Σⱼ πⱼ N(xₙ|μⱼ,Σⱼ)
// M-step: Update parameters
Nₖ = Σₙ γₙₖ
μₖ = (1/Nₖ) Σₙ γₙₖ xₙ
Σₖ = (1/Nₖ) Σₙ γₙₖ (xₙ−μₖ)(xₙ−μₖ)ᵀ
πₖ = Nₖ / N

Log-Likelihood (to monitor convergence)

ℓ(θ) = Σₙ log [Σₖ πₖ N(xₙ|μₖ,Σₖ)]

Covariance Types (from most to least flexible)

Full: Σₖ arbitrary (K·d² params)
Tied: Σₖ = Σ shared (d²)
Diagonal: Σₖ = diag(σ²ₖ) (K·d params)
Spherical: Σₖ = σ²ₖI (K params)

Singularity if μₖ collapses to a data point with full covariance → regularize: Σₖ += εI

K-Means vs GMM vs DBSCAN

Property	K-Means	GMM	DBSCAN
Requires k	Yes	Yes (K)	No (ε, MinPts)
Soft assignment	No (hard)	Yes (γₙₖ)	No
Cluster shape	Spherical	Elliptical (per Σₖ)	Arbitrary
Outlier detection	None	Low-responsibility pts	Explicit noise pts
Probabilistic model	No	Yes	No
Scalability	Very high	Moderate (O(nKd²))	O(n log n) w/ index
Convergence	Local WCSS min	Local log-lik max	Deterministic

Cross-Validation Methods

Method	Description	Use When
k-Fold CV	k disjoint folds; rotate test fold	General; k=5 or 10
Stratified k-Fold	Preserve class ratio per fold	Classification, imbalanced
LOOCV	n-fold; one sample left out	Very small n; high variance of estimate
Repeated k-Fold	Multiple random k-fold runs	Reduce CV estimate variance
Time series split	Walk-forward; no future leakage	Any temporal data
Nested CV	Outer: eval; Inner: HP tuning	Unbiased performance estimate when doing HP search

Fit all preprocessors (scalers, imputers) on training fold only; never on test fold.

Statistical Tests for Model Comparison

Paired t-test (k-fold CV)

tₛₜₐₜ = d̄ / (s_d / √k)
dᵢ = acc_A(foldᵢ) − acc_B(foldᵢ)

H₀: models equivalent; reject if |t| > t_{α/2, k-1}

5×2 CV Test (Dietterich — more reliable)

Use 5 repetitions of 2-fold CV
pᵢⱼ = err_A(fold j, rep i) − err_B(fold j, rep i)
tₛₜₐₜ = p¹₁ / √(1/5 Σᵢ Ŝ²ᵢ)

McNemar's Test (single test set)

χ² = (|b − c| − 1)² / (b + c)
b: A right, B wrong; c: B right, A wrong

Tests if disagreements are asymmetric; no assumption of independence

Always report confidence intervals, not just point estimates
Multiple comparisons: Bonferroni (conservative) or Holm correction

Bias, Fairness & Interpretability

Demographic parity: P(ŷ=1|A=0) = P(ŷ=1|A=1) — equal positive prediction rate
Equalized odds: equal TPR and FPR across protected groups
Calibration equity: P(y=1|ŷ=p) = p must hold per group
Individual fairness: similar individuals → similar outcomes

Method	Type	Principle
LIME	Local, model-agnostic	Fit local linear approx in perturbed neighborhood
SHAP	Local + Global	Shapley values; unique fair attribution
Grad-CAM	Local, CNN	Gradient × activation for saliency map
PDP / ICE	Global / Local	Marginal effect of one feature
Permutation Imp.	Global	Score drop when feature shuffled

SHAP Shapley Value

φᵢ = Σ_{S⊆F\{i}} [|S|!(|F|−|S|−1)!/|F|!] · [f(S∪{i}) − f(S)]

Average marginal contribution across all feature orderings. Satisfies: efficiency, symmetry, dummy, additivity.

SHAP is the only method satisfying all 4 fairness axioms for feature attribution

Common Interview Traps

Data leakage: future data / test info in training; fit all preprocessing on train split only
Train-test distribution shift: model can't generalize; measure separately
Accuracy on imbalanced data: 99% accuracy with 1% positives by always predicting negative
Nested CV required: when selecting model AND evaluating — separate loops prevent optimistic bias
RF feature importance bias: biased toward high-cardinality features → use permutation importance
XGB splits vs GBM: XGB uses 2nd-order Taylor; GBM uses 1st-order only
Calibration ≠ accuracy: model can rank well (AUC=0.9) but probabilities be poorly calibrated
Multiple comparisons: testing 20 models at α=0.05 → ~1 will appear significant by chance

Algorithm Selection Guide

Scenario	Top Choice	Alternative	Avoid
Tabular data, structured, n≥1K	XGBoost / LightGBM	Random Forest	KNN (slow predict)
Few features, linear relationship	Ridge / Lasso Regression	Logistic Regression	Deep trees
Text classification	Logistic Reg + TF-IDF	Linear SVM, Multinomial NB	KNN (high-d)
Small dataset (<1K samples)	SVM (RBF), NB	Logistic Regression	Deep neural networks
Need interpretability	Decision Tree, Linear Model	SHAP + XGBoost	Deep NN without SHAP
Probabilistic output needed	Logistic Regression, GMM	Calibrated RF / XGB	Hard-margin SVM
Clustering, unknown k	DBSCAN / HDBSCAN	Hierarchical clustering	K-Means
Clustering, known k, spherical	K-Means++	Mini-batch K-Means	DBSCAN
Soft cluster assignments	GMM	Fuzzy C-Means	K-Means
Non-linear boundary, few samples	SVM + RBF kernel	Random Forest	Plain KNN
Online / streaming	SGD variants, NB	Hoeffding Trees	Batch SVM
Imbalanced binary	XGB (scale_pos_weight) + F1	RF + class_weight	Default threshold

Complexity Reference

Algorithm	Train	Predict	Space
Linear Regression (GD)	O(nd·iter)	O(d)	O(d)
Linear Regression (Normal Eq)	O(nd² + d³)	O(d)	O(d)
Logistic Regression	O(nd·iter)	O(d)	O(d)
KNN	O(1) / O(nd) build index	O(nd)	O(nd)
Decision Tree	O(nd log n)	O(depth)	O(n·depth)
SVM (kernel)	O(n²d – n³)	O(SV·d)	O(SV)
Naive Bayes	O(nd)	O(Kd)	O(Kd)
Random Forest	O(T·nd log n)	O(T·depth)	O(T·n)
GBM / XGBoost	O(T·nd log n)	O(T·depth)	O(T·n)
K-Means	O(nkd·iter)	O(kd)	O((n+k)d)
GMM-EM	O(nkd²·iter)	O(kd²)	O(kd²)

Model Assumptions Reference

Algorithm	Key Assumptions
Linear Regression	Linear f(x), homoscedasticity, no multicollinearity, errors N(0,σ²)
Logistic Regression	Linear decision boundary in feature space
LDA	Gaussian class-conditionals, equal covariance matrices
Naïve Bayes	Conditional independence of features given class
KNN	Locally similar points share labels; smooth decision boundary
K-Means	Spherical, similar-size, convex, well-separated clusters
GMM	Data generated from K Gaussian components
SVM	Separable (or nearly) by hyperplane; kernel defines geometry
Decision Tree	Axis-aligned decision boundaries; recursive partition
Bayesian LR	Gaussian prior on weights, Gaussian likelihood