MFML · Interview Notes

Mathematical Foundations for ML

Linear Systems Vector Spaces Distance & Similarity Inner Products Eigenvalues Decompositions Vector Calculus Taylor Series Gradient Descent Constrained Optimization PCA SVM
01

Linear Systems & Matrices

Matrix Form of Linear System

  • System Ax = B → augmented matrix [A | B]
  • Coefficient matrix: just A (n×n)
  • Augmented matrix: [A | b] includes RHS
  • System has solution iff rank(A) = rank([A|b])

Elementary Row Operations

  • Rᵢ ↔ Rⱼ — swap two rows
  • Rᵢ → c·Rᵢ — multiply row by nonzero constant
  • Rᵢ → Rᵢ + a·Rⱼ — add multiple of one row to another
  • All are reversible · do not change solution set

REF & RREF

  • REF (Row Echelon Form): leading entry each row to right of row above; rows of zeros at bottom
  • RREF (Reduced REF): additionally, each pivot = 1 and only nonzero in its column
  • Pivot variable: corresponds to pivot column
  • Free variable: corresponds to non-pivot column → infinite solutions
  • Eigenvalues of triangular matrix = diagonal entries

Rank & Consistency

  • Rank: number of pivot elements (= dim of row/column space)
  • No solution: rank(A) < rank([A|b]) — last row [0…0 | c≠0]
  • Unique solution: rank(A) = rank([A|b]) = n
  • Infinite solutions: rank(A) = rank([A|b]) < n (free variables)
  • Homogeneous Ax=0: always consistent; nontrivial soln iff rank(A) < n

Special Matrices

  • Upper triangular U: all entries below diagonal = 0
  • Lower triangular L: all entries above diagonal = 0
  • Diagonal D: all non-diagonal entries = 0; D^k = diag(d₁^k, …, dₙ^k)
  • det(triangular) = product of diagonal entries
  • Inverse of diagonal: replace each dᵢ with 1/dᵢ
02

Vector Spaces, Basis & Linear Independence

Vector Space

  • Set V with addition + and scalar multiplication · that satisfies closure, associativity, commutativity, identity, inverse, distributivity
  • Rⁿ = {(x₁,…,xₙ) : xᵢ∈R} is a vector space
  • Vectors used in ML: data encoding, embeddings, model weights, activations

Linear Independence

  • Set {v₁,…,vₖ} is linearly independent if c₁v₁+…+cₖvₖ=0 implies all cᵢ=0
  • Linearly dependent if some vector is a linear combination of others
  • n vectors in Rᵐ where n>m → always linearly dependent
  • Any subset of an LI set is LI; any superset of LD set is LD

Span, Basis & Dimension

  • Span(A): set of all linear combinations of vectors in A
  • A generates/spans V if every v∈V can be written as linear combination of A
  • Basis: linearly independent generating set · minimal spanning set
  • Dimension: number of vectors in any basis
  • All bases of a vector space have the same size

ML Relevance

  • Inputs encoded as vectors (tabular, image pixels, token embeddings)
  • Neural net layers: linear maps between vector spaces
  • Rank = number of "independent directions" in data — used in PCA
  • Free variables → underdetermined systems → regularization needed
03

Distance & Similarity Measures

Data Matrix

  • n data objects × p attributes → matrix X of size n×p
  • Dissimilarity matrix: n×n symmetric matrix D where d(i,j)=d(j,i), d(i,i)=0
  • Measures how unlike two data points are

Key Distance Measures

Euclidean: d(i,j) = √Σ(xᵢₖ − xⱼₖ)²

Simple dissim (categorical): d(i,j) = (p − m) / p
p = # attributes, m = # matches

Cosine sim: cos(θ) = ⟨x,y⟩ / (‖x‖‖y‖)

When to Use Which

MeasureBest For
EuclideanContinuous numerical attributes
Simple dissimilarityCategorical / binary attributes
Cosine similarityText / high-dim sparse vectors (direction matters, not magnitude)
Inner productWhen basis / inner product space is defined
04

Inner Products & Orthogonality

Inner Product Axioms

  • Symmetry: ⟨x,y⟩ = ⟨y,x⟩
  • Linearity: ⟨αx+y, z⟩ = α⟨x,z⟩ + ⟨y,z⟩
  • Positive-definiteness: ⟨x,x⟩ > 0 for x≠0
  • General inner product: ⟨x,y⟩ = x̂ᵀAŷ where A is symmetric positive definite
  • Dot product is the standard inner product when A = I

Orthogonality

  • x ⊥ y iff ⟨x,y⟩ = 0
  • Zero vector is orthogonal to all vectors
  • Orthogonality depends on which inner product is used
  • cos(ω) = ⟨x,y⟩ / (‖x‖·‖y‖)

Orthogonal Matrices

AᵀA = I = AAᵀ → Aᵀ = A⁻¹ Columns are orthonormal: aᵢᵀaⱼ = δᵢⱼ
  • Preserve lengths: ‖Ax‖² = xᵀAᵀAx = xᵀx = ‖x‖²
  • 2D rotation matrix is orthogonal: [[cosθ, -sinθ], [sinθ, cosθ]]
  • Left-inverse = right-inverse for square matrices

Norms (from inner products)

L2 norm: ‖x‖₂ = √(x₁²+…+xₙ²) = √⟨x,x⟩
L1 norm: ‖x‖₁ = Σ|xᵢ|
Lp norm: ‖x‖ₚ = (Σ|xᵢ|ᵖ)^(1/p)
  • L2 → ridge regularization; L1 → lasso (sparsity)
05

Eigenvalues & Eigenvectors

Definitions

Av = λv (v ≠ 0)

Characteristic equation: det(A − λI) = 0
Eigenspace of λ: Eλ = null(A − λI)
  • λ: eigenvalue · v: corresponding eigenvector (non-zero)
  • Eigenspace = set of all eigenvectors for λ + zero vector
  • Find λ: solve det(A-λI)=0 · find v: solve (A-λI)v=0

Key Properties

  • Eigenvalues of triangular matrix = diagonal entries
  • A and Aᵀ have the same eigenvalues
  • Symmetric positive-definite matrices → all eigenvalues real and positive
  • Repeated eigenvalues → may have 1 or 2 eigenvectors (algebraic vs geometric multiplicity)
  • det(A) = product of eigenvalues · trace(A) = sum of eigenvalues

Diagonalization

A = P D P⁻¹
D = diag(λ₁, …, λₙ)
P = [eigenvector₁ | … | eigenvectorₙ]
  • A is diagonalizable iff it has n linearly independent eigenvectors
  • Allows fast matrix powers: Aᵏ = P Dᵏ P⁻¹
  • All symmetric matrices are diagonalizable (spectral theorem)

ML Relevance

  • PCA: eigenvectors of covariance matrix = principal components
  • Largest eigenvalue → direction of maximum variance
  • Spectral clustering: eigenvectors of graph Laplacian
  • Page Rank: dominant eigenvector of transition matrix
  • Stability of RNNs related to spectral radius of weight matrix
06

Matrix Decompositions

Cholesky Decomposition

A = LLᵀ
L: lower triangular with positive diagonal
Requires A to be symmetric positive definite
  • det(A) = det(L)² = (product of diag entries of L)²
  • ML uses: sampling from multivariate Gaussian, solving linear systems efficiently
  • If A not SPD → Cholesky fails (use LU decomposition instead)

SVD (Singular Value Decomposition)

A = U Σ Vᵀ
A ∈ R^(m×n)
U ∈ R^(m×m): left singular vectors (orthogonal)
Σ ∈ R^(m×n): singular values on diagonal (σ₁≥σ₂≥…≥0)
V ∈ R^(n×n): right singular vectors (orthogonal)

Computing: σᵢ = √(eigenvalues of AᵀA)
U columns = eigenvectors of AAᵀ
V columns = eigenvectors of AᵀA
uᵢ = (1/σᵢ) A vᵢ

SVD Properties & Uses

  • Exists for ANY matrix (not just square/symmetric)
  • Rank of A = number of nonzero singular values
  • Low-rank approximation: keep top-k singular values → best rank-k approx
  • Connection to PCA: eigenvalues of S = σ²/N where σ are singular values of X
  • Moore-Penrose pseudoinverse: A⁺ = V Σ⁺ Uᵀ
  • Condition number = σ_max / σ_min (large = ill-conditioned)

Decompositions Compared

DecompositionFormRequirementKey Use in ML
CholeskyA = LLᵀSymmetric positive definiteGaussian sampling, fast solvers
EigendecompositionA = PDP⁻¹Square, n indep. eigenvectorsPCA, spectral methods
SVDA = UΣVᵀAny matrixDimensionality reduction, pseudo-inverse, low-rank approx
LUA = LUSquare (with pivoting for stability)Solving linear systems
07

Vector Calculus

Derivative (Univariate)

df/dx = lim_{h→0} [f(x+h) − f(x)] / h

Intuition: slope of tangent = direction of steepest ascent
d/dx(xⁿ) = nxⁿ⁻¹
  • Chain rule: d/dx f(g(x)) = f'(g(x))·g'(x)
  • Product rule: d/dx(fg) = f'g + fg'

Partial Derivatives & Gradient

∂f/∂xᵢ: derivative w.r.t. xᵢ, all others fixed

Gradient: ∇f(x) = [∂f/∂x₁, …, ∂f/∂xₙ]ᵀ ∈ Rⁿ
Points in direction of steepest ascent of f
  • Gradient of f(x) = xᵀAx is (A + Aᵀ)x; if A symmetric → 2Ax
  • Setting gradient to zero → critical points (min, max, saddle)

Jacobian

f: Rⁿ → Rᵐ

J = df/dx ∈ R^(m×n)

⎡∂f₁/∂x₁ … ∂f₁/∂xₙ ⎤
J = ⎢ ⋮ ⋮ ⎥
⎣∂fₘ/∂x₁ … ∂fₘ/∂xₙ ⎦

Special: f(x) = Ax → J = A ∈ R^(M×N)
  • Row i = gradient of fᵢ
  • Used in backpropagation to compute weight gradients

Hessian

H(f) ∈ R^(n×n) for f: Rⁿ → R

Hᵢⱼ = ∂²f / (∂xᵢ∂xⱼ)

H is symmetric if f has continuous 2nd derivatives

  • 2nd order optimality test (multivariate):
  • fₓₓ > 0 and det(H) > 0 → local min
  • fₓₓ < 0 and det(H) > 0 → local max
  • det(H) < 0 → saddle point
  • H positive definite → convex function → global min unique

Chain Rule (Multivariate)

h = f∘g, g: Rⁿ→Rᵐ, f: Rᵐ→Rᵏ

dh/dx = (∂f/∂g)·(∂g/∂x) [matrix product]

Backprop: δL/δw = (δL/δa)·(δa/δw)
output grad × local grad
  • Each layer in a neural net applies chain rule
  • Neural net layer: fᵢ(xᵢ₋₁) = σ(Aᵢ₋₁xᵢ₋₁ + bᵢ₋₁)
08

Taylor Series & Approximations

Taylor Polynomial & Series

Taylor polynomial degree n at x₀:
Tₙ(x) = Σ_{k= 0 to n} f^(k)(x₀)/k! · (x−x₀)^k

Maclaurin series: x₀ = 0 special case

2nd-Order Multivariate Taylor

f(a+h, b+k) ≈ f(a,b) + hfₓ(a,b) + kfᵧ(a,b) + ½(h²fₓₓ + 2hkfₓᵧ + k²fᵧᵧ) → used to derive second-order optimality conditions
  • Sign of Q(h,k) = h²fₓₓ + 2hkfₓᵧ + k²fᵧᵧ determines min/max/saddle

ML Applications

  • Optimization: approximate cost function locally → enables gradient/Newton methods
  • Newton's method: uses 2nd-order Taylor (Hessian) for faster convergence
  • Regression: polynomial features come from Taylor approximation
  • Model interpretation: expanding model around a point reveals feature sensitivity
  • Taylor approx is exact for polynomials of degree ≤ n
09

Gradient Descent & Optimization

Unconstrained Optimization

  • Goal: minimize differentiable objective function f(θ)
  • Necessary condition for local min: ∇f(θ*) = 0
  • Sufficient condition: ∇f = 0 AND H is positive definite
  • Gradient descent: use negative gradient to walk toward minimum
θ_{t+1} = θ_t − α·∇L(θ_t) α = learning rate (step size)

Batch vs SGD vs Mini-Batch

MethodUpdate UsesProsCons
Batch GDAll n samplesStable, smooth convergenceSlow for large data
SGD1 sampleFast, escapes local optimaNoisy, oscillates
Mini-Batch GDBatch of k samplesBalance of bothHyperparameter k

SGD: gradient of 1 sample is good approximation of true gradient (law of large numbers)

Learning Rate Issues

  • Too large α: overshoots, oscillates around optimum
  • Too small α: very slow convergence
  • Best practice: use decaying learning rate — large initially, decreasing over time
  • Decaying update: θ_{t+1} = θ_t − αₜ·∇L, where αₜ decreases with t
  • With decaying α, SGD converges almost surely to local minimum

GD Failure Modes

  • Local optima: ∇f=0 but not global min — bad initialization causes this
  • Saddle points: ∇f=0 but neither min nor max — common in high-dim
  • Flat regions: gradient ≈ 0, updates nearly zero — GD stalls
  • High-dim functions have exponentially many local minima

Momentum

v_t = βv_{t-1} + (1−β)∇L(θ_t) [smoothed gradient]
θ_{t+1} = θ_t − α·v_t

β = momentum coefficient (typically 0.9)
  • Accumulates velocity in direction of consistent gradient
  • Dampens oscillations in directions with conflicting gradients
  • Helps escape flat regions and local optima
  • Analogy: marble rolling down hill — builds momentum over consistent slopes

Loss Function (Regression)

L(θ) = Σᵢ ‖θᵀxᵢ − yᵢ‖² [MSE / least squares]
L(θ) = Σᵢ Lᵢ(θ) [separable → enables SGD]
Gradient update: ∂L/∂θⱼ = θ⁽ʲ⁾ + C·Σ ∂Lᵢ/∂θⁿ
  • Least squares is convex → unique global minimum
  • Separability of L enables stochastic approximation
10

Constrained Optimization, Lagrange & KKT

Lagrange Multipliers

Minimize f(x) subject to g(x) = 0

Condition: ∇f = λ·∇g (gradients parallel)
λ = Lagrange multiplier

Lagrangian: L(x,λ) = f(x) − λ·g(x)
Solve: ∇ₓL = 0 and ∇λL = 0
  • At extremum, objective and constraint surfaces share a tangent plane
  • Each constraint gets its own λ

KKT Conditions (Inequality Constraints)

min f(x) s.t. gᵢ(x) ≤ 0, hⱼ(x) = 0

KKT conditions at optimal (x*, λ*, ν*):
1. Primal feasibility: gᵢ(x*) ≤ 0, hⱼ(x*) = 0
2. Dual feasibility: λᵢ* ≥ 0
3. Complementary slackness: λᵢ* gᵢ(x*) = 0
4. Stationarity: ∇f(x*) + Σλᵢ*∇gᵢ + Σνⱼ*∇hⱼ = 0

Duality

  • Lagrangian dual: D(λ) = minₓ L(x,λ) — unconstrained in x
  • Weak duality: D(λ) ≤ f(x*) always — dual gives lower bound
  • Strong duality: D(λ*) = f(x*) — primal = dual optimal
  • Slater's condition: if f convex, gᵢ convex, hⱼ linear, and strict interior feasible point exists → strong duality holds
  • SVM uses duality to move from primal (w,b) to dual (αᵢ)

Complementary Slackness (Intuition)

  • λᵢ*·gᵢ(x*) = 0 means: for each constraint either…
  • λᵢ* = 0 → constraint is inactive (not binding at solution)
  • gᵢ(x*) = 0 → constraint is active (binding / tight)
  • In SVM: αᵢ = 0 for non-support vectors, αᵢ > 0 for support vectors
11

Principal Component Analysis (PCA)

Goal & Intuition

  • Linear dimensionality reduction: D → M dimensions (M < D)
  • Find projection directions that retain maximum variance
  • Each principal component = direction of maximum residual variance
  • Output: lower-dim representation zₙ = Bᵀxₙ ∈ Rᴹ
  • B: D×M matrix with orthonormal columns (principal axes)

Algorithm (Step-by-Step)

  • 1. Standardize: subtract mean μ_d, divide by σ_d per dimension
  • 2. Compute covariance matrix: S = (1/N) Σ xₙxₙᵀ = (1/N)XXᵀ
  • 3. Eigendecompose S: Sb = λb → eigenvalues + eigenvectors
  • 4. Sort: eigenvectors by descending eigenvalue
  • 5. Select top M eigenvectors as columns of B
  • 6. Project: z* = Bᵀx* (new coordinates)
  • 7. Reconstruct: x̃ = BBᵀx* (in original space)

PCA via SVD

Data matrix X ∈ R^(D×N), SVD: X = UΣVᵀ

Covariance: S = (1/N)XXᵀ = (1/N)UΣΣᵀUᵀ

Eigenvalues of S: λ_d = σ_d² / N
Eigenvectors of S = columns of U

→ compute PCA via SVD of X directly

Key Properties

  • Projected variance on m-th PC = λₘ (m-th eigenvalue of S)
  • Total variance retained = Σ_{m=1}^{M} λₘ / Σ_{d=1}^{D} λ_d
  • PCs are orthogonal → projections are uncorrelated
  • PCA minimizes reconstruction error = MSE in lower-dim space
  • Limitation: linear only; no class labels used (unsupervised)
  • Use LDA instead for class-discriminant dimensionality reduction

Choosing Number of Components

  • Scree plot: plot λ₁, λ₂, …; look for "elbow"
  • Variance threshold: choose M s.t. Σλ_m / Σλ_d ≥ 0.95 (95% variance retained)
  • For reconstruction: pick M where reconstruction error is acceptable
12

Support Vector Machines (SVM)

Core Idea

  • Binary classifier: f(x,w,b) = sign(wᵀx + b)
  • Goal: find hyperplane that maximizes margin between classes
  • Margin: distance between hyperplane and nearest points of each class
  • Support vectors: data points closest to hyperplane (on margin boundary)
  • Only support vectors determine the decision boundary

Hard-Margin SVM (Primal)

min_{w,b} ½‖w‖²

s.t. yᵢ(wᵀxᵢ + b) ≥ 1 ∀i

Margin width = 2/‖w‖
Support vectors: yᵢ(wᵀxᵢ+b) = ±1
  • Requires perfectly linearly separable data
  • Maximizing margin = minimizing ‖w‖²

Soft-Margin SVM (with C)

min_{w,b,ξ} ½‖w‖² + C·Σᵢ ξᵢ

s.t. yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

Hinge loss form (unconstrained):
min_{w,b} ½‖w‖² + C·Σᵢ max(0, 1−yᵢ(wᵀxᵢ+b))
  • ξᵢ: slack variable — allows misclassification
  • Large C: penalize misclassification heavily → small margin, overfit
  • Small C: allow more misclassification → large margin, underfit

Dual Formulation

max_α Σᵢ αᵢ − ½ Σᵢⱼ αᵢαⱼyᵢyⱼ xᵢᵀxⱼ

s.t. 0 ≤ αᵢ ≤ C, Σᵢ αᵢyᵢ = 0

Recover: w = Σᵢ αᵢyᵢxᵢ
Predict: f(x) = sign(Σᵢ αᵢyᵢ xᵢᵀx + b)
  • αᵢ = 0 → non-support vector (doesn't affect decision boundary)
  • αᵢ > 0 → support vector
  • Dual depends on data only through inner products xᵢᵀxⱼ → enables kernel trick

Kernel Trick

  • Replace xᵢᵀxⱼ with K(xᵢ,xⱼ) = φ(xᵢ)ᵀφ(xⱼ) — implicit high-dim feature map
  • Never compute φ(x) explicitly — only need kernel evaluations
Linear: K(xᵢ,xⱼ) = xᵢᵀxⱼ
Polynomial: K(xᵢ,xⱼ) = (1 + xᵢᵀxⱼ)ᵖ
RBF/Gauss: K(xᵢ,xⱼ) = exp(−‖xᵢ−xⱼ‖²/2σ²)
Sigmoid: K(xᵢ,xⱼ) = tanh(β₀xᵢᵀxⱼ + β₁)

Valid kernel must correspond to an inner product in some feature space

SVM: Hard vs Soft vs Kernel — Quick Reference

Hard MarginSoft MarginKernel SVM
Data requirementLinearly separableNoisy / overlappingNon-linearly separable
Objectivemin ½‖w‖²min ½‖w‖² + CΣξᵢmin ½‖w‖² + CΣξᵢ with K
HyperparameterNoneC (regularization)C + kernel params (σ, p…)
Constraint violationNot allowedAllowed (penalized)Allowed (penalized)
Decision boundaryLinearLinearNon-linear in input space