MFML Interview Notes

Linear Systems & Matrices

Matrix Form of Linear System

System Ax = B → augmented matrix [A | B]
Coefficient matrix: just A (n×n)
Augmented matrix: [A | b] includes RHS
System has solution iff rank(A) = rank([A|b])

Elementary Row Operations

Rᵢ ↔ Rⱼ — swap two rows
Rᵢ → c·Rᵢ — multiply row by nonzero constant
Rᵢ → Rᵢ + a·Rⱼ — add multiple of one row to another
All are reversible · do not change solution set

REF & RREF

REF (Row Echelon Form): leading entry each row to right of row above; rows of zeros at bottom
RREF (Reduced REF): additionally, each pivot = 1 and only nonzero in its column
Pivot variable: corresponds to pivot column
Free variable: corresponds to non-pivot column → infinite solutions
Eigenvalues of triangular matrix = diagonal entries

Rank & Consistency

Rank: number of pivot elements (= dim of row/column space)
No solution: rank(A) < rank([A|b]) — last row [0…0 | c≠0]
Unique solution: rank(A) = rank([A|b]) = n
Infinite solutions: rank(A) = rank([A|b]) < n (free variables)
Homogeneous Ax=0: always consistent; nontrivial soln iff rank(A) < n

Special Matrices

Upper triangular U: all entries below diagonal = 0
Lower triangular L: all entries above diagonal = 0
Diagonal D: all non-diagonal entries = 0; D^k = diag(d₁^k, …, dₙ^k)
det(triangular) = product of diagonal entries
Inverse of diagonal: replace each dᵢ with 1/dᵢ

Vector Spaces, Basis & Linear Independence

Vector Space

Set V with addition + and scalar multiplication · that satisfies closure, associativity, commutativity, identity, inverse, distributivity
Rⁿ = {(x₁,…,xₙ) : xᵢ∈R} is a vector space
Vectors used in ML: data encoding, embeddings, model weights, activations

Linear Independence

Set {v₁,…,vₖ} is linearly independent if c₁v₁+…+cₖvₖ=0 implies all cᵢ=0
Linearly dependent if some vector is a linear combination of others
n vectors in Rᵐ where n>m → always linearly dependent
Any subset of an LI set is LI; any superset of LD set is LD

Span, Basis & Dimension

Span(A): set of all linear combinations of vectors in A
A generates/spans V if every v∈V can be written as linear combination of A
Basis: linearly independent generating set · minimal spanning set
Dimension: number of vectors in any basis
All bases of a vector space have the same size

ML Relevance

Inputs encoded as vectors (tabular, image pixels, token embeddings)
Neural net layers: linear maps between vector spaces
Rank = number of "independent directions" in data — used in PCA
Free variables → underdetermined systems → regularization needed

Distance & Similarity Measures

Data Matrix

n data objects × p attributes → matrix X of size n×p
Dissimilarity matrix: n×n symmetric matrix D where d(i,j)=d(j,i), d(i,i)=0
Measures how unlike two data points are

Key Distance Measures

Euclidean: d(i,j) = √Σ(xᵢₖ − xⱼₖ)²

Simple dissim (categorical): d(i,j) = (p − m) / p
p = # attributes, m = # matches

Cosine sim: cos(θ) = ⟨x,y⟩ / (‖x‖‖y‖)

When to Use Which

Measure	Best For
Euclidean	Continuous numerical attributes
Simple dissimilarity	Categorical / binary attributes
Cosine similarity	Text / high-dim sparse vectors (direction matters, not magnitude)
Inner product	When basis / inner product space is defined

Inner Products & Orthogonality

Inner Product Axioms

Symmetry: ⟨x,y⟩ = ⟨y,x⟩
Linearity: ⟨αx+y, z⟩ = α⟨x,z⟩ + ⟨y,z⟩
Positive-definiteness: ⟨x,x⟩ > 0 for x≠0
General inner product: ⟨x,y⟩ = x̂ᵀAŷ where A is symmetric positive definite
Dot product is the standard inner product when A = I

Orthogonality

x ⊥ y iff ⟨x,y⟩ = 0
Zero vector is orthogonal to all vectors
Orthogonality depends on which inner product is used
cos(ω) = ⟨x,y⟩ / (‖x‖·‖y‖)

Orthogonal Matrices

AᵀA = I = AAᵀ → Aᵀ = A⁻¹ Columns are orthonormal: aᵢᵀaⱼ = δᵢⱼ

Preserve lengths: ‖Ax‖² = xᵀAᵀAx = xᵀx = ‖x‖²
2D rotation matrix is orthogonal: [[cosθ, -sinθ], [sinθ, cosθ]]
Left-inverse = right-inverse for square matrices

Norms (from inner products)

L2 norm: ‖x‖₂ = √(x₁²+…+xₙ²) = √⟨x,x⟩
L1 norm: ‖x‖₁ = Σ|xᵢ|
Lp norm: ‖x‖ₚ = (Σ|xᵢ|ᵖ)^(1/p)

L2 → ridge regularization; L1 → lasso (sparsity)

Eigenvalues & Eigenvectors

Definitions

Av = λv (v ≠ 0)

Characteristic equation: det(A − λI) = 0
Eigenspace of λ: Eλ = null(A − λI)

λ: eigenvalue · v: corresponding eigenvector (non-zero)
Eigenspace = set of all eigenvectors for λ + zero vector
Find λ: solve det(A-λI)=0 · find v: solve (A-λI)v=0

Key Properties

Eigenvalues of triangular matrix = diagonal entries
A and Aᵀ have the same eigenvalues
Symmetric positive-definite matrices → all eigenvalues real and positive
Repeated eigenvalues → may have 1 or 2 eigenvectors (algebraic vs geometric multiplicity)
det(A) = product of eigenvalues · trace(A) = sum of eigenvalues

Diagonalization

A = P D P⁻¹
D = diag(λ₁, …, λₙ)
P = [eigenvector₁ | … | eigenvectorₙ]

A is diagonalizable iff it has n linearly independent eigenvectors
Allows fast matrix powers: Aᵏ = P Dᵏ P⁻¹
All symmetric matrices are diagonalizable (spectral theorem)

ML Relevance

PCA: eigenvectors of covariance matrix = principal components
Largest eigenvalue → direction of maximum variance
Spectral clustering: eigenvectors of graph Laplacian
Page Rank: dominant eigenvector of transition matrix
Stability of RNNs related to spectral radius of weight matrix

Matrix Decompositions

Cholesky Decomposition

A = LLᵀ
L: lower triangular with positive diagonal
Requires A to be symmetric positive definite

det(A) = det(L)² = (product of diag entries of L)²
ML uses: sampling from multivariate Gaussian, solving linear systems efficiently
If A not SPD → Cholesky fails (use LU decomposition instead)

SVD (Singular Value Decomposition)

A = U Σ Vᵀ
A ∈ R^(m×n)
U ∈ R^(m×m): left singular vectors (orthogonal)
Σ ∈ R^(m×n): singular values on diagonal (σ₁≥σ₂≥…≥0)
V ∈ R^(n×n): right singular vectors (orthogonal)

Computing: σᵢ = √(eigenvalues of AᵀA)
U columns = eigenvectors of AAᵀ
V columns = eigenvectors of AᵀA
uᵢ = (1/σᵢ) A vᵢ

SVD Properties & Uses

Exists for ANY matrix (not just square/symmetric)
Rank of A = number of nonzero singular values
Low-rank approximation: keep top-k singular values → best rank-k approx
Connection to PCA: eigenvalues of S = σ²/N where σ are singular values of X
Moore-Penrose pseudoinverse: A⁺ = V Σ⁺ Uᵀ
Condition number = σ_max / σ_min (large = ill-conditioned)

Decompositions Compared

Decomposition	Form	Requirement	Key Use in ML
Cholesky	A = LLᵀ	Symmetric positive definite	Gaussian sampling, fast solvers
Eigendecomposition	A = PDP⁻¹	Square, n indep. eigenvectors	PCA, spectral methods
SVD	A = UΣVᵀ	Any matrix	Dimensionality reduction, pseudo-inverse, low-rank approx
LU	A = LU	Square (with pivoting for stability)	Solving linear systems

Vector Calculus

Derivative (Univariate)

df/dx = lim_{h→0} [f(x+h) − f(x)] / h

Intuition: slope of tangent = direction of steepest ascent
d/dx(xⁿ) = nxⁿ⁻¹

Chain rule: d/dx f(g(x)) = f'(g(x))·g'(x)
Product rule: d/dx(fg) = f'g + fg'

Partial Derivatives & Gradient

∂f/∂xᵢ: derivative w.r.t. xᵢ, all others fixed

Gradient: ∇f(x) = [∂f/∂x₁, …, ∂f/∂xₙ]ᵀ ∈ Rⁿ
Points in direction of steepest ascent of f

Gradient of f(x) = xᵀAx is (A + Aᵀ)x; if A symmetric → 2Ax
Setting gradient to zero → critical points (min, max, saddle)

Jacobian

f: Rⁿ → Rᵐ

J = df/dx ∈ R^(m×n)

⎡∂f₁/∂x₁ … ∂f₁/∂xₙ ⎤
J = ⎢ ⋮ ⋮ ⎥
⎣∂fₘ/∂x₁ … ∂fₘ/∂xₙ ⎦

Special: f(x) = Ax → J = A ∈ R^(M×N)

Row i = gradient of fᵢ
Used in backpropagation to compute weight gradients

Hessian

H(f) ∈ R^(n×n) for f: Rⁿ → R

Hᵢⱼ = ∂²f / (∂xᵢ∂xⱼ)

H is symmetric if f has continuous 2nd derivatives

2nd order optimality test (multivariate):
fₓₓ > 0 and det(H) > 0 → local min
fₓₓ < 0 and det(H) > 0 → local max
det(H) < 0 → saddle point
H positive definite → convex function → global min unique

Chain Rule (Multivariate)

h = f∘g, g: Rⁿ→Rᵐ, f: Rᵐ→Rᵏ

dh/dx = (∂f/∂g)·(∂g/∂x) [matrix product]

Backprop: δL/δw = (δL/δa)·(δa/δw)
output grad × local grad

Each layer in a neural net applies chain rule
Neural net layer: fᵢ(xᵢ₋₁) = σ(Aᵢ₋₁xᵢ₋₁ + bᵢ₋₁)

Taylor Series & Approximations

Taylor Polynomial & Series

Taylor polynomial degree n at x₀:
Tₙ(x) = Σ_{k= 0 to n} f^(k)(x₀)/k! · (x−x₀)^k

Maclaurin series: x₀ = 0 special case

2nd-Order Multivariate Taylor

f(a+h, b+k) ≈ f(a,b) + hfₓ(a,b) + kfᵧ(a,b) + ½(h²fₓₓ + 2hkfₓᵧ + k²fᵧᵧ) → used to derive second-order optimality conditions

Sign of Q(h,k) = h²fₓₓ + 2hkfₓᵧ + k²fᵧᵧ determines min/max/saddle

ML Applications

Optimization: approximate cost function locally → enables gradient/Newton methods
Newton's method: uses 2nd-order Taylor (Hessian) for faster convergence
Regression: polynomial features come from Taylor approximation
Model interpretation: expanding model around a point reveals feature sensitivity
Taylor approx is exact for polynomials of degree ≤ n

Gradient Descent & Optimization

Unconstrained Optimization

Goal: minimize differentiable objective function f(θ)
Necessary condition for local min: ∇f(θ*) = 0
Sufficient condition: ∇f = 0 AND H is positive definite
Gradient descent: use negative gradient to walk toward minimum

θ_{t+1} = θ_t − α·∇L(θ_t) α = learning rate (step size)

Batch vs SGD vs Mini-Batch

Method	Update Uses	Pros	Cons
Batch GD	All n samples	Stable, smooth convergence	Slow for large data
SGD	1 sample	Fast, escapes local optima	Noisy, oscillates
Mini-Batch GD	Batch of k samples	Balance of both	Hyperparameter k

SGD: gradient of 1 sample is good approximation of true gradient (law of large numbers)

Learning Rate Issues

Too large α: overshoots, oscillates around optimum
Too small α: very slow convergence
Best practice: use decaying learning rate — large initially, decreasing over time
Decaying update: θ_{t+1} = θ_t − αₜ·∇L, where αₜ decreases with t
With decaying α, SGD converges almost surely to local minimum

GD Failure Modes

Local optima: ∇f=0 but not global min — bad initialization causes this
Saddle points: ∇f=0 but neither min nor max — common in high-dim
Flat regions: gradient ≈ 0, updates nearly zero — GD stalls
High-dim functions have exponentially many local minima

Momentum

v_t = βv_{t-1} + (1−β)∇L(θ_t) [smoothed gradient]
θ_{t+1} = θ_t − α·v_t

β = momentum coefficient (typically 0.9)

Accumulates velocity in direction of consistent gradient
Dampens oscillations in directions with conflicting gradients
Helps escape flat regions and local optima
Analogy: marble rolling down hill — builds momentum over consistent slopes

Loss Function (Regression)

L(θ) = Σᵢ ‖θᵀxᵢ − yᵢ‖² [MSE / least squares]
L(θ) = Σᵢ Lᵢ(θ) [separable → enables SGD]
Gradient update: ∂L/∂θⱼ = θ⁽ʲ⁾ + C·Σ ∂Lᵢ/∂θⁿ

Least squares is convex → unique global minimum
Separability of L enables stochastic approximation

Constrained Optimization, Lagrange & KKT

Lagrange Multipliers

Minimize f(x) subject to g(x) = 0

Condition: ∇f = λ·∇g (gradients parallel)
λ = Lagrange multiplier

Lagrangian: L(x,λ) = f(x) − λ·g(x)
Solve: ∇ₓL = 0 and ∇λL = 0

At extremum, objective and constraint surfaces share a tangent plane
Each constraint gets its own λ

KKT Conditions (Inequality Constraints)

min f(x) s.t. gᵢ(x) ≤ 0, hⱼ(x) = 0

KKT conditions at optimal (x*, λ*, ν*):
1. Primal feasibility: gᵢ(x*) ≤ 0, hⱼ(x*) = 0
2. Dual feasibility: λᵢ* ≥ 0
3. Complementary slackness: λᵢ* gᵢ(x*) = 0
4. Stationarity: ∇f(x*) + Σλᵢ*∇gᵢ + Σνⱼ*∇hⱼ = 0

Duality

Lagrangian dual: D(λ) = minₓ L(x,λ) — unconstrained in x
Weak duality: D(λ) ≤ f(x*) always — dual gives lower bound
Strong duality: D(λ*) = f(x*) — primal = dual optimal
Slater's condition: if f convex, gᵢ convex, hⱼ linear, and strict interior feasible point exists → strong duality holds
SVM uses duality to move from primal (w,b) to dual (αᵢ)

Complementary Slackness (Intuition)

λᵢ*·gᵢ(x*) = 0 means: for each constraint either…
λᵢ* = 0 → constraint is inactive (not binding at solution)
gᵢ(x*) = 0 → constraint is active (binding / tight)
In SVM: αᵢ = 0 for non-support vectors, αᵢ > 0 for support vectors

Principal Component Analysis (PCA)

Goal & Intuition

Linear dimensionality reduction: D → M dimensions (M < D)
Find projection directions that retain maximum variance
Each principal component = direction of maximum residual variance
Output: lower-dim representation zₙ = Bᵀxₙ ∈ Rᴹ
B: D×M matrix with orthonormal columns (principal axes)

Algorithm (Step-by-Step)

1. Standardize: subtract mean μ_d, divide by σ_d per dimension
2. Compute covariance matrix: S = (1/N) Σ xₙxₙᵀ = (1/N)XXᵀ
3. Eigendecompose S: Sb = λb → eigenvalues + eigenvectors
4. Sort: eigenvectors by descending eigenvalue
5. Select top M eigenvectors as columns of B
6. Project: z* = Bᵀx* (new coordinates)
7. Reconstruct: x̃ = BBᵀx* (in original space)

PCA via SVD

Data matrix X ∈ R^(D×N), SVD: X = UΣVᵀ

Covariance: S = (1/N)XXᵀ = (1/N)UΣΣᵀUᵀ

Eigenvalues of S: λ_d = σ_d² / N
Eigenvectors of S = columns of U

→ compute PCA via SVD of X directly

Key Properties

Projected variance on m-th PC = λₘ (m-th eigenvalue of S)
Total variance retained = Σ_{m=1}^{M} λₘ / Σ_{d=1}^{D} λ_d
PCs are orthogonal → projections are uncorrelated
PCA minimizes reconstruction error = MSE in lower-dim space
Limitation: linear only; no class labels used (unsupervised)
Use LDA instead for class-discriminant dimensionality reduction

Choosing Number of Components

Scree plot: plot λ₁, λ₂, …; look for "elbow"
Variance threshold: choose M s.t. Σλ_m / Σλ_d ≥ 0.95 (95% variance retained)
For reconstruction: pick M where reconstruction error is acceptable

Support Vector Machines (SVM)

Core Idea

Binary classifier: f(x,w,b) = sign(wᵀx + b)
Goal: find hyperplane that maximizes margin between classes
Margin: distance between hyperplane and nearest points of each class
Support vectors: data points closest to hyperplane (on margin boundary)
Only support vectors determine the decision boundary

Hard-Margin SVM (Primal)

min_{w,b} ½‖w‖²

s.t. yᵢ(wᵀxᵢ + b) ≥ 1 ∀i

Margin width = 2/‖w‖
Support vectors: yᵢ(wᵀxᵢ+b) = ±1

Requires perfectly linearly separable data
Maximizing margin = minimizing ‖w‖²

Soft-Margin SVM (with C)

min_{w,b,ξ} ½‖w‖² + C·Σᵢ ξᵢ

s.t. yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

Hinge loss form (unconstrained):
min_{w,b} ½‖w‖² + C·Σᵢ max(0, 1−yᵢ(wᵀxᵢ+b))

ξᵢ: slack variable — allows misclassification
Large C: penalize misclassification heavily → small margin, overfit
Small C: allow more misclassification → large margin, underfit

Dual Formulation

max_α Σᵢ αᵢ − ½ Σᵢⱼ αᵢαⱼyᵢyⱼ xᵢᵀxⱼ

s.t. 0 ≤ αᵢ ≤ C, Σᵢ αᵢyᵢ = 0

Recover: w = Σᵢ αᵢyᵢxᵢ
Predict: f(x) = sign(Σᵢ αᵢyᵢ xᵢᵀx + b)

αᵢ = 0 → non-support vector (doesn't affect decision boundary)
αᵢ > 0 → support vector
Dual depends on data only through inner products xᵢᵀxⱼ → enables kernel trick

Kernel Trick

Replace xᵢᵀxⱼ with K(xᵢ,xⱼ) = φ(xᵢ)ᵀφ(xⱼ) — implicit high-dim feature map
Never compute φ(x) explicitly — only need kernel evaluations

Linear: K(xᵢ,xⱼ) = xᵢᵀxⱼ
Polynomial: K(xᵢ,xⱼ) = (1 + xᵢᵀxⱼ)ᵖ
RBF/Gauss: K(xᵢ,xⱼ) = exp(−‖xᵢ−xⱼ‖²/2σ²)
Sigmoid: K(xᵢ,xⱼ) = tanh(β₀xᵢᵀxⱼ + β₁)

Valid kernel must correspond to an inner product in some feature space

SVM: Hard vs Soft vs Kernel — Quick Reference

	Hard Margin	Soft Margin	Kernel SVM
Data requirement	Linearly separable	Noisy / overlapping	Non-linearly separable
Objective	min ½‖w‖²	min ½‖w‖² + CΣξᵢ	min ½‖w‖² + CΣξᵢ with K
Hyperparameter	None	C (regularization)	C + kernel params (σ, p…)
Constraint violation	Not allowed	Allowed (penalized)	Allowed (penalized)
Decision boundary	Linear	Linear	Non-linear in input space