DL Interview Notes

What / Why / Where

What: ML with multi-layer neural networks learning hierarchical representations from raw data
Why over classical ML: automatic feature engineering; scales with data + compute; end-to-end learning
Key enablers: big data, GPUs (parallelism), better activations (ReLU), regularization (Dropout), frameworks (PyTorch)
Where: vision (CNN), language (Transformer), speech (RNN/CTC), tabular (less common), RL policy

Classical ML	Deep Learning
Manual feature engineering	Learned features end-to-end
Works well on small data	Needs large data (or transfer)
Interpretable (trees, LR)	Black box (need SHAP/GradCAM)
Tabular: XGB wins	Images / text / speech: DL wins
Fast train + predict	Slow train; fast predict (GPU)

DL = representation learning; each layer = progressively abstract features

Universal Approximation Theorem

Statement

A single hidden layer NN with N neurons and non-polynomial activation
can approximate any continuous function f: ℝⁿ → ℝ on compact domain
to arbitrary precision ε > 0

Width ≫ Depth version. Doesn't say how many neurons needed or how to train it.

Depth version: deep networks exponentially more efficient than shallow for certain functions (e.g. parity)
Existence ≠ learnability: theorem doesn't guarantee GD finds the solution
Practical implication: architecture matters; depth gives structured inductive bias

Counts (2^n) vs depth-n circuits: shallow needs exponential width; deep is polynomial

DL Taxonomy

Architecture	Inductive Bias	Best For
MLP / DFNN	None (fully connected)	Tabular, small structured
CNN	Translation equivariance, local receptive field	Images, time-series, audio
RNN / LSTM	Sequential order, shared weights over time	Text, speech, time-series
Transformer	Global attention, permutation equivariant	NLP, vision (ViT), multimodal
GNN	Message passing over graph structure	Molecules, social graphs
VAE / GAN / Diffusion	Latent space structure	Generative modeling

Artificial Neuron Model (McCulloch-Pitts)

Single Neuron Computation

z = w₁x₁ + w₂x₂ + ··· + wₙxₙ + b = wᵀx + b
a = g(z)

z: pre-activation (net input); a: activation (output); g: activation function

Component	Biological Analogy	Role
Input xᵢ	Dendrites / synapse input	Receive signals
Weight wᵢ	Synaptic strength	Scale each input
Bias b	Neuron firing threshold	Shift activation boundary
Σ (dot product)	Cell body (soma) summation	Aggregate inputs
Activation g	Action potential fire/no-fire	Non-linear transformation
Output a	Axon output signal	Pass to next layer

Biological neurons: spike trains, ~7 types, ~10⁴ connections each. Artificial: gross simplification.

Perceptron

Perceptron Output (step activation)

ŷ = sign(wᵀx + b) = { +1 if wᵀx+b ≥ 0, −1 otherwise }

Perceptron Learning Rule

// Update only on misclassification
w ← w + η · yᵢ · xᵢ if yᵢ(wᵀxᵢ) ≤ 0
b ← b + η · yᵢ

η: learning rate (typically 1 for perceptron). Equivalent to SGD with 0-1 loss.

// Perceptron Algorithm

Initialize w=0, b=0
repeat until no misclassifications (or max epochs):
for each (xᵢ, yᵢ) in D:
if yᵢ · (wᵀxᵢ + b) ≤ 0: // misclassified
w ← w + η·yᵢ·xᵢ
b ← b + η·yᵢ

Perceptron Convergence Theorem

If data linearly separable with margin γ, ||x||≤R, converges in ≤ (R/γ)² steps

If NOT linearly separable → infinite loop. No convergence guarantee.

XOR problem: not linearly separable → single perceptron fails. Need hidden layers (Minsky & Papert 1969).

Decision boundary = hyperplane; moves toward correctly classifying each mistake

Perceptron vs Logistic Regression vs SVM

Property	Perceptron	Logistic Reg	SVM
Output	Binary ±1	Probability [0,1]	Binary ±1
Loss	0-1 (implicit)	Cross-entropy	Hinge loss
Update	Only on mistakes	Always (gradient)	Only on SVs
Boundary	Any separating	MLE boundary	Max margin
Probabilistic	No	Yes	No (needs Platt)
Convergence	If separable only	Always (convex)	Always (convex)

Single Output Neuron — Regression

Forward Pass

z = wᵀx + b (no activation — linear output)
ŷ = z

Loss Function (MSE)

L(w,b) = (1/2)(ŷ − y)² = (1/2)(wᵀx + b − y)²

½ for clean gradient; over dataset: J = (1/2m)Σᵢ(ŷᵢ−yᵢ)²

Backward Pass (Gradients)

∂L/∂ŷ = ŷ − y // loss → output
∂L/∂w = (ŷ − y) · x // output → weights
∂L/∂b = (ŷ − y) // output → bias

Weight Update (GD)

w ← w − α · ∂L/∂w = w − α(ŷ−y)x
b ← b − α · ∂L/∂b = b − α(ŷ−y)

Identical to linear regression; neuron notation makes extension to deep nets natural

Activation for regression output: identity (none) or softplus (if output must be >0)

MLE Justification for MSE

Assume y = f(x;w) + ε, ε ~ N(0,σ²)

P(y|x;w) = N(y; wᵀx, σ²)
MLE: max Σᵢ log P(yᵢ|xᵢ;w)
= max −(1/2σ²) Σᵢ (ŷᵢ − yᵢ)²
≡ min MSE

MSE optimal under Gaussian noise assumption
Outliers → Gaussian heavy-tail violated → use MAE (Laplace prior) or Huber

Huber Loss (robust)

L_δ(r) = { ½r² if |r| ≤ δ
δ(|r|−½δ) if |r| > δ }

Quadratic near 0 (smooth gradient); linear for large errors (robust to outliers)

Binary Classification — Sigmoid Output

Forward Pass

z = wᵀx + b
a = σ(z) = 1/(1+e⁻ᶻ) = P(y=1|x)

Loss — Binary Cross-Entropy (BCE)

L = −[y log(a) + (1−y) log(1−a)]

Backward Pass (chain rule)

∂L/∂a = −y/a + (1−y)/(1−a) = (a−y)/(a(1−a))
∂a/∂z = σ(z)(1−σ(z)) = a(1−a) // sigmoid derivative
∂L/∂z = ∂L/∂a · ∂a/∂z = a − y // clean! loss derivative cancels sigmoid
∂L/∂w = (a − y) · x
∂L/∂b = a − y

∂L/∂z = a−y for sigmoid+BCE is the same form as linear regression. No vanishing gradient in output layer for this pairing.

Multiclass Classification — Softmax Output

Forward Pass (K classes)

zₖ = wₖᵀx + bₖ, k = 1,...,K
aₖ = softmax(z)ₖ = exp(zₖ) / Σⱼ exp(zⱼ) = P(y=k|x)

Numerical Stability Trick

aₖ = exp(zₖ − max(z)) / Σⱼ exp(zⱼ − max(z))

Subtract max before exp to prevent overflow; doesn't change output (numerator and denominator both divide by same constant)

Loss — Categorical Cross-Entropy

L = −Σₖ yₖ log(aₖ) = −log(a_{true_class}) (one-hot y)

Backward Pass — Softmax + CCE gradient

∂L/∂zₖ = aₖ − yₖ // extremely clean! identical to BCE form

Proof: ∂L/∂zₖ = Σⱼ (∂L/∂aⱼ)(∂aⱼ/∂zₖ)
For j=k: ∂aⱼ/∂zₖ = aₖ(1−aₖ) → contributes −(yₖ/aₖ)·aₖ(1−aₖ)
For j≠k: ∂aⱼ/∂zₖ = −aⱼaₖ → contributes +(yⱼ/aⱼ)·aⱼaₖ = yⱼaₖ
Sum = aₖ·Σⱼyⱼ − yₖ = aₖ − yₖ (since Σyⱼ=1 for one-hot)

Weight / Bias Gradients

∂L/∂Wₖ = (aₖ − yₖ) · xᵀ // outer product for matrix form
∂L/∂bₖ = aₖ − yₖ

Softmax + CCE gradient = predicted − one-hot label. Elegant and stable.

Output Layer Cheat Sheet

Task	Output Activation	Loss Function	∂L/∂z
Regression	Linear (none)	MSE = ½(ŷ−y)²	ŷ − y
Regression (bounded)	Sigmoid × range	MSE	Chain rule
Binary classification	Sigmoid σ(z)	BCE	a − y
Multiclass (excl.)	Softmax	Cat. Cross-Entropy	a − y (vec)
Multilabel	Sigmoid (per class)	BCE (per class sum)	aₖ − yₖ each
Counting / Poisson	exp(z) or softplus	Poisson NLL	exp(z) − y

Never use sigmoid + MSE for classification: not convex, flat gradients far from boundary. Always use BCE.

Activation Functions — Formulas + Derivatives + Properties

Activation	Formula	Derivative g'(z)	Range	Key Notes
Sigmoid	σ(z) = 1/(1+e⁻ᶻ)	σ(z)(1−σ(z))	(0,1)	Vanishing gradient for \|z\|≫0; saturates; not zero-centered; slow
Tanh	tanh(z) = (eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)	1 − tanh²(z)	(−1,1)	Zero-centered (better than sigmoid); still saturates; tanh(z)=2σ(2z)−1
ReLU	max(0, z)	0 if z<0; 1 if z>0	[0,∞)	No saturation for z>0; sparse activation; dead neurons (z<0 forever)
Leaky ReLU	max(αz, z), α≈0.01	α if z<0; 1 if z>0	(−∞,∞)	Fixes dying ReLU; α is hyperparameter or learned (PReLU)
ELU	z if z≥0; α(eᶻ−1) if z<0	1 if z≥0; α·eᶻ if z<0	(−α,∞)	Smooth at 0; negative mean pushes toward zero; α≈1.0
GELU	z·Φ(z) ≈ 0.5z(1+tanh[√(2/π)(z+0.044715z³)])	Φ(z)+z·φ(z)	(≈−0.17,∞)	Transformer default; stochastic regularization interpretation; smooth everywhere
Swish	z · σ(βz)	σ(βz)+βz·σ(βz)(1−σ(βz))	(≈−0.28,∞)	β=1 common; self-gated; outperforms ReLU on deep nets empirically
Softplus	log(1+eᶻ)	σ(z)	(0,∞)	Smooth approximation of ReLU; derivative is sigmoid; rarely used in hidden layers

Dead ReLU Problem

If z = wᵀx+b < 0 for all training examples → ∂L/∂w = 0 → w never updates
Causes: large negative bias, large learning rate causing weight to "flip" negative permanently

Fix: Leaky ReLU / ELU / He initialization / smaller learning rate / batch norm

ReLU default for hidden layers; GELU for Transformers; sigmoid/tanh only for gates or output layers

Forward Pass — Exact Equations

Notation

L: number of layers (1..L); layer 0 = input
W⁽ˡ⁾ ∈ ℝ^{nˡ × nˡ⁻¹}: weight matrix for layer l
b⁽ˡ⁾ ∈ ℝ^{nˡ}: bias vector for layer l
a⁽⁰⁾ = x: input features

Per-Layer Forward Pass (l = 1, 2, ..., L)

z⁽ˡ⁾ = W⁽ˡ⁾ a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ // pre-activation
a⁽ˡ⁾ = g⁽ˡ⁾(z⁽ˡ⁾) // post-activation
// Hidden: g = ReLU; Output: g = softmax or identity

// Forward Pass (vectorized, batch)

A⁽⁰⁾ = X // X ∈ ℝ^{d×m}, m=batch size
for l = 1 to L:
Z⁽ˡ⁾ = W⁽ˡ⁾ A⁽ˡ⁻¹⁾ + b⁽ˡ⁾ // b broadcasts
A⁽ˡ⁾ = g⁽ˡ⁾(Z⁽ˡ⁾)
// Cache Z⁽ˡ⁾, A⁽ˡ⁾ for backprop
ŷ = A⁽ᴸ⁾
L = loss(ŷ, y)

Dimensions Check (critical for debugging)

W⁽ˡ⁾: [nˡ, nˡ⁻¹]; b⁽ˡ⁾: [nˡ, 1]
Z⁽ˡ⁾: [nˡ, m]; A⁽ˡ⁾: [nˡ, m]
Inner dim must match: nˡ⁻¹ = cols of W⁽ˡ⁾ = rows of A⁽ˡ⁻¹⁾

Backpropagation — Exact Equations

Error Signal (delta) — Output Layer L

δ⁽ᴸ⁾ = ∂L/∂Z⁽ᴸ⁾ = ∂L/∂A⁽ᴸ⁾ ⊙ g'⁽ᴸ⁾(Z⁽ᴸ⁾)
// For softmax+CCE or sigmoid+BCE: δ⁽ᴸ⁾ = A⁽ᴸ⁾ − Y (clean form)

Backpropagation — Hidden Layer l (l = L−1, ..., 1)

δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾)ᵀ δ⁽ˡ⁺¹⁾ ⊙ g'⁽ˡ⁾(Z⁽ˡ⁾)

⊙: element-wise (Hadamard) product
(W⁽ˡ⁺¹⁾)ᵀ: "routes" errors backward through same weights
g'⁽ˡ⁾(Z⁽ˡ⁾): gates gradient flow by how much neuron was "active"

Parameter Gradients

∂L/∂W⁽ˡ⁾ = (1/m) δ⁽ˡ⁾ (A⁽ˡ⁻¹⁾)ᵀ
∂L/∂b⁽ˡ⁾ = (1/m) Σᵢ δ⁽ˡ⁾ᵢ = (1/m) sum over batch

Weight Update

W⁽ˡ⁾ ← W⁽ˡ⁾ − α · ∂L/∂W⁽ˡ⁾
b⁽ˡ⁾ ← b⁽ˡ⁾ − α · ∂L/∂b⁽ˡ⁾

// Full Backprop Algorithm

// Output layer delta
δ⁽ᴸ⁾ = A⁽ᴸ⁾ − Y (softmax+CCE or sigmoid+BCE)
for l = L downto 1:
dW⁽ˡ⁾ = (1/m) δ⁽ˡ⁾ · (A⁽ˡ⁻¹⁾)ᵀ
db⁽ˡ⁾ = (1/m) rowsum(δ⁽ˡ⁾)
if l > 1:
δ⁽ˡ⁻¹⁾ = (W⁽ˡ⁾)ᵀ · δ⁽ˡ⁾ ⊙ g'(Z⁽ˡ⁻¹⁾)
Update W⁽ˡ⁾, b⁽ˡ⁾ via optimizer

Cache Z⁽ˡ⁾ and A⁽ˡ⁾ during forward pass — needed to compute g'(Z⁽ˡ⁾) in backprop

Vanishing / Exploding Gradients

Gradient at Layer l (simplified)

∂L/∂W⁽¹⁾ ∝ Π_{k=2}^{L} (W⁽ᵏ⁾ · g'(z⁽ᵏ⁾))

Product of L−1 matrices and activation derivatives
Sigmoid: g'(z) ≤ 0.25 → product → 0 exponentially (vanishing)
Large weights: ||W|| > 1 → product → ∞ exponentially (exploding)

Problem	Symptom	Fix
Vanishing gradient	Early layers learn ~nothing; loss plateau	ReLU, skip connections, BatchNorm, LSTM/GRU gates
Exploding gradient	Loss NaN; weights blow up	Gradient clipping, smaller lr, weight decay, BatchNorm

Gradient Clipping (for exploding)

if ||g|| > threshold: g ← g · (threshold / ||g||)

Clips gradient norm, not individual values. Standard in RNN training (threshold=1.0 or 5.0).

Weight Initialization

Xavier / Glorot (tanh, sigmoid)

W ~ U[−√(6/(nˡ⁻¹+nˡ)), +√(6/(nˡ⁻¹+nˡ))]
Var(W) = 2 / (nˡ⁻¹ + nˡ)

Keeps variance of activations constant across layers for linear-like activations

He / Kaiming (ReLU)

W ~ N(0, 2/nˡ⁻¹)
Var(W) = 2 / nˡ⁻¹

Accounts for ReLU zeroing ~half the activations; factor 2 compensates

LeCun (SELU)

Var(W) = 1 / nˡ⁻¹

Zero init: all neurons compute same gradient → symmetry breaking failure
Large random init: activations saturate → vanishing gradient
Orthogonal init: useful for RNNs; preserves gradient norms
Always initialize biases to 0 (or small constant for output layer)

Goal: maintain signal variance through forward pass AND gradient variance through backward pass

Depth vs Width

Property	More Depth	More Width
Expressivity	Exponential gain (compositionality)	Linear gain
Parameters	Fewer for same power	More parameters
Optimization	Harder (more local minima, vanishing grad)	Easier (flatter landscape)
Inductive bias	Hierarchical feature reuse	Richer per-level features
Generalization	Better with regularization	Can overfit quickly

Depth: layers build abstractions (edges → textures → objects in CNN)
Width: more neurons per layer → richer intermediate representations
Modern: wide + deep + residual connections

Depth ↔ program length; width ↔ memory per step

Convolution Operation

2D Cross-Correlation (what CNNs actually compute)

(I ★ K)[i,j] = Σₘ Σₙ I[i+m, j+n] · K[m,n]

I: input; K: kernel/filter; ★: cross-correlation (not full convolution)

Output Spatial Size

H_out = ⌊(H_in + 2P − F_H) / S⌋ + 1
W_out = ⌊(W_in + 2P − F_W) / S⌋ + 1

H_in: input height; P: padding; F_H: filter height; S: stride

Parameter Count (conv layer)

Params = C_out × (C_in × F_H × F_W + 1)

+1 for bias per output channel. Shared across spatial locations → massive param saving vs FC.

Padding Modes

'valid': no padding → output shrinks
'same': P = (F−1)/2 → output = input size (stride=1)

Receptive Field (depth-l conv with filter size f, stride 1)

RF_l = RF_{l-1} + (f−1) × Π_{k=1}^{l-1} sₖ

Grows with depth; dilated convolutions increase RF without extra params

Weight sharing = same filter detects feature everywhere → translation equivariance

Convolution — Forward & Backward

Forward Pass (single filter, no bias)

Z[i,j] = Σₘ Σₙ A_prev[i·S+m, j·S+n] · W[m,n]
A = g(Z)

Backward Pass — Gradient w.r.t. Filter

∂L/∂W[m,n] = Σᵢ Σⱼ δ[i,j] · A_prev[i·S+m, j·S+n]

Correlation of input patch with upstream gradient = convolve input with δ

Backward Pass — Gradient w.r.t. Input (for prev layer)

∂L/∂A_prev[p,q] = Σᵢ Σⱼ δ[i,j] · W[p−i·S, q−j·S]

Full convolution of δ with flipped kernel W (180° rotation = true convolution)

With Channels (general: C_in input channels, C_out filters)

Z_{c_out}[i,j] = Σ_{c_in} Σₘ Σₙ A_prev_{c_in}[i+m,j+n] · W_{c_out,c_in}[m,n] + b_{c_out}

Depthwise separable conv: depthwise (1 filter/channel) + pointwise (1×1) → ~8-9× fewer params than standard conv
Dilated conv: inserts holes (rate r) → RF = F + (F-1)(r-1); no extra params
Transposed conv: used in decoders/generators; upsampling with learned weights

Pooling & Channels

Max Pooling (no learned params)

P[i,j] = max{A[i·S+m, j·S+n] : m,n ∈ [0,f)}

Average Pooling

P[i,j] = (1/f²) Σₘ Σₙ A[i·S+m, j·S+n]

Global Average Pooling (GAP)

P_c = (1/HW) Σᵢ Σⱼ A_c[i,j] // one value per channel

Replaces FC layers in modern CNNs (GoogLeNet, ResNet). No params; spatial invariance; regularization effect.

Max Pool Backward

Gradient flows only to max-activation position; zero elsewhere

"Switch" remembers which position was max during forward pass

Layer Pattern	Effect
Conv + BN + ReLU	Standard modern block
CONV(3×3) × N → Pool	VGG-style
Concat([conv1×1, conv3×3, conv5×5, maxpool])	Inception-style
x + F(x) (residual)	ResNet-style

AlexNet → VGGNet → InceptionNet → ResNet — Key Innovations

Architecture	Year	Key Innovation	Params	Architecture Summary
AlexNet	2012	ReLU over tanh (10× faster); Dropout; Data augment; Multi-GPU; Local Response Norm	60M	5 CONV [11,5,3,3,3] + 3 FC. Large filters, MaxPool after 1,2,5
VGGNet-16	2014	All 3×3 convs — two 3×3 = one 5×5 RF with fewer params and extra nonlinearity; very deep (16-19 layers)	138M	5 blocks of [CONV3×3 ×2-3 → MaxPool2×2] + 3 FC. Uniform design.
GoogLeNet (InceptionV1)	2014	Inception module: parallel 1×1, 3×3, 5×5 convs + maxpool; 1×1 bottleneck for dim reduction; GAP instead of FC; Auxiliary classifiers for gradient flow	6.8M	9 Inception modules + GAP + Softmax
InceptionV3	2016	Factorize n×n → 1×n + n×1 (asymmetric); BatchNorm everywhere; Label smoothing	23.8M	Factorized Inception + BN + GAP
ResNet-50	2015	Residual (skip) connections: solve vanishing gradient, enable training 100+ layers; BN after each conv; GAP + single FC	25M	1 CONV + 4 stages of [bottleneck blocks: 1×1→3×3→1×1] + GAP + FC
ResNet Bottleneck	2015	1×1 → 3×3 → 1×1: reduce channels for 3×3, then restore → 3× fewer params than plain 3×3 block	—	Residual: F(x)+x; dimension match via 1×1 projection if needed

Residual Block (ResNet) — Forward

F(x) = W₂ · ReLU(BN(W₁x)) // simplified
y = ReLU(F(x) + x) // skip connection
// if dim mismatch: x_proj = W_s · x (1×1 conv)
y = ReLU(F(x) + W_s·x)

Identity shortcut: gradient = 1 → no vanishing. F(x) learns residual (correction), not full mapping.

Why skip connections work — gradient perspective

∂L/∂x = ∂L/∂y · (∂F/∂x + 1)
+1 ensures gradient ≥ 1 → no vanishing even with deep F
Also: creates ensemble of paths from input to output

Inception Module — Forward

h₁ = CONV1×1(x)
h₂ = CONV3×3(CONV1×1(x)) // 1×1 bottleneck first
h₃ = CONV5×5(CONV1×1(x))
h₄ = CONV1×1(MaxPool(x))
out = Concat([h₁, h₂, h₃, h₄], axis=channel)

1×1 Convolution — Why It Matters

Acts as learned channel-wise projection / mixing
Reduces channels: C_in=256 → 64 → 256 saves 9× FLOPs for 3×3 conv
Adds non-linearity across channels with minimal spatial cost

VGG: simple and deep. Inception: multi-scale features. ResNet: residual learning = train arbitrarily deep. Modern default: ResNet family.

Transfer Learning

Feature Hierarchy

Early layers: edges, colors, textures (universal)
Mid layers: parts, patterns (semi-universal)
Late layers: task-specific (dataset-specific)

Scenario	Strategy	What to Do
Small data + similar domain	Feature extraction	Freeze all; replace + train only top FC layers
Large data + similar domain	Fine-tune deeper	Unfreeze later conv blocks + retrain with small lr
Small data + different domain	Feature extraction	Freeze all; linear classifier on penultimate features
Large data + different domain	Full fine-tune	Initialize with pretrained; train all layers

Learning rate warmup: use 10× smaller lr for pretrained layers vs new head
Layer-wise LR decay: lr_l = base_lr × decay^(L−l); earlier layers lower lr
Catastrophic forgetting: fine-tuning overwrites pretrained knowledge → use low lr, fewer epochs
Pretrained sources: ImageNet (CNNs), BERT (text), ViT (vision+language)

Pretrained = compressed knowledge of millions of images; adapt last few layers for new task

Modern CNN Variants

Architecture	Key Idea	Use Case
ResNeXt	Grouped convolutions (32 groups) — wider + grouped	Better ResNet
DenseNet	Each layer connected to all subsequent layers: x_l = H([x₀,x₁,...,x_{l-1}])	Dense feature reuse; small datasets
MobileNetV2	Depthwise sep + inverted residuals (expand→DWConv→project)	Mobile / edge deployment
EfficientNet	Compound scaling: depth × width × resolution simultaneously via NAS	State-of-art efficiency tradeoff
ViT	Split image into patches → linear embed → Transformer encoder; no conv	Large-data image classification

DenseNet Forward — Layer l

x_l = H_l([x₀, x₁, ..., x_{l-1}]) // concat all prev activations
Growth rate k: each layer adds k channels

Alleviates vanishing gradient (direct path to all layers); feature reuse; fewer params than ResNet for same accuracy

Vanilla RNN — Equations

Forward Pass (timestep t)

hₜ = tanh(Wₕ hₜ₋₁ + Wₓ xₜ + bₕ)
= tanh(W[hₜ₋₁; xₜ] + bₕ) // compact: concatenate
yₜ = softmax(Wᵧ hₜ + bᵧ) // if output at each step

Wₕ: hidden-to-hidden; Wₓ: input-to-hidden; weights shared across time

Loss (sum over all timesteps)

L = Σₜ₌₁ᵀ Lₜ(yₜ, ŷₜ)

Dimensions

xₜ ∈ ℝᵈ; hₜ ∈ ℝʰ; yₜ ∈ ℝᵏ
Wₕ ∈ ℝ^{h×h}; Wₓ ∈ ℝ^{h×d}; Wᵧ ∈ ℝ^{k×h}

Same W reused at each timestep → shared representation of dynamics; sequence as directed graph through time

Backpropagation Through Time (BPTT)

Gradient w.r.t. Wₕ

∂L/∂Wₕ = Σₜ ∂Lₜ/∂Wₕ
∂Lₜ/∂Wₕ = Σₖ₌₁ᵗ (∂Lₜ/∂hₜ) · (∂hₜ/∂hₖ) · (∂hₖ/∂Wₕ)

Gradient of hₜ w.r.t. hₖ (chain over time)

∂hₜ/∂hₖ = Πⱼ₌ₖ₊₁ᵗ ∂hⱼ/∂hⱼ₋₁ = Πⱼ₌ₖ₊₁ᵗ Wₕᵀ · diag(tanh'(zⱼ))

Product of (t−k) Jacobians. If ||WₕᵀD||<1: vanishes; if >1: explodes. tanh' ≤ 1 → compound vanishing with sigmoid.

// Truncated BPTT

Unroll T steps max (T=50–200 typical)
for each chunk of T timesteps:
Forward pass: compute h₁,...,hT, loss
Backward pass: gradient flows back T steps
Carry forward hT as h₀ for next chunk

Vanishing gradient: RNN can't learn long-range dependencies (>~10 steps with vanilla). Fix: LSTM/GRU.

LSTM — All Gate Equations (Forward Pass)

Input: concatenate [hₜ₋₁, xₜ] → input vector

iₜ = σ(Wᵢ[hₜ₋₁, xₜ] + bᵢ) // Input gate: what new info to write
fₜ = σ(Wf[hₜ₋₁, xₜ] + bf) // Forget gate: what to erase from cell
oₜ = σ(Wo[hₜ₋₁, xₜ] + bo) // Output gate: what to expose from cell
g̃ₜ = tanh(Wg[hₜ₋₁, xₜ] + bg) // Candidate cell: new info to add

Cell State Update

cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ g̃ₜ

fₜ⊙cₜ₋₁: selectively forget past
iₜ⊙g̃ₜ: selectively write new info

Hidden State Update

hₜ = oₜ ⊙ tanh(cₜ)

Read from cell through output gate → filtered memory exposure

Dimensions

xₜ ∈ ℝᵈ, hₜ, cₜ ∈ ℝʰ
Each W ∈ ℝ^{h×(h+d)}, each b ∈ ℝʰ
Total params ≈ 4 × (h² + hd + h)

4× more params than vanilla RNN of same hidden size

Why LSTM Solves Vanishing Gradient

∂cₜ/∂cₜ₋₁ = fₜ // just element-wise mult!
Gradient of cell: ∂L/∂c₀ = ∂L/∂cT · Πₜ fₜ
When fₜ ≈ 1 (forget gate open): gradient flows unchanged
No matrix product → no exponential decay
Highway for gradients through cell state

Peephole Connections (Gers & Schmidhuber)

Gates also see cell state:
iₜ = σ(Wᵢ[hₜ₋₁, xₜ, cₜ₋₁] + bᵢ)
fₜ = σ(Wf[hₜ₋₁, xₜ, cₜ₋₁] + bf)
oₜ = σ(Wo[hₜ₋₁, xₜ, cₜ] + bo)

Cell state = conveyor belt; forget gate erases; input gate writes; output gate reads — long-range memory via near-linear gradient paths

GRU — All Equations

Gate Equations

zₜ = σ(Wz[hₜ₋₁, xₜ] + bz) // Update gate: blend old vs new
rₜ = σ(Wr[hₜ₋₁, xₜ] + br) // Reset gate: how much to forget

Candidate Hidden State

h̃ₜ = tanh(W[rₜ ⊙ hₜ₋₁, xₜ] + b)

rₜ ⊙ hₜ₋₁: reset gate masks past hidden state. If rₜ→0, h̃ₜ ignores past = restart.

Final Hidden State

hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ

zₜ: interpolation between past and candidate. If zₜ→1: update fully. If zₜ→0: keep past.

Dimensions

Total params ≈ 3 × (h² + hd + h) // 25% fewer than LSTM

GRU = simplified LSTM; merges cell and hidden state; 2 gates vs 3; often matches LSTM with fewer params

LSTM vs GRU vs Vanilla RNN

Property	Vanilla RNN	LSTM	GRU
Gates	0	3 (i, f, o) + cell	2 (z, r)
Hidden states	hₜ only	hₜ + cₜ (cell)	hₜ only
Params (d=100,h=100)	~20K	~80K	~60K
Long-range memory	Poor (<10 steps)	Excellent (>1000 steps)	Good
Train speed	Fastest	Slowest	Middle
When to use	Very short sequences	Complex long sequences; when cₜ semantics matter	Default; less data; faster experiments

Empirically: LSTM and GRU similar on most tasks. GRU preferred for speed/fewer params.

BiLSTM & Stacked RNNs

Bidirectional LSTM

h_fwd_t = LSTM_fwd(xₜ, h_fwd_{t-1}) // forward in time
h_bwd_t = LSTM_bwd(xₜ, h_bwd_{t+1}) // backward in time
hₜ = [h_fwd_t ; h_bwd_t] // concatenate

2× params; can't use for autoregressive generation (needs full sequence)

Stacked (Deep) RNN

hₜ⁽ˡ⁾ = LSTM(hₜ⁽ˡ⁻¹⁾, hₜ₋₁⁽ˡ⁾) for l=1..L

Input to layer l = hidden state of layer l−1. Adds abstraction. 2-4 layers common; dropout between layers.

Teacher forcing: during training, feed ground-truth yₜ₋₁ as input at each step; faster convergence
Scheduled sampling: anneal from teacher forcing → model's own predictions over training
Encoder-decoder: encode seq → context vector → decode; bottleneck problem → use attention

BiLSTM for encoding (classification, NER); unidirectional for generation; stacking for deeper abstraction

Seq2Seq + Bahdanau Attention

Problem: encoder-decoder bottleneck

All encoder info compressed into single vector c = h_T
Information loss for long sequences → attention solves this

Bahdanau (Additive) Attention — Equations

eₜⱼ = vₐᵀ tanh(Wₐ sₜ₋₁ + Uₐ hⱼ) // alignment score
αₜⱼ = softmax(eₜⱼ) = exp(eₜⱼ)/Σⱼ'exp(eₜⱼ') // attention weights
cₜ = Σⱼ αₜⱼ hⱼ // context vector
sₜ = tanh(W[sₜ₋₁; yₜ₋₁; cₜ] + b) // decoder update

sₜ: decoder hidden; hⱼ: encoder hidden at pos j; vₐ,Wₐ,Uₐ: learned; O(T²d) per sequence

Luong (Multiplicative) Attention — Variants

dot: eₜⱼ = sₜᵀ hⱼ
general: eₜⱼ = sₜᵀ Wₐ hⱼ
concat: eₜⱼ = vᵀ tanh(W[sₜ; hⱼ])

Decoder "queries" encoder outputs; weights = how relevant each encoder position is for current output step

Scaled Dot-Product Attention (Transformer)

Core Equation

Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V

Q ∈ ℝ^{n×dₖ}: queries; K ∈ ℝ^{m×dₖ}: keys; V ∈ ℝ^{m×dᵥ}: values
n: #query positions; m: #key-value positions; output ∈ ℝ^{n×dᵥ}

Why √dₖ scaling?

QKᵀ = Σᵢ qᵢkᵢ has variance dₖ (for unit variance q,k)
→ magnitude grows with dₖ → softmax saturates → tiny gradients
√dₖ normalization restores unit variance

Self-Attention

Q = X Wᵠ; K = X Wᴷ; V = X Wᵛ where X ∈ ℝ^{n×d}
Attention(Q,K,V) — Q, K, V all from same sequence X

Each position attends to all others; captures global dependencies in O(n²d)

Cross-Attention (encoder-decoder)

Q = decoder states; K = V = encoder outputs
Decoder queries encoder to compute context

Masked Self-Attention (causal / autoregressive)

Set eᵢⱼ = −∞ for j > i (future positions)
After softmax: attention weight = 0 for future tokens
Position i can only attend to positions 1..i

Attention = soft retrieval: query finds relevant keys, retrieves weighted values

Multi-Head Attention

Multi-Head Attention (h heads)

head_i = Attention(Q Wᵠᵢ, K Wᴷᵢ, V Wᵛᵢ)
MHA(Q,K,V) = Concat(head₁,...,head_h) Wᴼ

Wᵠᵢ ∈ ℝ^{d×dₖ}, Wᴷᵢ ∈ ℝ^{d×dₖ}, Wᵛᵢ ∈ ℝ^{d×dᵥ}, Wᴼ ∈ ℝ^{hdᵥ×d}
Typically: dₖ = dᵥ = d/h (so total compute ≈ single-head)

Complexity

Self-attention: O(n²d) time, O(n²+nd) space
Bottleneck: n² attention matrix → O(n²) memory for long sequences

Why multiple heads: each head can attend to different aspects (syntax, semantics, coreference)
Efficient attention variants: Sparse Attention, Longformer (local+global), FlashAttention (IO-aware)

FlashAttention — Key Idea

Fuse QKᵀ softmax V into single GPU kernel; tile computation to fit SRAM
O(n²) FLOPS but only O(n) memory → 2–4× speedup; no accuracy loss

Multiple heads = multiple "attention programs" running in parallel; each specializes

Transformer Architecture — Complete Equations

Positional Encoding (sinusoidal)

PE(pos, 2i) = sin(pos / 10000^{2i/d})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d})
Input embedding: x_pos = Embed(token) + PE(pos)

Injects sequential position info (Transformer is permutation-invariant by default)
Learned PE also common (BERT uses learned)

Encoder Block (repeated N times)

z₁ = LayerNorm(x + MHA(x,x,x)) // self-attn + residual
z₂ = LayerNorm(z₁ + FFN(z₁)) // FFN + residual

Decoder Block (repeated N times)

z₁ = LayerNorm(x + Masked-MHA(x,x,x)) // causal self-attn
z₂ = LayerNorm(z₁ + Cross-MHA(z₁,enc,enc)) // cross-attn
z₃ = LayerNorm(z₂ + FFN(z₂))

Feed-Forward Network (FFN) per position

FFN(x) = max(0, x W₁ + b₁) W₂ + b₂
// ReLU or GELU; W₁ expands d→4d, W₂ projects 4d→d
Params: 2 × d × 4d = 8d² per layer

Layer Norm (Pre-LN vs Post-LN)

LN(x) = γ · (x − μ) / √(σ² + ε) + β
// Post-LN (original): x + LN(sublayer(x))
// Pre-LN (modern): x + sublayer(LN(x)) — more stable training

γ, β: learned scale and shift per feature

Original Transformer (Vaswani 2017) Config

N=6 layers; d=512; h=8 heads; dₖ=dᵥ=64; FFN d_ff=2048
Encoder: 6 × [MHA(8 heads) + FFN]
Decoder: 6 × [Masked-MHA + Cross-MHA + FFN]
Output: linear projection → softmax over vocab

Full Forward Pass (seq2seq)

src → embed + PE → encoder → enc_out
tgt → embed + PE → decoder(enc_out) → dec_out
logits = dec_out · Wᵥᵒᶜᵃᵇ
loss = CrossEntropy(logits, tgt_shifted)

Teacher forcing during training: feed ground-truth target tokens shifted right

Transformer Attention Complexity

Self-attention: O(n²·d) — quadratic in sequence length
FFN: O(n·d²) — linear in sequence length
For n<d: FFN dominates. For n>d: attention dominates.

Backprop Through Transformer Block

∂L/∂x = ∂L/∂z₂ · ∂z₂/∂z₁ + ∂L/∂z₂ // residual: gradient splits
Residual ensures identity path: ∂L/∂x has additive term 1 from skip

BERT vs GPT vs T5 — Transformer Variants

Model	Architecture	Training Objective	Attention Mask	Best For
BERT	Encoder only	MLM (15% mask) + NSP	Bidirectional (full)	Classification, NER, QA (understanding)
GPT family	Decoder only	Causal LM (next token)	Causal (left-only)	Generation, chat, code completion
T5	Encoder-Decoder	Span corruption (replace spans)	Enc: full; Dec: causal	Seq2seq tasks: translation, summarization
XLNet	Encoder (AR)	Permutation LM	Permuted	Bidirectional + autoregressive
RoBERTa	Encoder only	MLM (no NSP, dynamic mask)	Bidirectional	Better BERT (more data, better training)

BERT MLM Loss

L = −Σ_{masked} log P(xᵢ | x\{xᵢ}; θ)

GPT Causal LM Loss

L = −Σᵢ log P(xᵢ | x₁,...,xᵢ₋₁; θ)

Positional Encoding — Types

Type	Method	Can Extrapolate?	Used In
Sinusoidal (absolute)	Fixed sin/cos of position	Moderate	Original Transformer
Learned (absolute)	Trainable embedding per position	No (up to max_len)	BERT, GPT-2
RoPE (relative rotary)	Rotate Q,K by position angle	Yes (good)	GPT-NeoX, LLaMA
ALiBi (relative bias)	Linear bias to attention logits: −\|i−j\|·m	Yes (best)	BLOOM, MPT
T5 Relative Bias	Learned bias per distance bucket	Moderate	T5, mT5

RoPE — Rotary Position Embedding

q'ₘ = Rₘ qₘ, k'ₙ = Rₙ kₙ
qₘᵀkₙ = qₘᵀ Rₘᵀ Rₙ kₙ = f(m−n) // depends only on relative offset
Rₘ: rotation matrix by angle m·θ (block-diagonal 2D rotations)

SGD + Momentum

Vanilla SGD

g = ∇_θ L(θ; mini-batch)
θ ← θ − α · g

SGD + Momentum (Polyak)

v ← β v − α g // v: velocity (β≈0.9)
θ ← θ + v

Accumulates past gradients; dampens oscillations; escapes shallow local minima; β=0.9 typical

Nesterov Accelerated Gradient (NAG)

v ← β v − α ∇L(θ + βv) // gradient at "lookahead" position
θ ← θ + v

Look ahead where momentum will take you, then compute gradient there → smarter correction

Momentum = heavy ball rolling; accumulates speed in gradient direction; dampens oscillations in high-curvature directions

Adaptive Optimizers

AdaGrad

Gₜ ← Gₜ₋₁ + gₜ² // accumulated squared gradients
θₜ ← θₜ₋₁ − (α/√(Gₜ+ε)) · gₜ

Rare features get large updates; learning rate monotonically decreases → eventually stops learning

RMSprop

vₜ ← ρ vₜ₋₁ + (1−ρ) gₜ² // exponential moving average; ρ≈0.9
θₜ ← θₜ₋₁ − (α/√(vₜ+ε)) · gₜ

Fixes AdaGrad's monotonic decay; ε: numerical stability (1e-8)

Adam — Adaptive Moment Estimation

mₜ ← β₁ mₜ₋₁ + (1−β₁) gₜ // 1st moment (momentum; β₁≈0.9)
vₜ ← β₂ vₜ₋₁ + (1−β₂) gₜ² // 2nd moment (RMSprop; β₂≈0.999)
m̂ₜ ← mₜ / (1−β₁ᵗ) // bias correction
v̂ₜ ← vₜ / (1−β₂ᵗ) // bias correction
θₜ ← θₜ₋₁ − α · m̂ₜ / (√v̂ₜ + ε)

Defaults: α=1e-3, β₁=0.9, β₂=0.999, ε=1e-8
Bias correction critical at early steps (m₀=v₀=0)

AdamW — Adam + Decoupled Weight Decay

θₜ ← θₜ₋₁ − α · [m̂ₜ/(√v̂ₜ+ε) + λθₜ₋₁]

Standard Adam: L2 reg ≠ weight decay (adaptive lr corrupts it). AdamW decouples → better generalization. Default for Transformers.

Adam = momentum + adaptive per-parameter learning rates. AdamW = Adam but correct regularization.

Learning Rate Schedules

Step Decay

α(t) = α₀ · γ^{⌊t/step_size⌋} // γ≈0.1 every k epochs

Cosine Annealing

α(t) = α_min + ½(α_max − α_min)(1 + cos(πt/T))

Smooth decay to α_min; can restart (SGDR: reset to α_max periodically)

Warmup + Decay (Transformer default)

α(t) = d^{−0.5} · min(t^{−0.5}, t · warmup_steps^{−1.5})

Linear warmup for warmup_steps, then √t decay. Warmup: stabilize early training with small gradients before adaptive estimates are reliable.

Cyclical Learning Rate (CLR)

α(t) = α_min + (α_max − α_min) · max(0, 1 − |t/stepsize − 2q − 1|)

Triangular cycles between min/max lr; may escape saddle points

Schedule	Best For
Step decay	CV tasks; well-tuned baselines
Cosine annealing	General DL; CNNs
Warmup + cosine	Transformers, LLMs
ReduceLROnPlateau	When val loss stagnates; simple baseline

Optimizer Comparison

Optimizer	Memory	Convergence	Best For
SGD	O(p)	Slow, noisy	CV (with momentum, tuned lr); best generalization
SGD+Momentum	O(2p)	Faster, smooth	CV, ResNets
AdaGrad	O(2p)	Good (sparse)	Sparse features, NLP (old)
RMSprop	O(2p)	Good	RNNs, online
Adam	O(3p)	Fast	Default for NLP/Transformers
AdamW	O(3p)	Fast + generalize	Transformers, LLMs (default)

Adam often converges faster but generalizes worse than SGD+momentum on image classification. SGD still preferred for ResNet/ImageNet. AdamW = standard for LLMs.

Loss Landscape & Optimization Issues

Saddle Points vs Local Minima

Saddle: ∂²L/∂θᵢ² >0 for some i, <0 for others → gradient=0 but not minimum
In high-dim: local minima rare; saddle points common
SGD noise helps escape saddle points

Sharp vs flat minima: flat minima → better generalization (lower curvature = smaller generalization gap)
Loss landscape with BatchNorm: smoother; fewer sharp minima
SAM (Sharpness Aware Minimization): perturb weights to worst-case neighborhood before GD step → flatter minima
Gradient accumulation: simulate larger batch by accumulating gradients over k steps before update
Mixed precision: FP16 forward + backward; FP32 optimizer step; 2× memory saving + speedup

Batch Size Effects

Property	Small Batch (8-64)	Large Batch (512-8K)
Gradient noise	High → implicit regularization	Low → sharp minima risk
GPU utilization	Low	High
Steps per epoch	Many (more updates)	Few
Generalization	Often better	Often worse (sharp minima)
LR scaling	Baseline lr	Linear scaling rule: lr = base_lr × B/B₀

Linear Scaling Rule (Goyal et al.)

If base lr = α₀ with batch B₀, use lr = α₀ · B/B₀ for batch B
With warmup for first ~5 epochs to avoid instability

L1 / L2 Weight Regularization

L2 (Weight Decay)

J(θ) = L(θ) + (λ/2) Σᵢ θᵢ²
∂J/∂θ = ∂L/∂θ + λθ
Update: θ ← θ(1 − αλ) − α ∂L/∂θ // shrinks toward 0

J(θ) = L(θ) + λ Σᵢ |θᵢ|
∂J/∂θ = ∂L/∂θ + λ sign(θ)

Subgradient at 0; promotes sparsity; less common in DL than L2

Adam + L2 ≠ weight decay (adaptive lr rescales penalty). Use AdamW for correct weight decay.

Dropout

Forward Pass (training)

mask ~ Bernoulli(p) // p = keep prob (e.g., 0.8)
ã = mask ⊙ a / p // inverted dropout: scale by 1/p

Inverted dropout: no change needed at test time (just use a directly)

Forward Pass (inference)

ã = a // no dropout; all neurons active
// Inverted dropout: already normalized during training

Backward Pass

∂L/∂a = mask ⊙ (∂L/∂ã) / p // same mask as forward

Typical p values: 0.5 for FC layers; 0.1–0.2 for conv layers; 0.1 for Transformer attention
Effect: trains ensemble of 2ⁿ networks; forces redundant representations
DropConnect: drop weights instead of activations
SpatialDropout: drop entire channels in CNN; preserves spatial correlation
DropPath / StochasticDepth: drop entire residual blocks; used in ResNets + ViT

Each training step: different subnet. Inference: implicit ensemble of all subnets.

Batch Normalization

Forward Pass (training, per mini-batch)

μ_B = (1/m) Σᵢ zᵢ // batch mean
σ²_B = (1/m) Σᵢ (zᵢ − μ_B)² // batch variance
ẑᵢ = (zᵢ − μ_B) / √(σ²_B + ε) // normalize
yᵢ = γ ẑᵢ + β // scale + shift (learned)

Forward Pass (inference)

Use running averages (EMA over training batches):
μ_running ← α μ_running + (1−α) μ_B
σ²_running ← α σ²_running + (1−α) σ²_B
ŷ = γ(z − μ_running)/√(σ²_running + ε) + β

Backward Pass

∂L/∂ẑᵢ = ∂L/∂yᵢ · γ
∂L/∂σ²_B = Σᵢ ∂L/∂ẑᵢ · (zᵢ−μ_B) · −½(σ²_B+ε)^{−3/2}
∂L/∂μ_B = Σᵢ ∂L/∂ẑᵢ · −1/√(σ²_B+ε) + ∂L/∂σ²_B · Σᵢ −2(zᵢ−μ_B)/m
∂L/∂zᵢ = ∂L/∂ẑᵢ/√(σ²_B+ε) + ∂L/∂σ²_B · 2(zᵢ−μ_B)/m + ∂L/∂μ_B/m
∂L/∂γ = Σᵢ ∂L/∂yᵢ · ẑᵢ; ∂L/∂β = Σᵢ ∂L/∂yᵢ

Normalizes inputs to each layer → smoother loss landscape; higher lr; acts as regularizer (batch statistics = noise); reduces internal covariate shift

Batch size <8: batch stats noisy → use GroupNorm or LayerNorm instead. Incompatible with online learning.

Normalization Methods — Comparison

Method	Normalizes Over	Learned Params	Best For	Issue
Batch Norm	Batch (N) + spatial (HW) per channel	γ, β per channel	CNNs, large batch CV	Fails small batch; train/test discrepancy
Layer Norm	Features (C, H, W) per sample	γ, β per feature	NLP, Transformers, RNNs	Weaker channel normalization for CNN
Instance Norm	Spatial (HW) per channel per sample	γ, β per channel	Style transfer, GAN	Loses global statistics
Group Norm	Group of channels + spatial per sample	γ, β per group	CV with small batch, detection	Hyperparameter: num_groups

Layer Norm Equations

μ = (1/d) Σⱼ aⱼ; σ² = (1/d) Σⱼ(aⱼ − μ)²
LN(a) = γ ⊙ (a−μ)/√(σ²+ε) + β

Same equation as BN but over feature dim (not batch dim) → works for batch=1

Rule of thumb: BN for CNNs with large batch; LN for Transformers/RNNs; GN for detection/segmentation with small batch

Data Augmentation

Domain	Technique	Effect
Image	Random crop, flip, color jitter, rotate, cutout	Spatial + photometric invariance
Image	MixUp: x̃ = λxᵢ + (1−λ)xⱼ, ỹ = λyᵢ + (1−λ)yⱼ	Smoother decision boundary; calibration
Image	CutMix: paste random patch from xⱼ into xᵢ	Better than MixUp for detection
Image	AutoAugment / RandAugment: learned/random policy	State-of-art augmentation
Text	Synonym replace, back-translation, random insert/delete	Lexical diversity
Audio	SpecAugment: mask freq/time bands in spectrogram	Speech robustness; default for ASR

MixUp Loss

L = λ · CE(f(x̃), yᵢ) + (1−λ) · CE(f(x̃), yⱼ) where λ ~ Beta(α,α), α=0.2 typical

Early Stopping & Other Regularizers

Early stopping: monitor val loss; stop if no improvement for patience=k epochs; restore best weights
Max norm: clip ||w||₂ ≤ c per neuron; c=3–5 typical; prevents large weights without changing direction
Label smoothing: y_smooth = (1−ε)·y + ε/K; ε=0.1; prevents overconfident softmax; improves calibration
Stochastic depth: randomly drop entire residual blocks during training with prob p_l; test: all blocks active (scaled)
Spectral norm: normalize each weight W ← W/σ_max(W); Lipschitz constraint; common in GANs
Noise injection: add Gaussian noise to inputs, activations, or weights; acts as L2 regularizer for weights

Label Smoothing — Effect on Loss

L_LS = (1−ε) · H(y,p) + ε · H(u,p)
where u = uniform = 1/K
= H(y,p) − ε·H(y,p) + ε/K·Σlog(1/pₖ)

Penalizes overconfident predictions; reduces gap between max logit and others; improves calibration

Normalization — Gradient Effects

BN smooths loss surface: reduces sensitivity to initialization; allows larger lr → faster convergence
BN = implicit regularizer: normalization depends on batch → each sample sees different "noisy" normalization → like Dropout
BN internal covariate shift: keeps input distribution to each layer stable → stable gradient magnitudes

BN Position Conventions

Original: Conv → BN → ReLU (Post-activation BN)
Modern best practice: Pre-activation (He et al. 2016): BN → ReLU → Conv
Pre-activation: gradient passes through identity before activation → cleaner backprop

Generative Models — Brief Overview

Model	Objective	Inference	Pros / Cons
VAE	ELBO = E[log p(x\|z)] − KL(q(z\|x)\|\|p(z))	z~q(z\|x) → decode	Fast; blurry samples; smooth latent space
GAN	min_G max_D E[logD(x)] + E[log(1−D(G(z)))]	z~N(0,I) → G(z)	Sharp samples; mode collapse; unstable training
Diffusion	ELBO over noise schedule; predict ε from xₜ	Iterative denoising (T steps)	Best quality; slow sampling; no mode collapse
Flow	Exact log-likelihood via change of variables	Invertible transform z=f(x)	Exact likelihood; restricted architectures

Diffusion Forward Process

q(xₜ|xₜ₋₁) = N(xₜ; √(1−βₜ)xₜ₋₁, βₜI)
q(xₜ|x₀) = N(xₜ; √ᾱₜ x₀, (1−ᾱₜ)I) where ᾱₜ = Πₛ₌₁ᵗ(1−βₛ)

DL Training Recipe — Interview Checklist

1. Overfit small batch first: verify model can memorize 5–10 samples → confirms forward/backward correct
2. Regularize: dropout, weight decay, data augmentation, early stopping
3. Learning rate: LR range test; warmup; decay schedule; use AdamW
4. Initialization: He (ReLU), Glorot (tanh); BN after init helps
5. Monitor: train/val loss curves; gradient norms; weight magnitudes; activation histograms
6. Debug gradients: gradient check (numerical ≈ analytical); check layer by layer
7. Data pipeline: normalize inputs (mean-center, unit-var or [0,1]); augment on-the-fly

Gradient Check

f'(θ) ≈ [L(θ+ε) − L(θ−ε)] / 2ε (central difference)
relative error = ||f'_approx − f'_backprop|| / (||f'_approx|| + ||f'_backprop||)
Accept if < 1e-5; investigate if > 1e-3

Architecture Selection Guide

Task	Primary Choice	Alternative	Notes
Image classification	ResNet-50 / EfficientNet	ViT (large data)	Transfer from ImageNet always
Object detection	YOLOv8 / Faster R-CNN	DETR (Transformer)	YOLO for speed; DETR for simplicity
Semantic segmentation	UNet (medical) / DeepLabV3+	SegFormer	Skip connections critical
Text classification	BERT fine-tune	DistilBERT (fast)	Always fine-tune, don't train from scratch
Text generation	GPT-style decoder	T5 (seq2seq)	GPT for open-ended; T5 for structured output
Machine translation	Transformer (encoder-decoder)	mBART, NLLB	Pretrained MT models >> training from scratch
Time-series forecast	Temporal Conv Net / Transformer	LSTM	Patching (PatchTST) competitive with Transformers
Tabular data	XGBoost / LightGBM	TabNet, FT-Transformer	DL rarely beats GBDT on tabular; try both

Complexity Reference

Layer	Params	FLOPs (forward)
Linear (d→h)	dh + h	2dh
Conv (C_in, C_out, f×f)	C_out(C_in·f²+1)	2·C_in·C_out·f²·H_out·W_out
MultiHead Attn (d,h)	4d²	8n·d² + 4n²·d
FFN (d,4d)	8d²+5d	16n·d²
BN / LN	2·features	~6·features·n
LSTM (d→h)	4(h²+hd+h)	8h(h+d) per step

Common Interview Questions

Why ReLU over sigmoid in hidden? — No saturation for z>0; sparse activations; faster to compute
Why is backprop efficient? — Dynamic programming: reuses intermediate ∂L/∂z computations; O(params) vs O(params²) naive
BN at train vs test? — Train: batch stats; Test: running EMA of train batch stats
Dropout at test? — Off; inverted dropout: scaling done at train time
Why scale in attention? — Variance of dot product = dₖ; √dₖ restores unit variance; prevents softmax saturation
Skip connections improve what? — Gradient flow (identity path); representational flexibility (learn residual)
Why AdamW not Adam for LLMs? — Adaptive lr in Adam corrupts L2 penalty → weight decay must be decoupled

Gradient Computation Summary

Layer	∂L/∂z (upstream δ)	Key Rule
Linear z=Wx+b	δ = Wᵀ·δ_next	∂W = δ·xᵀ
ReLU a=max(0,z)	δ = δ_next ⊙ 𝟙(z>0)	Gate by forward mask
Sigmoid a=σ(z)	δ = δ_next ⊙ a(1−a)	Derivative = output × (1−output)
Tanh a=tanh(z)	δ = δ_next ⊙ (1−a²)	Derivative = 1−output²
Softmax+CCE	δ = a − y	Clean — loss cancels derivative
BN (simplified)	Complex (see BN section)	Involves batch sum terms
MaxPool	δ flows to argmax	Switch from forward pass

DL Interview Notes

Introduction to Deep Learning

What / Why / Where

Universal Approximation Theorem

DL Taxonomy

Artificial Neuron & Perceptron

Artificial Neuron Model (McCulloch-Pitts)

Perceptron

Perceptron vs Logistic Regression vs SVM

Linear Neural Network for Regression

Single Output Neuron — Regression

MLE Justification for MSE

Linear Neural Network for Classification

Binary Classification — Sigmoid Output

Multiclass Classification — Softmax Output

Output Layer Cheat Sheet

Deep Feedforward Neural Networks (MLP)

Activation Functions

Activation Functions — Formulas + Derivatives + Properties

Forward Propagation — General L-Layer Network

Forward Pass — Exact Equations

Backpropagation — Exact Equations

Vanishing / Exploding Gradients

Weight Initialization

Depth vs Width

Convolutional Neural Networks

Convolution Operation

Convolution — Forward & Backward

Pooling & Channels

CNN Architectures

AlexNet → VGGNet → InceptionNet → ResNet — Key Innovations

Transfer Learning

Modern CNN Variants

Recurrent Neural Networks

Vanilla RNN — Equations

Backpropagation Through Time (BPTT)

LSTM — Long Short-Term Memory

LSTM — All Gate Equations (Forward Pass)

GRU — Gated Recurrent Unit

GRU — All Equations

LSTM vs GRU vs Vanilla RNN

BiLSTM & Stacked RNNs

Attention Mechanism

Seq2Seq + Bahdanau Attention

Scaled Dot-Product Attention (Transformer)

Multi-Head Attention

Transformer

Transformer Architecture — Complete Equations

BERT vs GPT vs T5 — Transformer Variants

Positional Encoding — Types

Optimization for Deep Models

SGD + Momentum

Adaptive Optimizers

Learning Rate Schedules

Optimizer Comparison

Loss Landscape & Optimization Issues

Batch Size Effects

Regularization for Deep Models

L1 / L2 Weight Regularization

Dropout

Batch Normalization

Normalization Methods — Comparison

Data Augmentation

Early Stopping & Other Regularizers

Additional Key Topics

Batch Norm vs Layer Norm — Detailed Gradient Flow

Normalization — Gradient Effects

Generative Models — Brief Overview

DL Training Recipe — Interview Checklist

Quick Reference

Architecture Selection Guide

Complexity Reference

Common Interview Questions

Gradient Computation Summary