DL Interview Notes

// deep learning · forward + backprop equations · architectures + pseudocode · interview-ready
01

Introduction to Deep Learning

What / Why / Where

Classical MLDeep Learning
Manual feature engineeringLearned features end-to-end
Works well on small dataNeeds large data (or transfer)
Interpretable (trees, LR)Black box (need SHAP/GradCAM)
Tabular: XGB winsImages / text / speech: DL wins
Fast train + predictSlow train; fast predict (GPU)
DL = representation learning; each layer = progressively abstract features

Universal Approximation Theorem

Statement
A single hidden layer NN with N neurons and non-polynomial activation
can approximate any continuous function f: ℝⁿ → ℝ on compact domain
to arbitrary precision ε > 0
Width ≫ Depth version. Doesn't say how many neurons needed or how to train it.
Counts (2^n) vs depth-n circuits: shallow needs exponential width; deep is polynomial

DL Taxonomy

ArchitectureInductive BiasBest For
MLP / DFNNNone (fully connected)Tabular, small structured
CNNTranslation equivariance, local receptive fieldImages, time-series, audio
RNN / LSTMSequential order, shared weights over timeText, speech, time-series
TransformerGlobal attention, permutation equivariantNLP, vision (ViT), multimodal
GNNMessage passing over graph structureMolecules, social graphs
VAE / GAN / DiffusionLatent space structureGenerative modeling
02

Artificial Neuron & Perceptron

Artificial Neuron Model (McCulloch-Pitts)

Single Neuron Computation
z = w₁x₁ + w₂x₂ + ··· + wₙxₙ + b = wᵀx + b
a = g(z)
z: pre-activation (net input); a: activation (output); g: activation function
ComponentBiological AnalogyRole
Input xᵢDendrites / synapse inputReceive signals
Weight wᵢSynaptic strengthScale each input
Bias bNeuron firing thresholdShift activation boundary
Σ (dot product)Cell body (soma) summationAggregate inputs
Activation gAction potential fire/no-fireNon-linear transformation
Output aAxon output signalPass to next layer
Biological neurons: spike trains, ~7 types, ~10⁴ connections each. Artificial: gross simplification.

Perceptron

Perceptron Output (step activation)
ŷ = sign(wᵀx + b) = { +1 if wᵀx+b ≥ 0, −1 otherwise }
Perceptron Learning Rule
// Update only on misclassification
w ← w + η · yᵢ · xᵢ if yᵢ(wᵀxᵢ) ≤ 0
b ← b + η · yᵢ
η: learning rate (typically 1 for perceptron). Equivalent to SGD with 0-1 loss.
// Perceptron Algorithm
Initialize w=0, b=0
repeat until no misclassifications (or max epochs):
for each (xᵢ, yᵢ) in D:
if yᵢ · (wᵀxᵢ + b) ≤ 0: // misclassified
w ← w + η·yᵢ·xᵢ
b ← b + η·yᵢ
Perceptron Convergence Theorem
If data linearly separable with margin γ, ||x||≤R, converges in ≤ (R/γ)² steps
If NOT linearly separable → infinite loop. No convergence guarantee.
XOR problem: not linearly separable → single perceptron fails. Need hidden layers (Minsky & Papert 1969).
Decision boundary = hyperplane; moves toward correctly classifying each mistake

Perceptron vs Logistic Regression vs SVM

PropertyPerceptronLogistic RegSVM
OutputBinary ±1Probability [0,1]Binary ±1
Loss0-1 (implicit)Cross-entropyHinge loss
UpdateOnly on mistakesAlways (gradient)Only on SVs
BoundaryAny separatingMLE boundaryMax margin
ProbabilisticNoYesNo (needs Platt)
ConvergenceIf separable onlyAlways (convex)Always (convex)
03

Linear Neural Network for Regression

Single Output Neuron — Regression

Forward Pass
z = wᵀx + b (no activation — linear output)
ŷ = z
Loss Function (MSE)
L(w,b) = (1/2)(ŷ − y)² = (1/2)(wᵀx + b − y)²
½ for clean gradient; over dataset: J = (1/2m)Σᵢ(ŷᵢ−yᵢ)²
Backward Pass (Gradients)
∂L/∂ŷ = ŷ − y // loss → output
∂L/∂w = (ŷ − y) · x // output → weights
∂L/∂b = (ŷ − y) // output → bias
Weight Update (GD)
w ← w − α · ∂L/∂w = w − α(ŷ−y)x
b ← b − α · ∂L/∂b = b − α(ŷ−y)
Identical to linear regression; neuron notation makes extension to deep nets natural
Activation for regression output: identity (none) or softplus (if output must be >0)

MLE Justification for MSE

Assume y = f(x;w) + ε, ε ~ N(0,σ²)
P(y|x;w) = N(y; wᵀx, σ²)
MLE: max Σᵢ log P(yᵢ|xᵢ;w)
= max −(1/2σ²) Σᵢ (ŷᵢ − yᵢ)²
≡ min MSE
Huber Loss (robust)
L_δ(r) = { ½r² if |r| ≤ δ
δ(|r|−½δ) if |r| > δ }
Quadratic near 0 (smooth gradient); linear for large errors (robust to outliers)
04

Linear Neural Network for Classification

Binary Classification — Sigmoid Output

Forward Pass
z = wᵀx + b
a = σ(z) = 1/(1+e⁻ᶻ) = P(y=1|x)
Loss — Binary Cross-Entropy (BCE)
L = −[y log(a) + (1−y) log(1−a)]
Backward Pass (chain rule)
∂L/∂a = −y/a + (1−y)/(1−a) = (a−y)/(a(1−a))
∂a/∂z = σ(z)(1−σ(z)) = a(1−a) // sigmoid derivative
∂L/∂z = ∂L/∂a · ∂a/∂z = a − y // clean! loss derivative cancels sigmoid
∂L/∂w = (a − y) · x
∂L/∂b = a − y
∂L/∂z = a−y for sigmoid+BCE is the same form as linear regression. No vanishing gradient in output layer for this pairing.

Multiclass Classification — Softmax Output

Forward Pass (K classes)
zₖ = wₖᵀx + bₖ, k = 1,...,K
aₖ = softmax(z)ₖ = exp(zₖ) / Σⱼ exp(zⱼ) = P(y=k|x)
Numerical Stability Trick
aₖ = exp(zₖ − max(z)) / Σⱼ exp(zⱼ − max(z))
Subtract max before exp to prevent overflow; doesn't change output (numerator and denominator both divide by same constant)
Loss — Categorical Cross-Entropy
L = −Σₖ yₖ log(aₖ) = −log(a_{true_class}) (one-hot y)
Backward Pass — Softmax + CCE gradient
∂L/∂zₖ = aₖ − yₖ // extremely clean! identical to BCE form
Proof: ∂L/∂zₖ = Σⱼ (∂L/∂aⱼ)(∂aⱼ/∂zₖ)
For j=k: ∂aⱼ/∂zₖ = aₖ(1−aₖ) → contributes −(yₖ/aₖ)·aₖ(1−aₖ)
For j≠k: ∂aⱼ/∂zₖ = −aⱼaₖ → contributes +(yⱼ/aⱼ)·aⱼaₖ = yⱼaₖ
Sum = aₖ·Σⱼyⱼ − yₖ = aₖ − yₖ (since Σyⱼ=1 for one-hot)
Weight / Bias Gradients
∂L/∂Wₖ = (aₖ − yₖ) · xᵀ // outer product for matrix form
∂L/∂bₖ = aₖ − yₖ
Softmax + CCE gradient = predicted − one-hot label. Elegant and stable.

Output Layer Cheat Sheet

TaskOutput ActivationLoss Function∂L/∂z
RegressionLinear (none)MSE = ½(ŷ−y)²ŷ − y
Regression (bounded)Sigmoid × rangeMSEChain rule
Binary classificationSigmoid σ(z)BCEa − y
Multiclass (excl.)SoftmaxCat. Cross-Entropya − y (vec)
MultilabelSigmoid (per class)BCE (per class sum)aₖ − yₖ each
Counting / Poissonexp(z) or softplusPoisson NLLexp(z) − y
Never use sigmoid + MSE for classification: not convex, flat gradients far from boundary. Always use BCE.
05

Deep Feedforward Neural Networks (MLP)

5.0

Activation Functions

Activation Functions — Formulas + Derivatives + Properties

ActivationFormulaDerivative g'(z)RangeKey Notes
Sigmoidσ(z) = 1/(1+e⁻ᶻ)σ(z)(1−σ(z))(0,1)Vanishing gradient for |z|≫0; saturates; not zero-centered; slow
Tanhtanh(z) = (eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)1 − tanh²(z)(−1,1)Zero-centered (better than sigmoid); still saturates; tanh(z)=2σ(2z)−1
ReLUmax(0, z)0 if z<0; 1 if z>0[0,∞)No saturation for z>0; sparse activation; dead neurons (z<0 forever)
Leaky ReLUmax(αz, z), α≈0.01α if z<0; 1 if z>0(−∞,∞)Fixes dying ReLU; α is hyperparameter or learned (PReLU)
ELUz if z≥0; α(eᶻ−1) if z<01 if z≥0; α·eᶻ if z<0(−α,∞)Smooth at 0; negative mean pushes toward zero; α≈1.0
GELUz·Φ(z) ≈ 0.5z(1+tanh[√(2/π)(z+0.044715z³)])Φ(z)+z·φ(z)(≈−0.17,∞)Transformer default; stochastic regularization interpretation; smooth everywhere
Swishz · σ(βz)σ(βz)+βz·σ(βz)(1−σ(βz))(≈−0.28,∞)β=1 common; self-gated; outperforms ReLU on deep nets empirically
Softpluslog(1+eᶻ)σ(z)(0,∞)Smooth approximation of ReLU; derivative is sigmoid; rarely used in hidden layers
Dead ReLU Problem
If z = wᵀx+b < 0 for all training examples → ∂L/∂w = 0 → w never updates
Causes: large negative bias, large learning rate causing weight to "flip" negative permanently
Fix: Leaky ReLU / ELU / He initialization / smaller learning rate / batch norm
ReLU default for hidden layers; GELU for Transformers; sigmoid/tanh only for gates or output layers
5.1

Forward Propagation — General L-Layer Network

Forward Pass — Exact Equations

Notation
L: number of layers (1..L); layer 0 = input
W⁽ˡ⁾ ∈ ℝ^{nˡ × nˡ⁻¹}: weight matrix for layer l
b⁽ˡ⁾ ∈ ℝ^{nˡ}: bias vector for layer l
a⁽⁰⁾ = x: input features
Per-Layer Forward Pass (l = 1, 2, ..., L)
z⁽ˡ⁾ = W⁽ˡ⁾ a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ // pre-activation
a⁽ˡ⁾ = g⁽ˡ⁾(z⁽ˡ⁾) // post-activation
// Hidden: g = ReLU; Output: g = softmax or identity
// Forward Pass (vectorized, batch)
A⁽⁰⁾ = X // X ∈ ℝ^{d×m}, m=batch size
for l = 1 to L:
Z⁽ˡ⁾ = W⁽ˡ⁾ A⁽ˡ⁻¹⁾ + b⁽ˡ⁾ // b broadcasts
A⁽ˡ⁾ = g⁽ˡ⁾(Z⁽ˡ⁾)
// Cache Z⁽ˡ⁾, A⁽ˡ⁾ for backprop
ŷ = A⁽ᴸ⁾
L = loss(ŷ, y)
Dimensions Check (critical for debugging)
W⁽ˡ⁾: [nˡ, nˡ⁻¹]; b⁽ˡ⁾: [nˡ, 1]
Z⁽ˡ⁾: [nˡ, m]; A⁽ˡ⁾: [nˡ, m]
Inner dim must match: nˡ⁻¹ = cols of W⁽ˡ⁾ = rows of A⁽ˡ⁻¹⁾

Backpropagation — Exact Equations

Error Signal (delta) — Output Layer L
δ⁽ᴸ⁾ = ∂L/∂Z⁽ᴸ⁾ = ∂L/∂A⁽ᴸ⁾ ⊙ g'⁽ᴸ⁾(Z⁽ᴸ⁾)
// For softmax+CCE or sigmoid+BCE: δ⁽ᴸ⁾ = A⁽ᴸ⁾ − Y (clean form)
Backpropagation — Hidden Layer l (l = L−1, ..., 1)
δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾)ᵀ δ⁽ˡ⁺¹⁾ ⊙ g'⁽ˡ⁾(Z⁽ˡ⁾)
⊙: element-wise (Hadamard) product
(W⁽ˡ⁺¹⁾)ᵀ: "routes" errors backward through same weights
g'⁽ˡ⁾(Z⁽ˡ⁾): gates gradient flow by how much neuron was "active"
Parameter Gradients
∂L/∂W⁽ˡ⁾ = (1/m) δ⁽ˡ⁾ (A⁽ˡ⁻¹⁾)ᵀ
∂L/∂b⁽ˡ⁾ = (1/m) Σᵢ δ⁽ˡ⁾ᵢ = (1/m) sum over batch
Weight Update
W⁽ˡ⁾ ← W⁽ˡ⁾ − α · ∂L/∂W⁽ˡ⁾
b⁽ˡ⁾ ← b⁽ˡ⁾ − α · ∂L/∂b⁽ˡ⁾
// Full Backprop Algorithm
// Output layer delta
δ⁽ᴸ⁾ = A⁽ᴸ⁾ − Y (softmax+CCE or sigmoid+BCE)
for l = L downto 1:
dW⁽ˡ⁾ = (1/m) δ⁽ˡ⁾ · (A⁽ˡ⁻¹⁾)ᵀ
db⁽ˡ⁾ = (1/m) rowsum(δ⁽ˡ⁾)
if l > 1:
δ⁽ˡ⁻¹⁾ = (W⁽ˡ⁾)ᵀ · δ⁽ˡ⁾ ⊙ g'(Z⁽ˡ⁻¹⁾)
Update W⁽ˡ⁾, b⁽ˡ⁾ via optimizer
Cache Z⁽ˡ⁾ and A⁽ˡ⁾ during forward pass — needed to compute g'(Z⁽ˡ⁾) in backprop

Vanishing / Exploding Gradients

Gradient at Layer l (simplified)
∂L/∂W⁽¹⁾ ∝ Π_{k=2}^{L} (W⁽ᵏ⁾ · g'(z⁽ᵏ⁾))
Product of L−1 matrices and activation derivatives
Sigmoid: g'(z) ≤ 0.25 → product → 0 exponentially (vanishing)
Large weights: ||W|| > 1 → product → ∞ exponentially (exploding)
ProblemSymptomFix
Vanishing gradientEarly layers learn ~nothing; loss plateauReLU, skip connections, BatchNorm, LSTM/GRU gates
Exploding gradientLoss NaN; weights blow upGradient clipping, smaller lr, weight decay, BatchNorm
Gradient Clipping (for exploding)
if ||g|| > threshold: g ← g · (threshold / ||g||)
Clips gradient norm, not individual values. Standard in RNN training (threshold=1.0 or 5.0).

Weight Initialization

Xavier / Glorot (tanh, sigmoid)
W ~ U[−√(6/(nˡ⁻¹+nˡ)), +√(6/(nˡ⁻¹+nˡ))]
Var(W) = 2 / (nˡ⁻¹ + nˡ)
Keeps variance of activations constant across layers for linear-like activations
He / Kaiming (ReLU)
W ~ N(0, 2/nˡ⁻¹)
Var(W) = 2 / nˡ⁻¹
Accounts for ReLU zeroing ~half the activations; factor 2 compensates
LeCun (SELU)
Var(W) = 1 / nˡ⁻¹
Goal: maintain signal variance through forward pass AND gradient variance through backward pass

Depth vs Width

PropertyMore DepthMore Width
ExpressivityExponential gain (compositionality)Linear gain
ParametersFewer for same powerMore parameters
OptimizationHarder (more local minima, vanishing grad)Easier (flatter landscape)
Inductive biasHierarchical feature reuseRicher per-level features
GeneralizationBetter with regularizationCan overfit quickly
Depth ↔ program length; width ↔ memory per step
06

Convolutional Neural Networks

Convolution Operation

2D Cross-Correlation (what CNNs actually compute)
(I ★ K)[i,j] = Σₘ Σₙ I[i+m, j+n] · K[m,n]
I: input; K: kernel/filter; ★: cross-correlation (not full convolution)
Output Spatial Size
H_out = ⌊(H_in + 2P − F_H) / S⌋ + 1
W_out = ⌊(W_in + 2P − F_W) / S⌋ + 1
H_in: input height; P: padding; F_H: filter height; S: stride
Parameter Count (conv layer)
Params = C_out × (C_in × F_H × F_W + 1)
+1 for bias per output channel. Shared across spatial locations → massive param saving vs FC.
Padding Modes
'valid': no padding → output shrinks
'same': P = (F−1)/2 → output = input size (stride=1)
Receptive Field (depth-l conv with filter size f, stride 1)
RF_l = RF_{l-1} + (f−1) × Π_{k=1}^{l-1} sₖ
Grows with depth; dilated convolutions increase RF without extra params
Weight sharing = same filter detects feature everywhere → translation equivariance

Convolution — Forward & Backward

Forward Pass (single filter, no bias)
Z[i,j] = Σₘ Σₙ A_prev[i·S+m, j·S+n] · W[m,n]
A = g(Z)
Backward Pass — Gradient w.r.t. Filter
∂L/∂W[m,n] = Σᵢ Σⱼ δ[i,j] · A_prev[i·S+m, j·S+n]
Correlation of input patch with upstream gradient = convolve input with δ
Backward Pass — Gradient w.r.t. Input (for prev layer)
∂L/∂A_prev[p,q] = Σᵢ Σⱼ δ[i,j] · W[p−i·S, q−j·S]
Full convolution of δ with flipped kernel W (180° rotation = true convolution)
With Channels (general: C_in input channels, C_out filters)
Z_{c_out}[i,j] = Σ_{c_in} Σₘ Σₙ A_prev_{c_in}[i+m,j+n] · W_{c_out,c_in}[m,n] + b_{c_out}

Pooling & Channels

Max Pooling (no learned params)
P[i,j] = max{A[i·S+m, j·S+n] : m,n ∈ [0,f)}
Average Pooling
P[i,j] = (1/f²) Σₘ Σₙ A[i·S+m, j·S+n]
Global Average Pooling (GAP)
P_c = (1/HW) Σᵢ Σⱼ A_c[i,j] // one value per channel
Replaces FC layers in modern CNNs (GoogLeNet, ResNet). No params; spatial invariance; regularization effect.
Max Pool Backward
Gradient flows only to max-activation position; zero elsewhere
"Switch" remembers which position was max during forward pass
Layer PatternEffect
Conv + BN + ReLUStandard modern block
CONV(3×3) × N → PoolVGG-style
Concat([conv1×1, conv3×3, conv5×5, maxpool])Inception-style
x + F(x) (residual)ResNet-style
6.1

CNN Architectures

AlexNet → VGGNet → InceptionNet → ResNet — Key Innovations

ArchitectureYearKey InnovationParamsArchitecture Summary
AlexNet2012ReLU over tanh (10× faster); Dropout; Data augment; Multi-GPU; Local Response Norm60M5 CONV [11,5,3,3,3] + 3 FC. Large filters, MaxPool after 1,2,5
VGGNet-162014All 3×3 convs — two 3×3 = one 5×5 RF with fewer params and extra nonlinearity; very deep (16-19 layers)138M5 blocks of [CONV3×3 ×2-3 → MaxPool2×2] + 3 FC. Uniform design.
GoogLeNet (InceptionV1)2014Inception module: parallel 1×1, 3×3, 5×5 convs + maxpool; 1×1 bottleneck for dim reduction; GAP instead of FC; Auxiliary classifiers for gradient flow6.8M9 Inception modules + GAP + Softmax
InceptionV32016Factorize n×n → 1×n + n×1 (asymmetric); BatchNorm everywhere; Label smoothing23.8MFactorized Inception + BN + GAP
ResNet-502015Residual (skip) connections: solve vanishing gradient, enable training 100+ layers; BN after each conv; GAP + single FC25M1 CONV + 4 stages of [bottleneck blocks: 1×1→3×3→1×1] + GAP + FC
ResNet Bottleneck20151×1 → 3×3 → 1×1: reduce channels for 3×3, then restore → 3× fewer params than plain 3×3 blockResidual: F(x)+x; dimension match via 1×1 projection if needed
Residual Block (ResNet) — Forward
F(x) = W₂ · ReLU(BN(W₁x)) // simplified
y = ReLU(F(x) + x) // skip connection
// if dim mismatch: x_proj = W_s · x (1×1 conv)
y = ReLU(F(x) + W_s·x)
Identity shortcut: gradient = 1 → no vanishing. F(x) learns residual (correction), not full mapping.
Why skip connections work — gradient perspective
∂L/∂x = ∂L/∂y · (∂F/∂x + 1)
+1 ensures gradient ≥ 1 → no vanishing even with deep F
Also: creates ensemble of paths from input to output
Inception Module — Forward
h₁ = CONV1×1(x)
h₂ = CONV3×3(CONV1×1(x)) // 1×1 bottleneck first
h₃ = CONV5×5(CONV1×1(x))
h₄ = CONV1×1(MaxPool(x))
out = Concat([h₁, h₂, h₃, h₄], axis=channel)
1×1 Convolution — Why It Matters
Acts as learned channel-wise projection / mixing
Reduces channels: C_in=256 → 64 → 256 saves 9× FLOPs for 3×3 conv
Adds non-linearity across channels with minimal spatial cost
VGG: simple and deep. Inception: multi-scale features. ResNet: residual learning = train arbitrarily deep. Modern default: ResNet family.

Transfer Learning

Feature Hierarchy
Early layers: edges, colors, textures (universal)
Mid layers: parts, patterns (semi-universal)
Late layers: task-specific (dataset-specific)
ScenarioStrategyWhat to Do
Small data + similar domainFeature extractionFreeze all; replace + train only top FC layers
Large data + similar domainFine-tune deeperUnfreeze later conv blocks + retrain with small lr
Small data + different domainFeature extractionFreeze all; linear classifier on penultimate features
Large data + different domainFull fine-tuneInitialize with pretrained; train all layers
Pretrained = compressed knowledge of millions of images; adapt last few layers for new task

Modern CNN Variants

ArchitectureKey IdeaUse Case
ResNeXtGrouped convolutions (32 groups) — wider + groupedBetter ResNet
DenseNetEach layer connected to all subsequent layers: x_l = H([x₀,x₁,...,x_{l-1}])Dense feature reuse; small datasets
MobileNetV2Depthwise sep + inverted residuals (expand→DWConv→project)Mobile / edge deployment
EfficientNetCompound scaling: depth × width × resolution simultaneously via NASState-of-art efficiency tradeoff
ViTSplit image into patches → linear embed → Transformer encoder; no convLarge-data image classification
DenseNet Forward — Layer l
x_l = H_l([x₀, x₁, ..., x_{l-1}]) // concat all prev activations
Growth rate k: each layer adds k channels
Alleviates vanishing gradient (direct path to all layers); feature reuse; fewer params than ResNet for same accuracy
07

Recurrent Neural Networks

Vanilla RNN — Equations

Forward Pass (timestep t)
hₜ = tanh(Wₕ hₜ₋₁ + Wₓ xₜ + bₕ)
= tanh(W[hₜ₋₁; xₜ] + bₕ) // compact: concatenate
yₜ = softmax(Wᵧ hₜ + bᵧ) // if output at each step
Wₕ: hidden-to-hidden; Wₓ: input-to-hidden; weights shared across time
Loss (sum over all timesteps)
L = Σₜ₌₁ᵀ Lₜ(yₜ, ŷₜ)
Dimensions
xₜ ∈ ℝᵈ; hₜ ∈ ℝʰ; yₜ ∈ ℝᵏ
Wₕ ∈ ℝ^{h×h}; Wₓ ∈ ℝ^{h×d}; Wᵧ ∈ ℝ^{k×h}
Same W reused at each timestep → shared representation of dynamics; sequence as directed graph through time

Backpropagation Through Time (BPTT)

Gradient w.r.t. Wₕ
∂L/∂Wₕ = Σₜ ∂Lₜ/∂Wₕ
∂Lₜ/∂Wₕ = Σₖ₌₁ᵗ (∂Lₜ/∂hₜ) · (∂hₜ/∂hₖ) · (∂hₖ/∂Wₕ)
Gradient of hₜ w.r.t. hₖ (chain over time)
∂hₜ/∂hₖ = Πⱼ₌ₖ₊₁ᵗ ∂hⱼ/∂hⱼ₋₁ = Πⱼ₌ₖ₊₁ᵗ Wₕᵀ · diag(tanh'(zⱼ))
Product of (t−k) Jacobians. If ||WₕᵀD||<1: vanishes; if >1: explodes. tanh' ≤ 1 → compound vanishing with sigmoid.
// Truncated BPTT
Unroll T steps max (T=50–200 typical)
for each chunk of T timesteps:
Forward pass: compute h₁,...,hT, loss
Backward pass: gradient flows back T steps
Carry forward hT as h₀ for next chunk
Vanishing gradient: RNN can't learn long-range dependencies (>~10 steps with vanilla). Fix: LSTM/GRU.
7.1

LSTM — Long Short-Term Memory

LSTM — All Gate Equations (Forward Pass)

Input: concatenate [hₜ₋₁, xₜ] → input vector
iₜ = σ(Wᵢ[hₜ₋₁, xₜ] + bᵢ) // Input gate: what new info to write
fₜ = σ(Wf[hₜ₋₁, xₜ] + bf) // Forget gate: what to erase from cell
oₜ = σ(Wo[hₜ₋₁, xₜ] + bo) // Output gate: what to expose from cell
g̃ₜ = tanh(Wg[hₜ₋₁, xₜ] + bg) // Candidate cell: new info to add
Cell State Update
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ g̃ₜ
fₜ⊙cₜ₋₁: selectively forget past
iₜ⊙g̃ₜ: selectively write new info
Hidden State Update
hₜ = oₜ ⊙ tanh(cₜ)
Read from cell through output gate → filtered memory exposure
Dimensions
xₜ ∈ ℝᵈ, hₜ, cₜ ∈ ℝʰ
Each W ∈ ℝ^{h×(h+d)}, each b ∈ ℝʰ
Total params ≈ 4 × (h² + hd + h)
4× more params than vanilla RNN of same hidden size
Why LSTM Solves Vanishing Gradient
∂cₜ/∂cₜ₋₁ = fₜ // just element-wise mult!
Gradient of cell: ∂L/∂c₀ = ∂L/∂cT · Πₜ fₜ
When fₜ ≈ 1 (forget gate open): gradient flows unchanged
No matrix product → no exponential decay
Highway for gradients through cell state
Peephole Connections (Gers & Schmidhuber)
Gates also see cell state:
iₜ = σ(Wᵢ[hₜ₋₁, xₜ, cₜ₋₁] + bᵢ)
fₜ = σ(Wf[hₜ₋₁, xₜ, cₜ₋₁] + bf)
oₜ = σ(Wo[hₜ₋₁, xₜ, cₜ] + bo)
Cell state = conveyor belt; forget gate erases; input gate writes; output gate reads — long-range memory via near-linear gradient paths
7.2

GRU — Gated Recurrent Unit

GRU — All Equations

Gate Equations
zₜ = σ(Wz[hₜ₋₁, xₜ] + bz) // Update gate: blend old vs new
rₜ = σ(Wr[hₜ₋₁, xₜ] + br) // Reset gate: how much to forget
Candidate Hidden State
h̃ₜ = tanh(W[rₜ ⊙ hₜ₋₁, xₜ] + b)
rₜ ⊙ hₜ₋₁: reset gate masks past hidden state. If rₜ→0, h̃ₜ ignores past = restart.
Final Hidden State
hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ
zₜ: interpolation between past and candidate. If zₜ→1: update fully. If zₜ→0: keep past.
Dimensions
Total params ≈ 3 × (h² + hd + h) // 25% fewer than LSTM
GRU = simplified LSTM; merges cell and hidden state; 2 gates vs 3; often matches LSTM with fewer params

LSTM vs GRU vs Vanilla RNN

PropertyVanilla RNNLSTMGRU
Gates03 (i, f, o) + cell2 (z, r)
Hidden stateshₜ onlyhₜ + cₜ (cell)hₜ only
Params (d=100,h=100)~20K~80K~60K
Long-range memoryPoor (<10 steps)Excellent (>1000 steps)Good
Train speedFastestSlowestMiddle
When to useVery short sequencesComplex long sequences; when cₜ semantics matterDefault; less data; faster experiments
Empirically: LSTM and GRU similar on most tasks. GRU preferred for speed/fewer params.

BiLSTM & Stacked RNNs

Bidirectional LSTM
h_fwd_t = LSTM_fwd(xₜ, h_fwd_{t-1}) // forward in time
h_bwd_t = LSTM_bwd(xₜ, h_bwd_{t+1}) // backward in time
hₜ = [h_fwd_t ; h_bwd_t] // concatenate
2× params; can't use for autoregressive generation (needs full sequence)
Stacked (Deep) RNN
hₜ⁽ˡ⁾ = LSTM(hₜ⁽ˡ⁻¹⁾, hₜ₋₁⁽ˡ⁾) for l=1..L
Input to layer l = hidden state of layer l−1. Adds abstraction. 2-4 layers common; dropout between layers.
BiLSTM for encoding (classification, NER); unidirectional for generation; stacking for deeper abstraction
08

Attention Mechanism

Seq2Seq + Bahdanau Attention

Problem: encoder-decoder bottleneck
All encoder info compressed into single vector c = h_T
Information loss for long sequences → attention solves this
Bahdanau (Additive) Attention — Equations
eₜⱼ = vₐᵀ tanh(Wₐ sₜ₋₁ + Uₐ hⱼ) // alignment score
αₜⱼ = softmax(eₜⱼ) = exp(eₜⱼ)/Σⱼ'exp(eₜⱼ') // attention weights
cₜ = Σⱼ αₜⱼ hⱼ // context vector
sₜ = tanh(W[sₜ₋₁; yₜ₋₁; cₜ] + b) // decoder update
sₜ: decoder hidden; hⱼ: encoder hidden at pos j; vₐ,Wₐ,Uₐ: learned; O(T²d) per sequence
Luong (Multiplicative) Attention — Variants
dot: eₜⱼ = sₜᵀ hⱼ
general: eₜⱼ = sₜᵀ Wₐ hⱼ
concat: eₜⱼ = vᵀ tanh(W[sₜ; hⱼ])
Decoder "queries" encoder outputs; weights = how relevant each encoder position is for current output step

Scaled Dot-Product Attention (Transformer)

Core Equation
Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V
Q ∈ ℝ^{n×dₖ}: queries; K ∈ ℝ^{m×dₖ}: keys; V ∈ ℝ^{m×dᵥ}: values
n: #query positions; m: #key-value positions; output ∈ ℝ^{n×dᵥ}
Why √dₖ scaling?
QKᵀ = Σᵢ qᵢkᵢ has variance dₖ (for unit variance q,k)
→ magnitude grows with dₖ → softmax saturates → tiny gradients
√dₖ normalization restores unit variance
Self-Attention
Q = X Wᵠ; K = X Wᴷ; V = X Wᵛ where X ∈ ℝ^{n×d}
Attention(Q,K,V) — Q, K, V all from same sequence X
Each position attends to all others; captures global dependencies in O(n²d)
Cross-Attention (encoder-decoder)
Q = decoder states; K = V = encoder outputs
Decoder queries encoder to compute context
Masked Self-Attention (causal / autoregressive)
Set eᵢⱼ = −∞ for j > i (future positions)
After softmax: attention weight = 0 for future tokens
Position i can only attend to positions 1..i
Attention = soft retrieval: query finds relevant keys, retrieves weighted values

Multi-Head Attention

Multi-Head Attention (h heads)
head_i = Attention(Q Wᵠᵢ, K Wᴷᵢ, V Wᵛᵢ)
MHA(Q,K,V) = Concat(head₁,...,head_h) Wᴼ
Wᵠᵢ ∈ ℝ^{d×dₖ}, Wᴷᵢ ∈ ℝ^{d×dₖ}, Wᵛᵢ ∈ ℝ^{d×dᵥ}, Wᴼ ∈ ℝ^{hdᵥ×d}
Typically: dₖ = dᵥ = d/h (so total compute ≈ single-head)
Complexity
Self-attention: O(n²d) time, O(n²+nd) space
Bottleneck: n² attention matrix → O(n²) memory for long sequences
FlashAttention — Key Idea
Fuse QKᵀ softmax V into single GPU kernel; tile computation to fit SRAM
O(n²) FLOPS but only O(n) memory → 2–4× speedup; no accuracy loss
Multiple heads = multiple "attention programs" running in parallel; each specializes
09

Transformer

Transformer Architecture — Complete Equations

Positional Encoding (sinusoidal)
PE(pos, 2i) = sin(pos / 10000^{2i/d})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d})
Input embedding: x_pos = Embed(token) + PE(pos)
Injects sequential position info (Transformer is permutation-invariant by default)
Learned PE also common (BERT uses learned)
Encoder Block (repeated N times)
z₁ = LayerNorm(x + MHA(x,x,x)) // self-attn + residual
z₂ = LayerNorm(z₁ + FFN(z₁)) // FFN + residual
Decoder Block (repeated N times)
z₁ = LayerNorm(x + Masked-MHA(x,x,x)) // causal self-attn
z₂ = LayerNorm(z₁ + Cross-MHA(z₁,enc,enc)) // cross-attn
z₃ = LayerNorm(z₂ + FFN(z₂))
Feed-Forward Network (FFN) per position
FFN(x) = max(0, x W₁ + b₁) W₂ + b₂
// ReLU or GELU; W₁ expands d→4d, W₂ projects 4d→d
Params: 2 × d × 4d = 8d² per layer
Layer Norm (Pre-LN vs Post-LN)
LN(x) = γ · (x − μ) / √(σ² + ε) + β
// Post-LN (original): x + LN(sublayer(x))
// Pre-LN (modern): x + sublayer(LN(x)) — more stable training
γ, β: learned scale and shift per feature
Original Transformer (Vaswani 2017) Config
N=6 layers; d=512; h=8 heads; dₖ=dᵥ=64; FFN d_ff=2048
Encoder: 6 × [MHA(8 heads) + FFN]
Decoder: 6 × [Masked-MHA + Cross-MHA + FFN]
Output: linear projection → softmax over vocab
Full Forward Pass (seq2seq)
src → embed + PE → encoder → enc_out
tgt → embed + PE → decoder(enc_out) → dec_out
logits = dec_out · Wᵥᵒᶜᵃᵇ
loss = CrossEntropy(logits, tgt_shifted)
Teacher forcing during training: feed ground-truth target tokens shifted right
Transformer Attention Complexity
Self-attention: O(n²·d) — quadratic in sequence length
FFN: O(n·d²) — linear in sequence length
For n<d: FFN dominates. For n>d: attention dominates.
Backprop Through Transformer Block
∂L/∂x = ∂L/∂z₂ · ∂z₂/∂z₁ + ∂L/∂z₂ // residual: gradient splits
Residual ensures identity path: ∂L/∂x has additive term 1 from skip

BERT vs GPT vs T5 — Transformer Variants

ModelArchitectureTraining ObjectiveAttention MaskBest For
BERTEncoder onlyMLM (15% mask) + NSPBidirectional (full)Classification, NER, QA (understanding)
GPT familyDecoder onlyCausal LM (next token)Causal (left-only)Generation, chat, code completion
T5Encoder-DecoderSpan corruption (replace spans)Enc: full; Dec: causalSeq2seq tasks: translation, summarization
XLNetEncoder (AR)Permutation LMPermutedBidirectional + autoregressive
RoBERTaEncoder onlyMLM (no NSP, dynamic mask)BidirectionalBetter BERT (more data, better training)
BERT MLM Loss
L = −Σ_{masked} log P(xᵢ | x\{xᵢ}; θ)
GPT Causal LM Loss
L = −Σᵢ log P(xᵢ | x₁,...,xᵢ₋₁; θ)

Positional Encoding — Types

TypeMethodCan Extrapolate?Used In
Sinusoidal (absolute)Fixed sin/cos of positionModerateOriginal Transformer
Learned (absolute)Trainable embedding per positionNo (up to max_len)BERT, GPT-2
RoPE (relative rotary)Rotate Q,K by position angleYes (good)GPT-NeoX, LLaMA
ALiBi (relative bias)Linear bias to attention logits: −|i−j|·mYes (best)BLOOM, MPT
T5 Relative BiasLearned bias per distance bucketModerateT5, mT5
RoPE — Rotary Position Embedding
q'ₘ = Rₘ qₘ, k'ₙ = Rₙ kₙ
qₘᵀkₙ = qₘᵀ Rₘᵀ Rₙ kₙ = f(m−n) // depends only on relative offset
Rₘ: rotation matrix by angle m·θ (block-diagonal 2D rotations)
10

Optimization for Deep Models

SGD + Momentum

Vanilla SGD
g = ∇_θ L(θ; mini-batch)
θ ← θ − α · g
SGD + Momentum (Polyak)
v ← β v − α g // v: velocity (β≈0.9)
θ ← θ + v
Accumulates past gradients; dampens oscillations; escapes shallow local minima; β=0.9 typical
Nesterov Accelerated Gradient (NAG)
v ← β v − α ∇L(θ + βv) // gradient at "lookahead" position
θ ← θ + v
Look ahead where momentum will take you, then compute gradient there → smarter correction
Momentum = heavy ball rolling; accumulates speed in gradient direction; dampens oscillations in high-curvature directions

Adaptive Optimizers

AdaGrad
Gₜ ← Gₜ₋₁ + gₜ² // accumulated squared gradients
θₜ ← θₜ₋₁ − (α/√(Gₜ+ε)) · gₜ
Rare features get large updates; learning rate monotonically decreases → eventually stops learning
RMSprop
vₜ ← ρ vₜ₋₁ + (1−ρ) gₜ² // exponential moving average; ρ≈0.9
θₜ ← θₜ₋₁ − (α/√(vₜ+ε)) · gₜ
Fixes AdaGrad's monotonic decay; ε: numerical stability (1e-8)
Adam — Adaptive Moment Estimation
mₜ ← β₁ mₜ₋₁ + (1−β₁) gₜ // 1st moment (momentum; β₁≈0.9)
vₜ ← β₂ vₜ₋₁ + (1−β₂) gₜ² // 2nd moment (RMSprop; β₂≈0.999)
m̂ₜ ← mₜ / (1−β₁ᵗ) // bias correction
v̂ₜ ← vₜ / (1−β₂ᵗ) // bias correction
θₜ ← θₜ₋₁ − α · m̂ₜ / (√v̂ₜ + ε)
Defaults: α=1e-3, β₁=0.9, β₂=0.999, ε=1e-8
Bias correction critical at early steps (m₀=v₀=0)
AdamW — Adam + Decoupled Weight Decay
θₜ ← θₜ₋₁ − α · [m̂ₜ/(√v̂ₜ+ε) + λθₜ₋₁]
Standard Adam: L2 reg ≠ weight decay (adaptive lr corrupts it). AdamW decouples → better generalization. Default for Transformers.
Adam = momentum + adaptive per-parameter learning rates. AdamW = Adam but correct regularization.

Learning Rate Schedules

Step Decay
α(t) = α₀ · γ^{⌊t/step_size⌋} // γ≈0.1 every k epochs
Cosine Annealing
α(t) = α_min + ½(α_max − α_min)(1 + cos(πt/T))
Smooth decay to α_min; can restart (SGDR: reset to α_max periodically)
Warmup + Decay (Transformer default)
α(t) = d^{−0.5} · min(t^{−0.5}, t · warmup_steps^{−1.5})
Linear warmup for warmup_steps, then √t decay. Warmup: stabilize early training with small gradients before adaptive estimates are reliable.
Cyclical Learning Rate (CLR)
α(t) = α_min + (α_max − α_min) · max(0, 1 − |t/stepsize − 2q − 1|)
Triangular cycles between min/max lr; may escape saddle points
ScheduleBest For
Step decayCV tasks; well-tuned baselines
Cosine annealingGeneral DL; CNNs
Warmup + cosineTransformers, LLMs
ReduceLROnPlateauWhen val loss stagnates; simple baseline

Optimizer Comparison

OptimizerMemoryConvergenceBest For
SGDO(p)Slow, noisyCV (with momentum, tuned lr); best generalization
SGD+MomentumO(2p)Faster, smoothCV, ResNets
AdaGradO(2p)Good (sparse)Sparse features, NLP (old)
RMSpropO(2p)GoodRNNs, online
AdamO(3p)FastDefault for NLP/Transformers
AdamWO(3p)Fast + generalizeTransformers, LLMs (default)
Adam often converges faster but generalizes worse than SGD+momentum on image classification. SGD still preferred for ResNet/ImageNet. AdamW = standard for LLMs.

Loss Landscape & Optimization Issues

Saddle Points vs Local Minima
Saddle: ∂²L/∂θᵢ² >0 for some i, <0 for others → gradient=0 but not minimum
In high-dim: local minima rare; saddle points common
SGD noise helps escape saddle points

Batch Size Effects

PropertySmall Batch (8-64)Large Batch (512-8K)
Gradient noiseHigh → implicit regularizationLow → sharp minima risk
GPU utilizationLowHigh
Steps per epochMany (more updates)Few
GeneralizationOften betterOften worse (sharp minima)
LR scalingBaseline lrLinear scaling rule: lr = base_lr × B/B₀
Linear Scaling Rule (Goyal et al.)
If base lr = α₀ with batch B₀, use lr = α₀ · B/B₀ for batch B
With warmup for first ~5 epochs to avoid instability
11

Regularization for Deep Models

L1 / L2 Weight Regularization

L2 (Weight Decay)
J(θ) = L(θ) + (λ/2) Σᵢ θᵢ²
∂J/∂θ = ∂L/∂θ + λθ
Update: θ ← θ(1 − αλ) − α ∂L/∂θ // shrinks toward 0
L1
J(θ) = L(θ) + λ Σᵢ |θᵢ|
∂J/∂θ = ∂L/∂θ + λ sign(θ)
Subgradient at 0; promotes sparsity; less common in DL than L2
Adam + L2 ≠ weight decay (adaptive lr rescales penalty). Use AdamW for correct weight decay.

Dropout

Forward Pass (training)
mask ~ Bernoulli(p) // p = keep prob (e.g., 0.8)
ã = mask ⊙ a / p // inverted dropout: scale by 1/p
Inverted dropout: no change needed at test time (just use a directly)
Forward Pass (inference)
ã = a // no dropout; all neurons active
// Inverted dropout: already normalized during training
Backward Pass
∂L/∂a = mask ⊙ (∂L/∂ã) / p // same mask as forward
Each training step: different subnet. Inference: implicit ensemble of all subnets.

Batch Normalization

Forward Pass (training, per mini-batch)
μ_B = (1/m) Σᵢ zᵢ // batch mean
σ²_B = (1/m) Σᵢ (zᵢ − μ_B)² // batch variance
ẑᵢ = (zᵢ − μ_B) / √(σ²_B + ε) // normalize
yᵢ = γ ẑᵢ + β // scale + shift (learned)
Forward Pass (inference)
Use running averages (EMA over training batches):
μ_running ← α μ_running + (1−α) μ_B
σ²_running ← α σ²_running + (1−α) σ²_B
ŷ = γ(z − μ_running)/√(σ²_running + ε) + β
Backward Pass
∂L/∂ẑᵢ = ∂L/∂yᵢ · γ
∂L/∂σ²_B = Σᵢ ∂L/∂ẑᵢ · (zᵢ−μ_B) · −½(σ²_B+ε)^{−3/2}
∂L/∂μ_B = Σᵢ ∂L/∂ẑᵢ · −1/√(σ²_B+ε) + ∂L/∂σ²_B · Σᵢ −2(zᵢ−μ_B)/m
∂L/∂zᵢ = ∂L/∂ẑᵢ/√(σ²_B+ε) + ∂L/∂σ²_B · 2(zᵢ−μ_B)/m + ∂L/∂μ_B/m
∂L/∂γ = Σᵢ ∂L/∂yᵢ · ẑᵢ; ∂L/∂β = Σᵢ ∂L/∂yᵢ
Normalizes inputs to each layer → smoother loss landscape; higher lr; acts as regularizer (batch statistics = noise); reduces internal covariate shift
Batch size <8: batch stats noisy → use GroupNorm or LayerNorm instead. Incompatible with online learning.

Normalization Methods — Comparison

MethodNormalizes OverLearned ParamsBest ForIssue
Batch NormBatch (N) + spatial (HW) per channelγ, β per channelCNNs, large batch CVFails small batch; train/test discrepancy
Layer NormFeatures (C, H, W) per sampleγ, β per featureNLP, Transformers, RNNsWeaker channel normalization for CNN
Instance NormSpatial (HW) per channel per sampleγ, β per channelStyle transfer, GANLoses global statistics
Group NormGroup of channels + spatial per sampleγ, β per groupCV with small batch, detectionHyperparameter: num_groups
Layer Norm Equations
μ = (1/d) Σⱼ aⱼ; σ² = (1/d) Σⱼ(aⱼ − μ)²
LN(a) = γ ⊙ (a−μ)/√(σ²+ε) + β
Same equation as BN but over feature dim (not batch dim) → works for batch=1
Rule of thumb: BN for CNNs with large batch; LN for Transformers/RNNs; GN for detection/segmentation with small batch

Data Augmentation

DomainTechniqueEffect
ImageRandom crop, flip, color jitter, rotate, cutoutSpatial + photometric invariance
ImageMixUp: x̃ = λxᵢ + (1−λ)xⱼ, ỹ = λyᵢ + (1−λ)yⱼSmoother decision boundary; calibration
ImageCutMix: paste random patch from xⱼ into xᵢBetter than MixUp for detection
ImageAutoAugment / RandAugment: learned/random policyState-of-art augmentation
TextSynonym replace, back-translation, random insert/deleteLexical diversity
AudioSpecAugment: mask freq/time bands in spectrogramSpeech robustness; default for ASR
MixUp Loss
L = λ · CE(f(x̃), yᵢ) + (1−λ) · CE(f(x̃), yⱼ) where λ ~ Beta(α,α), α=0.2 typical

Early Stopping & Other Regularizers

Label Smoothing — Effect on Loss
L_LS = (1−ε) · H(y,p) + ε · H(u,p)
where u = uniform = 1/K
= H(y,p) − ε·H(y,p) + ε/K·Σlog(1/pₖ)
Penalizes overconfident predictions; reduces gap between max logit and others; improves calibration
12

Additional Key Topics

12.1

Batch Norm vs Layer Norm — Detailed Gradient Flow

Normalization — Gradient Effects

BN Position Conventions
Original: Conv → BN → ReLU (Post-activation BN)
Modern best practice: Pre-activation (He et al. 2016): BN → ReLU → Conv
Pre-activation: gradient passes through identity before activation → cleaner backprop

Generative Models — Brief Overview

ModelObjectiveInferencePros / Cons
VAEELBO = E[log p(x|z)] − KL(q(z|x)||p(z))z~q(z|x) → decodeFast; blurry samples; smooth latent space
GANmin_G max_D E[logD(x)] + E[log(1−D(G(z)))]z~N(0,I) → G(z)Sharp samples; mode collapse; unstable training
DiffusionELBO over noise schedule; predict ε from xₜIterative denoising (T steps)Best quality; slow sampling; no mode collapse
FlowExact log-likelihood via change of variablesInvertible transform z=f(x)Exact likelihood; restricted architectures
Diffusion Forward Process
q(xₜ|xₜ₋₁) = N(xₜ; √(1−βₜ)xₜ₋₁, βₜI)
q(xₜ|x₀) = N(xₜ; √ᾱₜ x₀, (1−ᾱₜ)I) where ᾱₜ = Πₛ₌₁ᵗ(1−βₛ)

DL Training Recipe — Interview Checklist

Gradient Check
f'(θ) ≈ [L(θ+ε) − L(θ−ε)] / 2ε (central difference)
relative error = ||f'_approx − f'_backprop|| / (||f'_approx|| + ||f'_backprop||)
Accept if < 1e-5; investigate if > 1e-3
QR

Quick Reference

Architecture Selection Guide

TaskPrimary ChoiceAlternativeNotes
Image classificationResNet-50 / EfficientNetViT (large data)Transfer from ImageNet always
Object detectionYOLOv8 / Faster R-CNNDETR (Transformer)YOLO for speed; DETR for simplicity
Semantic segmentationUNet (medical) / DeepLabV3+SegFormerSkip connections critical
Text classificationBERT fine-tuneDistilBERT (fast)Always fine-tune, don't train from scratch
Text generationGPT-style decoderT5 (seq2seq)GPT for open-ended; T5 for structured output
Machine translationTransformer (encoder-decoder)mBART, NLLBPretrained MT models >> training from scratch
Time-series forecastTemporal Conv Net / TransformerLSTMPatching (PatchTST) competitive with Transformers
Tabular dataXGBoost / LightGBMTabNet, FT-TransformerDL rarely beats GBDT on tabular; try both

Complexity Reference

LayerParamsFLOPs (forward)
Linear (d→h)dh + h2dh
Conv (C_in, C_out, f×f)C_out(C_in·f²+1)2·C_in·C_out·f²·H_out·W_out
MultiHead Attn (d,h)4d²8n·d² + 4n²·d
FFN (d,4d)8d²+5d16n·d²
BN / LN2·features~6·features·n
LSTM (d→h)4(h²+hd+h)8h(h+d) per step

Common Interview Questions

Gradient Computation Summary

Layer∂L/∂z (upstream δ)Key Rule
Linear z=Wx+bδ = Wᵀ·δ_next∂W = δ·xᵀ
ReLU a=max(0,z)δ = δ_next ⊙ 𝟙(z>0)Gate by forward mask
Sigmoid a=σ(z)δ = δ_next ⊙ a(1−a)Derivative = output × (1−output)
Tanh a=tanh(z)δ = δ_next ⊙ (1−a²)Derivative = 1−output²
Softmax+CCEδ = a − yClean — loss cancels derivative
BN (simplified)Complex (see BN section)Involves batch sum terms
MaxPoolδ flows to argmaxSwitch from forward pass