| Classical ML | Deep Learning |
|---|---|
| Manual feature engineering | Learned features end-to-end |
| Works well on small data | Needs large data (or transfer) |
| Interpretable (trees, LR) | Black box (need SHAP/GradCAM) |
| Tabular: XGB wins | Images / text / speech: DL wins |
| Fast train + predict | Slow train; fast predict (GPU) |
| Architecture | Inductive Bias | Best For |
|---|---|---|
| MLP / DFNN | None (fully connected) | Tabular, small structured |
| CNN | Translation equivariance, local receptive field | Images, time-series, audio |
| RNN / LSTM | Sequential order, shared weights over time | Text, speech, time-series |
| Transformer | Global attention, permutation equivariant | NLP, vision (ViT), multimodal |
| GNN | Message passing over graph structure | Molecules, social graphs |
| VAE / GAN / Diffusion | Latent space structure | Generative modeling |
| Component | Biological Analogy | Role |
|---|---|---|
| Input xᵢ | Dendrites / synapse input | Receive signals |
| Weight wᵢ | Synaptic strength | Scale each input |
| Bias b | Neuron firing threshold | Shift activation boundary |
| Σ (dot product) | Cell body (soma) summation | Aggregate inputs |
| Activation g | Action potential fire/no-fire | Non-linear transformation |
| Output a | Axon output signal | Pass to next layer |
| Property | Perceptron | Logistic Reg | SVM |
|---|---|---|---|
| Output | Binary ±1 | Probability [0,1] | Binary ±1 |
| Loss | 0-1 (implicit) | Cross-entropy | Hinge loss |
| Update | Only on mistakes | Always (gradient) | Only on SVs |
| Boundary | Any separating | MLE boundary | Max margin |
| Probabilistic | No | Yes | No (needs Platt) |
| Convergence | If separable only | Always (convex) | Always (convex) |
| Task | Output Activation | Loss Function | ∂L/∂z |
|---|---|---|---|
| Regression | Linear (none) | MSE = ½(ŷ−y)² | ŷ − y |
| Regression (bounded) | Sigmoid × range | MSE | Chain rule |
| Binary classification | Sigmoid σ(z) | BCE | a − y |
| Multiclass (excl.) | Softmax | Cat. Cross-Entropy | a − y (vec) |
| Multilabel | Sigmoid (per class) | BCE (per class sum) | aₖ − yₖ each |
| Counting / Poisson | exp(z) or softplus | Poisson NLL | exp(z) − y |
| Activation | Formula | Derivative g'(z) | Range | Key Notes |
|---|---|---|---|---|
| Sigmoid | σ(z) = 1/(1+e⁻ᶻ) | σ(z)(1−σ(z)) | (0,1) | Vanishing gradient for |z|≫0; saturates; not zero-centered; slow |
| Tanh | tanh(z) = (eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ) | 1 − tanh²(z) | (−1,1) | Zero-centered (better than sigmoid); still saturates; tanh(z)=2σ(2z)−1 |
| ReLU | max(0, z) | 0 if z<0; 1 if z>0 | [0,∞) | No saturation for z>0; sparse activation; dead neurons (z<0 forever) |
| Leaky ReLU | max(αz, z), α≈0.01 | α if z<0; 1 if z>0 | (−∞,∞) | Fixes dying ReLU; α is hyperparameter or learned (PReLU) |
| ELU | z if z≥0; α(eᶻ−1) if z<0 | 1 if z≥0; α·eᶻ if z<0 | (−α,∞) | Smooth at 0; negative mean pushes toward zero; α≈1.0 |
| GELU | z·Φ(z) ≈ 0.5z(1+tanh[√(2/π)(z+0.044715z³)]) | Φ(z)+z·φ(z) | (≈−0.17,∞) | Transformer default; stochastic regularization interpretation; smooth everywhere |
| Swish | z · σ(βz) | σ(βz)+βz·σ(βz)(1−σ(βz)) | (≈−0.28,∞) | β=1 common; self-gated; outperforms ReLU on deep nets empirically |
| Softplus | log(1+eᶻ) | σ(z) | (0,∞) | Smooth approximation of ReLU; derivative is sigmoid; rarely used in hidden layers |
| Problem | Symptom | Fix |
|---|---|---|
| Vanishing gradient | Early layers learn ~nothing; loss plateau | ReLU, skip connections, BatchNorm, LSTM/GRU gates |
| Exploding gradient | Loss NaN; weights blow up | Gradient clipping, smaller lr, weight decay, BatchNorm |
| Property | More Depth | More Width |
|---|---|---|
| Expressivity | Exponential gain (compositionality) | Linear gain |
| Parameters | Fewer for same power | More parameters |
| Optimization | Harder (more local minima, vanishing grad) | Easier (flatter landscape) |
| Inductive bias | Hierarchical feature reuse | Richer per-level features |
| Generalization | Better with regularization | Can overfit quickly |
| Layer Pattern | Effect |
|---|---|
| Conv + BN + ReLU | Standard modern block |
| CONV(3×3) × N → Pool | VGG-style |
| Concat([conv1×1, conv3×3, conv5×5, maxpool]) | Inception-style |
| x + F(x) (residual) | ResNet-style |
| Architecture | Year | Key Innovation | Params | Architecture Summary |
|---|---|---|---|---|
| AlexNet | 2012 | ReLU over tanh (10× faster); Dropout; Data augment; Multi-GPU; Local Response Norm | 60M | 5 CONV [11,5,3,3,3] + 3 FC. Large filters, MaxPool after 1,2,5 |
| VGGNet-16 | 2014 | All 3×3 convs — two 3×3 = one 5×5 RF with fewer params and extra nonlinearity; very deep (16-19 layers) | 138M | 5 blocks of [CONV3×3 ×2-3 → MaxPool2×2] + 3 FC. Uniform design. |
| GoogLeNet (InceptionV1) | 2014 | Inception module: parallel 1×1, 3×3, 5×5 convs + maxpool; 1×1 bottleneck for dim reduction; GAP instead of FC; Auxiliary classifiers for gradient flow | 6.8M | 9 Inception modules + GAP + Softmax |
| InceptionV3 | 2016 | Factorize n×n → 1×n + n×1 (asymmetric); BatchNorm everywhere; Label smoothing | 23.8M | Factorized Inception + BN + GAP |
| ResNet-50 | 2015 | Residual (skip) connections: solve vanishing gradient, enable training 100+ layers; BN after each conv; GAP + single FC | 25M | 1 CONV + 4 stages of [bottleneck blocks: 1×1→3×3→1×1] + GAP + FC |
| ResNet Bottleneck | 2015 | 1×1 → 3×3 → 1×1: reduce channels for 3×3, then restore → 3× fewer params than plain 3×3 block | — | Residual: F(x)+x; dimension match via 1×1 projection if needed |
| Scenario | Strategy | What to Do |
|---|---|---|
| Small data + similar domain | Feature extraction | Freeze all; replace + train only top FC layers |
| Large data + similar domain | Fine-tune deeper | Unfreeze later conv blocks + retrain with small lr |
| Small data + different domain | Feature extraction | Freeze all; linear classifier on penultimate features |
| Large data + different domain | Full fine-tune | Initialize with pretrained; train all layers |
| Architecture | Key Idea | Use Case |
|---|---|---|
| ResNeXt | Grouped convolutions (32 groups) — wider + grouped | Better ResNet |
| DenseNet | Each layer connected to all subsequent layers: x_l = H([x₀,x₁,...,x_{l-1}]) | Dense feature reuse; small datasets |
| MobileNetV2 | Depthwise sep + inverted residuals (expand→DWConv→project) | Mobile / edge deployment |
| EfficientNet | Compound scaling: depth × width × resolution simultaneously via NAS | State-of-art efficiency tradeoff |
| ViT | Split image into patches → linear embed → Transformer encoder; no conv | Large-data image classification |
| Property | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Gates | 0 | 3 (i, f, o) + cell | 2 (z, r) |
| Hidden states | hₜ only | hₜ + cₜ (cell) | hₜ only |
| Params (d=100,h=100) | ~20K | ~80K | ~60K |
| Long-range memory | Poor (<10 steps) | Excellent (>1000 steps) | Good |
| Train speed | Fastest | Slowest | Middle |
| When to use | Very short sequences | Complex long sequences; when cₜ semantics matter | Default; less data; faster experiments |
| Model | Architecture | Training Objective | Attention Mask | Best For |
|---|---|---|---|---|
| BERT | Encoder only | MLM (15% mask) + NSP | Bidirectional (full) | Classification, NER, QA (understanding) |
| GPT family | Decoder only | Causal LM (next token) | Causal (left-only) | Generation, chat, code completion |
| T5 | Encoder-Decoder | Span corruption (replace spans) | Enc: full; Dec: causal | Seq2seq tasks: translation, summarization |
| XLNet | Encoder (AR) | Permutation LM | Permuted | Bidirectional + autoregressive |
| RoBERTa | Encoder only | MLM (no NSP, dynamic mask) | Bidirectional | Better BERT (more data, better training) |
| Type | Method | Can Extrapolate? | Used In |
|---|---|---|---|
| Sinusoidal (absolute) | Fixed sin/cos of position | Moderate | Original Transformer |
| Learned (absolute) | Trainable embedding per position | No (up to max_len) | BERT, GPT-2 |
| RoPE (relative rotary) | Rotate Q,K by position angle | Yes (good) | GPT-NeoX, LLaMA |
| ALiBi (relative bias) | Linear bias to attention logits: −|i−j|·m | Yes (best) | BLOOM, MPT |
| T5 Relative Bias | Learned bias per distance bucket | Moderate | T5, mT5 |
| Schedule | Best For |
|---|---|
| Step decay | CV tasks; well-tuned baselines |
| Cosine annealing | General DL; CNNs |
| Warmup + cosine | Transformers, LLMs |
| ReduceLROnPlateau | When val loss stagnates; simple baseline |
| Optimizer | Memory | Convergence | Best For |
|---|---|---|---|
| SGD | O(p) | Slow, noisy | CV (with momentum, tuned lr); best generalization |
| SGD+Momentum | O(2p) | Faster, smooth | CV, ResNets |
| AdaGrad | O(2p) | Good (sparse) | Sparse features, NLP (old) |
| RMSprop | O(2p) | Good | RNNs, online |
| Adam | O(3p) | Fast | Default for NLP/Transformers |
| AdamW | O(3p) | Fast + generalize | Transformers, LLMs (default) |
| Property | Small Batch (8-64) | Large Batch (512-8K) |
|---|---|---|
| Gradient noise | High → implicit regularization | Low → sharp minima risk |
| GPU utilization | Low | High |
| Steps per epoch | Many (more updates) | Few |
| Generalization | Often better | Often worse (sharp minima) |
| LR scaling | Baseline lr | Linear scaling rule: lr = base_lr × B/B₀ |
| Method | Normalizes Over | Learned Params | Best For | Issue |
|---|---|---|---|---|
| Batch Norm | Batch (N) + spatial (HW) per channel | γ, β per channel | CNNs, large batch CV | Fails small batch; train/test discrepancy |
| Layer Norm | Features (C, H, W) per sample | γ, β per feature | NLP, Transformers, RNNs | Weaker channel normalization for CNN |
| Instance Norm | Spatial (HW) per channel per sample | γ, β per channel | Style transfer, GAN | Loses global statistics |
| Group Norm | Group of channels + spatial per sample | γ, β per group | CV with small batch, detection | Hyperparameter: num_groups |
| Domain | Technique | Effect |
|---|---|---|
| Image | Random crop, flip, color jitter, rotate, cutout | Spatial + photometric invariance |
| Image | MixUp: x̃ = λxᵢ + (1−λ)xⱼ, ỹ = λyᵢ + (1−λ)yⱼ | Smoother decision boundary; calibration |
| Image | CutMix: paste random patch from xⱼ into xᵢ | Better than MixUp for detection |
| Image | AutoAugment / RandAugment: learned/random policy | State-of-art augmentation |
| Text | Synonym replace, back-translation, random insert/delete | Lexical diversity |
| Audio | SpecAugment: mask freq/time bands in spectrogram | Speech robustness; default for ASR |
| Model | Objective | Inference | Pros / Cons |
|---|---|---|---|
| VAE | ELBO = E[log p(x|z)] − KL(q(z|x)||p(z)) | z~q(z|x) → decode | Fast; blurry samples; smooth latent space |
| GAN | min_G max_D E[logD(x)] + E[log(1−D(G(z)))] | z~N(0,I) → G(z) | Sharp samples; mode collapse; unstable training |
| Diffusion | ELBO over noise schedule; predict ε from xₜ | Iterative denoising (T steps) | Best quality; slow sampling; no mode collapse |
| Flow | Exact log-likelihood via change of variables | Invertible transform z=f(x) | Exact likelihood; restricted architectures |
| Task | Primary Choice | Alternative | Notes |
|---|---|---|---|
| Image classification | ResNet-50 / EfficientNet | ViT (large data) | Transfer from ImageNet always |
| Object detection | YOLOv8 / Faster R-CNN | DETR (Transformer) | YOLO for speed; DETR for simplicity |
| Semantic segmentation | UNet (medical) / DeepLabV3+ | SegFormer | Skip connections critical |
| Text classification | BERT fine-tune | DistilBERT (fast) | Always fine-tune, don't train from scratch |
| Text generation | GPT-style decoder | T5 (seq2seq) | GPT for open-ended; T5 for structured output |
| Machine translation | Transformer (encoder-decoder) | mBART, NLLB | Pretrained MT models >> training from scratch |
| Time-series forecast | Temporal Conv Net / Transformer | LSTM | Patching (PatchTST) competitive with Transformers |
| Tabular data | XGBoost / LightGBM | TabNet, FT-Transformer | DL rarely beats GBDT on tabular; try both |
| Layer | Params | FLOPs (forward) |
|---|---|---|
| Linear (d→h) | dh + h | 2dh |
| Conv (C_in, C_out, f×f) | C_out(C_in·f²+1) | 2·C_in·C_out·f²·H_out·W_out |
| MultiHead Attn (d,h) | 4d² | 8n·d² + 4n²·d |
| FFN (d,4d) | 8d²+5d | 16n·d² |
| BN / LN | 2·features | ~6·features·n |
| LSTM (d→h) | 4(h²+hd+h) | 8h(h+d) per step |
| Layer | ∂L/∂z (upstream δ) | Key Rule |
|---|---|---|
| Linear z=Wx+b | δ = Wᵀ·δ_next | ∂W = δ·xᵀ |
| ReLU a=max(0,z) | δ = δ_next ⊙ 𝟙(z>0) | Gate by forward mask |
| Sigmoid a=σ(z) | δ = δ_next ⊙ a(1−a) | Derivative = output × (1−output) |
| Tanh a=tanh(z) | δ = δ_next ⊙ (1−a²) | Derivative = 1−output² |
| Softmax+CCE | δ = a − y | Clean — loss cancels derivative |
| BN (simplified) | Complex (see BN section) | Involves batch sum terms |
| MaxPool | δ flows to argmax | Switch from forward pass |