Core Idea
- Binary classifier: f(x,w,b) = sign(wᵀx + b)
- Goal: find hyperplane that maximizes margin between classes
- Margin: distance between hyperplane and nearest points of each class
- Support vectors: data points closest to hyperplane (on margin boundary)
- Only support vectors determine the decision boundary
Hard-Margin SVM (Primal)
min_{w,b} ½‖w‖²
s.t. yᵢ(wᵀxᵢ + b) ≥ 1 ∀i
Margin width = 2/‖w‖
Support vectors: yᵢ(wᵀxᵢ+b) = ±1
- Requires perfectly linearly separable data
- Maximizing margin = minimizing ‖w‖²
Soft-Margin SVM (with C)
min_{w,b,ξ} ½‖w‖² + C·Σᵢ ξᵢ
s.t. yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0
Hinge loss form (unconstrained):
min_{w,b} ½‖w‖² + C·Σᵢ max(0, 1−yᵢ(wᵀxᵢ+b))
- ξᵢ: slack variable — allows misclassification
- Large C: penalize misclassification heavily → small margin, overfit
- Small C: allow more misclassification → large margin, underfit
Dual Formulation
max_α Σᵢ αᵢ − ½ Σᵢⱼ αᵢαⱼyᵢyⱼ xᵢᵀxⱼ
s.t. 0 ≤ αᵢ ≤ C, Σᵢ αᵢyᵢ = 0
Recover: w = Σᵢ αᵢyᵢxᵢ
Predict: f(x) = sign(Σᵢ αᵢyᵢ xᵢᵀx + b)
- αᵢ = 0 → non-support vector (doesn't affect decision boundary)
- αᵢ > 0 → support vector
- Dual depends on data only through inner products xᵢᵀxⱼ → enables kernel trick
Kernel Trick
- Replace xᵢᵀxⱼ with K(xᵢ,xⱼ) = φ(xᵢ)ᵀφ(xⱼ) — implicit high-dim feature map
- Never compute φ(x) explicitly — only need kernel evaluations
Linear: K(xᵢ,xⱼ) = xᵢᵀxⱼ
Polynomial: K(xᵢ,xⱼ) = (1 + xᵢᵀxⱼ)ᵖ
RBF/Gauss: K(xᵢ,xⱼ) = exp(−‖xᵢ−xⱼ‖²/2σ²)
Sigmoid: K(xᵢ,xⱼ) = tanh(β₀xᵢᵀxⱼ + β₁)
Valid kernel must correspond to an inner product in some feature space
SVM: Hard vs Soft vs Kernel — Quick Reference
| Hard Margin | Soft Margin | Kernel SVM |
| Data requirement | Linearly separable | Noisy / overlapping | Non-linearly separable |
| Objective | min ½‖w‖² | min ½‖w‖² + CΣξᵢ | min ½‖w‖² + CΣξᵢ with K |
| Hyperparameter | None | C (regularization) | C + kernel params (σ, p…) |
| Constraint violation | Not allowed | Allowed (penalized) | Allowed (penalized) |
| Decision boundary | Linear | Linear | Non-linear in input space |