Prob Stats · DS/ML Interview Notes

Quick Revision | Masters Notes

Statistics Basics Probability Bayes + NB Random Variables Distributions CLT + Estimation Hypothesis Testing ANOVA MLE Regression Time Series GMM
01

Statistics Basics

Data Types

  • Categorical: Nominal (no order), Ordinal (ordered)
  • Numerical: Discrete (countable), Continuous (measurable)
  • Scales: Nominal < Ordinal < Interval (no true 0) < Ratio (true 0)
  • Interval eg: Celsius — Ratio eg: Weight, Age

Central Tendency

  • Mean: Σx/n · affected by outliers · use for symmetric data
  • Median: Middle value · robust to outliers · 50th percentile
  • Mode: Most frequent · works for nominal data
  • Skew rule: Mode = 3·Median − 2·Mean (empirical)

Skewness

  • Right skew: Mean > Median > Mode (long right tail)
  • Left skew: Mean < Median < Mode (long left tail)
  • Symmetric: Mean = Median = Mode
  • Use median for heavily skewed data

Variability / Spread

  • Range: Max − Min · unreliable (uses 2 points only)
  • Variance (σ²): Mean of squared deviations
  • SD (σ): √Variance · same units as data
  • IQR: Q3 − Q1 · robust to outliers
  • Sample variance uses n−1 (Bessel's correction) — avoids bias

5-Number Summary & Boxplot

  • Min · Q1 · Median · Q3 · Max
  • Outlier fence: Q1 − 1.5·IQR to Q3 + 1.5·IQR
  • Major outlier: beyond Q1 − 3·IQR or Q3 + 3·IQR
  • QD = IQR / 2 (variation in middle 50%)
02

Probability

Definitions

  • Experiment: Uncertain outcome process
  • Sample Space S: All possible outcomes
  • Event: Subset of S
  • Axioms: P(S)=1, 0≤P(E)≤1, P(A∪B)=P(A)+P(B) if mutually exclusive

Key Rules

Addition: P(A∪B) = P(A)+P(B)−P(A∩B) Complement: P(Aᶜ) = 1 − P(A) Multiplication: P(A∩B) = P(A)·P(B|A) Independence: P(A∩B) = P(A)·P(B)

ME vs Independent

Mut. ExclusiveIndependent
P(A∩B)0P(A)·P(B)
Both occur?NeverPossible
VennNo overlapOverlap

Conditional Probability

P(B|A) = P(A∩B) / P(A) P(A|B) = P(A∩B) / P(B)
  • Multiplication: P(A∩B∩C) = P(A)·P(B|A)·P(C|A∩B)

Total Probability

P(B) = Σ P(B|Aᵢ)·P(Aᵢ)
  • Aᵢ must be mutually exclusive & exhaustive (partition of S)
  • Classic use: disease test, binary channel, device types
03

Bayes Theorem & Naive Bayes

Bayes Theorem

P(Eᵢ|A) = P(Eᵢ)·P(A|Eᵢ) / ΣP(Eⱼ)·P(A|Eⱼ)
P(A|B) = P(B|A)·P(A) / P(B)
  • Posterior = Prior × Likelihood / Evidence

Naive Bayes Classifier

  • Key assumption: Features are conditionally independent given class
  • P(C|X) ∝ P(C) · P(X|C)
  • P(C|X) ∝ P(C) · Π P(Xᵢ|C) (Conditional Independence)
  • Predict class with highest posterior
  • Laplace smoothing: Add 1 to all counts → avoids zero probability

Why Conditional Independence Matters

  • Without it: need all feature combinations → exponential params
  • With it: only n params per class (one per feature)
  • 5 binary features: 100K vs 5 probabilities!
  • Apps: spam filtering, sentiment analysis, document classification

NB Steps (Categorical)

  • 1. Build frequency table from training data
  • 2. Compute P(class) and P(feature|class)
  • 3. For new point: compute P(class) · Π P(xᵢ|class)
  • 4. Pick argmax class
  • If any P(feature|class)=0 → apply Laplace (+1)
04

Random Variables & Distributions

Discrete RV

  • PMF: f(x) = P(X=x), Σf(x)=1, f(x)≥0
  • CDF: F(x) = P(X≤x) = Σf(xᵢ) for xᵢ≤x
  • E[X] = Σ x·f(x)
  • Var[X] = E[X²] − (E[X])²

Continuous RV

  • PDF: f(x)≥0, ∫f(x)dx=1
  • P(a≤X≤b) = ∫ₐᵇ f(x)dx
  • P(X=exact value) = 0
  • E[X] = ∫x·f(x)dx
  • Var[X] = ∫(x−μ)²f(x)dx

Joint Distribution

  • f(x,y) = P(X=x, Y=y)
  • Marginal: f_X(x) = Σ_y f(x,y)
  • Conditional: f(x|y) = f(x,y)/f_Y(y)
  • Independence: f(x,y) = f_X(x)·f_Y(y)
  • Cov(X,Y) = E[XY] − E[X]·E[Y]
05

Key Probability Distributions

Bernoulli

P(X=x) = pˣ·(1−p)^(1−x), x∈{0,1}
  • Single trial, 2 outcomes
  • E[X]=p, Var[X]=p(1−p)=pq
  • Special case of Binomial with n=1

Binomial B(n,p)

P(X=x) = C(n,x)·pˣ·qⁿ⁻ˣ
  • n fixed independent trials, constant p
  • E[X]=np, Var[X]=npq
  • Conditions: fixed n, independent, constant p, binary outcome
  • Phrases: "at least k" = 1 − P(X≤k−1)

Poisson (λ)

P(X=x) = e⁻λ·λˣ/x!
  • Rare events, large n, small p, λ=np
  • E[X] = Var[X] = λ (mean = variance!)
  • Approx Binomial when n≥100, np≤10
  • Use: arrivals, defects, errors per page

Normal N(μ, σ²)

f(x) = (1/σ√2π) · exp(−(x−μ)²/2σ²)
  • Bell-shaped, symmetric, Mean=Median=Mode
  • Empirical rule: 68%-95%-99.7% within 1,2,3σ
  • Z = (X−μ)/σ → N(0,1)
  • Normal approx to Binomial: if np≥15 and nq≥15
  • Continuity correction: P(X≤k) → P(X≤k+0.5)

Sampling Distributions

DistributionDefinitionKey Use
Chi-Square χ²(k)Sum of k squared N(0,1) varsGoodness of fit, variance tests, right-skewed
t-distribution(k)Z / √(χ²/k) — heavier tails than normalSmall samples, unknown σ · uses s instead of σ
F-distribution(v1,v2)Ratio of two chi-squares/dofCompare variances (ANOVA)
06

Sampling Theory, CLT & Estimation

Central Limit Theorem

  • x̄ ~ N(μ, σ²/n) as n→∞ regardless of population dist.
  • SE of mean = σ/√n
  • Z = (x̄ − μ) / (σ/√n)
  • Finite population correction: multiply SE by √((N−n)/(N−1))
  • Use FPC if n/N > 0.05

Sampling Methods

  • Simple Random: Each unit equal probability (homogeneous pop.)
  • Systematic: Every k-th unit (k=N/n), first unit random
  • Stratified: Divide heterogeneous pop. into strata; SRS within each; proportional allocation
  • Non-probability: judgment, convenience, quota, snowball

Confidence Intervals

CI for μ (σ known): x̄ ± z_{α/2} · σ/√n
CI for μ (σ unknown): x̄ ± t_{α/2,n-1} · s/√n
CI for proportion p̂ ± z·√(p̂q̂/n)
  • α=0.05 → z=1.96 · α=0.01 → z=2.576
  • Width ↑ when σ↑ or n↓ or confidence↑

Sampling Distribution of Proportion

p̂ ~ N(p, pq/n) if np>15, nq>15
SE = √(pq/n)
Z = (p̂ − p) / SE
07

Hypothesis Testing

Framework

  • H₀: Null — no difference / neutral
  • H₁: Alternate — what we want to prove
  • Never "prove" H₀; only reject or fail to reject
  • α = significance level = P(Type I error)
  • p-value < α → Reject H₀

Error Types

H₀ TrueH₀ False
Reject H₀Type I (α)Correct (Power)
Don't RejectCorrectType II (β)

Power = 1 − β = P(reject H₀ | H₁ true)

Test Types & When

  • Z-test: n≥30, σ known · or proportion test
  • 1-sample t: n<30, σ unknown
  • 2-sample t (unpaired): Independent groups, compare means
  • Paired t: Before/after same subjects · use d=y−x
  • Left/Right tailed: H₁ has < or > · Two-tailed: H₁ has ≠

Z-test Formulas at a Glance

1-sample mean (σ known): Z = (x̄ − μ₀) / (σ/√n)
2-sample mean: Z = (x̄₁ − x̄₂) / √(σ₁²/n₁ + σ₂²/n₂)
1-sample proportion: Z = (p̂ − p₀) / √(p₀q₀/n)
2-sample proportion: Z = (p̂₁ − p̂₂) / SE(p̂₁ − p̂₂)
Pooled SE (equal n): SE = √(p̄q̄(1/n₁ + 1/n₂)), p̄ = (x₁+x₂)/(n₁+n₂)
t-test (small sample): t = (x̄ − μ₀) / (s/√n), df = n−1
Paired t: t = d̄ / (s_d/√n), d = y−x, df = n−1
08

Analysis of Variance (ANOVA)

Why ANOVA (not multiple t-tests)?

  • Each t-test has Type I error α → inflated for k groups
  • Total error = 1 − (1−α)^k (e.g. k=5: 40% error!)
  • ANOVA tests H₀: μ₁=μ₂=…=μₖ simultaneously
  • Assumptions: Normal, equal variances, independent

One-way ANOVA Partitioning

SST = SSTR + SSE

SSTR = Σ nⱼ(x̄ⱼ − x̄)² [between groups, df=k−1]
SSE = Σ Σ (xᵢⱼ − x̄ⱼ)² [within groups, df=n−k]
MSTR = SSTR/(k−1)
MSE = SSE/(n−k)
F = MSTR / MSE

ANOVA Table

SourceSSdfMSF
TreatmentsSSTRk−1MSTRMSTR/MSE
ErrorSSEn−kMSE
TotalSSTn−1

F_cal > F_tab → Reject H₀ (means differ)

Two-way ANOVA

  • Tests effect of two factors: Treatment + Block
  • SST = SSTR + SSBL + SSE
  • Reduces error variance vs one-way
  • F_treatment = MSTR/MSE, F_block = MSBL/MSE
  • Two H₀: no treatment diff + no block diff

Shortcut: Correction Factor Method

CF = G² / n (G: Grand Total)
SST = Σ xᵢⱼ² − CF
SSTR = Σ (Cᵢ²/nᵢ) − CF
SSE = SST − SSTR
(Two-way: SSBL = Σ(Rᵢ²/k) − CF) (Rᵢ: Total of Row i)
09

Maximum Likelihood Estimation

Concept

  • Find θ that maximizes likelihood of observing the data
  • Likelihood: L(θ) = Π f(xᵢ|θ)
  • Log-likelihood: ℓ(θ) = Σ log f(xᵢ|θ)
  • MLE: dℓ/dθ = 0 and d²ℓ/dθ² < 0
  • If calculus fails → use order statistics

MLEs for Common Distributions

Bernoulli/Binomial: p̂ = x̄ (= k/n)
Poisson: λ̂ = x̄
Normal: μ̂ = x̄, σ̂² = Σ(xᵢ−x̄)²/n
Exponential: λ̂ = 1/x̄
Uniform(a,b): â = x_(1) (min), b̂ = x_(n) (max)

Properties of Good Estimators

  • Unbiased: E[θ̂] = θ
  • Consistent: θ̂ → θ as n → ∞
  • Efficient: Minimum variance among all unbiased estimators
  • Sufficient: Uses all information in data about θ
  • MLE is generally consistent & asymptotically efficient
10

Correlation & Regression

Covariance vs Correlation

  • Cov(X,Y) = E[(X−μₓ)(Y−μᵧ)] = E[XY]−E[X]E[Y]
  • Shows direction, but unit-dependent
  • r = Cov(X,Y)/(σₓ·σᵧ) — unit-free, range [−1, +1]
  • r≠0 ≠ causation

Pearson r Interpretation

|r| = 0: no correlation
0.01–0.25: weak
0.26–0.75: moderate
0.76–0.99: strong
r = ±1: perfect

Simple Linear Regression

Model: Y = a + bX + ε
b = Σxy / Σx² = Σ(X−X̄)(Y−Ȳ) / Σ(X−X̄)²
a = Ȳ − b·X̄
  • Least squares minimizes Σ(Yᵢ − Ŷᵢ)²
  • b: for every 1-unit increase in X, Y changes by b units

Coefficient of Determination

SST = SSR + SSE
R² = SSR/SST = 1 − SSE/SST
  • R² = % of variation in Y explained by regression
  • R² = r² for simple linear regression
  • Error assumptions: E[ε]=0, constant variance, independent, normal

Spearman Rank Correlation

R = 1 − 6Σdᵢ² / (n(n²−1))
dᵢ = Rank(Xᵢ) − Rank(Yᵢ)
  • Non-parametric alternative to Pearson r
  • Works for ordinal or non-linear monotonic relationships
  • Tie handling: assign average ranks
11

Time Series Analysis

Components

  • Trend (T): Long-term direction (upward/downward)
  • Seasonal (S): Fixed, regular patterns within year (quarterly, monthly)
  • Cyclical (C): Multi-year irregular waves (economic cycles)
  • Irregular (I): Random, unpredictable noise

Additive vs Multiplicative

AdditiveMultiplicative
ModelT+C+S+IT×C×S×I
SeasonConstant magnitudeGrows with trend
WhenLinear trendExponential growth

Moving Averages

  • MA(k) = Average of last k observations
  • Centered MA: aligns to middle of window (for even k: MA of MA)
  • Larger k = smoother curve but more lag
  • Low-pass filter: removes high-frequency noise
  • Choose k by comparing MSE

Exponential Smoothing

Simple ES: F_{t+1} = α·Y_t + (1−α)·F_t
α close to 1 → more weight on recent (reactive)
α close to 0 → smoother, less reactive
Holt (trend): adds trend component β
Holt-Winters: adds seasonality component γ

Forecast Error Metrics

MSE = Σ(Y_t − F_t)² / n [penalizes outliers]
MAD = Σ|Y_t − F_t| / n [robust to outliers]
MAPE = Σ|Y_t−F_t|/Y_t / n [scale-free %]
LAD = max|Y_t − F_t| [worst case]

Autocorrelation

  • ACF: Correlation of series with its own lags
  • PACF: ACF removing effects of shorter lags
  • Stationarity check: Dickey-Fuller test (H₀: unit root)
  • White noise: uncorrelated, mean 0, constant variance

AR, MA, ARMA, ARIMA

  • AR(p): Y_t = c + Σφᵢ·Y_{t-i} (Auto Regression)
  • MA(q): Y_t = Σθᵢ·ε_{t-i} + ε_t [Moving Average on error term]
  • ARMA(p,q): Combines both:
    Y_t = c + Σφᵢ·Y_{t-i} + Σθᵢ·ε_{t-i} + ε_t
  • ARIMA(p,d,q): d=differencing order for stationarity, Essentially perform ARMA on differences of Y_t instead of Y_t (differences are stationary)
12

Gaussian Mixture Models (GMM)

What is GMM?

  • Probabilistic model = mixture of K Gaussian distributions
  • Soft clustering: assigns probability to each cluster (vs hard KMeans)
  • p(x) = Σ πₖ · N(x|μₖ, Σₖ) where Σπₖ = 1
  • Parameters: μₖ (mean), Σₖ (covariance), πₖ (mixing coefficient)

EM Algorithm

  • E-step: Compute responsibilities γ(zₖ) = P(k|x) using Bayes
  • M-step: Update μₖ, Σₖ, πₖ using weighted averages
  • Repeat until log-likelihood converges
  • Guaranteed non-decrease in likelihood per step
γ(zₖ) = πₖ·N(x|μₖ,Σₖ) / Σⱼ πⱼ·N(x|μⱼ,Σⱼ)

Responsibility (E-step detail)

  • γ = posterior probability point belongs to component k
  • If N(x|μ₁,σ₁²) and N(x|μ₂,σ₂²) equal, γ₁ = π₁/(π₁+π₂)
  • M-step updated mean: μₖ = Σ γ(zₖ)·xᵢ / Σ γ(zₖ)
  • Updated weight: πₖ = (1/N) Σ γ(zₖ)

GMM vs KMeans

KMeansGMM
AssignmentHard (1 cluster)Soft (probabilities)
ShapeSphericalElliptical (via Σ)
AlgorithmAssign-UpdateE-M steps
OutputCluster labelsP(cluster|x)

Applications

  • Soft clustering (overlapping groups)
  • Density estimation
  • Anomaly detection (low likelihood = anomaly)
  • Speaker recognition, image segmentation
  • Challenge: K must be chosen; Gaussian assumption may not hold