Prob Stats Interview Notes

Statistics Basics

Data Types

Categorical: Nominal (no order), Ordinal (ordered)
Numerical: Discrete (countable), Continuous (measurable)
Scales: Nominal < Ordinal < Interval (no true 0) < Ratio (true 0)
Interval eg: Celsius — Ratio eg: Weight, Age

Central Tendency

Mean: Σx/n · affected by outliers · use for symmetric data
Median: Middle value · robust to outliers · 50th percentile
Mode: Most frequent · works for nominal data
Skew rule: Mode = 3·Median − 2·Mean (empirical)

Skewness

Right skew: Mean > Median > Mode (long right tail)
Left skew: Mean < Median < Mode (long left tail)
Symmetric: Mean = Median = Mode
Use median for heavily skewed data

Variability / Spread

Range: Max − Min · unreliable (uses 2 points only)
Variance (σ²): Mean of squared deviations
SD (σ): √Variance · same units as data
IQR: Q3 − Q1 · robust to outliers
Sample variance uses n−1 (Bessel's correction) — avoids bias

5-Number Summary & Boxplot

Min · Q1 · Median · Q3 · Max
Outlier fence: Q1 − 1.5·IQR to Q3 + 1.5·IQR
Major outlier: beyond Q1 − 3·IQR or Q3 + 3·IQR
QD = IQR / 2 (variation in middle 50%)

Probability

Definitions

Experiment: Uncertain outcome process
Sample Space S: All possible outcomes
Event: Subset of S
Axioms: P(S)=1, 0≤P(E)≤1, P(A∪B)=P(A)+P(B) if mutually exclusive

Key Rules

Addition: P(A∪B) = P(A)+P(B)−P(A∩B) Complement: P(Aᶜ) = 1 − P(A) Multiplication: P(A∩B) = P(A)·P(B|A) Independence: P(A∩B) = P(A)·P(B)

ME vs Independent

	Mut. Exclusive	Independent
P(A∩B)	0	P(A)·P(B)
Both occur?	Never	Possible
Venn	No overlap	Overlap

Conditional Probability

P(B|A) = P(A∩B) / P(A) P(A|B) = P(A∩B) / P(B)

Multiplication: P(A∩B∩C) = P(A)·P(B|A)·P(C|A∩B)

Total Probability

P(B) = Σ P(B|Aᵢ)·P(Aᵢ)

Aᵢ must be mutually exclusive & exhaustive (partition of S)
Classic use: disease test, binary channel, device types

Bayes Theorem & Naive Bayes

Bayes Theorem

Posterior = Prior × Likelihood / Evidence

Naive Bayes Classifier

Key assumption: Features are conditionally independent given class
P(C|X) ∝ P(C) · P(X|C)
P(C|X) ∝ P(C) · Π P(Xᵢ|C) (Conditional Independence)
Predict class with highest posterior
Laplace smoothing: Add 1 to all counts → avoids zero probability

Why Conditional Independence Matters

Without it: need all feature combinations → exponential params
With it: only n params per class (one per feature)
5 binary features: 100K vs 5 probabilities!
Apps: spam filtering, sentiment analysis, document classification

NB Steps (Categorical)

1. Build frequency table from training data
2. Compute P(class) and P(feature|class)
3. For new point: compute P(class) · Π P(xᵢ|class)
4. Pick argmax class
If any P(feature|class)=0 → apply Laplace (+1)

Random Variables & Distributions

Discrete RV

PMF: f(x) = P(X=x), Σf(x)=1, f(x)≥0
CDF: F(x) = P(X≤x) = Σf(xᵢ) for xᵢ≤x
E[X] = Σ x·f(x)
Var[X] = E[X²] − (E[X])²

Continuous RV

PDF: f(x)≥0, ∫f(x)dx=1
P(a≤X≤b) = ∫ₐᵇ f(x)dx
P(X=exact value) = 0
E[X] = ∫x·f(x)dx
Var[X] = ∫(x−μ)²f(x)dx

Joint Distribution

f(x,y) = P(X=x, Y=y)
Marginal: f_X(x) = Σ_y f(x,y)
Conditional: f(x|y) = f(x,y)/f_Y(y)
Independence: f(x,y) = f_X(x)·f_Y(y)
Cov(X,Y) = E[XY] − E[X]·E[Y]

Key Probability Distributions

Bernoulli

P(X=x) = pˣ·(1−p)^(1−x), x∈{0,1}

Single trial, 2 outcomes
E[X]=p, Var[X]=p(1−p)=pq
Special case of Binomial with n=1

Binomial B(n,p)

P(X=x) = C(n,x)·pˣ·qⁿ⁻ˣ

n fixed independent trials, constant p
E[X]=np, Var[X]=npq
Conditions: fixed n, independent, constant p, binary outcome
Phrases: "at least k" = 1 − P(X≤k−1)

Poisson (λ)

P(X=x) = e⁻λ·λˣ/x!

Rare events, large n, small p, λ=np
E[X] = Var[X] = λ (mean = variance!)
Approx Binomial when n≥100, np≤10
Use: arrivals, defects, errors per page

Normal N(μ, σ²)

f(x) = (1/σ√2π) · exp(−(x−μ)²/2σ²)

Bell-shaped, symmetric, Mean=Median=Mode
Empirical rule: 68%-95%-99.7% within 1,2,3σ
Z = (X−μ)/σ → N(0,1)
Normal approx to Binomial: if np≥15 and nq≥15
Continuity correction: P(X≤k) → P(X≤k+0.5)

Sampling Distributions

Distribution	Definition	Key Use
Chi-Square χ²(k)	Sum of k squared N(0,1) vars	Goodness of fit, variance tests, right-skewed
t-distribution(k)	Z / √(χ²/k) — heavier tails than normal	Small samples, unknown σ · uses s instead of σ
F-distribution(v1,v2)	Ratio of two chi-squares/dof	Compare variances (ANOVA)

Sampling Theory, CLT & Estimation

Central Limit Theorem

x̄ ~ N(μ, σ²/n) as n→∞ regardless of population dist.
SE of mean = σ/√n
Z = (x̄ − μ) / (σ/√n)
Finite population correction: multiply SE by √((N−n)/(N−1))
Use FPC if n/N > 0.05

Sampling Methods

Simple Random: Each unit equal probability (homogeneous pop.)
Systematic: Every k-th unit (k=N/n), first unit random
Stratified: Divide heterogeneous pop. into strata; SRS within each; proportional allocation
Non-probability: judgment, convenience, quota, snowball

Confidence Intervals

CI for μ (σ known): x̄ ± z_{α/2} · σ/√n
CI for μ (σ unknown): x̄ ± t_{α/2,n-1} · s/√n
CI for proportion p̂ ± z·√(p̂q̂/n)

α=0.05 → z=1.96 · α=0.01 → z=2.576
Width ↑ when σ↑ or n↓ or confidence↑

Sampling Distribution of Proportion

p̂ ~ N(p, pq/n) if np>15, nq>15
SE = √(pq/n)
Z = (p̂ − p) / SE

Hypothesis Testing

Framework

H₀: Null — no difference / neutral
H₁: Alternate — what we want to prove
Never "prove" H₀; only reject or fail to reject
α = significance level = P(Type I error)
p-value < α → Reject H₀

Error Types

	H₀ True	H₀ False
Reject H₀	Type I (α)	Correct (Power)
Don't Reject	Correct	Type II (β)

Power = 1 − β = P(reject H₀ | H₁ true)

Test Types & When

Z-test: n≥30, σ known · or proportion test
1-sample t: n<30, σ unknown
2-sample t (unpaired): Independent groups, compare means
Paired t: Before/after same subjects · use d=y−x
Left/Right tailed: H₁ has < or > · Two-tailed: H₁ has ≠

Z-test Formulas at a Glance

1-sample mean (σ known): Z = (x̄ − μ₀) / (σ/√n)
2-sample mean: Z = (x̄₁ − x̄₂) / √(σ₁²/n₁ + σ₂²/n₂)
1-sample proportion: Z = (p̂ − p₀) / √(p₀q₀/n)
2-sample proportion: Z = (p̂₁ − p̂₂) / SE(p̂₁ − p̂₂)
Pooled SE (equal n): SE = √(p̄q̄(1/n₁ + 1/n₂)), p̄ = (x₁+x₂)/(n₁+n₂)
t-test (small sample): t = (x̄ − μ₀) / (s/√n), df = n−1
Paired t: t = d̄ / (s_d/√n), d = y−x, df = n−1

Analysis of Variance (ANOVA)

Why ANOVA (not multiple t-tests)?

Each t-test has Type I error α → inflated for k groups
Total error = 1 − (1−α)^k (e.g. k=5: 40% error!)
ANOVA tests H₀: μ₁=μ₂=…=μₖ simultaneously
Assumptions: Normal, equal variances, independent

One-way ANOVA Partitioning

SST = SSTR + SSE

SSTR = Σ nⱼ(x̄ⱼ − x̄)² [between groups, df=k−1]
SSE = Σ Σ (xᵢⱼ − x̄ⱼ)² [within groups, df=n−k]
MSTR = SSTR/(k−1)
MSE = SSE/(n−k)
F = MSTR / MSE

ANOVA Table

Source	SS	df	MS	F
Treatments	SSTR	k−1	MSTR	MSTR/MSE
Error	SSE	n−k	MSE
Total	SST	n−1

F_cal > F_tab → Reject H₀ (means differ)

Two-way ANOVA

Tests effect of two factors: Treatment + Block
SST = SSTR + SSBL + SSE
Reduces error variance vs one-way
F_treatment = MSTR/MSE, F_block = MSBL/MSE
Two H₀: no treatment diff + no block diff

Shortcut: Correction Factor Method

CF = G² / n (G: Grand Total)
SST = Σ xᵢⱼ² − CF
SSTR = Σ (Cᵢ²/nᵢ) − CF
SSE = SST − SSTR
(Two-way: SSBL = Σ(Rᵢ²/k) − CF) (Rᵢ: Total of Row i)

Maximum Likelihood Estimation

Concept

Find θ that maximizes likelihood of observing the data
Likelihood: L(θ) = Π f(xᵢ|θ)
Log-likelihood: ℓ(θ) = Σ log f(xᵢ|θ)
MLE: dℓ/dθ = 0 and d²ℓ/dθ² < 0
If calculus fails → use order statistics

MLEs for Common Distributions

Bernoulli/Binomial: p̂ = x̄ (= k/n)
Poisson: λ̂ = x̄
Normal: μ̂ = x̄, σ̂² = Σ(xᵢ−x̄)²/n
Exponential: λ̂ = 1/x̄
Uniform(a,b): â = x_(1) (min), b̂ = x_(n) (max)

Properties of Good Estimators

Unbiased: E[θ̂] = θ
Consistent: θ̂ → θ as n → ∞
Efficient: Minimum variance among all unbiased estimators
Sufficient: Uses all information in data about θ
MLE is generally consistent & asymptotically efficient

Correlation & Regression

Covariance vs Correlation

Cov(X,Y) = E[(X−μₓ)(Y−μᵧ)] = E[XY]−E[X]E[Y]
Shows direction, but unit-dependent
r = Cov(X,Y)/(σₓ·σᵧ) — unit-free, range [−1, +1]
r≠0 ≠ causation

Pearson r Interpretation

|r| = 0: no correlation
0.01–0.25: weak
0.26–0.75: moderate
0.76–0.99: strong
r = ±1: perfect

Simple Linear Regression

Model: Y = a + bX + ε
b = Σxy / Σx² = Σ(X−X̄)(Y−Ȳ) / Σ(X−X̄)²
a = Ȳ − b·X̄

Least squares minimizes Σ(Yᵢ − Ŷᵢ)²
b: for every 1-unit increase in X, Y changes by b units

Coefficient of Determination

SST = SSR + SSE
R² = SSR/SST = 1 − SSE/SST

R² = % of variation in Y explained by regression
R² = r² for simple linear regression
Error assumptions: E[ε]=0, constant variance, independent, normal

Spearman Rank Correlation

R = 1 − 6Σdᵢ² / (n(n²−1))
dᵢ = Rank(Xᵢ) − Rank(Yᵢ)

Non-parametric alternative to Pearson r
Works for ordinal or non-linear monotonic relationships
Tie handling: assign average ranks

Time Series Analysis

Components

Trend (T): Long-term direction (upward/downward)
Seasonal (S): Fixed, regular patterns within year (quarterly, monthly)
Cyclical (C): Multi-year irregular waves (economic cycles)
Irregular (I): Random, unpredictable noise

Additive vs Multiplicative

	Additive	Multiplicative
Model	T+C+S+I	T×C×S×I
Season	Constant magnitude	Grows with trend
When	Linear trend	Exponential growth

Moving Averages

MA(k) = Average of last k observations
Centered MA: aligns to middle of window (for even k: MA of MA)
Larger k = smoother curve but more lag
Low-pass filter: removes high-frequency noise
Choose k by comparing MSE

Exponential Smoothing

Simple ES: F_{t+1} = α·Y_t + (1−α)·F_t
α close to 1 → more weight on recent (reactive)
α close to 0 → smoother, less reactive
Holt (trend): adds trend component β
Holt-Winters: adds seasonality component γ

Forecast Error Metrics

Autocorrelation

ACF: Correlation of series with its own lags
PACF: ACF removing effects of shorter lags
Stationarity check: Dickey-Fuller test (H₀: unit root)
White noise: uncorrelated, mean 0, constant variance

AR, MA, ARMA, ARIMA

AR(p): Y_t = c + Σφᵢ·Y_{t-i} (Auto Regression)
MA(q): Y_t = Σθᵢ·ε_{t-i} + ε_t [Moving Average on error term]
ARMA(p,q): Combines both:
Y_t = c + Σφᵢ·Y_{t-i} + Σθᵢ·ε_{t-i} + ε_t
ARIMA(p,d,q): d=differencing order for stationarity, Essentially perform ARMA on differences of Y_t instead of Y_t (differences are stationary)

Gaussian Mixture Models (GMM)

What is GMM?

Probabilistic model = mixture of K Gaussian distributions
Soft clustering: assigns probability to each cluster (vs hard KMeans)
p(x) = Σ πₖ · N(x|μₖ, Σₖ) where Σπₖ = 1
Parameters: μₖ (mean), Σₖ (covariance), πₖ (mixing coefficient)

EM Algorithm

E-step: Compute responsibilities γ(zₖ) = P(k|x) using Bayes
M-step: Update μₖ, Σₖ, πₖ using weighted averages
Repeat until log-likelihood converges
Guaranteed non-decrease in likelihood per step

γ(zₖ) = πₖ·N(x|μₖ,Σₖ) / Σⱼ πⱼ·N(x|μⱼ,Σⱼ)

Responsibility (E-step detail)

γ = posterior probability point belongs to component k
If N(x|μ₁,σ₁²) and N(x|μ₂,σ₂²) equal, γ₁ = π₁/(π₁+π₂)
M-step updated mean: μₖ = Σ γ(zₖ)·xᵢ / Σ γ(zₖ)
Updated weight: πₖ = (1/N) Σ γ(zₖ)

GMM vs KMeans

	KMeans	GMM
Assignment	Hard (1 cluster)	Soft (probabilities)
Shape	Spherical	Elliptical (via Σ)
Algorithm	Assign-Update	E-M steps
Output	Cluster labels	P(cluster\|x)

Applications

Soft clustering (overlapping groups)
Density estimation
Anomaly detection (low likelihood = anomaly)
Speaker recognition, image segmentation
Challenge: K must be chosen; Gaussian assumption may not hold