Statistics, Played

Lesson 01 ● The law of large numbers

One flip is random. A thousand flips are a law.

A single coin flip is genuinely unpredictable — heads or tails, no way to call it. Yet the moment you repeat, structure appears. The share of heads lurches around at first, then quietly settles toward the true probability and stays there. Randomness doesn’t disappear; it averages out.

Flip a few times — the line whips around. Flip a thousand times — it presses flat against 0.50. That convergence is the engine under every average you’ve ever trusted.

Running share of heads after each flip. Dashed line = the true probability, 0.50.

Flips

0

Heads

0

Share heads

–

Where this shows up in ML

Every number you read off a model is an average over a sample — accuracy on a test set, the mean loss in a batch, a reward estimate. The law of large numbers is the promise that those estimates settle near the truth if the sample is big enough. A validation score on 50 examples wobbles; on 50,000 it’s trustworthy. Same law, every time.

Trust averages, not single draws — and trust them more the more you average.

Lesson 02 ● The normal distribution

One bell, two numbers, the whole world.

The normal distribution — the bell curve — is fixed completely by just two numbers: where it’s centered (the mean μ) and how wide it spreads (the standard deviation σ). Slide them below. Notice that no matter where you put μ or how you stretch σ, the same fractions of the area always sit within one, two, and three σ of the center: 68%, 95%, 99.7%. That ruler never moves.

Drag σ wide and the bell flattens; drag it narrow and it spikes — but the shaded bands keep the same shares. Hit Sample to drop real draws and watch the histogram fill the curve from underneath.

The bell, with ±1σ / ±2σ / ±3σ bands shaded. Bars = sampled draws.

Mean μ = 0.0

Std dev σ = 1.0

Within ±1σ

68.3%

Within ±2σ

95.4%

Within ±3σ

99.7%

Sample mean

–

Where this shows up in ML

The Gaussian is the default assumption almost everywhere: neural-net weights are initialized from it, regularization (weight decay) is a Gaussian prior on those weights, measurement noise is modeled as Gaussian, and “this point is 3σ out” is how anomaly detection flags an outlier. Two numbers — center and spread — carry an enormous amount of a model’s behavior.

Center and spread are not summaries of the bell — they are the bell.

Lesson 03 ● The central limit theorem

Average anything enough times and you get a bell.

Take the average of a sample — from any distribution, however lopsided. Collect many of those averages and they always shape themselves into the same curve: the normal bell from Lesson 02. The underlying population can have two humps, a long tail, anything. The distribution of its means doesn’t care.

The population strip below has two peaks — nothing bell-shaped about it. Yet the averages of n draws from it land in a clean bell. Raise n and the bell tightens. This is why so much of the world looks normal even when its parts don’t.

Top: histogram of collected sample means. Bottom: the two-humped population they’re drawn from.

Sample size n = 5

Means collected

0

Mean of means

–

Spread

–

Where this shows up in ML

It’s why mini-batch gradients work: each batch gradient is an average of per-example gradients, so it concentrates around the true gradient — and the larger the batch, the tighter (the spread shrinks like 1/√n). It’s also why error bars, confidence intervals, and A/B-test significance are valid even when the raw data is nowhere near normal.

You almost never see the population. You see averages of it — and averages are born normal.

Lesson 04 ● Bayes & the rare disease

A positive test usually means you’re fine.

The test came back positive. Are you sick? Surprisingly often: probably not. When a disease is rare, the few true cases are buried under a pile of false alarms — because even a small error rate, applied to the huge healthy majority, produces more wrong positives than there are sick people. The intuition is almost impossible to feel from the formula. So count the dots.

1,000 people. Red = actually sick. The test splits them into negative and positive. Look at the positive column: at 1% prevalence with a good 90% test, most red-flagged dots are still blue — healthy people caught by a false alarm.

Actually sick Actually healthy

Prevalence 1%

Sensitivity 90%

Specificity 90%

P(sick | positive)

–

True hits

–

False alarms

–

Slide prevalence up: as the disease gets common, the same test suddenly becomes trustworthy. The test never changed — only how rare the thing it’s looking for is. That’s Bayes: what you should believe after the evidence depends on what was true before it.

Where this shows up in ML

This is the precision problem. A fraud or disease classifier with 99% accuracy can still be wrong most of the time it fires an alarm, because the positive class is rare — the base rate dominates. Bayes is also the spine of Naive Bayes classifiers, Bayesian deep learning, and every “prior × likelihood” update. If you ignore the base rate, your confident model lies to you.

Evidence updates belief; it doesn’t replace it. Always start from the base rate.

Lesson 05 ● Maximum likelihood

Training is sliding a curve until the data stops being surprised.

Here is a cloud of data points. Suppose they came from a bell — but which one? Maximum likelihood says: pick the curve under which the data you actually saw is least surprising. Slide μ and σ to move and stretch the candidate bell. The log-likelihood readout scores how well the curve explains the points — higher is better. There is exactly one best answer, and you can find it by hand.

Drag μ off-center and the score plummets — the curve is putting its mass where there’s no data. Center it and tune σ to the cloud’s real width, and the score peaks. Hit Snap to MLE to jump to the mathematical best fit — it lands exactly on the sample’s own mean and spread.

Dots = observed data (rug + histogram). Curve = your candidate bell. The taller the curve sits over the dots, the higher the likelihood.

Guess mean μ = 0.0

Guess std σ = 1.0

Log-likelihood

–

Best possible

–

Sample mean / std

–

Where this shows up in ML

Almost every loss function is a disguised negative log-likelihood. Fitting a regression with mean squared error is exactly maximum likelihood under Gaussian noise — the calculation you just did by hand. Cross-entropy (next lessons) is maximum likelihood for classification. “Minimize the loss” and “make the data most likely” are the same sentence.

A trained model is just the parameters that make your data least surprising.

Lesson 06 ● Gradient descent

Roll downhill. The step size decides everything.

A model learns by minimizing a loss. Picture the loss as a valley and the model as a ball: at each step, look at the slope under your feet and take a step downhill. That’s gradient descent — the algorithm behind essentially all of modern ML. The only real knob is the learning rate: how big a step you take. It looks innocent. It is not.

Set a tiny learning rate and the ball crawls — correct, but glacially. Set it just right and it slides to the bottom in a few steps. Push it too high and the ball overshoots, bouncing higher each time until it flies off the surface entirely. Same valley, same start — the step size alone is the difference between learning and diverging.

The loss surface, with the ball’s path traced. Flag = the minimum we’re trying to reach.

Learning rate η = 0.10

Step

0

Position w

–

Loss

–

Status

ready

Where this shows up in ML

This is the whole training loop. Backpropagation just computes the slope; gradient descent takes the step. The learning rate is the single most important hyperparameter you tune — too high and the loss explodes to NaN, too low and training takes forever. Schedulers, warm-up, Adam, momentum — all of them are smarter answers to the one question you just felt: how big a step?

The gradient picks the direction. The learning rate decides whether you arrive or detonate.

Lesson 07 ● Entropy & cross-entropy

Build the classification loss with your own hands.

A classifier outputs a guess: a probability for each class. How wrong is it? Cross-entropy answers this — and it’s the loss function behind nearly every classifier ever trained. The truth p is fixed (the real label). Your prediction q is the bars you control. Cross-entropy measures the surprise of seeing the truth through the eyes of your prediction. Push a bar toward the true class and the loss drops; bet confidently on the wrong class and it spikes toward infinity.

Move the sliders to set your predicted probabilities (they auto-normalize to sum to 1). KL divergence is the pure penalty — how far q is from p — and it hits exactly zero, its floor, the instant your prediction matches the truth. That moment is what a classifier is chasing.

Truth p (the real label) Your prediction q

Cross-entropy H(p,q)

–

Entropy H(p)

–

KL(p‖q) — the loss

–

Where this shows up in ML

Cross-entropy is the loss in logistic regression, in every softmax classifier, in language models predicting the next token. Minimizing it is maximum likelihood (Lesson 05) for categorical data. Its steepness near a confident-but-wrong answer is exactly why it trains well — and why a single overconfident mistake dominates the loss. Measured in bits, it’s also literally how many bits your model wastes encoding the truth.

Cross-entropy rewards being right and humble, and punishes being wrong and sure.

Lesson 08 ● The bias–variance tradeoff

Fit harder, fit worse.

Here are a few noisy points from a smooth hidden curve. You get to choose how flexible your model is — the degree of the polynomial it fits. A straight line (degree 1) is too stiff: it misses the shape (high bias, underfitting). A very high degree threads through every single point perfectly — and then thrashes wildly between them, chasing the noise instead of the signal (high variance, overfitting). The training error keeps dropping. The error on new data does not.

Drag the degree up. Watch the training error (fit to the dots) fall toward zero while the test error (fit to the true curve) first improves, then turns and climbs. The best model is in the valley between too simple and too clever — never the one that fits the data best.

Dots = noisy training data. Faint line = the hidden true curve. Bold line = your fitted polynomial.

Polynomial degree d = 3

Training error

–

Test error (vs truth)

–

Verdict

–

Where this shows up in ML

This is the central tension of the whole field. Model capacity (depth, width, parameters) is the degree slider; regularization, dropout, early stopping, and more data are how you pull back from the overfitting cliff. The reason you keep a held-out test set at all is everything this lesson shows: the score that matters is the one on data the model has never seen.

A model that has memorized the past has not understood it. Generalization lives in the valley.

Lesson 09 ● Markov chains

Where you go next depends only on where you are now.

A Markov chain is the simplest model of a thing that changes over time: a set of states, and a rule that says — given today’s state — the odds of each state tomorrow. It has no memory. The chain doesn’t know or care how it got here. Yet from that goldfish-brained hopping, something durable emerges: run it long enough and the share of time spent in each state stops depending on where it began. That long-run mix is the stationary distribution, and almost every chain falls into it.

The weather below hops Sunny → Cloudy → Rainy by the rule in the table. Start it raining or start it sunny — doesn’t matter. Step it a few hundred times and the three bars press toward the same fixed heights (the dashed markers), the stationary weather the rule implies.

Transition rule P — read a row as “if today is X, tomorrow is…”
	→ Sunny	→ Cloudy	→ Rainy
Sunny →	0.70	0.20	0.10
Cloudy →	0.30	0.40	0.30
Rainy →	0.25	0.35	0.40

Top: the most recent states, newest at the right. Bottom: share of time in each state so far (bars) vs. the stationary distribution (dashed markers).

Now

Sunny

Steps

0

Empirical %

– / – / –

Stationary π

– / – / –

Where this shows up in ML

Markov chains are the skeleton under an enormous amount of AI. PageRank is the stationary distribution of a random surfer; reinforcement learning models the world as a Markov decision process; MCMC samples impossible distributions by building a chain whose stationary distribution is the target; and an n-gram language model is literally a Markov chain over words. The memoryless assumption is wrong about almost everything — and useful anyway.

Forget where you started. A Markov chain remembers only the present, yet always arrives at the same long run.

Lesson 10 ● Monte Carlo

Compute the uncomputable by throwing darts.

Some numbers are hard to derive but easy to sample. Take π. Drop random points into a square with a circle inscribed in it. The fraction that land inside the circle is the ratio of their areas — π/4. So multiply that fraction by four and you have estimated π using nothing but a random-number generator and counting. No calculus, no formula for the circle. This is the whole Monte Carlo idea: when you can’t calculate a quantity, replace it with the average of random draws.

Throw a hundred darts — the estimate is rough. Throw ten thousand — it homes in on 3.14159. The error shrinks like 1/√N: to get one more digit you need a hundred times the darts. Slow, but it works in any number of dimensions, which is exactly where the formulas give up.

Inside the circle Outside

Darts

0

Inside

0

Estimate of π

–

Error

–

Where this shows up in ML

Whenever a quantity is an expectation you can’t integrate, you Monte-Carlo it. Reinforcement learning estimates the value of a state by averaging sampled returns; dropout and Bayesian nets average many random forward passes; variational inference and policy gradients are expectations approximated by samples. “Just sample it and average” is one of the most powerful moves in the field — and it’s the law of large numbers (Lesson 01) doing a job.

If you can’t compute it but you can sample it, sample it. The average is your answer.

Lesson 11 ● The reparameterization trick

Sample from a distribution — and still send a gradient through it.

Here is a problem that stops a lot of models cold. You want a layer that samples: draw z from a normal with mean μ and spread σ, where μ and σ are things the network learns. But sampling is random — you can’t differentiate through a dice roll, so the gradient can’t reach μ and σ, and the layer can’t train. The fix is a sleight of hand. Don’t draw z directly. Draw the randomness once, as plain noise ε from a fixed N(0,1), and then build z out of it: z = μ + σ·ε. Now all the randomness lives in ε — frozen, no parameters — and μ and σ enter through an ordinary, differentiable formula.

The grey ticks are ε: a fixed cloud of standard noise that never changes. Each one is tied to a blue point z below it. Drag μ and the whole cloud of z slides together; drag σ and it stretches around μ — rigidly, deterministically. Because ∂z/∂μ = 1 and ∂z/∂σ = ε, the gradient walks straight back to the knobs. Hit New noise to redraw ε and watch the same machinery reshape it.

Top: frozen noise ε ~ N(0,1). Bottom: z = μ + σ·ε ~ N(μ,σ). Same noise, deterministically reshaped by the two knobs.

Mean μ = 0.0

Spread σ = 1.5

Sample mean of z

–

Sample spread of z

–

Gradient path

∂z/∂μ=1, ∂z/∂σ=ε

Where this shows up in ML

This single trick is what makes the variational autoencoder trainable, and it powers the latent sampling in diffusion models and stochastic policies. The principle generalizes far past Gaussians: push the randomness out to a fixed, parameter-free source and make your parameters enter through a differentiable transform. Then backpropagation — which only knows how to follow deterministic arrows — can train a layer that genuinely samples.

Freeze the randomness, keep the knobs differentiable, and a coin flip becomes something you can train.

Lesson 12 ● Attention as a probability

Attention is a weighted average — and softmax sets the weights.

The mechanism under every transformer sounds exotic and is, underneath, pure probability. A query compares itself to a set of keys, producing a raw similarity score for each. Those scores aren’t a distribution — they’re just numbers. Softmax turns them into one: exponentiate, then normalize so they sum to 1. Now they’re probabilities — the attention weights — and the output is simply the weighted average of the values under them. Attention is the model deciding, probabilistically, what to look at.

Drag the query along the row. Each key’s weight (blue) grows as the query nears it. The temperature is the dial that matters: turn it down and softmax sharpens to a near one-hot spike — the query attends to a single key. Turn it up and the weights flatten toward uniform — the query blurs across everything. Same scores, completely different behavior.

Faint bars = each key’s value. Blue bars = attention weight on it. Green line = the output, the weighted average of the values.

Query position q = 0.50

Temperature τ = 0.15

Largest weight

–

Attention entropy

–

Output value

–

Where this shows up in ML

This is self-attention, the engine of transformers — GPT, BERT, every modern language and vision model. Real attention scores queries against keys with a dot product and scales by 1/√d (that scaling is a temperature, keeping softmax from saturating). Stack this operation in parallel heads and in layers and you get a model that, token by token, computes a probability distribution over what to attend to — and reads out the weighted average. Intelligence, it turns out, leans hard on a softmax.

Strip the jargon and attention is one sentence: softmax the scores, average the values.

One flip is random. A thousand flips are a law.

One bell, two numbers, the whole world.

Average anything enough times and you get a bell.

A positive test usually means you’re fine.

Training is sliding a curve until the data stops being surprised.

Roll downhill. The step size decides everything.

Build the classification loss with your own hands.

Fit harder, fit worse.

Where you go next depends only on where you are now.

Compute the uncomputable by throwing darts.

Sample from a distribution — and still send a gradient through it.

Attention is a weighted average — and softmax sets the weights.

Get the next lesson when it’s playable