Machine learning is applied probability wearing a hoodie. Loss functions are likelihoods, training is optimization on a statistical surface, and most failures are statistics misunderstood. This is a course in those foundations — twelve ideas you play with your hands, then meet again the moment you train a model.
A single coin flip is genuinely unpredictable — heads or tails, no way to call it. Yet the moment you repeat, structure appears. The share of heads lurches around at first, then quietly settles toward the true probability and stays there. Randomness doesn’t disappear; it averages out.
Flip a few times — the line whips around. Flip a thousand times — it presses flat against 0.50. That convergence is the engine under every average you’ve ever trusted.
Running share of heads after each flip. Dashed line = the true probability, 0.50.
Trust averages, not single draws — and trust them more the more you average.
The normal distribution — the bell curve — is fixed completely by just two numbers: where it’s centered (the mean μ) and how wide it spreads (the standard deviation σ). Slide them below. Notice that no matter where you put μ or how you stretch σ, the same fractions of the area always sit within one, two, and three σ of the center: 68%, 95%, 99.7%. That ruler never moves.
Drag σ wide and the bell flattens; drag it narrow and it spikes — but the shaded bands keep the same shares. Hit Sample to drop real draws and watch the histogram fill the curve from underneath.
The bell, with ±1σ / ±2σ / ±3σ bands shaded. Bars = sampled draws.
Center and spread are not summaries of the bell — they are the bell.
Take the average of a sample — from any distribution, however lopsided. Collect many of those averages and they always shape themselves into the same curve: the normal bell from Lesson 02. The underlying population can have two humps, a long tail, anything. The distribution of its means doesn’t care.
The population strip below has two peaks — nothing bell-shaped about it. Yet the averages of n draws from it land in a clean bell. Raise n and the bell tightens. This is why so much of the world looks normal even when its parts don’t.
Top: histogram of collected sample means. Bottom: the two-humped population they’re drawn from.
You almost never see the population. You see averages of it — and averages are born normal.
The test came back positive. Are you sick? Surprisingly often: probably not. When a disease is rare, the few true cases are buried under a pile of false alarms — because even a small error rate, applied to the huge healthy majority, produces more wrong positives than there are sick people. The intuition is almost impossible to feel from the formula. So count the dots.
1,000 people. Red = actually sick. The test splits them into negative and positive. Look at the positive column: at 1% prevalence with a good 90% test, most red-flagged dots are still blue — healthy people caught by a false alarm.
Slide prevalence up: as the disease gets common, the same test suddenly becomes trustworthy. The test never changed — only how rare the thing it’s looking for is. That’s Bayes: what you should believe after the evidence depends on what was true before it.
Evidence updates belief; it doesn’t replace it. Always start from the base rate.
Here is a cloud of data points. Suppose they came from a bell — but which one? Maximum likelihood says: pick the curve under which the data you actually saw is least surprising. Slide μ and σ to move and stretch the candidate bell. The log-likelihood readout scores how well the curve explains the points — higher is better. There is exactly one best answer, and you can find it by hand.
Drag μ off-center and the score plummets — the curve is putting its mass where there’s no data. Center it and tune σ to the cloud’s real width, and the score peaks. Hit Snap to MLE to jump to the mathematical best fit — it lands exactly on the sample’s own mean and spread.
Dots = observed data (rug + histogram). Curve = your candidate bell. The taller the curve sits over the dots, the higher the likelihood.
A trained model is just the parameters that make your data least surprising.
A model learns by minimizing a loss. Picture the loss as a valley and the model as a ball: at each step, look at the slope under your feet and take a step downhill. That’s gradient descent — the algorithm behind essentially all of modern ML. The only real knob is the learning rate: how big a step you take. It looks innocent. It is not.
Set a tiny learning rate and the ball crawls — correct, but glacially. Set it just right and it slides to the bottom in a few steps. Push it too high and the ball overshoots, bouncing higher each time until it flies off the surface entirely. Same valley, same start — the step size alone is the difference between learning and diverging.
The loss surface, with the ball’s path traced. Flag = the minimum we’re trying to reach.
NaN, too low and training takes forever. Schedulers, warm-up, Adam, momentum — all of them are smarter answers to the one question you just felt: how big a step?The gradient picks the direction. The learning rate decides whether you arrive or detonate.
A classifier outputs a guess: a probability for each class. How wrong is it? Cross-entropy answers this — and it’s the loss function behind nearly every classifier ever trained. The truth p is fixed (the real label). Your prediction q is the bars you control. Cross-entropy measures the surprise of seeing the truth through the eyes of your prediction. Push a bar toward the true class and the loss drops; bet confidently on the wrong class and it spikes toward infinity.
Move the sliders to set your predicted probabilities (they auto-normalize to sum to 1). KL divergence is the pure penalty — how far q is from p — and it hits exactly zero, its floor, the instant your prediction matches the truth. That moment is what a classifier is chasing.
Cross-entropy rewards being right and humble, and punishes being wrong and sure.
Here are a few noisy points from a smooth hidden curve. You get to choose how flexible your model is — the degree of the polynomial it fits. A straight line (degree 1) is too stiff: it misses the shape (high bias, underfitting). A very high degree threads through every single point perfectly — and then thrashes wildly between them, chasing the noise instead of the signal (high variance, overfitting). The training error keeps dropping. The error on new data does not.
Drag the degree up. Watch the training error (fit to the dots) fall toward zero while the test error (fit to the true curve) first improves, then turns and climbs. The best model is in the valley between too simple and too clever — never the one that fits the data best.
Dots = noisy training data. Faint line = the hidden true curve. Bold line = your fitted polynomial.
A model that has memorized the past has not understood it. Generalization lives in the valley.
A Markov chain is the simplest model of a thing that changes over time: a set of states, and a rule that says — given today’s state — the odds of each state tomorrow. It has no memory. The chain doesn’t know or care how it got here. Yet from that goldfish-brained hopping, something durable emerges: run it long enough and the share of time spent in each state stops depending on where it began. That long-run mix is the stationary distribution, and almost every chain falls into it.
The weather below hops Sunny → Cloudy → Rainy by the rule in the table. Start it raining or start it sunny — doesn’t matter. Step it a few hundred times and the three bars press toward the same fixed heights (the dashed markers), the stationary weather the rule implies.
| → Sunny | → Cloudy | → Rainy | |
|---|---|---|---|
| Sunny → | 0.70 | 0.20 | 0.10 |
| Cloudy → | 0.30 | 0.40 | 0.30 |
| Rainy → | 0.25 | 0.35 | 0.40 |
Top: the most recent states, newest at the right. Bottom: share of time in each state so far (bars) vs. the stationary distribution (dashed markers).
Forget where you started. A Markov chain remembers only the present, yet always arrives at the same long run.
Some numbers are hard to derive but easy to sample. Take π. Drop random points into a square with a circle inscribed in it. The fraction that land inside the circle is the ratio of their areas — π/4. So multiply that fraction by four and you have estimated π using nothing but a random-number generator and counting. No calculus, no formula for the circle. This is the whole Monte Carlo idea: when you can’t calculate a quantity, replace it with the average of random draws.
Throw a hundred darts — the estimate is rough. Throw ten thousand — it homes in on 3.14159. The error shrinks like 1/√N: to get one more digit you need a hundred times the darts. Slow, but it works in any number of dimensions, which is exactly where the formulas give up.
If you can’t compute it but you can sample it, sample it. The average is your answer.
Here is a problem that stops a lot of models cold. You want a layer that samples: draw z from a normal with mean μ and spread σ, where μ and σ are things the network learns. But sampling is random — you can’t differentiate through a dice roll, so the gradient can’t reach μ and σ, and the layer can’t train. The fix is a sleight of hand. Don’t draw z directly. Draw the randomness once, as plain noise ε from a fixed N(0,1), and then build z out of it: z = μ + σ·ε. Now all the randomness lives in ε — frozen, no parameters — and μ and σ enter through an ordinary, differentiable formula.
The grey ticks are ε: a fixed cloud of standard noise that never changes. Each one is tied to a blue point z below it. Drag μ and the whole cloud of z slides together; drag σ and it stretches around μ — rigidly, deterministically. Because ∂z/∂μ = 1 and ∂z/∂σ = ε, the gradient walks straight back to the knobs. Hit New noise to redraw ε and watch the same machinery reshape it.
Top: frozen noise ε ~ N(0,1). Bottom: z = μ + σ·ε ~ N(μ,σ). Same noise, deterministically reshaped by the two knobs.
Freeze the randomness, keep the knobs differentiable, and a coin flip becomes something you can train.
The mechanism under every transformer sounds exotic and is, underneath, pure probability. A query compares itself to a set of keys, producing a raw similarity score for each. Those scores aren’t a distribution — they’re just numbers. Softmax turns them into one: exponentiate, then normalize so they sum to 1. Now they’re probabilities — the attention weights — and the output is simply the weighted average of the values under them. Attention is the model deciding, probabilistically, what to look at.
Drag the query along the row. Each key’s weight (blue) grows as the query nears it. The temperature is the dial that matters: turn it down and softmax sharpens to a near one-hot spike — the query attends to a single key. Turn it up and the weights flatten toward uniform — the query blurs across everything. Same scores, completely different behavior.
Faint bars = each key’s value. Blue bars = attention weight on it. Green line = the output, the weighted average of the values.
Strip the jargon and attention is one sentence: softmax the scores, average the values.
New interactive lessons land here — diffusion as reversed noise, Markov chain Monte Carlo, the Bellman equation, more. One email when one ships, plus the intuition behind it. No noise.
No spam, unsubscribe any time.A hands-on course in the statistics under machine learning — each lesson rebuilt independently in vanilla JavaScript and Canvas, no libraries, nothing to install. The visual, play-it-don’t-derive-it spirit owes a debt to Seeing Theory (Daniel Kunin, Brown University); none of its code is used here. Every figure is live — resize the window, sample again, move a slider, and the math recomputes in front of you.
↑ Back to the top