Zero is the Universal Bayesian Prior
A slightly better explanation of Stein's Paradox
Stein’s paradox is legitimately disturbing. Here’s a great video explaining it and giving an okayish but not terribly satisfying resolution. If you don’t already know what Stein’s paradox is, you should watch this video as background before reading the rest of this post, in which I’ll try to give a better resolution.
In summary, if you’re estimating a value of three or more dimensions (for example a single box might have length, width, and height) you can make your estimate more accurate by making it slightly smaller, regardless of what the estimate actually is.
The thing which is very disturbing about this is that it isn’t exactly ‘smaller’ it’s towards zero. Why zero? No particular reason, it just has to be a valid value. Why not towards 2x, which moves your estimate in the exact opposite direction? Well that would be fine if you happened to pick 2x out of a hat beforehand but if you did it after getting your data as a value calculated off of x it’s making your estimate less accurate. This makes no sense. Analysis of data should come entirely out of the data, not out of the brain of the data analyzer. Multiple analyses of the same data ideally should come to the same conclusions. This is the exact opposite of that, depending critically on some completely arbitrary decision the data analyzer made.
This shows up in seemingly innocuous applications. If we’re measuring temperature why use Celsius? Why not Fahrenheit? Or Kelvin? If we’re measuring speed, why use the Earth as our inertial reference frame? Why not the Sun? Or the Earth if it wasn’t spinning on its axis? Or the speed of light?
Temperature is the most illuminating example. We don’t move our value closer to zero Kelvin because that’s a ridiculous value which we know isn’t going to happen. While Celsius and Fahrenheit both have different zero values, they’re set to amounts which can happen in normal everyday experience. But in the case of Fahrenheit it’s a bit of a stretch. Using ‘room temperature’ is more reasonable.
This gets to the crux of what’s going on. The James-Stein estimator is getting a Bayesian prior by cheating. It’s a reasonable assumption that any units in common use have zero set to a ‘normal’ value. This isn’t always true, for example in the example of the dimensions of a box it obviously can’t have zero length so it’s probably downright counterproductive there. But Bayesian priors are generally reasonable things to use and this now sounds a lot less weird.
Of course you shouldn’t use the not-admitting-it’s-a-Bayesian-prior value of zero, you should use a real Bayesian prior, for example room temperature, or something a bit more germane to your experiment. Given this, we can now rephrase Stein’s paradox in a way which sounds much more reasonable:
If you’re measuring something consisting of many values and the measures of each of them is done independently (note major caveat!) then each of those measurements will individually add noise to the total making it increasingly reasonable to bias your final estimate towards a Bayesian prior as the number of values goes up.
And now it doesn’t seem to paradoxical any more.
One useful lesson we can draw is that when estimating a value which it’s known must be positive we can improve accuracy a bit by averaging our values when representing them on a log scale. That does ‘bend towards zero’ a bit but it does it in a way which is much more reasonable and it requires the zero be specifically on the boundary of an invalid value, like zero Kelvin or length zero.
Thanks for reading Bram’s Substack! Subscribe for free to receive new posts