Zero is the Universal Bayesian Prior

A slightly better explanation of Stein's Paradox

Jul 14, 2023

Stein’s paradox is legitimately disturbing. Here’s a great video explaining it and giving an okayish but not terribly satisfying resolution. If you don’t already know what Stein’s paradox is, you should watch this video as background before reading the rest of this post, in which I’ll try to give a better resolution.

In summary, if you’re estimating a value of three or more dimensions (for example a single box might have length, width, and height) you can make your estimate more accurate by making it slightly smaller, regardless of what the estimate actually is.

The thing which is very disturbing about this is that it isn’t exactly ‘smaller’ it’s towards zero. Why zero? No particular reason, it just has to be a valid value. Why not towards 2x, which moves your estimate in the exact opposite direction? Well that would be fine if you happened to pick 2x out of a hat beforehand but if you did it after getting your data as a value calculated off of x it’s making your estimate less accurate. This makes no sense. Analysis of data should come entirely out of the data, not out of the brain of the data analyzer. Multiple analyses of the same data ideally should come to the same conclusions. This is the exact opposite of that, depending critically on some completely arbitrary decision the data analyzer made.

This shows up in seemingly innocuous applications. If we’re measuring temperature why use Celsius? Why not Fahrenheit? Or Kelvin? If we’re measuring speed, why use the Earth as our inertial reference frame? Why not the Sun? Or the Earth if it wasn’t spinning on its axis? Or the speed of light?

Temperature is the most illuminating example. We don’t move our value closer to zero Kelvin because that’s a ridiculous value which we know isn’t going to happen. While Celsius and Fahrenheit both have different zero values, they’re set to amounts which can happen in normal everyday experience. But in the case of Fahrenheit it’s a bit of a stretch. Using ‘room temperature’ is more reasonable.

This gets to the crux of what’s going on. The James-Stein estimator is getting a Bayesian prior by cheating. It’s a reasonable assumption that any units in common use have zero set to a ‘normal’ value. This isn’t always true, for example in the example of the dimensions of a box it obviously can’t have zero length so it’s probably downright counterproductive there. But Bayesian priors are generally reasonable things to use and this now sounds a lot less weird.

Of course you shouldn’t use the not-admitting-it’s-a-Bayesian-prior value of zero, you should use a real Bayesian prior, for example room temperature, or something a bit more germane to your experiment. Given this, we can now rephrase Stein’s paradox in a way which sounds much more reasonable:

If you’re measuring something consisting of many values and the measures of each of them is done independently (note major caveat!) then each of those measurements will individually add noise to the total making it increasingly reasonable to bias your final estimate towards a Bayesian prior as the number of values goes up.

And now it doesn’t seem to paradoxical any more.

One useful lesson we can draw is that when estimating a value which it’s known must be positive we can improve accuracy a bit by averaging our values when representing them on a log scale. That does ‘bend towards zero’ a bit but it does it in a way which is much more reasonable and it requires the zero be specifically on the boundary of an invalid value, like zero Kelvin or length zero.

Lynx

Apr 7, 2024

Hi Bram,

I happened to read your article yesterday and found it interesting as I have also been looking into random distribution recently.

Especially since I just read this

Articel here: https://www.scientificamerican.com/article/these-numbers-look-random-but-arent-mathematicians-prove/

Here is also the paper: https://arxiv.org/abs/2206.07809

I wondered whether you had also read this paper.

In principle, this is all very closely related to Benford's Law.

I was wondering if this property could be useful in poker games. Couldn't this property be used to design a game without a third party?

Maybe just a random idea but would love to hear your opinion.

Expand full comment

Jan

Aug 22, 2023

Omg the video is great. Must watch them all. There is something fascinating about complex issues explained I a clear precise manner. Even without knowing what a Bayesian Prior is yet I have a feeling of what you are talking about. I translate it in really rough terms as “The paradox of the bias works when u consider relatable data points and not some abstract stuff. So “room temperature” is working better than the “temperature of the sun”? Need to binge watch the YouTube channel. I wish we would have been confronted with that kind of explanations when our brains where still of high plasticity (youth). How much could even a below average guy like me have learned.

Bram’s Thoughts

Discussion about this post