Scott Alexander writes about the mystery of the genetics on schizophrenia. Some of the weirdness is explained fully by the numbers in genetic correlates being counterintuitive, but two mysteries remain:
Why can we only find a small fraction of the genetic causes of schizophrenia?
Why do fraternal twins indicate smaller genetic causality than identical twins?
I’m going to argue that this is just math: The tools we have at hand are only looking for linear interactions but the real phenomenon is probably fairly nonlinear and both of the above artifacts are exactly what we’d expect if that’s the case.1
Let’s consider two very different causes of a disease which occurs in about 1% of the population but one is linear and the other is very nonlinear.
In the linear case there’s a single cause of a disease which occurs in about 1% of the population and causes the disase 100% of the time. In this cases identical twins will have the disease disease with perfect correlation, indicating that it’s 100% genetic, and fraternal twins will get it about half the time when the other one has it, as expected. The one genetic cause is known and the measured fraction of the genetic cause which it makes up is all of it, so no mystery here.2
In the nonlinear case there are two genetic contributors the disease both of which occur in about 10% of the population. Neither of them alone causes it but the combination of both causes it 100% of the time. In this case identical twins will have it 100% of the time. But fraternal twins of someone with the disaes will only get it about a quarter of the time, seemingly indicating a lower amount of genetic cause. The amount of cause measured by both genes alone will be about 10%, so the contribution of known genetic factors will about 20%, leaving a mystery of where the other 80% is coming from.
It’s also possible for there to be different types of genetic interactions, including ones where the individual traits have a protective effect against the other one or more complex interactions between multiple genes. But this is the most common style of interaction: There are multiple redundant systems in the body, and all of them need to be broken in order for disease to happen, leading to superlinear thresholding phenomena.
Given this sort of phenomena the problem of only being able to find 20% or so of the genetic causes of a disease seems less mysterious and more like what we’d expect for any disease where a complex redundant system fails. You might then wonder why we don’t simply look for non-linear interactions. In the example above the interaction between the two traits would be easy enough to find. The problem is that a lot of the causes will fall below the threshold for statistical significance. The genome is very long, leading to require a huge sample size to look for even linear phenomena, and when you get into pairs of things there are so many possibilities that statistical significance is basically impossible. The example given above is special because there are so few causes that they can be individually identified. In most cases you won’t even figure out the genes involved.
If you want to find non-linear causes of genetic disease your best bet right now - and I cringe as I write this - is to train a neural network on the available data, then test it on data which was withheld from training. Because it only gives a single answer to each case getting statistical significance on its accuracy is no big deal. That will get you a useful diagnostic tool and give you measure of how much of the genetic cause it’s accounting for, but it’s far from ideal. What you have is basically a ‘trust me bro’ expert. Different training runs might give wildly different answers to the same case, and it offers no reasoning behind the diagnosis. You can start trying to glean its reasoning by seeing how its answers change when you modify the inputs but that’s a bit of a process. Hopefully in the future neural networks will be able to explain themselves better and the tooling for gleaning their reasoning will be improved.
I’m glossing over the distinction between a genetic cause and a genetic trait which is correlated with a confounder which is the actual cause. Scott eplains that better than I can in the linked essay and the distinction doesn’t matter for the math here. For the purposes of exposition I’m assuming the genetic correlation is causal.
The word ‘about’ is used a lot here because of some fractional stuff which matters less as the disease gets rarer. I think it’s convention to skip explaining the details and leave out all the ‘about’s but I’m pedantic enough that it feels wrong to not have them when I skipped explaining the details.