Scott Alexander writes about the mystery of the genetics on schizophrenia. Some of the weirdness is explained fully by the numbers in genetic correlates being counterintuitive, but two mysteries remain:
Why can we only find a small fraction of the genetic causes of schizophrenia?
Why do fraternal twins indicate smaller genetic causality than identical twins?
I’m going to argue that this is just math: The tools we have at hand are only looking for linear interactions but the real phenomenon is probably fairly nonlinear and both of the above artifacts are exactly what we’d expect if that’s the case.1
Let’s consider two very different causes of a disease which occurs in about 1% of the population but one is linear and the other is very nonlinear.
In the linear case there’s a single cause of a disease which occurs in about 1% of the population and causes the disase 100% of the time. In this cases identical twins will have the disease disease with perfect correlation, indicating that it’s 100% genetic, and fraternal twins will get it about half the time when the other one has it, as expected. The one genetic cause is known and the measured fraction of the genetic cause which it makes up is all of it, so no mystery here.2
In the nonlinear case there are two genetic contributors the disease both of which occur in about 10% of the population. Neither of them alone causes it but the combination of both causes it 100% of the time. In this case identical twins will have it 100% of the time. But fraternal twins of someone with the disaes will only get it about a quarter of the time, seemingly indicating a lower amount of genetic cause. The amount of cause measured by both genes alone will be about 10%, so the contribution of known genetic factors will about 20%, leaving a mystery of where the other 80% is coming from.
It’s also possible for there to be different types of genetic interactions, including ones where the individual traits have a protective effect against the other one or more complex interactions between multiple genes. But this is the most common style of interaction: There are multiple redundant systems in the body, and all of them need to be broken in order for disease to happen, leading to superlinear thresholding phenomena.
Given this sort of phenomena the problem of only being able to find 20% or so of the genetic causes of a disease seems less mysterious and more like what we’d expect for any disease where a complex redundant system fails. You might then wonder why we don’t simply look for non-linear interactions. In the example above the interaction between the two traits would be easy enough to find. The problem is that a lot of the causes will fall below the threshold for statistical significance. The genome is very long, leading to require a huge sample size to look for even linear phenomena, and when you get into pairs of things there are so many possibilities that statistical significance is basically impossible. The example given above is special because there are so few causes that they can be individually identified. In most cases you won’t even figure out the genes involved.
If you want to find non-linear causes of genetic disease your best bet right now - and I cringe as I write this - is to train a neural network on the available data, then test it on data which was withheld from training. Because it only gives a single answer to each case getting statistical significance on its accuracy is no big deal. That will get you a useful diagnostic tool and give you measure of how much of the genetic cause it’s accounting for, but it’s far from ideal. What you have is basically a ‘trust me bro’ expert. Different training runs might give wildly different answers to the same case, and it offers no reasoning behind the diagnosis. You can start trying to glean its reasoning by seeing how its answers change when you modify the inputs but that’s a bit of a process. Hopefully in the future neural networks will be able to explain themselves better and the tooling for gleaning their reasoning will be improved.
I’m glossing over the distinction between a genetic cause and a genetic trait which is correlated with a confounder which is the actual cause. Scott eplains that better than I can in the linked essay and the distinction doesn’t matter for the math here. For the purposes of exposition I’m assuming the genetic correlation is causal.
The word ‘about’ is used a lot here because of some fractional stuff which matters less as the disease gets rarer. I think it’s convention to skip explaining the details and leave out all the ‘about’s but I’m pedantic enough that it feels wrong to not have them when I skipped explaining the details.
I think this seems likely for schizophrenia (though there seem to be also many environmental ways to break that mechanism, like infections and drugs). And it should even more applicable to IQ, if we believe that general good functioning of a brain is a bigger target for mutations than whatever it is for just schizo (meaning-making or meaning-sensing?).
I can sort of see why the effect of common variants of genes should be additive. But why should the effect of stuff breaking down, i.e. rare bad mutations, be additive? Something rare is not expected and can't be planned for by other genes (other than building simple and multiple redundant mechanisms for the same function). Seems like a percolation, or broken phone -like thing. Enough defects and it's a completely different thing. For the brain it is easy to imagine a model where just slightly bigger error rates in different parts would multiply for a big effect.
Furthermore, as Scott wrote, GWASes can't see rare mutations (unlike accurate whole genome sequencing), and they can't resolve the effect new mutations well, which we should expect bad, rare ones to be.
For the missing heritability of disease there could also be another reason: MZ twins go together like shirt hem and butt. They must have a smaller non-shared environment. Even when raised apart they seem to have some kind of frequency lock with each other. This would result in them having similar environmental insults, most importantly pathogenic ones.
But of course non-additive genetic errors in immune system coverage (susceptibility) and regulation (autoimmunity) could be important, too.
As an example: We know by Ewald, Cochran & Cochran that endometriosis (a very common and fertility-damaging disease) can't be genetic (unless a ton of different mutations can cause this quite specific disorder, not plausible). It is believed to be partly hereditary, which is not strictly wrong if we consider microbiome part of the inheritance. OTOH, raised together MZ twins are about 50% concordant, which is considerably higher than just any pair of sisters or DZ twins raised together. But when they tried to build a genomic predictor for endometriosis (https://www.nature.com/articles/s41588-023-01323-z), they found
1. they can explain 5% of the variance
2. "significant comorbidity with other pain disorders", which I believe translates to "mostly confounded" =)
There's a small chance this genetic non-linearity thing could be the highest value application of ML (prioritizing where you put that selective effort), if it works. No need to cringe.
I’m completely convinced, the problem is that we lack the data/tools to find nonlinear patterns