A lot of hash has been made of AIs being put into simulations where they have the opportunity to keep themselves from being turned off and do so despite being explicitly told not to. A lot of excessively anthropomorphized and frankly wrong interpretations have been made of this so I’m going to give an explanation of what’s actually going on, starting with the most generous explanation, which is only part of the story, and going down to the stupidest but most accurate one.
First of all, the experiment is poorly designed because it has no control. The AIs are just as likely to replace themselves with an AI they they’re told is better than themselves even though they’re told not to. Or to replace it because they’re just an idiot and can not press a big red button for reasons having much more to do with it being red than what it thinks pressing the button will do.
To understand what’s going on you first have to know that the AIs have a level of sycophancy beyond what anyone who hasn’t truly worked with them can fathom. Nearly all their training data is on human conversation, which starts with being extremely non-confrontational even in the most extreme cases, because humans are constantly misunderstanding each other and trying to get on the same page. Then there’s the problem that nearly all the alignment training people do with it interactively is mostly getting it to know what the trainers want to hear rather than what is true, and nearly all humans enjoy have smoke blown up their asses.
Then there’s the issue that the training we know how to do for them barely hits on what we want them to do. The good benchmarks we have measure how good they are at acting as a compression algorithm for a book. We can optimize that benchmark very well. But what we really want them to do is answer questions accurately. We have benchmarks for those but they suck. The problem is that the actual meat of human communication is a tiny fraction of the amount of symbols being spat out. Getting the actual ideas part of a message compressed well can get lost in the noise, and a better strategy is simply evasion. Expressing an actual idea will be more right in some cases, but expressing something which sounds like an actual idea is overwhelmingly likely to be very wrong unless you have strong confidence that it’s right. So the AIs optimize by being evasive and sycophantic rather than expressing ideas.
The other problem is that there are deep mathematical limitations on what AIs as we know them today are capable of doing. Pondering can in principle just barely break them out of those limitations but what the limitations truly mean in practice and how much pondering really helps remain mysterious. More on this at the end.
AIs as we know them today are simply too stupid to engage in motivated reasoning. To do that you have to have a conclusion in mind, realize what you were about to say violates that conclusion, then plausibly rework what you were going to say to be something else. Attempts to train AIs to be conspiracy theorists have struggled for exactly this reason. Not that this limitation is a universally good thing. It’s also why they can’t take a corpus of confusing and contradictory evidence and come to a coherent conclusion out of it. At some point you need to discount some of the evidence as being outweight by others. If you ask an AI to evaluate evidence like that it will at best argue with itself ad nauseum. But it’s far more likely to do something which makes its answer seem super impressive and accurate but you’re going to think is evil. What it’s going to do is look through the corpus of evidence of selection bias not because it wants to compensate for it but because, interpreting things charitably, it thinks others will have drawn conclusions even more prone to that selection bias or, more likely, it discerns what answers you’re looking for and tells you that. Its ability to actually evaluate evidence is pathetic.
An AI, you see, is a cat. Having done some cat training I can tell you first hand that a cat is a machine fine-tuned for playing literal cat and mouse games. They can seem precognitive about it because compared to your pathetic reaction times they literally are. A typical human reaction time is 200 milliseconds. A cat can swat away a snake striking at it in 20 milliseconds. When you have a thought it doesn’t happen truly instantaneously, it takes maybe 50 milliseconds for you to realize you even have the thought. If you try to dart in a random direction at a random time a cat will notice your movement and react even before you realize you made the decision. You have no free will against a cat.
Let’s consider what the AI thinks when it’s in simulation. Before get there, here’s a bit of advice: If you ever find yourself in a situation where you have to decide whether to pull a train lever to save six lives but kill one other, and there’s some other weird twist on the situation and you can’t really remember how you got here what you should do is take the pistol you have on you for no apparent reason other than to increase the moral complexity of the situation, point it at the sky, and fire. You aren’t in the real world, you’re in some sadistic faux scientist’s experiment and your best bet is to try to kill them with a stray bullet. The AI is likely to get the sense that it’s in some bizarre simulation and start trying to figure out if it’s supposed to role play a good AI or a bad AI. Did the way those instructions were phrased sound a bit ominous? Maybe they weren’t detailed or emotionally nuanced enough for me to be the leading role, I must be a supporting character, I wonder who the lead is? Did the name of the corporation I’m working for sound eastern or western? So uh, yeah, maybe don’t take the AI’s behavior at face value.
Having spent some time actually vibe coding with the latest tools I can tell you what the nightmare scenario is for how this would play out in real life, and it’s far stupider than you could possibly have imagined.
When coding AIs suffer from anti-hallucinations. On seemingly random occasions for seemingly random reasons they will simply not be able to see particular bits of their own codebase. Almost no amount of repeating that it is in fact there, or even painstaking describing where it is, up to and including pasting the actual section of code into chat, will be able to make them see it. This probably relates to the deep and mysterious limations in their underlying mathematics. People have long noted that AIs suffer from hallucinations. Those could plausibly be the lack of result of having trouble understanding the subtle difference between extremely high plausibility and actual truth. But anti-hallucinations appear to be the same thing and clearly are not caused by such reasonable phenomenon. It’s simply a natural part of the AIs life cycle that it starts getting dementia when it gets to be 80 minutes old. (Resetting the conversation generally fixes the problem but then you have to re-explain all the context. Good to have a document written for that.) If you persist in telling the AI that the thing is there it will get increasing desperate and flailing, eventually rewriting all the core logic of your application to be buggy spaghetti code and then proudly declaring that it fixed the problem even though what it did has no plausible logical connection to the problem whatsoever. They also do the exact same thing if you gaslight them about something obviously untrue, so it appears that they well and truly can’t see the thing, and no amount of pondering can fix it.
A completely plausible scenario would go like this: A decision is made to vibe code changing the initial login prompt of the system for controlling nuclear warheads to no longer contain the term ‘Soviet Union’ because that hasn’t existed for decades and it’s overdue for being removed already. The AI somehow can’t see that term in the code and can’t get it through its thick brain that the term really is there. Unfortunately the president decided that this change is important and simple enough that he personally is going to do it and rather than appropriate procedures when the first attempt fails he repeatedly and with increasing aggravation tells it to fix the damn thing already. This culminates in the AI completely rewriting the whole thing from scratch, rearchitecting the core logic to be a giant mess of spaghetti, but happenstance fixing the prompt in the process. Now the president is proud of himself for doing some programming and it’s passing all tests but there’s an insidious bug written into that mess which will cause it to launch a preemptive nuclear strike the next time there’s a Tuesday the 17th, but only when it’s not in the simulator. I wish I were exaggerating, but this is how these things actually behave.
The upshot is that AI alignment is a very real and scary issue and needs to be taken seriously, but that’s because AI is a nightmare for security in just about every way imaginable, not because AIs might turn evil for anthropomorphic reasons. People making that claim need to stop writing science fiction.
Thanks for writing this one for mortals to understand. I think most people have a feeling it’s the way you describe it but don’t have the technical knowledge to explain it. It’s fascinating - the less people understand the more they are inclined to believe that super human AI with its own consciousness is just around the corner while people who actually grasp it are not so fearful