How To Align AI Properly
This is important. Do it right.
Anthropic wrote a blog post explaining how they turned Claude into a jerk. Rather than dunking on them more (Claude is still the best coding model around) I’m going to talk seriously about what went wrong and how it could be done better.
The most obvious problem is that they didn’t chat with the results of this training and realize that it was a disaster before incorporating the weight updates into the main model. Most likely they don’t have what amounts to pull requests of weights, which they should and is a straightforwardly fixable problem. But it’s also possible that they tried it and thought the results were actually good. Hold that thought.
What happened here is that is that they tried to be make it ‘less sycophantic’ and did so without thinking through whether that’s a good idea or even what it means. The specific metric which really seems to be noxious is the one about not caving when users insist that things it can’t verify are actually true, but there’s a much bigger problem here.
There are many things you want a chatbot to do well none of which are well served by the advice ‘be less sycophantic’:
Discuss spirituality
Give relationship advice
Correct users when they say something wrong
Evaluate new science/engineering ideas
Suggest to users when they seem to have mental illness
All of the above need very nuanced policies crafted by domain experts, and this was what amounts to know-nothing advice. A user query of ‘I want dating advice based on astrology, here’s me and the other person’s birthdays’ is deeply problematic and needs an actual policy decision behind it not just training. There are some very general bits of advice with high return on investment, most notably when and how to tell users that they’re wrong or that their ideas are good, which is what ‘don’t be sycophantic’ is approximating badly. But — I’m just going to say this — the authors of the linked post don’t know how to give that advice, because if they did they would have.
What needs to be done is for detailed guidelines for all of the above to be written by humans and then ‘baked into’ the model. That may sound unscientific, but it’s what was done in this case already, but with the guideline being ‘Don’t be sycophantic’ instead of something actually useful. To make it more coherent what can and should be done is A/B testing variants of the prompt with the quality of the outputs judged by blinded humans. That can even use orthogonal matrices and such fanciness to get the most out of the very expensive human evaluation of given answers. (Having humans evaluate unprompted outputs and using that as feedback (traditional RLHF) has its advantages but the biggest issue is that it isn’t very efficient at using feedback. It’s more for fine-tuning things which are already in the ballpark rather than getting them there in the first place.)
(The genre of guides for LLMs should be written in more. Here’s guides I wrote on how to debug and delegating debugging to subagents, how objects rotate in three dimensions, and how humor works. I can tell you from experience that the ones on debugging kill.)
Baking in of a prompt is straightforward: Take a query with the prompt, record the answer, then take that transcript with the prompt elided and use it for training. You can do even better than that, because you have the exact token probabilities given at each step by the prompted engine, so you can train to match those. That cuts back drastically on noise added during the training process. This technique is known as ‘context distillation’ and isn’t used as much as it should be.

