Automated Chess Commentary's Sorry State And Possible Improvements
We haven't even taken the first step
Computer tools for providing commentary on Chess games are currently awful. People play over games using Stockfish, which is a useful but not terribly relevant tool, and use that as a guide for their own commentary. There are Chess Youtubers who are aren’t strong players and it’s obvious to someone even of my own mediocre playing strength (1700 on a good day) that they don’t know what they’re talking about because in many situations there’s the obvious best move which fails due to some insane computer line but they don’t even cover it because the computer thinks it’s clearly inferior. Presumably commentary I generated using Stockfish as a guide would be equally obvious to someone of a stronger playing strength than me. People have been talking about using computers to make automated commentary on Chess positions since computers started getting good at Chess, and the amount of progress made has been pathetic. I’m now going to suggest a tool which would be a good first step in that process, although it still requires a human to put together the color commentary. It would also be a fun AI project on its own merits, and possibly have a darker use which I’ll get to at the end.
There’s only one truly objective metric of how good a Chess position is, and that’s whether it’s a win, loss, or draw with perfect play. In a lost position all moves are equally bad. In a won position any move no matter how ridiculous which preserves the theoretical win is equally good. Chess commentary which was based off this sort of analysis would be insane. Most high level games would be a theoretical draw until some point deep into already lost for a human territory at which point some uninteresting move would be labeled the losing blunder because it missed out on some way of theoretically eking out a draw. Obviously such commentary wouldn’t be useful. But commentary from Stockfish isn’t much better. Stockfish commentary is how a roughly 3000 rated player feels about the position if it assumes it’s playing against an opponent of roughly equal strength. That’s a super specific type of player and not one terribly relevant to how humans might fare in a given position. It’s close enough to perfect that a lot of the aforementioned ridiculousness shows up. There are many exciting tactical positions which are ‘only’ fifteen moves or so from being done and the engine says ‘ho hum, nothing to see here, I worked it out and it’s a dead draw’. What we need for Chess commentary is a tool geared towards human play, which says something about human games.
Here’s an idea of what to build: Make an AI engine which gets inputs of position, is told the ratings of the two players, the time controls, and the time left, and gives probabilities for a win, loss, or draw. This could be trained by taking a corpus of real human games and optimizing for Brier score. Without any lookahead this approach is limited by how strong of an evaluation it can get to, but that isn’t relevant for most people. Current engines at one node are probably around 2500 or so, so it might peter out in usefulness for strong grandmaster play, but you have my permission to throw in full Stockfish evaluations as another input when writing game commentary. The limited set of human games might hurt its overall playing strength, but throwing in a bunch of engine games for training or starting with an existing one node network is likely to help a lot. That last one in particular should save a lot of training time.
For further useful information you could train a neural network on the same corpus of games to predict the probability that a player will make each of the available legal moves based on their rating and the amount of time they spend making their move. Maybe the amount of time the opponent spent making their previous move should be included as well.
With all the above information it would be easy to make useful human commentary like ‘The obvious move here is X but that’s a bad idea because of line Y which even strong players are unlikely to see’. Or ‘This position is an objective win but it’s very tricky with very little time left on the clock’. The necessary information to make those observations is available, even if writing the color commentary is a whole other layer. Maybe an LLM could be trained to do that. It may help a lot for the LLM to be able to ask for evaluations of follow-on moves.
What all the above is missing is the ability to give any useful commentary on positional themes going on in games. Baby steps. Having any of the above would be a huge improvement in the state of the art. The insight that commentary needs to take into account what different skill levels and time controls think of the situation will remain an essential one moving forward.
What I’d really like to see out of the above is better Chess instruction. There are religious wars constantly going on about what the best practical advice for lower rated players is, and the truth is we simply don’t know. When people collect data from games they come up with results like by far the best opening for lower rated players as black is the Caro-Kann, which might or might not be true but indicates that the advice given to lower rated players based on what’s theoretically best is more than a little bit dodgy.
A darker use of the above would be to make a nearly undetectable cheating engine. With the addition of it giving an output of the range of likely amounts of time the player is likely to take in a given position it could make an indistinguishable facsimile of a player of a given playing strength in real time whose only distinguishing feature is being a bit too typical/generic, and that would be easy enough to throw in bias for. In situations where it wanted to plausibly win a game against a much higher opponent it could filter out potential moves based on their practical chances in the given situation being bad. That would result in very non-Stockfish-like play and seemingly a player plausibly of that skill level happening to play particularly well that game. Good luck coming up with anti-cheat algorithms to detect that.
A slightly different approach, I think I would go for a prediction of, given this position and this player rating, what move will they make. Then you could use that to deduce, this move is technically worse, but it can only be defeated by a 3000-level move, and it is very likely to beat a 2600.
The question of how to train an LLM to eg explain to me when I should push my h pawn positionally seems tougher. You can at least get a corpus of when it’s actually correct and when it isn’t. Then maybe fine tune the llm for it?