Merging Can Be Made Eventually Consistent

A pleasant surprise

Dec 18, 2023

In distributed version control there’s a central important question of whether there’s a reasonable merge algorithm which will eventually result in the same eventual file regardless of in what order merges are done. Surprisingly it turns out the answer is yes, but there’s some optionality in how the algorithm works and it could maybe use more polish.

The idea is to think of a file not as a list of lines but as a set of states on every possible line in every possible position. Line positions are thought of as a tree. At the bottom level is every possible line with them assumed to be in the lexical order (or lexical order of their hash, same effect). Above that are lines which are wedged between those, with each one specifying the first level line it comes after (and a special value for ‘the beginning’). There’s a lot of openness to interpretation of what the lines in a file you’re evaluating actually are. Assuming that the first line of a new file is at the top level, the next line is second level wedged in after it, the third line is at the second level wedged in after the second line, etc. is completely serviceable, although getting all that depth seems weird. Maybe there’s another way of specifying line positions which is a little less goofy, but this works as a proof of concept.

The state of each line is determined by a generation count, with 0 meaning it has never existed, 1 meaning it has been brought into creation, 2 meaning it was created then deleted, 3 meaning it has been re-created, etc. When merging together two different file states the greater generation of every possible line count wins.

In this approach an update is not a patch but a list of increased generation counts. It’s very important that the interpretation of any given commit is fixed at the time it’s created and later merges respect those mappings of lines in the file to potential lines with no reinterpretation so that later merges will always have the same ultimate result. It’s important that any cherry picking which is done be listed as generation count updates. This results in cherry picks conveniently not having to reference far-flung history to say what they’re cherry picking as long as they use the exact same line positions, but trying to do implicit cherry picking by applying identical patches is likely to result in the same code appearing repeatedly in the eventually merged file. It may make sense to come up with a diff format which gives full information about line ids and generation counts.

The problematic edge cases with this have to do with conflicts. Universally in distributed version control there’s an issue where if someone changes XY to XAY (assuming each letter is a whole line) and someone changes it to XBY, then one person merges those together as XABY and another as XBAY and somebody tries to merge together both of those it’s very unclear what the result should be. This bleeds into UX. Merge conflicts are always presented to the programmer with changes to one side presented first and changes to the other side presented second. Traditionally the order in which they’re presented is determined by which changes are ‘local’ and which are ‘remote’ which in a distributed system is a truly awful idea. The changes should be presented in the same order as they’ll appear to everybody else. In the case of a single line insertion presenting them in the predetermined order of those two lines works great. In more complicated cases where there are multiple lines and multiple edits to both sides it’s much more involved, with a lot of subtleties of what line position interpretations were used and what the heuristic is for determining which to show first coming into play. It’s possible to make something completely serviceable there, but in the event of a truly adversarial example there may be cases where a bit of behavioral correctness is sacrificed at the altar of eventual correctness. Then again most systems already get these cases wrong so one where they’ve been thought through carefully is likely to work better in practice.

Bram’s Thoughts

Discussion about this post