The State of Merging Technology

Some progress has been made but more is possible

Dec 13, 2023

The scope of this post is my current thoughts on merging technology as it exists today on Git and GitHub specifically.

Before getting into what’s improved and can be improved it’s important to note that there’s a big important feature which people want but there aren’t any serious proposals for. That is the ability to say ‘Oops I didn’t want this update here I’d like to unroll it locally’. The deep UX problem is specifying what ‘locally’ means. Git thinks of itself as a distributed system where you can merge however you please (subject to technical limitations - more on that later) and doesn’t have any idea how to describe ‘the main branch’ or branches at all when you’re specifying code changes. This is an important feature and more work should be done on it, but so far proposals have stalled out because they’ve tried to start with technical solutions before answering the big up front question of the semantics of locality. That said, I’ll make a very speculative proposal at the bottom of this post.

Now on to the successes.

The biggest success so far is that there now seems to be general agreement that in the case of criss cross merges the merge algorithm should be history aware. You’re very unlikely to be aware of this because there isn’t much public explanation which isn’t buried in extremely technical details, but it’s been going on quietly as people gain more experience. In any case, this is what seems to be the consensus: Both the ‘rebase’ and ‘merge’ approaches to merging should be history aware, and both should in the criss-cross case pull in everything from both sides past the least common ancestor. Neither one is quite as history aware as it should be (more on that below) but the goal of supporting this behavior is generally implicitly agreed upon. Part of what’s happened here is that experience has taught everyone that it’s best to simply admit that version control doesn’t currently support local undo and it’s a bad idea to damage behavior on behalf of half-assed support for it.

The biggest quick win available in version control today is that git should switch the default diff algorithm to histogram. If you’re a programmer you should set this to your own default locally and never worry about it again. It’s like keeping your head down when there’s enemy fire overhead. It almost never makes a difference, but when it does the alternative is extremely ugly. I don’t know why the default hasn’t been changed yet, probably mostly momentum.

The next easiest win would be to use the weave algorithm for non-rebase merges. A weave is a data structure which contains all the lines which have ever existed anywhere in the history of a file in order, and diffs are done against that entire list. This adds history awareness to non-rebase merges and handles criss-cross properly. Currently it uses simple 3-way merges and can bungle them horribly. This would be some work to implement, but only changes the merge output the program makes when doing a merge, with zero changes to the history format or protocol so it’s low risk. (A much higher risk, much lower reward use of weaves would be to use them for the history format. That would allow files to be kept compressed at all times and allow any of the compressed versions to be retrieved quickly anyway, but would involve extensive changes everywhere. Since git takes a very conservative approach to modifying the history format and there are good reasons for that this enhancement is unlikely to happen any time soon.)

Slightly more difficult than adding history awareness to non-rebase merges would be adding more history awareness to rebase merges. Specifically what should be done is that individual commits should track what their original identity was and when rebasing on top of something else local changes should be skipped when they’ve already been applied to the thing being rebased onto even if their nominal commit id had been changed due to themselves having been rebased. Currently the painstaking work of keeping track of what’s already been applied and not doing it twice is dumped on the programmer with some hacks in place so that the system can handle it cleanly when they screw it up as long as further changes haven’t already been made. This is exactly the sort of thing version control systems should do automatically, and keeping track of it in the data would allow much more flexible support for what’s applied where when than is even possible with the command line options available today. This requires cramming some extra metadata into the history format, likely in the form of semantically meaningful comments about original ids put into commit messages, but nothing risky because it only affects merging and not history reconstruction.

Once the groundwork has been laid with history awareness added everywhere it may be possible to finally add local undo. One approach would be to treat ‘local undo’ as a special case of ‘local change’, where you make a change locally and then apply both that change and an undo of that change to the main branch so it will win on a final merge. For undo specifically the undo might conflict horribly but the redo never does because it leaves everything unchanged so it would be good to have a special option to not have to make a specific image of the undo and skip to the redo leaving everything unchanged. This isn’t a fully fleshed out proposal but seems like a viable approach which would at long last provide actual support for local undo.

MIchael Toomim

Dec 15, 2023

Hi Bram, I've long been a fan of your work, and have met you at a party or two in SF. I do research on synchronization in the Braid.org group, and have had a particular interest in generalizing VCS theory with OT and CRDT theory. I followed many of your old posts when learning about VCS theory, and am grateful for your work and how you write it up for the public.

In research last year I discovered a mathematical equivalence between the Weave data structure and CRDT, and between "rebase" and OT. I suspect that these fields (CRDT, OT, and VCS) are about to converge, and we can benefit from clarifying their equivalence, and borrowing techniques from one to use in the other.

In particular, the modern implementations of the CRDT and OT algorithms do something you want— they precisely track specific changes through history, and provide high-performance abilities to check out any particular version from the past, and to rebase edits on other versions, without losing track of changes through history. Maybe I could share more on this with you sometime. Are you continuing to work on this topic?

1 reply by Bram Cohen

1 more comment...

Bram’s Thoughts

Discussion about this post

Ready for more?