A Novel Data Structure For Doing Merging Well

We can have good cherry-picking!

Dec 23, 2023

Continuing on my posts about eventually consistent merge algorithms, this one finally gets something decent. While eventual consistency sounds a bit impractical and unimportant it turns out this approach leads to a straightforward and reliable implementation of cherry picking which makes the whole exercise very compelling.

The other day I presented an algorithm for doing merging which is eventually consistent but complained that it’s ‘goofy’. The problem is that it invokes lexical order as a tiebreak way too often, with the ideal amount being ‘never’. That ideal isn’t reachable because in a distributed system if one person inserts a line into AB to make ACB and another does ADB then some tiebreak rule to determine the relative order of C and D which is consistent across the entire universe needs to be invoked. That said, if a series of updates are made to a file sequentially there doesn’t appear to be a reason why you shouldn’t be able to record the entire history with no lexical tiebreaks whatsoever, and I’ll now present a slightly quirky data structure which accomplishes exactly that.

The core idea is to organize all lines in a file in a tree. Each node in the tree corresponds to a line in the file except for the root which is the file’s beginning. Every node also has a notation as to whether it’s before or after its parent in the file. When generating an actual file out of this the rule is that if A is not a descendent of B and C is a descendent of B then B and C must both be on the same side of A. Only if a node has two descendants both in the same direction does the lexical order tiebreak rule get invoked. Usually it doesn’t have to be, because whenever a line is inserted between two others there will always be at least one of the two neighbors which has never had a descendant in the direction of the new line. (Note that this is a weave and deleted lines need to be included when deciding where a new line is inserted!)

Okay that was a bit of a mouthful so let’s do an example. Let’s say that we make an initial commit of a file as ABC. There are a few ways this can be interpreted, including:

File beginning
- → A
  - → B
    - → C

File beginning
- → C
  - ← A
    - ← B

File beginning
- → B
  - ← A
  - → C

It isn’t clear which of these is best. The first option seems cleanest. A binary splitting one will result in smaller representations of cherry picks. It may have about as much significance as which orientation you install toilet paper, or maybe there’s some compelling motivation for some order I haven’t thought of yet. Much more important than the specific interpretation is that one needs to be selected at commit time and that needs to be respected forever after.

For the sake of clarity let’s go with the first interpretation. Now let’s say you insert two lines to make ADEBC. There are two possible interpretations of this:

File beginning
- → A
  - ← D
    - → E
  - → B
    - → C

File beginning
- → A
  - ← E
    - ← D
  - → B
    - → C

Again there’s no compelling reason to prefer one over the other. Let’s go with the first one because it seems less weird. The three ways of inserting a line F around DE now result in:

File beginning
- → A
  - ← D
    - ← F
    - → E
  - → B
    - → C

File beginning
- → A
  - ← D
    - → E
      - ← F
  - → B
    - → C

File beginning
- → A
  - ← D
    - → E
      - → F
  - → B
    - → C

To do merges with this approach all the usual weave stuff applies: Every line which has ever been included in the file is remembered with a generation count where even numbers mean deleted (or in the case of 0 not born yet) and odd numbers mean present. When two versions of the same line are merged together the larger generation count wins. For two lines to be ‘the same’ they need not just their contents but their entire path to the root to be identical.

To do cherry picks with this structure the modified lines are presented with their updated generation counts. Note that sometimes these lines are dependent on other lines which aren’t updated. Those should be presented with generation count zero, so they’re enforcing position but their state is always deferred to the other side of a merge. Note that although cherry picks are explicit and can handle just about any historical ordering they don’t reference the history themselves.

There’s an interesting question of what order new lines should go in relative to lines which were previously deleted in the same location. The simple heuristic ‘new lines always go first’ may work fine but I haven’t thought through what edge cases there might be.

What form of lexical order should be used for tiebreaks probably doesn’t matter much but there may be some reason why it’s a good idea for empty lines to specifically go first or last.

If you’re presenting a conflict and have, for example, lines which are in the order 1234 and one side contains 13 and the other 24 rather than blow up the order by presenting that conflict as a simple two way one it’s probably better to present it as a conflict between 1 and 2 followed by a conflict between 3 and 4. If there’s a conflict between 13 and 2 that can be presented as a conflict between 1 and 2 followed by a conflict between 3 and nothing.

Although this supports cherry picking very well it requires explicit specification of what the cherry picks are to work properly. You can and should make it so that if the exact same patches are made in the exact same order then they’ll clean merge but if any sort of squashing is done differently on one side all bets are off. It appears that implicit cherry-picking is a fundamentally busted concept, similar to how implicit undo is a fundamentally busted concept. That said, we now have explicit local undo which works very well! To use it, make a local undo, then a create a redo which increments the generation counts to undo the undo, and apply that redo to the main branch. This is remarkably straightforward given the amount of flailing which has gone into figuring out methods of supporting it.

Another thing this doesn’t support is moves within files. It may be that implicit moves within files is a fundamentally busted concept because cleanly applying a change to a similar looking but actually different function is much more of a disaster than having an unnecessary conflict. Maybe there’s a way of supporting explicit moves within files, but that would be a big extension to the underlying data structure and a lot of complicated UX for very limited functionality gained.

Bram’s Thoughts

Discussion about this post