First Bram, I love that you are working on this. Git has been in the wild for over a decade, and version control could use some fresh thinking. Especially, some challenging ideas.
I plan to try to digest this for a bit, it's not 100% clear how this will work all of the time. I'd like to seem more examples, to fully understand it. Merges never failing do not yet make sense to me. I can only see your solution as a merge being a super position of commits.
I love the idea of improving rebasing, and you seem to be on a good path. On my current personal project I've started to use feature branches, and then merge to an intermediate branch that mirrors main, so that merge conflict resolution occurs on the intermediate branch, and then i merge ff-only back to main. this preserves history, doesn't use a squash, and keeps main with clean stable only commits.
My 3 largest version control pain points are:
* angenic coding - wip commits tend to have more changed lines of code, by a few orders of magnitude. the commit message, and the commit size are not well summarized or understood. code agents can refactor like a beast given the right prompt for hours at a time. I once switched from react to svelte in a few prompts.
* left vs. right code ancestries always seem to confuse me. rather than ordinal, I'd prefer logical assignment. which branch is left, which branch is right? who(blame)'s branch is left, who's branch is right? what commit message is left & right?
* binaries - git works with lines of code only, and also only well with 'pretty' code. if your code has syntactic sugar, is compressed, or is a binary, then git breaks down quickly.
compiled javascript that is all on one line of code is almost pointless to store in git. one statement per line works best, but if you define multiple vars, or a longer anonymous object or anything with any degree of complexity, then git treats the whole line as a change. then you have to compare A & B between the two lines and physically notice what changed. Some IDEs will show you sub token changes between the two lines, most wont.
in the era of angenic coding, imagine a pull request where two complex lines look identical, but to the human it looks like only one variable name was added or removed -- but to the machine or hacker one of the characters in one of the other variables where changed to an international character to go undetected. in this case you could sneak in a change on one of the other variables and switch code behavior unexpectedly.
on the binary front it would be cool, if version control software could become 'token' aware in much the same way of how LLMs are now first tokenizing input and output. tokens could be full variable names, or part of variable name (camel case, snake case, etc..).
maybe the version control software supports tokenizer plugins, or maybe it supports file type plugins for binaries?
maybe the version control has plugins for different languages, pdfs, bitmaps, jpgs, videos?
I've also been thinking of what I love about Resilio Sync, and what feels missing from it. Version Control seems like something that I both miss, and conflicts with Resilio Sync, Dropbox, and other file sync solutions.
Nothing really works well at both file sync across teams and version control for most files. File Sync solutions seem to conflict with git, and git doesn't work well with binaries.
I should probably add a writeup of the CRDT anchoring algorithm and why it has eventual consistency. It’s a fairly esoteric algorithm which I independently reinvented. The short of it is that every line which could ever exist has a generation count and they all start at zero, and when you merge together multiple files you set the generation count of each line to be the max of all of its parents. Lines then have a canonical ordering and all ones with odd numbers are included in the current version. The rest of the details are around making the data structure simple and compact.
Having work in progress commits would be fairly reasonable in this schema. You’d put a ‘major ancestor’ advisory hint at the end of all the work in progress, similar to what’s suggested for rebases although in this case it’s to a more distant ancestor not an immediate parent.
The ‘left’ and ‘right’ monikers are strictly advisory and it would be trivial to change the API so actual descriptors could be passed in for what the two sides are (or more, this implementation doesn’t allow more than two things to be merged at once but that’s completely doable.)
How does this compare to Pijul and Darcs, which at least at first glance seem to have a similar basis and properties? Darcs is the older but sadly seems to have insurmountable performance problems in practice. Pijul seems to be progressing but without much (recent) fanfare, I'm not certain what its current state is.
Edit: Removed the reference to project age which is irrelevant for the question.
Those seem to work off of keeping all historical patches around and then reordering the patches canonically and then doing some kind of mess to handle when they don't apply. I've never seen an explanation of how they do it which I could understand, much less get convinced that it's reliable. The approach I've come up with here is mathematically sound and it's straightforward to prove that it's always commutative and associative. It's also highly performant and easy to develop an intuitive understanding of its behavior.
The Pijul blog and documentation make claims of various properties (e.g., commutativity of independent patches, associativity of patches, consistent line ordering through merges), but I have also not seen or looked deeply enough to find a full explanation of how it guarantees those properties. Which is not to say that I dispute Pijul's claims - I just haven't tried to verify them.
First Bram, I love that you are working on this. Git has been in the wild for over a decade, and version control could use some fresh thinking. Especially, some challenging ideas.
I plan to try to digest this for a bit, it's not 100% clear how this will work all of the time. I'd like to seem more examples, to fully understand it. Merges never failing do not yet make sense to me. I can only see your solution as a merge being a super position of commits.
I love the idea of improving rebasing, and you seem to be on a good path. On my current personal project I've started to use feature branches, and then merge to an intermediate branch that mirrors main, so that merge conflict resolution occurs on the intermediate branch, and then i merge ff-only back to main. this preserves history, doesn't use a squash, and keeps main with clean stable only commits.
My 3 largest version control pain points are:
* angenic coding - wip commits tend to have more changed lines of code, by a few orders of magnitude. the commit message, and the commit size are not well summarized or understood. code agents can refactor like a beast given the right prompt for hours at a time. I once switched from react to svelte in a few prompts.
* left vs. right code ancestries always seem to confuse me. rather than ordinal, I'd prefer logical assignment. which branch is left, which branch is right? who(blame)'s branch is left, who's branch is right? what commit message is left & right?
* binaries - git works with lines of code only, and also only well with 'pretty' code. if your code has syntactic sugar, is compressed, or is a binary, then git breaks down quickly.
compiled javascript that is all on one line of code is almost pointless to store in git. one statement per line works best, but if you define multiple vars, or a longer anonymous object or anything with any degree of complexity, then git treats the whole line as a change. then you have to compare A & B between the two lines and physically notice what changed. Some IDEs will show you sub token changes between the two lines, most wont.
in the era of angenic coding, imagine a pull request where two complex lines look identical, but to the human it looks like only one variable name was added or removed -- but to the machine or hacker one of the characters in one of the other variables where changed to an international character to go undetected. in this case you could sneak in a change on one of the other variables and switch code behavior unexpectedly.
on the binary front it would be cool, if version control software could become 'token' aware in much the same way of how LLMs are now first tokenizing input and output. tokens could be full variable names, or part of variable name (camel case, snake case, etc..).
maybe the version control software supports tokenizer plugins, or maybe it supports file type plugins for binaries?
maybe the version control has plugins for different languages, pdfs, bitmaps, jpgs, videos?
I've also been thinking of what I love about Resilio Sync, and what feels missing from it. Version Control seems like something that I both miss, and conflicts with Resilio Sync, Dropbox, and other file sync solutions.
Nothing really works well at both file sync across teams and version control for most files. File Sync solutions seem to conflict with git, and git doesn't work well with binaries.
I should probably add a writeup of the CRDT anchoring algorithm and why it has eventual consistency. It’s a fairly esoteric algorithm which I independently reinvented. The short of it is that every line which could ever exist has a generation count and they all start at zero, and when you merge together multiple files you set the generation count of each line to be the max of all of its parents. Lines then have a canonical ordering and all ones with odd numbers are included in the current version. The rest of the details are around making the data structure simple and compact.
Having work in progress commits would be fairly reasonable in this schema. You’d put a ‘major ancestor’ advisory hint at the end of all the work in progress, similar to what’s suggested for rebases although in this case it’s to a more distant ancestor not an immediate parent.
The ‘left’ and ‘right’ monikers are strictly advisory and it would be trivial to change the API so actual descriptors could be passed in for what the two sides are (or more, this implementation doesn’t allow more than two things to be merged at once but that’s completely doable.)
https://www.youtube.com/watch?v=YXyaGe4N9oA
https://github.com/hyoo-ru/benzen -- it's already done here
Please contact us, we can create future together
t.me/giper_dev
I think the Manyana unified-style diff is better, but wording in the diff is still very confusing to me even after reading your textual description.
How does this compare to Pijul and Darcs, which at least at first glance seem to have a similar basis and properties? Darcs is the older but sadly seems to have insurmountable performance problems in practice. Pijul seems to be progressing but without much (recent) fanfare, I'm not certain what its current state is.
Edit: Removed the reference to project age which is irrelevant for the question.
Those seem to work off of keeping all historical patches around and then reordering the patches canonically and then doing some kind of mess to handle when they don't apply. I've never seen an explanation of how they do it which I could understand, much less get convinced that it's reliable. The approach I've come up with here is mathematically sound and it's straightforward to prove that it's always commutative and associative. It's also highly performant and easy to develop an intuitive understanding of its behavior.
Thanks.
The Pijul blog and documentation make claims of various properties (e.g., commutativity of independent patches, associativity of patches, consistent line ordering through merges), but I have also not seen or looked deeply enough to find a full explanation of how it guarantees those properties. Which is not to say that I dispute Pijul's claims - I just haven't tried to verify them.