Descendants of abandoned commits should be rebased onto their parents,
or the rewritten parents if they had been rewritten. This patch
teaches `DescendantRebaser` to do that. It updates `jj rebase -r` to
use the functionality. I plan to also use it in `jj abandon`
(naturally, given the name), and for rebasing descendants of deleted
refs imported from `jj git refresh/fetch/push`.
The fact that `DescendantRebaser` visits some commits that don't need
to be rebased is mostly an implementation detail. I can't think of a
reason that callers would care about these commits.
The command's help text says "Abandon a revision", which I think is a
good indication that the command's name should be `abandon`. This
patch renames the command and other user-facing occurrences of the
word. The remaining occurrences should be removed when I remove
support for evolution.
This patch moves the function for updating branches after rewrite from
`commands.rs` into `rewrite.rs`.
It also changes the function to update branches even if they were
conflicted or become conflicted. I think that seems better than
leaving branches on old commits. For example, let's say you have start
with this:
```
C main
|
B origin@main
|
A
```
You now pull from origin, which has updated the main branch from B to
B'. We apply that change to both the remote branch and the local
branch, which results in a conflict in the local branch:
```
C main?
|
B B' main? origin@main
|/
A
```
If you now rewrite C to C', the conflicted main branch will still
point to C, which is just weird. This patch changes that so the
conflicted side of main gets repointed to C'.
I also refactored the code to reuse our existing
`MutableRepo::merge_single_ref()`, which improves the behavior in
several cases, such as the conflict-resolution case in the last test
case.
As the updates test case shows, when rebasing forward, we missed
commits that fork off from the section between the source and the
destination.
As part of the fix, I also restructured the code a bit to prepare for
support for rebasing descendants of multiple rewritten commits.
It turns out that `FETCH_HEAD` is not the remote's `HEAD` (it's
actually not even a normal symbolic ref; it contains many lines of
commits and names). We're supposed to ask the remote for its default
branch instead. That's what this patch does.
It's annoying to have to add `--branch main` every time I push to
GitHub.
Maybe we should make it push only the current branch by default, but
we don't even have a concept of a current branch yet...
Before this change, you could end up with an index segment with 10
commits, then a child segment with 9 commits, then another child with
8 commits, and so on. That's not what I had intended. This changes
makes it so we squash if a segment has more than half as many commits
as its parent instead.
We have had support for ignores via `.gitignore` files since
3b326a942c, and we haven't had the problem with the temporary
`.git/` directory created by libgit2 since 88f7f4732b.
Git doesn't want `.git` entries in its trees, so at least when using
the Git backend, we need to ignore such paths. Let's just ignore
`.git` paths regardless of backend to keep it simple.
Closes#24.
When I added the function for rebasing descendants, I forgot to call
the existing `rebase()` function and instead simply created a new
commit with the new parents but the old contents.
This should be useful in lots of places. For example, `jj rebase -r`
currently rebases all descendants, because that's what the auto-evolve
feature does. I think it would be nice to instead copy from
Mercurial's `-s` flag for also rebasing descendants. Then `jj rebase
-r` can be made to pull a commit out of a stack, rebasing descendants
onto the rebased commit's parents. I also intend to use this
functionality for rebasing descendants when remote branches have been
rewritten.
The auto-rebasing of descendants doesn't work if you have an open
commit checked out, which means that you may still end up with orphans
in that case (though that's usually a short-lived problem since they
get rebased when you close the commit). I'm also about to make
branches update to successors, but that also doesn't work when the
branch is on a working copy commit that gets rewritten. To fix this
problem, I've decided to let the caller of `WorkingCopy::commit()`
responsible for the transaction.
I expect that some of the code that this change moves from the lib
crate to the cli crate will later move back into the lib crate in some
form.
With this change, we no longer fail if the user moves a branch
sideways or backwards and then push.
The push should ideally only succeed if the remote branch is where we
thought it was (like `git push --force-with-lease`), but that requires
rust-lang/git2-rs#733 to be fixed first.
Otherwise remote-tracking branches just pile up.
It seems that both git and libgit2 remove the remote-tracking branch
when you push a deletion, so `jj branch --delete foo; jj git push
--branch foo` already sees `foo` disappear locally as well. However,
if a branch has been deleted on the remote, we would never know before
this change.
Now that we have native branches, we can make `jj git push` only be
about pushing a branch to a remote branch with the same name.
We may want to add back support for the more advanced case of pushing
an arbitrary commit to an arbitrary branch later, but let's get the
common case simplified first.
This adds support for resolving tags and branches in revsets. Branches
and tags can be resolved by specifying their name (e.g. "main"). To
specify a branch's target on a remote, use e.g. "main@origin". In case
of conflicts, they get resolved to their "adds".
Now that we have our own representation of branches and tags, let's
update them when we import git refs. The View object's git refs are
now just a record of what the refs are in the underlying git ref last
time we imported them (we don't -- and won't -- provide a way for the
user to update our record of the git refs). We can therefore do a nice
3-way ref-merge using the `refs` module we added recently. That means
that we'll detect conflicts caused by changes made concurrently in the
underlying git repo and in jj's view.
I've finally decided to copy Git's branching model (issue #21), except
that I'm letting the name identify the branch across
remotes. Actually, now that I think about, that makes them more like
Mercurial's "bookmarks". Each branch will record the commit it points
to locally, as well as the commits it points to on each remote (as far
as the repo knows, of course). Those records are effectively the same
thing as Git's "remote-tracking branches"; the difference is that we
consider them the same branch. Consequently, when you pull a new
branch from a remote, we'll create that branch locally.
For example, if you pull branch "main" from a remote called "origin",
that will result in a local branch called "main", and also a record of
the position on the remote, which we'll show as "main@origin" in the
CLI (not part of this commit). If you then update the branch locally
and also pull a new target for it from "origin", the local "main"
branch will be divergent. I plan to make it so that pushing "main"
will update the remote's "main" iff it was currently at "main@origin"
(i.e. like using Git's `git push --force-with-lease`).
This commit adds a place to store information about branches in the
view model. The existing git_refs field will be used as input for the
branch information. For example, we can use it to tell if
"refs/heads/main" has changed and how it has changed. We will then use
that ref diff to update our own record of the "main" branch. That will
come later. In order to let git_refs take a back seat, I've also added
tags (like Git's lightweight tags) to the model in this commit.
I haven't ruled out *also* having some more persistent type of
branches (like Mercurials branches or topics).
I'm about to add some support for branches and tags (for issue #21)
and it seems that we didn't have explicit testing of merging of
views. There was `test_import_refs_merge()` in `test_git.rs` but
that's specifically for git refs. It seems that it's made obsolete by
the tests added by this commit, so I'm removing it.
I had previously created commit messages based only on the ref name,
which meant that `commit4` and `commit5` ended up being the same
commit. This fixes that problem.
There were some tests that discarded a transaction only because it
used to be easier to do that than to commit and reload the repo. We
get the new repo back when we commit the transaction these days, so
now it's often easier to commit the transaction instead.
When there are two concurrent operations, we would resolve conflicting
updates of git refs quite arbitrarily before this change. This change
introduces a new `refs` module with a function for doing a 3-way merge
of ref targets. For example, if both sides moved a ref forward but by
different amounts, we pick the descendant-most target. If we can't
resolve it, we leave it as a conflict. That's fine to do for git refs
because they can be resolved by simply running `jj git refresh` to
import refs again (the underlying git repo is the source of truth).
As with the previous change, I'm doing this now because mostly because
it is a good stepping stone towards branch support (issue #21). We'll
soon use the same 3-way merging for updating the local branch
definition (once we add that) when a branch changes in the git repo or
on a remote.
This adds support for having conflicting git refs in the view, but we
never create conflicts yet. The `git_refs()` revset includes all "add"
sides of any conflicts. Similarly `origin/main` (for example) resolves
to all "adds" if it's conflicted (meaning that `jj co origin/main` and
many other commands will error out if `origin/main` is
conflicted). The `git_refs` template renders the reference for all
"adds" and adds a "?" as suffix for conflicted refs.
The reason I'm adding this now is not because it's high priority on
its own (it's likely extremely uncommon to run two concurrent `jj git
refresh` and *also* update refs in the underlying git repo at the same
time) but because it's a building block for the branch support I've
planned (issue #21).
This copies the conflict marker format I added a while ago to
Mercurial (https://phab.mercurial-scm.org/D9551), except that it uses
`+++++++` instead of `=======` for sections that are pure adds. The
reason I made that change is because we also have support for pure
removes (Mercurial never ends up in that situation because it has
exactly one remove and two adds).
This change resolves part of issue #19.
I think `files::merge()` will be a useful place to share code for
resolving conflicting hunks after all. We'll want `MergeHunk` to
support multi-way merges then.
When there are conflicts between different types of tree entries, we
currently materialize them as "Unresolved complex conflict.". This
change makes it so we mention what types were involved and what their
ids were (though we still don't have an easy way of resolving an id).
The new `diff::DiffHunk` type is very similar but more generic. We
don't need the generality here. I just don't two very similar types
with the same name.
I have been trying to figure out how to generalize diffs and merges
for arbitrary number of inputs. For example, I want to have an
internal representation of an octopus merge adding 5 inputs (file
states/contents) and removing 4 inputs. I also want to be to represent
a diff from a regular 3-way-conflict state to a resolved state. Such a
diff would be from a state adding two inputs and removing one, to a
state adding just one input.
I finally realized last week that the problem is simple if you don't
care about adds vs removes. Instead, you line up the matching and
differing parts of all the inputs. It's then up to the caller to use
it in an appropriate way for its use case. For example, a regular diff
would pass in two inputs and would get back a list of matching and
dffering hunks. It might then present the first element of differing
hunks in red and the second element in green. Similarly, a 3-way merge
would pass in three inputs with the base first. It would then compare
the sides and decide on a resolution (or leave it unresolved if all
three sides are different).
This change adds a type representing this kind of multi-way
diff. Coming changes will update existing code to use it. In addition
to making the existing code simpler and more consistent, having this
in place should also:
* Make it much easier to present merge conflicts involving more than
3 parts.
* Experiment with different ways of displaying diffs from/to conflict
states.
* Experiment with sub-line-level merging.
Unlike the other places I fixed in 134940d2bb, the calls in
`working_copy.rs` should not simply use an existing file if the target
file was open. They should probably try again instead, but I'll leave
that for later.
On Windows, it seems that you can't rename a file if the target file
is open (Stebalien/tempfile#131). I think that's the reason for our
failing tests on Windows. This patch adds a simple wrapper around
`NamedTempFile::persist()` that returns the existing file instead of
failing, if there is one.
I don't know why these used to fail. Perhaps it was just that the
GitHub's Windows machines were not powerful to run them with 100
threads doing concurrent commits. Maybe they will pass now that we
limit the number of threads to the number of CPUs. This change enables
the tests so we can see what GitHub CI thinks.
This change teaches `Tree::diff()` to filter by a matcher. It only
filters the result so far; it does not restrict the tree walk to what
`Matcher::visit()` says is necessary yet. It also doesn't teach the
CLI to create a matcher and pass it in.
The two types have become very similar so it doesn't seem that there's
any point in having two types. We should probably do the same with
`ReadonlyEvolution` and `MutableEvolution`.
This patch makes it so we attempt to resolve a symbol as the
non-obsolete commits in a change id if all other resolutions
fail.
This addresses issue #15. I decided to not require any operator for
looking up by change id. I want to make it as easy as possible to use
change ids instead of commit ids to see how well it works to interact
mostly with change ids instead of commit ids (I'll try to test that by
using it myself).
The fact that the default change id in git repos is currently a prefix
of the commit id makes it impossible to use for resolving a prefix of
the change id to commits. This patch addresses that by reversing the
bits of the change id (relative to the commit id). The next patch will
make it so a change id (or a prefix thereof) is a valid revset.
I'd like to experiment with mostly using change ids instead of commit
ids on the CLI. Then it needs to be easy to refer to the non-obsolete
commits in a change, which means we probably don't want to require any
operators (i.e. a plain change id should resolve to the non-obsolete
commits in the change). This patch prepares for letting a change id
resolve to (possibly) many commits.
I had initially hoped that the type-safety provided by the separate
`FileRepoPath` and `DirRepoPath` types would help prevent bugs. I'm
not sure if it has prevented any bugs so far. It has turned out that
there are more cases than I had hoped where it's unknown whether a
path is for a directory or a file. One such example is for the path of
a conflict. Since it can be conflict between a directory and a file,
it doesn't make sense to use either. Instead we end up with quite a
bit of conversion between the types. I feel like they are not worth
the extra complexity. This patch therefore starts simplifying it by
replacing uses of `FileRepoPath` by `RepoPath`. `DirRepoPath` is a
little more complicated because its string form ends with a '/'. I'll
address that in separate patches.
I thought I had looked for this case and cleaned up all the places
when I made `Transaction::commit()` return a new `ReadonlyRepo`. I
must have forgotten to do that, because there we tons of places to
clean up left.
This commit rewites the divergence-resolution part of `evolve()` as an
iterator (though not implementing the `Iterator` trait). Iterators are
just much easier to work with: they can easily be stopped, and errors
are easy to propagate. This patch therefore lets us propagate errors
from writing to stdout (typically pipe errors).
This makes the workging copy walk skip an entire ignored directory if
there are no negative patterns later in the ignore file. That speeds
up `jj st` in this repo with ~13k files in `target/` from ~100 ms to
~25 ms (6.0dB). This closes issue #8.
This is to address issue #8. I haven't added the optimization to avoid
walking all the files in `target/` yet. Even so, this patch still
speeds up `jj st` in this repo, with ~13k files in `target/`, from
~320 ms to ~100 ms (-5.1dB). The time actually checking if paths match
gitignores seems to go down from 116 ms to 6 ms. I think that's mostly
because libgit2 has to look for `.gitignore` files in every parent
directory every time we ask it about a file, while the rewritten code
looks for a `.gitignore` file only when visiting a new directory.
When using the command line interface (which is the only interface so
far), it seems more useful to see the exact command that was run than
a logical description of what it does. This patch makes the CLI record
that information in the operation metadata in a new key/value field. I
put it in a generic key/value field instead of a more specialized
field because the key/value field seems like a useful thing to have in
general. However, that means that we "have to" do shell-escaping when
saving the data instead of leaving the data unescaped and adding the
shell-escaping when presenting it. I added very simple shell-escaping
for now.
I've wanted the API to look like this for a while. It seems like a
good API to me. It means that the caller won't have to reload the repo
after committing. The cost seems relatively small. It involves copying
potentially a lot of data in memory (at least the View object), but it
shouldn't involve reading from disk or any other processing. To reduce
the amount of data to copy, it may be worth switching to persistent
data types. I've also wanted to do that for the copying we do when
start a transaction.
I couldn't measure any slowdown caused by this change.
The git.git repo seems to have lots of merges from far back in the
history into newer history. That results in `jj log -r 'git_refs()'`
being completely useless because of the number of such edges. For
example, v2.31.0 has almost 600 edges going out of it and presumably
merging (forking) back into various different previous versions. Git,
unlike Mercurial, seems to remove an edge from the graph if the edge
can also be reached via a longer path. This commit makes it so we also
do that (i.e. the filtered graph is a transitive reduction of the
graph before filtering).
This slows down `jj log -r ,,v2.0.0 -T ""` by about 2%. That's still
small enough that it doesn't seem worth it to have a separate iterator
for contiguous ranges (which would be an option).
When rendering a non-contiguous subset of the commits, we want to
still show the connections between the commits in the graph, even
though they're not directly connected. This commit introduces an
adaptor for the revset iterators that also yield the edges to show in
such a simplified graph.
This has no measurable impact on `jj log -r ,,v2.0.0` in the git.git
repo.
The output of `jj log -r 'v1.0.0 | v2.0.0'` now looks like this:
```
o e156455ea491 e156455ea491 gitster@pobox.com 2014-05-28 11:04:19.000 -07:00 refs/tags/v2.0.0
:\ Git 2.0
: ~
o c2f3bf071ee9 c2f3bf071ee9 junkio@cox.net 2005-12-21 00:01:00.000 -08:00 refs/tags/v1.0.0
~ GIT 1.0.0
```
Before this commit, it looked like this:
```
o e156455ea491 e156455ea491 gitster@pobox.com 2014-05-28 11:04:19.000 -07:00 refs/tags/v2.0.0
| Git 2.0
| o c2f3bf071ee9 c2f3bf071ee9 junkio@cox.net 2005-12-21 00:01:00.000 -08:00 refs/tags/v1.0.0
| |\ GIT 1.0.0
```
The output of `jj log -r 'git_refs()'` in the git.git repo is still
completely useless (it's >350k lines and >500MB of data). I think
that's because we don't filter out edges to ancestors that we have
transitive edges to. Mercurial also doesn't filter out such edges, but
Git (with `--simplify-by-decoration`) seems to filter them out. I'll
change it soon so we filter them out.
This adds a `git_refs()` revset that includes all commits pointed to
by a git ref. It's not very useful yet because the graph log doesn't
use the right type of edges for non-contiguous commits.
Merging is currently done with line-level granularity, so it makes
sense to have newlines after the markers. That makes them easier to
edit out when resolving conflicts.
This lets you use the same operator as we currently have for ancestors
and descendants (`,,`) to also specify a DAG range. That's what
Mercurial uses the `::` operator for and what Git has `git log
--ancestry-path` for.
It seems clearer to let the parsed `RevsetExpression`s have only root
and head expression instead of adding the ancestors when building the
expression tree.
I really liked the idea of having the operators for parents and
ancestors (etc.) look similar, but that turned out to be problematic
when we want to add an infix operator for a DAG range (hg's `::`
revset operator and git's `--ancestry-path` flag). Let's say we chose
`:*:` as the operator. Part of the problem is how to parse `foo:*:bar`
without eagerly parsing the `foo:`. It would also be nicer to use
exactly the same operator as prefix, postfix, and infix. Since the
"parents" operator can be repeated, we can't have it be just `:` and
the "ancestors" operator be `::`. We could make the "ancestors"
operator be something like `*:*` (or anything symmetric with the `:`
symbol on the inside). However, at that point, the operator is getting
ugly and hard to type. Another option would be to use `:` for
ancestors and `::` for parents, but that is counterintuitive and get
annoying if you want to repeat it. So it seems that the best option is
to simply pick different symbols for parents/children and
ancestors/descendants/range.
This patch changes the ancestors/descendants operators to both be
`,,`. I'm not at all attached to that particular symbol. I suspect
we'll change it later.
Now that expressions may contain literal strings, we can simply have
functions accept only expressions arguments. That simplifies both the
grammar and the code.
A small drawback is that `description((foo), bar)` is now allowed and
does a search for the string "foo" (not "(foo)"). That seems
unlikely to trip up users.
Git refs with names containing e.g "-" are currently not accepted
symbol names, and I don't plan to change the grammar to accept
them. Instead, let's have the user quote symbol names containing
unusual characters. That way we can keep these symbols reserved for
revset operators.
With this patch the user can do e.g. `jj diff -r '"v2.9.0-rc2"'`.
This adds `children(<set>)` and `<set>:` for the children of the given
set, and `descendants(<set>)` and `<set>:*` for the descendants of the
given set. The children and descendants are filtered to be among
ancestors of non-obsolete commits. I haven't added a way of overriding
that yet.
This is especially important now that we leak the rule names into the
`SyntaxError` message. For example, the error message when doing `jj
diff -r :` will now mention "expected parents_op, ancestors_op, or
primary". It seems much clearer with the "_op" suffixes there. Longer
term, we should think more about how we can best surface syntax errors
from the library crate.
The tests don't need any complex set up (no repo necessary), so they
can be in the `revset` module itself. I'm sure we'll need to split up
that module later (at least separate out the parsing), but that's a
separate problem.
I don't know why I made it walk by generation number to start
with. Walking by position is better in at least two ways: 1) revsets
now depend on the walks to be by descending index position (though
they could equally well depend on the walks to be by generation number
-- it just needs to be consistent), and 2) the log output gets less
interleaved.
This commit makes the number of bytes in the graphlog output in the
git.git repo drop by ~40% due to the reduced amount of
interleaving. Also, it reduces the time of `jj bench walkrevs v1.0.0
v2.0.0` in the git.git repo by 32% (9.4ms -> 6.4ms) and `jj bench
walkrevs v2.0.0 v1.0.0` by 33% (7.7ms -> 5.1ms).
This change adds a `non_obsolete_heads(<set>)` revset, which walks up
ancestors of the input set until it gets to a non-obsolete and
non-pruned commit. That's what we do by default in `jj log`
(i.e. without `--all`). Now we can make `jj log` use revsets and teach
it a `-r` option!
This adds `parents(foo)` and `ancestors(foo)` as alternative ways of
writing `:foo` and `*:foo`.
I haven't added support for for whitespace yet; the parsing is very
strict. The error messages will also need to be improved later.
This patch adds initial support for a DSL for specifying revisions
inspired by Mercurial's "revset" language. The initial support
includes prefix operators ":" (parents) and "*:" (ancestors) with
naive parsing of the revsets. Mercurial uses postfix operator "^" for
parent 1 just like Git does. It uses prefix operator "::" for
ancestors and the same operator as postfix operator for descendants. I
did it differently because I like the idea of using the same operator
as prefix/postfix depending on desired direction, so I wanted to apply
that to parents/children as well (and for
predecessors/successors). The "*" in the "*:" operator is copied from
regular expression syntax. Let's see how it works out. This is an
experimental VCS, after all.
I've updated the CLI to use the new revset support.
The implementation feels a little messy, but you have to start
somewhere...
This actually seems to make it slightly slower, but it fixes an
important bug (we used to evolve only one topological branch per `jj
evolve` call). The slowdown seemed to be on the order of 5% when
evolving 100 commits on git.git's "what's cooking" branch.
I suspect that at least one reason that I didn't make
`MutableRepo::base_repo` by an `Arc<ReadonlyRepo>` before was that I
thought that that would mean that `start_transaction()` would need be
moved off of `ReadonlyRepo` so it can be given an
`&Arc<ReadonlyRepo>`, which would make it much less convenient to
use. It turns out that a `self` argument can actually be of type
`&Arc<ReadonlyRepo>`.
See test case for details.
Before:
test bench_diff_10k_lines_reversed ... bench: 36,249,659 ns/iter (+/- 174,455)
test bench_diff_10k_modified_lines ... bench: 37,258,890 ns/iter (+/- 803,963)
test bench_diff_10k_unchanged_lines ... bench: 4,252 ns/iter (+/- 69)
test bench_diff_1k_lines_reversed ... bench: 982,834 ns/iter (+/- 6,467)
test bench_diff_1k_modified_lines ... bench: 3,343,469 ns/iter (+/- 23,243)
test bench_diff_1k_unchanged_lines ... bench: 231 ns/iter (+/- 2)
test bench_diff_git_git_read_tree_c ... bench: 95,559 ns/iter (+/- 816)
After:
test bench_diff_10k_lines_reversed ... bench: 36,186,715 ns/iter (+/- 196,903)
test bench_diff_10k_modified_lines ... bench: 37,511,000 ns/iter (+/- 1,370,476)
test bench_diff_10k_unchanged_lines ... bench: 3,099 ns/iter (+/- 8)
test bench_diff_1k_lines_reversed ... bench: 986,010 ns/iter (+/- 11,565)
test bench_diff_1k_modified_lines ... bench: 3,370,938 ns/iter (+/- 17,041)
test bench_diff_1k_unchanged_lines ... bench: 230 ns/iter (+/- 2)
test bench_diff_git_git_read_tree_c ... bench: 102,189 ns/iter (+/- 1,052)
So this patch makes diffing even slower (but still easily fast enough
for all cases I've run into in real life). There's probably a lot that
can be done to make things faster, but the first priority is that the
diffs are correct and easy to read.
This is yet another step towards making it easy to propagate
`BrokenPipe` errors. The `jj diff` code (naturally) diffs two trees
and prints the diffs. If the printing fails, we shouldn't just crash
like we do today.
The new code is probably slower since it does more copying (the
callback got references to the `FileRepoPath` and `TreeValue`). I hope
that won't make a noticeable difference. At least `jj diff -r
334afbc76fbd --summary` didn't seem to get measurably slower.
The iterator version is easier to use and we get rid of the ugly type
parameter for the error type. I also simplified the code by using
`Peekable` iterators.
The new diff algorithm produces pretty bad diffs in some cases, such
as cc4b1e9230 in this repo (the parent of this commit). I think the
problem there is that many words are repeated over and over. Diffing
first at the line level and then refining the diff of the changed
ranges at the word level gives much better results. That's what this
patch does. After this patch, `jj diff -r cc4b1e923091` looks pretty
similar to the diff in GitHub's UI.
I hope to get around to doing the same for the merge code soon.
Impact on benchmarks:
Before:
test bench_diff_10k_lines_reversed ... bench: 42,647,532 ns/iter (+/- 765,347)
test bench_diff_10k_modified_lines ... bench: 21,407,980 ns/iter (+/- 126,366)
test bench_diff_10k_unchanged_lines ... bench: 4,235 ns/iter (+/- 16)
test bench_diff_1k_lines_reversed ... bench: 1,190,483 ns/iter (+/- 7,192)
test bench_diff_1k_modified_lines ... bench: 1,919,766 ns/iter (+/- 9,665)
test bench_diff_1k_unchanged_lines ... bench: 231 ns/iter (+/- 1)
test bench_diff_git_git_read_tree_c ... bench: 174,702 ns/iter (+/- 1,199)
After:
test bench_diff_10k_lines_reversed ... bench: 38,289,509 ns/iter (+/- 129,004)
test bench_diff_10k_modified_lines ... bench: 33,140,659 ns/iter (+/- 3,989,339)
test bench_diff_10k_unchanged_lines ... bench: 3,099 ns/iter (+/- 14)
test bench_diff_1k_lines_reversed ... bench: 973,551 ns/iter (+/- 94,895)
test bench_diff_1k_modified_lines ... bench: 3,033,818 ns/iter (+/- 29,513)
test bench_diff_1k_unchanged_lines ... bench: 230 ns/iter (+/- 1)
test bench_diff_git_git_read_tree_c ... bench: 79,100 ns/iter (+/- 963)
So most of them get slower, as expected. The last one, taken from a
real diff in the git.git repo, get faster, however (which is also what
I would have expected).
I made a quite late change in a recent patch to make the merge code to
merge based on lines instead of words. I forgot to update the tests
(and to even run them). Sorry :(
The previous patch switched over the content-merge code to use the new
histogram diff code. This patch switches over the content-diff code to
use the histogram diff code. As before, the immediate goal is to speed
it up. `jj diff -r c28ded83fc` in the git.git repo is a good example
of a diff that's extremely slow to calculate with our current
LCS-based diff. With this patch, that drops from 35 s to 0.12 s.
The diff was slightly better before. I think that's mostly because of
our different definition of a "word" in the data. We can improve that
later. The speedup we get now is easily worth the slightly worse diff.
With the histogram diff code from the previous patch, we can now start
using that for finding the "sync regions" in 3-way merge. That helps a
lot with the slow merging we had before this patch. `jj diff -r
9d540e9726` in the git.git repo drops from 22 s to 0.15 s with this
patch. (That commit is a rather arbitrary merge commit from aroun 5
years ago.)
With the new diff algorithm, the output of `jj diff -r 9d540e9726` in
git.git looks better if we find unchanged sync regions based on lines
than on words, so that's what I'm using in this patch. That's a change
compared the the LCS-based diff we used before this patch. I suspect
the reason that finding sync regions based on words works worse now is
not because of the change from LCS to histogram but because of the
change in how we define a word. My goal right now is mostly to make it
faster; I'll get back to refining the diff result later.
The current diff algorithm does a full LCS on the words of the texts,
which is really slow. Diffing the working copy when e.g.
`src/commands.py` has changes far apart takes seconds. This patch adds
an implementation inspired by JGit's Histogram diff. I say "inspired"
because I just didn't quite understand it :P In particular, I didn't
understand what it does when it finds non-unique elements. I decided
to line up the leading common elements on both sides of the merge. I
don't know if that usually gives good enough results in practice.
I'm sure this can still be optimized a lot, but this seems good enough
as a start. There is also many things to improve about the quality of
the diffs.