Any commit in _map_git is already known, so there's no point walking further
down the DAG.
For a repo with over 50,000 commits, this brings down a no-op hg pull from 38
seconds to 2.5.
getnewgitcommits() does a weird traversal where a particular commit SHA is
visited as many times as the number of parents it has, effectively doubling
object reads in the standard case with one parent. This patch makes the
convert_list a cache for objects, so that a particular Git object is read just
once.
On a mostly linear repository with over 50,000 commits, this brings a no-op hg
pull down from 70 seconds to 38, which is close to half the time, as expected.
Note that even a no-op hg pull currently does a full DAG traversal -- an
upcoming patch will fix this.
For a repo with over 50,000 commits, this brings down the computation of
'export' from 1.25 seconds to 0.25 seconds.
To scale this to hundreds of thousands of commits, one solution might be to
maintain the mapping in a DAG data structure mirroring the changelog, over
which findcommonmissing can be used.
Before this patch, in the git to hg conversion, .hgsubstate once created is
never deleted, even if no submodules are any longer present. This is broken
state, as shown by the test for which the SHA changes. Fix that by looking at
the diff instead of just what submodules are present.
Since 'gitlinks' now contains *changed* gitlinks, not *all* gitlinks, it no
longer makes sense to gate gitmodules checks on that.
This patch simply demonstrates that the test was broken; an upcoming patch will
introduce more tests.
Bonus: this also makes the import process faster because we no longer need to
walk the entire tree to collect gitlinks.
This will cause the SHAs of repos that have submodules added and then removed
to change.
Currently, to figure out which gitlinks are in a repository we walk through the
entire tree. This patch lets us use get_files_changed to detect which gitlinks
have changed.
This is an adaptation of the original patch submitted in [1], without the
monkey-patching: a patch has been committed in dulwich [2] which allows clients
to supply a custom urllib2 "opener" for opening the url; here, we provide such
an opener, which provides authentication information obtained from the hg
config.
[1] https://groups.google.com/forum/#!topic/hg-git/9clPr1wdtiw
[2] https://bugs.launchpad.net/dulwich/+bug/909037
Consider two octopus merges, one of which is a child of the other. Without this
patch, get_git_parents() called on the second octopus merge checks that each p1
is neither in the middle of an octopus merge nor the end of it. Since the end
of the first octopus merge is a p1 of the second one, this asserts.
Change the sanity check to only make sure that p1 is not in the middle of an
octopus merge.
This was crafted mostly via a bunch of aimless flailing in the
code. I'm pretty well convinced at this point that the incoming
support needs to be rewritten slightly to behave properly in the new
world order (specifically, the overlayrepo class probably should be
subclassing localrepo, or else more directly reimplementing things
instead of trying to forward methods.)
I've been waiting for dulwich upstream to fix this *and* for a test
from domruf that's acceptable. Having gotten neither over a period of
/months/, and having hit the bug myself, I'm moving on and accepting a
patch without tests. This will likely break again, but hopefully
before we'd break it dulwich will be fixed.
Previously, we emitted every Git tree when updating between Mercurial
changesets. With this patch, we now only emit Git trees that changed. A
side-effect of the implementation is that we now only update in-memory
Git trees objects that changed. Before, we always touched Git trees,
invalidating them in the process and causing Dulwich to recalculate
their SHA-1. Profiling revealed this to be expensive and removing the
extra calculation shows a nice performance win.
Another optimization is to not sort the order that changed paths are
processed in. Previously, we sorted by length, longest to shortest.
Profiling revealed that the sorts took a non-trivial amount of time.
While sorted execution resulted in likely idempotent behavior, it
shouldn't be strictly required.
On the author's machine, conversion of the Mercurial repository itself
decreased from ~493s to ~333s. Even more impressive is conversion of
Firefox's main repository (which is considerably larger). Converting the
first 200 revisions of that repository decreased from ~152s to ~42s.
This replaces the brute force Mercurial to Git export with one that is
incremental. It results in a decent performance win and paves the road
for parallel export via using multiple incremental exporters.
If dulwich is presented with a "sub minute" timezone offset, it throws
an exception (see tests/test-timezone.t). This patch rounds the timezone
down to the next minute before passing the value to dulwich.
As pointed out by l33t, Hg-Git's output for push doesn't currently do a very
good job of telling the user what happened. My previous changes in this area
had moved some of the output from status to note, making it only show if
--verbose was specified. However, I hadn't realized at the time that the
reference information (though overly verbose) was providing a valueable purpose
that otherwise wasn't met; telling the user that a remote reference had changed.
This changeset makes it so that:
* default output will include simple messages like "adding reference
refs/heads/feature" and "updating reference refs/heads/master" (omitting any
mention of unchanged references)
* verbose output will include more detailed messages like "adding reference
default::refs/heads/feature => GIT:aba43c" and "updating reference
default::refs/heads/master => GIT:aba43c" (omitting any mention of unchanged
references)
* debug output will include the detailed output like in verbose, but
addtionally will include messages like "unchanged reference
default::refs/heads/other => GIT:aba43c"
https://bitbucket.org/durin42/hg-git/issue/64/push-confirmation
l33t pointed out that currently, Hg-Git doesn't provide any confirmation that a
push was successful other than the exit code. Normal Mercurial provides a
couple other messages followed by "added X changesets with Y changes to
Z files". After this change, Hg-Git will provide much more similar output.
It's not identical, as the underlying model is substantially different, but the
concept is the same. The main message is "added X commits with Y trees and
Z blobs".
This change doesn't affect the output of what references/branches were touched.
That will be addressed in a subsequent commit.
Dulwich doesn't provide an easy hook to get the information needed for this
output. Instead of passing generate_pack_contents as the pack generator
function to send_pack, I pass a custom function that determines the "missing"
objects, stores the counts, and then calls generate_pack_contents (which then
will determine the "missing" objects again.
The new expected output:
searching for changes # unless quiet true
<N> commits found # if verbose true
list of commits: # if debugflag true and at least one commit found
<each hash> # if debugflag true and at least one commit found
adding objects # if at least one commit found unless quiet true
added <N> commits with <N> trees and <N> blobs # if at least one object unless
# quiet true
https://bitbucket.org/durin42/hg-git/issue/64/push-confirmation