bookmarks.write is deprecated and it was showing warning messages in
test-hg-branch.t with the latest test runner from core mercurial. Tested with
both hg 2.8 and hg tip.
Dulwich treats ref names internally as unicode strings (probably
because of Python 3?), which means that at some points it tries to do
os.path.join between the repo path and the unicode of the ref name,
which fails miserably if we construct the repo with a str and not a
unicode. Kludge around this problem.
Fixes issue 172.
We avoid using dulwich's refs method because it is incredibly slow. On a repo
with a few hundred branches and a few thousand tags, dulwich took about 200ms
to load everything.
This patch only traveses the remote ref directory and cuts that time down to
about 50ms.
It is unclear to me why we keep a file (which can become out of sync) of remote
refs instead of just using dulwich. This caught a missing remote ref in the
test suite.
By testing the uri early, we can reuse logic later in the method to parse the
git uri. We rely on the isgitsshuri heuristic to return True or False, and if
True, prepend 'git+ssh://' to the uri.
Arguably, this is fragile, and am open to better ideas, but can't think of
anything else currently.
Previously, there was an edge case for Git repositories that started as
Mercurial repositories and had used subrepos where a deleted .hgsubstate would
be ignored and therefore reintroduced.
This patch fixes that behavior by checking for the deleted .hgsubstate file
first.
A test has been added to verify behavior.
If the importer encountered an error half way through a large import, all the
commits are saved, but the mapfile is not written, so the process starts over
from the beginning when run again.
This adds the option for a config value that will save the map file every X
commits. I thought about just hard coding this to 100 or something, but doing it
this way seems a little less invasive.
The default dulwich graph walker only walks from refs/heads. During the
discovery phase of fetching this causes it to redownload commits that are only
referenced by refs/remotes. In a normal hggit case, this seems to mean it
redownloads the entire git repo on every hg pull.
Added a --debug to a test to check the object count (it decreased from 21 to 10
as part of this patch).
filectx.renamed() returns a 2-tuple or None. memfilectx.__init__ expects
the copied argument to be either None or a string. Before, we were
passing a 2-tuple, leading to the memfilectx storing the wrong type.
This eventually resulted in doing a key lookup against a manifest
with a 2-tuple, which made manifest.c throw an error.
There really is no point to this -- the sorting is expensive to compute and
the structure is never actually used.
For a mapfile with 1.5 million entries, this speeds up save_map from 3.6
seconds to 0.87.
This is probably the limit of the speedups we can get with pure-Python code.
Any further speedups will have to be made by rewriting these bits in C.
Sorting a list of tuples is much more expensive than sorting a list of strings.
For a mapfile with 1.5 million entries, this speeds up save_map from 6 seconds
to 3.5.
While this has been done since the beginning of time, there's no apparent
justification for it. If an imported commit works out to the same hash as an
existing one, it simply won't be added to the revlog.
The tests all continue to pass. There's already test coverage for reimporting
commits in test-pull-after-strip.t. Also, gimport has worked this way all this
while.
Consider a Mercurial commit with hash 'h1'. Originally, if the only Mercurial
field stored is the branch info (which is stored in the commit message rather
than as an extra field), we'd store the rename source explicitly as a Git extra
field -- let's call the original exported hash 'g1'.
In Git, some operations throw all extra fields away. (One such example is a
rebase.) If such an operation happens, we'll be left with a frankencommit with
the branch info but without the rename source. Let's call this hash 'g2'. For a
setup where Git is the source of truth, let's say that this 'g2' frankencommit
is what gets pushed to the server.
When 'g2' is subsequently imported into Mercurial, we'd look at the fact that
it contains a Mercurial field in the commit message and believe that it was a
legacy commit from the olden days when all info was stored in the commit
message. In that case, in an attempt to preserve the hash, we wouldn't store
any extra rename source info, resulting in 'h1'. Then, when the commit is
re-exported to Git, we'd add the rename source again and produce 'g1' -- and
thus break bidirectionality.
Prevent this situation by not storing the rename source if we're adding branch
info to the commit message. Then for 'h1' we export as 'g2' directly and never
produce 'g1'.
What happens if we not only need to store branch info but also other extra
info, like renames? For 'h1' we'd produce 'g1', then it'd be rewritten on the
Git side to 'g2' throwing away all that extra information. 'g2' being
subsequently imported into Mercurial would produce a new hash, say 'h2'. That's
fine because the commit did get rewritten in Git. We unfortunately wouldn't
perform rename detection thinking that the commit is from Mercurial and had no
renames recorded there, but when the commit is re-exported to Git we'd export
it to 'g2' again. This at least preserves bidirectionality.
See comment inline for explanation. Also add tests for this (the bug was masked
with rename detection disabled -- it only appeared with rename detection
enabled).
See inline comments for why the additional metadata needs to be stored.
This literally breaks all the hashes because of the additional metadata. The
changing of hashes is unfortunate but necessary to preserve bidirectionality.
While this could be broken up into multiple commits, there was no way to do
that while preserving bidirectionality. Following the principle that every
intermediate commit must result in a correct state, I decided to combine the
commits.
We use Dulwich's rename detector to detect any renames over the specified
similarity threshold.
This isn't fully bidirectional yet -- when the commit is exported to Git
the hashes will no longer be the same. That's why that isn't tested here. In
upcoming patches we'll make sure it's bidirectional and will add the
corresponding tests.
hg-git translates octopus merges into a series of merges, called an octopus
explosion. The intermediate octopus explosion commits are not recorded in
the mapfile -- only the final commit is. This means that they show up in the
export list and have to then be detected and filtered out.
Don't initialize the incremental exporter with octopus explosion commits.
Previously, if there were octopus merges early in the history, we'd initialize
the incremental exporter with the first one's parent, then calculate the diff
of that against the first commit we actually cared about exporting. That's slow
and wasteful.
For a particular real-world repo with one octopus merge 1/3 of the way in
history, this is a 10x improvement for 'hg gexport' with one commit to export
-- 60 seconds to 6.
This prepares for an upcoming patch.
In theory, we could pass the context into export_hg_commit, but there's some
encoding shenanigans going on there that I don't want to delve into.