hg pull calls listkeys for bookmarks. This would previously cause a pack with
all refs to be fetched. For Mercurial mirrors of Git repositories where only
some refs were mirrored, this would cause problems in a bunch of ways:
- A larger pack would be fetched than necessary.
- The final refs written out to the Git repo would only be the set of refs we
were actually interested in. If a GC was subsequently run, unreferenced
objects would be deleted. Those objects might be referred to on subsequent
fetches, which could cause hg-git to crash.
We replace all that logic with a simple null fetch. The tests introduced in the
previous patch ensure no regressions.
hg perfrevset 'max(fromgit())' on a repo with around 60,000 commits:
before: ! wall 1.055093 comb 1.050000 user 1.050000 sys 0.000000 (best of 10)
after: ! wall 0.148586 comb 0.140000 user 0.140000 sys 0.000000 (best of 62)
In reality, perfrevset doesn't clear the Git-to-Mercurial map, which means that
a call like `hg log -r 'max(fromgit())'` speeds up from around 1.5 seconds to
0.6.
For a repository with around 60,000 commits, perfrevset for gitnode becomes:
before: ! wall 1.130716 comb 1.130000 user 1.130000 sys 0.000000 (best of 9)
after: ! wall 0.178828 comb 0.180000 user 0.180000 sys 0.000000 (best of 54)
In reality, perfrevset doesn't clear the Git-to-Mercurial map, which means that
a call like `hg log -r 'gitnode(...)'` speeds up from around 1.5 seconds to
0.6.
Previously, whenever a tree that wasn't the root ('') was stored, we'd prepend
a '/' to it. Then, when we'd try retrieving the entry, we'd do so without the
leading '/'. This caused data loss because existing tree entries were dropped
on the floor. Fix that by only adding '/' if we're adding to a non-empty
initial path.
This wasn't detected in tests because most of them deal only with files in the
root and not ones in subdirectories.
Previously, we'd spin up the Mercurial incremental exporter from the null
commit and build up state from there. This meant that for the first exported
commit, we'd have to read all the files in that commit and compute Git blobs
and trees based on that.
The current Mercurial to Git conversion scheme makes most sense with
Mercurial's current default storage format, where manifests are diffed against
the numerically previous revision. At some point in the future, the default
will switch to generaldelta, where manifests would be diffed against one of
their parents. In that world it might make more sense to have a stateless
exporter that diffed each commit against its generaldelta parent and calculated
dirty trees based on that instead. However, more experiments need to be done to
see what export scheme is best.
For a repo with around 50,000 files, this brings down an incremental 'hg
gexport' of one commit from 18 seconds with a hot file cache (and tens of
minutes with a cold one) to around 2 seconds with a hot file cache.
The usage of getattr was unsafe. Use hgutil.safehasattr instead.
util.safehasattr has been around since Mercurial 2.0.
This also fixes the formerly disabled test in test-pull.t.
Previously we'd attempt to import every single reachable commit in the Git
object store.
The test adds another branch to the Git repo and doesn't import it until much
later. Previously we'd import it when we ran `hg -R hgrepo pull -r beta`. Now
we won't.
The return value as implemented in git_handler.fetch was pretty bogus. It used
to return the number of values that changed in the 'refs/heads/' namespace,
regardless of whether multiple values in there point to the same Mercurial
commit, or whether particular heads were even imported. Fix all of that by
using the actual heads in the changelog, just like vanilla Mercurial.
The test output changes demonstrate examples where the code was buggy.
Since Mercurial is commit-oriented, the 'no changes found' message really
should rely on what new commits are in the repo, not on new heads. This also
makes an upcoming patch much simpler.
Since everything around this code is completely broken anyway, writing a test
for this that doesn't trigger other bugs is close to impossible. An upcoming
patch will include tests.
The test output change is for an empty clone -- the output is precisely how
vanilla Mercurial treats an empty clone.
The theme of this and upcoming patches is that relying on self.git.object_store
to figure out which commits/tags/bookmarks to import is not great. This breaks
if the git repo is manually put in place (as might be done in a server-based
replication scenario), or if a partial fetch pulled too many commits in for
whatever reason. Indeed we were just about always pulling an entire pack in,
because listkeys for bookmarks currently calls fetch_pack without any
filtering. (This is probably a bug and should be fixed, but this series doesn't
do that.)
Instead, rely on whether we actually imported the commit into Mercurial to
determine whether to import the tag. This is clean, straightforward, and
clearly correct.
There is a whole series of bugs in this code that any test case for this would
hit -- an upcoming patch will include a test for all these bugs at once.
object_store.add_object doesn't check to see if the object is already in a
pack, so it is still written out in that case. Do the check ourselves before
calling add_object.
Since the Git to Mercurial conversion process is incremental, it's at risk of
missing files, or recording files the wrong way, or recording the wrong commit
metadata. Add a command called 'gverify' that can verify the contents of a
particular Mercurial rev against the corresponding Git commit.
Currently, this is limited to checking file names, flags and contents, but this
can be made as robust as desired. Further additions will probably require
refactoring git_handler.py a bit though.
This function is pretty fast: on a Linux machine with a warm cache, verifying a
repository with around 50,000 files takes just 20 seconds. There is scope for
further improvement through parallelization, but conducting tree walks in
parallel is non-trivial with the current worker infrastructure in Mercurial.
This allows other functions to be able to use the `git` property without
needing to care about initializing it.
An upcoming patch will remove the `init_if_missing` function.
Previously, we'd try to access commit.parents[0] and fail. Now, check for
commit.parents being empty and return what Mercurial thinks is a repository
root in that case.
Previously we'd just test if gitrev was falsy, which it is if the rev returned
is 0, even though it shouldn't be. With this patch, test against None
explicitly.
This unmasks another bug: see next patch for a fix and a test.
Previously we'd recompute the repo tags each time we'd consider importing a Git
tag. This is O(n^2) in the number of tags and produced noticeable slowdowns in
repos with large numbers of tags.
To fix this, compute the tags just once. This is correct because the only case
where we'd have issues is if multiple new Git tags with the same name were
introduced, which can't happen because Git tags cannot share names.
For a repository with over 200 tags, this causes a no-op hg pull to be sped up
by around 0.5 seconds.
A new property called _tagscache was introduced in Mercurial 2.0, so the cache
wasn't actually working.
The contract for tags() also changed at some point -- it stopped returning
nodes that weren't in the repo. This will need to be accounted for if we
start using the tags cache again. However, it isn't very clear whether the
Mercurial tags cache is actually worth doing, since we already have a
separate in-memory cache for Git tags in the handler.
Previously, the correctness of _handle_subrepos was based on the order the
files were processed in. For example, consider the case where a subrepo at
location 'loc' is replaced with a file at 'loc', while another subrepo exists.
This would cause .hgsubstate and .hgsub to be modified and the file added.
If .hgsubstate was seen _before_ 'loc' in the modified/added loop, then
_handle_subrepos would run and remove 'loc' correctly, before 'loc' was added
back later. If, however, .hgsubstate was seen _after_ 'loc', then
_handle_subrepos would run after 'loc' was added and would remove 'loc'.
With this patch, _handle_subrepos merely computes the changes that need to be
applied. The changes are then applied, making sure removed files and subrepos
are processed before added ones.
This was detected by setting a random PYTHONHASHSEED (in this case, 3910358828)
and running the test suite against it. An upcoming patch will randomize the
PYTHONHASHSEED in run-tests.py, just like is done in Mercurial.
Since a fresh GitHandler is no longer created for every commit, this speeds up
the {gitnode} template massively.
For a repo with over 50,000 commits, the command
hg log -l 10 --template '{gitnode}\n'
speeds up from 2.4 seconds to 0.3.
Previously we'd load the git and hg maps twice on separate git handler objects.
This avoids that.
For a repo with over 50,000 commits, this brings a no-op hg pull down from 2.45
seconds to 2.37.
Currently we call hgrepo.tags() separately for each tag. (This should be fixed
at some point.) This avoids initializing a separate git handler for each tag.
For a repository with over 150 tags, this brings down a no-op hg pull by 0.05
seconds.
Any commit in _map_git is already known, so there's no point walking further
down the DAG.
For a repo with over 50,000 commits, this brings down a no-op hg pull from 38
seconds to 2.5.
getnewgitcommits() does a weird traversal where a particular commit SHA is
visited as many times as the number of parents it has, effectively doubling
object reads in the standard case with one parent. This patch makes the
convert_list a cache for objects, so that a particular Git object is read just
once.
On a mostly linear repository with over 50,000 commits, this brings a no-op hg
pull down from 70 seconds to 38, which is close to half the time, as expected.
Note that even a no-op hg pull currently does a full DAG traversal -- an
upcoming patch will fix this.
For a repo with over 50,000 commits, this brings down the computation of
'export' from 1.25 seconds to 0.25 seconds.
To scale this to hundreds of thousands of commits, one solution might be to
maintain the mapping in a DAG data structure mirroring the changelog, over
which findcommonmissing can be used.
Before this patch, in the git to hg conversion, .hgsubstate once created is
never deleted, even if no submodules are any longer present. This is broken
state, as shown by the test for which the SHA changes. Fix that by looking at
the diff instead of just what submodules are present.
Since 'gitlinks' now contains *changed* gitlinks, not *all* gitlinks, it no
longer makes sense to gate gitmodules checks on that.
This patch simply demonstrates that the test was broken; an upcoming patch will
introduce more tests.
Bonus: this also makes the import process faster because we no longer need to
walk the entire tree to collect gitlinks.
This will cause the SHAs of repos that have submodules added and then removed
to change.
Currently, to figure out which gitlinks are in a repository we walk through the
entire tree. This patch lets us use get_files_changed to detect which gitlinks
have changed.
This is an adaptation of the original patch submitted in [1], without the
monkey-patching: a patch has been committed in dulwich [2] which allows clients
to supply a custom urllib2 "opener" for opening the url; here, we provide such
an opener, which provides authentication information obtained from the hg
config.
[1] https://groups.google.com/forum/#!topic/hg-git/9clPr1wdtiw
[2] https://bugs.launchpad.net/dulwich/+bug/909037
Consider two octopus merges, one of which is a child of the other. Without this
patch, get_git_parents() called on the second octopus merge checks that each p1
is neither in the middle of an octopus merge nor the end of it. Since the end
of the first octopus merge is a p1 of the second one, this asserts.
Change the sanity check to only make sure that p1 is not in the middle of an
octopus merge.
This was crafted mostly via a bunch of aimless flailing in the
code. I'm pretty well convinced at this point that the incoming
support needs to be rewritten slightly to behave properly in the new
world order (specifically, the overlayrepo class probably should be
subclassing localrepo, or else more directly reimplementing things
instead of trying to forward methods.)
I've been waiting for dulwich upstream to fix this *and* for a test
from domruf that's acceptable. Having gotten neither over a period of
/months/, and having hit the bug myself, I'm moving on and accepting a
patch without tests. This will likely break again, but hopefully
before we'd break it dulwich will be fixed.
Previously, we emitted every Git tree when updating between Mercurial
changesets. With this patch, we now only emit Git trees that changed. A
side-effect of the implementation is that we now only update in-memory
Git trees objects that changed. Before, we always touched Git trees,
invalidating them in the process and causing Dulwich to recalculate
their SHA-1. Profiling revealed this to be expensive and removing the
extra calculation shows a nice performance win.
Another optimization is to not sort the order that changed paths are
processed in. Previously, we sorted by length, longest to shortest.
Profiling revealed that the sorts took a non-trivial amount of time.
While sorted execution resulted in likely idempotent behavior, it
shouldn't be strictly required.
On the author's machine, conversion of the Mercurial repository itself
decreased from ~493s to ~333s. Even more impressive is conversion of
Firefox's main repository (which is considerably larger). Converting the
first 200 revisions of that repository decreased from ~152s to ~42s.
This replaces the brute force Mercurial to Git export with one that is
incremental. It results in a decent performance win and paves the road
for parallel export via using multiple incremental exporters.
If dulwich is presented with a "sub minute" timezone offset, it throws
an exception (see tests/test-timezone.t). This patch rounds the timezone
down to the next minute before passing the value to dulwich.
As pointed out by l33t, Hg-Git's output for push doesn't currently do a very
good job of telling the user what happened. My previous changes in this area
had moved some of the output from status to note, making it only show if
--verbose was specified. However, I hadn't realized at the time that the
reference information (though overly verbose) was providing a valueable purpose
that otherwise wasn't met; telling the user that a remote reference had changed.
This changeset makes it so that:
* default output will include simple messages like "adding reference
refs/heads/feature" and "updating reference refs/heads/master" (omitting any
mention of unchanged references)
* verbose output will include more detailed messages like "adding reference
default::refs/heads/feature => GIT:aba43c" and "updating reference
default::refs/heads/master => GIT:aba43c" (omitting any mention of unchanged
references)
* debug output will include the detailed output like in verbose, but
addtionally will include messages like "unchanged reference
default::refs/heads/other => GIT:aba43c"
https://bitbucket.org/durin42/hg-git/issue/64/push-confirmation
l33t pointed out that currently, Hg-Git doesn't provide any confirmation that a
push was successful other than the exit code. Normal Mercurial provides a
couple other messages followed by "added X changesets with Y changes to
Z files". After this change, Hg-Git will provide much more similar output.
It's not identical, as the underlying model is substantially different, but the
concept is the same. The main message is "added X commits with Y trees and
Z blobs".
This change doesn't affect the output of what references/branches were touched.
That will be addressed in a subsequent commit.
Dulwich doesn't provide an easy hook to get the information needed for this
output. Instead of passing generate_pack_contents as the pack generator
function to send_pack, I pass a custom function that determines the "missing"
objects, stores the counts, and then calls generate_pack_contents (which then
will determine the "missing" objects again.
The new expected output:
searching for changes # unless quiet true
<N> commits found # if verbose true
list of commits: # if debugflag true and at least one commit found
<each hash> # if debugflag true and at least one commit found
adding objects # if at least one commit found unless quiet true
added <N> commits with <N> trees and <N> blobs # if at least one object unless
# quiet true
https://bitbucket.org/durin42/hg-git/issue/64/push-confirmation
This isn't a real implementation of phases support. Rather, it's just enough
to avoid the traceback.
Traceback (most recent call last):
File "/usr/local/share/python/hg", line 38, in <module>
mercurial.dispatch.run()
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 28, in run
sys.exit((dispatch(request(sys.argv[1:])) or 0) & 255)
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 65, in dispatch
return _runcatch(req)
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 88, in _runcatch
return _dispatch(req)
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 741, in _dispatch
cmdpats, cmdoptions)
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 514, in runcommand
ret = _runcommand(ui, options, cmd, d)
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 831, in _runcommand
return checkargs()
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 802, in checkargs
return cmdfunc()
File "/usr/local/lib/python2.7/site-packages/mercurial/dispatch.py", line 738, in <lambda>
d = lambda: util.checksignature(func)(ui, *args, **cmdoptions)
File "/usr/local/lib/python2.7/site-packages/mercurial/util.py", line 472, in check
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/mercurial/commands.py", line 3942, in incoming
return hg.incoming(ui, repo, source, opts)
File "/usr/local/lib/python2.7/site-packages/mercurial/hg.py", line 525, in incoming
return _incoming(display, subreporecurse, ui, repo, source, opts)
File "/usr/local/lib/python2.7/site-packages/mercurial/hg.py", line 494, in _incoming
displaychlist(other, chlist, displayer)
File "/usr/local/lib/python2.7/site-packages/mercurial/hg.py", line 524, in display
displayer.show(other[n])
File "/usr/local/lib/python2.7/site-packages/mercurial/cmdutil.py", line 670, in show
self._show(ctx, copies, matchfn, props)
File "/usr/local/lib/python2.7/site-packages/mercurial/cmdutil.py", line 691, in _show
label='log.changeset changeset.%s' % ctx.phasestr())
File "/usr/local/lib/python2.7/site-packages/mercurial/context.py", line 203, in phasestr
return phases.phasenames[self.phase()]
File "/usr/local/lib/python2.7/site-packages/mercurial/context.py", line 201, in phase
return self._repo._phasecache.phase(self._repo, self._rev)
AttributeError: 'overlaychangectx' object has no attribute '_repo'
This should fix a bug introduced by 4f4ab2d which caused all tags to be
duplicated as bookmarks on pull.
Test coverage has been added for pull to allow verifying the fix.
When communicating with the user on push/outgoing, Mercurial doesn't show a
"exporting hg objects to git" message, so we shouldn't. The message has been
changed to be shown if --verbose is specified.
When communicating with the user on push, Mercurial doesn't show much on
success. Currently, Hg-Git shows every changed ref. After this change,
the default output will more closely match Mercurial's regular behavior (no
per-ref output), while changed refs will be shown if --verbose is specified,
and all refs will be shown if --debug is specified.
This changeset adds test coverage for comparing "hg outgoing -B" in normal
Mercurial usage with Hg-Git usage. This didn't match, since previously, gitrepo
didn't provide a meaningful listkeys implementation. Now, it does.
gitrepo now has access to a GitHandler when a localrepo is available. This
handler is used to access the information needed to implement listkeys for
namespaces (currently, only bookmarks) and bookmarks.
A couple of other tests were testing "divergent bookmark" scenarios. These
tests have been updated to filter out the divergent bookmark output, as it isn't
consistent across the supported Mercurial versions.
This change wraps hg.peer to allow for capturing the repo object. It is then
passed in to new gitrepo instanceds. This will be needed to implement later
functionality, such as richer bookmark support using pushkeys.
In the logic that was attempting to handle the case where the local repo doesn't
have any bookmarks, the assumption was being made that tip resolved to a
non-null revision. In the case of a totally empty local repo, however, that
isn't a valid assumption, and resulted in attempting to set the master ref
to None, which broke dulwich.
The "fix", which avoids the traceback and allows the push to complete (though
still do nothing, since in this case there aren't any changes to push), is to
not tweak the refs at all if tip is nullid. Leaving the special capabilities
ref and not adding a master ref appears to be fine in this case.
The output for "hg push" when there were no changes didn't quite match between
Mercurial with and without Hg-Git, so I changed the behavior to bring it into
synch. The existing "creating and sending data" message was changed to be
included if --verbose is specified.
Mercurial has support for including information about the tested versions of
Mercurial for an extension when it detects that an extension has broken. This
change includes the appropriate attribute in the extension.
Mercurial has support for including a link to an issue tracker when it detects
that an extension has broken. This change includes the appropriate attribute
in the extension, pointing it at the issue tracker for the main BitBucket repo.
There was a bug introduced in fa5f235be2cd such that calling hg outgoing on
a Git repository would result in all refs being deleted from the remote
repository (with the possible exception of the currently checked out branch).
It wasn't noticed before because the existing test for outgoing didn't actually
verify the refs on the remote. This changeset fixes the bug, as well as adding
test coverage to allow verifying that the fix works.
When exporting Git commits, verify that the tree and parents objects
exist in the repository before allowing the commit to be exported. If a
tree or parent commit is missing, then the repository is not valid and
the export should not be allowed.
While working on some other tests, I noticed that the push command was returning
exit code 1 on success. This changeset makes hgrepo.push use the same return
code contract as localrepo.push, which makes the exit codes behave as expected.
that mimic a branchname to be maintained on the git side without
a particular suffix - e.g. if the hg repo had a branch "release_05",
and a bookmark created onto it "release_05_bookmark", the branch on the
git side would be named "release_05". When pulling branches back from
git, if an hg named branch of that name exists, the suffix is appended
back onto the name before creating a bookmark on the hg side.
This is strictly so that a git repo can be generated that has the
same "branch names" as an older hg repo that has named branches, and
has had bookmarks added in to mirror the branch names.
This is given the restrictions that
A. hg named branches can never be renamed and B. hg-git only supports
hg bookmarks, not branches
Signed-off-by: Ehsan Akhgari <ehsan.akhgari@gmail.com>
---
I found a number of bugs when I was trying to convert Mozila's hg repository
to git using hg-git. This patch fixes a number of bugs with irregular
author lines present in hg repositories. Git cannot correctly process a
commit object which has a committer or author line in a format that it does
not understand, which makes it not be able to handle the repositories
with have such commit objects.
The added test cases shows the irregular cases that this patch is able to
deal with.
The wrapped version of findoutgoing unconditionally mangled the
keyword arguments, but doesn't do version fixups unless the
remote is a git repository. This change only mangles the argument
list when the remote is a git repository.
Only show importing/exporting messages when there is something
to do. Change "importing Hg objects into Git" to "exporting
hg objects to git" (and lowercase the other direction).
With this patch, attempts to push (or run outgoing) to read-only git URLs
at github return github's helpful error message instead of just saying
the remote end hung up.
Previously, we appended to .hg/localtags on every pull. This meant
that we never deleted refs that disappeared on the remote server, and
the file length grew without bound. Now we use our own file
(.hg/git-remote-refs) and we do prune refs that disappear from the
remote server.
Use an exact match with the ref name ('foo' in 'refs/heads/foo'),
instead of just checking if it ended with '/foo'.
This allows
$ hg pull -r foo
to run successfully on a repo containing the branches
- 'foo',
- 'mine/foo',
- 'theirs/foo'
dulwich recently changed apply_delta() [1] to return lists. Invoke
join() on the output with an empty string, as dulwich does in its
codebase.
[1] git reference: a2709f6 (Return chunks from apply_delta.)
If a git tag is of the annotated-type, the git server sends an
additional line with the SHA-1 the tag dereferences to (eg.
refs/tag/mytag^{}). These aren't "real" tags, so don't store them.
Splice the ref name only once, and don't loop through refs/heads
multiple times.
This changes the order in which hg tags are created; update test output
to reflect this.