A user recently got confused and managed to track and export a .git
directory, which confuses git and causes it to emit very odd
errors. For example, cloning one such repository (which has a symlink
for .git) produces this output from git:
Cloning into 'git'...
done.
error: Updating '.git' would lose untracked files in it
and another (which has a .git directory checked in) produces this:
Cloning into 'git'...
done.
error: Invalid path '.git/hooks/post-update'
If it ended there, that'd be fine, but this led to a line of
investigation that ended with CVE-2014-9390, so now git will block
checking these revisions out, so we should try to prevent
foot-shooting on our end. Since some servers (notably github) are
blocking trees that contain these entries, default to refusing to
export any path component that looks like it folds to .git. Since some
histories probably contain this already, we offer an escape hatch via
the config option git.blockdotgit that allows users to resume
foot-shooting behavior.
Previously, whenever a tree that wasn't the root ('') was stored, we'd prepend
a '/' to it. Then, when we'd try retrieving the entry, we'd do so without the
leading '/'. This caused data loss because existing tree entries were dropped
on the floor. Fix that by only adding '/' if we're adding to a non-empty
initial path.
This wasn't detected in tests because most of them deal only with files in the
root and not ones in subdirectories.
Previously, we'd spin up the Mercurial incremental exporter from the null
commit and build up state from there. This meant that for the first exported
commit, we'd have to read all the files in that commit and compute Git blobs
and trees based on that.
The current Mercurial to Git conversion scheme makes most sense with
Mercurial's current default storage format, where manifests are diffed against
the numerically previous revision. At some point in the future, the default
will switch to generaldelta, where manifests would be diffed against one of
their parents. In that world it might make more sense to have a stateless
exporter that diffed each commit against its generaldelta parent and calculated
dirty trees based on that instead. However, more experiments need to be done to
see what export scheme is best.
For a repo with around 50,000 files, this brings down an incremental 'hg
gexport' of one commit from 18 seconds with a hot file cache (and tens of
minutes with a cold one) to around 2 seconds with a hot file cache.
Previously, the correctness of _handle_subrepos was based on the order the
files were processed in. For example, consider the case where a subrepo at
location 'loc' is replaced with a file at 'loc', while another subrepo exists.
This would cause .hgsubstate and .hgsub to be modified and the file added.
If .hgsubstate was seen _before_ 'loc' in the modified/added loop, then
_handle_subrepos would run and remove 'loc' correctly, before 'loc' was added
back later. If, however, .hgsubstate was seen _after_ 'loc', then
_handle_subrepos would run after 'loc' was added and would remove 'loc'.
With this patch, _handle_subrepos merely computes the changes that need to be
applied. The changes are then applied, making sure removed files and subrepos
are processed before added ones.
This was detected by setting a random PYTHONHASHSEED (in this case, 3910358828)
and running the test suite against it. An upcoming patch will randomize the
PYTHONHASHSEED in run-tests.py, just like is done in Mercurial.
Previously, we emitted every Git tree when updating between Mercurial
changesets. With this patch, we now only emit Git trees that changed. A
side-effect of the implementation is that we now only update in-memory
Git trees objects that changed. Before, we always touched Git trees,
invalidating them in the process and causing Dulwich to recalculate
their SHA-1. Profiling revealed this to be expensive and removing the
extra calculation shows a nice performance win.
Another optimization is to not sort the order that changed paths are
processed in. Previously, we sorted by length, longest to shortest.
Profiling revealed that the sorts took a non-trivial amount of time.
While sorted execution resulted in likely idempotent behavior, it
shouldn't be strictly required.
On the author's machine, conversion of the Mercurial repository itself
decreased from ~493s to ~333s. Even more impressive is conversion of
Firefox's main repository (which is considerably larger). Converting the
first 200 revisions of that repository decreased from ~152s to ~42s.
This replaces the brute force Mercurial to Git export with one that is
incremental. It results in a decent performance win and paves the road
for parallel export via using multiple incremental exporters.