Commit Graph

18 Commits

Author SHA1 Message Date
Siddharth Agarwal
cbe9637432 hg2git: regularize mercurial imports 2015-12-31 12:27:07 -08:00
Sean Farley
3b19ebc41e hg2git: flake8 cleanup 2015-04-22 16:42:48 -07:00
Augie Fackler
6efcd59b56 hg2git: audit path components during export (CVE-2014-9390)
A user recently got confused and managed to track and export a .git
directory, which confuses git and causes it to emit very odd
errors. For example, cloning one such repository (which has a symlink
for .git) produces this output from git:

  Cloning into 'git'...
  done.
  error: Updating '.git' would lose untracked files in it

and another (which has a .git directory checked in) produces this:

  Cloning into 'git'...
  done.
  error: Invalid path '.git/hooks/post-update'

If it ended there, that'd be fine, but this led to a line of
investigation that ended with CVE-2014-9390, so now git will block
checking these revisions out, so we should try to prevent
foot-shooting on our end. Since some servers (notably github) are
blocking trees that contain these entries, default to refusing to
export any path component that looks like it folds to .git. Since some
histories probably contain this already, we offer an escape hatch via
the config option git.blockdotgit that allows users to resume
foot-shooting behavior.
2014-11-23 19:06:21 -05:00
Siddharth Agarwal
c188adb4b9 hg2git: in _init_dirs, store keys without leading '/' (issue103)
Previously, whenever a tree that wasn't the root ('') was stored, we'd prepend
a '/' to it. Then, when we'd try retrieving the entry, we'd do so without the
leading '/'. This caused data loss because existing tree entries were dropped
on the floor. Fix that by only adding '/' if we're adding to a non-empty
initial path.

This wasn't detected in tests because most of them deal only with files in the
root and not ones in subdirectories.
2014-03-25 11:11:04 -07:00
Siddharth Agarwal
f84c69b6c1 hg2git: start incremental conversion from a known commit
Previously, we'd spin up the Mercurial incremental exporter from the null
commit and build up state from there. This meant that for the first exported
commit, we'd have to read all the files in that commit and compute Git blobs
and trees based on that.

The current Mercurial to Git conversion scheme makes most sense with
Mercurial's current default storage format, where manifests are diffed against
the numerically previous revision. At some point in the future, the default
will switch to generaldelta, where manifests would be diffed against one of
their parents. In that world it might make more sense to have a stateless
exporter that diffed each commit against its generaldelta parent and calculated
dirty trees based on that instead. However, more experiments need to be done to
see what export scheme is best.

For a repo with around 50,000 files, this brings down an incremental 'hg
gexport' of one commit from 18 seconds with a hot file cache (and tens of
minutes with a cold one) to around 2 seconds with a hot file cache.
2014-03-14 20:45:09 -07:00
Siddharth Agarwal
e5bd941852 hg2git: implement a method to initialize _dirs from a Git commit
Upcoming patches will start incrementally exporting from a particular commit
instead of from null. This function will be used for that..
2014-03-14 19:17:09 -07:00
Siddharth Agarwal
6b4e5f67db hg2git: fix subrepo handling to be deterministic
Previously, the correctness of _handle_subrepos was based on the order the
files were processed in. For example, consider the case where a subrepo at
location 'loc' is replaced with a file at 'loc', while another subrepo exists.
This would cause .hgsubstate and .hgsub to be modified and the file added.

If .hgsubstate was seen _before_ 'loc' in the modified/added loop, then
_handle_subrepos would run and remove 'loc' correctly, before 'loc' was added
back later. If, however, .hgsubstate was seen _after_ 'loc', then
_handle_subrepos would run after 'loc' was added and would remove 'loc'.

With this patch, _handle_subrepos merely computes the changes that need to be
applied. The changes are then applied, making sure removed files and subrepos
are processed before added ones.

This was detected by setting a random PYTHONHASHSEED (in this case, 3910358828)
and running the test suite against it. An upcoming patch will randomize the
PYTHONHASHSEED in run-tests.py, just like is done in Mercurial.
2014-02-19 20:52:59 -08:00
Siddharth Agarwal
689b38dc44 hg2git: move parse_subrepos to top level
durin42 expressed a desire for this function to be at the top level.
2014-02-19 20:18:43 -08:00
Siddharth Agarwal
d7dbce79bd hg2git: call _handle_subrepos when .hgsubstate is removed
Now that _handle_subrepos can handle .hgsubstate being removed, we should use
it for that.

The test changes make sure that the SHAs roundtrip.
2014-02-12 22:55:16 -08:00
Siddharth Agarwal
39d1c15298 hg2git: make _handle_subrepos worked in the removed case
A test for this will be included in an upcoming patch.
2014-02-12 21:19:04 -08:00
Siddharth Agarwal
ca74d6d967 hg2git: add 'new' prefix to _handle_subrepos variables
An upcoming patch will introduce similar variables for self._ctx. This helps
disambiguate.
2014-02-12 20:34:09 -08:00
Siddharth Agarwal
3cadf19b94 hg2git: factor out subrepo parsing into a separate function
This code will be used in multiple contexts in an upcoming patch.
2014-02-12 20:28:28 -08:00
Siddharth Agarwal
44c13be822 hg2git: factor out remove path logic into a separate function
This will be used by _handle_subrepos in an upcoming patch.
2014-02-12 19:50:56 -08:00
Siddharth Agarwal
873a402c5e hg2git: call status on newctx, not newctx.rev()
There's no benefit to calling rev().
2014-02-12 18:05:12 -08:00
Siddharth Agarwal
17657a025c hg2git: store ctx instead of rev
Storing a ctx enables values like manifests to be cached on the context.
2014-02-12 17:49:14 -08:00
Siddharth Agarwal
b470bfcf51 hg2git: rename ctx to newctx in update_changeset
An upcoming patch will introduce a new field called _ctx. This helps prevent
confusion.
2014-02-12 17:47:38 -08:00
Gregory Szorc
10dcc5b5c0 Only export modified Git trees
Previously, we emitted every Git tree when updating between Mercurial
changesets. With this patch, we now only emit Git trees that changed. A
side-effect of the implementation is that we now only update in-memory
Git trees objects that changed. Before, we always touched Git trees,
invalidating them in the process and causing Dulwich to recalculate
their SHA-1. Profiling revealed this to be expensive and removing the
extra calculation shows a nice performance win.

Another optimization is to not sort the order that changed paths are
processed in. Previously, we sorted by length, longest to shortest.
Profiling revealed that the sorts took a non-trivial amount of time.
While sorted execution resulted in likely idempotent behavior, it
shouldn't be strictly required.

On the author's machine, conversion of the Mercurial repository itself
decreased from ~493s to ~333s. Even more impressive is conversion of
Firefox's main repository (which is considerably larger). Converting the
first 200 revisions of that repository decreased from ~152s to ~42s.
2013-04-14 11:11:41 -07:00
Gregory Szorc
baa19027ef Export Git objects from incremental Mercurial changes
This replaces the brute force Mercurial to Git export with one that is
incremental. It results in a decent performance win and paves the road
for parallel export via using multiple incremental exporters.
2013-03-19 22:44:01 -07:00