The getchanges function of some converter_source classes can return
some false positives. I.e. they sometimes claim that a file "foo"
was changed in some revision, even though its contents are still the
same.
convert_svn is particularly bad, but I think this can also happen with
convert_cvs and, at least in theory, with mercurial_source.
For regular conversions this is not really a problem - as long as
getfile returns the right contents, we'll get a converted revision
with the right contents. But when we use --filemap, this could lead
to superfluous revisions being converted.
Instead of fixing every converter_source, I decided to change
mercurial_sink to work around this problem.
When --filemap is used, we're interested only in revisions that touch
some specific files. If a revision doesn't change any of these files,
then we're not interested in it (at least for revisions with a single
parent; merges are special).
For mercurial_sink, we abuse this property and rollback a commit if
the manifest text hasn't changed. This avoids duplicating the logic
from localrepo.filecommit to detect unchanged files.
To handle merges correctly, this revision adds a filemap_source class
that wraps a converter_source and does the work necessary to calculate
the subgraph we're interested in.
The wrapped converter_source must provide a new getchangedfiles method
that, given a revision rev, and an index N, returns the list of files
that are different in rev and its Nth parent.
The implementation depends on the ability to skip some revisions and to
change the parents field of the commit objects that we returned earlier.
To make the conversion restartable, we assume the revisons in the
revmapfile are topologically sorted.
The --filemap support in hg convert doesn't handle merges correctly.
(And after 98d1e8c16343 I managed to break it even for simple cases
where we don't want the first revision.)
If getchanges returns a string, it's assumed to be the id of an
already converted revision. We map the current revision to the same
revision this converted revision was mapped to.
To allow skipping a root revision, getchanges can return the special
string 'hg-convert-skipped-revision' (a.k.a. common.SKIPREV), which
hopefully won't clash with any real id.
The converter_source is responsible for rewriting the parents of the
commit objects to make sure the revision graph makes sense.
- handle chunk headers separately rather than prepending them to
(potentially large) chunks
- break large chunks into 1M pieces for compression
- don't prepend file metadata onto (potentially large) file data
To avoid extra memory usage and performance issues with large files,
generate a trivial delta header for deltas against the null revision
rather than calling the usual delta generator.
We append the delta header to meta rather than prepending it to data
to avoid a large allocate and copy.
lyhash is a very simple and fast hash function that had the fewest
hash collisions on a 3.9M line text corpus and 190k line binary corpus
and should have significantly fewer collisions than the current hash
function.
pretty easy to find after I recompiled the python interpreter and
mercurial for profiling.
In "bdiff.c" function "equatelines" allocates the minimum hash table
size, which can lead to tons of collisions. I introduced an
"overcommit" factor of 16, this is, I allocate 16 times more memory
than the minimum value. Overcommiting 128 times does not improve the
performance over the 16-times case.
We want to store version information about the revlog in the first
entry of its index. The code in packentry was using some heuristics
to detect whether this was the first entry, but these heuristics could
fail in some cases (e.g. rev 0 was empty; rev 1 descends directly from
the nullid and is stored as a delta).
We now give the revision number to packentry to avoid heuristics.
WSGI applications are not supposed to refer to sys.stdin. In af5aceab19f4,
hgweb and hgwebdir were fixed to pass interactive=False to their ui()'s, but
sys.stdin.isatty() was still called by the ui objects. This change makes sure
only the ui.fixconfig() method will call ui.isatty() (by making the
ui._readline() method, which is currently only called from ui.prompt(),
private). ui.fixconfig() is changed to let config files override the initial
interactivity setting, but not check isatty() if interactive=False was
specified in the creation of the ui.
read:
- single call to len(st)
- fewer assignments for position tracking
- don't split apart tuple from unpack
- use a literal for the unpack spec
write:
- localize variables and functions
- avoid copied function call
- use % for string concatenation
- shortcircuit decpath if we haven't built the _dirs map
- increment only for leafnodes of directory tree
(this should make construction more like O(nlog n) than O(n^2))