If manifestctx inherits from manifestdict, it requires some weird logic to
lazily load the dict if a piece of information is asked for. This ended up being
complicated and unintuitive to use.
Let's move the dict creation to .read(). This will make even more sense once we
start adding readdelta() and other similar methods to manifestctx.
If manifestctx inherits from manifestdict, it requires some weird logic to
lazily load the dict if a piece of information is asked for. This ended up being
complicated and unintuitive to use.
Let's move the dict creation to .read(). This will make even more sense once we
start adding readdelta() and other similar methods to manifestctx.
In a future patch we will change manifestctx and treemanifestctx to no longer
derive from manifestdict and treemanifest, respectively. This means that
consumers of the _mancache will now need to be aware of the different between
the two, until we get rid of the manifest entirely and the _mancache becomes
only filled with ctxs.
Before we start using repo.manifestlog in the rest of the code base, we need to
make sure that it's capable of returning treemanifests. As we add new
functionality to manifestctx, we'll add it to treemanifestctx at the same time.
We also comment out the manifestctx p1, p2, and linkrev fields for now, since
we're not implementing them on treemanifest yet.
As part of refactoring the manifest, certain test cases started failing because
writesubtrees was called with p1 and p2 manifests that had not been loaded (so
accessing m1._dirs resulted in an empty set). Let's call _load on these before
attempting to access _dirs.
This was caught by tests when future patches were applied.
The file caches we're using to avoid reloading the manifest from disk everytime
has an annoying bug that causes the in memory structure to not be reloaded if
the mtime and the size haven't changed. This causes a breakage in the tests
because the manifestlog is not being reloaded after a commit+strip operation in
mq (the mtime is the same because it all happens in the same second, and the
resulting size is the same because we add 1 and remove 1). The only reason this
doesn't affect the manifest itself is because we touch it so often that we
had already reloaded it after the commit, but before the strip.
Once the entire manifest has migrated to manifestlog, we can get rid of these
properties, since then the manifestlog will be touched after the commit, but
before the strip, as well.
This is the start of a large refactoring of the manifest class. It introduces
the new manifestlog and manifestctx classes which will represent the collection
of all manifests and individual instances, respectively.
Future patches will begin to convert usages of repo.manifest to
repo.manifestlog, adding the necessary functionality to manifestlog and instance
as they are needed.
As part of our refactoring to split the manifest concept from its storage, we
need to start moving the revlog specific parts of the manifest implementation to
a new class. This patch creates manifestrevlog and moves the fulltextcache onto
the base class.
The old manifest cache cached both the inmemory representation and the raw text.
As part of the manifest refactor we want to separate the storage format from the
in memory representation, so let's split this cache into two caches.
This will let other manifest implementations participate in the in memory cache,
while allowing the revlog based implementations to still depend on the full text
caching where necessary.
This is a little messier than I'd like, and I'll probably come back
and do some more refactoring later, but as it is this unblocks
narrowhg. An alternative approach (which I may do as part of the
mentioned refactoring) would be to construct *all* dirlog instances up
front, so that we don't have to keep track of the linkmapper
method. This would avoid a reference cycle between the bundlemanifest
and the bundlerepository, but I was hesitant to do all the work up
front like that.
With this change, it's possible to do 'hg incoming' and 'hg pull' from
bundles in .hg/strip-backup in a treemanifest repository. Sadly, this
doesn't make it possible to 'hg clone' one of those (if you do 'hg
strip 0'), because the cg3 in the bundle gets written without a
treemanifest flag. Since that's going to be an involved refactor in a
different part of the code (which I *suspect* won't touch any of the
code I've just written here), let's leave it as an idea for Later.
lazymanifests can compute diffs significantly faster than taking the set
of two manifests and calculating the delta.
when running hg diff --git -c . on Facebook's big repo, this reduces the
run time from 2.1s to 1.5s.
The current code for generating treemanifest revisions takes the list
of files in the changeset and finds the directories from them. This
does not work for merges, since a merge may pick file A from one side
and file B from another and neither of them would appear in the
changeset's "files" list, but the manifest would still change.
Fix this by instead walking the root manifest log for all needed
revisions, storing all needed file and subdirectory revisions, then
recursively visiting the subdirectories. This also turns out to be
faster: cloning a version of hg core converted to treemanifests went
from ~28s to ~19s (timing somewhat unfair: before this patch, timed
until crash; after this patch, timed until manifests complete).
The new algorithm is used only on treemanifest repos. Although it
works equally well on flat manifests, we leave the iteration over
files in the changeset for flat manifests for now.
When using treemanifests, an on-disk manifest entry with the 't' flag
set means that that entry is a directory and not a file. When read
into memory, these become instances of the treemanifest class. The 't'
flag should therefore never be visible to outside of manifest.py, so
setflag() checks that it is not called with the 't' flag. However, it
turns out that it will be useful for the narrowhg extension to expose
the 't' flag to the user (see below), so let's drop the assertion.
The narrowhg extension allows cloning only a given set of files and
directories. Filelogs and dirlogs that don't match that set will not
be included in the clone. The extension currently doesn't work with
treemanifests. I plan on changing it so directories outside the narrow
clone appear in the manifest. For example, if a directory 'outside/'
is not part of the narrow clone, it will look like a file 'outside'
with the 't' flag set. That will make e.g. manifestmerge() just work
in most cases (and make it well prepared to handle the other
cases).
In repos with treemanifests, there is no specific verification of
directory manifest revlogs. It simply collects all file nodes by
reading each manifest delta. With treemanifests, that's means calling
the manifest._slowreaddelta(). If there are missing revlog entries in
a subdirectory revlog, 'hg verify' will simply report the exception
that occurred while trying to read the root manifest:
manifest@0: reading delta 1700e2e92882: meta/b/00manifest.i@67688a370455: no node
This patch changes the verify code to load only the root manifest at
first and verify all revisions of it, then verify all revisions of
each direct subdirectory, and so on, recursively. The above message
becomes
b/@0: parent-directory manifest refers to unknown revision 67688a370455
Since the new algorithm reads a single revlog at a time and in order,
'hg verify' on a treemanifest version of the hg core repo goes from
~50s to ~14s. As expected, there is no significant difference on a
repo with flat manifests.
Before a4236180df5e (match: remove unnecessary optimization where
visitdir() returns 'all', 2015-05-06), match.visitdir() used to return
the special value 'all' to indicate that it was known that all
subdirectories would also be included in the match. The purpose for
that value was to avoid calling the matcher on all the paths. It
turned out that calling the matcher was not a problem, so the special
return value was removed and the code was simplified. However, if we
use the same special value for not just avoiding calling the matcher
on each file, but to avoid iterating over each file, it's a much
bigger win. On commands like
hg st --rev .^ --rev . dom/
we run the matcher (dom/) on the two manifests, then diff the narrowed
manifest. If the size of the match is much larger than the size of the
diff, this is wasteful. In the above case, we would end up iterating
over the 15k-or-so files in dom/ for each of the manifests, only to
later discover that they are mostly the same. This means that runningt
the command above is usually slower than getting the status for the
entire repo, because that code avoids calling treemanifest.match() and
only calls treemanifest.diff(), which loads only what's needed for the
diff.
Let's fix this by reintroducing the 'all' value in match.visitdir()
and making treemanifest.match() return a lazy copy of the manifest
from dom/ and down (in the above case). This speeds up the above
command on the Firefox repo from 0.357s to 0.137s (best of 5). The
wider the match, the bigger the speedup.
We currently use 'd' to indicate that a manifest entry is a
directory. Let's switch to 't', since that's not a valid hex digit and
therefore easier to spot in the raw manifest data.
This will break any existing repos with tree manifests, but it's still
an experimental feature and there are probably only a few test repos
in existence with 'd' flags.
In large repos, the existing manifest fastdelta computation (which performs a
bisect on the raw manifest for every file that is changing), is excessively
slow. This patch makes fastdelta fallback to the normal string delta algorithm
if the number of changes is large.
On a large repo with a commit of 8000 files, this reduces the commit time by 7
seconds (fastdelta goes from 8 seconds to 1).
I tested this change by modifying the function to compare the old and the new
values and running the test suite. The only difference is that the pure
text-diff algorithm sometimes produces smaller (but functionaly identical)
deltatexts than the bisect algorithm.
The old lazy-copy code formed a chain of copied manifests with each
copy. Under typical operation, the stack never got more than a couple
of manifests deep and was fine. Under conditions like hgsubversion or
convert, the stack could get hundreds of manifests deep, and
eventually overflow the recursion limit for Python. I was able to
consistently reproduce this by converting an hgsubversion clone of
svn's history to treemanifests.
This may result in fewer manifests staying in memory during operations
like convert when treemanifests are in use, and should make those
operations faster since there will be significantly fewer noop
function calls going on.
A previous attempt (never mailed) of mine to fix this problem tried to
simply have all treemanifests only have a loadfunc - that caused
somewhat weird problems because the gettext() callable passed into
read() wasn't idempotent, so the easy solution is to have a loadfunc
and a copyfunc.
Most operations on treemanifests already visit only relevant
submanifests. Notable examples include __getitem__, __contains__,
walk/matches with matcher, diff. By making submanifests lazily loaded,
we speed up all these operations.
The lazy loading is achieved by adding a _load() method that gets
defined where we currently eagerly parse the manifest. We make sure to
call it before any access to _dirs, _files or _flags.
Some timings on the Mozilla repo (with flat manifest timings for
reference):
hg cat -r . README.txt: 1.644s -> 0.096s (0.255s)
hg diff -r .^ -r . : 1.746s -> 0.137s (0.431s)
hg files -r . python : 1.508s -> 0.146s (0.335s)
hg files -r . : 2.125s -> 2.203s (0.712s)
We currently avoid saving a treemanifest revision if it's the same as
one of it's parents. This is checked by comparing the generated text
for all three versions. Let's avoid that when possible by comparing
the nodeids for clean (not dirty) nodes.
On the Mozilla repo, this speeds up commit from 2.836s to 2.343s.
Since tree manifests have a nodeid per directory, we can avoid diffing
entire directories if they have the same nodeid. The comparison is
only valid for unmodified treemanifest instances, of course, so we
need to keep track of which have been modified. Therefore, let's add a
dirty flag to treemanifest indicating whether its nodeid can be
trusted. We set it when _files or _dirs is modified, and make diff(),
and its cousin filesnotin(), not descend into subdirectories that are
the same on both sides.
On the Mozilla repo, this speeds up 'hg diff -r .^ -r .' from 1.990s
to 1.762s. The improvement will be much larger when we start lazily
loading subdirectory manifests.
Match's visitdir() was prematurely optimized to return 'all' in some cases, so
that the caller would not have to call it for directories within the current
directory. This change makes the visitdir system less flexible for future
changes, such as making visitdir consider the match's include and exclude
patterns.
As a demonstration of this optimization not actually improving performance,
I ran 'hg files -r . media' on the Mozilla repository, stored as treemanifest
revlogs.
With best of ten tries, the command took 1.07s both with and without the
optimization, even though the optimization reduced the calls from visitdir()
from 987 to 51.
Since manifests instances are cached on the manifest log instance, we
can cache directory manifests by caching the directory manifest
logs. The directory manifest log cache is a plain dict, so it never
expires; we assume that we can keep all the directories in memory.
The cache is kept on the root manifestlog, so access to directory
manifest logs now has to go through the root manifest log.
The caching will soon not be only an optimization. When we start
lazily loading directory manifests, we need to make sure we don't
create multiple instances of the log objects. The caching takes care
of that problem.
It should be possible to debug the submanifest revlogs without having
to know where they are stored (in .hg/store/meta/), so let's add a
--dir option for this purpose.
With this change, when tree manifests are enabled (in .hg/requires),
commits will be written with one manifest revlog per directory. The
manifest revlogs are stored in
.hg/store/meta/$dir/00manifest.[id].
Flat manifests can still be read and interacted with as usual (they
are also read into treemanifest instances). The functionality for
writing treemanifest as a flat manifest to disk is still left in the
code; tests still pass with '_treeinmem=True' hardcoded.
Exchange is not yet implemented.
The very next changeset will start writing one revlog per directory
when tree manifests are enabled. That is backwards incompatible, so it
requires .hg/requires to be updated. Just like with generaldelta, we
want to update .hg/requires only when the repo is created. Updating
..hg/requires is bad for repos on shared disk. Instead, those who do
want to upgrade a repo to using treemanifest (or manifestv2, etc) can
run
hg clone --config experimental.treemanifest repo clone
which will create a new repo with the requirement set. Unlike the case
of e.g. generaldelta, it will not rewrite the changesets, since tree
manifests hash differently.
I keep having to ponder out what readfast() means, and it always
surprises me. Document the return type in the docstring so that future
readers won't have to puzzle this out again.
When we start to lazily load submanifests, it will be useful to be
able to create an treemanifest instance before manifest data gets
parsed into it. To prepare for this, extract the parsing code from
treemanifest's constructor to a separate method.
When we start writing submanifests to their own revlogs, we will not
want to write a new revision for a directory if there were no changes
to it. To prepare for this, duplicate the call to addrevision() and
move them earlier where they can more easily be avoided.
When we start writing tree manifests with one manifest revlog per
directory, it will still be nice to be able to run tests using tree
manifests in memory but writing to a flat manifest to a single
revlog. Let's break the current '_usetreemanifest' flag on the revlog
into '_treeinmem' and '_treeondisk'. Both are populated from the same
config, but after this change, one can temporarily hard-code
_treeinmem=True to see that tests still pass.
manifestdict() creates an empty manifestdict, so let's consistently
use that instead of explicitly parsing an empty string (which does
result in an empty manifest).
The condition on which manifestdict.matches() and manifestdict.walk()
take the fast path of iterating over files instead of the manifest, is
slightly different. Specifically, walk() does not take the fast path
for exact matchers and it does not avoid taking the fast path when
there are more than 100 files. Let's extract the condition so we don't
have to maintain it in two places and so walk() can gain these two
missing pieces of the condition (although there seems to be no current
caller of walk() with an exact matcher).
When checking whether we can take the fast path of iterating over
matcher files instead of manifest files, we check whether
match.files() is non-empty. However, now that return early for
match.always(), it can only be empty when there are only
include/exclude patterns, but in that case anypats() will be True, so
it's already covered. This makes manifestdict.walk() more similar to
manifestdict.matches().
This cuts down the run time of
hg files -r . > /dev/null
from ~0.850s to ~0.780s on the Firefox repo. Note that
manifest.matches() already has the corresponding optimization.
This makes treemanifest.walk() not visit submanifests that are known not to
have any matching files. It does this by calling match.visitdir() on
submanifests as it walks.
This change also updates largefiles to be able to work with this new behavior
in treemanifests. It overrides match.visitdir(), the function that dictates
how walk() and matches() skip over directories.
The greatest speed improvements are seen with narrower scopes. For example,
this commit speeds up the following command on the Mozilla repo from 1.14s
to 1.02s:
hg files -r . dom/apps/
Whereas with a wider scope, dom/, the speed only improves from 1.21s to 1.13s.
As with similar a similar optimization to treemanifest.matches(), this change
will bring out even bigger performance improvements once treemanifests are
loaded lazily. Once that happens, we won't just skip over looking at
submanifests, but we'll skip even loading them.
The _intersectfiles() method is only called from one place, it's
pretty short, and its caller has to be aware when it's appropriate to
call it (when the number of files in the matcher is not too large), so
let's inline it.