It seems like the subdirmatcher should be checking if the matcher it's
based on is matching prefixes. It was effectively doing that already
because "prefix() == not always() and not anypats() and not
isexact()", subdirmatcher was checking the first two parts of that
condition and I don't think it will ever be given an "exact" matcher
with it's directory name (because exact matchers are for matching
files, not directories). Still, let's switch to using prefix() for
clarity (and because I'm trying to remove code that reaches for
matchers internals).
Because _rootsanddirs() returns a list of directories to visit
recursively and a list of directories to visit non-recursively. For
patterns such as 'rootfilesin:foo/bar', we clearly need to visit the
directory foo/bar, but we also need to visit its parents. The method
therefore uses util.dirs() to find the parent directories of
'foo/bar'. That method does not include the root directory, but since
we obviously need to visit the root directory, we always added '.' to
the set of directories to visit non-recursively.
The visitdir() method had special handling to consider set(['.']) to
mean that no includes had been specified and would thus visit all
directories. However, when the pattern is 'rootfilesin:.', set(['.'])
is actually the real set of directories to visit and the special
handling of that set meant that all directories got visited instead of
just the root directory.
The fix is simple: add '.' to the set of parent directories in
_rootsanddirs() and stop treating set(['.']) specially. This makes
hg files -r . -I rootfilesin:.
in a treemanifest version of the Firefox repo go from 1.5s to 0.26s on
warm disk (and a *much* bigger improvement on cold disk).
Note that the -I is necessary for no good reason. We just haven't
optimized visitdir() for regular (non-include, non-exclude) patterns
yet.
The matcher subinclude functionality allows us to have .hgignore files that
include subdirectory hgignore files. Today it parses the entire repo at once,
even if we only need to test a file in one subdirectory. This patch makes the
subinclude tree creation lazy, which speeds up matcher creation significantly in
large repos with very large trees of ignore patterns.
Changeset ba7f2a1cc2d2 removed the mutable default value, but did not explicitly
tested for None. Such implicit testing can introduce semantic and performance
issue. We move to an explicit testing for None as recommended by PEP8:
https://www.python.org/dev/peps/pep-0008/#programming-recommendations
There shouldn't be a big perf hit creating a new object because
this function is complicated and does things that dwarf the cost
of creating a new PyObject.
Primarily as an optimization to avoid recursing into directories that will
never have a match inside, this classifies each matcher pattern's root as
recursive or non-recursive (erring on the side of keeping it recursive,
which may lead to wasteful directory or manifest walks that yield no matches).
I measured the performance of "rootfilesin" in two repos:
- The Firefox repo with tree manifests, with
"hg files -r . -I rootfilesin:browser".
The browser directory contains about 3K files across 249 subdirectories.
- A specific Google-internal directory which contains 75K files across 19K
subdirectories, with "hg files -r . -I rootfilesin:REDACTED".
I tested with both cold and warm disk caches. Cold cache was produced by
running "sync; echo 3 > /proc/sys/vm/drop_caches". Warm cache was produced
by re-running the same command a few times.
These were the results:
Cold cache Warm cache
Before After Before After
firefox 0m5.1s 0m2.18s 0m0.22s 0m0.14s
google3 dir 2m3.9s 0m1.57s 0m8.12s 0m0.16s
Certain extensions, notably narrowhg, can depend on this for correctness
(not trying to recurse into directories for which it has no information).
This adds a new "rootfilesin" matcher type which matches files inside a
directory, but not any subdirectories (so it matches non-recursively).
This has the "root" prefix per foozy's plan for other matchers (rootglob,
rootpath, cwdre, etc.).
The manifest.manifest class has a _treeinmem member than one can
manually set to True to test that the treemanifest class works as a
drop-in replacement for manifestdict (which is mostly a requirement
for treemanifest repos to work). However, it doesn't quite work at the
moment. These tests fail:
test-largefiles-misc.t
test-rebase-newancestor.t
test-subrepo.t
test-subrepo-deep-nested-change.t
test-subrepo-recursion.t
All but test-rebase-newancestor.t fail because they trigger calls to
subdirmatcher.visitdir(), which tries to access a _excluderoots field
that does not exist on the subdirmatcher. Let's fix that by overriding
visitdir() in a similar way to how matchfn is overridden, i.e. by
prepending the directory before calling the superclass method.
Before a4236180df5e (match: remove unnecessary optimization where
visitdir() returns 'all', 2015-05-06), match.visitdir() used to return
the special value 'all' to indicate that it was known that all
subdirectories would also be included in the match. The purpose for
that value was to avoid calling the matcher on all the paths. It
turned out that calling the matcher was not a problem, so the special
return value was removed and the code was simplified. However, if we
use the same special value for not just avoiding calling the matcher
on each file, but to avoid iterating over each file, it's a much
bigger win. On commands like
hg st --rev .^ --rev . dom/
we run the matcher (dom/) on the two manifests, then diff the narrowed
manifest. If the size of the match is much larger than the size of the
diff, this is wasteful. In the above case, we would end up iterating
over the 15k-or-so files in dom/ for each of the manifests, only to
later discover that they are mostly the same. This means that runningt
the command above is usually slower than getting the status for the
entire repo, because that code avoids calling treemanifest.match() and
only calls treemanifest.diff(), which loads only what's needed for the
diff.
Let's fix this by reintroducing the 'all' value in match.visitdir()
and making treemanifest.match() return a lazy copy of the manifest
from dom/ and down (in the above case). This speeds up the above
command on the Firefox repo from 0.357s to 0.137s (best of 5). The
wider the match, the bigger the speedup.
This has a small, but measurable, effect on performance if a pattern
file is very large. In an artificial test with 200,000 lines of
pattern data, using re2 reduced read time by 200 milliseconds.
The home of 'Abort' is 'error' not 'util' however, a lot of code seems to be
confused about that and gives all the credit to 'util' instead of the
hardworking 'error'. In a spirit of equity, we break the cycle of injustice and
give back to 'error' the respect it deserves. And screw that 'util' poser.
For great justice.
The problem was that the former name and the new name are both normalized to the
case in dirstate, so matcher._files would be ['ABC.txt', 'ABC.txt'].
localrepo.commit() calls localrepo.status(), passing along the matcher. Inside
dirstate.status(), _walkexplicit() simply grabs matcher.files() and processes
those items. Since the old name isn't present, it is silently dropped. There's
a fundamental tension here, because the status command should also accept files
that don't match the filesystem, so we can't drop the normalization in status.
The problem originated in d70aa474bd84.
Unfortunately with this change, the case of the old file must still be specified
exactly, or the old file is again silently excluded. I went back to
d70aa474bd84^, and that had the same behavior, so we are no worse off. I'm open
to ideas from a matcher or dirstate expert on how to fix that half.
Since 862fc5f60fef, .hgignore is ignored on Windows because a pat may have
a drive letter, but pathutil.join is posixpath.join.
"z:\foo\bar/z:\foo\bar\.hgignore"
Instead, this patch uses os.path.join() and util.localpath() to process both
parts as file-system paths.
Maybe we can remove os.path.join() at dirstate._ignore because 'include:' is
resolved relative to repo root? It was introduced by 29ce78e5f35c.
When reading pattern files, we just call open(path), which is relative to the
current directory. Let's fix this by resolving the paths before attempting to
read the file.
Python 2.6 introduced the "except type as instance" syntax, replacing
the "except type, instance" syntax that came before. Python 3 dropped
support for the latter syntax. Since we no longer support Python 2.4 or
2.5, we have no need to continue supporting the "except type, instance".
This patch mass rewrites the exception syntax to be Python 2.6+ and
Python 3 compatible.
This patch was produced by running `2to3 -f except -w -n .`.
The files command supports naming directories to limit the output to children of
that directory, and it also supports -S to force recursion into a subrepo. But
previously, using -S and naming a subrepo caused nothing to be output. The
reason was narrowmatcher() strips the current subrepo path off of each file,
which would leave an empty list if only the subrepo was named.
When matching on workingctx, dirstate.matches() would see the submatcher is not
always(), so it returned the list of files in dmap for each file in the matcher-
namely, an empty list. If a directory in a subrepo was named, the output was as
expected, so this was inconsistent.
The 'not anypats()' check is enforced by an existing test around line 140:
$ hg remove -I 're:.*.txt' sub1
Without the check, this removed all of the files in the subrepo.
This class is only needed on case insensitive filesystems, and only
for wdir context matches. It allows the user to not match the case of
the items in the filesystem- especially for naming directories, which
dirstate doesn't handle[1]. Making dirstate handle mismatched
directory cases is too expensive[2].
Since dirstate doesn't apply to committed csets, this is only created by
overriding basectx.match() in workingctx, and only on icasefs. The default
arguments have been dropped, because the ctx must be passed to the matcher in
order to function.
For operations that can apply to both wdir and some other context, this ends up
normalizing the filename to the case as it exists in the filesystem, and using
that case for the lookup in the other context. See the diff example in the
test.
Previously, given a directory with an inexact case:
- add worked as expected
- diff, forget and status would silently ignore the request
- files would exit with 1
- commit, revert and remove would fail (even when the commands leading up to
them worked):
$ hg ci -m "AbCDef" capsdir1/capsdir
abort: CapsDir1/CapsDir: no match under directory!
$ hg revert -r '.^' capsdir1/capsdir
capsdir1\capsdir: no such file in rev 64dae27060b7
$ hg remove capsdir1/capsdir
not removing capsdir1\capsdir: no tracked files
[1]
Globs are normalized, so that the -I and -X don't need to be specified with a
case match. Without that, the second last remove (with -X) removes the files,
leaving nothing for the last remove. However, specifying the files as
'glob:**.Txt' does not work. Perhaps this requires 're.IGNORECASE'?
There are only a handful of places that create matchers directly, instead of
being routed through the context.match() method. Some may benefit from changing
over to using ctx.match() as a factory function:
revset.checkstatus()
revset.contains()
revset.filelog()
revset._matchfiles()
localrepository._loadfilter()
ignore.ignore()
fileset.subrepo()
filemerge._picktool()
overrides.addlargefiles()
lfcommands.lfconvert()
kwtemplate.__init__()
eolfile.__init__()
eolfile.checkrev()
acl.buildmatch()
Currently, a toplevel subrepo can be named with an inexact case. However, the
path auditor gets in the way of naming _anything_ in the subrepo if the top
level case doesn't match. That is trickier to handle, because there's the user
provided case, the case in the filesystem, and the case stored in .hgsub. This
can be fixed next cycle.
--- a/tests/test-subrepo-deep-nested-change.t
+++ b/tests/test-subrepo-deep-nested-change.t
@@ -170,8 +170,15 @@
R sub1/sub2/test.txt
$ hg update -Cq
$ touch sub1/sub2/folder/bar
+#if icasefs
+ $ hg addremove Sub1/sub2
+ abort: path 'Sub1\sub2' is inside nested repo 'Sub1'
+ [255]
+ $ hg -q addremove sub1/sub2
+#else
$ hg addremove sub1/sub2
adding sub1/sub2/folder/bar (glob)
+#endif
$ hg status -S
A sub1/sub2/folder/bar
? foo/bar/abc
The narrowmatcher class may need to be tweaked when that is fixed.
[1] http://www.selenic.com/pipermail/mercurial-devel/2015-April/068183.html
[2] http://www.selenic.com/pipermail/mercurial-devel/2015-April/068191.html
The matches function was previously traversing all submanifests to look for
matching files, even though it was possible to know if a submanifest won't
contain any matches.
This change adds a visitdir function on the match object to decide quickly if
a directory should be visited when traversing. The function also decides if
_all_ subdirectories should be traversed.
Adding this logic as methods on the match object also makes the logic
modifiable by extensions, such as largefiles.
An example of a command this speeds up is running
hg status --rev .^ python/
on the Mozilla repo with the treemanifest experiment enabled.
It goes from 2.03s to 1.85s.
More improvements to speed from this change will happen when treemanifests are
lazily loaded. Because a flat manifest is still loaded and then converted
into treemanifests, speed improvements are limited.
This change has no negative effect on speed. For a worst-case example, this
command is not negatively impacted:
hg status --rev .^ 'relglob:*.js'
on the Mozilla repo. It goes from 2.83s to 2.82s.
Several commands take the fast path when match.always() is
true. However, when the user passes "." on the command line, that
results in a matcher for which match.always() == False. Let's make it
so such matchers return True, and have an empty list of .files(). This
makes e.g. "hg log ." as fast as "hg log" and "hg revert ." as fast as
"hg revert --all" (when run from repo root).
This makes _includeroots more like _fileroots and gives visitdir() a
nice symmetry in the two.
I'm hoping to later combine the two (_fileroots and _includeroots),
and having them treated similarly should make that step easier to
follow.
It seems unlikely that the optimization to avoid calling util.finddirs
twice will be noticeable, so let's drop it. This makes the two
conditions for includes and regular patterns more similar.
The repo root is nothing special when it comes to what directories to
visit: patterns like '-X relglob:*.py' should not exclude the top
directory, while '-X path:.' should (pointless as such a pattern may
be). The explicit removal of '.' from the set of excluded roots was
probably there to avoid removing the the root directory when any
patterns had been given, but since a619025a07b5 (treemanifest: visit
directory 'foo' when given e.g. '-X foo/ba?', 2015-05-27), we only
exclude directories that should be completely excluded, so we no
longer need to special-case the root directory.
The full path is propagated to the original match object since this is often
used directly for printing a file name to the user. This is cleaner than
requiring each caller to join the prefix with the file name prior to calling it,
and will lead to not having to pass the prefix around separately. It is also
consistent with the bad() and abs() methods in terms of the required input. The
uipath() method now inherits this path building property.
There is no visible change in path style for rel() because it ultimately calls
util.pathto(), which returns an os.sep based path. (The previous os.path.join()
was violating the documented usage of util.pathto(), that its third parameter be
'/' separated.) The doctest needed to be normalized to '/' separators to avoid
test differences on Windows, now that a full path is returned for a short
filename.
The test changes are to drop globs that are no longer necessary when printing an
absolute file in a subrepo, as returned from match.uipath(). Previously when
os.path.join() was used to add the prefix, the absolute path to a file in a
subrepo was printed with a mix of '/' and '\'. The absolute path for a file not
in a subrepo, as returned from match.uipath(), is still purely '/' based.
This is a utility to make it easier for subrepos to convert a file name to the
full path rooted at the top repository. It can replace the various path joining
lambdas, and doesn't require the prefix to be passed into the method that wishes
to build such a path.
The name is derived from the following pattern in annotate() and other places:
name = ((pats and rel) or abs)
The pathname separator is not os.sep in part to avoid confusion with variables
named 'abs' or similar that _are_ '/' separated, and in part because some
methods like cmdutils.forget() and maybe cmdutils.add() need to build a '/'
separated path to the root repo. This can replace the manual path building
there.
Various bit of code replace the bad method on matchers, and then delegate to the
original bad method after doing some custom processing. At least some of these
forget to restore the original method when the need has passed, and then when
the matcher is passed to the next subrepo (even a sibling), another layer is
added such that the chain looks like:
bad2 -> bad1 -> original
At best, this is a waste of processing, but sometimes spurious messages can be
emitted (e.g. 9f1524f400d9).
The trick with this copy of the matcher is to make sure it is *not* passed to
any subrepo- the original must be passed instead.
For globs like 'foo/ba?', match._roots() will return 'foo'. Since
visitdir(), excludes directories in the excluded roots, it would skip
the entire foo directory. This is incorrect, since 'foo/ba?' doesn't
mean that everything in foo/ should be exluded. Note that visitdir()
is called only from the treemanifest class, so this only affects tree
manifests. Fix by adding roots to the set of excluded roots only if
there are no excluded patterns.
Since 'glob' is the default pattern type for globs, we also need to
update some -X patterns in the tests to be of 'path' type to take
advantage of the visitdir tricks. For consistency, also update the -I
patterns.
It seems a little unfortunate that 'foo' in 'hg files -X foo' is
considered a pattern because of the implied 'glob' type, but improving
that is left for another day.