The various status types are currently documented on the
dirstate.status() method. Now that we have a class for the status
types, it makese sense to document the status types there
instead. Only leave the bits related to lookup/unsure in the status()
method documentation.
Callers of various status() methods (on dirstate, context, repo) get a
tuple of 7 elements, where each element is a list of files. This
results in lots of uses of indexes where names would be much more
readable. For example, "status.ignored" seems clearer than "status[4]"
[1]. So, let's introduce a simple named tuple containing the 7 status
fields: modified, added, removed, deleted, unknown, ignored, clean.
This patch introduces the class and updates the status methods to
return instances of it. Later patches will update the callers.
[1] Did you even notice that it should have been "status[5]"?
(tweaked by mpm to introduce the class in scmutil and only change one user)
The status tuple returned from dirstate.status() has an additional
field compared to the other status tuples: lookup/unsure. This field
is just an optimization and not something most callers care about
(they want the resolved value of 'modified' or 'clean'). To prepare
for a single future status type, let's separate out the 'lookup' field
from the rest by having dirstate.status() return a pair: (lookup,
status).
This is a small win on OS X. hg perfdirstatefoldmap:
before: wall 0.399708 comb 0.410000 user 0.390000 sys 0.020000 (best of 25)
after: wall 0.386331 comb 0.390000 user 0.370000 sys 0.020000 (best of 25)
Adds an exception when calling dirstate.setparent without having first called
dirstate.beginparentchange. This will prevent people from writing code that
modifies the dirstate parent without considering the transactionality of their
change.
This will break third party extensions that call setparents.
It's possible for the dirstate to become incoherent (issue4353) if there is an
exception in the middle of the dirstate parent and entries being written (like
if the user ctrl+c's). This change adds begin/endparentchange which a future
patch will require to be set before changing the dirstate parent. This will
allow us to prevent writing the dirstate in the event of an exception while
changing the parent.
Current callers that require just this data call workingctx.walk, which calls
dirstate.walk, which stats all the files. Even worse, workingctx.walk looks for
unknown files, significantly slowing things down, even though callers might not
be interested in them at all.
Even though "dirstate.write()" is invoked explicitly after "normal"
invocations, timestamp field of entries may be still "unset" in the
"dirstate" file itself , because "pack_dirstate" drops it when it is
equal to the timestamp of "dirstate" file itself.
This can avoid overlooking modification of files, which are updated at
same time in the second. But on the other hand, this may hide timing
critical problems.
For example, incorrect "normal"-ing (or lack of "normallookup"-ing on
the already "normal"-ed entry) is visible only when:
- the target file is modified in the working directory at T1, and
- "dirstate" file is written out at T2 (!= T1)
Otherwise, T1 is dropped by "pack_dirstate" in "dirstate.write()"
invocation, and "unset" is stored into "dirstate" file.
It often fails to reproduce problems from incorrect "normal"-ing by
Mercurial testset, because automated actions in the small repository
almost always causes that T1 and T2 are same.
This patch adds the debug feature to delay writing out to ensure
timestamp of each entries explicitly.
This feature is used to make timing critical "dirstate" problems
reproducable in subsequent patches.
With this patch, hg status and hg diff regain their previous speed.
The following tests are run against a working copy with over 270,000 files.
Here, 'before' means without this or the previous patch applied.
Note that in this case `hg perfstatus` isn't representative since it doesn't
take dirstate parsing time into account.
$ time hg status # best of 5
before: 2.03s user 1.25s system 99% cpu 3.290 total
after: 2.01s user 1.25s system 99% cpu 3.261 total
$ time hg diff # best of 5
before: 1.32s user 0.78s system 99% cpu 2.105 total
after: 1.27s user 0.79s system 99% cpu 2.066 total
Previously, while unpacking the dirstate we'd create 3-4 new CPython objects
for most dirstate values:
- the state is a single character string, which is pooled by CPython
- the mode is a new object if it isn't 0 due to being in the lookup set
- the size is a new object if it is greater than 255
- the mtime is a new object if it isn't -1 due to being in the lookup set
- the tuple to contain them all
In some cases such as regular hg status, we actually look at all the objects.
In other cases like hg add, hg status for a subdirectory, or hg status with the
third-party hgwatchman enabled, we look at almost none of the objects.
This patch eliminates most object creation in these cases by defining a custom
C struct that is exposed to Python with an interface similar to a tuple. Only
when tuple elements are actually requested are the respective objects created.
The gains, where they're expected, are significant. The following tests are run
against a working copy with over 270,000 files.
parse_dirstate becomes significantly faster:
$ hg perfdirstate
before: wall 0.186437 comb 0.180000 user 0.160000 sys 0.020000 (best of 35)
after: wall 0.093158 comb 0.100000 user 0.090000 sys 0.010000 (best of 95)
and as a result, several commands benefit:
$ time hg status # with hgwatchman enabled
before: 0.42s user 0.14s system 99% cpu 0.563 total
after: 0.34s user 0.12s system 99% cpu 0.471 total
$ time hg add new-file
before: 0.85s user 0.18s system 99% cpu 1.033 total
after: 0.76s user 0.17s system 99% cpu 0.931 total
There is a slight regression in regular status performance, but this is fixed
in an upcoming patch.
Upcoming patches will switch away from using Python tuples for dirstate values
in compiled builds. Make that easier by introducing a variable called
dirstatetuple, currently set to tuple. In upcoming patches, this will be set to
an object from the parsers module.
dirstate.walk will return unknown files that were explicitly requested, even
if listunknown is false. There's no point in dropping these files on the
floor in dirstate.status.
This has no effect on any current callers, because all of them assume the
unknown list is empty and ignore it. Future callers may find it useful,
though.
On Windows, there are two ways symlinks can manifest themselves:
1. As placeholders: text files containing the symlink's target. This is what
usually happens with fresh clones on Windows.
2. With their dereferenced contents. This happens with clones accessed over NFS
or Samba.
In order to handle case 2, 28af3e0c54f0 made dirstate.status ignore all symlink
placeholders on Windows. It doesn't ignore symlinks in the lookup set, though,
since those don't have the link bit set. This is problematic because it
violates the invariant that `hg status` with every file in the normal set
produces the same output as `hg status` with every file in the lookup set.
With this change, symlink placeholders in the normal set are no longer ignored.
We instead rely on code in localrepo.status that uses heuristics to look for
suspect placeholders.
An upcoming patch will test this out by no longer adding files written in the
last second of an update to the lookup set.
If a directory matched a regex in hgignore but the files inside the directory
did not match the regex, they would appear as deleted in hg status. This
change fixes them to appear normally in hg status.
Removing the ignore(nf) conditional here is ok because it just means we might
stat more files than we had before. My testing on a large repo shows this
causes no performance regression since the only additional files being stat'd
are the ones that are missing (i.e. status=!), which are generally rare.
Before this patch, all files in dirstate are used to build "_foldmap"
up on case insensitive filesystem regardless of their statuses.
For example, when dirstate contains both removed file 'a' and added
file 'A', "_foldmap" may be updated finally by removed file 'a'. This
causes unexpected status information for added file 'A' at "hg status"
invocation.
This patch ignores removed files at building "_foldmap" up on case
insensitive filessytem.
This patch doesn't add any test, because this issue is difficult to
reproduce intentionally: it depends on iteration order of "dirstate._map".
Consider a hypothetical extension that implements walk in a more efficient
manner and skips some known-clean files. However, that can only be done under
some situations, such as when clean files are not being asked for and a
match.traversedir callback is not set. The full flag lets walk tell these two
cases apart.
This makes a big difference to performance in some cases.
hg --time locate 'set:symlink()'
mozilla-central (70,000 files):
before: 2.92 sec
after: 2.47
another repo (170,000 files):
before: 7.87 sec
after: 6.86
An upcoming patch will speed dirstate.walk up by not querying the results dict
when it is empty. This ensures it is in some common cases.
This should be safe because subrepos and .hg aren't part of the dirstate.
The bash_completion code uses "hg status" to generate a list of
possible completions for commands that operate on files in the
working directory. In a large working directory, this can result
in a single tab-completion being very slow (several seconds) as a
result of checking the status of every file, even when there is no
need to check status or no possible matches.
The new debugpathcomplete command gains performance in a few simple
ways:
* Allow completion to operate on just a single directory. When used
to complete the right commands, this considerably reduces the
number of completions returned, at no loss in functionality.
* Never check the status of files. For completions that really must
know if a file is modified, it is faster to use status:
hg status -nm 'glob:myprefix**'
Performance:
Here are the commands used by bash_completion to complete, run in
the root of the mozilla-central working dir (~77,000 files) and
another repo (~165,000 files):
All "normal state" files (used by e.g. remove, revert):
mozilla other
status -nmcd 'glob:**' 1.77 4.10 sec
debugpathcomplete -f -n 0.53 1.26
debugpathcomplete -n 0.17 0.41
("-f" means "complete full paths", rather than the current directory)
Tracked files matching "a":
mozilla other
status -nmcd 'glob:a**' 0.26 0.47
debugpathcomplete -f -n a 0.10 0.24
debugpathcomplete -n a 0.10 0.22
We should be able to further improve completion performance once
the critbit work lands. Right now, our performance is limited by
the need to iterate over all keys in the dirstate.
hg strip -k was using dirstate.rebuild() which reset all the dirstate
entries timestamps to 0. This meant that the next time hg status was
run every file was considered to be 'unsure', which caused it to do
expensive read operations on every filelog. On a repo with >150,000
files it took 70 seconds when everything was in memory. From a cold
cache it took several minutes.
The fix is to only reset files that have changed between the working
context and the destination context.
For reference, --keep means the working directory is left alone during
the strip. We have users wanting to use this operation to store their
work-in-progress as a commit on a branch while they go work on another
branch, then come back later and be able to uncommit that work and
continue working. They currently use 'git reset HARD^' to accomplish
this in git.
util.statfiles returns a generator on python 2.7 with c extensions disabled.
This caused util.statfiles(...) [0] to break in tests. Since we're only
stat'ing one file, I just changed it to call lstat directly.
Previously dirstate.walk would return a stat object for files in the dmap
that have a symlink to a directory in their path. Now it will return None
to indicate that they are no longer considered part of the repository. This
currently only affects walks that traverse the entire directory tree (ex:
hg status) and not walks that only list the contents of the dmap (ex: hg diff).
In a situation like this:
mkdir foo && touch foo/a && hg commit -Am "a"
mv foo bar
ln -s bar foo
'hg status' will now show '! foo/a', whereas before it incorrectly considered
'foo/a' to be unchanged.
In addition to making 'hg status' report the correct information, this will
allow callers to dirstate.walk to not have to detect symlinks themselves,
which can be very expensive.