This is about 9 times faster than the Python dirstate packing code.
The relatively small speedup is due to the poor locality and memory
access patterns caused by traversing dicts and other boxed Python
values.
The fix introduced in 3509b9cf8f86 was only partially successful. It is correct
to turn dirstate 'm' merge records into normal/dirty ones but copy records are
lost in the process. To adjust them as well, we need to look in the first
parent manifest to know which files were added and preserve only related
records. But the dirstate does not have access to changesets, the logic has to
moved at another level, in localrepo.
The original issue was something like:
$ hg init repo
$ cd repo
$ mkdir D
$ echo a > D/a
$ hg ci -Am adda
adding D/a
$ mv D temp
$ mv temp d
$ echo b > d/b
$ hg add d/b
adding D/b
$ hg ci -m addb
$ hg mv d/b d/c
moving D/b to d/c
$ hg st
A d/c
R D/b
Here we expected:
A D/c
R D/b
the logic being we try to preserve case of path components already known in the
dirstate. This is fixed by the current patch.
Note the following stories are not still not supported:
Changing directory case
$ hg mv D d
moving D/a to D/D/a
moving D/b to D/D/b
$ hg st
A D/D/a
A D/D/b
R D/a
R D/b
or:
$ hg mv D/* d
D/a: not overwriting - file exists
D/b: not overwriting - file exists
And if they were, there are probably similar issues with diffing/patching.
When rebasing, if a conflict occurs and is resolved in a way the rebased
revision becomes empty, it is not skipped, unlike revisions being emptied
without conflicts.
The reason is:
- File 'x' is merged and resolved, merge.update() marks it as 'm' in the
dirstate.
- rebase.concludenode() calls localrepo.commit(), which calls
localrepo.status() which calls dirstate.status(). 'x' shows up as 'm' and is
unconditionnally added to the modified files list, instead of being checked
again.
- localrepo.commit() detects 'x' as changed an create a new revision where only
the manifest parents and linkrev differ.
Marking 'x' as modified without checking it makes sense for regular merges. But
in rebase case, the merge looks normal but the second parent is usually
discarded. When this happens, 'm' files in dirstate are a bit irrelevant and
should be considered 'n' possibly dirty instead. That is what the current patch
does.
Another approach, maybe more efficient, would be to pass another flag to
merge.update() saying the 'branchmerge' is a bit of a lie and recordupdate()
should call dirstate.normallookup() instead of merge().
It is also tempting to add this logic to dirstate.setparents(), moving from two
to one parent is what invalidates the 'm' markers. But this is a far bigger
change to make.
v2: succumb to the temptation and move the logic in dirstate.setparents(). mpm
suggested trying _filecommit() first but it is called by commitctx() which
knows nothing about the dirstate and comes too late into the game. A second
approach was to rewrite the 'm' state into 'n' on the fly in dirstate.status()
which failed for graft in the following case:
$ hg init repo
$ cd repo
$ echo a > a
$ hg ci -qAm0
$ echo a >> a
$ hg ci -m1
$ hg up 0
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ hg mv a b
$ echo c > b
$ hg ci -m2
created new head
$ hg graft 1 --tool internal:local
grafting revision 1
$ hg --config extensions.graphlog= glog --template '{rev} {desc|firstline}\n'
@ 3 1
|
o 2 2
|
| o 1 1
|/
o 0 0
$ hg log -r 3 --debug --patch --git --copies
changeset: 3:19cd7d1417952af13161b94c32e901769104560c
tag: tip
phase: draft
parent: 2:b5c505595c9e9a12d5dd457919c143e05fc16fb8
parent: -1:0000000000000000000000000000000000000000
manifest: 3:3d27ce8d02241aa59b60804805edf103c5c0cda4
user: test
date: Thu Jan 01 00:00:00 1970 +0000
extra: branch=default
extra: source=a03df74c41413a75c0a42997fc36c2de97b26658
description:
1
Here, revision 3 is created because there is a copy record for 'b' in the
dirstate and thus 'b' is considered modified. But this information is discarded
at commit time since 'b' content is unchanged. I do not know if discarding this
information is correct or not, but at this time we cannot represent it anyway.
This patch therefore implements the last solution of moving the logic into
dirstate.setparents(). It does not sound crazy as 'm' files makes no sense with
only one parent. It also makes dirstate.merge() calls .lookupnormal() if there
is one parent, to preserve the invariant.
I am a bit concerned about introducing this kind of stateful behaviour to
existing code which historically treated setparents() as a basic setter without
side-effects. And doing that during the code freeze.
file in nested directory causes unexpected abort.
problems below should be fixed for recursive normalization route in
dirstate._normalize():
1. rsplit() may cause unpacking into more than 2 elements.
it should be called with 'maxsplit' argument to unpack
into 'd, f'
2. 'd' is replaced by normalized value prefixed with
'self._root', but this makes 'folded' as absolute path,
and it is unexpected one for caller of recursive
normalization
on icasefs, "hg qnew" fails to import changing letter case of filename
already occurred in working directory, for example:
$ hg rename a tmp
$ hg rename tmp A
$ hg qnew casechange
$ hg status
R a
$
"hg qnew" invokes 'dirstate.walk()' via 'localrepository.commit()'
with 'exact match' matching object having exact filenames of targets
in ones 'files()'.
current implementation of 'dirstate.walk()' always normalizes letter
case of filenames from 'match.files()' on icasefs, even though exact
matching is required.
then, files only different in letter case are treated as one file.
this patch prevents 'dirstate.walk()' from normalizing, if exact
matching is required, even on icasefs.
filenames for 'exact matching' are given not from user command line,
but from dirstate walk result, manifest of changecontext, patch files
or fixed list for specific system files (e.g.: '.hgtags').
in such case, case normalization should not be done, so this patch
works well.
path to repo root may contains case sensitive part, even though repo
is located in case insensitive filesystem: e.g. repo in FAT32 device
mounted on Unix.
so, case normalized root causes failure of stat(2).
this patch uses case preserved root for 'util.fspath()' invocation to
avoid this problem.
case preserved root for 'util.fspath()' may decrease efficiency of
fspath cache, but 'util.fspath()' is currently called only from
dirstate, so this fix has less impact.
this patch adds 'dirs()' to changectx/workingctx, which returns map of
all directories deduced from manifest, to examine whether specified
pattern is related to the context as directory or not quickly.
'workingctx.dirs()' uses 'dirstate.dirs()' rather than building
another copy of it.
'dirstate._normalize()', the only caller of 'util.fspath()', has
already normcase()-ed path before invocation of it.
normcase()-ed root can be cached on dirstate side, too.
so, this patch changes 'util.fspath()' API specification to avoid
normcase()-ing in it.
at first of dirstate.walk() on case insensitive filesystem,
normalization of '.' causes util.fspath() invocation, but '.' is not
cached in it.
this invocation is not only useless, but also harmful: initial "hg
tag" causes creation of ".hgtags" file after dirstate.walk(), and
looking up ".hgtags" in cache will fail, because directory contents of
root is already cached at util.fspath() invocation for '.'.
Complex merges with divergent renames can cause a file to be 'moved'
twice, causing dirstate.drop() to be called twice. Rather than try to
ensure there are no unexpected corner cases where this can happen, we
simply ignore drops of files that aren't tracked.
Before this patch, Windows always did the wrong thing with exec bits
when committing a merge: consult the flags in first parent.
Now we manually recompute the result of merging flags at commit time,
which almost always does the right thing (except when there are
conflicts between symlink and exec flags).
To do this, we:
- pull flag synthesis out into its own function
- delay building this function unless it's needed
- add a merge case that compares flags in local and other against the ancestor
This has been tested in multiple ways on Linux:
- running the whole test suite with both old and new code in place,
checking for differences in each flags() result
- running the whole test suite while comparing real on-disk flags
against synthetic ones for merges
- test-issue1802 (from Martin Geisler) which disables exec bit
checking on Unix
The usual contract is that close() makes your writes permanent, so
atomictempfile's use of close() to *discard* writes (and rename() to
keep them) is rather unexpected. Thus, change it so close() makes
things permanent and add a new discard() method to throw them away.
discard() is only used internally, in __del__(), to ensure that writes
are discarded when an atomictempfile object goes out of scope.
I audited mercurial.*, hgext.*, and ~80 third-party extensions, and
found no one using the existing semantics of close() to discard
writes, so this should be safe.
It has substantially different semantics from forget at the command
layer, so change it to avoid confusion.
We can't simply combine it with remove because we need to explicitly
drop non-added files in some cases like commit.
These leaks may occur in environments that don't employ a reference
counting GC, i.e. PyPy.
This implies:
- changing opener(...).read() calls to opener.read(...)
- changing opener(...).write() calls to opener.write(...)
- changing open(...).read(...) to util.readfile(...)
- changing open(...).write(...) to util.writefile(...)
We can get rid of the _lastnormal set by using the filesystem mtimes to
identify the problematic "lastnormal" files on status(), forcing a file
content-comparison if the file's mtime timeslot is equal to _lastnormaltime.
- consistently use mtime as mapped to dirstate granularity (needed for
filesystems like NTFS, which have sub-second resolution)
- no need to add files with mtime < _lastnormaltime
- improve comments
(issue2264, issue2516)
The race happens when two commits in a row change the same file
without changing its size, *if* those two commits happen in the same
second in the same process while holding the same repo lock. For
example:
commit 1:
M a
M b
commit 2: # same process, same second, same repo lock
M b # modify b without changing its size
M c
This first manifested in transplant, which is the most common way to
do multiple commits in the same process. But it can manifest in any
script or extension that does multiple commits under the same repo
lock. (Thus, the test script tests both transplant and a custom script.)
The problem was that dirstate.status() failed to notice the change to
b when localrepo is about to do the second commit, meaning that change
gets left in the working directory. In the context of transplant, that
means either a crash ("RuntimeError: nothing committed after
transplant") or a silently inaccurate transplant, depending on whether
any other files were modified by the second transplanted changeset.
The fix is to make status() work a little harder when we have
previously marked files as clean (state 'normal') in the same process.
Specifically, dirstate.normal() adds files to self._lastnormal, and
other state-changing methods remove them. Then dirstate.status() puts
any files in self._lastnormal into state 'lookup', which will make
localrepository.status() read file contents to see if it has really
changed. So we pay a small performance penalty for the second (and
subsequent) commits in the same process, without affecting the common
case. Anything that does lots of status updates and checks in the
same process could suffer a performance hit.
Incidentally, there is a simpler fix: call dirstate.normallookup() on
every file updated by commit() at the end of the commit. The trouble
with that solution is that it imposes a performance penalty on the
common case: it means the next status-dependent hg command after every
"hg commit" will be a little bit slower. The patch here is more
complex, but only affects performance for the uncommon case.
Add missing calls to close() to many places where files are
opened. Relying on reference counting to catch them soon-ish is not
portable and fails in environments with a proper GC, such as PyPy.
split can be more readable for longer lists like the list in
dirstate.invalidate. As dirstate.invalidate is used in wlock() and therefoe
used heavily, I think it's worth avoiding a split there too.
Previously, branch names were ideally manipulated as UTF-8 strings,
because they were stored as UTF-8 in the dirstate and the changelog
and could not be safely converted to the local encoding and back.
However, only about 80% of branch name code was actually using the
right encoding conventions. This patch uses the localstr addition to
allow working on branch names as local strings, which simplifies
handling so that the previously incorrect code becomes correct.
This gives the repository control over which nested repository paths
that should be allowed via the custom path auditor.
Since paths into subrepositories are now allowed, dirstate.walk must
now filter away more paths than before.
When the filesystem cannot handle the executable bit, we currently
ignore it completely when looking for modified files. Similarly, it is
impossible to set or clear the bit when the filesystem ignores it.
This patch makes Mercurial treat symbolic links the same way.
Symlinks are a little different since they manifest themselves as
small files containing a filename (the symlink target). On Windows,
these files show up as regular files, and on Linux and Mac they show
up as real symlinks.
Issue1888 presents a case where the symlink files are better ignored
from the Windows side. A Linux client creates symlinks in a working
copy which is shared over a network between Linux and Windows clients.
The Samba server is helpful and defererences the symlink when the
Windows client looks at it. This means that Mercurial on the Windows
side sees file content instead of a file name in the symlink, and
hence flags the link as modified. Ignoring the change would be much
more helpful, similarly to how Mercurial does not report any changes
when executable bits are ignored in a checkout on Windows.
An initial checkout of a symbolic link on a file system that cannot
handle symbolic links will still result in a regular file containing
the target file name as its content. Sharing such a checkout with a
Linux client will not turn the file into a symlink automatically, but
'hg revert' can fix that. After the revert, the Windows client will
see the correct file content (provided by the Samba server when it
follows the link on the Linux side) and otherwise ignore the change.
Running 'hg perfstatus' 10 times gives these results:
Before: After:
min: 0.544703 min: 0.546549
med: 0.547592 med: 0.548881
avg: 0.549146 avg: 0.548549
max: 0.564112 max: 0.551504
The median time is increased about 0.24%.
The dirstate.granularity configuration parameter was never documented,
it only adds code complexity and it is unneeded.
Adding comments describing forced 'unset' entries.
This change narrows the race guard that was introduced by ffd022830d6d
("dirstate: ignore stat data for files that were updated too recently")
to not discard the _map entry's stat data if the mtime is in the future.
Without this change, status locks files having odd mtimes in the future
into the 'unset' state, causing needless file compares later (admittedly
harmless), but also inflicting highly irritating sticky effects on
tools/plugins that directly read .hg/dirstate (e.g. TortoiseHg).
Bound methods hold a reference to self, so assigning a bound method to
an instance unavoidably creates a cycle. Work around this by choosing
a normalize method at walk time instead. Eliminate default arg while
we're at it.
- merge always and match with patterns
- make always and match with patterns the default
- invert dostep3 to skipstep3
- move dirignore test inside exact case
nothing will ever match on match.never
nothing new will match on match.exact (all found in step 1)
nothing new will match on match.match when
there is no pattern and
there is no direcory in pats
Copy information was saved in a common loop, then refined in a git-only block.
The problem was the latter did filter out renames occuring in the current
patch and irrelevant to commit. In the non-git case, copy records still existed
in the dirstate, referencing removed files, making the commit to fail. Git and
non-git copy handling paths are now separated for simplicity.
Reported by Gary Bernhardt
util module implements two versions of statfiles function
_statfiles calls lstat per file
_statfiles_clustered takes advantage of optimizations in osutil.c, stats all
files in directory at once when new directory is hit and caches the results
util.statfiles dispatches to appropriate version during module loading
The speedup on directory tree with 2k directories and 63k files is about
factor of 1.8 (1.3s -> 0.8s for hg diff - hg startup overhead about .2s)
At this point only Win32 now benefit from this patch.
Rest of OSes use the non clustered implementation.
Normcase already takes care of upper/lower case and /->\ conversions.
What's left for normpath is folding of a/../a sequences but this should
be either done consistently on both non-folding and folding code path
or not at all, otherwise we are introducing inconsistent behavior between the
two that has nothing to do with case folding.
Second argument against it - normpath being pure Python function is very slow -
as much as 50% of time is spend just inside normpath call on my repository.
This patch fixes regression reported in 1286 that causes util.fspath
to be called for every file not in current manifest - including ignored files.
The regression is quite severe - the time for simple hg st goes from 5s to 1m38s
on one of my source trees - which basically renders mercurial useless.
Ignore unknown files if we don't need them (eg in hg diff).
It slows things down a little bit for big trees (kernel repo), since _join()
is called for each file instead of for each directory.
fix issue567
- add fast _finddirs function
- remove recursion from incpath/decpath
- split changepath into addpath/droppath
- change relax arg to check
- move incpathcheck logic into addpath
- move incpath into addpath
- move decpath into droppath
- inline code in self._dirs creation
add _checklink var to dirstate
introduce dirstate.flagfunc
switch users of util.execfunc/linkfunc to flagfunc
change manifestdict.set to take a flags string
change ctx.fileflags to ctx.flags
change gitmode func to a dict
remove util.execfunc/linkfunc
This method returns the normalised form of a path. This is
- the form in the dirstate, if available, or
- the form on disk, if available, or
- the form passed on the command line
normalize() is called on the type-'f' result of statwalk.
This fixes issues 910 and 1092
This should fix the race where
hg commit foo
<change foo without changing its size>
happens in the same second and status is fooled into thinking foo
is clean.
A configuration item is used to determine the timeout, since different
filesystems may have different requirements (I think VFAT needs 3s,
while most Unix filesystems are fine with 1s).
We encode the previous state as a negative file size (AFAICS, previous
versions of hg always have size == 0 when state == 'r').
We save the state of 'm'erged and dirty files, because they're the
two states that indicate that a file has to be committed on a merge
to correctly record per-file history.
With a pattern like '^directory$' in .hgignore, a "hg status directory"
would still walk "directory" and all its subdirs.
This is the first half of a fix for issue886.
Workaround for dir-changed-to-file updates mentioned
in rev c3f3393b9096 doesn't actually work since tests
introduced in mentioned changeset prevented dirstate
updates even if working directory updates succeded.
Make tests more relaxed for dirstate operations
not directly accessible from cli. See also issue660.
While here, move _dirs existance check from _decpath()
to _changepath() for unification.
Allow adding to dirstate files that clash with previously existing
but marked for removal. Protect from reintroducing clashes by revert.
This change doesn't address related issues with update. Current
workaround is to do "clean" update by manually removing conflicting
files/dirs from working directory.
read:
- single call to len(st)
- fewer assignments for position tracking
- don't split apart tuple from unpack
- use a literal for the unpack spec
write:
- localize variables and functions
- avoid copied function call
- use % for string concatenation
- shortcircuit decpath if we haven't built the _dirs map
- increment only for leafnodes of directory tree
(this should make construction more like O(nlog n) than O(n^2))
After a hg merge, we want to include in the commit all the files that we
got from the second parent, so that we have the correct file-level
history. To make them visible to hg commit, we try to mark them as dirty.
Unfortunately, right now we can't really mark them as dirty[1] - the
best we can do is to mark them as needing a full comparison of their
contents, but they will still be considered clean if they happen to be
identical to the version in the first parent.
This changeset extends the dirstate format in a compatible way, so that
we can mark a file as dirty:
Right now we use a negative file size to indicate we don't have valid
stat data for this entry. In practice, this size is always -1.
This patch uses -2 to indicate that the entry is dirty. Older versions
of hg won't choke on this dirstate, but they may happily mark the file
as clean after a full comparison, destroying all of our hard work.
The patch adds a dirstate.normallookup method with the semantics of the
current normaldirty, and changes normaldirty to forcefully mark the
entry as dirty.
This should fix issue522.
[1] - well, we could put them in state 'm', but that state has a
different meaning.
Theoretically, it's possible to forget modified dirstate
parents by doing:
dirstate.invalidate()
dirstate.setparents(p1, p2)
dirstate._map
The final access to _map should call _read(), which will
unconditionally overwrite dirstate._pl.
This doesn't actually happen right now because invalidate
accidentally ends up rebuilding dirstate._map.