I think it's both simpler and clearer to use any() than the current
for loop.
While at it, also drop the call to sorted(), since it doesn't matter
which order we iterate over subrepos.
As 70dbbec70c4c demonstrated with stream clones, closing files on
background threads on Windows can yield a significant speedup
because closing files that have been created/appended to is slow
on Windows/NTFS.
Working directory updates can write thousands of files. Therefore it
is susceptible to excessive slowness on Windows due to slow file
closes.
This patch enables background file closing when performing working
directory file writes. The impact when performing an `hg up tip` on
mozilla-central (136,357 files) from an empty working directory is
significant:
Before: 535s (8:55)
After: 133s (2:13)
Delta: -402s (6:42)
That's a 4x speedup!
By comparison, that same machine can perform the same operation
in ~15s on Linux. So Windows went from ~35x to ~9x slower. Not bad
but there's still work to do.
As a reminder, background file closing is only activated on Windows
because it is only beneficial on that platform. So this patch
shouldn't change non-Windows behavior at all.
It's worth noting that non-Windows systems perform working directory
updates with multiple processes. Unfortunately, worker.py doesn't
yet support Windows. So, there is still plenty of room for making
working directory updates faster on Windows. Even if multiple
processes are used on Windows, I believe background file closing
will still provide a benefit, as individual processes will still
be slowed down by the file close bottleneck (assuming the I/O system
isn't saturated).
Previously we would lstat the file to see if it was a file or a link before
attempting to process it. If the file happened to exist across a symlink, and if
that symlink was pointing to a network file system, that check could be very
expensive.
The new logic audit's the path to avoid symlinks before performing the lstat on
the file itself.
In our situation, this shaved 10 minutes off of certain hg updates.
300 files * (2 seconds - the network filesystem lookup time)
checkunknown and checkignored are currently respected for updates and regular
merges, but not for certain kinds of rebases. To be precise, they aren't
respected for rebases when:
(1) we're rebasing while currently on the destination commit, and
(2) an untracked or ignored file F is currently in the working copy, and
(3) the same file F is in a source commit, and
(4) F has different contents in the source commit.
This happens because rebases set force to True when calling merge.update.
Setting force to True makes a lot of sense in general, but it turns out the
force option is overloaded: there's a deprecated '--force' option in merge that
allows you to merge in outstanding changes, including changes in untracked
files. We use the 'mergeforce' parameter to tell those two cases apart.
I think the behavior during rebases when checkunknown is 'abort' (the default)
is wrong -- we should abort on or overwrite differing untracked files, not try
to merge them in. However that currently breaks rebases by aborting in the
middle -- we need better handling for that case before we can change the
default.
In an upcoming patch we'll have different behavior here for when 'merge
--force' is used as opposed to when other kinds of force operations are
performed, like rebases.
During a merge, each file has a current commitnode+filenode, an other
commitnode+filenode, and an ancestor commitnode+filenode. The ancestor
commitnode is not stored though, and we rely on the ability for the filectx() to
look up the commitnode by using the filenode's linkrev. In alternative backends
(like remotefilelog), linkrevs may have restriction that prevent arbitrary
linkrev look up given a filenode.
This patch accounts for that by storing the ancestor commitnode in
the merge state so that it is available later at resolve time.
This results in some test changes because the ancestor commitnode we're using at
resolve time changes slightly. Before, we used the linkrev commit, which is the
earliest commit that introduced that particular filenode (which may not be the
latest common ancestor of the commits being merged). Now we use the latest
common ancestor of the merged commits as the commitnode. This is fine though,
because that commit contains the same filenode as the linkrev'd commit.
In future commits we will want to store more data related to each file in the
merge state. This patch adds an optional record for storing a dictionary of
extras for each file.
In my patch series ending with rev c3f2eede6938 I switched most change/delete
conflicts to be handled at the resolve layer. .hgsubstate was the one file that
we weren't able to handle, so we kept the old code path around for it.
The old code path added .hgsubstate to one of the other lists as the user
specifies, including possibly the 'g' list.
Now since we did this check after converting the actions from being keyed by
file to being keyed by action type, there was nothing that actually removed
.hgsubstate from the 'cd' or 'dc' lists. This meant that the file would
eventually make its way into the 'mergeactions' list, now freshly augmented
with 'cd' and 'dc' actions.
We call subrepo.submerge for both 'g' actions and merge actions.
This means that if the resolution to an .hgsubstate change/delete conflict was
to add it to the 'g' list, subrepo.submerge would be called twice. It turns out
that this doesn't cause any adverse effects on Linux due to caching, but
apparently breaks on other operating systems including Windows.
The fix here moves this to before we convert the actions over. This ensures
that it .hgsubstate doesn't make its way into multiple lists.
The real fix here is going to be:
(1) move .hgsubstate conflict resolution into the resolve layer, and
(2) use a real data structure for the actions rather than shuffling data around
between lists and dictionaries: we need a hash (or prefix-based) index by
file and a list index by action type.
There's a very tiny behavior change here: collision detection on
case-insensitive systems will happen after this is resolved, not before. I think
this is the right change -- .hgsubstate could theoretically collide with other
files -- but in any case it makes no practical difference.
Thanks to Yuya Nishihara for investigating this.
In some real-world cases it is preferable to allow overwriting ignored files
while continuing to abort on unknown files. This primarily happens when we're
replacing build artifacts (which are ignored) with checked in files, but
continuing to abort on differing files that aren't ignored.
We're redefining merge.checkunknown to only control the behavior for files
that aren't ignored. That's fine because this config was only very recently
introduced and has not made its way into any Mercurial releases yet.
A 'colliding unknown file' is a file that meets all of the following
conditions:
- is untracked or ignored on disk
- is present in the changeset being merged or updated to
- has different contents
Previously, we would always abort whenever we saw such files. With this config
option we can choose to warn and back the unknown files up instead, or even
forgo the warning entirely and silently back the unknown files up.
Common use cases for this configuration include a large scale transition of
formerly ignored unknown files to tracked files. In some cases the files can be
given new names, but in other cases, external "convention over configuration"
constraints have determined that the file must retain the same name as before.
Previously, we were using Python's native 'os.path.isfile' method which follows
symlinks. In this case, since we're operating on repo contents, we don't want
to follow symlinks.
There's a behaviour change here, as shown by the second part of the added test.
Consider a symlink 'f' pointing to a file containing 'abc'. If we try and
replace it with a file with contents 'abc', previously we would have let it
though. Now we don't. Although this breaks naive inspection with tools like
'cat' and 'diff', on balance I believe this is the right change.
This means that the diff code does less work, potentially
significantly less in the case of treemanifests. It also should ease
implementation with narrowed clone cases (such as narrowhg) when we
don't always have the entire set of treemanifest revlogs locally.
As far as I can tell, this codepath is currently only used by record,
so it'll probably die in the near future, and then narrowhg won't have
to worry about composing with some unknown matching system.
This will be a chance for the merge driver to finish resolving or generating
any driver-resolved files.
As before, having a separate error state from 'unresolved' is too big a
refactoring for now, so we hack around it by setting unresolved to a positive
value when necessary.
We also need to update our internal state to whatever state driverpreprocess
leaves it in.
Adding an error state separate from the unresolved count is too big a
refactoring for now, so we hack around it by setting it to a positive value to
indicate an error state.
The exact semantics for what should happen (particularly with respect to error
handling) are still a bit hard to pin down, so I think it's better to
experiment with it as an extension for now. For now this stub will act as a
convenient point for extensions to hook on.
c67339617276 (while 3.4 code-freeze) made all 'update' hooks run after
releasing wlock for visibility of in-memory dirstate changes. But this
breaks paired invocation of 'preupdate' and 'update' hooks.
For example, 'hg backout --merge' for TARGET revision, which isn't
parent of CURRENT, consists of steps below:
1. update from CURRENT to TARGET
2. commit BACKOUT revision, which backs TARGET out
3. update from BACKOUT to CURRENT
4. merge TARGET into CURRENT
Then, we expects hooks to run in the order below:
- 'preupdate' on CURRENT for (1)
- 'update' on TARGET for (1)
- 'preupdate' on BACKOUT for (3)
- 'update' on CURRENT for (3)
- 'preupdate' on TARGET for (4)
- 'update' on CURRENT/TARGET for (4)
But hooks actually run in the order below:
- 'preupdate' on CURRENT for (1)
- 'preupdate' on BACKOUT for (3)
- 'preupdate' on TARGET for (4)
- 'update' on TARGET for (1), but actually on CURRENT/TARGET
- 'update' on CURRENT for (3), but actually on CURRENT/TARGET
- 'update' on CURRENT for (4), but actually on CURRENT/TARGET
Root cause of the issue focused by c67339617276 is that external
'update' hook process can't view in-memory changes (especially, of
dirstate), because they aren't written out until the end of
transaction (or wlock).
Now, hooks can be invoked just after updating, because previous
patches made in-memory changes visible to external process.
This patch may break backward compatibility from the point of view of
"scheduling hook execution", but should be reasonable because 'update'
hooks had been executed in this order before 3.4.
This patch tests "hg backout" and "hg unshelve", because the former
activates the transaction before 'update' hook invocation, but the
former doesn't.
Now, 'dirstate.write(tr)' delays writing in-memory changes out, if a
transaction is running.
This may cause treating this revision as "the first bad one" at
bisecting in some cases using external hook process inside transaction
scope, because some external hooks and editor process are still
invoked without HG_PENDING and pending changes aren't visible to them.
'dirstate.write()' callers below in localrepo.py explicitly use 'None'
as 'tr', because they can assume that no transaction is running:
- just before starting transaction
- at closing transaction, or
- at unlocking wlock
This opens the door to working slightly more closely with the manifest
type and letting it optimize out some of the diff comparisons for us,
and also makes life significantly easier for narrowhg.
Once we get a matcher down into manifestmerge, we can make narrowhg
work more easily and potentially let manifest.match().diff() do less
work in manifestmerge.
We currently allow updating and merging (with --force) when there are
unresolved merge conflicts, as long as there is only one parent of the
working copy. Even worse, when updating to another revision
(linearly), if one of the unresolved files (including any conflict
markers in the working copy) can now be merged cleanly with the target
revision, the file becomes marked as resolved.
While we could potentially allow updates that affect only files that
are not in the set of unresolved files, that's considerably more work,
and we don't have a use case for it anyway. Instead, let's keep it
simple and refuse any merge or update (without -C) when there are
unresolved conflicts.
Note that test-merge-local.t explicitly checks for conflict markers
that get carried over on update. It's unclear if that was intentional
or not, but it seems bad enough that we should forbid it. The simplest
way of fixing the test case is to leave the conflict markers in place
and just mark the files resolved, so let's just do that for now.
Currently merge.graft re-writes the dirstate so only a single
parent is kept. For some cases, like evolving a merge commit,
this behaviour is not desired. More specifically, this is
needed to fix issue4389.
We have finally laid all the groundwork to make this happen.
The only change/delete conflicts that haven't been moved are .hgsubstate
conflicts. Those are trickier to deal with and well outside the scope of this
series.
We add comprehensive testing not just for the initial selections but also for
re-resolves and all possible dirstate transitions caused by merge tools. That
testing managed to shake out several bugs in the way we were handling dirstate
transitions.
The other test changes are because we now treat change/delete conflicts as
proper merges, and increment the 'merged' counter rather than the 'updated'
counter. I believe this is the right approach here.
For third-party extensions, if they're interacting with filemerge code they
might have to deal with an absentfilectx rather than a regular filectx.
Still to come:
- add a 'leave unresolved' option to merges
- change the default for non-interactive change/delete conflicts to be 'leave
unresolved'
- add debug output to go alongside debug outputs for binary and symlink file
merges
This is somewhat different from the currently existing 'a' action, for the
following case:
- dirty working copy, with file 'fa' added and 'fm' modified
- hg merge --force with a rev that neither has 'fa' nor 'fm'
- for the change/delete conflicts we pick 'changed' for both 'fa' and 'fm'.
In this case 'branchmerge' is true, but we need to distinguish between 'fa',
which should ultimately be marked added, and 'fm', which should be marked
modified.
Our current strategy is to just not touch the dirstate at all. That works for
now, but won't work once we move change/delete conflicts to the resolve phase.
In that case we may perform repeated re-resolves, some of which might mark the
file removed or remove the file from the dirstate. We'll need to re-add the
file to the dirstate, and we need to be able to figure out whether we mark the
file added or modified. That is what the new 'am' action lets us do.
Once we move change/delete conflicts into the resolve phase, a 'dc' file might
first be resolved by picking the other side, then later be resolved by picking
the local side. For this transition we want to make sure that the file goes
back to not being in the dirstate.
This has no impact on conflicts during the initial merge.
At the moment this is a no-op (the only actions defined are 'r', 'a' and 'g'),
but soon we're going to add other sorts of actions to the dictionary returned
from mergestate.actions().
These are meant for use by custom merge drivers that might want to modify the
dirstate. Dirstate internal consistency rules require that all removes happen
before any adds -- this means that custom merge drivers shouldn't be modifying
the dirstate directly.
We're going to use this to extend the action lists in merge.applyupdates.
The somewhat funky return value is to make passing this dict directly into
recordactions easier. We're going to exploit that in an upcoming patch.
This eliminates a whole bunch of duplicate code and allows us to update the
removed count for change/delete conflicts where the delete action was chosen.
This will not only allow us to remove a bunch of duplicate code in applyupdates
in an upcoming patch, it will also allow the resolve interface to be a lot
simpler: it doesn't need to return the dirstate action to applyupdates.
We will represent a deleted file as 'nullhex' in the in-memory and on-disk
merge states. We need to be able to create absentfilectxes in that case, and
delete the file from disk rather than try to write it out.
We introduce a new record type, 'C', to indicate change/delete conflicts. This
is a separate record type because older versions of Mercurial will not be able
to handle these conflicts.
We aren't actually storing any change/delete conflicts yet -- that will come in
future patches.
This works around a bug in older Mercurial versions' handling of the v2 merge
state.
We also add a bunch of tests that make sure that
(1) we correctly abort when the merge state has an unsupported record type
(2) aborting the merge, rebase or histedit continues to work and clears out the
merge state.
This is too low-level to be the top-level documentation for mergestate. We're
restricting the top-level documentation to only be about what consumers of the
mergestate and anyone extending it need to care about.
With this patch, mergestate.clean() will no longer abort when it encounters an
unsupported merge type. However we hold off on testing it until backwards
compatibility is in place.
Eventually, we'll move the read call out of the constructor. This will:
- avoid unnecessary reads when we're going to nuke the merge state anyway
- avoid raising an exception if there's an unsupported merge record
'clean' seems like a good name for it because I wanted to avoid anything with
the word 'new' in it, and 'reset' is more an action performed on a merge state
than a way to get a new merge state.
Thanks to Martin von Zweigbergk for feedback about naming this.
We're going to catch this exception in 'hg summary' to print a better error
message.
This code is pretty untested, so there are no changes to test output. In
upcoming patches we're going to test the output more thoroughly.
Now that all internal callers pre-compute and set a destination at a higher level
it feels like we can kill this API. This will allow us to simplify this
function. However I feel like this is a bit too central and critical to break
now. I'm adding a devel warning to let extension make catch this in the next
cycle.
File/directory case folding collisions cannot be represented on case folding
systems and have to fail.
To detect this and abort early, utilize that for file/directory collisions, a
sorted list of case folded manifest names will have the colliding directory
right after the file.
(This could perhaps be optimized, but this way of doing it also has
directory/directory case folding in mind ... which however not is handled yet.)
A driver-resolved file is a file that's handled specially by the driver. A
common use case for this state would be autogenerated files, the generation of
which should happen only after all source conflicts are resolved.
This is done with an uppercase letter because older versions of Mercurial will
not know how to treat such files at all.
A 'merge driver' is a coordinator for the overall merge process. It will be
able to control:
- tools for individual files, much like the merge-patterns configuration does
today
- tools that can work across groups of files
- the ordering of file resolution
- resolution of automatically generated files
- adding and removing additional files to and from the dirstate
Since it is a critical part of the merge process, it really is part of the
merge state.
This is a lowercase character (i.e. optional) because ignoring this is fine for
older versions of Mercurial -- however, if there are any files that are
specially treated by the driver, we should abort. That will happen in upcoming
patches.
There is a potential security issue with storing the merge driver in the merge
state. See the inline comments for more details.
For the same reason, we move the bookmark related update logic into the
'destupdate' function. This requires to extend the returns of the function to
include the bookmark that needs to move (more or less) and the bookmark to
activate at the end of the function. See function documentation for details on
this returns.
We perform all that we can non-interactively before prompting the user for input
via their merge tool. This allows for a maximally consistent state when the user
is first prompted.
The test output changes indicate the actual behavior change happening.
The section of code that writes out the version of the file cached in the merge
state should only be run at preresolve time. This is so that if the premerge
keeps around conflict markers, those don't get overwritten before the main
merge.
This means that in ms.resolve we must call merge after calling premerge. This
doesn't yet mean that all premerges happen before any merges -- however, this
does get us closer to our goal.
The output differences are because we recompute the merge tool. The only
user-visible difference caused by this patch is that if the tool is missing
we'll print the warning twice. Not a huge deal, though.
The home of 'Abort' is 'error' not 'util' however, a lot of code seems to be
confused about that and gives all the credit to 'util' instead of the
hardworking 'error'. In a spirit of equity, we break the cycle of injustice and
give back to 'error' the respect it deserves. And screw that 'util' poser.
For great justice.
We can safely drop this because the very same assignment is enforcement later in
the function. Dropping it will make it simpler to extract the default
destination logic in its own function.
This is another step toward having "default" destination more clear and unified.
Not all the logic is there because some bookmark related computation happened
elsewhere. It will be moved later.
The function is private because as for the other ones, cleanup is needed before
we can proceed.
Resolving other conflicts before merge ones is better because the state before
the merge is as consistent as possible. It will also help with future work
involving automatic resolution of merge conflicts with an external merge
driver.
There are no ordering issues here because it is easy to verify that the same
file is never in both the dg/dm and the m sets.
We're going to treat these conflicts similarly to merge conflicts, and this
change to the way we store things in memory makes future code a lot simpler.
(1) These aren't currently read from anywhere, so emptying this out is
pointless.
(2) These *will* be read from later in upcoming patches, and not emptying
them out will be important then.
I actually wanted to reduce the amount of code around the call to
applyupdates(), so I tried moving these warnings a little earlier, and
I think it makes the output make a little more sense (see changes to
test cases).
This only makes a difference when a merge driver is active -- in that case we
don't want to try and merge all the files, just the ones still unresolved after
the merge driver's preprocess step is over.