Copy tracing can be up to 80% of rebase time when rebasing stacks of commits in
large repos (hundreds of thousands of files). This provides the option of
turning off the majority of copy tracing. It does not turn off _forwardcopies()
since that is used to carry copy information inside a commit across a rebase.
This will affect the situation where a user edits a file, then rebases on top of
commits that have moved that file. The move will not be detected and the user
will have to manually resolve the issue (possibly by redoing the rebase with
this flag off).
The reason to have a flag instead of trying to fix the actual copy tracing
performance is that copy tracing is fundamentally an O(number of files in the
repo) operation. In order to know if file X in the rebase source was copied
anywhere, we have to walk the filelog for every new file that exists in the
rebase destination (i.e. a file in the destination that is not in the common
ancestor). Without an index that lets us trace forward (i.e. from file Y in the
common ancestor forward to the rebase destination), it will never be an O(number
of changes in my branch) operation.
In mozilla-central, rebasing a 3 commit stack across 20,000 revs goes from 39s
to 11s.
checkcopies was using fctx.rev() which it was expecting would be
equivalent to linkrev() but was triggering the new _adjustlinkrev path.
This was making grafts and merges with large sets of potential copies
very expensive.
The root directory is not normally added to 'dirs' instances (although
I think it should be). In copies.mergecopies, we call dirname() to get
the directory of a path and then check for containment in the 'dirs'
instances ('d1' and 'd2'). In order to easily handle files in the root
directory, '/' is added to d1/d2. This results in the empty string
being added to the sets, since what comes before the slash in '/' is
an empty string. This seems less than obvious, so let's document it.
This allows passing a matcher down the pathcopies() stack to _forwardcopies().
This will let us add logic in a later patch to avoid tracing copies when not
necessary (like when doing hg diff -r 1 -r 2 foo.txt).
The _computenonoverlap function takes two manifests to allow extensions to hook
in and read the manifest nodes produced by the function. The remotefilelog
extension actually needs the entire changectx instead (which includes the
manifest) so it can prefetch the subset of files necessary for a sparse checkout
(and the sparse checkout depends on which commit is being accessed, hence the
need for the changectx).
I have tests in the remotefilelog extension that cover this.
Merge copies is traversing file history in search for copies and renames.
Since 3.3 we are doing "linkrev adjustment" to ensure duplicated filelog entry
does not confuse the traversal. This "linkrev adjustment" involved ancestry
testing and walking in the changeset graph. If we do such walk in the changesets
graph for each file, we end up with a 'O(<changesets>x<files>)' complexity
that create massive issue. For examples, grafting a changeset in Mozilla's repo
moved from 6 seconds to more than 3 minutes.
There is a mechanism to reuse such ancestors computation between all files. But
it has to be manually set up in situation were it make sense to take such
shortcut. This changesets set this mechanism up and bring back the graph time
from 3 minutes to 8 seconds.
To do so, we need a bigger control on the way 'filectx' are instantiated during
each 'checkcopies' calls that 'mergecopies' is doing. We add a new 'setupctx'
that configure and return a 'filectx' factory. The function make sure the
ancestry context is properly created and the factory make sure it is properly
installed on returned 'filectx'.
The new linkrev adjustement mechanism makes rename detection very slow, because
each file rewalks the ancestor dag. To mitigate the issue in Mercurial 3.3, we
introduce a simplistic way to share the ancestors computation for the linkrev
validation phase.
We can reuse the ancestors in that case because we do not care about
sub-branching in the ancestors graph.
The cached set will be use to check if the linkrev is valid in the search
context. This is the vast majority of the ancestors usage during copies search
since the uncached one will only be used when linkrev is invalid, which is
hopefully rare.
Commit d1f83f500b47 changed the computenonoverlap api's to not require the
manifests. We actually need the manifests in the remotefilelog extension so
we can find the file nodes for the various files that change. Let's add it
back to the function signature with a note explaining why.
This doesn't affect any behavior.
Now that we have manifestdict.filesnotin(), we can write _nonoverlap()
in terms of that method instead, enabling future speedups when
filesnotin() gets optimized, and perhaps making the code a little
clearer at the same time.
copies._computeforwardmissing() finds files in one context that is not
in the other. Let's move this code into a new method on manifestdict,
so m1.filesnotin(m2) can be optimized for various types of manifests
(we expect more types of manifests soon).
Moves the _forwardcopies missingfiles logic to a separate function so that other
extensions which need to prefetch information about the files being
processed have a hook point.
This saves extensions from having to recompute this information themselves, and
thus saves several seconds off of various commands (like rebase).
Moves the mergecopies nonoverlap logic to a separate function so that other
extensions which may need to prefetch information about the files being
processed have a hook point.
This saves extensions from having to recompute this information themselves, and
thus saves several seconds off of various commands (like rebase).
This addresses the bug described in issue4405: when obsolescence markers are
enabled, amending a commit with a file move can lead to the copy information
being lost.
However, the bug is more general and can be reproduced without obsmarkers as
well, as demonstracted by Pierre-Yves and put into the updated test.
Specifically, graph topology divergences between the filelogs and the changelog
can cause copy information to be lost during amends.
This removes an optimization that was introduced in 5a644704d5eb but was too
aggressive - as indicated by how it changed test-mq-merge.t .
We are walking filelogs to find copy sources and we can thus not be sure to hit
the base revision and find the renamed file there - it could also be in the
first ancestor of the base ... in the filelog.
We are walking the filelog and can thus not easily know when we hit the first
ancestor of the base revision and which filename to look for there. Instead, we
use _findlimit like mergecopies do: The lower bound for how far we have to go
is found from the lowest changelog revision that is an ancestor of only one of
the compared revisions. Any filelog ancestor with a revision number lower than
that revision will be the ancestor of both compared revisions, and there is
thus no reason to go further back than that.
This moves checkcopies() out of mergecopies() and makes it a top level
function in the copies module. This allows extensions to override it. For
example, I'm developing a filelog replacement that doesn't have rev numbers
so all the rev number dependent implementation here needs to be replaced
by the extension.
No logic is changed in this commit.
This is a performance win for a number of reasons:
- We don't iterate over contexts, which avoids a completely unnecessary sorted
call + the O(number of files) abstraction cost of doing that.
- We don't check membership in a context, which avoids another
O(number of files) abstraction cost.
- We iterate over the manifests in C instead of Python.
For a large repo with 170,000 files, this improves perfpathcopies from 0.34
seconds to 0.07. Anything that uses pathcopies, such as rebase or diff --git
between two revisions, benefits.
The inverse of a rename is a rename, but the inverse of a copy is not a copy.
Presenting it as such -- in particular, stuffing it into the same dict as real
copies -- causes bugs because other code starts believing the inverse copies
are real.
The only test whose output changes is test-mv-cp-st-diff.t. When a backwards
status -C command is run where a copy is involved, the inverse copy (which was
hitherto presented as a real copy) is no longer displayed.
Keeping track of inverse copies is useful in some situations -- composability
of diffs, for example, since adding "a" followed by an inverse copy "b" to "a"
is equivalent to a rename "b" to "a". However, representing them would require
a more complex data structure than the same dict in which real copies are also
stored.
The -> in debug messages is currently overloaded to mean both source to dest
and dest to source. To fix this, we add explicit labels and make the arrow
direction consistent.
Currently the "copy" dict contains both explicit copies/moves made by a
context and pending moves that need to happen because the other context moved
the directory the file was in. For explicit copies, the dict stores a
destination to source map, while for pending moves via directory renames, it
stores a source to destination map. The merge code uses this fact in a non-
obvious way to differentiate between these two cases.
We make this explicit by storing these pending moves in a separate dict. The
dict still has a source to destination map, but that is called out in the
docstring.
For divergent renames the following message is printed during merge:
note: possible conflict - file was renamed multiple times to:
newfile
file2
When a file is renamed in one branch and deleted in the other, the file still
exists after a merge. With this change a similar message is printed for mv+rm:
note: possible conflict - file was deleted and renamed to:
newfile