copytrace: use filename heuristics to quickly find moves

Summary:
Copytracing that is based on a simple idea: most moves are either directory
moves or moves of the file inside the same directory. That means that either
basename of the moved file or the dirname of the moved file is the same.
More details in the comments.

Test Plan: Run unit-tests

Reviewers: #mercurial, durham, quark, rmcelroy

Reviewed By: quark, rmcelroy

Subscribers: mjpieters, #sourcecontrol

Differential Revision: https://phabricator.intern.facebook.com/D5137372

Tasks: 18508761

Signature: t1:5137372:1496243148:8d229c1593da196b674318ee8b37af15a60831c8
This commit is contained in:
Stanislau Hlebik 2017-06-01 04:39:27 -07:00
parent 1b9a5fdc5c
commit 9e8ca3ae43
3 changed files with 496 additions and 6 deletions

View File

@ -5,12 +5,40 @@
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.
'''extension that does copytracing fast
::
[copytrace]
# whether to enable fast copytracing or not
fastcopytrace = False
# limits the number of commits in the source "branch" i. e. "branch".
# that is rebased or merged. These are the commits from base up to csrc
# (see _mergecopies docblock below).
# copytracing can be too slow if there are too
# many commits in this "branch".
sourcecommitlimit = 100
# limits the number of heuristically found move candidates to check
maxmovescandidatestocheck = 5
'''
from collections import defaultdict
from functools import partial, update_wrapper
from mercurial import commands, dispatch, extensions, filemerge, util
from mercurial import (
commands,
copies as copiesmod,
dispatch,
extensions,
filemerge,
util,
)
from mercurial.i18n import _
from mercurial.node import hex, wdirid
import os
import time
_copytracinghint = ("hint: if this message is due to a moved file, you can " +
"ask mercurial to attempt to automatically resolve this " +
@ -49,6 +77,7 @@ def extsetup(ui):
filemerge.internals[':prompt'] = wrapperpromptmerge
filemerge.internalsdoc[':prompt'] = wrapperpromptmerge
filemerge.internals['internal:prompt'] = wrapperpromptmerge
extensions.wrapfunction(copiesmod, 'mergecopies', _mergecopies)
def _runcommand(orig, lui, repo, cmd, fullargs, ui, *args, **kwargs):
if "--tracecopies" in fullargs:
@ -63,13 +92,9 @@ def _promptmerge(origfunc, repo, mynode, orig, fcd, fco, *args, **kwargs):
ctx2 = _getctxfromfctx(fcd)
msg = [(ctx1.phase(), _gethex(ctx1)), (ctx2.phase(), _gethex(ctx2))]
reporoot = repo.origroot if util.safehasattr(repo, 'origroot') else ''
reponame = ui.config('paths', 'default', reporoot)
if reponame:
reponame = os.path.basename(reponame)
if fco.isabsent() or fcd.isabsent():
ui.log("promptmerge", "", mergechangeddeleted=('%s' % msg),
reponame=reponame)
reponame=_getreponame(repo, ui))
except Exception as e:
# since it's just a logging we don't want a error in this code to break
# clients
@ -85,3 +110,155 @@ def _getctxfromfctx(fctx):
def _gethex(ctx):
# for workingctx return p1 hex
return ctx.hex() if ctx.hex() != hex(wdirid) else ctx.p1().hex()
def _mergecopies(orig, repo, cdst, csrc, base):
start = time.time()
try:
return _domergecopies(orig, repo, cdst, csrc, base)
except Exception as e:
# make sure we don't break clients
repo.ui.log("copytrace", "Copytrace failed: %s" % e,
reponame=_getreponame(repo, repo.ui))
return {}, {}, {}, {}, {}
finally:
repo.ui.log("copytracingduration", "",
copytracingduration=time.time() - start,
fastcopytraceenabled=_fastcopytraceenabled(repo.ui))
def _domergecopies(orig, repo, cdst, csrc, base):
""" Fast copytracing using filename heuristics
Handle one case where we assume there are no moves or merge commits in
"source branch". Source branch is commits from base up to csrc not
including base.
If these assumptions don't hold then we fallback to the
upstream mergecopies
p
|
p <- cdst - rebase or merge destination, can be draft
.
.
. d <- csrc - commit to be rebased or merged.
| |
p d <- base
| /
p <- common ancestor
To find copies we are looking for files with similar filenames.
See description of the heuristics below.
Upstream copytracing function returns five dicts:
"copy", "movewithdir", "diverge", "renamedelete" and "dirmove". See below
for a more detailed description (mostly copied from upstream).
This extension returns "copy" dict only, everything else is empty.
"copy" is a mapping from destination name -> source name,
where source is in csrc and destination is in cdst or vice-versa.
"movewithdir" is a mapping from source name -> destination name,
where the file at source present in one context but not the other
needs to be moved to destination by the merge process, because the
other context moved the directory it is in.
"diverge" is a mapping of source name -> list of destination names
for divergent renames. On the time of writing this extension it was used
only to print warning.
"renamedelete" is a mapping of source name -> list of destination
names for files deleted in c1 that were renamed in c2 or vice-versa.
On the time of writing this extension it was used only to print warning.
"dirmove" is a mapping of detected source dir -> destination dir renames.
This is needed for handling changes to new files previously grafted into
renamed directories.
"""
if not repo.ui.configbool("experimental", "disablecopytrace"):
# user explicitly enabled copytracing - use it
return orig(repo, cdst, csrc, base)
if not _fastcopytraceenabled(repo.ui):
return orig(repo, cdst, csrc, base)
if not cdst or not csrc or cdst == csrc:
return {}, {}, {}, {}, {}
# avoid silly behavior for parent -> working dir
if csrc.node() is None and cdst.node() == repo.dirstate.p1():
return repo.dirstate.copies(), {}, {}, {}, {}
if cdst.rev() is None:
cdst = cdst.p1()
if csrc.rev() is None:
csrc = csrc.p1()
copies = {}
ctx = csrc
changedfiles = set()
sourcecommitnum = 0
sourcecommitlimit = repo.ui.configint('copytrace', 'sourcecommitlimit', 100)
while ctx != base:
if len(ctx.parents()) == 2:
# To keep things simple let's not handle merges
return orig(repo, cdst, csrc, base)
changedfiles.update(ctx.files())
ctx = ctx.p1()
sourcecommitnum += 1
if sourcecommitnum > sourcecommitlimit:
return orig(repo, cdst, csrc, base)
m1 = cdst.manifest()
missingfiles = filter(lambda f: f not in m1, changedfiles)
if missingfiles:
# Use the following file name heuristic to find moves: moves are
# usually either directory moves or renames of the files in the
# same directory. That means that we can look for the files in dstc
# with either the same basename or the same dirname.
basenametofilename = defaultdict(list)
dirnametofilename = defaultdict(list)
for f in m1.filesnotin(base.manifest()):
basename = os.path.basename(f)
dirname = os.path.dirname(f)
basenametofilename[basename].append(f)
dirnametofilename[dirname].append(f)
maxmovecandidatestocheck = repo.ui.configint(
'copytrace', 'maxmovescandidatestocheck', 5)
# in case of a rebase/graft, base may not be a common ancestor
anc = cdst.ancestor(csrc)
for f in missingfiles:
basename = os.path.basename(f)
dirname = os.path.dirname(f)
samebasename = basenametofilename[basename]
samedirname = dirnametofilename[dirname]
movecandidates = samebasename + samedirname
# if file "f" is not present in csrc that means that it was deleted
# in cdst and csrc. Ignore "f" in that case
if f in csrc:
f2 = csrc.filectx(f)
for candidate in movecandidates[:maxmovecandidatestocheck]:
f1 = cdst.filectx(candidate)
if copiesmod._related(f1, f2, anc.rev()):
# if there are a few related copies then we'll merge
# changes into all of them. This matches the behaviour
# of upstream copytracing
copies[candidate] = f
if len(movecandidates) > maxmovecandidatestocheck:
repo.ui.log("copytrace", "",
reponame=_getreponame(repo, repo.ui),
toomanymovescandidates=len(movecandidates))
return copies, {}, {}, {}, {}
def _fastcopytraceenabled(ui):
return ui.configbool("copytrace", "fastcopytrace", False)
def _getreponame(repo, ui):
reporoot = repo.origroot if util.safehasattr(repo, 'origroot') else ''
reponame = ui.config('paths', 'default', reporoot)
if reponame:
reponame = os.path.basename(reponame)
return reponame

10
tests/copytrace.sh Normal file
View File

@ -0,0 +1,10 @@
function initclient() {
cat >> $1/.hg/hgrc <<EOF
[copytrace]
remote = False
enablefilldb = True
fastcopytrace = True
[experimental]
disablecopytrace = True
EOF
}

303
tests/test-copytrace.t Normal file
View File

@ -0,0 +1,303 @@
$ . "$TESTDIR/copytrace.sh"
$ extpath=`dirname $TESTDIR`
$ cat >> $HGRCPATH << EOF
> [extensions]
> copytrace=$extpath/hgext3rd/copytrace.py
> rebase=
> [experimental]
> disablecopytrace=True
> EOF
Check filename heuristics (same dirname and same basename)
$ hg init server
$ cd server
$ echo a > a
$ mkdir dir
$ echo a > dir/file.txt
$ hg addremove
adding a
adding dir/file.txt
$ hg ci -m initial
$ hg mv a b
$ hg mv -q dir dir2
$ hg ci -m 'mv a b, mv dir/ dir2/'
$ cd ..
$ hg clone -q server repo
$ initclient repo
$ cd repo
$ hg up -q 0
$ echo b > a
$ echo b > dir/file.txt
$ hg ci -qm 'mod a, mod dir/file.txt'
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: 557f403c0afd2a3cf15d7e2fb1f1001a8b85e081
| desc: mod a, mod dir/file.txt, phase: draft
| o changeset: 928d74bc9110681920854d845c06959f6dfc9547
|/ desc: mv a b, mv dir/ dir2/, phase: public
o changeset: 3c482b16e54596fed340d05ffaf155f156cda7ee
desc: initial, phase: public
$ hg rebase -s . -d 1
rebasing 2:557f403c0afd "mod a, mod dir/file.txt" (tip)
merging b and a to b
merging dir2/file.txt and dir/file.txt to dir2/file.txt
saved backup bundle to $TESTTMP/repo/.hg/strip-backup/557f403c0afd-9926eeff-backup.hg (glob)
$ cd ..
$ rm -rf server
$ rm -rf repo
Make sure filename heuristics do not when they are not related
$ hg init server
$ cd server
$ echo 'somecontent' > a
$ hg add a
$ hg ci -m initial
$ hg rm a
$ echo 'completelydifferentcontext' > b
$ hg add b
$ hg ci -m 'rm a, add b'
$ cd ..
$ hg clone -q server repo
$ initclient repo
$ cd repo
$ hg up -q 0
$ printf 'somecontent\nmoarcontent' > a
$ hg ci -qm 'mode a'
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: d526312210b9e8f795d576a77dc643796384d86e
| desc: mode a, phase: draft
| o changeset: 46985f76c7e5e5123433527f5c8526806145650b
|/ desc: rm a, add b, phase: public
o changeset: e5b71fb099c29d9172ef4a23485aaffd497e4cc0
desc: initial, phase: public
$ hg rebase -s . -d 1
rebasing 2:d526312210b9 "mode a" (tip)
other [source] changed a which local [dest] deleted
hint: if this message is due to a moved file, you can ask mercurial to attempt to automatically resolve this change by re-running with the --tracecopies flag, but this will significantly slow down the operation, so you will need to be patient.
Source control team is working on fixing this problem.
use (c)hanged version, leave (d)eleted, or leave (u)nresolved? u
unresolved conflicts (see hg resolve, then hg rebase --continue)
[1]
$ cd ..
$ rm -rf server
$ rm -rf repo
Test when lca didn't modified the file that was moved
$ hg init server
$ cd server
$ echo 'somecontent' > a
$ hg add a
$ hg ci -m initial
$ echo c > c
$ hg add c
$ hg ci -m randomcommit
$ hg mv a b
$ hg ci -m 'mv a b'
$ cd ..
$ hg clone -q server repo
$ initclient repo
$ cd repo
$ hg up -q 1
$ echo b > a
$ hg ci -qm 'mod a'
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: 9d5cf99c3d9f8e8b05ba55421f7f56530cfcf3bc
| desc: mod a, phase: draft
| o changeset: d760186dd240fc47b91eb9f0b58b0002aaeef95d
|/ desc: mv a b, phase: public
o changeset: 48e1b6ba639d5d7fb313fa7989eebabf99c9eb83
| desc: randomcommit, phase: public
o changeset: e5b71fb099c29d9172ef4a23485aaffd497e4cc0
desc: initial, phase: public
$ hg rebase -s . -d 2
rebasing 3:9d5cf99c3d9f "mod a" (tip)
merging b and a to b
saved backup bundle to $TESTTMP/repo/.hg/strip-backup/9d5cf99c3d9f-f02358cc-backup.hg (glob)
$ cd ..
$ rm -rf server
$ rm -rf repo
Rebase "backwards"
$ hg init server
$ cd server
$ echo 'somecontent' > a
$ hg add a
$ hg ci -m initial
$ echo c > c
$ hg add c
$ hg ci -m randomcommit
$ hg mv a b
$ hg ci -m 'mv a b'
$ cd ..
$ hg clone -q server repo
$ initclient repo
$ cd repo
$ hg up -q 2
$ echo b > b
$ hg ci -qm 'mod b'
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: fbe97126b3969056795c462a67d93faf13e4d298
| desc: mod b, phase: draft
o changeset: d760186dd240fc47b91eb9f0b58b0002aaeef95d
| desc: mv a b, phase: public
o changeset: 48e1b6ba639d5d7fb313fa7989eebabf99c9eb83
| desc: randomcommit, phase: public
o changeset: e5b71fb099c29d9172ef4a23485aaffd497e4cc0
desc: initial, phase: public
$ hg rebase -s . -d 0
rebasing 3:fbe97126b396 "mod b" (tip)
merging a and b to a
saved backup bundle to $TESTTMP/repo/.hg/strip-backup/fbe97126b396-cf5452a1-backup.hg (glob)
$ cd ..
$ rm -rf server
$ rm -rf repo
Rebase draft commit on top of draft commit
$ hg init repo
$ initclient repo
$ cd repo
$ echo 'somecontent' > a
$ hg add a
$ hg ci -m initial
$ hg mv a b
$ hg ci -m 'mv a b'
$ hg up -q ".^"
$ echo b > a
$ hg ci -qm 'mod a'
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: 5268f05aa1684cfb5741e9eb05eddcc1c5ee7508
| desc: mod a, phase: draft
| o changeset: 542cb58df733ee48fa74729bd2cdb94c9310d362
|/ desc: mv a b, phase: draft
o changeset: e5b71fb099c29d9172ef4a23485aaffd497e4cc0
desc: initial, phase: draft
$ hg rebase -s . -d 1
rebasing 2:5268f05aa168 "mod a" (tip)
merging b and a to b
saved backup bundle to $TESTTMP/repo/.hg/strip-backup/5268f05aa168-284f6515-backup.hg (glob)
$ cd ..
$ rm -rf server
$ rm -rf repo
Check a few potential move candidates
$ hg init repo
$ initclient repo
$ cd repo
$ mkdir dir
$ echo a > dir/a
$ hg add dir/a
$ hg ci -qm initial
$ hg mv dir/a dir/b
$ hg ci -qm 'mv dir/a dir/b'
$ mkdir dir2
$ echo b > dir2/a
$ hg add dir2/a
$ hg ci -qm 'create dir2/a'
$ hg up -q 0
$ echo b > dir/a
$ hg ci -qm 'mod dir/a'
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: 6b2f4cece40fd320f41229f23821256ffc08efea
| desc: mod dir/a, phase: draft
| o changeset: 4494bf7efd2e0dfdd388e767fb913a8a3731e3fa
| | desc: create dir2/a, phase: draft
| o changeset: b1784dfab6ea6bfafeb11c0ac50a2981b0fe6ade
|/ desc: mv dir/a dir/b, phase: draft
o changeset: 36859b8907c513a3a87ae34ba5b1e7eea8c20944
desc: initial, phase: draft
$ hg rebase -s . -d 2
rebasing 3:6b2f4cece40f "mod dir/a" (tip)
merging dir/b and dir/a to dir/b
saved backup bundle to $TESTTMP/repo/.hg/strip-backup/6b2f4cece40f-503efe60-backup.hg (glob)
$ cd ..
$ rm -rf server
$ rm -rf repo
Move file in one branch and delete it in another
$ hg init repo
$ initclient repo
$ cd repo
$ echo a > a
$ hg add a
$ hg ci -m initial
$ hg mv a b
$ hg ci -m 'mv a b'
$ hg up -q ".^"
$ hg rm a
$ hg ci -m 'del a'
created new head
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: 7d61ee3b1e48577891a072024968428ba465c47b
| desc: del a, phase: draft
| o changeset: 472e38d57782172f6c6abed82a94ca0d998c3a22
|/ desc: mv a b, phase: draft
o changeset: 1451231c87572a7d3f92fc210b4b35711c949a98
desc: initial, phase: draft
$ hg rebase -s 1 -d 2
rebasing 1:472e38d57782 "mv a b"
saved backup bundle to $TESTTMP/repo/.hg/strip-backup/472e38d57782-17d50e29-backup.hg (glob)
$ hg up -q c492ed3c7e35dcd1dc938053b8adf56e2cfbd062
$ ls
b
$ cd ..
$ rm -rf server
$ rm -rf repo
Too many move candidates
$ hg init repo
$ initclient repo
$ cd repo
$ echo a > a
$ hg add a
$ hg ci -m initial
$ hg rm a
$ echo a > b
$ echo a > c
$ echo a > d
$ echo a > e
$ echo a > f
$ echo a > g
$ hg add b
$ hg add c
$ hg add d
$ hg add e
$ hg add f
$ hg add g
$ hg ci -m 'rm a, add many files'
$ hg up -q ".^"
$ echo b > a
$ hg ci -m 'mod a'
created new head
$ hg log -G -T 'changeset: {node}\n desc: {desc}, phase: {phase}\n'
@ changeset: ef716627c70bf4ca0bdb623cfb0d6fe5b9acc51e
| desc: mod a, phase: draft
| o changeset: d133babe0b735059c360d36b4b47200cdd6bcef5
|/ desc: rm a, add many files, phase: draft
o changeset: 1451231c87572a7d3f92fc210b4b35711c949a98
desc: initial, phase: draft
$ hg rebase -s 2 -d 1
rebasing 2:ef716627c70b "mod a" (tip)
other [source] changed a which local [dest] deleted
hint: if this message is due to a moved file, you can ask mercurial to attempt to automatically resolve this change by re-running with the --tracecopies flag, but this will significantly slow down the operation, so you will need to be patient.
Source control team is working on fixing this problem.
use (c)hanged version, leave (d)eleted, or leave (u)nresolved? u
unresolved conflicts (see hg resolve, then hg rebase --continue)
[1]
$ cd ..
$ rm -rf server
$ rm -rf repo