Now that there is only a single call to addmodehdr() left, and there
is other similar code (for new/deleted files) around that call site,
let's inline the function there. That also makes it clearer under what
circumstances the header is actually written (when modes differ).
This is the first step towards simplifying the big loop in
trydiff(). This will make both the header code and the non-header code
clearer, and it prepares for further simplification of the many nested
if-statements in the body of the loop.
Use '1' and '2' as suffix for names just like in the parameters
'ctx[12]':
to,tn -> content1,content2
a,b -> f1, f2
omode,mode -> mode1,mode2
omode,nmode -> mode1,mode2
onode,nnode -> node1,node2
oflag,nflag -> flag1,flag2
oindex,nindex -> index1,index2
When creating a diff with copy/rename enabled, we consider added files
and check if they are either copy sources or targets. However, an
added file should never be a copy source. The test suite seems to
agree with this: all tests pass if we raise an exception when an added
file is a copy source. So, let's simplify the code by dropping the
conditions that are never true.
For those interested in the historical reasons:
Before commit c15c00e7afba (patch: separate reverse copy data
(issue1959), 2010-02-11), 'copy' seems to have been a bidirectional
map. Then that commit split it up into two unidirectional maps and
duplicated the logic to look in both maps. It was still needed at that
point to look in both maps, as the copy detection was poor and could
sometimes be reported in reverse.
A little later came 5a644704d5eb (copies: rewrite copy detection for
non-merge users, 2012-01-04). That commit fixed the copy detection to
be backwards when it should, and made the hacks in trydiff
unnecessary.
We started updating 'modifiedset' in e85e48ffb83e (trydiff: simplify
checking for additions, 2014-12-23) but in the same commit, we removed
the last use of the variable. Clean it up.
When a binary source has been copied or renamed into a non-binary
file, we don't check whether the copy source was binary. There is a
code comment explaining that a git mode will be forced anyway in order
to capture the copy record (i.e. losedatafn() will be called). This is
true, but forcing git mode is not the only effect binary files have:
when git mode was already requested, we use the binary-ness to tell us
whether to use a regular unified diff or a git binary diff. The user
sees this as a "Binary file $file has changed" instead of the binary
The 'dodiff' variable is initialized to True and may later be set to
either False or "binary". When it's set to False, we skip everything
after that point, so we can simplify by instead continue-ing (the
loop). We can then also drop the 'if dodiff', since it will always be
true.
The conditional-ness is not clear from the name and there is only one
caller, so it's clearer to check on the call site. Moving it also
makes addindexmeta() no longer close on the 'opts' variable.
In the body of the loop in trydiff(), there are conditions like:
addedset or (f in modifiedset and to is None)
The second half of that expression is to account for the fact that
merge-in additions appear as additions. By instead fixing up the sets
of modified and added files to compensate for this fact, we can
simplify the body of the loop. It also fixes one case where the
addedset was checked without the additional check (the "have we
already reported a copy above?" case in the code, also see fixed test
case).
The similar condition with 'removedset' in it seems to have served no
purpose even before this change, so it could have been simplified even
before.
Note that there is a comment saying "ctx2 date may be dynamic". The
comment was introduced in 8ed3d2a60500 (patch: use contexts for diff,
2006-12-25), but it seems like it stopped being dynamic in that very
changeset -- before that changeset, the date seems to have been the
file's mtime, but after the changeset, it seems to be the changeset's
timestamp (current time for workingctx). Since no one seems to have
missed the "dynamicness", let's simplify and extract a date2 for
symmetry with date1.
This only has a noticeable effect on diffs touching A LOT of
files. For example, it takes
hg diff -r FIREFOX_AURORA_30_BASE -r FIREFOX_AURORA_35_BASE
from 1m55.465s to 1m32.354s. That diff has over 50k files.
These aren't exactly format-breaking features -- just ones for which patches
applied to a repo will produce incorrect commits, In any case, some commands
like record and annotate only care about this feature.
Not all callers are interested in all diffopts -- for example, commands like
record (which use diff internally) break when diffopts like noprefix are
enabled. This function will allow us to add flags that callers can use to
enable only the features they're interested in.
In an upcoming patch we'll enable support as an option to 'hg diff' as well.
The tests reflect the current state of the world -- as we add support we'll see
changes in the test output.
Upcoming patches will add an option that will almost certainly break diff
output parsers when enabled. Add support for forcing an option to something in
plain mode, as a fallback. Options passed in via the CLI are not affected,
though -- it is assumed that any script passing the option in explicitly knows
what it is doing.
The following patch splits up changed lines along tabs (using
re.findall), and gives them a "diff.tab" label. This can be used by
the color extension for colorising tabs, like it does right now with
trailing whitespace.
I also provide corresponding tests.
The internal API used IOError to indicate that a file should be marked as
removed.
There is some correlation between IOError (especially with ENOENT) and files
that should be removed, but using IOErrors to represent file removal internally
required some hacks.
Instead, use the value None to indicate that the file not is present.
Before, spurious IO errors could cause commits that silently removed files.
They will now be reported like all other IO errors so the root cause can be
fixed.
Future patches will allow patch.diff to take a basectx so node1 (and node2)
could make hexfunc error out. Instead, we'll call the node function on the
context object directly.
The `hg import` command gains a `--partial` flag. When specified, a commit will
always be created from a patch import. Any hunk that fails to apply will
create .rej file, same as what `hg qimport` would do. This change is mainly
aimed at preserving changeset metadata when applying a patch, something very
important for reviewers.
In case of failure with `--partial`, `hg import` returns 1 and the following
message is displayed:
patch applied partially
(fix the .rej files and run `hg commit --amend`)
When multiple patches are imported, we stop at the first one with failed hunks.
In the future, someone may feel brave enough to tackle a --continue flag to
import.
Before this patch, "contrib/check-code.py" can't detect these
problems, because the regexp pattern to detect "% inside _()" doesn't
suppose the case that format string consists of multiple string
components concatenated implicitly or explicitly,
This patch does below for that regexp pattern to detect "% inside _()"
problems in such case.
- put "+" into separator part ("[ \t\n]") for explicit concatenation
("...." + "...." style)
- enclose "component and separator" part by "(?:....)+" for
concatenation itself ("...." "...." or "...." + "....")
When creating patches modifying binary files using "git format-patch",
git creates 'literal' and 'delta' hunks. Mercurial currently supports
'literal' hunks only, which makes it impossible to import patches with
'delta' hunks.
This changeset adds support for 'delta' hunks. It is a reimplementation
of patch-delta.c from git :
http://git.kernel.org/cgit/git/git.git/tree/patch-delta.c
This is arguably a workaround, a better fix may be in the repo to
ensure that it won't list a file 'modified' unless there is a file
context for the previous version.
addremove required paths relative to the cwd, which meant a lot of extra code
that transformed paths into relative ones. That code is now gone as well.
This reduces the likelihood of a traceback when trying to email a
patch that happens to have 'diff --git' at the beginning of a line
in the description, as this patch did:
http://markmail.org/message/wxpgowxd7ucxygwe
The check pattern only checked for whitespace between keyword and operator.
Now it also warns:
> x = f(),7
missing whitespace after ,
> x = f()+7
missing whitespace in expression
In an upcoming patch, we will add index information to all git diffs, not
only binary diffs, so this code needs to be moved to a more appropriate
place.
Also, since this information is used for patch headers, it makes more
sense to be in the patch module, along with other patch-related metadata.
addmodehdr is a header helper, same as diffline, so it doesn't
need to be a top-level function and can be nested under trydiff.
In upcoming patches we will generalize this approach for
all headers.
Make diffline more readable, using strings with placeholders
rather than appending to a list from many ifs that makes
difficult to understand the actual output format.
Before, quiet mode produced no diffline header for mercurial as a
side effect of not populating "revs". This was a weird side effect,
and we will always need revs for git index header that will be added
in upcoming patches, so now we just check ui.quiet from diffline
directly.
diffline is not part of diff computation, so it makes more sense
to place it with other header generation in patch module.
In upcoming patches we will generalize this approach for
all headers added in the patch, including the git index
header.
diffline was called from trydiff for binary diffs and from unidiff
for text diffs. In this patch we unify those calls into one.
diffline is also a header, not part of diff mechanisms, so it makes
sense to remove that responsibility from the mdiff module. In
upcoming patches we will move diffline to patch module and
keep grouping responsibilities.
b85diff generates a binary diff, so we move this code to mdiff module
along with unidiff for text diffs. All diffing mechanisms will be in the
same place.
In an upcoming patch we will remove the responsibility to print the
index header from b85diff and move it back to patch, since it's
a patch metadata header, not part of the diff generation.
When applying a patch renaming/copying 'a' to 'b' on a revision where
'a' does not exist, the patching process would abort immediately,
without processing the remaining hunks and without reporting it. This
patch makes the patching no longer abort and possible hunks applied on
the copied/renamed file be written in reject files.
There have been quite a few places where we pop elements off the
front of a list. This can turn O(n) algorithms into something more
like O(n**2). Python has provided a deque type that can do this
efficiently since at least 2.4.
As an example of the difference a deque can make, it improves
perfancestors performance on a Linux repo from 0.50 seconds to 0.36.
Since f7e538c3b7ba, if a chunk starts with "@@ -0,1", oldstart turns into
a negative value. Because diffhelpers.testhunk() doesn't expect negative bstart,
it bypasses "alen > blen - bstart" condition and segfaults at
"PyList_GET_ITEM(b, i + bstart)".
The only place where an trailing CR could be meaningful is in the "diff --git"
line as part of a filename, and existing code already rule out this
possibility. Extend the CR/LF filtering to the whole binary hunk.
$ hg import --no-commit ../mercurial_1915035238540490516.patch
applying ../mercurial_1915035238540490516.patch
abort: could not extract binary data
Becomes:
abort: could not extract "binary2" binary data
Before, import was terminating with a traceback. Now it says:
$ hg import --no-commit ../bad.patch
applying ../bad.patch
abort: could not decode binary patch: bad base85 character at position 66
Git patches are parsed in two phases: 1) extract metadata, 2) parse actual
deltas and merge them with the previous metadata. We do this to avoid
dependency issues like "modify a; copy a to b", where "b" must be copied from
the unmodified "a".
Issue3384 is caused by flaky code I wrote to synchronize the patch metadata
with the emitted hunk:
if (gitpatches and
(gitpatches[-1][0] == afile or gitpatches[-1][1] == bfile)):
gp = gitpatches.pop()[2]
With a patch like:
diff --git a/a b/c
copy from a
copy to c
--- a/a
+++ b/c
@@ -1,1 +1,2 @@
a
+a
@@ -2,1 +2,2 @@
a
+a
diff --git a/a b/a
--- a/a
+++ b/a
@@ -1,1 +1,2 @@
a
+b
the first hunk of the first block is matched with the metadata for the block
"diff --git a/a b/c", then the second hunk of the first block is matched with
the metadata of the second block "diff --git a/a b/a", because of the "or" in
the code paste above. Turning the "or" into an "and" is not enough as we have
to deal with /dev/null cases for each file.
We I remove this broken piece of code:
# copy/rename + modify should modify target, not source
if gp.op in ('COPY', 'DELETE', 'RENAME', 'ADD') or gp.mode:
afile = bfile
because "afile = bfile" set "afile" to stuff like "b/file" instead of "a/file",
and because this only happens for git patches, which afile/bfile are ignored
anyway by applydiff().
v2:
- Avoid a traceback on git metadata desynchronization
Here is how export and mq write the "Parent" header:
mq: # Parent XXXXX
export: # Parent XXXXX
then import expects exactly 2 spaces while mq tolerates one or more. So "hg
import --exact" truncates mq generated patches header by one character and
fails. This patch aligns import "Parent" header parsing on mq one. I do not
expect spaces in parent references anytime soon.
Reported by Stefan Ring <stefanrin@gmail.com>
The previous code was assuming a default context of 3 lines. When fuzzing, it
would take this value in account to reduce the amount of removed line from
hunks top or bottom. For instance, if a hunk has only 2 lines of bottom
context, fuzzing with fuzz=1 would do nothing and with fuzz=2 it would remove
one of those lines. A hunk with one line of bottom context could not be fuzzed
at all. patch(1) has apparently no such restrictions and takes the fuzz level
at face value.
- test-import.t: fuzz/offset changes at the beginning of file are explained by
the new fuzzing behaviour and match patch(1) ones. Patching locations are
different but those of my patch(1) do not make a lot of sense right now
(patched output are the same)
- test-import-bypass.t: more agressive fuzzing makes a patching supposed to
fail because of context, succeed. Change the diff to avoid this.
- test-mq-merge.t: more agressive fuzzing would allow the merged patch to apply
with fuzz, but fortunately we disallow this behaviour. The new output is
kept.
I have not enough experience with patch(1) fuzzing to know whether aligning our
implementation on it is a good or bad idea. Until now, it has been the
implementation reference. For instance, "qpush" tolerates fuzz (test-mq-merge.t
runs the special case of pushing merge revisions where fuzzing is forbidden).
When applying hunks such as:
@@ -2,1 +2,2 @@
context
+change
fuzzing would empty the "old" block and make patchfile.apply() traceback.
Instead, we apply the new block at specified location without testing.
The "bottom hunk" test was removed as patch(1) has no problem applying hunk
with no context in the middle of a file.
- It moves hunks processing weirdness where it belongs
- It helps reusing said weirdness whenever fuzzit() is called, like during the
actual hunk fuzzing.
There is no reason to discard copy sources from the set of files considered by
addremove(). It was done to handle the case where a first patch would create
'a' and a second one would move 'a' to 'b'. If these patches were applied with
--no-commit, 'a' would first be marked as added, then unlinked and dropped from
the dirstate but still passed to addremove(). A better fix is thus to exclude
removed files which ends being dropped from the dirstate instead of removed.
Reported by Jason Harris <jason@jasonfharris.com>
Prior to this patch "hg diff -U0", i.e., zero lines of context, would
output hunk headers with a start line one greater than what GNU patch
and git output. Guido van Rossum documents the unified diff format[1]
as having a start line value "one lower than one would expect" for
zero length hunks.
Comparing the behaviour of the three systems prior to this patch in
transforming
c1
c3
to
c1
c2
c3
- GNU "diff -U0" reports the hunk as "@@ -1,0 +2 @@"
- "git diff -U0" reports the hunk as "@@ -1,0 +2 @@"
- "hg diff -U0" reports the hunk as "@@ -2,0 +2,1 @@"
After this patch, "hg diff -U0" reports "@@ -1,0 +2,1 @@".
Since "hg export --config diff.unified=0" outputs zero-context unified
diffs, "hg import" has also been updated to account for start lines
one less than expected for zero length hunk ranges.
[1]: http://www.artima.com/weblogs/viewpost.jsp?thread=164293
splitblock() was added to handle blocks returned by bdiff.blocks() which differ
only by blank lines but are not made only of blank lines. I do not know exactly
how it could happen but mdiff.blocks() threshold behaviour makes me think it
can if those blocks are made of very popular lines mixed with popular blank
lines. If it is proven to be wrong, the function can be dropped.
The first implementation made annotate share diff configuration entries. But it
looks like users will user -w/b for annotate but not for diff, on both the
command line and hgweb. Since the latter cannot use command line entries, we
introduce a new [annotate] section duplicating the diff whitespace options.
The 'Bin' marker was added to every changed file for which we could not find
any diff changes. This included binary files but also copy/renames and mode
changes. Since Mercurial regular diff format emits a 'Binary file XXX has
changed' line when fed with binary files, we use that and the usual git marker
to tell them from other cases. In particular, new empty files are no longer
reported as binary.
Still, this fix is not complete since copy/renames/mode changes are now
reported as '0' lines changes, instead of 'Bin'.
The line content of continued Subject: lines was yet joined via
str.replace('\n\t', ' '), which does not handle continuation via
spaces. So expan the regular expression instead to
handle all allowed forms of mail header line continuation.
This feature is more a way to test patching without a working directory than
something people asked about. Adding a --rev option to specify the parent patch
revision would make it a little more useful.
What this change introduces is patch.repobackend class which let patches be
applied against repository revisions. The caller must supply a filestore object
to receive patched content, which can be turned into a memctx with
patch.makememctx() helper.
- Add patchmeta.copy() and emit copies from iterhunks. Modifying patchmeta
instances in applydiff() makes things simpler.
- Rename selectfile() into makepatchmeta(). It is responsible for creating
patchmeta for regular patches.
- Pass patchmeta objects to patchfile() directly
patchmeta instances were associated with git patches, for regular patches we
had to pass additional variables to tell the patch intent to patchfile().
Instead, we generate patchmeta for regular patches and pass them. This will
also help with patch filtering by matcher objects.
This information is more correctly returned by backends.
The extra updated file removed from test-mq-merge.t output came from changes
from git patches being counted before being really applied in some cases.
Synchronizing on bfile does not work on file removal where bfile is /dev/null.
We match items on afile or bfile instead. The incorrect code makes iterhunks()
to emit patchmeta and hunks separately in some cases. This is currently hidden
by applydiff() being too tolerant when processing patchmeta, and will be fixed
later.
git patches may require copies to be handled out-of-order. For instance, take
the following sequence:
* modify a
* copy a into b
Here, we have to generate b from a before its modification. To do so,
applydiff() was scanning for copy metadata and performing the copies before
processing the other changes in-order. While smart and efficient, this approach
complicates things by handling file copies and file creations at different
places and times. While a new file must not exist before being patched a copied
file already exists before applying the first hunk.
Instead of copying the files at their final destination before patching, we
store them in a temporary file location and retrieve them when patching. The
filestore always stores file content in real files but nothing prevents adding
a cache layer. The filestore class was kept separate from fsbackend for at
least two reasons:
- This class is likely to be reused as a temporary result store for a future
repository patching call (entries just have to be extended to contain copy
sources).
- Delegating this role to backends might be more efficient in a repository
backend case: the source files are already available in the repository itself
and do not need to be copied again. It also means that third-parties backend
would have to implement two other methods. If we ever decide to merge the
filestore feature into backend, a minimalistic approach would be to compose
with filestore directly. Keep in mind this copy overhead only applies for
copy/rename sources, and may even be reduced to copy sources which have to
handled ahead of time.
The patcher has to know if a file is being created or removed to check if the
target already exists, or to actually unlink the file when a hunk emptying it
is applied. This was done by embedding the creation/removal information in the
first (and only) hunk attached to the file.
There are two problems with this approach:
- creation/removal is really a property of the file being patched and not its
hunk.
- for regular patches, file creation cannot be deduced at parsing time: there
are case where the *stripped* file paths must be compared. Modifying hunks
after their creation is clumsy and prevent further refactorings related to
copies handling.
Instead, we delegate this job to selectfile() which has all the relevant
information, and remove the hunk createfile() and rmfile() methods.
Restore the previous diffstat behaviour of scaling by the maximum number of
changes to a single file. Changeset 7bb0e22a7988 modified the diffstat to be
scaled by the total number of changes. This seems to have been unintentional.
gitpatch objects emitted by iterhunks() were referencing file paths unmodified
from the input patch. _applydif() made them usable by modifying the gitpatch
objects in-place with specified path strip level. The same modified objects
were then reused by iterhunks() generator. _applydiff() now copies and update
the paths which completely decouples both routines.
As a side effect, the "git" event now receives only metadata about
copies/renames to perform the necessary copies ahead of time. Other actions are
handled in the "file" event.
Patch changes are emitted by iterhunks() in two separate events: 'file' when
hunks have to be applied and 'git' to describe other modifications like copies
or mode changes. Note that a file which mode is changed and which content is
modified by the same patch will be emitted in both events. It is more
convenient to handle all file modifications in a single event. This patch
"zips" git actions with regular changes so both kinds can be emitted at the
same place.
git afile/bfile are extracted twice, once when reading a 'diff --git', and
again when reading a unified hunk. The problem is not all git blocks have
unified hunks (renames just have metadata) and they were not extracted the same
way. This is what this patch unifies.
gitpatch objects emitted by iterhunks() are modified in place by applydiff().
Processing them earlier improves iterhunks() isolation. applydiff() modifying
them should still be fixed though.
_updatedir() is no longer used by internalpatch()
The change in test-mq-missingfiles.t comes from workingbackend not considering
the missing 'b' file as changed, thus not calling addremove() on it.
This introduces a performance regression for large files, as they will be
copied just to be clobbered afterwards since binary patching does not use
deltas. But it simplifies the code and the previous optimization will be
reintroduced later in a better way.
_applydiff() patcher argument was added to help hgsubversion like extension
monkeypatching the patching process. While it could be removed at this point, I
prefer to leave it until patch.py is completely refactored and there is a valid
and tested alternative.
Most filesystem calls are already isolated in patchfile but this is not enough:
renames are performed before patchfile is available and some chmod calls are
even done outside of the applydiff call. Once all these calls are extracted
into a backend class, we can provide cleaner APIs to write to a working
directory context directly into the repository.
The local err variable would be bound to PatchError thrown and it
keeps this value even after the except block is executed.
The whole thing worked anyway since the rejected variable is set in
the except block and this makes the function return -1 when a
PatchError is thrown.
Why?
- Mercurial internal patcher works correctly for regular patches and git
patches, is much faster at least on Windows and is more extensible.
- In theory, the external patcher can be used to handle exotic patch formats. I
do not know any and have not heard about any such use in years.
- Most patch programs cannot handle git format patches, which makes the API
caller to decide either to ignore ui.patch by calling patch.internalpatch()
directly, or take the risk of random failures with valid inputs.
- One thing a patch program could do Mercurial patcher cannot is applying with
--reverse. Apparently several shelve like extensions try to use that,
including passing the "reverse" option to Mercurial patcher, which has been
removed mid-2009. I never heard anybody complain about that, and would prefer
reimplementing it anyway.
And from the technical perspective:
- The external patcher makes everything harder to maintain and implement. EOL
normalization is not implemented, and I would bet file renames, if supported
by the patcher, are not correctly recorded in the dirstate.
- No tests.
How?
- Remove related documentation
- Clearly mark patch.externalpatch() as private
- Remove the debuginstall check. This deprecation request was actually
triggered by this last point. debuginstall is the only piece of code patching
without a repository. When migrating to an integrated patch() + updatedir()
call, this was really a showstopper, all workarounds were either ugly or
uselessly complicated to implement. If we do not support external patcher
anymore, the debuginstall check is not useful anymore.
- Remove patch.externalpatch() after 1.9 release.
The patch changes the output of "hg diff --stat" when one file whose filename
has spaces has changed, making it get the full filename instead of just the
substring between the last space and the end of the filename.
It also changes the diffstat generated by "hg email -d" when one of the commit
messages starts with "diff". Because of the regex used to parse the filename,
the diffstat generated by "hg email -d" will still be not correct if a commit
message starts with "diff -r ".
Before the patch Mercurial has the following behavior:
$ echo "foobar">"file with spaces"
$ hg add "file with spaces"
$ hg diff --stat
spaces | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
$ hg diff --git --stat
file with spaces | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
After the patch:
$ echo "foobar">"file with spaces"
$ hg add "file with spaces"
$ hg diff --stat
file with spaces | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
$ hg diff --git --stat
file with spaces | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
Before the patch:
$ hg add mercurial/patch.py tests/tests-diffstat.t
$ hg commit -m "diffstat: fix parsing of filenames"
$ hg email -d --test tip
This patch series consists of 1 patches.
diffstat: fix parsing of filenames
[...]
filenames | 0
mercurial/patch.py | 6 ++++--
tests/test-diffstat.t | 17 +++++++++++++++++
3 files changed, 21 insertions(+), 2 deletions(-)
[...]
After the patch:
$ hg email -d --test tip
This patch series consists of 1 patches.
diffstat: fix parsing of filenames
[...]
mercurial/patch.py | 6 ++++--
tests/test-diffstat.t | 17 +++++++++++++++++
3 files changed, 21 insertions(+), 2 deletions(-)
[...]
Do not pass reject file content to patchfile.writelines() to:
- Avoid line endings transformations
- Avoid polluting overriding implementations with unrelated data. They should
override write_rej() to deal or ignore reject files properly.
Bug report, analysis and original patch and test by
Shun-ichi GOTO <shunichi.goto@gmail.com>
Previously no '# ' lines came through the parser.
Now only the first '# ' lines are processed, from '# HG changeset patch' and to
the first line not starting with '# '.
Some encoding and language combinations (e.g.: UTF-8 and Japanese)
cause encoding characters into sequence of bytes more than column
width of them.
So, encoding.colwidth() should be applied instread of len() on i18n
strings.
In addition to it, formatting by '%*s'/'%-*s' also uses "number of
bytes" to calculate space padding size, and should be fixed, too.