sapling/tests/test-diff-color.t
Jun Wu 97411a0738 patch: rewrite the worddiff algorithm
Summary:
There were recent complains about both quality [1] [2] and performance [3]
of the current word diff algorithm.

The current algorithm is actually bad in various ways:

  - Lines could be matched across hunks, which is confusing (report [1]).
  - For short lines, they can fail "similarity" check, which means they
    won't be highlighted when they are expected to be (report [2]).
  - Various performance issues:
    - Using difflib implemented by pure Python, which is both slow and
      suboptimal comparing with xdiff.
    - Searching for matched lines across hunks could be O(N^2) if there are
      no match found.

Thinking it in a "highlight" way is actually tricky, consider the following
change:

```
  # before
  foo = 10

  # after
  if True:
      foo = 21 + 3
```

It's obvious that "10" and "21 + 3" need highlighting because they are
different.  But what about "if True:"? In theory it's also "different" and
need highlighting. How about purely inserted or deleted hunks then?
Highlighting all of them would be too noisy.

This diff rewrites the word diff algorithm. It differs in multiple ways:

  1. Get rid of "matching lines by similarity" step.
  2. Only diff words within a same hunk.
  3. Dim unchanged words. Instead of highlighting changed words.
  4. Treat pure insertion or deletion hunks differently - do not dim or
     highlight words in them.
  5. Use xdiff instead.
  6. Use a better regexp to split words. This reduces the number of tokens sent
     to the diff algorithm.

1, 2, 5, 6 help performance. 1, 2, 3, 4 make the result more predictable and
trustworthy. 3 avoids the nasty question about what to highlight. 3 and 4 makes
it more flexible for people to tweak colors. 6 makes the result better since it
merges multiple space tokens into one so xdiff will less likely miss important
matches (than meaningless matches like spaces).

"bold" and "underline" were removed so the changed words will have regular
red/green colors. The output won't be too "noisy" even in cases where code are
changed in a way that inline word matching is meaningless. For people who want
more contrast, they can set:

  [color]
  diff.inserted.changed = green bold
  diff.deleted.changed = red bold

Practically, when diffing D7319718, the old code spends 4 seconds on finding
matched lines preparing for worddiff:

```
           | diffordiffstat                     cmdutil.py:1522
            \ difflabel (17467 times)           patch.py:2471
             ....
> 3927        \ _findmatches (22 times)         patch.py:2537
   348          \ __init__ (8158 times)         difflib.py:154
   340           | set_seqs (8158 times)        difflib.py:223
   328           | set_seq2 (8158 times)        difflib.py:261
   322           | __chain_b (8158 times)       difflib.py:306
  1818          \ ratio (8158 times)            difflib.py:636
  1777           | get_matching_blocks (8158 times) difflib.py:460
  1605            \ find_longest_match (51966 times) difflib.py:350
    38             | __new__ (51966 times)      <string>:8
    29            \ _make (36035 times)         <string>:12
   143      \ write (17466 times)               ui.py:883
```

The new code takes 0.14 seconds:

```
             | diffordiffstat                   cmdutil.py:1522
              \ difflabel (23401 times)         patch.py:2562
               ....
>  140          \ consumehunkbuffer (23346 times) patch.py:2585
   130           | diffsinglehunkinline (23240 times) patch.py:2496
   215        \ write (23400 times)             ui.py:883
   118    \ flush                               cmdutil.py:1606
   118     | write                              ui.py:883

```

[1]: https://fburl.com/lkb9rc9m
[2]: https://fburl.com/0r9bqf0e
[3]: https://fburl.com/pxqznw31

Reviewed By: ryanmce

Differential Revision: D7314726

fbshipit-source-id: becd979cb9ac3fd3f4adae11cb10804d535f58df
2018-04-13 21:51:30 -07:00

394 lines
11 KiB
Raku

Setup
$ cat <<EOF >> $HGRCPATH
> [ui]
> color = yes
> formatted = always
> paginate = never
> [color]
> mode = ansi
> EOF
$ hg init repo
$ cd repo
$ cat > a <<EOF
> c
> c
> a
> a
> b
> a
> a
> c
> c
> EOF
$ hg ci -Am adda
adding a
$ cat > a <<EOF
> c
> c
> a
> a
> dd
> a
> a
> c
> c
> EOF
default context
$ hg diff --nodates
\x1b[0;1mdiff -r cf9f4ba66af2 a\x1b[0m (esc)
\x1b[0;31;1m--- a/a\x1b[0m (esc)
\x1b[0;32;1m+++ b/a\x1b[0m (esc)
\x1b[0;35m@@ -2,7 +2,7 @@\x1b[0m (esc)
c
a
a
\x1b[0;31m-b\x1b[0m (esc)
\x1b[0;32m+dd\x1b[0m (esc)
a
a
c
(check that 'ui.color=yes' match '--color=auto')
$ hg diff --nodates --config ui.formatted=no
diff -r cf9f4ba66af2 a
--- a/a
+++ b/a
@@ -2,7 +2,7 @@
c
a
a
-b
+dd
a
a
c
(check that 'ui.color=no' disable color)
$ hg diff --nodates --config ui.formatted=yes --config ui.color=no
diff -r cf9f4ba66af2 a
--- a/a
+++ b/a
@@ -2,7 +2,7 @@
c
a
a
-b
+dd
a
a
c
(check that 'ui.color=always' force color)
$ hg diff --nodates --config ui.formatted=no --config ui.color=always
\x1b[0;1mdiff -r cf9f4ba66af2 a\x1b[0m (esc)
\x1b[0;31;1m--- a/a\x1b[0m (esc)
\x1b[0;32;1m+++ b/a\x1b[0m (esc)
\x1b[0;35m@@ -2,7 +2,7 @@\x1b[0m (esc)
c
a
a
\x1b[0;31m-b\x1b[0m (esc)
\x1b[0;32m+dd\x1b[0m (esc)
a
a
c
--unified=2
$ hg diff --nodates -U 2
\x1b[0;1mdiff -r cf9f4ba66af2 a\x1b[0m (esc)
\x1b[0;31;1m--- a/a\x1b[0m (esc)
\x1b[0;32;1m+++ b/a\x1b[0m (esc)
\x1b[0;35m@@ -3,5 +3,5 @@\x1b[0m (esc)
a
a
\x1b[0;31m-b\x1b[0m (esc)
\x1b[0;32m+dd\x1b[0m (esc)
a
a
diffstat
$ hg diff --stat
a | 2 \x1b[0;32m+\x1b[0m\x1b[0;31m-\x1b[0m (esc)
1 files changed, 1 insertions(+), 1 deletions(-)
$ cat <<EOF >> $HGRCPATH
> [extensions]
> record =
> [ui]
> interactive = true
> [diff]
> git = True
> EOF
#if execbit
record
$ chmod +x a
$ hg record -m moda a <<EOF
> y
> y
> EOF
\x1b[0;1mdiff --git a/a b/a\x1b[0m (esc)
\x1b[0;36;1mold mode 100644\x1b[0m (esc)
\x1b[0;36;1mnew mode 100755\x1b[0m (esc)
1 hunks, 1 lines changed
\x1b[0;33mexamine changes to 'a'? [Ynesfdaq?]\x1b[0m y (esc)
\x1b[0;35m@@ -2,7 +2,7 @@ c\x1b[0m (esc)
c
a
a
\x1b[0;31m-b\x1b[0m (esc)
\x1b[0;32m+dd\x1b[0m (esc)
a
a
c
\x1b[0;33mrecord this change to 'a'? [Ynesfdaq?]\x1b[0m y (esc)
$ echo "[extensions]" >> $HGRCPATH
$ echo "mq=" >> $HGRCPATH
$ hg rollback
repository tip rolled back to revision 0 (undo commit)
working directory now based on revision 0
qrecord
$ hg qrecord -m moda patch <<EOF
> y
> y
> EOF
\x1b[0;1mdiff --git a/a b/a\x1b[0m (esc)
\x1b[0;36;1mold mode 100644\x1b[0m (esc)
\x1b[0;36;1mnew mode 100755\x1b[0m (esc)
1 hunks, 1 lines changed
\x1b[0;33mexamine changes to 'a'? [Ynesfdaq?]\x1b[0m y (esc)
\x1b[0;35m@@ -2,7 +2,7 @@ c\x1b[0m (esc)
c
a
a
\x1b[0;31m-b\x1b[0m (esc)
\x1b[0;32m+dd\x1b[0m (esc)
a
a
c
\x1b[0;33mrecord this change to 'a'? [Ynesfdaq?]\x1b[0m y (esc)
$ hg qpop -a
popping patch
patch queue now empty
#endif
issue3712: test colorization of subrepo diff
$ hg init sub
$ echo b > sub/b
$ hg -R sub commit -Am 'create sub'
adding b
$ echo 'sub = sub' > .hgsub
$ hg add .hgsub
$ hg commit -m 'add subrepo sub'
$ echo aa >> a
$ echo bb >> sub/b
$ hg diff -S
\x1b[0;1mdiff --git a/a b/a\x1b[0m (esc)
\x1b[0;31;1m--- a/a\x1b[0m (esc)
\x1b[0;32;1m+++ b/a\x1b[0m (esc)
\x1b[0;35m@@ -7,3 +7,4 @@\x1b[0m (esc)
a
c
c
\x1b[0;32m+aa\x1b[0m (esc)
\x1b[0;1mdiff --git a/sub/b b/sub/b\x1b[0m (esc)
\x1b[0;31;1m--- a/sub/b\x1b[0m (esc)
\x1b[0;32;1m+++ b/sub/b\x1b[0m (esc)
\x1b[0;35m@@ -1,1 +1,2 @@\x1b[0m (esc)
b
\x1b[0;32m+bb\x1b[0m (esc)
test tabs
$ cat >> a <<EOF
> one tab
> two tabs
> end tab
> mid tab
> all tabs
> EOF
$ hg diff --nodates
\x1b[0;1mdiff --git a/a b/a\x1b[0m (esc)
\x1b[0;31;1m--- a/a\x1b[0m (esc)
\x1b[0;32;1m+++ b/a\x1b[0m (esc)
\x1b[0;35m@@ -7,3 +7,9 @@\x1b[0m (esc)
a
c
c
\x1b[0;32m+aa\x1b[0m (esc)
\x1b[0;32m+\x1b[0m \x1b[0;32mone tab\x1b[0m (esc)
\x1b[0;32m+\x1b[0m \x1b[0;32mtwo tabs\x1b[0m (esc)
\x1b[0;32m+end tab\x1b[0m\x1b[0;1;41m \x1b[0m (esc)
\x1b[0;32m+mid\x1b[0m \x1b[0;32mtab\x1b[0m (esc)
\x1b[0;32m+\x1b[0m \x1b[0;32mall\x1b[0m \x1b[0;32mtabs\x1b[0m\x1b[0;1;41m \x1b[0m (esc)
$ echo "[color]" >> $HGRCPATH
$ echo "diff.tab = bold magenta" >> $HGRCPATH
$ hg diff --nodates
\x1b[0;1mdiff --git a/a b/a\x1b[0m (esc)
\x1b[0;31;1m--- a/a\x1b[0m (esc)
\x1b[0;32;1m+++ b/a\x1b[0m (esc)
\x1b[0;35m@@ -7,3 +7,9 @@\x1b[0m (esc)
a
c
c
\x1b[0;32m+aa\x1b[0m (esc)
\x1b[0;32m+\x1b[0m\x1b[0;1;35m \x1b[0m\x1b[0;32mone tab\x1b[0m (esc)
\x1b[0;32m+\x1b[0m\x1b[0;1;35m \x1b[0m\x1b[0;32mtwo tabs\x1b[0m (esc)
\x1b[0;32m+end tab\x1b[0m\x1b[0;1;41m \x1b[0m (esc)
\x1b[0;32m+mid\x1b[0m\x1b[0;1;35m \x1b[0m\x1b[0;32mtab\x1b[0m (esc)
\x1b[0;32m+\x1b[0m\x1b[0;1;35m \x1b[0m\x1b[0;32mall\x1b[0m\x1b[0;1;35m \x1b[0m\x1b[0;32mtabs\x1b[0m\x1b[0;1;41m \x1b[0m (esc)
$ cd ..
test inline color diff
$ hg init inline
$ cd inline
$ cat > file1 << EOF
> this is the first line
> this is the second line
> third line starts with space
> + starts with a plus sign
> this one with one tab
> now with full two tabs
> now tabs everywhere, much fun
>
> this line won't change
>
> two lines are going to
> be changed into three!
>
> three of those lines will
> collapse onto one
> (to see if it works)
> EOF
$ hg add file1
$ hg ci -m 'commit'
$ cat > file1 << EOF
> that is the first paragraph
> this is the second line
> third line starts with space
> - starts with a minus sign
> this one with two tab
> now with full three tabs
> now there are tabs everywhere, much fun
>
> this line won't change
>
> two lines are going to
> (entirely magically,
> assuming this works)
> be changed into four!
>
> three of those lines have
> collapsed onto one
> EOF
$ hg diff --config experimental.worddiff=False --color=debug
[diff.diffline|diff --git a/file1 b/file1]
[diff.file_a|--- a/file1]
[diff.file_b|+++ b/file1]
[diff.hunk|@@ -1,16 +1,17 @@]
[diff.deleted|-this is the first line]
[diff.deleted|-this is the second line]
[diff.deleted|- third line starts with space]
[diff.deleted|-+ starts with a plus sign]
[diff.deleted|-][diff.tab| ][diff.deleted|this one with one tab]
[diff.deleted|-][diff.tab| ][diff.deleted|now with full two tabs]
[diff.deleted|-][diff.tab| ][diff.deleted|now tabs][diff.tab| ][diff.deleted|everywhere, much fun]
[diff.inserted|+that is the first paragraph]
[diff.inserted|+ this is the second line]
[diff.inserted|+third line starts with space]
[diff.inserted|+- starts with a minus sign]
[diff.inserted|+][diff.tab| ][diff.inserted|this one with two tab]
[diff.inserted|+][diff.tab| ][diff.inserted|now with full three tabs]
[diff.inserted|+][diff.tab| ][diff.inserted|now there are tabs][diff.tab| ][diff.inserted|everywhere, much fun]
this line won't change
two lines are going to
[diff.deleted|-be changed into three!]
[diff.inserted|+(entirely magically,]
[diff.inserted|+ assuming this works)]
[diff.inserted|+be changed into four!]
[diff.deleted|-three of those lines will]
[diff.deleted|-collapse onto one]
[diff.deleted|-(to see if it works)]
[diff.inserted|+three of those lines have]
[diff.inserted|+collapsed onto one]
$ hg diff --config experimental.worddiff=True --color=debug
[diff.diffline|diff --git a/file1 b/file1]
[diff.file_a|--- a/file1]
[diff.file_b|+++ b/file1]
[diff.hunk|@@ -1,16 +1,17 @@]
[diff.deleted|-][diff.deleted.changed|this][diff.deleted.unchanged| is the first ][diff.deleted.changed|line]
[diff.deleted|-][diff.deleted.unchanged|this is the second line]
[diff.deleted|-][diff.deleted.changed| ][diff.deleted.unchanged|third line starts with space]
[diff.deleted|-][diff.deleted.changed|+][diff.deleted.unchanged| starts with a ][diff.deleted.changed|plus][diff.deleted.unchanged| sign]
[diff.deleted|-][diff.tab| ][diff.deleted.unchanged|this one with ][diff.deleted.changed|one][diff.deleted.unchanged| tab]
[diff.deleted|-][diff.tab| ][diff.deleted.unchanged|now with full ][diff.deleted.changed|two][diff.deleted.unchanged| tabs]
[diff.deleted|-][diff.tab| ][diff.deleted.unchanged|now ][diff.deleted.unchanged|tabs][diff.tab| ][diff.deleted.unchanged|everywhere, much fun]
[diff.inserted|+][diff.inserted.changed|that][diff.inserted.unchanged| is the first ][diff.inserted.changed|paragraph]
[diff.inserted|+][diff.inserted.changed| ][diff.inserted.unchanged|this is the second line]
[diff.inserted|+][diff.inserted.unchanged|third line starts with space]
[diff.inserted|+][diff.inserted.changed|-][diff.inserted.unchanged| starts with a ][diff.inserted.changed|minus][diff.inserted.unchanged| sign]
[diff.inserted|+][diff.tab| ][diff.inserted.unchanged|this one with ][diff.inserted.changed|two][diff.inserted.unchanged| tab]
[diff.inserted|+][diff.tab| ][diff.inserted.unchanged|now with full ][diff.inserted.changed|three][diff.inserted.unchanged| tabs]
[diff.inserted|+][diff.tab| ][diff.inserted.unchanged|now ][diff.inserted.changed|there are ][diff.inserted.unchanged|tabs][diff.tab| ][diff.inserted.unchanged|everywhere, much fun]
this line won't change
two lines are going to
[diff.deleted|-][diff.deleted.unchanged|be changed into ][diff.deleted.changed|three][diff.deleted.unchanged|!]
[diff.inserted|+][diff.inserted.changed|(entirely magically,]
[diff.inserted|+][diff.inserted.changed| assuming this works)]
[diff.inserted|+][diff.inserted.unchanged|be changed into ][diff.inserted.changed|four][diff.inserted.unchanged|!]
[diff.deleted|-][diff.deleted.unchanged|three of those lines ][diff.deleted.changed|will]
[diff.deleted|-][diff.deleted.changed|collapse][diff.deleted.unchanged| onto one]
[diff.deleted|-][diff.deleted.changed|(to see if it works)]
[diff.inserted|+][diff.inserted.unchanged|three of those lines ][diff.inserted.changed|have]
[diff.inserted|+][diff.inserted.changed|collapsed][diff.inserted.unchanged| onto one]
multibyte character shouldn't be broken up in word diff:
$ $PYTHON <<'EOF'
> with open("utf8", "wb") as f:
> f.write(b"blah \xe3\x82\xa2 blah\n")
> EOF
$ hg ci -Am 'add utf8 char' utf8
$ $PYTHON <<'EOF'
> with open("utf8", "wb") as f:
> f.write(b"blah \xe3\x82\xa4 blah\n")
> EOF
$ hg ci -m 'slightly change utf8 char' utf8
$ hg diff --config experimental.worddiff=True --color=debug -c.
[diff.diffline|diff --git a/utf8 b/utf8]
[diff.file_a|--- a/utf8]
[diff.file_b|+++ b/utf8]
[diff.hunk|@@ -1,1 +1,1 @@]
[diff.deleted|-][diff.deleted.unchanged|blah ][diff.deleted.changed|\xe3\x82\xa2][diff.deleted.unchanged| blah] (esc)
[diff.inserted|+][diff.inserted.unchanged|blah ][diff.inserted.changed|\xe3\x82\xa4][diff.inserted.unchanged| blah] (esc)