Commit Graph

9 Commits

Author SHA1 Message Date
Jun Wu
bdbf60f28d xdiff: backport upstream changes
Summary:
I did some extra xdiff changes in upstream, namely:

  - Remove unused features
  - Replace "long" (32-bit in MSVC) with int64_t to support large files
  - Add comment on some key variables

This backports them. It also includes Matt's fixes about Windows compatibility.

Reviewed By: ryanmce

Differential Revision: D7223939

fbshipit-source-id: 9287d5be22dae4ab41b05b3a4c160d836b5714a6
2018-04-13 21:51:48 -07:00
Jun Wu
78f4faea65 xdiff: add a preprocessing step that trims files
Summary:
xdiff has a `xdl_trim_ends` step that removes common lines, unmatchable
lines. That is in theory good, but happens too late - after splitting,
hashing, and adjusting the hash values so they are unique. Those splitting,
hashing and adjusting hash values steps could have noticeable overhead.

For not uncommon cases like diffing two large files with minor differences,
the raw performance of those preparation steps seriously matter. Even
allocating an O(N) array and storing line offsets to it is expensive.
Therefore my previous attempts [1] [2] cannot be good enough since they do
not remove the O(N) array assignment.

This patch adds a preprocessing step - `xdl_trim_files` that runs before
other preprocessing steps. It counts common prefix and suffix and lines in
them (needed for displaying line number), without doing anything else.

Testing with a crafted large (169MB) file, with minor change:

```
  open('a','w').write(''.join('%s\n' % (i % 100000) for i in xrange(30000000) if i != 6000000))
  open('b','w').write(''.join('%s\n' % (i % 100000) for i in xrange(30000000) if i != 6003000))
```

Running xdiff by a simple binary [3], this patch improves the xdiff perf by
more than 10x for the above case:

```
  # xdiff before this patch
  2.41s user 1.13s system 98% cpu 3.592 total
  # xdiff after this patch
  0.14s user 0.16s system 98% cpu 0.309 total
  # gnu diffutils
  0.12s user 0.15s system 98% cpu 0.272 total
  # (best of 20 runs)
```

It's still slightly slower than GNU diffutils. But it's pretty close now.

Testing with real repo data:

For the whole repo, this patch makes xdiff 25% faster:

```
  # hg perfbdiff --count 100 --alldata -c d334afc585e2 --blocks [--xdiff]
  # xdiff, after
  ! wall 0.058861 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)
  # xdiff, before
  ! wall 0.077816 comb 0.080000 user 0.080000 sys 0.000000 (best of 91)
  # bdiff
  ! wall 0.117473 comb 0.120000 user 0.120000 sys 0.000000 (best of 67)
```

For files that are long (ex. commands.py), the speedup is more than 3x, very
significant:

```
  # hg perfbdiff --count 3000 --blocks commands.py.i 1 [--xdiff]
  # xdiff, after
  ! wall 0.690583 comb 0.690000 user 0.690000 sys 0.000000 (best of 12)
  # xdiff, before
  ! wall 2.240361 comb 2.210000 user 2.210000 sys 0.000000 (best of 4)
  # bdiff
  ! wall 2.469852 comb 2.440000 user 2.440000 sys 0.000000 (best of 4)
```

The improvement is also seen for the `json` test case mentioned in D7124455.
xdiff's time improves from 0.3s to 0.04s, similar to GNU diffutils.

This patch is also sent as https://phab.mercurial-scm.org/D2686.

[1]: https://phab.mercurial-scm.org/D2631
[2]: https://phab.mercurial-scm.org/D2634
[3]:

```
// Code to run xdiff from command line. No proper error handling.
mmfile_t readfile(const char *path) {
  struct stat st; int fd = open(path, O_RDONLY);
  fstat(fd, &st); mmfile_t f = { malloc(st.st_size), st.st_size };
  ensure(read(fd, f.ptr, st.st_size) == st.st_size); close(fd); return f; }
static int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf) { int i;
  for (i = 0; i < nbuf; i++) { write(STDOUT_FILENO, mb[i].ptr, mb[i].size); }
  return 0; }
int main(int argc, char const *argv[]) {
  mmfile_t a = readfile(argv[1]), b = readfile(argv[2]);
  xpparam_t xpp = { XDF_INDENT_HEURISTIC, 0 };
  xdemitconf_t xecfg = { 3, 0 }; xdemitcb_t ecb = { 0, &xdiff_outf };
  xdl_diff(&a, &b, &xpp, &xecfg, &ecb); return 0; }
```

Reviewed By: ryanmce

Differential Revision: D7151582

fbshipit-source-id: 3f2dd43b74da118bd827af4fc5e1bf65be191ad2
2018-04-13 21:51:25 -07:00
Ryan Prince
573a8eb9cc fixing xdiff build on windows
Summary: fixing xdiff build on windows

Reviewed By: quark-zju

Differential Revision: D7189839

fbshipit-source-id: ef05219d911af44f3546bc51fb74539d06b443b5
2018-04-13 21:51:23 -07:00
Jun Wu
81e68a9a57 xdiff: decrease indent heuristic overhead
Summary:
Add a "boring" threshold to limit the search range of the indention heuristic,
so the performance of the diff algorithm is mostly unaffected by turning on
indention heuristic.

Reviewed By: ryanmce

Differential Revision: D7145002

fbshipit-source-id: 024ec685f96aa617fb7da141f38fa4e12c4c0fc9
2018-04-13 21:51:21 -07:00
Jun Wu
511ec41260 xdiff: add a bdiff hunk mode
Summary:
xdiff generated hunks for the differences (ex. questionmarks in the
`@@ -?,?  +?,? @@` part from `diff --git` output). However, bdiff generates
matched hunks instead.

This patch adds a `XDL_EMIT_BDIFFHUNK` flag used by the output function
`xdl_call_hunk_func`.  Once set, xdiff will generate bdiff-like hunks
instead. That makes it easier to use xdiff as a drop-in replacement of bdiff.

Note that since `bdiff('', '')` returns `[(0, 0, 0, 0)]`, the shortcut path
`if (xscr)` is removed. I have checked functions called with `xscr` argument
(`xdl_mark_ignorable`, `xdl_call_hunk_func`, `xdl_emit_diff`,
`xdl_free_script`) work just fine with `xscr = NULL`.

Reviewed By: ryanmce

Differential Revision: D7135207

fbshipit-source-id: cfb8c363e586841c06c94af283c7f014ba65fcc0
2018-04-13 21:51:21 -07:00
Jun Wu
56a738fce4 xdiff: remove patience and histogram diff algorithms
Summary:
Patience diff is the normal diff algorithm, plus some greediness that
unconditionally matches common common unique lines.  That means it is easy to
construct cases to let it generate suboptimal result, like:

```
open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n')))
open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n')))
```

Patience diff has been advertised as being able to generate better results for
some C code changes. However, the more scientific way to do that is the
indention heuristic [1].

Since patience diff could generate suboptimal result more easily and its
"better" diff feature could be replaced by the new indention heuristic, let's
just remove it and its variant histogram diff to simplify the code.

[1]: 433860f3d0

Reviewed By: ryanmce

Differential Revision: D7124711

fbshipit-source-id: 127e8de6c75d0262687a1b60814813e660aae3da
2018-04-13 21:51:20 -07:00
Jun Wu
65d9160c6f xdiff: vendor xdiff library from git
Summary:
Vendor git's xdiff library from git commit
d7c6c2369d7c6c2369ac21141b7c6cceaebc6414ec3da14ad using GPL2+ license.

There is another recent user report that hg diff generates suboptimal
result. It seems the fix to issue4074 isn't good enough. I crafted some
other interesting cases, and hg diff barely has any advantage compared with
gnu diffutils or git diff.

| testcase | gnu diffutils |      hg diff |   git diff |
|          |    lines time |   lines time | lines time |
| patience |        6 0.00 |     602 0.08 |     6 0.00 |
|   random |    91772 0.90 |  109462 0.70 | 91772 0.24 |
|     json |        2 0.03 | 1264814 1.81 |     2 0.29 |

"lines" means the size of the output, i.e. the count of "+/-" lines. "time"
means seconds needed to do the calculation. Both are the smaller the better.
"hg diff" counts Python startup overhead.

Git and GNU diffutils generate optimal results. For the "json" case, git can
have an optimization that does a scan for common prefix and suffix first,
and match them if the length is greater than half of the text. See
https://neil.fraser.name/news/2006/03/12/. That would make git the fastest
for all above cases.

About testcases:

patience:
Aiming for the weakness of the greedy "patience diff" algorithm.  Using
git's patience diff option would also get suboptimal result. Generated using
the Python script:

```
open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n')))
open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n')))
```

random:
Generated using the script in `test-issue4074.t`. It practically makes the
algorithm suffer. Impressively, git wins in both performance and diff
quality.

json:
The recent user reported case. It's a single line movement near the end of a
very large (800K lines) JSON file.

Reviewed By: ryanmce

Differential Revision: D7124455

fbshipit-source-id: 832651115da770f9d2ed5fdff2e200453c0013f8
2018-04-13 21:51:20 -07:00
Kostia Balytskyi
67b2e1496a hg: vendor a third-party implementation of mman library for Windows
Summary: This is needed to make our C code compile on Windows.

Reviewed By: quark-zju

Differential Revision: D6970929

fbshipit-source-id: 2cfe46e0718fe75916912d0e59c5400038e03a12
2018-04-13 21:51:10 -07:00
Durham Goode
1ab0bb112d sha1: add sha1detectcoll library to setup.py
Summary:
cdatapack depends on sha1detectcoll, so let's add the library to setup.py before
we add cdatapack.

Test Plan:
hg purge --all && make local && cd tests/ && ./run-tests.py -S -j 48

Verified sha1dc was in the build output and the tests passed.

Reviewers: quark, #mercurial

Reviewed By: quark

Differential Revision: https://phabricator.intern.facebook.com/D6676405

Signature: 6676405:1515444508:2da65c6c3a18267a1d3c151c8e9acf60b674ffc2
2018-01-08 12:54:57 -08:00