sapling

mirror of https://github.com/facebook/sapling.git synced 2024-10-12 17:58:27 +03:00

Author	SHA1	Message	Date
Jun Wu	bdbf60f28d	xdiff: backport upstream changes Summary: I did some extra xdiff changes in upstream, namely: - Remove unused features - Replace "long" (32-bit in MSVC) with int64_t to support large files - Add comment on some key variables This backports them. It also includes Matt's fixes about Windows compatibility. Reviewed By: ryanmce Differential Revision: D7223939 fbshipit-source-id: 9287d5be22dae4ab41b05b3a4c160d836b5714a6	2018-04-13 21:51:48 -07:00
Jun Wu	78f4faea65	xdiff: add a preprocessing step that trims files Summary: xdiff has a `xdl_trim_ends` step that removes common lines, unmatchable lines. That is in theory good, but happens too late - after splitting, hashing, and adjusting the hash values so they are unique. Those splitting, hashing and adjusting hash values steps could have noticeable overhead. For not uncommon cases like diffing two large files with minor differences, the raw performance of those preparation steps seriously matter. Even allocating an O(N) array and storing line offsets to it is expensive. Therefore my previous attempts [1] [2] cannot be good enough since they do not remove the O(N) array assignment. This patch adds a preprocessing step - `xdl_trim_files` that runs before other preprocessing steps. It counts common prefix and suffix and lines in them (needed for displaying line number), without doing anything else. Testing with a crafted large (169MB) file, with minor change: ``` open('a','w').write(''.join('%s\n' % (i % 100000) for i in xrange(30000000) if i != 6000000)) open('b','w').write(''.join('%s\n' % (i % 100000) for i in xrange(30000000) if i != 6003000)) ``` Running xdiff by a simple binary [3], this patch improves the xdiff perf by more than 10x for the above case: ``` # xdiff before this patch 2.41s user 1.13s system 98% cpu 3.592 total # xdiff after this patch 0.14s user 0.16s system 98% cpu 0.309 total # gnu diffutils 0.12s user 0.15s system 98% cpu 0.272 total # (best of 20 runs) ``` It's still slightly slower than GNU diffutils. But it's pretty close now. Testing with real repo data: For the whole repo, this patch makes xdiff 25% faster: ``` # hg perfbdiff --count 100 --alldata -c d334afc585e2 --blocks [--xdiff] # xdiff, after ! wall 0.058861 comb 0.050000 user 0.050000 sys 0.000000 (best of 100) # xdiff, before ! wall 0.077816 comb 0.080000 user 0.080000 sys 0.000000 (best of 91) # bdiff ! wall 0.117473 comb 0.120000 user 0.120000 sys 0.000000 (best of 67) ``` For files that are long (ex. commands.py), the speedup is more than 3x, very significant: ``` # hg perfbdiff --count 3000 --blocks commands.py.i 1 [--xdiff] # xdiff, after ! wall 0.690583 comb 0.690000 user 0.690000 sys 0.000000 (best of 12) # xdiff, before ! wall 2.240361 comb 2.210000 user 2.210000 sys 0.000000 (best of 4) # bdiff ! wall 2.469852 comb 2.440000 user 2.440000 sys 0.000000 (best of 4) ``` The improvement is also seen for the `json` test case mentioned in D7124455. xdiff's time improves from 0.3s to 0.04s, similar to GNU diffutils. This patch is also sent as https://phab.mercurial-scm.org/D2686. [1]: https://phab.mercurial-scm.org/D2631 [2]: https://phab.mercurial-scm.org/D2634 [3]: ``` // Code to run xdiff from command line. No proper error handling. mmfile_t readfile(const char path) { struct stat st; int fd = open(path, O_RDONLY); fstat(fd, &st); mmfile_t f = { malloc(st.st_size), st.st_size }; ensure(read(fd, f.ptr, st.st_size) == st.st_size); close(fd); return f; } static int xdiff_outf(void priv_, mmbuffer_t mb, int nbuf) { int i; for (i = 0; i < nbuf; i++) { write(STDOUT_FILENO, mb[i].ptr, mb[i].size); } return 0; } int main(int argc, char const argv[]) { mmfile_t a = readfile(argv[1]), b = readfile(argv[2]); xpparam_t xpp = { XDF_INDENT_HEURISTIC, 0 }; xdemitconf_t xecfg = { 3, 0 }; xdemitcb_t ecb = { 0, &xdiff_outf }; xdl_diff(&a, &b, &xpp, &xecfg, &ecb); return 0; } ``` Reviewed By: ryanmce Differential Revision: D7151582 fbshipit-source-id: 3f2dd43b74da118bd827af4fc5e1bf65be191ad2	2018-04-13 21:51:25 -07:00
Ryan Prince	573a8eb9cc	fixing xdiff build on windows Summary: fixing xdiff build on windows Reviewed By: quark-zju Differential Revision: D7189839 fbshipit-source-id: ef05219d911af44f3546bc51fb74539d06b443b5	2018-04-13 21:51:23 -07:00
Jun Wu	81e68a9a57	xdiff: decrease indent heuristic overhead Summary: Add a "boring" threshold to limit the search range of the indention heuristic, so the performance of the diff algorithm is mostly unaffected by turning on indention heuristic. Reviewed By: ryanmce Differential Revision: D7145002 fbshipit-source-id: 024ec685f96aa617fb7da141f38fa4e12c4c0fc9	2018-04-13 21:51:21 -07:00
Jun Wu	511ec41260	xdiff: add a bdiff hunk mode Summary: xdiff generated hunks for the differences (ex. questionmarks in the `@@ -?,? +?,? @@` part from `diff --git` output). However, bdiff generates matched hunks instead. This patch adds a `XDL_EMIT_BDIFFHUNK` flag used by the output function `xdl_call_hunk_func`. Once set, xdiff will generate bdiff-like hunks instead. That makes it easier to use xdiff as a drop-in replacement of bdiff. Note that since `bdiff('', '')` returns `[(0, 0, 0, 0)]`, the shortcut path `if (xscr)` is removed. I have checked functions called with `xscr` argument (`xdl_mark_ignorable`, `xdl_call_hunk_func`, `xdl_emit_diff`, `xdl_free_script`) work just fine with `xscr = NULL`. Reviewed By: ryanmce Differential Revision: D7135207 fbshipit-source-id: cfb8c363e586841c06c94af283c7f014ba65fcc0	2018-04-13 21:51:21 -07:00
Jun Wu	56a738fce4	xdiff: remove patience and histogram diff algorithms Summary: Patience diff is the normal diff algorithm, plus some greediness that unconditionally matches common common unique lines. That means it is easy to construct cases to let it generate suboptimal result, like: ``` open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n'))) open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n'))) ``` Patience diff has been advertised as being able to generate better results for some C code changes. However, the more scientific way to do that is the indention heuristic [1]. Since patience diff could generate suboptimal result more easily and its "better" diff feature could be replaced by the new indention heuristic, let's just remove it and its variant histogram diff to simplify the code. [1]: `433860f3d0` Reviewed By: ryanmce Differential Revision: D7124711 fbshipit-source-id: 127e8de6c75d0262687a1b60814813e660aae3da	2018-04-13 21:51:20 -07:00
Jun Wu	65d9160c6f	xdiff: vendor xdiff library from git Summary: Vendor git's xdiff library from git commit d7c6c2369d7c6c2369ac21141b7c6cceaebc6414ec3da14ad using GPL2+ license. There is another recent user report that hg diff generates suboptimal result. It seems the fix to issue4074 isn't good enough. I crafted some other interesting cases, and hg diff barely has any advantage compared with gnu diffutils or git diff. \| testcase \| gnu diffutils \| hg diff \| git diff \| \| \| lines time \| lines time \| lines time \| \| patience \| 6 0.00 \| 602 0.08 \| 6 0.00 \| \| random \| 91772 0.90 \| 109462 0.70 \| 91772 0.24 \| \| json \| 2 0.03 \| 1264814 1.81 \| 2 0.29 \| "lines" means the size of the output, i.e. the count of "+/-" lines. "time" means seconds needed to do the calculation. Both are the smaller the better. "hg diff" counts Python startup overhead. Git and GNU diffutils generate optimal results. For the "json" case, git can have an optimization that does a scan for common prefix and suffix first, and match them if the length is greater than half of the text. See https://neil.fraser.name/news/2006/03/12/. That would make git the fastest for all above cases. About testcases: patience: Aiming for the weakness of the greedy "patience diff" algorithm. Using git's patience diff option would also get suboptimal result. Generated using the Python script: ``` open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n'))) open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n'))) ``` random: Generated using the script in `test-issue4074.t`. It practically makes the algorithm suffer. Impressively, git wins in both performance and diff quality. json: The recent user reported case. It's a single line movement near the end of a very large (800K lines) JSON file. Reviewed By: ryanmce Differential Revision: D7124455 fbshipit-source-id: 832651115da770f9d2ed5fdff2e200453c0013f8	2018-04-13 21:51:20 -07:00
Kostia Balytskyi	67b2e1496a	hg: vendor a third-party implementation of mman library for Windows Summary: This is needed to make our C code compile on Windows. Reviewed By: quark-zju Differential Revision: D6970929 fbshipit-source-id: 2cfe46e0718fe75916912d0e59c5400038e03a12	2018-04-13 21:51:10 -07:00
Durham Goode	1ab0bb112d	sha1: add sha1detectcoll library to setup.py Summary: cdatapack depends on sha1detectcoll, so let's add the library to setup.py before we add cdatapack. Test Plan: hg purge --all && make local && cd tests/ && ./run-tests.py -S -j 48 Verified sha1dc was in the build output and the tests passed. Reviewers: quark, #mercurial Reviewed By: quark Differential Revision: https://phabricator.intern.facebook.com/D6676405 Signature: 6676405:1515444508:2da65c6c3a18267a1d3c151c8e9acf60b674ffc2	2018-01-08 12:54:57 -08:00

9 Commits