sapling

mirror of https://github.com/facebook/sapling.git synced 2024-10-11 09:17:30 +03:00

Author	SHA1	Message	Date
Gregory Szorc	72da280a1e	bdiff: don't check border condition in loop This is pretty much a copy of f0646ce71ed6, just to a different loop. The condition `p == plast` (`plast == a + len - 1`) was only true on the final iteration of the loop. So it was wasteful to check for it on every iteration. We decrease the iteration count by 1 and add an explicit check for `p == plast` after the loop. Again, we see modest wins. From the mozilla-unified repository: $ perfbdiff -m 3041e4d59df2 ! wall 0.035502 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) ! wall 0.030480 comb 0.030000 user 0.030000 sys 0.000000 (best of 100) $ perfbdiff 0e9928989e9c --alldata --count 100 ! wall 4.097394 comb 4.100000 user 4.100000 sys 0.000000 (best of 3) ! wall 3.597798 comb 3.600000 user 3.600000 sys 0.000000 (best of 3) The 2nd example throws a total of ~3.3GB of data at bdiff. This change increases the throughput from ~811 MB/s to ~924 MB/s.	2016-11-20 16:56:21 -08:00
Mads Kiilerich	b3ae903762	bdiff: give slight preference to removing trailing lines [This change could be folded into the previous changeset to minimize the repo churn ...] Similar to the previous change, introduce an exception to the general preference for matches in the middle of bdiff ranges: If the best match on the B side starts at the beginning of the bdiff range, don't aim for the middle-most A side match but for the earliest. New (later) matches on the A side will only be considered better if the corresponding match on the B side not is at the beginning of the range. Thus, if the best (middle-most) match on the B side turns out to be at the beginning of the range, the earliest match on the A side will be used. The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from 22807275 to 22808120 bytes - a 0.004% increase.	2016-11-15 21:56:49 +01:00
Mads Kiilerich	7eb5c806da	bdiff: give slight preference to appending lines [This change could be folded into the previous changeset to minimize the repo churn ...] The general preference to matches in the middle of bdiff ranges helps getting balanced recursion and efficient computation. But, as previous changes have shown, it might also give diffs that seems "obviously wrong". To mitigate that: If the best match on the A side starts at the beginning of the bdiff range, don't aim for the middle-most B side match but for the earliest. This will make the matches balanced (by both sides being "early") even though the bisection will be less balanced. Still, this case only apply if the best and middle-most match was fully unbalanced on the A side. Each recursion will thus even in this worst case reduce the problem significantly and we are not re-introducing the problem that was fixed in d3deb406b55b. The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from 22806817 to 22807275 bytes - a 0.002% increase. This make the recent test-bdiff.py changes give a more pretty output ... but they no longer show that the recursion is around middle matches (because it in these cases isn't).	2016-11-15 21:56:49 +01:00
Mads Kiilerich	b5feb5a49b	bdiff: give slight preference to longest matches in the middle of the B side We already have a slight preference for matches close to the middle on the A side. Now, do the same on the B side. j is iterating the b range backwards and we thus accept a new j if the previous match was in the upper half. This makes the test-bhalf diff "correct". It obviously also gives more preference to balanced recursion than to appending to sequences. That is kind of correct, but will also unfortunately make some bundles bigger. No doubt, we can also create examples where it will make them smaller ... The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from 22803824 to 22806817 bytes - an 0.01% increase.	2016-11-08 18:37:33 +01:00
Mads Kiilerich	ce49bae5d5	bdiff: rearrange the "better longest match" code This is primarily to make the code more managable and prepare for later changes. More specific assignments might also be slightly faster, even thought it also might generate a bit more code.	2016-11-08 18:37:33 +01:00
Mads Kiilerich	3f0e3f18d8	bdiff: adjust criteria for getting optimal longest match in the A side middle We prefer matches closer to the middle to balance recursion, as introduced in d3deb406b55b. For ranges with uneven length, matches starting exactly in the middle should have preference. That will be optimal for matches of length 1. We will thus accept equality in the half check. For ranges with even length, half was ceil'ed when calculated but we got the preference for low matches from the 'less than half' check. To get the same result as before when we also accept equality, floor it. Without that, test-annotate.t would show some different (still correct but less optimal) results. This will change the heuristics. Tests shows a slightly different output - and sometimes slightly smaller bundles. The bundle size for 4.0 (hg bundle --base null -r 4.0 x.hg) happens to go from 22804885 to 22803824 bytes - an 0.005% reduction.	2016-11-08 18:37:33 +01:00
Gregory Szorc	9c0cfd877d	bdiff: replace hash algorithm This patch replaces lyhash with the hash algorithm used by diffutils. The algorithm has its origins in Git commit 2e9d1410, which is all the way back from 1992. The license header in the code at that revision in GPL v2. I have not performed an extensive analysis of the distribution (and therefore buckets) of hash output. However, `hg perfbdiff` gives some clear wins. I'd like to think that if it is good enough for diffutils it is good enough for us? From the mozilla-unified repository: $ perfbdiff -m 3041e4d59df2 ! wall 0.053271 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) ! wall 0.035827 comb 0.040000 user 0.040000 sys 0.000000 (best of 100) $ perfbdiff 0e9928989e9c --alldata --count 100 ! wall 6.204277 comb 6.200000 user 6.200000 sys 0.000000 (best of 3) ! wall 4.309710 comb 4.300000 user 4.300000 sys 0.000000 (best of 3) From the hg repo: $ perfbdiff 35000 --alldata --count 1000 ! wall 0.660358 comb 0.660000 user 0.660000 sys 0.000000 (best of 15) ! wall 0.534092 comb 0.530000 user 0.530000 sys 0.000000 (best of 19) Looking at the generated assembly and statistical profiler output from the kernel level, I believe there is room to make this function even faster. Namely, we're still consuming data character by character instead of at the word level. This translates to more loop iterations and more instructions. At this juncture though, the real performance killer is that we're hashing every line. We should get a significant speedup if we change the algorithm to find the longest prefix, longest suffix, treat those as single "lines" and then only do the line splitting and hashing on the parts that are different. That will require a lot of C code, however. I'm optimistic this approach could result in a ~2x speedup.	2016-11-06 18:51:57 -08:00
Gregory Szorc	e6dba2eac7	bdiff: don't check border condition in loop `plast = a + len - 1`. So, this "for" loop iterates from "a" to "plast", inclusive. So, `p == plast` can only be true on the final iteration of the loop. So checking for it on every loop iteration is wasteful. This patch simply decreases the upper bound of the loop by 1 and adds an explicit check after iteration for the `p == plast` case. We can't simply add 1 to the initial value for "i" because that doesn't do the correct thing on empty input strings. `perfbdiff -m 3041e4d59df2` on the Firefox repo becomes significantly faster: ! wall 0.072763 comb 0.070000 user 0.070000 sys 0.000000 (best of 100) ! wall 0.053221 comb 0.060000 user 0.060000 sys 0.000000 (best of 100) For the curious, this code has its origins in 3605ecc924ef, which is the changeset that introduced bdiff.c in 2005. Also, GNU diffutils is able to perform a similar line-based diff in under 20ms. So there's likely more perf wins to be found in this code. One of them is the hashing algorithm. But it looks like mpm spent some time testing hash collisions in c22788816627. I'd like to do the same before switching away from lyhash, just to be on the safe side.	2016-11-06 00:37:50 -07:00
Maciej Fijalkowski	04dfc79402	bdiff: split bdiff into cpy-aware and cpy-agnostic part	2016-07-13 10:46:26 +02:00
Maciej Fijalkowski	99c86205e7	bdiff: rename functions and structs to be amenable for later exporting	2016-07-13 10:07:17 +02:00
Maciej Fijalkowski	020e7f27ca	bdiff: use ssize_t in favor of Py_ssize_t in cpython-unaware locations This function and struct will be exposed via cffi, so we need to remove the cpython API dependency they currently have.	2016-07-13 09:36:24 +02:00
Maciej Fijalkowski	8e7a874bdf	internals: move the bitmanipulation routines into its own file This is to allow more flexibility with the C sources -- now the bitmanipulation routines can be safely imported without importing Python.h	2016-06-06 13:08:13 +02:00
Matt Mackall	d3fd1530f8	bdiff: remove effectively dead code Now that we extend matches backwards in the inner loop, the final adjustment has no effect. (A similar extension for the forward direction is trickier and has less benefit.)	2016-06-02 17:11:32 -05:00
Matt Mackall	18c0002f9b	bdiff: extend matches across popular lines For very large diffs that have large numbers of identical lines (JSON dumps) that also have large blocks of identical text, bdiff could become confused about which block matches which because it can only match very limited regions. The result is very large diffs for small sets of edits. The earlier recursion rebalancing fix made this behavior more frequent because it's now more prone to match block 1 to block 2. One frequent user of large JSON files reported being unable to pass the resulting diffs through their code review system. Prior to this change, bdiff would calculate the length of a match at (i, j) as 1 + length found at (i-1, j-1). With large number of popular (ignored) lines, this often meant matches couldn't be extended backwards at all and thus all matching regions were very small. Disabling the popularity threshold is not an option because it brings back quadratic behavior. Instead, we extend a match backwards until we either found a previously discovered match or we find a mismatching line. This thus successfully bridges over any popular lines inside and before a matching region. The larger regions then significant reduce the probability of confusion.	2016-06-02 17:09:06 -05:00
Matt Mackall	7909072f7f	bdiff: further restrain potential quadratic performance This causes the longest_match search to limit itself to a window of 30000 lines during search (roughly 1MB), thus avoiding a full O(N*M) search that might occur in repetitive structured inputs. For a particular class of many MB pathological test cases, this generated the following timings: size before after 10x 1.25s 1.24s 100x 57s 33s 1000x >8400s 400s The times on the right quickly become much faster and appear more linear. While windowing means deltas are no longer "optimal", the resulting deltas were within a couple percent of expected size. While we've yet to have a report of a file with the level of repetition necessary to hit this case, some JSON/XML database dump scenario is fairly likely to hit it. This may also slightly improve the average-case performance for deltas of large binaries.	2016-04-22 13:38:02 -05:00
Matt Mackall	7110077dec	bdiff: balance recursion to avoid quadratic behavior (issue4704) For highly structured files like JSON or XML dumps with large numbers of duplicate lines (eg braces) and isolated matching lines, bdiff could find large numbers of equally good spans. Because it prefers earlier matches, this would result in pathologically unbalance recursion that resulted in quadratic performance. This patch makes it prefer matches closer to the middle that tend to balance recursion. This change improves the speed of a pathological test case from 1100s to 9s. Included is a smaller test that has a roughly 50x safety margin on the performance it accepts. It's likely to fail on pure builds because difflib also has a recursion-balancing problem.	2016-04-21 22:04:11 -05:00
Matt Mackall	f33142790b	bdiff: deal better with duplicate lines The longest_match code compares all the possible positions in two files to find the best match. Given a pair of sequences, it effectively searches a grid like this: a b b b c . d e . f 0 1 2 3 4 5 6 7 8 9 a 1 - - - - - - - - - b - 2 1 1 - - - - - - b - 1 3 2 - - - - - - b - 1 2 4 - - - - - - . - - - - - 1 - - 1 - Here, the 4 in the middle says "the first four lines of the file match", which it can compute be comparing the fourth lines and then adding one to the result found when comparing the third lines in the entry to the upper left. We generally avoid the quadratic worst case by only looking at lines that match, which is precomputed. We also avoid quadratic storage by only keeping a single column vector and then keeping track of the best match. Unfortunately, this can get us into trouble with the sequences above. Because we want to reuse the '3' value when calculating the '4', we need to be careful not to overwrite it with the '2' we calculate immediately before. If we scan left to right, top to bottom, we're going to have a problem: we'll overwrite our 3 before we use it and calculate a suboptimal best match. To address this, we can either keep two column vectors and swap between them (which significantly complicates bookkeeping), or change our scanning order. If we instead scan from left to right, bottom to top, we'll avoid ever overwriting values we'll need in the future. This unfortunately needs several changes to be made simultaneously: - change the order we build the initial hash chains for the b sequence - change the sentinel values from INT_MAX to -1 - change the visit order in the longest_match inner loop - add a tie-breaker preference for earlier matches This last is needed because we previously had an implicit tie-breaker from our visitation order that our test suite relies on. Later matches can also trigger a bug in the normalization code in diff().	2016-04-21 21:05:26 -05:00
Matt Mackall	01fb35972f	bdiff: fix latent normalization bug This bug is hidden by the current bias towards matches at the beginning of the file. When this bias is tweaked later to address recursion balancing, the normalization code could cause the next block to shrink to a negative length, thus creating invalid delta chunks. We add checks here to disallow that. This bug requires test cases that are an awkwardly large size for the test suite, but is very rapidly picked up by the included torture tester.	2016-04-21 21:53:18 -05:00
Matt Mackall	771a9f0eac	bdiff: fold in shift calculation in normalize This just makes the code harder to read without any performance advantage. We're going to make the check here more complex, let's make it simpler first.	2016-04-21 21:46:31 -05:00
Matt Mackall	6ac425cfc0	bdiff: unify duplicate normalize loops We're about to make the while loop check more complicated, so let's simplify first.	2016-04-21 21:37:13 -05:00
Matt Mackall	a40dbdbbc6	bdiff: avoid a memory error on malloc failure	2013-10-30 16:03:42 -05:00
Matt Mackall	15bc341744	bdiff: simplify overflow checks Rather than check that each delta start, end, and length is within 32 bits, we simply check that the input strings are under 4GB.	2013-02-02 16:15:22 -06:00
Adrian Buehlmann	e54da6574b	bdiff: check and cast first parameter value on putbe32() calls Eliminates mercurial/bdiff.c(383) : warning C4244: 'function' : conversion from '__int64' to 'uint32_t', possible loss of data mercurial/bdiff.c(384) : warning C4244: 'function' : conversion from '__int64' to 'uint32_t', possible loss of data mercurial/bdiff.c(385) : warning C4244: 'function' : conversion from 'Py_ssize_t' to 'uint32_t', possible loss of data when compiling for Windows x64 target using the Microsoft compiler.	2012-05-15 22:36:47 +02:00
Adrian Buehlmann	933c7c3ffa	bdiff: use Py_ssize_t instead of int Reduces the conversion warnings mercurial/bdiff.c(61) : warning C4244: '=' : conversion from '__int64' to 'int', possible loss of data mercurial/bdiff.c(307) : warning C4244: 'function' : conversion from 'Py_ssize_t' to 'int', possible loss of data mercurial/bdiff.c(308) : warning C4244: 'function' : conversion from 'Py_ssize_t' to 'int', possible loss of data mercurial/bdiff.c(362) : warning C4244: '+=' : conversion from '__int64' to 'int', possible loss of data mercurial/bdiff.c(380) : warning C4244: '=' : conversion from '__int64' to 'int', possible loss of data mercurial/bdiff.c(381) : warning C4244: 'function' : conversion from '__int64' to 'uint32_t', possible loss of data mercurial/bdiff.c(382) : warning C4244: 'function' : conversion from '__int64' to 'uint32_t', possible loss of data mercurial/bdiff.c(416) : warning C4244: '=' : conversion from 'Py_ssize_t' to 'int', possible loss of data to mercurial/bdiff.c(383) : warning C4244: 'function' : conversion from '__int64' to 'uint32_t', possible loss of data mercurial/bdiff.c(384) : warning C4244: 'function' : conversion from '__int64' to 'uint32_t', possible loss of data mercurial/bdiff.c(385) : warning C4244: 'function' : conversion from 'Py_ssize_t' to 'uint32_t', possible loss of data on the three putbe32() calls in the function bdiff when compiling for Windows x64 target using the Microsoft compiler.	2012-05-15 22:36:27 +02:00
Augie Fackler	61a03305ee	bdiff.bdiff: release the GIL before doing expensive diff operations This means that threaded webservers will have more of a chance of doing something useful while the C extension is busy computing a delta. Not doing this was causing problems for Google Code with a 25 meg text file that takes O(7 minutes) to deltify.	2012-04-20 11:08:14 -05:00
Matt Mackall	0fa9895915	util.h: replace ntohl/htonl with get/putbe32	2012-04-16 11:26:00 -05:00
Matt Mackall	fd4256c9b1	util.h: unify some common platform tweaks	2012-04-10 12:07:14 -05:00
Jim Hague	5fc330dd3c	bdiff: fix malloc(0) issue in fixws() If fixws() is called on a zero-length string, malloc(0) is called and expected to return a pointer. Which it does on e.g. Linux. AIX returns NULL, which it is also legal, but the malloc() is then assumed to have failed. So ensure a valid pointer is always returned.	2012-02-03 23:27:17 +00:00
Patrick Mezard	c4cef76e25	mdiff: replace wscleanup() regexps with C loops On my system it reduces: hg annotate -w mercurial/commands.py from 36s to less than 8s, to be compared with 6.3s when run without whitespace options.	2011-11-18 14:23:03 +01:00
Matt Mackall	776ce88bdd	bdiff: fix pointer aliasing	2011-10-11 20:21:13 -05:00
Markus F.X.J. Oberhumer	159a87219b	bdiff.c: rename all variables which hold a hash value to "hash"	2011-03-23 02:33:24 +01:00
Markus F.X.J. Oberhumer	b841bede91	bdiff.c: use unsigned arithmetic for hash computation Signed integer overflow is undefined in C.	2011-03-23 02:33:23 +01:00
Markus F.X.J. Oberhumer	26bf54022b	bdiff.c: cast to unsigned char when computing hash value	2011-03-23 02:33:22 +01:00
Markus F.X.J. Oberhumer	a68114a4d1	bdiff.c: make all local functions static	2011-03-23 02:33:21 +01:00
Martin Geisler	a76e121863	backout of e4cb9628354c Matt and a majority of crew did not like this approach.	2011-01-27 11:15:08 +01:00
Martin Geisler	d23e1973c2	specify C indention style using Emacs file local variables	2011-01-26 12:05:01 +01:00
Matt Mackall	4b21988877	bdiff: Fix bogus NULL return	2010-12-06 16:59:43 -06:00
Matt Mackall	338e4487bf	bdiff: dynamically allocate hunks Make the hunk list a singly-linked list rather than a large array.	2010-12-06 16:42:48 -06:00
Renato Cunha	51519e8461	bdiff.c: Added support for py3k. This patch adds support for py3k in bdiff.c. This is accomplished by including a header file responsible for abstracting the API differences between python 2 and python 3.	2010-06-15 19:49:56 -03:00
Alistair Bell	fbefff7aa5	bdiff: do not use recursion / avoid stackoverflow (issue1940)	2010-02-18 10:32:51 +01:00
Matt Mackall	8d99be19f0	many, many trivial check-code fixups	2010-01-25 00:05:27 -06:00
Benoit Boissinot	3978618d04	bdiff: gradually enable the popularity hack Patch from Jason Orendorff The lower the threshold, the stronger the popularity hack's influence. So at 3999 lines, the hack is disabled; and at 4000 lines, the hack is enabled at maximum strength (t=4). No source file in mercurial/crew is over 4000 lines. But there are, oh, a few such files in Mozilla. I can testify that this hack causes hg to generate some correct but eyebrow-raising patches. I think the hack should phase in gradually. The threshold should be high for small files where we don't need it so much. Like this: t = (bn < 31000) ? 1000000 / bn : bn / 1000; That would leave the popularity hack disabled for small files, then gradually phase it in: bn < 1000 -- t > bn (popularity hack is completely disabled) bn == 1000 -- t = 1000 (still effectively disabled) bn == 2000 -- t = 500 (only hits unusual files) bn == 10000 -- t = 100 (only hits especially common lines) bn == 31000 -- t = 31 (hack is at maximum power) bn == 32000 -- t = 32 (hack could backfire, ease off)	2009-10-03 23:36:08 +02:00
Matt Mackall	9d8806e3c3	bdiff: fix compile with GCC -ansi (issue1690)	2009-06-20 11:50:51 -05:00
Benoit Boissinot	10822b7c00	bdiff: add comment about normalization	2009-01-12 17:51:57 +01:00
Dirkjan Ochtman	1c96ed002e	merge with crew-stable, again	2008-10-20 14:58:49 +02:00
Thomas Arendsen Hein	380f8d8da9	spaces->tabs in one line of a C extension for consistency	2008-10-20 14:53:53 +02:00
Scott McCreary	501e6fe68f	allow Mercurial to compile on Haiku	2008-09-17 10:22:35 +02:00
Benoit Boissinot	d6688cc959	bdiff: normalize the diff (issue1295) When the common part of a diff can be moved forward, move it forward. Otherwise we don't get deterministic results (it would depends on the way we split for the recursion). That way we get identical hunks when doing the same change, it helps to solve issue1295 (inconsistent diffs on different side during a merge).	2008-10-14 20:13:53 +02:00
Jim Hague	b1341e30d4	fix calloc(0, ...) issue	2007-10-23 10:39:24 +00:00
Matt Mackall	0c1648fb37	bdiff: tweaks for large files - adjust the common line threshold to .1% this speeds up a delta of 7M lines of source from 10m to 40s - adjust the scaling of the hash array down a bit as it was raising the peak memory usage significantly	2007-10-11 00:46:56 -05:00

1 2

85 Commits