lyhash is a very simple and fast hash function that had the fewest
hash collisions on a 3.9M line text corpus and 190k line binary corpus
and should have significantly fewer collisions than the current hash
function.
pretty easy to find after I recompiled the python interpreter and
mercurial for profiling.
In "bdiff.c" function "equatelines" allocates the minimum hash table
size, which can lead to tons of collisions. I introduced an
"overcommit" factor of 16, this is, I allocate 16 times more memory
than the minimum value. Overcommiting 128 times does not improve the
performance over the 16-times case.
bdiff.blocks() returns a dummy match at the end of both files; the
length of that chunk is never set, so it will sometimes contain random
heap garbage. There are apparently workarounds for this elsewhere:
# bdiff sometimes gives huge matches past eof, this check eats them,
Python 2.5 doesn't like it when we mix str objects and the "t#" format
in PyArg_ParseTuple. Change it to use "s#". Tested with python 2.3, 2.4
and 2.5.
manifest.add gives revlog.addrevision a buffer object, which may
be cached and used for a second call in the same session (as mq does
when pushing multiple patches). The other option would be to cast the
buffer to str when caching it.
on 5.8MB (244.000 lines) text file with similar lines, hash before
this change made diff against empty file take 75 seconds. this change
improves performance to 0.6 seconds. result is that clone of smallish
repo (137MB) with some files like this takes 1 minute instead of 10
minutes.
common case of diff is 10% slower now, probably because of worse cache
locality. but diff does not affect overall performance in common case
(less than 1% of runtime is in diff when it is working ok), so this
tradeoff looks good.
Many projects use inttypes.h, too. stdint.h isn't available everywhere, e.g.
on some versions of Solaris, while inttypes.h is available everywhere where
stdint.h is.
The compiling runs through without warning, but runnig the newly builded
hg emmits a message:
| ImportError: ld.so.1: python: fatal: relocation error:
| file /opt/local/lib/python2.3/site-packages/mercurial/bdiff.so:
| symbol cmp: referenced symbol not found
Removing the inline infront of cmp corrects this error message.
Temporary fix to allow Mercurial to build on HP-UX 11, as the C
compiler on HP-UX 11 doesn't support 'inline' qualifier. The
'__inline' qualifier seemed to be supported, but not without
first resolving other associated issues.
I ran into a bug while importing a large repository into mercurial.
The diff algorithm does not allocate a big enough array of hunks
for some test cases. This results in memory corruption, and possibly,
as in my case, a seg fault.
You should be able to reproduce this problem with any case of more
than a few lines that follows this pattern:
a b
= =
1 1
2
2 3
4
3 5
.
4 .
.
5
.
.
.
I.e., "a" has blank lines on every other line that have been removed in
"b". In this case, the number of matching hunks is equal to the number
of lines in "b". This is more than ((an + bn)/4 + 2). I'm not sure what
motivates this formula, but when I changed it to the smaller of an or
bn (+ 1), it works.
[comment added by mpm]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
[PATCH] use <arpa/inet.h> instead of <netinet/in.h> for ntohl/htonl
From: Jed Davis <jdev@panix.com>
This fixes the Mac OS X build problem; hopefully it won't break any
other OSes, especially since SUSv3 says arpa/inet is the right header.
( http://www.opengroup.org/onlinepubs/009695399/functions/ntohl.html )
manifest hash: 2f06ff0cffefdb35e794131afcd1f34f9fdfa5cf
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCyEoFywK+sNU5EO8RAk6WAJ9v/pnr07zUXKM9EBQQGaKSZAlhxACdHrwS
XTLSL6pPGAwaRfExGF2A3DQ=
=Rtv9
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
[PATCH] bdiff/mpatch under MSVC
From: K Thananchayan <thananck@yahoo.com>
MSVC (6.0) environment does not have 'stdint.h' and does not provide
`inline' qualifier. The following patch is needed to make mecurial
installable under MSVC.
manifest hash: a5b64235acced16cb451faa698922559fec4e573
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCxPapywK+sNU5EO8RAmRnAKCt9cOASaIsYB6kNUDSIStR1DmY4gCgnXlL
Jf0nMmGEkoyXtB0eV+fLzJU=
=fKD5
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Remove stdint.h from mpatch and bdiff
It's only there for ntohl and htonl and should be pulled in by in.h.
manifest hash: 65954290279241ac92c9ce04c21cf1a3c9dd54e0
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCwD8KywK+sNU5EO8RAhv2AJ40R/T72XK63IbeEFqMLSRJbRJWdACcDa9r
dOL9XpyYxR09REbAHw0JrlE=
=8wkZ
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Minor speed improvements for bdiff
Consolidate the jpos/jlen arrays to improve cache locality.
Do the same for the hash head/length arrays.
manifest hash: e6d9ed36782741b1d6fcce8c2d00155a9540e81d
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCvzWxywK+sNU5EO8RAlTMAJ9+yl0dKIeWv4RegeLy7g6wcnoYwgCgk6la
ip6KEAyBb7ktsX14KyZ5+/s=
=utNJ
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
extensions: use stdint.h
Not sure why I didn't do this the first time around. Hopefully still
builds everywhere.
manifest hash: 965582286a190728f8cc0dfb8e11ee56628a59a5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCvfRgywK+sNU5EO8RAg9SAJ4/ZVpQZcDY5xovLDTZK2txEegEgwCdF2b+
lzSIP109qq8D+KIdUWsbEPc=
=+0Yy
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Use __inline instead of inline
This should let us compile bdiff.c on the other OS.
manifest hash: c1233bb3c7fc060e49dbd2597c122d903797db9e
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCuu/WywK+sNU5EO8RAtqvAKC0d8Kv8He6xNCwmFnvKcff9BT4gACeLq7n
9JDFxYtWMrgjwlShfay1nL4=
=GDFt
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Add bdiff.blocks / minor performance tweaks
This refactors bdiff.bdiff so that we can get a list of matching
blocks of line numbers for use by annotate/unidiff.
Minor performance tweaks:
- - add a field for equivalence so we can keep h around a bit longer for cmp
- - mix len into the hash to reduce collisions
- - move an operation into the slow path in longest_match
manifest hash: b1aee590b6291b31069ea8a86b6aa8fb259ac244
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCubu2ywK+sNU5EO8RAm4FAJ9r10aJpT7qA96nqGYFHcuy4XcIHgCfeFx5
q0PyTXeZQc7Fw5kwEPcoykI=
=QXSb
-----END PGP SIGNATURE-----