If there's no inline data, revlog.revision opens the data file every
time it's called. This is useful if we're going to call chunk many
times, but, if we're going to call it only once, it's better to let
chunk open the file - if we're lucky, all the data we're going to need
is already cached and we won't need to even look at the file.
When we remove revision N from the repository, all revisions >= N are
affected: either it's a descendant from N and will also be removed, or
it's not a descendant of N and will be renumbered.
As a consequence, we have to (at least temporarily) remove all filelog
and manifest revisions that have a linkrev >= N, readding some of them
later.
Unfortunately, it's possible to have a revlog with two revisions
r1 and r2 such that r1 < r2, but linkrev(r1) > linkrev(r2). If we try
to strip revision linkrev(r1) from the repository, we'll also lose
revision r2 when we truncate this revlog.
We already use changegroupsubset to create a temporary changegroup
containing the revisions that have to be restored, but that function is
unable to detect that we also wanted to save the r2 in the case above.
So we manually calculate these extra nodes and pass it to changegroupsubset.
This should fix issue764.
Python's zlib apparently makes an internal copy of strings passed to
compress(). To avoid this, compress strings 1M at a time, then join
them at the end if the result would be smaller than the original.
For initial commits of large but compressible files, this cuts peak
memory usage nearly in half.
- use a buffer to extract the delta from a chunk
- avoid concatenating to a compressed delta
- use a buffer to directly extra full text from a trivial delta
- delete chunk and delta objects after use
- handle chunk headers separately rather than prepending them to
(potentially large) chunks
- break large chunks into 1M pieces for compression
- don't prepend file metadata onto (potentially large) file data
To avoid extra memory usage and performance issues with large files,
generate a trivial delta header for deltas against the null revision
rather than calling the usual delta generator.
We append the delta header to meta rather than prepending it to data
to avoid a large allocate and copy.
We want to store version information about the revlog in the first
entry of its index. The code in packentry was using some heuristics
to detect whether this was the first entry, but these heuristics could
fail in some cases (e.g. rev 0 was empty; rev 1 descends directly from
the nullid and is stored as a delta).
We now give the revision number to packentry to avoid heuristics.
This function is fairly performance sensitive, so we make a couple
ugly tweaks:
- keep all entries packed so we needn't test entry types
- fold index lookup/load into unpack call to eliminate
local variable setting
- remove unused defaults for p1, p2, and text
- reduce some if/else
- use better variable names
- remove some extra variables
- remove some obsolete corner tests
- simply first entry handling for revlogng
- simply inline vs outofline writeout
We expand our index by one entry so that index[nullrev] points to a
unique entry, the null revision. This naturally eliminates numerous
extra tests in the performance-sensitive index access functions, most
of which are now trivial again.
Adding new entries is now done with insert(-1, e) rather than
append(e).
This way can use one additional bit, and when encountering invalid revlogs
with the first bit set don't produce python warnings or strange error messages.
This should fix issue255.
It looks like the problem there happens when addgroup calls addrevision
to add a full revision, and addrevision decides to split the index file
into a .i/.d pair. Since addgroup has an open file handle for the
index file, the renaming of the new .i file to its final name fails on
windows.
manifest.add gives revlog.addrevision a buffer object, which may
be cached and used for a second call in the same session (as mq does
when pushing multiple patches). The other option would be to cast the
buffer to str when caching it.
Instead of converting each node from the filenode to a hex form,
convert the arg to a bin form.
For a revlog with 26711 entries, doing 100 lookup:
before: ~18s
after : ~13s
- add comments
- do a clean separation of the different cases
- don't use a list of each possible node when
doing the lookup, just keep the previous entry
Make depth calculation non-recursive
Add simple shortcut for linear ancestry
Convert context to use ancestor function
make memoized parents function
Convert revlog to use ancestor function
On the kernel repo:
$ hg heads -q
before after
RevlogNG 1.11 0.52
Revlogv0 0.80 0.69
Since the current code for tags has to find all the heads of the repo,
this also helps there:
$ hg tags
before after
RevlogNG 2.35 1.76
Revlogv0 2.04 1.90
This allows one to walk the revision graph using only revision numbers,
which can be faster than using revision hashes, especially for
RevlogNG, where the parents of a revision are stored as revision
numbers.
parseindex could fail if read returns too little data in the right
moment (e.g. when there's still leftover data from the previous
iteration and read returns less than "s" bytes).
During a clone, an inline index is not converted to a split index
file until the very end. When the conversion happens, the index
can be very large, and the inline index loading functions always load
the entire index file into ram.
This changes the revlog code to read the index in smaller chunks.
This changes revlog specify revlogng by default. Inlined
data files are also used unless a flags option is found in the .hgrc.
Some example hgrc files:
[revlog]
# use the original revlog format
format=0
[revlog]
# use revlogng. Because no flags are included, inlined data files
# also be selected
format=1
[revlog]
# use revlogng but do not inline the data files with the index
flags=
[revlog]
# the new default
format=1
flags=inline
revlog.py wasn't trying to detect the version of a revlog file that
doesn't exist on the filesystem (as is the case with old-http).
Additionally, there was an off-by-one error in httprangereader.read
(ranges in HTTP Range headers are inclusive), making it get more data
than what was asked for. This made a struct.unpack complain that
"unpack str size does not match format".
Finally, with the two fixes above, test-static-http fails, since
BaseHTTPServer doesn't understand ranges and returns too much data.
Work around that by reading only the specified amount.
it used to cache open files. this made revlogng break because it wants
to rename files when splitting .i into .i/.d, but cannot rename or unlink
open files on windows.
new code is bit slower, but safe on linux and windows. proper fix for
too many open/close of changelog/manifest belongs in different place.
can get 10% speed improvement back.
The appendfile code was not passing default version info to the
changelog or manifest classes, and so they were always being created
as version 0.
revlog.checkinlinesize had to be corrected to seek to the end
of the index file when no index file was passed (only clone does this)
revlog.ancestors can be expensive on big repos. This cuts down the overall
time for hg update by ~19% by short cutting revlog.ancestors when one of the
revisions is reachable from another.
Storing the tuple returned by struct.unpack significantly increases
the memory required to store the entire index in ram. This patch
uses struct.unpack on demand instead.
This tunes for large repositories. It does not read the whole
index file in one big chunk, but tries to buffer reads in more
reasonable chunks instead.
Search speeds are improved in two ways. When trying to find a
specific sha hash, it searches from the end of the file backward.
More recent entries are more likely to be relevant, especially the
tip.
Also, this can load only the mapping of nodes to revlog index number.
Loading the map uses less cpu (no struct.unpack) and much less
memory than loading both the map and the index.
This cuts down the time for hg tip on the 80,000 changeset
kernel repo from 1.8s to 3.69s. Most commands the pull a single
rev out of a big index get roughly the same benefit. Commands
that read the whole index are not slower.
This uses code from Matt to calculate the size change that
would result from applying a delta to keep an accurate running
total of the text size during revlog.addgroup
The revlog.checkinlinesize() uses an atomic opener to replace the
index file after converting it from inline to traditional .i and .d
files. If this operation is interrupted, the atomic file class can
overwrite a valid file with a partially written one.
This patch introduces an atomic opener that does not automatically
replace the destination file with the tempfile. This way
an interrupted checkinlinesize() call turns into a noop.
The appendfile class needs a few changes to make it work with interleaved
index files. It needs to support the tell() method, opening in a+ mode,
and it needs to delay the checkinlinesize call until after the
append file is written.
Given that open(file, "a+") doesn't always seek to the end of the file,
this adds seek operations to appendfile that understand whence args
This patch allows you to optionally inline data bytes with the
revlog index file. It saves considerable space and checkout
time by reducing the number of inodes, wasted partial blocks and
system calls.
To use the inline data add this to your .hgrc
[revlog]
# inline data only works with revlogng
format=1
# inline is the only valid flag right now.
flags=inline
revlogng results in smaller indexes, can address larger data files, and
supports flags and version numbers.
By default the original revlog format is used. To use the new format,
use the following .hgrc field:
[revlog]
# format choices are 0 (classic revlog format) and 1 revlogng
format=1
The empty changegroup can be caused by remote servers dying soon after
findincoming, and further code in pull assumes (correctly) that there are
new changesets.
Incoming ssh needs to detect the end of the changegroup, otherwise it would
block trying to read from the ssh pipe. This is done by parsing the
changegroup chunks.
bundlerepo.getchunk() already is identical to
localrepo.addchangegroup.getchunk(), which is followed by getgroup which
looks much like what you can re-use in bundlerepository.__init__() and in
write_bundle(). bundlerevlog.__init__.genchunk() looks very similar, too,
as do some while loops in localrepo.py.
Applied patch from Benoit Boissinot to move duplicate/related code
to mercurial/changegroup.py and use this to fix incoming ssh.
revlog.group cached every chunk from the revlog, the behaviour was
needed to minimize the roundtrip with old-http.
We now cache the revlog data ~4MB at a time.
The memory used server side when pulling goes down to 35Mo maximum
whereas without the patch more than 160Mo was used when cloning the linux kernel
repository.
The time used by cloning is higher mainly because of the check in revlog.revision.
before
110.25user 20.90system 2:52.00elapsed 76%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+708707minor)pagefaults 0swaps
after
117.56user 18.86system 2:50.43elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+748366minor)pagefaults 0swaps
Make it actually work (undefined variable 'rev'; allow to pass a rev parameter).
repo.branchlookup() doesn't need a copy of heads because it doesn't modify it.
Use None as default argument to heads() instead of nullid.
Doc string PEPification.
and replaces it with calls to nodesbetween.
nodesbetween calculates all the changesets needed to have a complete
revision graph between a given set of base nodes and a given set of
head nodes.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Fix out of range regression
From: Filip Brcic <brcha@users.sourceforge.net>
The old revlog.py issued "index out of range" error when cloning the repository
Now I have reverted the parts of revlog.py to the old state when prev was
initialized as -1 and later assigned self.tip() only if that is possible.
Previously prev was always initialized as self.tip() and that is where the
out of range error was.
manifest hash: c94c9aee8b6d382ef52c3981f306a6e7e5f4c4d1
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCzzIxywK+sNU5EO8RAtlcAJ0TX9FXuC2c3YHuYXNwqZhdzPWUlgCggq+a
yJzUKDKH/gvnD3Tx3jcmCn8=
=euPi
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Fix corruption resulting from skipping parts of a revision group
We were occassionally losing track of what revision a delta applied to
when we skipped over deltas we already had and applying the delta
against the wrong base. This could result in coredumps from mpatch,
consistency errors, or failed verify.
manifest hash: fcf20a8abfd81f08fae2398136b2ed66216b2083
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCzu5SywK+sNU5EO8RAi10AJ9cqIfQzOzbcdH36t1LR/rY+UMtHwCeM79p
Dtv+Jh0McLZr6nf4iJyhDgI=
=5o6U
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Fix an odd revlog bug
If revlog had a cached -empty- revision, as opposed to no cached
version, it could get confused. This cropped up in verify on a
particular repo.
manifest hash: 90ccf122087f6bbcb4322cb9d9bb8124610ba886
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCzjRaywK+sNU5EO8RAgVEAKCv3WBJt1rBOX0UlTDXFPygPIru+gCfTZxJ
CEz1lYny1gkQ+haGY26QdBs=
=C/K5
-----END PGP SIGNATURE-----
# HG changeset patch
# User mason@suse.com
Performance enhancements for manifest.add()
Improve manifest.add performance by using bisect to insert/remove
changed items into the manifest list. This also generates the
manifest delta directly based on the changes being made.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ancestor algorithm fix
The ancestor algorithm was a bit too optimistic about node ordering
still. Add revision numbers to the comparison to sort things out.
manifest hash: f4eaf95057b5623e864359706dcaee820b10fd20
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCsTrCywK+sNU5EO8RAtqMAJ9fEJEesPn+0SMg/i/g5vZYmX/pBgCfVnhl
+s88q/Wilw27MVWP6J6oqX8=
=k9AU
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Change the size of the short hash representation
First note that this number doesn't really matter, as we always check
for ambiguous short hash ids.
Here's the math on collision probability:
>>> import math
>>> def p(f, n): return 1 - (1 / math.exp(n**2/(2*f)))
...
>>> p(2**32, 30000.0)
0.09947179164613551 # with 30000 changesets (BKCVS), we have a 9% chance
>>> p(2**32, 65000.0)
0.38850881217977273 # and with a full import from BK, we'd have a 39% chance
>>> p(2**40, 1e6)
0.36539171908447321 # we'd like to be "safe" for 1M csets, so 40 isn't enough
>>> p(2**48, 1e6)
0.001774780051374103 # But 48 looks good
>>> p(2**48, 1e7)
0.16275260939624481
>>> p(2**48, 5e6)
0.043437281083569146
>>> p(2**48, 2e6)
0.0070802434913129764
>>> p(2**48, 3e6)
0.01586009440574343
manifest hash: 24d9f928a463f46708b0e11fb781d5a241851424
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCsQoMywK+sNU5EO8RAoBBAJwII9GV6dT9QUOYAk3gZGw9z0JvjACfSI4q
IFnTu1F7P5OuLelO1GsM8Bs=
=CNWk
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
lazyparser speed ups
When we do __contains__ on a map, we might as well load the whole
index. Not doing this was slowing down finding new changesets quite by
a factor of 20. When we do a full load, we also attempt to replace the
revlog's index and nodemap with normal Python objects to avoid the
lazymap overhead.
manifest hash: 9b2b20aacc508f9027d115426c63a381d28e5485
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCreYIywK+sNU5EO8RAoNHAJ9+LmXqsTQb9Bh3mZHq0A0VfQOleQCffHmn
jC/O0vnfx5FCRsX2bUFG794=
=BDTz
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
revlog: allow duplicates
If two branches make the same change to the same parent, the result
will be an identical hash. Git apparently does this all the time. Deal
with it gracefully.
manifest hash: c6217eab4b310e1ae529dd75ab90e717dbe5d55d
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCqU61ywK+sNU5EO8RAkFqAJ9KhWUQgjZbzzB/+mTkolH0GkT1awCfa+Mj
ulbI4xCRZcvfQE492mcNwQA=
=N6In
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
fix bad assumption about uniqueness of file versions
Mercurial had assumed that a given file hash could show up in only one
changeset, and thus that the mapping from file revision to changeset
was 1-to-1. But if two people perform the same edit with the same
parents, we can get an identical hash in different changesets.
So we've got to loosen up our uniqueness checks in addgroup and in
verify.
manifest hash: 5462003241e7d071ffa1741b87a59f646c9988ed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCoMDkywK+sNU5EO8RAg9PAJ9YWSknfFBoeYve/+Z5DDGGvytDkwCgoMwj
kT01PcjNzGPr1/Oe5WRvulE=
=HC4t
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Changes to network protocol
Stream changes at the delta level rather than at whole delta groups
this breaks the protocol - we now send a zero byte delta to indicate
the end of a group rather than sending the entire group length up front
Fix filename length asymmetry while we're breaking things
Fix hidden O(n^2) bug in calculating changegroup
list.append(e) is O(n), list + [element] is not
Decompress chunks on read in revlog.group()
Improve status messages
report bytes transferred
report nothing to do
Deal with /dev/null path brokenness
Remove untriggered patch assertion
manifest hash: 3eedcfe878561f9eb4adedb04f6be618fb8ae8d8
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCmzlqywK+sNU5EO8RAn0KAJ4z4toWSSGjLoZO6FKWLx/3QbZufACglQgd
S48bumc++DnuY1iPSNWKGAI=
=lCjx
-----END PGP SIGNATURE-----
The old ancestor algorithm could get fooled into returning ancestors
closer to root than it ought to. Hopefully this one, which strictly
orders its search by distance from room, will be foolproof.
- we don't attempt to compress things under 44 bytes (empirical)
- we check whether larger objects actually compress
- we tag objects to indicate their compression
NUL means uncompressed and starts with NUL
x means gzipped and starts with x (handy)
u means uncompressed, drop the u
Delete old code
Fix calculation of newer nodes on server
Fix branch recursion on client
Fix manifest merge problems
Add more debugging and note messages to merge