If a buffer of an mutable object is passed to revlog.addrevision(), the revlog
will happily store it in its cache. Later when the revlog reuses the cached
entry, if the manifest modified the object in-between, all kind of bugs
appears.
We fix it by:
- passing immutable objects to addrevision() if they are already available
- only storing the text in the cache if it's of str type
Then we can remove the conversion of the cache entry to str() during
retrieval. That was probably just there hiding the bug for the common cases
but not really fixing it.
- chunk to _chunk
- _prime to _chunkraw
- _chunkclear for cache clearing
- _chunk calls _chunkraw
- clean up _prime a bit
- simplify users in revision and checkinlinesize
- drop file descriptor passing (we're better off opening fds lazily
The built-in None object is a singleton and it is therefore safe to
compare memory addresses with is. It is also faster, how much depends
on the object being compared. For a simple type like str I get:
| s = "foo" | s = None
----------+-----------+----------
s == None | 0.25 usec | 0.21 usec
s is None | 0.17 usec | 0.17 usec
Uses a transaction instance from the local repository to journal the
truncation of revlog files, such that if a strip only partially completes,
hg recover will be able to finish the truncate of all the files.
The potential unbundling of changes that have been backed up to be restored
later will, in case of an error, have to be unbundled manually. The
difference is that it will be possible to recover the repository state so
the unbundle can actually succeed.
Because we often compute sha1(nullid), it's interesting to copy a precomputed
hash of nullid instead of computing everytime the same hash. Similarly, when
one of the parents is null, we can avoid a < comparison (sort).
Overall, this change adds a string equality comparison on each hash() call,
but when p2 is null, we drop one string < comparison, and copy a hash instead
of computing it. Since it is common to have revisions with only one parent,
this change makes hash() 25% faster when cloning a big repository.
They are unnecessary. I did leave them in localrepo.py where there is
something like:
_junk = foo()
_junk = None
to free memory early. I don't know if just `foo()` will free the return
value as early.
- create error.py for exception classes to reduce demandloading
- move revlog exceptions to it
- change users to import error and drop revlog import if possible
changegroup() has a problem when nodes which does not descend from a node
in <bases> are added to remote after the discovery phase.
If that happens, changegroup() won't send the correct set of nodes, ie.
some nodes will be missing.
To correct it we have to find the set of nodes that both remote and self
have (called <common>), and send all the nodes not in <common>.
This fix has some overhead, in the worst case it will re-send a whole branch.
A proper fix to avoid this overhead might be to change the protocol so that
the <common> nodes are sent (instead of the <bases> of the missing nodes).
Previously, an unknown node id would lead to the following error:
abort: 00changelog.i@343445453433: no node!
All other unknown revision would instead display as:
abort: unknown revision '343445453'!
The former error message has been suppressed in favor of the latter.
This patch adds two methods to revlog:
- ancestors: given a list of revisions returns their ancestors
- descendants: given a list of revisions return their descendants
If there's no inline data, revlog.revision opens the data file every
time it's called. This is useful if we're going to call chunk many
times, but, if we're going to call it only once, it's better to let
chunk open the file - if we're lucky, all the data we're going to need
is already cached and we won't need to even look at the file.
When we remove revision N from the repository, all revisions >= N are
affected: either it's a descendant from N and will also be removed, or
it's not a descendant of N and will be renumbered.
As a consequence, we have to (at least temporarily) remove all filelog
and manifest revisions that have a linkrev >= N, readding some of them
later.
Unfortunately, it's possible to have a revlog with two revisions
r1 and r2 such that r1 < r2, but linkrev(r1) > linkrev(r2). If we try
to strip revision linkrev(r1) from the repository, we'll also lose
revision r2 when we truncate this revlog.
We already use changegroupsubset to create a temporary changegroup
containing the revisions that have to be restored, but that function is
unable to detect that we also wanted to save the r2 in the case above.
So we manually calculate these extra nodes and pass it to changegroupsubset.
This should fix issue764.
Python's zlib apparently makes an internal copy of strings passed to
compress(). To avoid this, compress strings 1M at a time, then join
them at the end if the result would be smaller than the original.
For initial commits of large but compressible files, this cuts peak
memory usage nearly in half.
- use a buffer to extract the delta from a chunk
- avoid concatenating to a compressed delta
- use a buffer to directly extra full text from a trivial delta
- delete chunk and delta objects after use
- handle chunk headers separately rather than prepending them to
(potentially large) chunks
- break large chunks into 1M pieces for compression
- don't prepend file metadata onto (potentially large) file data
To avoid extra memory usage and performance issues with large files,
generate a trivial delta header for deltas against the null revision
rather than calling the usual delta generator.
We append the delta header to meta rather than prepending it to data
to avoid a large allocate and copy.
We want to store version information about the revlog in the first
entry of its index. The code in packentry was using some heuristics
to detect whether this was the first entry, but these heuristics could
fail in some cases (e.g. rev 0 was empty; rev 1 descends directly from
the nullid and is stored as a delta).
We now give the revision number to packentry to avoid heuristics.
This function is fairly performance sensitive, so we make a couple
ugly tweaks:
- keep all entries packed so we needn't test entry types
- fold index lookup/load into unpack call to eliminate
local variable setting
- remove unused defaults for p1, p2, and text
- reduce some if/else
- use better variable names
- remove some extra variables
- remove some obsolete corner tests
- simply first entry handling for revlogng
- simply inline vs outofline writeout
We expand our index by one entry so that index[nullrev] points to a
unique entry, the null revision. This naturally eliminates numerous
extra tests in the performance-sensitive index access functions, most
of which are now trivial again.
Adding new entries is now done with insert(-1, e) rather than
append(e).
This way can use one additional bit, and when encountering invalid revlogs
with the first bit set don't produce python warnings or strange error messages.
This should fix issue255.
It looks like the problem there happens when addgroup calls addrevision
to add a full revision, and addrevision decides to split the index file
into a .i/.d pair. Since addgroup has an open file handle for the
index file, the renaming of the new .i file to its final name fails on
windows.
manifest.add gives revlog.addrevision a buffer object, which may
be cached and used for a second call in the same session (as mq does
when pushing multiple patches). The other option would be to cast the
buffer to str when caching it.
Instead of converting each node from the filenode to a hex form,
convert the arg to a bin form.
For a revlog with 26711 entries, doing 100 lookup:
before: ~18s
after : ~13s
- add comments
- do a clean separation of the different cases
- don't use a list of each possible node when
doing the lookup, just keep the previous entry