The remaining uses of repo.manifest in the changegroup module are treating the
manifest exclusively as a revlog, so let's replace them with instances of the
revlog directly.
This is part of dropping all dependencies on repo.manifest in favor of
repo.manifestlog.
Using offsets for accessing changelog entries isn't very readable.
As a bonus, changelog.changelogrevision() also accepts a revision,
so we don't need to perform the inline node resolution either.
History has taught us that repo.changelog can add significant overhead
to loops. So cache the changelog instance outside of the loop to
avoid the lookup. While we're here, do the same for manifestlog,
since each loop would otherwise initialize a new manifestlog instance.
By default, Python defers to the operating system for choosing the
default buffer size on opened files. On my Linux machine, the default
is 4k, which is really small for 2016.
This patch bumps the write buffer size when writing
changegroups/bundles to 128k. This matches the 128k read buffer
we already use on revlogs.
It's worth noting that this only impacts when writing to an explicit
file (such as during `hg bundle`). Buffers when writing to bundle
files via the repo vfs or to a temporary file are not impacted.
When producing a none-v2 bundle file of the mozilla-unified repository,
this change caused the number of write() system calls to drop from
952,449 to 29,788. After this change, the most frequent system
calls are fstat(), read(), lseek(), and open(). There were
2,523,672 system calls after this patch (so a net decrease of
~950k is statistically significant).
This change shows no performance change on my system. But I have a
high-end system with a fast SSD. It is quite possible this change
will have a significant impact on network file systems, where
extra network round trips due to excessive I/O system calls could
introduce significant latency.
Revlog can now be configured to store full snapshot only. This is used on the
changelog. However, the changegroup packing was still recomputing deltas to be
sent over the wire.
We now just reuse the full snapshot directly in this case, skipping delta
computation. This provides use with a large speed up(-30%):
# perfchangegroupchangelog on mercurial
! wall 2.010326 comb 2.020000 user 2.000000 sys 0.020000 (best of 5)
! wall 1.382039 comb 1.380000 user 1.370000 sys 0.010000 (best of 8)
# perfchangegroupchangelog on pypy
! wall 5.792589 comb 5.780000 user 5.780000 sys 0.000000 (best of 3)
! wall 3.911158 comb 3.920000 user 3.900000 sys 0.020000 (best of 3)
# perfchangegroupchangelog on mozilla central
! wall 20.683727 comb 20.680000 user 20.630000 sys 0.050000 (best of 3)
! wall 14.190204 comb 14.190000 user 14.150000 sys 0.040000 (best of 3)
Many tests have to be updated because of the change in bundle content. All
theses update have been verified. Because diffing changelog was not very
valuable, the resulting bundle have similar size (often a bit smaller):
# full bundle of mozilla central
with delta: 1142740533B
without delta: 1142173300B
So this is a win all over the board.
As part of debugging low-level changegroup generation, I came across
what I initially thought was a weird behavior: changegroup v2 is
choosing the previous revision in the changegroup as a delta base
instead of p1. I was tempted to rewrite this to use p1, as p1
will delta better than prev in the common case. However, I realized
that taking p1 as the base would potentially require resolving a
revision fulltext and thus require more CPU for e.g. server-side
processing of getbundle requests.
This patch tweaks the code comment to note the choice of behavior.
It also notes there is room for a flag or config option to tweak
this behavior later: using p1 as the delta base would likely make
changegroups smaller at the expense of more CPU, which could be
beneficial for things like clone bundles.
This adds an implementation of readdelta to the new manifestctx class and adds a
couple consumers of it. This currently appears to have some duplicate code, but
future patches cause this function to diverge when things like "shallow" are
introduced.
Now that all users are in exchange, we can safely move the code in the
'exchange' module. This function is really about processing the argument of a
'getbundle' call, so it even makes senses to do so.
There is various version of this function that differ mostly by the way they
define the bundled set. The flexibility is now available in the outgoing object
itself so we move the complexity into the caller themself. This will allow use
to remove a good share of the similar function to obtains a changegroup in the
'changegroup.py' module.
An important side effect is that we stop calling 'computeoutgoing' in
'getchangegroup'. This is fine as code that needs such argument processing
is actually going through the 'exchange' module which already all this function
itself.
This argument can be used instead of 'commonheads' to determine the 'outgoing'
set. We remove the outgoingbetween function as its role can now be handled by
'outgoing' itself.
I've thought of using an external function instead of making the constructor
more complicated. However, there is low hanging fruit to improve the current
code flow by storing some side products of the processing of 'missingroots'. So
in my opinion it make senses to add all this to the class.
We are to introduce more code constructing such object in the code base. It will
be more convenient to pass a repository object, all current users already
operate at the repository level anyway. More changes to the contructor argument
are coming in later changeset.
changegroup.changegroupsubset() contained somewhat low-level code for
constructing an "outgoing" instance from a list of roots and heads
nodes. It feels like discovery.py is a more appropriate location for
this code.
This code can definitely be optimized, as outgoing.missing will
recompute the set of changesets we've already discovered from
cl.between(). But code shouldn't be refactored during a move, so
I've simply inserted a TODO calling attention to that.
The bundle2 changegroup part has an advisory param saying how many
changesets are in the part. Before this patch, we were setting
this part when generating bundle2 parts via the wire protocol but
not when generating local bundle2 files.
A side effect of not setting the changeset count part is that progress
bars don't work when applying changesets. As the tests show, this
impacted clone bundles, shelve, backup bundles, `hg unbundle`, and
anything touching bundle2 files.
This patch adds a backdoor to allow us to pass state from
changegroup generation into the unbundler. We store the number
of changesets in the changegroup in this state and use it to
populate the aforementioned advisory part parameter when generating
the bundle2 bundle.
I concede that I'm not thrilled by how state is being passed in
changegroup.py (it feels a bit hacky). I would love to overhaul the
rather confusing set of functions in changegroup.py with something that
passes rich objects around instead of e.g. low-level generators.
However, given the code freeze for 3.9 is imminent, I'd rather not
undertake this endeavor right now. This feels like the easiest way
to get the parameter added to the changegroup part.
When grafting/rebasing, it is common for multiple changesets to make
the same change to a subdirectory. When writing the revlog for the
directory, the revlog code already takes care of not writing the entry
again. In 3eb9fa4180d3 (changegroup: prune subdirectory dirlogs too,
2016-02-12), I added the corresponding code in changegroup (not
sending entries the client already has), but I forgot to avoid sending
the entire changegroup if no nodes remained in the pruned
set. Although that's harmless besides the wasted network traffic, the
receiving side was checking for it (copied from the changegroup code
for handling files). This resulted in the client crashing with:
abort: received dir revlog group is empty
Fix by simply not emitting a changegroup for the directory if there
were no changes is it. This matches how files are handled.
The current implementation of narrowhg needs to influence the order in
which nodes are sent to the client. adgar@ and I think this is
fixable, but it's going to require pretty substantial time investment,
so in the interim we'd like to extract this method.
I think it makes the group() code a little more obvious, as it took us
a couple of tries to isolate the exact behavior we were observing.
writebundle() writes a bundle2 bundle or a plain changegroup1. Imagine
away the "2" in "bundle2.py" for a moment and this change should makes
sense. The bundle wraps the changegroup, so it makes sense that it
knows about it. Another sign that this is correct is that the delayed
import of bundle2 in changegroup goes away.
I'll leave it for another time to remove the "2" in "bundle2.py"
(alternatively, extract a new bundle.py from it).
The progress callback is replaced by one for manifests after changelog
processing is done, but let's not depend on manifests replacing the
value and instead explicitly clear it.
The "prog" class cg1unpacker.apply() has the unit set to
"chunks". This is not correct for files, where the file itself is the
unit. The unit is not usually printed, which is probably why this has
not been fixed yet. It can be show with e.g. "--config
progress.format='topic number unit'".
The progress callback for manifests is cleared outside of
_unpackmanifests(), which means it will remain in effect while pulling
subdirectory manifests when using treemanifests. Since the total
number of revisions used for the progress is the number of changesets,
the total number of treemanifest revisions is usually larger than
that. One effect of this is that the ETA is negative. It's hard to
estimate the number of subdirectory revisions, so let's just exclude
them from progress for now.
We have a dedicated function to get just the list of files in
a changelog entry. Use it.
This will presumably speed up changegroup application since we're
no longer decoding the entire changelog entry. But I didn't measure
the impact.
Since 37e42a0009a4 (changegroup: avoid iterating the whole manifest,
2015-12-04), the manifest linkrev callback iterates over only the
files that were touched according the the changeset. Before that
change, we iterated over all files returned in
manifest.readfast(). That method returns the files in the delta, if
the delta parent is a parent, otherwise it returns the full
manifest. Most manifest revisions end up using one of the parents as
its delta parent, so most of the time, the method returns a short
manifest. It seems that that happens often enough that it doesn't
really matter; I could not reproduce the timings reported in that
change.
Since the treemanifest code now works quite differently, and since
that code also works correctly for flat manifests, let's drop the
special-casing of flat manifests.
The current code for generating treemanifest revisions takes the list
of files in the changeset and finds the directories from them. This
does not work for merges, since a merge may pick file A from one side
and file B from another and neither of them would appear in the
changeset's "files" list, but the manifest would still change.
Fix this by instead walking the root manifest log for all needed
revisions, storing all needed file and subdirectory revisions, then
recursively visiting the subdirectories. This also turns out to be
faster: cloning a version of hg core converted to treemanifests went
from ~28s to ~19s (timing somewhat unfair: before this patch, timed
until crash; after this patch, timed until manifests complete).
The new algorithm is used only on treemanifest repos. Although it
works equally well on flat manifests, we leave the iteration over
files in the changeset for flat manifests for now.
This is another step towards making the manifest generation recurse
along the directory trees. The loop over 'tmfnodes' now takes the form
of a queue. At this point, we only add to the queue twice: we add the
root manifests, and, while visiting the root manifest revisions, we
add all subdirectory revisions (for treemanifest repos). Thus, any
iterations over 'tmfnodes' after the first will not add any items and
the "queue" will just keep shrinking.
This is another step towards making the manifest generation recurse
along the directory trees. It makes the two calls to _packmanifests()
more similar.
When verbose logging is one, we report the size in bytes of the
manifest data in the changegroup. For files, we report the size per
file, but I'm not sure we need that level of detail (i.e. size per
directory manifest). Instead, report a single figure for the size of
root manifest plus submanifests.
The next few patches will rewrite the manifest generation code to work
with merges. We will then walk dirlogs recursively. This prepares for
that by moving much of the treemanifest code out of _packmanifests()
and into generatemanifests(). For this to work, it also adds
_manifestsdone() method that returns the "end of manifests" close
chunk for cg3 and an empty string for cg1 and cg2.
In b89de5ee5b31 (changegroup: don't support versions 01 and 02 with
treemanifests, 2016-01-19), I stopped supporting use of cg1 and cg2
with treemanifest repos. What I had not considered was that it's
perfectly safe to pull *to* a treemanifest repo using any changegroup
version. As reported in issue5066, I therefore broke pull from old
repos into a treemanifest repo. It was not covered by the test case,
because that pulled from a local repo while enabling treemanifests,
which enabled treemanifests on the source repo as well. After
switching to pulling via HTTP, it breaks.
Fix by splitting up changegroup.supportedversions() into
supportedincomingversions() and supportedoutgoingversions().
There were two mistakes: one was accidental reuse of the fclnode
variable from the loop gathering file nodes, and the other (masked by
that bug) was not correctly handling deleted directories. Both cases
are now fixed and the test passes.
In a few places (at least repair.py and shelve.py), we want to find
the best changegroup version that we can assume users of the repo will
understand. For example, we choose version 01 by default, but if it's
a generaldelta repo, we expect clients to support version 02 anyway,
so we choose that for new bundles (for e.g. "hg strip"). Let's create
a helper for this functionality in changegroup, so we can reuse it
elsewhere later.
Since it would be terribly expensive to convert between flat manifests
and treemanifests, we have decided to simply not support changegroup
version 01 and 02 with treemanifests. Therefore, let's stop announcing
that we support these versions on treemanifest repos.
Note that this means that older clients that try to clone from a
treemanifest repo will fail. What happens is that the server, after
this patch, finds that there are no common versions and raises
"ValueError: no common changegroup version". This results in "abort:
HTTP Error 500: Internal Server Error" on the client.
Before this patch, it was no better: The server would instead find
that there were directory manifest nodes to put in the changegroup 01
or 02 and raise an AssertionError on changegroup.py#668 (assert not
tmfnodes), which would also appear as a 500 to the client.
changegroup.getchunks() determines the end of the stream by looking
for an empty chunk group (two consecutive empty chunks). It ignores
empty groups in the first two groups. Changegroup 3 introduced an
empty chunk between the manifests and the files, which confuses
getchunks(). Since it comes after the first two, getchunks() will stop
there.
Fix by rewriting getchunks so it first counts two groups (empty or
not) and then keeps antostarts counting empty groups. With this counting,
changegroup 1 and 2 have exactly one empty group after the first two
groups, while changegroup 3 has two (one for directories and one for
files).
It's a little hard to test this at this point, but I have verified
that this patch fixes narrowhg (which was broken before this
patch). Also, future patches will fix "hg strip" with treemanifests,
and once that's done, getchunks() will be tested through tests of "hg
strip".
By putting the treemanifest code in _unpackmanifests(),
_addchangegroupfiles() will only be about files again, and we get a
nice symmetry between _packmanifests() and _unpackmanifest(). The
immediate benefit to me is that remotefilelog should not need to be
updated to work with treemanifests. It should also make
server.validate and progress output easier to get right. Probably
bundlerepo too.
Remotefilelog overrides changegroup._addchangegroupfiles(), assuming
it is about files, which seems like a natural assumption. However, in
changegroup3, directory manifests are sent in the files section of the
changegroup. These naturally make remotefilelog unhappy.
The fact that the directories are not separated from the files
(although they do come before the files) also makes server.validate
harder to implement. Since we read one chunk at a time from the steam,
once we have found a file (non-directory) entry in the stream, we
would have to push the read data back into the stream, or otherwise
refactor the code. It will be easier if we add an empty chunk after
all directory manifests.
This change adds that empty chunk, although we don't yet take
advantage of it on the reading side. We will soon move the tree
manifest stuff out of _addchangegroupfiles() and into
_unpackmanifests().
In order to give us the freedom to change the changegroup3 format,
let's hide it behind an experimental config. Since it is required by
treemanifests, that will override the cg3 config.
Sometimes a txnclose or changegroup hook wants to iterate through all
the changesets in transaction: in that situation usually the revset
`$HG_NODE:` is used to select the revisions. Unfortunately this revset
sometimes may contain too many changesets because we don't have the
write lock while the hook runs newer changes may be added to
repository in the meantime.
That's why there is a need for extra variable carrying the information about
the last change in the transaction.
By adding a mandatory 'treemanifest' parameter in the bundle2 part, we
make it possible for the recipient to set repo requirements before the
manifest revlog is accessed.
I'm not entirely happy with using a trailing / on a "file" entry for
transferring a treemanifest. We've discussed putting some flags on
each file header[0], but I'm unconvinced that's actually any better:
if we were going to add another feature to the cg format we'd still be
doing a version bump anyway to cg4, so I'm inclined to not spend time
coming up with a more sophisticated format until we actually know what
the next feature we want to stuff in a changegroup will be.
Test changes outside test-treemanifest.t are only due to the new CG3
bundlecap showing up in the wire protocol.
Many thanks to adgar@google.com and martinvonz@google.com for helping
me with various odd corners of the changegroup and treemanifest API.
0: It's not hard refactoring, nor is it a lot of work. I'm just
disinclined to do speculative work when it's not clear what the
customer would actually be.