Currently, the push discovery first determines the full set of common nodes
before looking into what changesets are outgoing. When pushing a specific
subset, this can lead to pathological situations where we search for the status
of thousand of local heads that are unrelated to the requested pushes.
To fix this, we need to teach the discovery to ignores part of the graph. Most
of the necessary pieces were already in place. This changeset just makes them
available to higher level API and tests them.
Change actually impacting pushes are coming in a later changeset.
The 'srvheads' list contains all server heads including the common ones. We
adjust 'ui.log' message to provide more useful information about server heads
locally unknown. The performance impact of turning the list to set is
negligible (about 1e-4s) compared to the rest of the discovery cost, so I'm
taking the easy path.
We log the discovery summary, the number of roundtrips and the elapsed time.
This is useful to understand where slow push might come from when lloking at
the blackbox.
The home of 'Abort' is 'error' not 'util' however, a lot of code seems to be
confused about that and gives all the credit to 'util' instead of the
hardworking 'error'. In a spirit of equity, we break the cycle of injustice and
give back to 'error' the respect it deserves. And screw that 'util' poser.
For great justice.
For '_takefullsample' we can just retrieve the list of head directly and
ignore the rest of the complex return values. This was the last call to the
infamous '_updatesample' function.
This argument exists because of the complex code flow in '_takequicksample'. It
first gets the list of heads and then calls '_updatesample' on an empty initial
sample and a size limit matching the differences between the number of heads and
the target sample size. Finally the heads and the sample from '_updatesample'
were added. To ensure this addition result had the exact target length, the code
had to ensure no elements from the heads were added to the '_updatesample'
content and therefore was passing this "always included set of heads".
Instead we can just update the initial heads sample directly and use the final
target size as target size for the update.
This removes the need for this 'always' parameter to the '_updatesample' function
The test are affected because different set building order results in different
random sampling.
As explained in a previous changeset, prioritizing heads too much behaves
pathologically when there are more heads than the sample size. To counter this,
we always inject exponential samples before reducing to the sample size limit.
This already show some benefit in the test themselves, but on a real-world example
this moves my discovery for push to pathologically headed repo from 45 rounds to
17 of them.
We should maybe ensure that at least 25% of the result sample is heads, but I
think the random sampling will be fine in practice.
The heads and exponential sample are going to end up in the same set
before any extra processing happens. We simplify the code by directly
updating a set with heads.
Changes in the order the set is built lead to small changes in the random
sampling output. But after double checking, I can confirm the input data to
the random sampling is consistent.
Before this changeset, the discovery protocol was too heads-centric. Heads of the
undiscovered set were always sent for discovery and any room remaining in the
sample were filled with exponential samples (and random ones if any room
remained).
This behaved extremely poorly when the number of heads exceeded the sample size,
because we keep just asking about the existence of heads, then their direct parent
and so on. As a result, the 'O(log(len(repo)))' discovery turns into a
'O(len(repo))' one. As a solution we take a random sample of the heads plus
exponential samples. This way we ensure some exponential sampling is achieved,
bringing back some logarithmic convergence of the discovery again.
This patch only applies this principle in one place. More places will be updated
in future patches.
One test is impacted because the random sample happen to be different. By
chance, it helps a bit in this case.
If the length of undecided is smaller than the sample size, we can just request
information for all of them.
This conditional was previously handled by '_setupsample'. But '_setupsample' is
in my opinion a problematic function with blurry semantics. Having this
conditional explicitly earlier makes the code more explicit and moves us closer
to removing this '_setupsample' function.
Some of the logic around sample building is duplicated in the sample builders,
it would clean up thing to extract it in the top function, but this requires
all codes to be in the same place.
This changeset mostly exists to make the next one more clear.
We are using full sampling of 'fullsamplesize' in both case. The only
difference is the debug message. So we factorise the sampling code and put the
message in an extra conditional.
This is going to help making changes around the sampling logic. Such changes are
needed to improve discovery performance on highly headed repository.
We were definitely being suboptimal here: we were constructing two full sets,
one with the full set of common nodes (i.e. a graph traversal) and one with all
nodes. Then we subtract one set from the other. This whole process is
O(commits) and causes discovery to be significantly slower than it should be.
Instead, keep track of common incrementally and keep undecided as small as
possible.
This makes discovery massively faster on large repos: on one such repo, 'hg
debugdiscovery' over SSH with one commit missing on the client and five on the
server went from 4.5 seconds to 1.5. (An 'hg debugdiscovery' with no commits
missing on the client, i.e. connection startup time, was 1.2 seconds.)
2ec3e28dea6b changed 'sample' from a list to a set. The iteration order is thus
undefined and the yesno indices are not stable.
To solve this, repeat the listification and comment from elsewhere in the code.
Note: the randomness in the discovery protocol can make this problem hard to
reproduce.
2ec3e28dea6b made it possible that the initial head check didn't include all
heads. If that is the case, don't use the early exit just because this random
sample happened to be 'all known'.
Note: the randomness in the discovery protocol can make this problem hard to
reproduce.
Further digging on this issue show that the limit on the sample size used in
discovery never works for heads. Here is a quote from the code itself:
desiredlen = size - len(always)
if desiredlen <= 0:
# This could be bad if there are very many heads, all unknown to the
# server. We're counting on long request support here.
The long request support never landed and evolution make the "very many heads,
all unknown to the server" case quite common.
We implement a simple and stupid hard limit of sample size for all query. This
should prevent HTTP 414 error with the current state of the code.
The set discovery start by sending a "known" command with all local heads. When
the number of local heads is massive (eg: using hidden changesets) such request
becomes too large. This lead to 414 error over http, aborting the whole
process.
We limit the size of the sample used by the first query to fix this.
The test are impacted because they do test massive number of heads. But they do
not test it over real world http setup.
This introduces a peer method into all repository classes, which currently
simply returns self. It also changes hg.repository so it now raises an
exception if the supplied paths does not resolve to a localrepo or descendant.
Finally, all call sites are changed to use the peer and local methods as
appropriate, where peer is used whenever the code is dealing with a remote
repository (even if it's on local disk).
Any secret changesets will be excluded from pull and push. Phase data are
properly synchronized on pull and push if a changeset is seen as secret locally
but is non-secret remote side.
This patch does not handle the case of a changeset secret on remote but known
locally.
When setting up the next sample, we always add all of the heads, regardless
of the desired max sample size. But if the number of heads exceeds this
size, then we don't add any more nodes from the still undecided set.
(This is debatable per se, and I'll investigate it, but it's how we designed
it at the moment.)
The bug was that we always added the overall heads, not the heads of the
remaining undecided set. Thus, if #heads>200 (desired sample size), we
did not make progress any longer.
This fixes (issue2907) a crash when using 'hg incoming --bundle' with an empty
remote repo and a non-empty local repo.
This also fixes an unreported bug that 'hg summary --remote' erroneously
reports incoming changes when the remote repo is empty and the local is not.
Also, add a test to make sure issue2907 stays fixed
This means that we now discover both subset conditions (local<remote and
remote<local) in a single roundtrip without ever constructing an actual
sample (which takes a bit of client CPU).
Adds a new discovery method based on repeatedly sampling the still
undecided subset of the local node graph to determine the set of nodes
common to both the client and the server.
For small differences between client and server, it uses about the same
or slightly fewer roundtrips than the old tree-based discovery. For
larger differences, it typically reduces the number of roundtrips
drastically (from 150 to 4, for instance).
The old discovery code now lives in treediscovery.py, the new code is
in setdiscovery.py.
Still missing is a hook for extensions to contribute nodes to the
initial sample. For instance, Augie's remotebranches could contribute
the last known state of the server's heads.
Credits for the actual sampler and computing common heads instead of
bases go to Benoit Boissinot.