A new bundle2 capability 'phases' has been added. If 'heads' is part of the
supported value for 'phases', the server supports reading and sending 'phase-
heads' bundle2 part.
Server is now able to process a 'phases' boolean parameter to 'getbundle'. If
'True', a 'phase-heads' bundle2 part will be included in the bundle with phase
information relevant to the whole pulled set. If this method is available the
phases listkey namespace will no longer be listed.
Beside the more efficient encoding of the data, this new method will greatly
improve the phase exchange efficiency for repositories with non-served
changesets (obsolete, secret) since we'll no longer send data about the
filtered heads.
Add a new 'devel.legacy.exchange' config item to allow fallback to the old
'listkey in bundle2' method.
Reminder: the pulled set is not just the changesets bundled by the pull. It
also contains changeset selected by the "pull specification" on the client
side (eg: everything for bare pull). One of the reason why the 'pulled set' is
important is to make sure we can move -common- nodes to public.
As part of reducing the number of changegroup creation APIs, let's replace the
changegroup function with makechangegroup. This pushes the responsibility of
creating the outgoing set to the caller, but that seems like a simple and
reasonable concept for the caller to be aware of.
Differential Revision: https://phab.mercurial-scm.org/D668
As part of getting rid of all the permutations of changegroup creation, let's
remove changegroupsubset and call makechangegroup instead. This moves the
responsibility of creating the outgoing set to the caller, but that seems like a
relatively reasonable unit of functionality for the caller to have to care about
(i.e. what commits should be bundled).
Differential Revision: https://phab.mercurial-scm.org/D665
As far as I can tell, this interface originally used 'return' here, so the
"fallthrough" to self._abort made sense. When it was switched to 'yield' this
didn't make sense, but doesn't impact most uses because the 'plain' wrapper in
peer.py's 'batchable' decorator only attempts to yield two items (args and
value).
When using iterbatch, however, it attempts to verify that the @batchable
generators only emit 2 results, by expecting a StopIteration when attempting to
access a third.
Differential Revision: https://phab.mercurial-scm.org/D608
The wirepeer class provides concrete implementations of peer interface
methods for calling wire protocol commands. It makes sense for this
class to inherit from the peer abstract base class. So we change
that.
Since httppeer and sshpeer have already been converted to the new
interface, peerrepository is no longer adding any value. So it has
been removed. httppeer and sshpeer have been updated to reflect the
loss of peerrepository and the inheritance of the abstract base
class in wirepeer.
The code changes in wirepeer are reordering of methods to group
by interface.
Some Python code in tests was updated to reflect changed APIs.
.. api::
peer.peerrepository has been removed. Use repository.peer abstract
base class to represent a peer repository.
Differential Revision: https://phab.mercurial-scm.org/D338
The last use of this API was removed in 3bcb9f9a4a63 in 2016. While
not formally deprecated, as of the last commit the code is no longer
explicitly tested. I think the new API has existed long enough for
people to transition to it.
I also have plans to more formalize the peer API and removing batch()
makes that work easier.
I'm not convinced the current client-side API around batching is
great. But it's the best we have at the moment.
.. api:: remove peer.batch()
Replace with peer.iterbatch().
Differential Revision: https://phab.mercurial-scm.org/D320
The remote batching code is difficult to read. Let's improve it.
As part of the refactor, the future returned by method calls on
batchiter() instances is now populated. However, you still need to
consume the results() generator for the future to be set. But at
least now we can stuff the future somewhere and not have to worry
about aligning method call order with result order since you can
use a future to hold the result.
Also as part of the change, we now verify that @batchable generators
yield exactly 2 values. In other words, we enforce their API.
The non-iter batcher has been unused since 3bcb9f9a4a63. And to my
surprise we had no explicit unit test coverage of it! test-batching.py
has been overhauled to use the iterating batcher.
Since the iterating batcher doesn't allow non-batchable method
calls nor local calls, tests have been updated to reflect reality.
The iterating batcher has been used for multiple releases apparently
without major issue. So this shouldn't cause alarm.
.. api::
@peer.batchable functions must now yield exactly 2 values
Differential Revision: https://phab.mercurial-scm.org/D319
@peer.batchable decorated generator functions have two forms:
yield value, None
and
yield args, future
yield value
These forms have been present since the decorator was introduced.
There are currently no in-repo consumers of the first form. So this
commit removes support for it.
Note that remoteiterbatcher.submit() asserts the 2nd form. And
3bcb9f9a4a63 removed the last user of remotebatcher, forcing everyone
to remoteiterbatcher. So anything relying on this in the wild would
have been broken since 3bcb9f9a4a63.
.. api::
@peer.batchable can no longer emit local values
Differential Revision: https://phab.mercurial-scm.org/D318
remoteiterbatcher (unlike remotebatcher) only supports batchable
commands. This claim can be validated by comparing their
implementations of submit() and noting how remoteiterbatcher assumes
the invoked method has a "batchable" attribute, which is set by
@peer.batchable.
remoteiterbatcher has a custom __getitem__ that was trying to
validate that only batchable methods are called. However, it was only
validating that the called method exists, not that it is batchable.
This wasn't a big deal since remoteiterbatcher.submit() would raise
an AttributeError attempting to `mtd.batchable(...)`.
Let's fix the check and convert it to ProgrammingError, which may
not have been around when this was originally implemented.
Differential Revision: https://phab.mercurial-scm.org/D317
This is done by a script [2] using RedBaron [1], a tool designed for doing
code refactoring. All "default" values are decided by the script and are
strongly consistent with the existing code.
There are 2 changes done manually to fix tests:
[warn] mercurial/exchange.py: experimental.bundle2-output-capture: default needs manual removal
[warn] mercurial/localrepo.py: experimental.hook-track-tags: default needs manual removal
Since RedBaron is not confident about how to indent things [2].
[1]: https://github.com/PyCQA/redbaron
[2]: https://github.com/PyCQA/redbaron/issues/100
[3]:
#!/usr/bin/env python
# codemod_configitems.py - codemod tool to fill configitems
#
# Copyright 2017 Facebook, Inc.
#
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.
from __future__ import absolute_import, print_function
import os
import sys
import redbaron
def readpath(path):
with open(path) as f:
return f.read()
def writepath(path, content):
with open(path, 'w') as f:
f.write(content)
_configmethods = {'config', 'configbool', 'configint', 'configbytes',
'configlist', 'configdate'}
def extractstring(rnode):
"""get the string from a RedBaron string or call_argument node"""
while rnode.type != 'string':
rnode = rnode.value
return rnode.value[1:-1] # unquote, "'str'" -> "str"
def uiconfigitems(red):
"""match *.ui.config* pattern, yield (node, method, args, section, name)"""
for node in red.find_all('atomtrailers'):
entry = None
try:
obj = node[-3].value
method = node[-2].value
args = node[-1]
section = args[0].value
name = args[1].value
if (obj in ('ui', 'self') and method in _configmethods
and section.type == 'string' and name.type == 'string'):
entry = (node, method, args, extractstring(section),
extractstring(name))
except Exception:
pass
else:
if entry:
yield entry
def coreconfigitems(red):
"""match coreconfigitem(...) pattern, yield (node, args, section, name)"""
for node in red.find_all('atomtrailers'):
entry = None
try:
args = node[1]
section = args[0].value
name = args[1].value
if (node[0].value == 'coreconfigitem' and section.type == 'string'
and name.type == 'string'):
entry = (node, args, extractstring(section),
extractstring(name))
except Exception:
pass
else:
if entry:
yield entry
def registercoreconfig(cfgred, section, name, defaultrepr):
"""insert coreconfigitem to cfgred AST
section and name are plain string, defaultrepr is a string
"""
# find a place to insert the "coreconfigitem" item
entries = list(coreconfigitems(cfgred))
for node, args, nodesection, nodename in reversed(entries):
if (nodesection, nodename) < (section, name):
# insert after this entry
node.insert_after(
'coreconfigitem(%r, %r,\n'
' default=%s,\n'
')' % (section, name, defaultrepr))
return
def main(argv):
if not argv:
print('Usage: codemod_configitems.py FILES\n'
'For example, FILES could be "{hgext,mercurial}/*/**.py"')
dirname = os.path.dirname
reporoot = dirname(dirname(dirname(os.path.abspath(__file__))))
# register configitems to this destination
cfgpath = os.path.join(reporoot, 'mercurial', 'configitems.py')
cfgred = redbaron.RedBaron(readpath(cfgpath))
# state about what to do
registered = set((s, n) for n, a, s, n in coreconfigitems(cfgred))
toregister = {} # {(section, name): defaultrepr}
coreconfigs = set() # {(section, name)}, whether it's used in core
# first loop: scan all files before taking any action
for i, path in enumerate(argv):
print('(%d/%d) scanning %s' % (i + 1, len(argv), path))
iscore = ('mercurial' in path) and ('hgext' not in path)
red = redbaron.RedBaron(readpath(path))
# find all repo.ui.config* and ui.config* calls, and collect their
# section, name and default value information.
for node, method, args, section, name in uiconfigitems(red):
if section == 'web':
# [web] section has some weirdness, ignore them for now
continue
defaultrepr = None
key = (section, name)
if len(args) == 2:
if key in registered:
continue
if method == 'configlist':
defaultrepr = 'list'
elif method == 'configbool':
defaultrepr = 'False'
else:
defaultrepr = 'None'
elif len(args) >= 3 and (args[2].target is None or
args[2].target.value == 'default'):
# try to understand the "default" value
dnode = args[2].value
if dnode.type == 'name':
if dnode.value in {'None', 'True', 'False'}:
defaultrepr = dnode.value
elif dnode.type == 'string':
defaultrepr = repr(dnode.value[1:-1])
elif dnode.type in ('int', 'float'):
defaultrepr = dnode.value
# inconsistent default
if key in toregister and toregister[key] != defaultrepr:
defaultrepr = None
# interesting to rewrite
if key not in registered:
if defaultrepr is None:
print('[note] %s: %s.%s: unsupported default'
% (path, section, name))
registered.add(key) # skip checking it again
else:
toregister[key] = defaultrepr
if iscore:
coreconfigs.add(key)
# second loop: rewrite files given "toregister" result
for path in argv:
# reconstruct redbaron - trade CPU for memory
red = redbaron.RedBaron(readpath(path))
changed = False
for node, method, args, section, name in uiconfigitems(red):
key = (section, name)
defaultrepr = toregister.get(key)
if defaultrepr is None or key not in coreconfigs:
continue
if len(args) >= 3 and (args[2].target is None or
args[2].target.value == 'default'):
try:
del args[2]
changed = True
except Exception:
# redbaron fails to do the rewrite due to indentation
# see https://github.com/PyCQA/redbaron/issues/100
print('[warn] %s: %s.%s: default needs manual removal'
% (path, section, name))
if key not in registered:
print('registering %s.%s' % (section, name))
registercoreconfig(cfgred, section, name, defaultrepr)
registered.add(key)
if changed:
print('updating %s' % path)
writepath(path, red.dumps())
if toregister:
print('updating configitems.py')
writepath(cfgpath, cfgred.dumps())
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
Previously, a repo containing secret changesets would be served via
stream clone, transferring those secret changesets. While secret
changesets aren't meant to imply strong security (if you really
want to keep them secret, others shouldn't have read access to the
repo), we should at least make an effort to protect secret changesets
when possible.
After this commit, we no longer serve stream clones for repos
containing secret changesets by default. This is backwards
incompatible behavior. In case anyone is relying on the behavior,
we provide a config option to opt into the old behavior.
Note that this defense is only beneficial for remote repos
accessed via the wire protocol: if a client has access to the
files backing a repo, they can get to the raw data and see secret
revisions.
For large enough repositories, pull-based clones take too long, and an attempt
to use them indicates some sort of configuration or other issue or maybe an
outdated Mercurial. Add a config option to disable them.
os.fdopen() does not accepts bytes as its second argument which represent the
mode in which the file is to be opened. This patch makes sure unicodes are
passed in py3 by using pycompat.sysstr().
Previously Abort raised during 'getbundle' call poorly reported (HTTP-500 for
http, some scary messages for ssh). Abort error have been properly reported for
"push" for a long time, there is not reason to be different for 'getbundle'. We
properly catch such error and report them back the best way available. For
bundle, we issue a valid bundle2 reply (as expected by the client) with an
'error:abort' part. With bundle1 we do as best as we can depending of http or
ssh.
Changeset a0966f529e1b introduced a config option to have the server deny pull
using bundle1. The original protocol has not really been design to allow that
kind of error reporting so some hack was used. It turned the hack only works on
HTTP and that ssh server hangs forever when this is used. After further
digging, there is no way to report the error in a unified way. Using `ooberror`
freeze ssh and raising 'Abort' makes HTTP return a HTTP-500 without further
details. So with sadness we implement a version that dispatch according to the
protocol used.
Now the error is properly reported, but we still have ungraceful abort after
that. The protocol do not allow anything better to happen using bundle1.
Changeset a0966f529e1b introduced a config option to have the server deny push
using bundle1. The original protocol has not really be design to allow such kind
of error reporting so some hack was used. It turned the hack only works on HTTP
and that ssh wire peer hangs forever when the same hack is used. After further
digging, there is no way to report the error in a unified way. Using 'ooberror'
freeze ssh and raising 'Abort' makes HTTP return a HTTP500 without further
details. So with sadness we implement a version that dispatch according to the
protocol used.
We also add a test for pushing over ssh to make sure we won't regress in the
future. That test show that the hint is missing, this is another bug fixed in
the next changeset.
This commit introduces support for advertising a server's support for
media types and compression formats in accordance with the spec defined
in internals.wireproto.
The bulk of the new code is a helper function in wireproto.py to
obtain a prioritized list of compression engines available to the
wire protocol. While not utilized yet, we implement support
for obtaining the list of compression engines advertised by the
client.
The upcoming HTTP protocol enhancements are a bit lower-level than
existing tests (most existing tests are command centric). So,
this commit establishes a new test file that will be appropriate
for holding tests around the functionality of the HTTP protocol
itself.
Rounding out this change, `hg debuginstall` now prints compression
engines available to the server.
Previously, the capabilities list was protocol agnostic and we
advertised the same capabilities list to all clients, regardless of
transport protocol.
A few capabilities are specific to HTTP. I see no good reason why we
should advertise them to SSH clients. So this patch limits their
advertisement to HTTP clients.
This patch is BC, but SSH clients shouldn't be using the removed
capabilities so there should be no impact.
Almost all sys.stdin/out/err in hgext/ and mercurial/ are replaced by util's.
There are a few exceptions:
- lsprof.py and statprof.py are untouched since they are a kind of vendor
code and they never import mercurial modules right now.
- ui._readline() needs to replace sys.stdin and stdout to pass them to
raw_input(). We'll need another workaround here.
Currently, the "streamres" response type is populated with a generator
of chunks with compression possibly already applied. This puts the onus
on commands to perform chunking and compression. Architecturally, I
think this is the wrong place to perform this work. I think commands
should say "here is the data" and the protocol layer should take care
of encoding the final bytes to put on the wire.
Additionally, upcoming commits will improve wire protocol support for
compression. Having a central place for performing compression in the
protocol transport layer will be easier than having to deal with
compression at the commands layer.
This commit refactors the "streamres" response type to accept either
a generator or an object with "read." Additionally, the type now
accepts a flag indicating whether the response is a "version 1
compressible" response. This basically identifies all commands
currently performing compression. I could have used a special type
for this, but a flag works just as well. The argument name
foreshadows the introduction of wire protocol changes, hence the "v1."
The code for chunking and compressing has been moved to the output
generation function for each protocol transport. Some code has been
inlined, resulting in the deletion of now unused methods.
Currently, the "getbundle" wire protocol command obtains a generator of
data, converts it to a util.chunkbuffer, then converts it back to a
generator via the protocol's groupchunks() implementation. For the SSH
protocol, groupchunks() simply reads 4kb chunks then write()s the
data to a file descriptor. For the HTTP protocol, groupchunks() reads
32kb chunks, feeds those into a zlib compressor, emits compressed data
as it is available, and that is sent to the WSGI layer, where it is
likely turned into HTTP chunked transfer chunks as is or further
buffered and turned into a larger chunk.
For both the SSH and HTTP protocols, there is inefficiency from using
util.chunkbuffer.
For SSH, emitting consistent 4kb chunks sounds nice. However, the file
descriptor it is writing to is almost certainly buffered. That means
that a Python .write() probably doesn't translate into exactly what is
written to the I/O layer.
For HTTP, we're going through an intermediate layer to zlib compress
data. So all util.chunkbuffer is doing is ensuring that the chunks we
feed into the zlib compressor are of uniform size. This means more CPU
time in Python buffering and emitting chunks in util.chunkbuffer but
fewer function calls to zlib.
This patch introduces and implements a new wire protocol abstract
method: compresschunks(). It is like groupchunks() except it operates
on a generator instead of something with a .read(). The SSH
implementation simply proxies chunks. The HTTP implementation uses
zlib compression.
To avoid duplicate code, the HTTP groupchunks() has been reimplemented
in terms of compresschunks().
To prove this all works, the "getbundle" wire protocol command has been
switched to compresschunks(). This removes the util.chunkbuffer from
that command. Now, data essentially streams straight from the
changegroup emitter to the wire, possibly through a zlib compressor.
Generators all the way, baby.
There were slim to no performance changes on the server as measured
with the mozilla-central repository. This is likely because CPU
time is dominated by reading revlogs, producing the changegroup, and
zlib compressing the output stream. Still, this brings us a little
closer to our ideal of using generators everywhere.
Currently, exchange.getbundle() returns either a cg1unpacker or a
util.chunkbuffer (in the case of bundle2). This is kinda OK, as
both expose a .read() to consumers. However, localpeer.getbundle()
has code inferring what the response type is based on arguments and
converts the util.chunkbuffer returned in the bundle2 case to a
bundle2.unbundle20 instance. This is a sign that the API for
exchange.getbundle() is not ideal because it doesn't consistently
return an "unbundler" instance.
In addition, unbundlers mask the fact that there is an underlying
generator of changegroup data. In both cg1 and bundle2, this generator
is being fed into a util.chunkbuffer so it can be re-exposed as a
file object.
util.chunkbuffer is a nice abstraction. However, it should only be
used "at the edges." This is because keeping data as a generator is
more efficient than converting it to a chunkbuffer, especially if we
convert that chunkbuffer back to a generator (as is the case in some
code paths currently).
This patch refactors exchange.getbundle() into
exchange.getbundlechunks(). The new API returns an iterator of chunks
instead of a file-like object.
Callers of exchange.getbundle() have been updated to use the new API.
There is a minor change of behavior in test-getbundle.t. This is
because `hg debuggetbundle` isn't defining bundlecaps. As a result,
a cg1 data stream and unpacker is being produced. This is getting fed
into a new bundle20 instance via bundle2.writebundle(), which uses
a backchannel mechanism between changegroup generation to add the
"nbchanges" part parameter. I never liked this backchannel mechanism
and I plan to remove it someday. `hg bundle` still produces the
"nbchanges" part parameter, so there should be no user-visible
change of behavior. I consider this "regression" a bug in
`hg debuggetbundle`. And that bug is captured by an existing
"TODO" in the code to use bundle2 capabilities.
groupchunks() is a generic "turn a file object into a generator"
function. It isn't limited to changegroups. Rename the argument
and update the docstring to reflect this.
This variable has been unused since 4e11f86d8ce7, which was over
2 years ago. gboptsmap should be used instead.
Marking as API because this could break extensions.
Clients escape both argument names and values when using the
batch command. Yet the server was only unescaping argument values.
Fortunately we don't have any argument names that need escaped. But
that isn't an excuse to lack symmetry in the code.
Since the server wasn't properly unescaping argument names, this
means we can never introduce an argument to a batchable command that
needs escaped because an old server wouldn't properly decode its name.
So we've introduced an assertion to detect the accidental introduction
of this in the future. Of course, we could introduce a server
capability that says the server knows how to decode argument names and
allow special argument names to go through. But until there is a need
for it (which I doubt there will be), we shouldn't bother with adding
an unused capability.
Both wireproto.py and sshpeer.py had code for producing the value to
the "cmds" argument used by the "batch" command. This patch extracts
that code to a standalone function and uses it.
As part of teaching Mozilla's replication extension to better handle
repositories with obsolescence data, I encountered a few scenarios
where I wanted built-in wire protocol commands from replication clients
to operate on unfiltered repositories so they could have access to
obsolete changesets.
While the undocumented "web.view" config option provides a mechanism
to choose what filter/view hgweb operates on, this doesn't apply
to wire protocol commands because wireproto.dispatch() is always
operating on the "served" repo.
This patch extracts the line for obtaining the repo that
wireproto commands operate on to its own function so extensions
can monkeypatch it to e.g. return an unfiltered repo.
I stopped short of exposing a config option because I view the use
case for changing this as a niche feature, best left to the domain
of extensions.
All versions of Python we support or hope to support make the hash
functions available in the same way under the same name, so we may as
well drop the util forwards.
Now that batch can be used by remotefilelog, the quadratic string
copying this was doing was actually disastrous. In my local testing,
fetching a 56 meg file used to take 3 minutes, and now takes only a
few seconds.
writebundle() writes a bundle2 bundle or a plain changegroup1. Imagine
away the "2" in "bundle2.py" for a moment and this change should makes
sense. The bundle wraps the changegroup, so it makes sense that it
knows about it. Another sign that this is correct is that the delayed
import of bundle2 in changegroup goes away.
I'll leave it for another time to remove the "2" in "bundle2.py"
(alternatively, extract a new bundle.py from it).
narrowhg (for its narrow spec) and remotefilelog (for its large batch
requests) would like to be able to make requests with argument sets so
absurdly large that they blow out total request size limit on some
http servers. As a workaround, support stuffing args at the start
of the POST body.
We will probably want to leave this behavior off by default in servers
forever, because it makes the old "POSTs are only for writes"
assumption wrong, which might break some of the simpler authentication
configurations.
Unfortunately, the ssh and http implementations are slightly different
due to differences in their _callstream implementations, which
prevents ssh from behaving streamily. We should probably introduce a
new batch command that can stream results over ssh at some point in
the near future.
The streamy behavior of batch over http(s) is an enormous win for
remotefilelog over http: in my testing, it's saving about 40% on file
fetches with a cold cache against a server on localhost.
This is very much like ordinary batch(), but it will let me add a mode
for batch where we have pathologically large requests which are then
handled streamily. This will be a significant improvement for things
like remotefilelog, which may want to request thousands of entities at
once.
I recently implemented the server.bundle1* options to control whether
bundle1 exchange is allowed.
After thinking about Mozilla's strategy for handling generaldelta
rollout a bit more, I think server operators need an additional
lever: disable bundle1 if and only if the repo is generaldelta.
bundle1 exchange for non-generaldelta repos will not have the potential
for CPU explosion that generaldelta repos do. Therefore, it makes sense
for server operators to continue to allow bundle1 exchange for
non-generaldelta repos without having to set a per-repo hgrc option
to change the policy depending on whether the repo is generaldelta.
This patch introduces a new set of options to control bundle1 behavior
for generaldelta repos. These options enable server operators to limit
bundle1 restrictions to the class of repos that can be performance
issues. It also allows server operators to tie bundle1 access to store
format. In many server environments (including Mozilla's), legacy repos
will not be generaldelta and new repos will or might be. New repos often
aren't bound by legacy access requirements, so setting a global policy
that disallows access to new/generaldelta repos via bundle1 could be a
reasonable policy in many server environments. This patch makes this
policy very easy to implement (modify global hgrc, add options to
existing generaldelta repos to grandfather them in).
bundle2 is the new and preferred wire protocol format. For various
reasons, server operators may wish to force clients to use it.
One reason is performance. If a repository is stored in generaldelta,
the server must recompute deltas in order to produce the bundle1
changegroup. This can be extremely expensive. For mozilla-central,
bundle generation typically takes a few minutes. However, generating
a non-gd bundle from a generaldelta encoded mozilla-central requires
over 30 minutes of CPU! If a large repository like mozilla-central
were encoded in generaldelta and non-gd clients connected, they could
easily flood a server by cloning.
This patch gives server operators config knobs to control whether
bundle1 is allowed for push and pull operations. The default is to
support legacy bundle1 clients, making this patch backwards compatible.