This is probably only used in the 'unbundle' command, but the code
ended up being cleaner to make it generic and treat *all* httppostargs
with a non-args request body as though they were file-like in
nature. It also means we get test coverage more or less for free. A
previous version of this change didn't use io.BytesIO, and it was a
lot more complicated.
This also fixes a server-side bug, so anyone using httppostargs should
update all of their servers to this revision or later *before* this
gets to their clients, otherwise servers will hang trying to over-read
the POST body.
Differential Revision: https://phab.mercurial-scm.org/D231
The repo instance is currently only used to provide a changeset
lookup function as part of parsing revsets. I /think/ this allows
node fragments to resolve. I'm not sure why we wouldn't want this
to always "just work" if parsing a revset string.
Plus, an upcoming commit will introduce a new consumer that needs a
handle on the repo. So passing it more often will make that code
work more.
Passing a repo instance in all callers of revset.match* results in
a bunch of test changes. Notably, branch and tags caches get
populated as part of evaluating revsets. I'm not sure if this is
desirable. So this patch takes the conservative approach and only
passes the repo if we're passing a ui instance.
Differential Revision: https://phab.mercurial-scm.org/D97
Add the followlines.js script and corresponding parameters as data attribute
on <tbody class="sourcelines"> element.
Extend CSS rules so that they also match the DOM structure of annotate view.
As previously, only address paper and gitweb styles (other styles do not have
followlines at all).
context.py seems not a good place to host these functions.
% wc -l mercurial/context.py mercurial/dagop.py
2306 mercurial/context.py
424 mercurial/dagop.py
2730 total
It seems sufficiently simple to use "profile(enabled=X)" to not justify having
a dedicated context manager just to read the config.
(I do not have a too strong opinion about this).
When on a filelog head, we are certain that there will be no descendant so the
target of the "descending" link will lead to an empty log result. Do not
display the link in this case.
When this "descend" query parameter is present along with "linerange"
parameter, we get revisions following line range in descending order. The
parameter has no effect without "linerange".
A common exploit in web applications that access paths is to insert
path separator strings like ".." to try to get the server to serve up
files it shouldn't.
We have code for detecting this in staticfile(). A subsequent commit
will need to perform this test as well. Since this is security code,
let's factor the check so we don't have to reinvent the wheel.
When "patch" query parameter is present in requests to filelog view, line ids
in patches diff are no longer unique in the page since several patches are
shown on the same page. We now prefix line id by changeset shortnode when
several patches are displayed in the same page to have unique line ids
overall.
The previous clause for filter out a diff hunk was too restrictive. We need to
consider the following cases (assuming linerange=(lb, ub) and the @s2,l2
hunkrange):
<-(s2)--------(s2+l2)->
<-(lb)---(ub)->
<-(lb)---(ub)->
<-(lb)---(ub)->
previously on the first and last situations were considered.
In test-hgweb-filelog.t, add a couple of lines at the beginning of file "b" so
that the line range we will follow does not start at the beginning of file.
This covers the change in aforementioned diff hunk filter clause.
This is used to filter out hunks based on their range (with respect to 'node2'
for patch.diffhunks() call, i.e. 'ctx' for webutil.diffs()).
This is the simplest way to filter diff hunks, here done on server side. Later
on, it might be interesting to perform this filtering on client side and
expose a "toggle" action to alternate between full and filtered diff.
We now handle a "linerange" URL query parameter to filter filelog using
a logic similar to followlines() revset.
The URL syntax is: log/<rev>/<file>?linerange=<fromline>:<toline>
As a result, filelog entries only consists of revision changing specified
line range.
The linerange information is propagated to "more"/"less" navigation links but
not to numeric navigation links as this would apparently require a dedicated
"revnav" class.
Only update the "paper" template in this patch.
Add support for a "patch" query parameter in filelog web command similar to
--patch option of `hg log` to display the diff of each changeset in the table
of revisions. The diff text is displayed in a dedicated row of the table that
follows the existing one for each entry and spans over all columns. Only
update "paper" template in this patch.
There's apparently no reason to have the "parity" of diff blocks that
webutil.diffs() generates coming from outside the function. So have it
internally managed. We thus now pass a "web" object to webutil.diffs() to get
access to both "repo" and "stripecount" attribute.
This is useful for when repositories are nested in --web-conf, and in the future
with hosted subrepositories. The previous behavior was only to render an index
at each virtual directory. There is now an explicit 'index' child for each
virtual directory. The name was suggested by Yuya, for consistency with the
other method names.
Additionally, there is now an explicit 'index' child for every repository
directory with a nested repository somewhere below it. This seems more
consistent with each virtual directory hosting an index, and more discoverable
than to only have an index for a directory that directly hosts a nested
repository. I couldn't figure out how to close the loop and provide one in each
directory without a deeper nested repository, without blocking a committed
'index' file. Keeping that seems better than rendering an empty index.
Changeset 11e325d162fe removed the mutable default value, but did not explicitly
tested for None. Such implicit testing can introduce semantic and performance
issue. We move to an explicit testing for None as recommended by PEP8:
https://www.python.org/dev/peps/pep-0008/#programming-recommendations
Changeset 45c7a22dbdc0 removed the mutable default value, but did not explicitly
tested for None. Such implicit testing can introduce semantic and performance
issue. We move to an explicit testing for None as recommended by PEP8:
https://www.python.org/dev/peps/pep-0008/#programming-recommendations
Function patch.diffhunks yields items for a "block" (i.e. a file) as a whole
so take advantage of this to simplify the algorithm and avoid parsing diff
lines to determine whether we're starting a new "block" or not. Thus we drop
to external block counter and rely on diffhunks iterations instead.
We also take advantage of the fact that patch.diffhunks() yields *lines* of
hunks (instead of a string) to avoid building a list that is ''.join-ed into a
string that is then split.
As lines in 'header' returned by patch.diffhunks() have no trailing new line,
we need to insert it ourselves to match template expectations.
There's only one case where `basectx` parameter is None (over two usages), so
it's probably not worth handling the special case as it makes code-reading
harder.
Along the way, use ctx.p1() instead of checking for ctx.parents() being empty
which should not occur.
New revsetlang module hosts parser, tokenizer, and miscellaneous functions
working on parsed tree. It does not include functions for evaluation such as
getset() and match().
2288 mercurial/revset.py
684 mercurial/revsetlang.py
2972 total
get*() functions are aliased since they are common in revset.py.
There's no apparent reason to have this "entries" generator function that
builds a list and then yields its elements in reverse order and which is only
called to build the "entries" list. So just build the list directly, in
reverse order.
Adjust "parity" generator's offset to keep rendering the same.
Content-Security-Policy (CSP) is a web security feature that allows
servers to declare what loaded content is allowed to do. For example,
a policy can prevent loading of images, JavaScript, CSS, etc unless
the source of that content is whitelisted (by hostname, URI scheme,
hashes of content, etc). It's a nifty security feature that provides
extra mitigation against some attacks, notably XSS.
Mitigation against these attacks is important for Mercurial because
hgweb renders repository data, which is commonly untrusted. While we
make attempts to escape things, etc, there's the possibility that
malicious data could be injected into the site content. If this happens
today, the full power of the web browser is available to that
malicious content. A restrictive CSP policy (defined by the server
operator and sent in an HTTP header which is outside the control of
malicious content), could restrict browser capabilities and mitigate
security problems posed by malicious data.
CSP works by emitting an HTTP header declaring the policy that browsers
should apply. Ideally, this header would be emitted by a layer above
Mercurial (likely the HTTP server doing the WSGI "proxying"). This
works for some CSP policies, but not all.
For example, policies to allow inline JavaScript may require setting
a "nonce" attribute on <script>. This attribute value must be unique
and non-guessable. And, the value must be present in the HTTP header
and the HTML body. This means that coordinating the value between
Mercurial and another HTTP server could be difficult: it is much
easier to generate and emit the nonce in a central location.
This commit introduces support for emitting a
Content-Security-Policy header from hgweb. A config option defines
the header value. If present, the header is emitted. A special
"%nonce%" syntax in the value triggers generation of a nonce and
inclusion in <script> elements in templates. The inclusion of a
nonce does not occur unless "%nonce%" is present. This makes this
commit completely backwards compatible and the feature opt-in.
The nonce is a type 4 UUID, which is the flavor that is randomly
generated. It has 122 random bits, which should be plenty to satisfy
the guarantees of a nonce.
With this commit, the HTTP transport now parses the X-HgProto-<N>
header to determine what media type and compression engine to use for
responses. So far, we only compress responses that are already being
compressed with zlib today (stream response types to specific
commands). We can expand things to cover additional response types
later.
The practical side-effect of this commit is that non-zlib compression
engines will be used if both ends support them. This means if both
ends have zstd support, zstd - not zlib - will be used to compress
data!
When cloning the mozilla-unified repository between a local HTTP
server and client, the benefits of non-zlib compression are quite
noticeable:
engine server CPU (s) client CPU (s) bundle size
zlib (l=6) 174.1 283.2 1,148,547,026
zstd (l=1) 99.2 267.3 1,127,513,841
zstd (l=3) 103.1 266.9 1,018,861,363
zstd (l=7) 128.3 269.7 919,190,278
zstd (l=10) 162.0 - 894,547,179
none 95.3 277.2 4,097,566,064
The default zstd compression level is 3. So if you deploy zstd
capable Mercurial to your clients and servers and CPU time on
your server is dominated by "getbundle" requests (clients cloning
and pulling) - and my experience at Mozilla tells me this is often
the case - this commit could drastically reduce your server-side
CPU usage *and* save on bandwidth costs!
Another benefit of this change is that server operators can install
*any* compression engine. While it isn't enabled by default, the
"none" compression engine can now be used to disable wire protocol
compression completely. Previously, commands like "getbundle" always
zlib compressed output, adding considerable overhead to generating
responses. If you are on a high speed network and your server is under
high load, it might be advantageous to trade bandwidth for CPU.
Although, zstd at level 1 doesn't use that much CPU, so I'm not
convinced that disabling compression wholesale is worthwhile. And, my
data seems to indicate a slow down on the client without compression.
I suspect this is due to a lack of buffering resulting in an increase
in socket read() calls and/or the fact we're transferring an extra 3 GB
of data (parsing HTTP chunked transfer and processing extra TCP packets
can add up). This is definitely worth investigating and optimizing. But
since the "none" compressor isn't enabled by default, I'm inclined to
punt on this issue.
This commit introduces tons of tests. Some of these should arguably
have been implemented on previous commits. But it was difficult to
test without the server functionality in place.
Moving archivespecs to the module level allows using it from other modules
(such as hgwebdir_mod), and keeping a reference to it in requestcontext allows
current code to just work.
Thus we allow dict-like indexing and "in" checks, and also preserve the order
of archive types and can generate links in a certain order (so
requestcontext.archives is no longer needed).
It would be nice for archive links to always be in a certain commonly used
order, such as 'zip', 'bz', 'gzip2'. Repo index page (hgwebdir_mod) already
shows archive links in this order, let's do the same in hgweb_mod.
Sadly, archivespecs is a regular unordered dict, and collections.OrderedDict is
new in 2.7. But requestcontext.archives is a tuple of archive types, so it can
be used as an index to archivespecs.
os.name returns unicodes on py3 and we have pycompat.osname which returns
bytes. This series of 2 patches will change every ocurrence of os.name with
pycompat.osname.
We add an attribute to the HTTP and SSH protocol implementations
identifying the transport so future patches can conditionally
expose capabilities on a per-transport basis.
This allows us to write doctests depending on a ui object, but not on global
configs.
ui.load() is a class method so we can do wsgiui.load(). All ui() calls but
for doctests are replaced with ui.load(). Some of them could be changed to
not load configs later.
I'll move createservice() to the server module, but createapp() seems good to
remain in the hgweb module because of its dependency on hgweb/hgwebdir_mod.
Almost all sys.stdin/out/err in hgext/ and mercurial/ are replaced by util's.
There are a few exceptions:
- lsprof.py and statprof.py are untouched since they are a kind of vendor
code and they never import mercurial modules right now.
- ui._readline() needs to replace sys.stdin and stdout to pass them to
raw_input(). We'll need another workaround here.
Currently, the "streamres" response type is populated with a generator
of chunks with compression possibly already applied. This puts the onus
on commands to perform chunking and compression. Architecturally, I
think this is the wrong place to perform this work. I think commands
should say "here is the data" and the protocol layer should take care
of encoding the final bytes to put on the wire.
Additionally, upcoming commits will improve wire protocol support for
compression. Having a central place for performing compression in the
protocol transport layer will be easier than having to deal with
compression at the commands layer.
This commit refactors the "streamres" response type to accept either
a generator or an object with "read." Additionally, the type now
accepts a flag indicating whether the response is a "version 1
compressible" response. This basically identifies all commands
currently performing compression. I could have used a special type
for this, but a flag works just as well. The argument name
foreshadows the introduction of wire protocol changes, hence the "v1."
The code for chunking and compressing has been moved to the output
generation function for each protocol transport. Some code has been
inlined, resulting in the deletion of now unused methods.
43e3fb1c484e introduced a call to fctx.parents() for each line in
annotate output. This function call isn't cheap, as it requires
linkrev adjustment.
Since multiple lines in annotate output tend to belong to the same
file revision, a cache of fctx.parents() lookups for each input
should be effective in the common case. So we implement one.
Since the cache has to precompute parents so an aborted generator
doesn't leave an incomplete cache, we could just return a list.
However, we preserve the generator for backwards compatibility.
The effect of this change when requesting /annotate/96ca0ecdcfa/
browser/locales/en-US/chrome/browser/downloads/downloads.dtd on
the mozilla-aurora repo is significant:
p1(43e3fb1c484e) 5.5s
43e3fb1c484e: 66.3s
this patch: 10.8s
We're still slower than before. But only by ~2x instead of ~12x.
On the tip revisions of layout/base/nsCSSFrameConstructor.cpp file in
the mozilla-unified repo, time went from 12.5s to 14.5s and back to
12.5s. I'm not sure why the mozilla-aurora repo is so slow.
Looking at the code of basefilectx.parents(), there is room for
further improvements. Notably, we still perform redundant calls to
filelog.renamed() and basefilectx._parentfilectx(). And
basefilectx.annotate() also makes similar calls, so there is potential
for object reuse. However, introducing caches here are not appropriate
for the stable branch.
Currently, the "getbundle" wire protocol command obtains a generator of
data, converts it to a util.chunkbuffer, then converts it back to a
generator via the protocol's groupchunks() implementation. For the SSH
protocol, groupchunks() simply reads 4kb chunks then write()s the
data to a file descriptor. For the HTTP protocol, groupchunks() reads
32kb chunks, feeds those into a zlib compressor, emits compressed data
as it is available, and that is sent to the WSGI layer, where it is
likely turned into HTTP chunked transfer chunks as is or further
buffered and turned into a larger chunk.
For both the SSH and HTTP protocols, there is inefficiency from using
util.chunkbuffer.
For SSH, emitting consistent 4kb chunks sounds nice. However, the file
descriptor it is writing to is almost certainly buffered. That means
that a Python .write() probably doesn't translate into exactly what is
written to the I/O layer.
For HTTP, we're going through an intermediate layer to zlib compress
data. So all util.chunkbuffer is doing is ensuring that the chunks we
feed into the zlib compressor are of uniform size. This means more CPU
time in Python buffering and emitting chunks in util.chunkbuffer but
fewer function calls to zlib.
This patch introduces and implements a new wire protocol abstract
method: compresschunks(). It is like groupchunks() except it operates
on a generator instead of something with a .read(). The SSH
implementation simply proxies chunks. The HTTP implementation uses
zlib compression.
To avoid duplicate code, the HTTP groupchunks() has been reimplemented
in terms of compresschunks().
To prove this all works, the "getbundle" wire protocol command has been
switched to compresschunks(). This removes the util.chunkbuffer from
that command. Now, data essentially streams straight from the
changegroup emitter to the wire, possibly through a zlib compressor.
Generators all the way, baby.
There were slim to no performance changes on the server as measured
with the mozilla-central repository. This is likely because CPU
time is dominated by reading revlogs, producing the changegroup, and
zlib compressing the output stream. Still, this brings us a little
closer to our ideal of using generators everywhere.
object should appear at the end, otherwise it tries to pre-empt the other
new-style classes in the MRO, resulting in an unresolvable MRO in Py3. We still
need to include object because otherwise in 2.7 we end up with an old-style
class if threading is not supported, new-style if it is.
This patch moves "fctx.annotate" used by the "annotate" webcommand, along
with the diffopts to a separated function which takes a ui and a fctx.
So it could be replaced by other implementations which don't want to replace
the core "fctx.annotate" directly.
groupchunks() is a generic "turn a file object into a generator"
function. It isn't limited to changegroups. Rename the argument
and update the docstring to reflect this.
Now that all the functionality has been moved to manifestlog/manifestrevlog/etc,
we can finally change all the uses of repo.manifest to use the new versions. A
future diff will then delete repo.manifest.
One additional change in this commit is to change repo.manifestlog to be a
@storecache property instead of @property. This is required by some uses of
repo.manifest require that it be settable (contrib/perf.py and the static http
server). We can't do this in a prior change because we can't use @storecache on
this until repo.manifest is no longer used anywhere.
Something weird is happening that breaks pyflakes installed via 'pip
install --user'. I haven't had a chance to finish debugging this, but
this at least fixes the build.
More low-level compression code elimination because we now have nice
APIs.
This patch also demonstrates why we needed and implemented the
"level" option on the "compressstream" API.
When doing streaming compression with zlib, zlib appears to emit chunks
with data after ~20-30kb on average is available. In other words, most
calls to compress() return an empty string. On the mozilla-unified repo,
only 48,433 of 921,167 (5.26%) of calls to compress() returned data.
In other words, we were sending hundreds of thousands of empty chunks
via a generator where they touched who knows how many frames (my guess
is millions). Filtering out the empty chunks from the generator
cuts down on overhead.
In addition, we were previously feeding 8kb chunks into zlib
compression. Since this function tends to emit *compressed* data after
20-30kb is available, it would take several calls before data was
produced. We increase the amount of data fed in at a time to 32kb.
This reduces the number of calls to compress() from 921,167 to
115,146. It also reduces the number of output chunks from 48,433 to
31,377. This does increase the average output chunk size by a little.
But I don't think this will matter in most scenarios.
The combination of these 2 changes appears to shave ~6s CPU time
or ~3% from a server serving the mozilla-unified repo.
Currently, running `hg serve --profile` doesn't yield anything useful:
when the process is terminated the profiling output displays results
from the main thread, which typically spends most of its time in
select.select(). Furthermore, it has no meaningful results from
mercurial.* modules because the threads serving HTTP requests don't
actually get profiled.
This patch teaches the hgweb wsgi applications to profile individual
requests. If profiling is enabled, the profiler kicks in after
HTTP/WSGI environment processing but before Mercurial's main request
processing.
The profile results are printed to the configured profiling output.
If running `hg serve` from a shell, they will be printed to stderr,
just before the HTTP request line is logged. If profiling to a file,
we only write a single profile to the file because the file is not
opened in append mode. We could add support for appending to files
in a future patch if someone wants it.
Per request profiling doesn't work with the statprof profiler because
internally that profiler collects samples from the thread that
*initially* requested profiling be enabled. I have plans to address
this by vendoring Facebook's customized statprof and then improving
it.
Before this patch, the HTTP transport protocol would always zlib
compress certain responses (notably "getbundle" wire protocol commands)
at zlib compression level 6.
zlib can be a massive CPU resource sink for servers. Some server
operators may wish to reduce server-side CPU requirements while
requiring more bandwidth. This is common on corporate intranets, for
example. Others may wish to use more CPU but reduce bandwidth.
This patch introduces a config option to allow server operators
to control the zlib compression level.
On the "mozilla-unified" generaldelta repository, setting this
value to "0" (disable compression) results in server-side CPU
utilization for a `hg clone` going from ~180s to ~124s CPU time on
my i7-6700K. A level of "1" (which increases the transfer size from
~1,074 MB at level 6 to ~1,222 MB) utilizes ~132s CPU time.
The BaseHTTPServer, SimpleHTTPServer and CGIHTTPServer has been merged into
http.server in python 3. All of them has been merged as util.httpserver to use
in both python 2 and 3. This patch adds a regex to check-code to warn against
the use of BaseHTTPServer. Moreover this patch also includes updates to lower
part of test-check-py3-compat.t which used to remain unchanged.
This patch transitions the built-in HTTPS server to use sslutil for
creating the server socket.
As part of this transition, we implement developer-only config options
to control CA loading and whether to require client certificates. This
eliminates the need for the custom extension in test-https.t to define
these.
There is a slight change in behavior with regards to protocol
selection. Before, we would always use the TLS 1.0 constant to define
the protocol version. This would *only* use TLS 1.0. sslutil defaults
to TLS 1.0+. So this patch improves the security of `hg serve` out of
the box by allowing it to use TLS 1.1 and 1.2 (if available).
Doing this will allow access to the lines in arbitrary order (because the
result of enumerate() is an iterator), and that will help calculating rowspan
for annotate blocks.
The link is embedded into a div with class="annotate-info" that only shows up
upon hover of the annotate column. To avoid duplicate hover-overs (this new
one and the one coming from link's title), drop "title" attribute from a
element and put it in the annotate-info element.
Previously, ETag headers from hgweb weren't correctly formed, because rfc2616
(section 14, header definitions) requires double quotes around the content of
the header. str(web.mtime) didn't do that.
Additionally, strong ETags signify that the resource representations are
byte-for-byte identical. That is, they can be reconstructed from byte ranges if
client so wishes. Considering ETags for all hgweb pages is just mtime of
00changelog.i and doesn't consider of e.g. .hg/hgrc with description, contact
and other fields, it's clearly shouldn't be strong. The W/ prefix marks it as
weak, which still allows caching the whole served file/page, but doesn't allow
byte-range requests.
hgweb currently offers limited functionality for "classifying"
repositories. This patch aims to change that.
The web.labels config option list is introduced. Its values
are exposed to the "index" and "summary" templates. Custom
templates can use template features like ifcontains() to e.g.
look for the presence of a specific label and engage specific
behavior. For example, a site operator may wish to assign a
"defunct" label to a repository so the repository is prominently
marked as dead in repository indexes.
I.e. when a revision blames a block of source lines, only display the
revision link on the first line of the block (this is identified by the
"blockhead" key in annotate context).
This addresses item "Visual grouping of changesets" of the blame improvements
plan (https://www.mercurial-scm.org/wiki/BlamePlan) which states: "Typically
there are block of lines all attributed to the same revision. Instead of
rendering the revision/changeset for every line, we could only render it once
per block."
Change summary webcommand to yield each element of the shortlog instead of the
entire list.
This makes generated json more readable since each entry can be formatted
separately, instead of returning all the shortlog content in a single string.
Even though it would be useless to start a web server by a command server,
it should be doable in principle. Also, we can't use sys.stdout/err directly
on Python 3 because they are unicode streams.
New frommapfile() function will make it clear when template aliases will be
loaded. They should be applied to command arguments and templates in hgrc,
but not to map files. Otherwise, our stock styles and web templates
(i.e map-file templates) could be modified unintentionally.
Future patches will add "aliases" argument to __init__(), but not to
frommapfile().