This patch also makes some expected output lines in tests glob-ed for
persistence of them.
BTW, files below aren't yet changed in 2017, but this patch also
updates copyright of them, because:
- mercurial/help/hg.1.txt
almost all of "man hg" output comes from online help of hg
command, and is already changed in 2017
- mercurial/help/hgignore.5.txt
- mercurial/help/hgrc.5
"copyright 2005-201X Matt Mackall" in them mentions about
copyright of Mercurial itself
0b5f1f2efc77 introduced handling of a crash in this case. A review comment
suggested that it was not entirely obvious that a 'dm' always would have a 'r'
for the source file.
To mitigate that risk, make the code more conservative and make less
assumptions.
Work around that 'dm' in the data model only can have one operation for the
target file, but still can have multiple and conflicting operations on the
source file where the other operation is a 'rm'. The move would thus fail with
'abort: No such file or directory'.
In this case it is "obvious" that the file should be removed, either before or
after moving it. We thus keep the 'rm' of the source file but drop the 'dm'.
This is not a pretty fix but quite "obviously" safe (famous last words...) as
it only touches a rare code path that used to crash. It is possible that it
would be better to swap the files for 'dm' as suggested on
https://bz.mercurial-scm.org/show_bug.cgi?id=5020#c13 but it is not entirely
obvious that it not just would create conflicts on the other file. That can be
revisited later.
dict.keys() is documented to return a copy, so it's surprising that
sortdict.keys() did not. I noticed this because we have an extension
that calls readlocaltags(). That method tries to remove any tags that
point to non-existent revisions (most likely stripped). However, since
it's unintentionally working on the instance it's modifying, it
sometimes fails to remove tags when there are multiple bad tags in a
row. This was not caught because localrepo.tags() does an additional
layer of filtering.
sortdict is also used in other places, but I have not checked whether
its keys() and/or __delitem__() methods are used there.
outgoing() and remote() may stall for long due to network I/O, which seems
unsafe per definition, "whether a predicate is safe for DoS attack." But I'm
not 100% sure about this. If our concern isn't elapsed time but CPU resource,
these predicates are considered safe. Perhaps that would be up to the
web/application server configuration?
Anyway, outgoing() and remote() wouldn't be useful in hgweb, so I think
it's okay to ban them.
statprof has a __main__ handler that allows viewing of previously
written data files. As Yuya pointed out during review, 82ee01726a77
broke this. This patch fixes that.
Until callsites are updated, this will have no effect. Once callsites
are updated, specifying experimental.editortmpinhg will create editor
temporary files in a subdirectory of .hg, which will make it easier
for tool integrations to determine what repository is in play when
they're asked to edit an hg-related file.
Care needs to be taken to prevent leaking potentially sensitive environment
variables through hgweb, if template support for environment variables is to be
introduced. There are a few ideas about the API for preventing accidental
leaking [1]. Option 3 seems best from the POV of not needing to configure
anything in the normal case. I couldn't figure out how to do that, so guard it
with an experimental option for now.
[1] https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/092383.html
Narrowhg has been using "1 << 14" as its revlog flag value for a long
time. We (Google) have many repos with that value in production
already. When the same value was reserved for EXTSTORED, it made those
repos invalid. Upgrading them will be a little painful. We should
clearly have reserved the value for narrowhg a long time ago. Since
the EXTSTORED flag is not yet in any release and Facebook also says
they have not started using it in production, so it should be okay to
change it. This patch gives the current value (1 << 14) back to
narrowhg and gives a new value (1 << 13) to EXTSTORED.
Before this change, the text about revlog flags was reflowed into a
single paragraph, which made it a bit hard to read. I don't even know
the rules around this, but adding a blank line before each flag seems
to prevent the reflowing.
util.buffer() either returns inbuilt buffer function or defines a new one which
slices. The inbuilt buffer() also has a length argument which is missing from
the ones we defined. This patch adds that length argument.
pycompat.getenv returns os.getenvb on py3 which is not available on Windows.
This patch replaces them with encoding.environ.get and checks to ensure no
new instances of os.getenv or os.setenv are introduced.
The final part of integrating the compression manager APIs into
revlog storage is the plumbing for repositories to advertise they
are using non-zlib storage and for revlogs to instantiate a non-zlib
compression engine.
The main intent of the compression manager work was to zstd all
of the things. Adding zstd to revlogs has proved to be more involved
than other places because revlogs are... special. Very small inputs
and the use of delta chains (which are themselves a form of
compression) are a completely different use case from streaming
compression, which bundles and the wire protocol employ. I've
conducted numerous experiments with zstd in revlogs and have yet
to formalize compression settings and a storage architecture that
I'm confident I won't regret later. In other words, I'm not yet
ready to commit to a new mechanism for using zstd - or any other
compression format - in revlogs.
That being said, having some support for zstd (and other compression
formats) in revlogs in core is beneficial. It can allow others to
conduct experiments.
This patch introduces *highly experimental* support for non-zlib
compression formats in revlogs. Introduced is a config option to
control which compression engine to use. Also introduced is a namespace
of "exp-compression-*" requirements to denote support for non-zlib
compression in revlogs. I've prefixed the namespace with "exp-"
(short for "experimental") because I'm not confident of the
requirements "schema" and in no way want to give the illusion of
supporting these requirements in the future. I fully intend to drop
support for these requirements once we figure out what we're doing
with zstd in revlogs.
A good portion of the patch is teaching the requirements system
about registered compression engines and passing the requested
compression engine as an opener option so revlogs can instantiate
the proper compression engine for new operations.
That's a verbose way of saying "we can now use zstd in revlogs!"
On an `hg pull` conversion of the mozilla-unified repo with no extra
redelta settings (like aggressivemergedeltas), we can see the impact
of zstd vs zlib in revlogs:
$ hg perfrevlogchunks -c
! chunk
! wall 2.032052 comb 2.040000 user 1.990000 sys 0.050000 (best of 5)
! wall 1.866360 comb 1.860000 user 1.820000 sys 0.040000 (best of 6)
! chunk batch
! wall 1.877261 comb 1.870000 user 1.860000 sys 0.010000 (best of 6)
! wall 1.705410 comb 1.710000 user 1.690000 sys 0.020000 (best of 6)
$ hg perfrevlogchunks -m
! chunk
! wall 2.721427 comb 2.720000 user 2.640000 sys 0.080000 (best of 4)
! wall 2.035076 comb 2.030000 user 1.950000 sys 0.080000 (best of 5)
! chunk batch
! wall 2.614561 comb 2.620000 user 2.580000 sys 0.040000 (best of 4)
! wall 1.910252 comb 1.910000 user 1.880000 sys 0.030000 (best of 6)
$ hg perfrevlog -c -d 1
! wall 4.812885 comb 4.820000 user 4.800000 sys 0.020000 (best of 3)
! wall 4.699621 comb 4.710000 user 4.700000 sys 0.010000 (best of 3)
$ hg perfrevlog -m -d 1000
! wall 34.252800 comb 34.250000 user 33.730000 sys 0.520000 (best of 3)
! wall 24.094999 comb 24.090000 user 23.320000 sys 0.770000 (best of 3)
Only modest wins for the changelog. But manifest reading is
significantly faster. What's going on?
One reason might be data volume. zstd decompresses faster. So given
more bytes, it will put more distance between it and zlib.
Another reason is size. In the current design, zstd revlogs are
*larger*:
debugcreatestreamclonebundle (size in bytes)
zlib: 1,638,852,492
zstd: 1,680,601,332
I haven't investigated this fully, but I reckon a significant cause of
larger revlogs is that the zstd frame/header has more bytes than
zlib's. For very small inputs or data that doesn't compress well, we'll
tend to store more uncompressed chunks than with zlib (because the
compressed size isn't smaller than original). This will make revlog
reading faster because it is doing less decompression.
Moving on to bundle performance:
$ hg bundle -a -t none-v2 (total CPU time)
zlib: 102.79s
zstd: 97.75s
So, marginal CPU decrease for reading all chunks in all revlogs
(this is somewhat disappointing).
$ hg bundle -a -t <engine>-v2 (total CPU time)
zlib: 191.59s
zstd: 115.36s
This last test effectively measures the difference between zlib->zlib
and zstd->zstd for revlogs to bundle. This is a rough approximation of
what a server does during `hg clone`.
There are some promising results for zstd. But not enough for me to
feel comfortable advertising it to users. We'll get there...
Now that compression engines declare their header in revlog chunks
and can decompress revlog chunks, we refactor revlog.decompress()
to use them.
Making full use of the property that revlog compressor objects are
reusable, revlog instances now maintain a dict mapping an engine's
revlog header to a compressor object. This is not only a performance
optimization for engines where compressor object reuse can result in
better performance, but it also serves as a cache of header values
so we don't need to perform redundant lookups against the compression
engine manager. (Yes, I measured and the overhead of a function call
versus a dict lookup was observed.)
Replacing the previous inline lookup table with a dict lookup was
measured to make chunk reading ~2.5% slower on changelogs and ~4.5%
slower on manifests. So, the inline lookup table has been mostly
preserved so we don't lose performance. This is unfortunate. But
many decompression operations complete in microseconds, so Python
attribute lookup, dict lookup, and function calls do matter.
The impact of this change on mozilla-unified is as follows:
$ hg perfrevlogchunks -c
! chunk
! wall 1.953663 comb 1.950000 user 1.920000 sys 0.030000 (best of 6)
! wall 1.946000 comb 1.940000 user 1.910000 sys 0.030000 (best of 6)
! chunk batch
! wall 1.791075 comb 1.800000 user 1.760000 sys 0.040000 (best of 6)
! wall 1.785690 comb 1.770000 user 1.750000 sys 0.020000 (best of 6)
$ hg perfrevlogchunks -m
! chunk
! wall 2.587262 comb 2.580000 user 2.550000 sys 0.030000 (best of 4)
! wall 2.616330 comb 2.610000 user 2.560000 sys 0.050000 (best of 4)
! chunk batch
! wall 2.427092 comb 2.420000 user 2.400000 sys 0.020000 (best of 5)
! wall 2.462061 comb 2.460000 user 2.400000 sys 0.060000 (best of 4)
Changelog chunk reading is slightly faster but manifest reading is
slower. What gives?
On this repo, 99.85% of changelog entries are zlib compressed (the 'x'
header). On the manifest, 67.5% are zlib and 32.4% are '\0'. This patch
swapped the test order of 'x' and '\0' so now 'x' is tested first. This
makes changelogs faster since they almost always hit the first branch.
This makes a significant percentage of manifest '\0' chunks slower
because that code path now performs an extra test. Yes, I too can't
believe we're able to measure the impact of an if..elif with simple
string compares. I reckon this code would benefit from being written
in C...
There's no apparent reason to have this "entries" generator function that
builds a list and then yields its elements in reverse order and which is only
called to build the "entries" list. So just build the list directly, in
reverse order.
Adjust "parity" generator's offset to keep rendering the same.
readline() returns '' only when EOF is encountered, in which case, Python's
getpass() raises EOFError. We should do the same to abort the session as
"response expected."
This bug was reported to
https://bitbucket.org/tortoisehg/thg/issues/4659/
The result of diffstatdata should not depend on having noprefix set or not, as
was reported in issue 4755. Forcing noprefix to false on call makes sure the
parser receives the diff in the correct format and returns the proper result.
Another way to fix this would have been to change the regular expressions in
path.diffstatdata(), but that would have introduced many unecessary special
cases.
This config knob will control whether or not to show the similarity
calculation in the diff output:
diff --git a/README.md b/foo.md
similarity index 88%
rename from README.md
rename to foo.md
--- a/README.md
+++ b/foo.md
We have 4 revset functions that take integer arguments, and they handle
their arguments in slightly different ways. This patch unifies them:
- getstring() in place of getsymbol(), which is more consistent with the
handling of integer revisions (both 1 and '1' are valid)
- say "expects" instead of "requires" for type errors
We don't need to catch TypeError since getstring() must return a string.
The rev argument has the same meaning as startrev of follow(), and I think
startrev is more informative.
followlines() is new function, we can make BC now.
Previously, compression engines had APIs for performing revlog
compression but no mechanism to perform revlog decompression. This
patch changes that.
Revlog decompression is slightly more complicated than compression
because in the compression case there is (currently) only a single
engine that can be used at a time. However for decompression, a
revlog could contain chunks from multiple compression engines. This
means decompression needs to map to multiple engines and
decompressors. This functionality is outside the scope of this patch.
But it drives the decision for engines to declare a byte header
sequence that identifies revlog data as belonging to an engine and
an API for obtaining an engine from a revlog header.
I really want to have an option of toggling a selection on a line and also
moving cursor down as a single keystroke. It also kinda makes sense for space
key to do this, because some other curses UIs in the wild do this (e.g. various
file managers, htop). So I got an idea to make a config option that defaults to
False for compatibility, but allows making crecord UI a lot more useful for
people with big hunks.
We add this an experimental option to experiment with this behavior.
This commit swaps in the just-added revlog compressor API into
the revlog class.
Instead of implementing zlib compression inline in compress(), we
now store a cached-on-first-use revlog compressor on each revlog
instance and invoke its "compress()" method.
As part of this, revlog.compress() has been refactored a bit to use
a cleaner code flow and modern formatting (e.g. avoiding
parenthesis around returned tuples).
On a mozilla-unified repo, here are the "compress" times for a few
commands:
$ hg perfrevlogchunks -c
! wall 5.772450 comb 5.780000 user 5.780000 sys 0.000000 (best of 3)
! wall 5.795158 comb 5.790000 user 5.790000 sys 0.000000 (best of 3)
$ hg perfrevlogchunks -m
! wall 9.975789 comb 9.970000 user 9.970000 sys 0.000000 (best of 3)
! wall 10.019505 comb 10.010000 user 10.010000 sys 0.000000 (best of 3)
Compression times did seem to slow down just a little. There are
360,210 changelog revisions and 359,342 manifest revisions. For the
changelog, mean time to compress a revision increased from ~16.025us to
~16.088us. That's basically a function call or an attribute lookup. I
suppose this is the price you pay for abstraction. It's so low that
I'm not concerned.
As part of "zstd all of the things," we need to teach revlogs to
use non-zlib compression formats. Because we're routing all compression
via the "compression manager" and "compression engine" APIs, we need to
introduction functionality there for performing revlog operations.
Ideally, revlog compression and decompression operations would be
implemented in terms of simple "compress" and "decompress" primitives.
However, there are a few considerations that make us want to have a
specialized primitive for handling revlogs:
1) Performance. Revlogs tend to do compression and especially
decompression operations in batches. Any overhead for e.g.
instantiating a "context" for performing an operation can be
noticed. For this reason, our "revlog compressor" primitive is
reusable. For zstd, we reuse the same compression "context" for
multiple operations. I've measured this to have a performance
impact versus constructing new contexts for each operation.
2) Specialization. By having a primitive dedicated to revlog use,
we can make revlog-specific choices and leave the door open for
more functionality in the future. For example, the zstd revlog
compressor may one day make use of dictionary compression.
A future patch will introduce a decompress() on the compressor
object.
The code for the zlib compressor is basically copied from
revlog.compress(). Although it doesn't handle the empty input
case, the null first byte case, and the 'u' prefix case. These
cases will continue to be handled in revlog.py once that code is
ported to use this API.
Upcoming patches will convert revlogs to use the compression engine
APIs to perform all things compression. The yet-to-be-introduced
APIs support a persistent "compressor" object so the same object
can be reused for multiple compression operations, leading to
better performance. In addition, compression engines like zstd
may wish to tweak compression engine state based on the revlog
(e.g. per-revlog compression dictionaries).
A global and shared decompress() function will shortly no longer
make much sense. So, we move decompress() to be a method of the
revlog class. It joins compress() there.
On the mozilla-unified repo, we can measure the impact of this change
on reading performance:
$ hg perfrevlogchunks -c
! chunk
! wall 1.932573 comb 1.930000 user 1.900000 sys 0.030000 (best of 6)
! wall 1.955183 comb 1.960000 user 1.930000 sys 0.030000 (best of 6)
! chunk batch
! wall 1.787879 comb 1.780000 user 1.770000 sys 0.010000 (best of 6
! wall 1.774444 comb 1.770000 user 1.750000 sys 0.020000 (best of 6)
"chunk" appeared to become slower but "chunk batch" got faster. Upon
further examination by running both sets multiple times, the numbers
appear to converge across all runs. This tells me that there is no
perceived performance impact to this refactor.
revlog.compress() compares the compressed size to the input size
and throws away the compressed data if it is larger than the input.
This is the correct thing to do, as storing compressed data that
is larger than the input takes up more storage space and makes reading
slower.
However, the comparison was implemented inconsistently. For the
streaming compression mode, we threw away the result if it was
greater than or equal to the input size. But for the one-shot
compression, we threw away the compression only if it was greater
than the input size!
This patch changes the comparison for the simple case so it is
consistent with the streaming case.
As a few tests demonstrate, this adds 1 byte to some revlog entries.
This is because of an added 'u' header on the chunk. It seems
somewhat wrong to increase the revlog size here. However, IMO the cost
of 1 byte in storage is insignificant compared to the performance gains
of avoiding decompression. This patch should invite questions around
the heuristic for throwing away compressed data. For example, I'd argue
we should be more liberal about rejecting compressed data, additionally
doing so where the number of bytes saved fails to reach a threshold.
But we can have this discussion another time.
This helps highlighting in third-party diff coloring (which assumes git
output) and maintains pedantic correctness with diff --git.
Tests will be added at the end of the series.
This config knob can take an integer between 0 and 40 or a
keyword ('none', 'short', 'full') to control the length of hash to
output. It will display diffs with the git index header as such,
diff --git a/mercurial/mdiff.py b/mercurial/mdiff.py
index 112edf7..d6b52c5 100644
We'll put this in the experimental section for now.
We have specific syntax for displaying the help text for a particular
revset predicate, so let's refer directly to the bisect() revset in
the verbose bisect help. It seems likely that the user doesn't care
about other revsets at that point, so they will probably not miss the
text about the other revset predicates.
There's no reason to duplicate this so many times, and it's likely an instance
will be missed if support for a new pattern is added and documented. The
stringmatcher is mostly used by revsets, though it is also used for the 'tag'
related templates, and namespace filtering in the journal extension. So maybe
there's a better place to document it. `hg help patterns` seems inappropriate,
because that is all file pattern matching.
While here, indicate how to perform case insensitive regex searches.
It was probably unintentional for regex, as the meaning of some sequences like
\S and \s is actually inverted by changing the case. For backward compatibility
however, the matching is forced to case insensitive.
Since we did a directory rename on the stores, the source
repository's lock path now references the dest repository's
lock path and the dest repository's lock path now references
a non-existent filename.
So releasing the lock on the source will unlock the dest and
releasing the lock on the dest will no-op because it fails due
to file not found. So we clean up the dest's lock manually.
The store contains more than just revlogs. This patch teaches the
upgrade code to copy regular files as well.
As the test changes demonstrate, the phaseroots file is now copied.
Our next step for in-place upgrade is to migrate store data. Revlogs
are the biggest source of data within the store and a store is useless
without them, so we implement their migration first.
Our strategy for migrating revlogs is to walk the store and call
`revlog.clone()` on each revlog. There are some minor complications.
Because revlogs have different storage options (e.g. changelog has
generaldelta and delta chains disabled), we need to obtain the
correct class of revlog so inserted data is encoded properly for its
type.
Various attempts at implementing progress indicators that didn't lead
to frustration from false "it's almost done" indicators were made.
I initially used a single progress bar based on number of revlogs.
However, this quickly churned through all filelogs, got to 99% then
effectively froze at 99.99% when it got to the manifest.
So I converted the progress bar to total revision count. This was a
little bit better. But the manifest was still significantly slower
than filelogs and it took forever to process the last few percent.
I then tried both revision/chunk bytes and raw bytes as the
denominator. This had the opposite effect: because so much data is in
manifests, it would churn through filelogs without showing much
progress. When it got to manifests, it would fill in 90+% of the
progress bar.
I finally gave up having a unified progress bar and instead implemented
3 progress bars: 1 for filelog revisions, 1 for manifest revisions, and
1 for changelog revisions. I added extra messages indicating the total
number of revisions of each so users know there are more progress bars
coming.
I also added extra messages before and after each stage to give extra
details about what is happening. Strictly speaking, this isn't
necessary. But the numbers are impressive. For example, when converting
a non-generaldelta mozilla-central repository, the messages you see are:
migrating 2475593 total revisions (1833043 in filelogs, 321156 in manifests, 321394 in changelog)
migrating 1.67 GB in store; 2508 GB tracked data
migrating 267868 filelogs containing 1833043 revisions (1.09 GB in store; 57.3 GB tracked data)
finished migrating 1833043 filelog revisions across 267868 filelogs; change in size: -415776 bytes
migrating 1 manifests containing 321156 revisions (518 MB in store; 2451 GB tracked data)
That "2508 GB" figure really blew me away. I had no clue that the raw
tracked data in mozilla-central was that large. Granted, 2451 GB is in
the manifest and "only" 57.3 GB is in filelogs. But still.
It's worth noting that gratuitous loading of source revlogs in order
to display numbers and progress bars does serve a purpose: it ensures
we can open all source revlogs. We don't want to spend several minutes
copying revlogs only to encounter a permissions error or similar later.
As part of this commit, we also add swapping of the store directory
to the upgrade function. After revlogs are converted, we move the
old store into the backup directory then move the temporary repo's
store into the old store's location. On well-behaved systems, this
should be 2 atomic operations and the window of inconsistency show be
very narrow.
There are still a few improvements to be made to store copying and
upgrading. But this commit gets the bulk of the work out of the way.
Upcoming patches will introduce functionality for in-place
repository/store "upgrades." Copying the contents of a revlog
feels sufficiently low-level to warrant being in the revlog
class. So this commit implements that functionality.
Because full delta recomputation can be *very* expensive (we're
talking several hours on the Firefox repository), we support
multiple modes of execution with regards to delta (re)use. This
will allow repository upgrades to choose the "level" of
processing/optimization they wish to perform when converting
revlogs.
It's not obvious from this commit, but "addrevisioncb" will be
used for progress reporting.
Now that all the upgrade planning work is in place, we can start
doing the real work: actually upgrading a repository.
The main goal of this commit is to get the "framework" for running
in-place upgrade actions in place.
Rather than get too clever and low-level with regards to in-place
upgrades, our strategy is to create a new, temporary repository,
copy data to it, then replace the old data with the new. This allows
us to reuse a lot of code in localrepo.py around store interaction,
which will eventually consume the bulk of the upgrade code.
But we have to start small. This patch implements adding new
repository requirements. But it still sets up a temporary
repository and locks it and the source repo before performing the
requirements file swap. This means all the plumbing is in place
to implement store copying in subsequent commits.
This commit introduces code for determining what actions/improvements
an upgrade should perform.
The "upgradefindimprovements" function introduces a mechanism to
return a list of improvements that can be made to a repository.
Each improvement is effectively an action that an upgrade will
perform. Associated with each of these improvements is metadata
that will be used to inform users what's wrong and what an
upgrade will do.
Each "improvement" is categorized as a "deficiency" or an
"optimization." TBH, I'm not thrilled about the terminology and
am receptive to constructive bikeshedding. The main difference
between a "deficiency" and an "optimization" is a deficiency
is always corrected (if it deviates from the current config) and
an "optimization" is an optional action that goes above and beyond
to improve the state of the repository (usually by requiring more
CPU during upgrade).
Our initial set of improvements identifies missing repository
requirements, a single, easily correctable problem with
changelog storage, and a set of "optimizations" related to delta
recalculation.
The main "upgraderepo" function has been expanded to handle
improvements. It queries for the list of improvements and determines
which of them will run based on the current repository state and user
I went through numerous iterations of the output format before
settling on a ReST-inspired definition list format. (I used
bulleted lists in the first submission of this commit and could
not get it to format just right.) Even with the various iterations,
I'm still not super thrilled with the format. But, this is a debug*
command, so that should mean we can refine the output without BC
concerns.
This commit introduces functionality for upgrading a repository in
place. The first part that's implemented is testing for upgrade
"compatibility." This is done by examining repository requirements.
There are 5 functions returning sets of requirements that control
upgrading. Why so many functions? Mainly to support extensions.
Functions are easier to monkeypatch than module variables.
Astute readers will see that we don't support "manifestv2" and
"treemanifest" requirements in the upgrade mechanism. I don't have
a great answer for why other than this is a complex set of patches
and I don't want to deal with the complexity of these experimental
features just yet. We can teach the upgrade mechanism about them
later, once the basic upgrade mechanism is in place.
This commit also introduces the "upgraderepo" function. This will be
our main routine for performing an in-place upgrade. Currently, it
just implements requirements checking. The structure of some code in
this function may look a bit weird (e.g. the inline function that is
only called once). But this will make sense after future commits.
Currently, if Mercurial introduces a new repository/store feature or
changes behavior of an existing feature, users must perform an
`hg clone` to create a new repository with hopefully the
correct/optimal settings. Unfortunately, even `hg clone` may not
give the correct results. For example, if you do a local `hg clone`,
you may get hardlinks to revlog files that inherit the old state.
If you `hg clone` from a remote or `hg clone --pull`, changegroup
application may bypass some optimization, such as converting to
generaldelta.
Optimizing a repository is harder than it seems and requires more
than a simple `hg` command invocation.
This commit starts the process of changing that. We introduce
`hg debugupgraderepo`, a command that performs an in-place upgrade
of a repository to use new, optimal features. The command is just
a stub right now. Features will be added in subsequent commits.
This commit does foreshadow some of the behavior of the new command,
notably that it doesn't do anything by default and that it takes
arguments that influence what actions it performs. These will be
explained more in subsequent commits.
The 'author' and 'desc' revsets are documented to be case insensitive.
Unfortunately, this was implemented in 'author' by forcing the input to
lowercase, including for regex like '\B'. (This actually inverts the meaning of
the sequence.) For backward compatibility, we will keep that a case insensitive
regex, but by using matcher options instead of brute force.
This doesn't preclude future hypothetical 'icase-literal:' style prefixes that
can be provided by the user. Such user specified cases can probably be handled
up front by stripping 'icase-', setting the variable, and letting it drop
through the existing code.
Selecting single and multiple revisions is closely related, so let's
put it in one place, so users can easily find it. We actually did not
even point to "hg help revsets" from "hg help revisions", but now that
they're on a single page, that won't be necessary.
Content-Security-Policy (CSP) is a web security feature that allows
servers to declare what loaded content is allowed to do. For example,
a policy can prevent loading of images, JavaScript, CSS, etc unless
the source of that content is whitelisted (by hostname, URI scheme,
hashes of content, etc). It's a nifty security feature that provides
extra mitigation against some attacks, notably XSS.
Mitigation against these attacks is important for Mercurial because
hgweb renders repository data, which is commonly untrusted. While we
make attempts to escape things, etc, there's the possibility that
malicious data could be injected into the site content. If this happens
today, the full power of the web browser is available to that
malicious content. A restrictive CSP policy (defined by the server
operator and sent in an HTTP header which is outside the control of
malicious content), could restrict browser capabilities and mitigate
security problems posed by malicious data.
CSP works by emitting an HTTP header declaring the policy that browsers
should apply. Ideally, this header would be emitted by a layer above
Mercurial (likely the HTTP server doing the WSGI "proxying"). This
works for some CSP policies, but not all.
For example, policies to allow inline JavaScript may require setting
a "nonce" attribute on <script>. This attribute value must be unique
and non-guessable. And, the value must be present in the HTTP header
and the HTML body. This means that coordinating the value between
Mercurial and another HTTP server could be difficult: it is much
easier to generate and emit the nonce in a central location.
This commit introduces support for emitting a
Content-Security-Policy header from hgweb. A config option defines
the header value. If present, the header is emitted. A special
"%nonce%" syntax in the value triggers generation of a nonce and
inclusion in <script> elements in templates. The inclusion of a
nonce does not occur unless "%nonce%" is present. This makes this
commit completely backwards compatible and the feature opt-in.
The nonce is a type 4 UUID, which is the flavor that is randomly
generated. It has 122 random bits, which should be plenty to satisfy
the guarantees of a nonce.
All the hgweb templates include mercurial.js in their header. All
the hgweb templates have the same <script> boilerplate to run
process_dates(). This patch factors that function call into
mercurial.js as part of a DOMContentLoaded event listener.
With this commit, the HTTP transport now parses the X-HgProto-<N>
header to determine what media type and compression engine to use for
responses. So far, we only compress responses that are already being
compressed with zlib today (stream response types to specific
commands). We can expand things to cover additional response types
later.
The practical side-effect of this commit is that non-zlib compression
engines will be used if both ends support them. This means if both
ends have zstd support, zstd - not zlib - will be used to compress
data!
When cloning the mozilla-unified repository between a local HTTP
server and client, the benefits of non-zlib compression are quite
noticeable:
engine server CPU (s) client CPU (s) bundle size
zlib (l=6) 174.1 283.2 1,148,547,026
zstd (l=1) 99.2 267.3 1,127,513,841
zstd (l=3) 103.1 266.9 1,018,861,363
zstd (l=7) 128.3 269.7 919,190,278
zstd (l=10) 162.0 - 894,547,179
none 95.3 277.2 4,097,566,064
The default zstd compression level is 3. So if you deploy zstd
capable Mercurial to your clients and servers and CPU time on
your server is dominated by "getbundle" requests (clients cloning
and pulling) - and my experience at Mozilla tells me this is often
the case - this commit could drastically reduce your server-side
CPU usage *and* save on bandwidth costs!
Another benefit of this change is that server operators can install
*any* compression engine. While it isn't enabled by default, the
"none" compression engine can now be used to disable wire protocol
compression completely. Previously, commands like "getbundle" always
zlib compressed output, adding considerable overhead to generating
responses. If you are on a high speed network and your server is under
high load, it might be advantageous to trade bandwidth for CPU.
Although, zstd at level 1 doesn't use that much CPU, so I'm not
convinced that disabling compression wholesale is worthwhile. And, my
data seems to indicate a slow down on the client without compression.
I suspect this is due to a lack of buffering resulting in an increase
in socket read() calls and/or the fact we're transferring an extra 3 GB
of data (parsing HTTP chunked transfer and processing extra TCP packets
can add up). This is definitely worth investigating and optimizing. But
since the "none" compressor isn't enabled by default, I'm inclined to
punt on this issue.
This commit introduces tons of tests. Some of these should arguably
have been implemented on previous commits. But it was difficult to
test without the server functionality in place.
Now that servers expose a capability indicating they support
application/mercurial-0.2 and compression, clients can key off
this to say they support responses that are compressed with
various compression formats.
After this commit, the HTTP wire protocol client now sends an
"X-HgProto-<N>" request header indicating its support for
"application/mercurial-0.2" media type and various compression
formats.
This commit also implements support for handling
"application/mercurial-0.2" responses. It simply reads the header
compression engine identifier then routes the remainder of the
response to the appropriate decompressor.
There were some test changes, but only to logging. That points to
an obvious gap in our test coverage. This will be addressed in a
subsequent commit once server support is in place (it is hard to
test without server support).
This commit introduces support for advertising a server's support for
media types and compression formats in accordance with the spec defined
in internals.wireproto.
The bulk of the new code is a helper function in wireproto.py to
obtain a prioritized list of compression engines available to the
wire protocol. While not utilized yet, we implement support
for obtaining the list of compression engines advertised by the
client.
The upcoming HTTP protocol enhancements are a bit lower-level than
existing tests (most existing tests are command centric). So,
this commit establishes a new test file that will be appropriate
for holding tests around the functionality of the HTTP protocol
itself.
Rounding out this change, `hg debuginstall` now prints compression
engines available to the server.
This patch implements a new compression engine API allowing
compression engines to declare support for the wire protocol.
Support is declared by returning a compression format string
identifier that will be added to payloads to signal the compression
type of data that follows and default integer priorities of the
engine.
Accessor methods have been added to the compression engine manager
class to facilitate use.
Note that the "none" and "bz2" engines declare wire protocol support
but aren't enabled by default due to their priorities being 0. It
is essentially free from a coding perspective to support these
compression formats, so we do it in case anyone may derive use from
it.
As part of adding zstd support to all of the things, we'll need
to teach the wire protocol to support non-zlib compression formats.
This commit documents how we'll implement that.
To understand how we arrived at this proposal, let's look at how
things are done today.
The wire protocol today doesn't have a unified format. Instead,
there is a limited facility for differentiating replies as successful
or not. And, each command essentially defines its own response format.
A significant deficiency in the current protocol is the lack of
payload framing over the SSH transport. In the HTTP transport,
chunked transfer is used and the end of an HTTP response body (and
the end of a Mercurial command response) can be identified by a 0
length chunk. This is how HTTP chunked transfer works. But in the
SSH transport, there is no such framing, at least for certain
responses (notably the response to "getbundle" requests). Clients
can't simply read until end of stream because the socket is
persistent and reused for multiple requests. Clients need to know
when they've encountered the end of a request but there is nothing
simple for them to key off of to detect this. So what happens is
the client must decode the payload (as opposed to being dumb and
forwarding frames/packets). This means the payload itself needs
to support identifying end of stream. In some cases (bundle2), it
also means the payload can encode "error" or "interrupt" events
telling the client to e.g. abort processing. The lack of framing
on the SSH transport and the transfer of its responsibilities to
e.g. bundle2 is a massive layering violation and a wart on the
protocol architecture. It needs to be fixed someday by inventing a
proper framing protocol.
So about compression.
The client transport abstractions have a "_callcompressable()"
API. This API is called to invoke a remote command that will
send a compressible response. The response is essentially a
"streaming" response (no framing data at the Mercurial layer)
that is fed into a decompressor.
On the HTTP transport, the decompressor is zlib and only zlib.
There is currently no mechanism for the client to specify an
alternate compression format. And, clients don't advertise what
compression formats they support or ask the server to send a
specific compression format. Instead, it is assumed that non-error
responses to "compressible" commands are zlib compressed.
On the SSH transport, there is no compression at the Mercurial
protocol layer. Instead, compression must be handled by SSH
itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
compression).
For the HTTP transport, adding new compression formats is pretty
straightforward. Once you know what decompressor to use, you can
stream data into the decompressor until you reach a 0 size HTTP
chunk, at which point you are at end of stream.
So our wire protocol changes for the HTTP transport are pretty
straightforward: the client and server advertise what compression
formats they support and an appropriate compression format is
chosen. We introduce a new HTTP media type to hold compressed
payloads. The header of the payload defines the compression format
being used. Whoever is on the receiving end can sniff the first few
bytes route to an appropriate decompressor.
Support for multiple compression formats is advertised on both
server and client. The server advertises a "compression" capability
saying which compression formats it supports and in what order they
are preferred. Clients advertise their support for multiple
compression formats and media types via the introduced "X-HgProto"
request header.
Strictly speaking, servers don't need to advertise which compression
formats they support. But doing so allows clients to fail fast if
they don't support any of the formats the server does. This is useful
in situations like sending bundles, where the client may have to
perform expensive computation before sending data to the server.
Rather than simply advertise a list of supported compression formats,
we introduce an additional "httpmediatype" server capability
advertising which media types the server supports. This means servers
are explicit about what formats they exchange. IMO, this is superior
to inferring support from other capabilities (like "compression").
By advertising compression support on each request in the "X-HgProto"
header and media type and direction at the server level, we are able
to gradually transition existing commands/responses to the new media
type and possibly compression. Contrast with the old world, where we
only supported a single media type and the use of compression was
built-in to the semantics of the command on both client and server.
In the new world, if "application/mercurial-0.2" is supported,
compression is supported. It's that simple.
It's worth noting that we explicitly don't use "Accept,"
"Accept-Encoding," "Content-Encoding," or "Transfer-Encoding" for
content negotiation and compression. People knowledgeable of the HTTP
specifications will say that we should use these because that's
what they are designed to be used for. They have a point and I
sympathize with the argument. Earlier versions of this commit even
defined supported media types in the "Accept" header. However, my
years of experience rolling out services leveraging HTTP has taught
me to not trust the HTTP layer, especially if you are going outside
the normal spec (such as using a custom "Content-Encoding" value to
represent zstd streams). I've seen load balancers, proxies, and other
network devices do very bad and unexpected things to HTTP messages
(like insisting zlib compressed content is decoded and then re-encoded
at a different compression level or even stripping compression
completely). I've found that the best way to avoid surprises when
writing protocols on top of HTTP is to use HTTP as a dumb transport as
much as possible to minimize the chances that an "intelligent" agent
between endpoints will muck with your data. While the widespread use of
TLS is mitigating many intermediate network agents interfering with
HTTP, there are still problems at the edges, with e.g. the origin HTTP
server needing to convert HTTP to and from WSGI and buggy or
feature-lacking HTTP client implementations. I've found the best way to
avoid these problems is to avoid using headers like "Content-Encoding"
and to bake as much logic as possible into media types and HTTP message
bodies. The protocol changes in this commit do rely on a custom HTTP
request header and the "Content-Type" headers. But we used them before,
so we shouldn't be increasing our exposure to "bad" HTTP agents.
For the SSH transport, we can't easily implement content negotiation
to determine compression formats because the SSH transport has no
content negotiation capabilities today. And without a framing protocol,
we don't know how much data to feed into a decompressor. So in order
to implement compression support on the SSH transport, we'd need to
invent a mechanism to represent content types and an outer framing
protocol to stream data robustly. While I'm fully capable of doing
that, it is a lot of work and not something that should be undertaken
lightly. My opinion is that if we're going to change the SSH transport
protocol, we should take a long hard look at implementing a grand
unified protocol that attempts to address all the deficiencies with
the existing protocol. While I want this to happen, that would be
massive scope bloat standing in the way of zstd support. So, I've
decided to take the easy solution: the SSH transport will not gain
support for multiple compression formats. Keep in mind it doesn't
support *any* compression today. So essentially nothing is changing
on the SSH front.
Currently, bundle compression uses the default compression level
for the active compression engine. The default compression level
is tuned as a compromise between speed and size.
Some scenarios may call for a different compression level. For
example, with clone bundles, bundles are generated once and used
several times. Since the cost to generate is paid infrequently,
server operators may wish to trade extra CPU time for better
compression ratios.
This patch introduces an experimental and undocumented config
option to control the bundle compression level. As the inline
comment says, this approach is a bit hacky. I'd prefer for
the compression level to be encoded in the bundle spec. e.g.
"zstd-v2;complevel=15." However, given that the 4.1 freeze is
imminent, I'm not comfortable implementing this user-facing
change without much time to test and consider the implications.
So, we're going with the quick and dirty solution for now.
Having this option in the 4.1 release will enable Mozilla to
easily produce and test zlib and zstd bundles with non-default
compression levels in production. This will help drive future
development of the feature and zstd integration with Mercurial.
Compression engines allow options to be passed to them to control
behavior. This patch exposes an argument to bundle2.writebundle()
that passes options to the compression engine when writing compressed
bundles. The argument is honored for both bundle1 and bundle2, the
latter requiring a bit of plumbing to pass the value around.
Detailed hint message is now provided when 'pull --rebase' operation detects
unclean working dir, for example:
abort: uncommitted changes
(cannot pull with rebase: please commit or shelve your changes first)
Added tests for uncommitted merge, and for subrepo support verifying that same
hint is also passed to subrepo state check.
Moving archivespecs to the module level allows using it from other modules
(such as hgwebdir_mod), and keeping a reference to it in requestcontext allows
current code to just work.
Thus we allow dict-like indexing and "in" checks, and also preserve the order
of archive types and can generate links in a certain order (so
requestcontext.archives is no longer needed).
Add the ability for revlog objects to process revision flags and apply
registered transforms on read/write operations.
This patch introduces:
- the 'revlog._processflags()' method that looks at revision flags and applies
flag processors registered on them. Due to the need to handle non-commutative
operations, flag transforms are applied in stable order but the order in which
the transforms are applied is reversed between read and write operations.
- the 'addflagprocessor()' method allowing to register processors on flags.
Flag processors are defined as a 3-tuple of (read, write, raw) functions to be
applied depending on the operation being performed.
- an update on 'revlog.addrevision()' behavior. The current flagprocessor design
relies on extensions to wrap around 'addrevision()' to set flags on revision
data, and on the flagprocessor to perform the actual transformation of its
contents. In the lfs case, this means we need to process flags before we meet
the 2GB size check, leading to performing some operations before it happens:
- if flags are set on the revision data, we assume some extensions might be
modifying the contents using the flag processor next, and we compute the
node for the original revision data (still allowing extension to override
the node by wrapping around 'addrevision()').
- we then invoke the flag processor to apply registered transforms (in lfs's
case, drastically reducing the size of large blobs).
- finally, we proceed with the 2GB size check.
Note: In the case a cachedelta is passed to 'addrevision()' and we detect the
flag processor modified the revision data, we chose to trust the flag processor
and drop the cachedelta.
Adding the ability to passing flags to addrevision instead of simply passing
default flags to _addrevision will allow extensions relying on flag transforms
to wrap around addrevision() in order to update revlog flags.
The first use case of this patch will be the lfs extension marking nodes as
stored externally when the contents are larger than the defined threshold.
One of the reasons leading to setting flags in addrevision() wrappers in the
flag processor design is that it allows to detect files larger than the 2GB
limit before the check is performed, which allows lfs to transform the contents
into metadata.
This patch introduces a new 'raw' argument (defaults to False) to revlog's
revision() and _addrevision() methods.
When the 'raw' argument is set to True, it indicates the revision data should be
handled as raw data by the flagprocessor.
Note: Given revlog.addgroup() calls are restricted to changegroup generation, we
can always set raw to True when calling revlog._addrevision() from
revlog.addgroup().
We have enough bits to switch to the new chg pager code path in runcommand.
So just remove the legacy getpager support.
This is a red-only patch, and will break chg's pager support temporarily.
This patch implements chgui._runpager in a relatively simple way. A more
clean way is to move the core logic of "attachio" to "ui", which will be
done later after chg runs uisetup per request.
This patch adds the "pager" support for the S channel. The pager API allows
running some subcommands, namely attachio, and waiting for the client to be
properly synchronized.
It would be nice for archive links to always be in a certain commonly used
order, such as 'zip', 'bz', 'gzip2'. Repo index page (hgwebdir_mod) already
shows archive links in this order, let's do the same in hgweb_mod.
Sadly, archivespecs is a regular unordered dict, and collections.OrderedDict is
new in 2.7. But requestcontext.archives is a tuple of archive types, so it can
be used as an index to archivespecs.
Having sections for specific operator types assumes the user already knows what
type of operators are supported. By having a common heading, the user can
simply lookup help for "(revsets|filesets|templates).operators".
This has the nice property of visually breaking up the wall of text. It also
allows specific smaller sections to be called out. For example,
`hg help filesets.predicates` now prints just the predicate section. At the
moment, the revset headings are a superset of the fileset headings, so there is
consistency in how example, predicate and operator help is called out.
The reference to `hg help patterns` was moved to the overview section, so that
it isn't stuck in the examples section.
Previously S channel is only used to send system commands. It will also be
used to send pager commands. So add a type parameter.
This breaks older chg clients. But chg and hg should always come from a
single commit and be packed into a single package. Supporting running
inconsistent versions of chg and hg seems to be unnecessarily complicated
with little benefit. So just make the change and assume people won't use
inconsistent chg with hg.
This revset returns the history of a range of lines (fromline, toline) of a
file starting from `rev` or the current working directory.
Added tests in test-annotate.t which already contains a reasonably complex
repository.
This yields ancestors of `fctx` by only keeping changesets touching the file
within specified linerange = (fromline, toline).
Matching revisions are found by inspecting the result of `mdiff.allblocks()`,
filtered by `mdiff.blocksinrange()`, to find out if there are blocks of type
"!" within specified line range.
If, at some iteration, an ancestor with an empty line range is encountered,
the algorithm stops as it means that the considered block of lines actually
has been introduced in the revision of this iteration. Otherwise, we finally
yield the initial revision of the file as the block originates from it.
When a merge changeset is encountered during ancestors lookup, we consider
there's a diff in the current line range as long as there is a diff between
the merge changeset and at least one of its parents (in the current line
range).
The function filters diff blocks as generated by mdiff.allblock function based
on whether they are contained in a given line range based on the "b-side" of
blocks.
Re-use the cmdutil._changesetlabels function introduced in c400c86d547f to
have consistent labels between the "changeset: " line in log command and the
"parent: " line in summary.
The "troubles" template keyword returns a list of evolution troubles.
It is EXPERIMENTAL, as anything else related to changeset evolution.
Test it in test-obsolete.t which has troubled changesets.
Every other template has the "raw" link load "raw-file." However,
fileannotate.tmpl's "raw" link loads "raw-annotate." This feels
inconsistent and wrong.
As far as I can tell, linking to the "raw annotate" view has occurred
since 2006.
repair.strip() expects a set of root revisions to strip. It then
builds the full set of descedants by walking the descandants of
each. It is rare that more than a few roots get passed in, but if that
happens, it will wastefully walk the changelog for each root. So let's
just walk it once.
I noticed this because the narrowhg extension was passing not only
roots, but all the commits to strip. When there were tens of thousands
of commits to strip, this resulted in quadratic behavior with that
extension.
Similar to git, we add a special string:
HG: ------------------------ >8 ------------------------
that means anything below it is ignored in a commit message.
This is helpful for integrating with third-party tools that display the
The bootstrapping issue was addressed at the parsing phase and we expect
that fullreposet.__and__() fully complies to the smartset API, in which
'self & other' should return a result set in self's order. See also
ab938e7ae803.
Let's resurrect the docstring since our help module can detect the EXPERIMENTAL
tag and display it only if -v is specified.
This patch updates the test added by bbdfa2d5aaa2 since wdir() is now
documented.
Unlike p1 = null, p2 = null denotes the revision has only one parent, which
shouldn't be considered a child of the null revision. This was spotted while
fixing the issue4682 and rediscovered as issue5439.
select() is a notable example of syscalls which may fail with EINTR. If we
had a SIGWINCH handler installed, ssh would crash when the terminal window
was resized. This patch fixes the problem.
This is the behavior of the default __import__() function, which doesn't
validate the existence of the fromlist items. Later on, the missing attribute
is detected while processing the import statement.
https://hg.python.org/cpython/file/v2.7.13/Python/import.c#l2575
The comtypes library relies on this (maybe) undocumented behavior, and we
got a bug report to TortoiseHg, sigh.
https://bitbucket.org/tortoisehg/thg/issues/4647/
The test added at 0be19b069edf verifies the behavior of the import statement,
so this patch only adds the test of __import__() function and works around
CPython/PyPy difference.
According to POSIX specification, just having group write access to a
file causes EPERM at invocation of os.utime() with an explicit time
information (e.g. working on the repository shared by group access
permission).
To ignore EPERM at closing file object in such case, this patch makes
checkambigatclosing._checkambig() use filestat.avoidambig() introduced
by previous patch.
Some functions below imply this code path at truncation of an existing
(= might be owned by another user) file.
- strip() in repair.py, introduced by 4d0a08431b6f
- _playback() in transaction.py, introduced by 48fe04792102
This is a variant of issue5418.
According to POSIX specification, just having group write access to a
file causes EPERM at invocation of os.utime() with an explicit time
information (e.g. working on the repository shared by group access
permission).
To ignore EPERM at renaming in such case, this patch makes
vfs.rename() use filestat.avoidambig() introduced by previous patch.
Now, advancing stat.st_mtime by os.utime() is used to avoid file stat
ambiguity. But according to POSIX specification, utime(2) with an
explicit time information is permitted only for a process with:
- the effective user ID equal to the user ID of the file, or
- appropriate privileges
http://pubs.opengroup.org/onlinepubs/9699919799/functions/utime.html
Therefore, just having group write access to a file causes EPERM at
applying os.utime() on it (e.g. working on the repository shared by
group access permission).
This patch adds class filestat utility function avoidamgig() to avoid
file stat ambiguity but skip it if EPERM.
It is reasonable to always ignore EPERM, because utime(2) causes EPERM
only in the case described above (EACCES is used only for utime(2)
with NULL).
43e3fb1c484e introduced a call to fctx.parents() for each line in
annotate output. This function call isn't cheap, as it requires
linkrev adjustment.
Since multiple lines in annotate output tend to belong to the same
file revision, a cache of fctx.parents() lookups for each input
should be effective in the common case. So we implement one.
Since the cache has to precompute parents so an aborted generator
doesn't leave an incomplete cache, we could just return a list.
However, we preserve the generator for backwards compatibility.
The effect of this change when requesting /annotate/96ca0ecdcfa/
browser/locales/en-US/chrome/browser/downloads/downloads.dtd on
the mozilla-aurora repo is significant:
p1(43e3fb1c484e) 5.5s
43e3fb1c484e: 66.3s
this patch: 10.8s
We're still slower than before. But only by ~2x instead of ~12x.
On the tip revisions of layout/base/nsCSSFrameConstructor.cpp file in
the mozilla-unified repo, time went from 12.5s to 14.5s and back to
12.5s. I'm not sure why the mozilla-aurora repo is so slow.
Looking at the code of basefilectx.parents(), there is room for
further improvements. Notably, we still perform redundant calls to
filelog.renamed() and basefilectx._parentfilectx(). And
basefilectx.annotate() also makes similar calls, so there is potential
for object reuse. However, introducing caches here are not appropriate
for the stable branch.
Currently the warning is ambiguous about whether the new tag (possibly specified
via --rev) is being added on a branch head or whether the working directory is
based on a branch head. Clarify the error message to eliminate this ambiguity.
Some errors could in some cases show unfortunate scary and confusing warnings
from the httppeer delstructors:
abort: nodename nor servname provided, or not known
Exception AttributeError: "'httpspeer' object has no attribute 'urlopener'" in <bound method httpspeer.__del__ of <mercurial.httppeer.httpspeer object at 0x106e1f5d0>> ignored```
To mute that, take 8bdb0bb8e209 to the next level and use getattr in __del__.
cl._partialmatch() can be pretty slow if hidden revisions are involved. This
patch cancels the slowdown introduced by the previous patch by using an
unfiltered changelog, which means shortest(node) isn't always the shortest.
The result isn't perfect, but seems okay as long as shortest(node) is short
enough to type and can be used as an identifier.
(with hidden revisions)
% hg log -R hg-committed -r0:20000 -T '{node|shortest}\n' --time > /dev/null
(.^^) time: real 1.530 secs (user 1.480+0.000 sys 0.040+0.000)
(.^) time: real 43.080 secs (user 43.060+0.000 sys 0.030+0.000)
(.) time: real 1.680 secs (user 1.650+0.000 sys 0.020+0.000)
cl.index.partialmatch() isn't a drop-in replacement for cl._partialmatch().
It has no knowledge about hidden revisions, and it raises ValueError if a node
shorter than 4 chars is given. Instead, use index.partialmatch() through
cl._partialmatch(), which has no such problems and gives the identical result
with/without --pure.
The test output was sampled with --pure without this patch, which shows the
most correct result. However, we'll need to switch to using an unfiltered
changelog because _partialmatch() of a filtered changelog can be an order of
magnitude slower.
(with hidden revisions)
% hg log -R hg-committed -r0:20000 -T '{node|shortest}\n' --time > /dev/null
(.^) time: real 1.530 secs (user 1.480+0.000 sys 0.040+0.000)
(.) time: real 43.080 secs (user 43.060+0.000 sys 0.030+0.000)
This is a fix for a regression introduced by the patches for issue4028.
The test changes are due to us doing fewer _checkcopies searches now, which
makes some test outputs revert to the pre-issue4028 behavior. That issue itself
remains fixed, we only skip copy tracing for files where it isn't relevant.
As a nice side effect, this makes copy detection much faster when tracing
backwards through lots of renames.
Certifi is currently incompatible with py2exe; the Python code for certifi gets
included in library.zip, but not the cacert.pem file - and even if it were
included, SSLContext can't load a cacert.pem file from library.zip.
This currently makes it impossible to build a standalone Windows version of
Mercurial.
Guard against this, and possibly other situations where a module with the name
"certifi" exists, but is not usable.
There was a "leak", apparently introduced in b37a67b41690. When running:
hg = hglib.open('repo')
while True:
hg.log("max(branch('default'))")
all filteredset instances from branch() would be cached indefinitely by the
@util.cachefunc annotation on the max() implementation.
util.cachefunc seems dangerous as method decorator and is barely used elsewhere
in the code base. Instead, just open code caching by having the min/max
methods replace themselves with a plain lambda returning the result.
It seems like the a regression has sneaked into debug.dirstate.delaywrite in
14bddc099338. It would sleep until no files were modified "now" any more, but
when writing the dirstate it would use the old "now" and still mark files as
'unset' instead of recording the timestamp that would make the file show up as
clean instead of unknown.
Instead of getting a new "now" from the file system, we trust the computed end
time as the new "now" and thus cause the actual modification time to be
writiten to the dirstate.
debug.dirstate.delaywrite is undocumented and only used in
test-largefiles-update.t . All tests seems to work fine for me without
debug.dirstate.delaywrite . Perhaps because it not really worked as intended
without the fix in this patch, and code and tests thus have evolved to do fine
without it? It could thus perhaps make sense to drop usage of this setting in
the tests. That could speed the test up a bit.
This functionality (or something very similar) can however apparently be very
convenient in setups where checking dirty-ness is expensive - such as when
using large files and have slow file filesystems or are CPU constrained. Now it
works and we can try it. (But ideally, for the largefile use case, it should
probably only delay lfdirstate writes - not ordinary dirstate.)
Over the past week I've had to instruct multiple people to run
Python code to query the ssl module to see what TLS protocol support
is present. I think it would be useful for `hg debuginstall` to print
this info to make it easier to access and debug why Mercurial is
complaining about using an insecure TLS 1.0 protocol.
Ideally we'd also print the path to the CA cert bundle. But the APIs
for querying that in sslutil can emit warnings, making it slightly
more difficult to integrate into `hg debuginstall`. That work will
have to wait for another day.
Same as in the last commit, the old treemanifestctx stored a reference to the
revlog. If the inmemory revlog became invalid, the ctx now held an old copy and
would be incorrect. To fix this, we need the ctx to go through the manifestlog
for each access.
This is the same pattern that changectx already uses (it stores the repo, and
accesses commit data through self._repo.changelog).
The old manifestctx stored a reference to the revlog. If the inmemory revlog
became invalid, the ctx now held an old copy and would be incorrect. To fix
this, we need the ctx to go through the manifestlog for each access.
This is the same pattern that changectx already uses (it stores the repo, and
accesses commit data through self._repo.changelog).
The old @property on manifestlog was broken. It meant that we would always
recreate the manifestlog instance, which meant the cache was never hit. Since
we'll eventually remove repo.manifest and make manifestlog the only property,
let's go ahead and make manifestlog the @storecache property, have manifestlog
own the manifest instance, and have repo.manifest refer to it via manifestlog.
This means all accesses go through repo.manifestlog, which is now invalidated
correctly.
A future patch will be moving manifest creation to be inside manifestlog as part
of improving our cache guarantees. bundlerepo and unionrepo currently rely on
being able to hook into manifest creation, so let's temporarily move the actual
manifest creation to a helper function for them to intercept.
In the future manifest.manifest() will disappear entirely and this can
disappear.
By default, Python defers to the operating system for choosing the
default buffer size on opened files. On my Linux machine, the default
is 4k, which is really small for 2016.
This patch bumps the write buffer size when writing
changegroups/bundles to 128k. This matches the 128k read buffer
we already use on revlogs.
It's worth noting that this only impacts when writing to an explicit
file (such as during `hg bundle`). Buffers when writing to bundle
files via the repo vfs or to a temporary file are not impacted.
When producing a none-v2 bundle file of the mozilla-unified repository,
this change caused the number of write() system calls to drop from
952,449 to 29,788. After this change, the most frequent system
calls are fstat(), read(), lseek(), and open(). There were
2,523,672 system calls after this patch (so a net decrease of
~950k is statistically significant).
This change shows no performance change on my system. But I have a
high-end system with a fast SSD. It is quite possible this change
will have a significant impact on network file systems, where
extra network round trips due to excessive I/O system calls could
introduce significant latency.
Revlog can now be configured to store full snapshot only. This is used on the
changelog. However, the changegroup packing was still recomputing deltas to be
sent over the wire.
We now just reuse the full snapshot directly in this case, skipping delta
computation. This provides use with a large speed up(-30%):
# perfchangegroupchangelog on mercurial
! wall 2.010326 comb 2.020000 user 2.000000 sys 0.020000 (best of 5)
! wall 1.382039 comb 1.380000 user 1.370000 sys 0.010000 (best of 8)
# perfchangegroupchangelog on pypy
! wall 5.792589 comb 5.780000 user 5.780000 sys 0.000000 (best of 3)
! wall 3.911158 comb 3.920000 user 3.900000 sys 0.020000 (best of 3)
# perfchangegroupchangelog on mozilla central
! wall 20.683727 comb 20.680000 user 20.630000 sys 0.050000 (best of 3)
! wall 14.190204 comb 14.190000 user 14.150000 sys 0.040000 (best of 3)
Many tests have to be updated because of the change in bundle content. All
theses update have been verified. Because diffing changelog was not very
valuable, the resulting bundle have similar size (often a bit smaller):
# full bundle of mozilla central
with delta: 1142740533B
without delta: 1142173300B
So this is a win all over the board.