Previously, when a chg server is exiting, it does not handle connected
clients so clients may get ECONNRESET and crash:
1. client connect() # success
2. server shouldexit = True and exit
3. client recv() # ECONNRESET
297d89f2789e makes this race condition easier to reproduce if a lot of short
chg commands are started in parallel.
This patch fixes the above issue by unlinking the socket path to stop
queuing new connections and processing all pending connections before exit.
This patch changes unixforkingservice so it only calls
`self._servicehandler.unlinksocket(self.address)` at most once.
This is needed by the next patch.
These classes are pretty large and independent from revset computation.
2961 mercurial/revset.py
973 mercurial/smartset.py
3934 total
revset.prettyformatset() is renamed to smartset.prettyformat(). Smartset
classes are aliased since they are quite common in revset.py.
pager replaced stdout with a line buffered version to work around glibc
deciding on a buffering strategy on the first write to stdout. This is going
to make my next patch hard, as replacing stdout will make tracking time
spent blocked on it more challenging.
Move the line buffering requirement to util.py, and remove it from pager.
This means that the abuse of ui.formatted=True and pager set to cat or equivalent
no longer results in a line-buffered output to a pipe, hence (BC), although
I don't expect anyone to be affected
changelog.heads() first calls headrevs then converts them to nodes.
localrepo.heads() then sorts them using self.changelog.rev function and makes
useless conversion back to revs. Instead let's call changelog.headrevs() from
localrepo.heads(), sort the output and then convert to nodes. Because headrevs
does not support start parameter this optimization only works if start is None.
This is a small style update for clarity. The previous situation was:
if foo:
50 lines
else:
2 lines
In such case I tend to invert these to get the simpler branch out of the way
earlier:
if not foo:
2 lines
else:
50 lines
This makes the conditional and various alternatives fit on the same screen,
simpler to read overall.
Bundle2 has its own mechanisms to check for heads (and other) changes, so push
using bundle2 is relying on the "check:heads" bundle part of unbundle and the
'check_heads' call is not checking anything. We add a small comment to make
this clearer.
The verifier calls out to _validpath() to check if it should verify
that path and the narrowhg extension overrides _validpath() to tell
the verifier to skip that path. In treemanifest repos, the verifier
calls the same method to check if it should visit a
directory. However, the decision to visit a directory is different
from the condition that it's a matching path, and narrowhg was working
around it by returning True from its _validpath() override if *either*
was true.
Similar to how one can do "hg files -I foo/bar/ -X foo/" (making the
include pointless), narrowhg can be configured to track the same
paths. In that case match("foo/bar/baz") would be false, but
match.visitdir("foo/bar/baz") turns out to be true, causing verify to
fail. This may seem like a bug in visitdir(), but it's explicitly
documented to be undefined for subdirectories of excluded
directories. When using treemanifests, the walk would not descend into
foo/, so verification would pass. However, when using flat manifests,
there is no recursive directory walk and the file path "foo/bar/baz"
would be passed to _validpath() without "foo/" (actually without the
slash) being passed first. As explained above, _validpath() would
return true for the file path and "hg verify" would fail.
Replacing the _validpath() method by a matcher seems like the obvious
fix. Narrowhg can then pass in its own matcher and not have to
conflate the two matching functions (for dirs and files). I think it
also makes the code clearer.
0b5f1f2efc77 introduced handling of a crash in this case. A review comment
suggested that it was not entirely obvious that a 'dm' always would have a 'r'
for the source file.
To mitigate that risk, make the code more conservative and make less
assumptions.
Work around that 'dm' in the data model only can have one operation for the
target file, but still can have multiple and conflicting operations on the
source file where the other operation is a 'rm'. The move would thus fail with
'abort: No such file or directory'.
In this case it is "obvious" that the file should be removed, either before or
after moving it. We thus keep the 'rm' of the source file but drop the 'dm'.
This is not a pretty fix but quite "obviously" safe (famous last words...) as
it only touches a rare code path that used to crash. It is possible that it
would be better to swap the files for 'dm' as suggested on
https://bz.mercurial-scm.org/show_bug.cgi?id=5020#c13 but it is not entirely
obvious that it not just would create conflicts on the other file. That can be
revisited later.
dict.keys() is documented to return a copy, so it's surprising that
sortdict.keys() did not. I noticed this because we have an extension
that calls readlocaltags(). That method tries to remove any tags that
point to non-existent revisions (most likely stripped). However, since
it's unintentionally working on the instance it's modifying, it
sometimes fails to remove tags when there are multiple bad tags in a
row. This was not caught because localrepo.tags() does an additional
layer of filtering.
sortdict is also used in other places, but I have not checked whether
its keys() and/or __delitem__() methods are used there.
outgoing() and remote() may stall for long due to network I/O, which seems
unsafe per definition, "whether a predicate is safe for DoS attack." But I'm
not 100% sure about this. If our concern isn't elapsed time but CPU resource,
these predicates are considered safe. Perhaps that would be up to the
web/application server configuration?
Anyway, outgoing() and remote() wouldn't be useful in hgweb, so I think
it's okay to ban them.
statprof has a __main__ handler that allows viewing of previously
written data files. As Yuya pointed out during review, 82ee01726a77
broke this. This patch fixes that.
Until callsites are updated, this will have no effect. Once callsites
are updated, specifying experimental.editortmpinhg will create editor
temporary files in a subdirectory of .hg, which will make it easier
for tool integrations to determine what repository is in play when
they're asked to edit an hg-related file.
Care needs to be taken to prevent leaking potentially sensitive environment
variables through hgweb, if template support for environment variables is to be
introduced. There are a few ideas about the API for preventing accidental
leaking [1]. Option 3 seems best from the POV of not needing to configure
anything in the normal case. I couldn't figure out how to do that, so guard it
with an experimental option for now.
[1] https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-January/092383.html
Narrowhg has been using "1 << 14" as its revlog flag value for a long
time. We (Google) have many repos with that value in production
already. When the same value was reserved for EXTSTORED, it made those
repos invalid. Upgrading them will be a little painful. We should
clearly have reserved the value for narrowhg a long time ago. Since
the EXTSTORED flag is not yet in any release and Facebook also says
they have not started using it in production, so it should be okay to
change it. This patch gives the current value (1 << 14) back to
narrowhg and gives a new value (1 << 13) to EXTSTORED.
Before this change, the text about revlog flags was reflowed into a
single paragraph, which made it a bit hard to read. I don't even know
the rules around this, but adding a blank line before each flag seems
to prevent the reflowing.
util.buffer() either returns inbuilt buffer function or defines a new one which
slices. The inbuilt buffer() also has a length argument which is missing from
the ones we defined. This patch adds that length argument.
pycompat.getenv returns os.getenvb on py3 which is not available on Windows.
This patch replaces them with encoding.environ.get and checks to ensure no
new instances of os.getenv or os.setenv are introduced.
The final part of integrating the compression manager APIs into
revlog storage is the plumbing for repositories to advertise they
are using non-zlib storage and for revlogs to instantiate a non-zlib
compression engine.
The main intent of the compression manager work was to zstd all
of the things. Adding zstd to revlogs has proved to be more involved
than other places because revlogs are... special. Very small inputs
and the use of delta chains (which are themselves a form of
compression) are a completely different use case from streaming
compression, which bundles and the wire protocol employ. I've
conducted numerous experiments with zstd in revlogs and have yet
to formalize compression settings and a storage architecture that
I'm confident I won't regret later. In other words, I'm not yet
ready to commit to a new mechanism for using zstd - or any other
compression format - in revlogs.
That being said, having some support for zstd (and other compression
formats) in revlogs in core is beneficial. It can allow others to
conduct experiments.
This patch introduces *highly experimental* support for non-zlib
compression formats in revlogs. Introduced is a config option to
control which compression engine to use. Also introduced is a namespace
of "exp-compression-*" requirements to denote support for non-zlib
compression in revlogs. I've prefixed the namespace with "exp-"
(short for "experimental") because I'm not confident of the
requirements "schema" and in no way want to give the illusion of
supporting these requirements in the future. I fully intend to drop
support for these requirements once we figure out what we're doing
with zstd in revlogs.
A good portion of the patch is teaching the requirements system
about registered compression engines and passing the requested
compression engine as an opener option so revlogs can instantiate
the proper compression engine for new operations.
That's a verbose way of saying "we can now use zstd in revlogs!"
On an `hg pull` conversion of the mozilla-unified repo with no extra
redelta settings (like aggressivemergedeltas), we can see the impact
of zstd vs zlib in revlogs:
$ hg perfrevlogchunks -c
! chunk
! wall 2.032052 comb 2.040000 user 1.990000 sys 0.050000 (best of 5)
! wall 1.866360 comb 1.860000 user 1.820000 sys 0.040000 (best of 6)
! chunk batch
! wall 1.877261 comb 1.870000 user 1.860000 sys 0.010000 (best of 6)
! wall 1.705410 comb 1.710000 user 1.690000 sys 0.020000 (best of 6)
$ hg perfrevlogchunks -m
! chunk
! wall 2.721427 comb 2.720000 user 2.640000 sys 0.080000 (best of 4)
! wall 2.035076 comb 2.030000 user 1.950000 sys 0.080000 (best of 5)
! chunk batch
! wall 2.614561 comb 2.620000 user 2.580000 sys 0.040000 (best of 4)
! wall 1.910252 comb 1.910000 user 1.880000 sys 0.030000 (best of 6)
$ hg perfrevlog -c -d 1
! wall 4.812885 comb 4.820000 user 4.800000 sys 0.020000 (best of 3)
! wall 4.699621 comb 4.710000 user 4.700000 sys 0.010000 (best of 3)
$ hg perfrevlog -m -d 1000
! wall 34.252800 comb 34.250000 user 33.730000 sys 0.520000 (best of 3)
! wall 24.094999 comb 24.090000 user 23.320000 sys 0.770000 (best of 3)
Only modest wins for the changelog. But manifest reading is
significantly faster. What's going on?
One reason might be data volume. zstd decompresses faster. So given
more bytes, it will put more distance between it and zlib.
Another reason is size. In the current design, zstd revlogs are
*larger*:
debugcreatestreamclonebundle (size in bytes)
zlib: 1,638,852,492
zstd: 1,680,601,332
I haven't investigated this fully, but I reckon a significant cause of
larger revlogs is that the zstd frame/header has more bytes than
zlib's. For very small inputs or data that doesn't compress well, we'll
tend to store more uncompressed chunks than with zlib (because the
compressed size isn't smaller than original). This will make revlog
reading faster because it is doing less decompression.
Moving on to bundle performance:
$ hg bundle -a -t none-v2 (total CPU time)
zlib: 102.79s
zstd: 97.75s
So, marginal CPU decrease for reading all chunks in all revlogs
(this is somewhat disappointing).
$ hg bundle -a -t <engine>-v2 (total CPU time)
zlib: 191.59s
zstd: 115.36s
This last test effectively measures the difference between zlib->zlib
and zstd->zstd for revlogs to bundle. This is a rough approximation of
what a server does during `hg clone`.
There are some promising results for zstd. But not enough for me to
feel comfortable advertising it to users. We'll get there...
Now that compression engines declare their header in revlog chunks
and can decompress revlog chunks, we refactor revlog.decompress()
to use them.
Making full use of the property that revlog compressor objects are
reusable, revlog instances now maintain a dict mapping an engine's
revlog header to a compressor object. This is not only a performance
optimization for engines where compressor object reuse can result in
better performance, but it also serves as a cache of header values
so we don't need to perform redundant lookups against the compression
engine manager. (Yes, I measured and the overhead of a function call
versus a dict lookup was observed.)
Replacing the previous inline lookup table with a dict lookup was
measured to make chunk reading ~2.5% slower on changelogs and ~4.5%
slower on manifests. So, the inline lookup table has been mostly
preserved so we don't lose performance. This is unfortunate. But
many decompression operations complete in microseconds, so Python
attribute lookup, dict lookup, and function calls do matter.
The impact of this change on mozilla-unified is as follows:
$ hg perfrevlogchunks -c
! chunk
! wall 1.953663 comb 1.950000 user 1.920000 sys 0.030000 (best of 6)
! wall 1.946000 comb 1.940000 user 1.910000 sys 0.030000 (best of 6)
! chunk batch
! wall 1.791075 comb 1.800000 user 1.760000 sys 0.040000 (best of 6)
! wall 1.785690 comb 1.770000 user 1.750000 sys 0.020000 (best of 6)
$ hg perfrevlogchunks -m
! chunk
! wall 2.587262 comb 2.580000 user 2.550000 sys 0.030000 (best of 4)
! wall 2.616330 comb 2.610000 user 2.560000 sys 0.050000 (best of 4)
! chunk batch
! wall 2.427092 comb 2.420000 user 2.400000 sys 0.020000 (best of 5)
! wall 2.462061 comb 2.460000 user 2.400000 sys 0.060000 (best of 4)
Changelog chunk reading is slightly faster but manifest reading is
slower. What gives?
On this repo, 99.85% of changelog entries are zlib compressed (the 'x'
header). On the manifest, 67.5% are zlib and 32.4% are '\0'. This patch
swapped the test order of 'x' and '\0' so now 'x' is tested first. This
makes changelogs faster since they almost always hit the first branch.
This makes a significant percentage of manifest '\0' chunks slower
because that code path now performs an extra test. Yes, I too can't
believe we're able to measure the impact of an if..elif with simple
string compares. I reckon this code would benefit from being written
in C...
There's no apparent reason to have this "entries" generator function that
builds a list and then yields its elements in reverse order and which is only
called to build the "entries" list. So just build the list directly, in
reverse order.
Adjust "parity" generator's offset to keep rendering the same.
readline() returns '' only when EOF is encountered, in which case, Python's
getpass() raises EOFError. We should do the same to abort the session as
"response expected."
This bug was reported to
https://bitbucket.org/tortoisehg/thg/issues/4659/
The result of diffstatdata should not depend on having noprefix set or not, as
was reported in issue 4755. Forcing noprefix to false on call makes sure the
parser receives the diff in the correct format and returns the proper result.
Another way to fix this would have been to change the regular expressions in
path.diffstatdata(), but that would have introduced many unecessary special
cases.
This config knob will control whether or not to show the similarity
calculation in the diff output:
diff --git a/README.md b/foo.md
similarity index 88%
rename from README.md
rename to foo.md
--- a/README.md
+++ b/foo.md
We have 4 revset functions that take integer arguments, and they handle
their arguments in slightly different ways. This patch unifies them:
- getstring() in place of getsymbol(), which is more consistent with the
handling of integer revisions (both 1 and '1' are valid)
- say "expects" instead of "requires" for type errors
We don't need to catch TypeError since getstring() must return a string.
The rev argument has the same meaning as startrev of follow(), and I think
startrev is more informative.
followlines() is new function, we can make BC now.