This is a significant performance win on large repositories. perffilefoldmap:
On Linux/gcc, on a test repo with over 500,000 files:
before: wall 0.605021 comb 0.600000 user 0.560000 sys 0.040000 (best of 17)
after: wall 0.280530 comb 0.280000 user 0.250000 sys 0.030000 (best of 35)
On Mac OS X/clang, on a real-world repo with over 200,000 files:
before: wall 0.281103 comb 0.280000 user 0.260000 sys 0.020000 (best of 34)
after: wall 0.133622 comb 0.140000 user 0.120000 sys 0.020000 (best of 65)
This visibly impacts status times on case-insensitive file systems. On the Mac
OS X repo, status goes from 3.64 seconds to 3.50.
With the third-party hgwatchman extension [1], 'hg status' on the same repo
goes from 0.80 seconds to 0.65.
[1] https://bitbucket.org/facebook/hgwatchman
This is a hot path on case-insensitive filesystems -- it's guaranteed to be
called every time 'hg status' is run.
This is significantly faster than the equivalent Python code: see the following
patch for numbers.
Unlike changeset_printer, it does not hide the manifest field because JSON
output will be parsed by machine where explicit "null" will be more useful
than nothing.
When tree manifests are stored with one revlog per directory and
loaded lazily, it's unclear how much readdelta will help. If only a
few files change, then only a small part of the full manifest will be
loaded, and the delta chains should also be shorter for tree
manifests. Therefore, let's disable readdelta for tree manifests for
now.
These will be used in upcoming patches to efficiently create a dirstate
foldmap.
The Cygwin normcase behavior is more complicated than just a simple lowercasing
or uppercasing. That's why we specify 'other'.
For C code we don't want to pay the cost of calling into a Python function for
the common case of ASCII filenames. However, while on most POSIX platforms we
normalize filenames by lowercasing them, on Windows we uppercase them. We
define an enum here indicating the direction that filenames should be
normalized as. Some platforms (notably Cygwin) have more complicated
normalization behavior -- we add a case for that too.
In upcoming patches we'll also define a fallback function that is called if the
string has non-ASCII bytes.
This enum will be replicated in the C code to make foldmaps. There's
unfortunately no nice way to avoid that -- we can't have encoding import
parsers because of import cycles. One way might be to have parsers import
encoding, but accessing Python modules from C code is just awkward.
The name 'normcasespecs' was chosen to indicate that this is merely an integer
that specifies a behavior, not a function. The name was pluralized since in
upcoming patches we'll introduce 'normcasespec' which will be one of these
values.
We should consider add HTML rendering of the RST into the response as a
follow-up. I attempted to do this, but there was an empty array
returned by the rstdoc() template function. Not sure what's going on.
Will deal with it later.
These are the same dispatch function under the hood. The only difference
is the default number of entries to render and the template to use. So
it makes sense to use a shared template.
Format for {changelistentry} is similar to {changeset}. However, there
are differences to argument names and their values preventing us from
(easily) using the same template. (Perhaps there is room to consolidate
the templates as a follow-up.)
We're currently not recording some data in {changelistentry} that exists
in {changeset}. This includes the branch name. This should be added in
a follow-up. For now, something is better than nothing.
The content of "hg help templating" is largely derived from docstrings
on functions providing functionality. Template functions are the long
holdout.
Prepare for generating them dynamically by defining docstrings for all
template functions.
There are numerous ways these docs could be improved. Right now, the
help output simply shows function names and arguments. So literally
any accurate data is better than what is there now.
Previously, specifying a file name but not matching the dirstate case yielded
the following, even though the file was actually removed:
$ hg forget capsdir1/capsdir/abc.txt
not removing capsdir\a.txt: file is already untracked
removing CapsDir\A.txt
[1]
This change doesn't appear to cause any extra filesystem accesses, even if a
nonexistant file is specified.
If a directory is specified without a case match, it is (and was previously)
still silently ignored.
We don't require it when adding files on a case insensitive filesystem, so don't
require it to add directories for consistency.
The problem with the previous code was that _walkexplicit() was only returning
the normalized directory. The file(s) in the directory are then appended, and
passed to the matcher. But if the user asks for 'capsdir1/capsdir', the matcher
will not accept 'CapsDir1/CapsDir/AbC.txt', and the name is dropped. Matching
based on the non-normalized name is required.
If not normalizing, skip the extra string building for efficiency. '.' is
replaced with '' so that the path being tested when no file is specified, isn't
prefixed with './' (and therefore fail the match).
Merge copies is traversing file history in search for copies and renames.
Since 3.3 we are doing "linkrev adjustment" to ensure duplicated filelog entry
does not confuse the traversal. This "linkrev adjustment" involved ancestry
testing and walking in the changeset graph. If we do such walk in the changesets
graph for each file, we end up with a 'O(<changesets>x<files>)' complexity
that create massive issue. For examples, grafting a changeset in Mozilla's repo
moved from 6 seconds to more than 3 minutes.
There is a mechanism to reuse such ancestors computation between all files. But
it has to be manually set up in situation were it make sense to take such
shortcut. This changesets set this mechanism up and bring back the graph time
from 3 minutes to 8 seconds.
To do so, we need a bigger control on the way 'filectx' are instantiated during
each 'checkcopies' calls that 'mergecopies' is doing. We add a new 'setupctx'
that configure and return a 'filectx' factory. The function make sure the
ancestry context is properly created and the factory make sure it is properly
installed on returned 'filectx'.
When the source rev value is 'None', the ctx is a working context. We
cannot compute the ancestors from there so we directly skip to its
parents. This will be necessary to allow 'None' value for
'_descendantrev' itself necessary to make all contexts used in
'mergecopies' reuse the same '_ancestrycontext'.
The linkrev adjustment will likely do the same ancestry walking multiple time
so we already have an optional mechanism to take advantage of this. Since
4e4e9e954fae, linkrev adjustment was done lazily to prevent too bad performance
impact on rename computation. However, this laziness created a quadratic
situation in 'annotate'.
Mercurial repo: hg annotate mercurial/commands.py
before: 8.090
after: 36.300
Mozilla repo: hg annotate layout/generic/nsTextFrame.cpp
before: 1.190
after: 290.230
So we setup sharing of the ancestry context in the annotate case too. Linkrev
adjustment still have an impact but it a much more sensible one.
Mercurial repo: hg annotate mercurial/commands.py
before: 36.300
after: 10.230
Mozilla repo: hg annotate layout/generic/nsTextFrame.cpp
before: 290.230
after: 5.560
It's unfortunate that workingctx revision is None, which doesn't work well in
arithmetic operation or comparison. This function is trivial but will be used
in several places.
We're going to reuse this in upcoming patches.
The change to Py_ssize_t is necessary because parsers.c doesn't define
PY_SSIZE_T_CLEAN. That macro changes the behavior of PyArg_ParseTuple but not
PyBytes_GET_SIZE.
The new manifest format is designed to be smaller, in particular to
produce smaller deltas. It stores hashes in binary and puts the hash
on a new line (for smaller deltas). It also uses stem compression to
save space for long paths. The format has room for metadata, but
that's there only for future-proofing. The parser thus accepts any
metadata and throws it away. For more information, see
http://mercurial.selenic.com/wiki/ManifestV2Plan.
The current manifest format doesn't allow an empty filename, so we use
an empty filename on the first line to tell a manifest of the new
format from the old. Since we still never write manifests in the new
format, the added code is unused, but it is tested by
test-manifest.py.
While it should be safe to switch to the new manifest format on an
existing repo, let's keep it simple for now and make the configuration
have any effect only at repo creation time. If the configuration is
enabled then (at repo creation), we add an entry to requires and read
that instead of the configuration from then on.
Previously we would compute the repoview's static blockers by finding all the
children of hidden commits that were not hidden. This was O(number of commits
since first hidden change) since 'children' requires walking every commit from
tip until the first hidden change.
The new algorithm walks all heads down until it sees a public commit. This makes
the computation O(number of draft) commits, which is much faster in large
repositories with a large number of commits and a low number of drafts.
On a large repo with 1000+ obsolete markers and the earliest draft commit around
tip~200000, this improves computehidden perf by 200x (2s to 0.01s).
It's pretty surprising phase wasn't part of this template call already.
We now expose {phase} to the {changeset} template and we expose this
data to JSON.
This brings JSON output in line with the output from `hg log -Tjson`.
The lone exception is hweb doesn't print the numeric rev. As has been
stated previously, I don't believe hgweb should be exposing these
unstable identifiers. (We can add them later if we really want them.)
There is still work to bring hgweb in parity with --verbose and
--debug output from the CLI.