Previously, "hg verify" verifies everything, which could be undesirable when
there are expensive flag processor contents.
This patch adds a "verify.skipflags" developer config. A flag processor will
be skipped if (flag & verify.skipflags) == 0.
In the LFS usecase, that means "hg verify --config verify.skipflags=8192"
will not download all LFS blobs, which could be too large to be stored
locally.
Note: "renamed" is also skipped since its default implementation may call
filelog.data() which will trigger the flag processor.
Previously, verify only checks "rawsize == len(rawtext)", if
"len(fl.read()) != fl.size()".
With flag processor, "len(fl.read()) != fl.size()" does not necessarily mean
"rawsize == len(rawtext)". So we may miss a useful check.
This patch removes the "if len(fl.read()) != fl.size()" condition so the
rawsize check is always performed.
With the condition removed, "fl.read(n)" looks unnecessary so a comment was
added to explain the side effect is wanted.
According to the document added above, we should check L1 == L2, and the
only way to get L1 in all cases is to call "rawsize()", and the only way to
get L2 is to call "revision(raw=True)". Therefore the fix.
Meanwhile there are still a lot of things about flagprocessor broken in
revlog.py. Tests will be added after revlog.py gets fixed.
It seems a good idea to list all kinds of "surprises" and expected behavior
to make the upcoming changes easier to understand.
Note: the comment added does not reflect the actual behavior of the current
code.
The verifier calls out to _validpath() to check if it should verify
that path and the narrowhg extension overrides _validpath() to tell
the verifier to skip that path. In treemanifest repos, the verifier
calls the same method to check if it should visit a
directory. However, the decision to visit a directory is different
from the condition that it's a matching path, and narrowhg was working
around it by returning True from its _validpath() override if *either*
was true.
Similar to how one can do "hg files -I foo/bar/ -X foo/" (making the
include pointless), narrowhg can be configured to track the same
paths. In that case match("foo/bar/baz") would be false, but
match.visitdir("foo/bar/baz") turns out to be true, causing verify to
fail. This may seem like a bug in visitdir(), but it's explicitly
documented to be undefined for subdirectories of excluded
directories. When using treemanifests, the walk would not descend into
foo/, so verification would pass. However, when using flat manifests,
there is no recursive directory walk and the file path "foo/bar/baz"
would be passed to _validpath() without "foo/" (actually without the
slash) being passed first. As explained above, _validpath() would
return true for the file path and "hg verify" would fail.
Replacing the _validpath() method by a matcher seems like the obvious
fix. Narrowhg can then pass in its own matcher and not have to
conflate the two matching functions (for dirs and files). I think it
also makes the code clearer.
Now that all the functionality has been moved to manifestlog/manifestrevlog/etc,
we can finally change all the uses of repo.manifest to use the new versions. A
future diff will then delete repo.manifest.
One additional change in this commit is to change repo.manifestlog to be a
@storecache property instead of @property. This is required by some uses of
repo.manifest require that it be settable (contrib/perf.py and the static http
server). We can't do this in a prior change because we can't use @storecache on
this until repo.manifest is no longer used anywhere.
In repos with treemanifests, the non-root-directory dirlogs often have
many more total revisions than the root manifest log has. This change
adds progress out to that part of 'hg verify'. Since the verification
is recursive along the directory tree, we don't know how many total
revisions there are at the beginning of the command, so instead we
report progress in units of directories, much like we report progress
for verification of files today.
I'm not very happy with passing both 'storefiles' and 'progress' into
the recursive calls. I tried passing in just a 'visitdir(dir)'
callback, but the results did not seem better overall. I'm happy to
update if anyone has better ideas.
We already report orphaned filelogs, i.e. revlogs for files that are
not mentioned in any manifest. This change adds checking for orphaned
dirlogs, i.e. revlogs that are not mentioned in any parent-directory
dirlog.
Note that, for fncachestore, only files mentioned in the fncache are
considered, there's not check for files in .hg/store/meta that are not
mentioned in the fncache. This is no different from the current
situation for filelogs.
In repos with treemanifests, there is no specific verification of
directory manifest revlogs. It simply collects all file nodes by
reading each manifest delta. With treemanifests, that's means calling
the manifest._slowreaddelta(). If there are missing revlog entries in
a subdirectory revlog, 'hg verify' will simply report the exception
that occurred while trying to read the root manifest:
manifest@0: reading delta 1700e2e92882: meta/b/00manifest.i@67688a370455: no node
This patch changes the verify code to load only the root manifest at
first and verify all revisions of it, then verify all revisions of
each direct subdirectory, and so on, recursively. The above message
becomes
b/@0: parent-directory manifest refers to unknown revision 67688a370455
Since the new algorithm reads a single revlog at a time and in order,
'hg verify' on a treemanifest version of the hg core repo goes from
~50s to ~14s. As expected, there is no significant difference on a
repo with flat manifests.
The "manifest" label that's used in error messages will instead be the
directory path for subdirectory manifests (not the root manifest), so
let's extract the constant to a variable already to make future
patches simpler.
When a changeset refers to a manifest revision that's not found in the
manifest log, we say "changeset refers to missing revision X", but
when a manifest refers to file revision that's not found in the
filelog, we say "X in manifests not found". The language used for
missing manifest revisions seems clearer, so let's use that for
missing filelog revisions too.
We include the "manifest" prefix on most other errors, so it seems
consistent to add them to the remaining messages too. Also, having the
"manifest" prefix will be more consistent with having the directory
prefix there when we add support for treemanifests. With the
"manifest" at the beginning, let's remove the now-redundant
"manifest" in the message itself.
In 1f64dad11884 (verify: filter messages about missing null manifests
(issue2900), 2011-07-13), we started ignoring nullid in the list of
manifest nodeids to check. Then, in 954b09a82a61 (verify: do not choke
on valid changelog without manifest, 2012-08-21), we stopped adding
nullid to the list to start with. So let's drop the left-over check
now.
Reasons:
* _crosscheckfiles(), as the name suggests, is about checking that
the set of files files mentioned in changesets match the set of
files mentioned in the manifests.
* The "checking" in _crosscheckfiles() looked rather strange, as it
just emitted an error for *every* entry in mflinkrevs. The reason
was that these were the entries remaining after the call to
_verifymanifest(). Moving all the processing of mflinkrevs into
_verifymanifest() makes it much clearer that it's the remaining
entries that are a problem.
Functional change: progress is no longer reported for "crosschecking"
of missing manifest entries. Since the crosschecking phase takes a
tiny fraction of the verification, I don't think this is a
problem. Also, any reports of "changeset refers to unknown manifest"
will now come before "crosschecking files in changesets and
manifests".
Similar to the previous patch, the .hg/store/meta/ directory does not
get copied when when using "hg clone --uncompressed". Fix by including
"meta/" in store.datafiles(). This seems safe to do, as there are only
a few users of this method. "hg manifest" already filters the paths by
"data/" prefix. The calls from largefiles also seem safe. The use in
verify needs updating to prevent it from mistaking dirlogs for
orphaned filelogs. That change is included in this patch.
Since the dirlogs will now be in the fncache when using fncachestore,
let's also update debugrebuildfncache(). That will also allow any
existing treemanifest repos to get their dirlogs into the fncache.
Also update test-treemanifest.t to use an a directory name that
requires dot-encoding and uppercase-encoding so we test that the path
encoding works.
In 0413f674179e (verify: move file cross checking to its own function,
2016-01-05), "mflinkrevs = None" was moved into function, so the
reference was cleared there, but the calling function now held on to
the variable. The point of clearing it was presumably to free up
memory, so let's move the clearing to the calling function where it
makes a difference. Also change "mflinkrevs = None" to "del
mflinkrevs", since the comment about scope now surely is obsolete.
_verifychangelog() and _verifymanifest() accept dictionaries that they
populate. We pass in empty dictionaries, so it's clearer to create
them in the functions and return them.
Nested functions in Python are not able to assign to variables in the
outer scope without something like the list trick because assignments
refer to the inner scope. So, we formerly used a list to give an
object to assign into.
Now that error and warning are object members, the [0] hack is no
longer needed.
This will allow us to start moving some of the nested functions inside verify()
out onto the class.
This will allow extensions to hook into verify more easily.
In order to allow extensions to hook into the verification logic more easily, we
need to refactor it into multiple functions. The first step is to move it to a
class so the shared state can be more easily accessed.
Without a hook of this nature, narrowhg[0] clones always result in 'hg
verify' reporting terrible damage to the entire repository
history. With this hook, we can ignore files that aren't supposed to
be in the clone, and then get an accurate report of any damage present
(or not) in the repo.
0: https://bitbucket.org/Google/narrowhg
The home of 'Abort' is 'error' not 'util' however, a lot of code seems to be
confused about that and gives all the credit to 'util' instead of the
hardworking 'error'. In a spirit of equity, we break the cycle of injustice and
give back to 'error' the respect it deserves. And screw that 'util' poser.
For great justice.
Python 2.6 introduced the "except type as instance" syntax, replacing
the "except type, instance" syntax that came before. Python 3 dropped
support for the latter syntax. Since we no longer support Python 2.4 or
2.5, we have no need to continue supporting the "except type, instance".
This patch mass rewrites the exception syntax to be Python 2.6+ and
Python 3 compatible.
This patch was produced by running `2to3 -f except -w -n .`.
In the very early days of hg, it was possible to commit /dev/null because our
patch importer was too simple. Repos from this era may still
exist, add a note about why we ignore this name.
Since afe2bc876c89, repo.cancopy() cannot be used to check if the repo is
a bundlerepository.
repo.url() should always have "scheme:", so it isn't necessary to parse
by util.url().
These slashes are a hangover from issue3612, fixed in d5787cfaa7cf.
Although the bugfix in that commit is correct, the test it adds
does not replicate the conditions for the bug correctly.
Before this patch, there are two ambiguous variables: "havemf" and
"hasmanifest".
"havemf" means whether there are any "manifest" entries.
"hasmanifest" means whether there are any "changelog" entries
referring to "manifest" entry.
This patch renames from "hasmanifest" to "refersmf" to clear
difference from "havemf".