A censored revision stored in a revlog should have the censored revlog index
flag bit set. This implies we must know if a revision is censored before we
add it to the revlog. When adding revisions from exchanged deltas, we would
prefer to determine this flag without decoding every single full text.
This change introduces a heuristic based on assumptions around the Mercurial
delta format and filelog metadata. Since deltas which produce a censored
revision must be full-replacement deltas, we can read the delta's first bytes
to check the filelog metadata. Since "censored" is the alphabetically first
filelog metadata key, censored filelog revisions have a well-known prefix we
can look for.
For more on the design and background of the censorship feature, see:
http://mercurial.selenic.com/wiki/CensorPlan
To ensure that exchanged deltas in the presence of censored revisions can
always be applied to the recipient repository, the deltas must replace the
entire base text. To make this restriction reasonably enforceable, the delta
must do so with a single patch operation.
For background and broader design of the censorship feature, see:
http://mercurial.selenic.com/wiki/CensorPlan
The iscensored method will be used by the exchange layer to reject
nonconforming deltas involving censored revisions (and to produce
conforming deltas).
For background and broader design of the censorship feature, see:
http://mercurial.selenic.com/wiki/CensorPlan
To ensure delta compatibility, when a revision is censored, it is
padded to match the original data in size. The previous check does
not allow for padding because it was added before padding was found
to be a requirement.
For more background and design of the censorship feature, see:
mercurial.selenic.com/wiki/CensorPlan
To support "status" operations against working directories that are
the children of censored revisions, filelog must define "cmp" and "size"
for censored content.
With this change, when a revlog revision hash does not match its content, and
the content is empty with a special metadata key, the integrity failure is
assumed to be intentionally caused to remove sensitive content from repository
history.
To allow different Mercurial functionality to handle this scenario differently
a more specific exception is raised than "ordinary" hash failures.
Alternatives to this approach include, but are not limited to:
- Calling a hook when hashes mismatch to allow arbitrary tombstone validation.
Cons: Irresponsibly easy to disable integrity checking altogether.
- Returning empty revision data eagerly instead of raising, masking the error.
Cons: Push/pull won't roundtrip the tombstone, so client repos are unusable.
- Doing nothing differently at this layer. Callers must do their own detection
of tombstoned data if they want to handle some hash checks and not others.
- Impacts dozens of callsites, many of which don't have the revision data
- Would probably be missing one or two callsites at any given time
- Currently we throw a RevlogError, as do 12 other places in revlog.py.
Callers would need to parse the exception message and/or ensure
RevlogError is not thrown from any other part of their call tree.
_parsemeta returns the dictionary and a list of keys in the order they appear
in metadata. This can be used to repack the dictionary in the same order.
_packmeta creates metadata from a dictionary and an optional key-order list.
In _parsemeta, we use slices and re.search indead of str.index so we can accept
both buffers and strings.
filelog.renamed() is an expensive call as it reads the filelog if p1 == nullid.
It's more efficient to first compute the hash, and to bail early if
the computed hash is the same as the stored nodeid.
'samehashes' variable is not strictly necessary, but helps for comprehension.
Because "\1\n" is a separator for metadata, data starting with "\1\n" is
handled specifically. It was not tested.
size() call return incorrect data if original data had been "\1\n-escaped".
There's no obvious way to fix it for now, just flag the error in the code
and add an "expected failure" kind of test.
This is similar to the __builtin__.cmp behaviour, but still not
straightforward, as the dailylife meaning of a comparison usually is
"find out if they are different".
the escaping of directories ending with .i or .d doesn't
really belong to filelog.
we put the encoding/decoding in store instead, for backwards
compat, streamclone and the fncache file format still uses the
partially encoded filenames.
This should be faster and more future-proof. Calls where the result is to be
sorted using util.sort() have been left unchanged. Calls to .items() on
configparser objects have been left as-is, too.
str.find return -1 when the substring is not found, -1 evaluate
to True and is a valid index, which can lead to bugs.
Using alternatives when possible makes the code clearer and less
prone to bugs. (and __contains__ is faster in microbenchmarks)
This changes revlog specify revlogng by default. Inlined
data files are also used unless a flags option is found in the .hgrc.
Some example hgrc files:
[revlog]
# use the original revlog format
format=0
[revlog]
# use revlogng. Because no flags are included, inlined data files
# also be selected
format=1
[revlog]
# use revlogng but do not inline the data files with the index
flags=
[revlog]
# the new default
format=1
flags=inline
revlogng results in smaller indexes, can address larger data files, and
supports flags and version numbers.
By default the original revlog format is used. To use the new format,
use the following .hgrc field:
[revlog]
# format choices are 0 (classic revlog format) and 1 revlogng
format=1