mirror of
https://github.com/facebook/sapling.git
synced 2024-10-11 01:07:15 +03:00
1802036ff3
Summary: The `lib/linelog` directory contains pure C code that is unrelated from either Mercurial or Python. The `mercurial/cyext` contains Cython extension code (although for linelog's case, the Cython extension is unrelated from Mercurial). Cython is now a hard dependence to simplify the code. Test Plan: `make local` and check `from mercurial.cyext import linelog` works. Reviewers: durham, #mercurial Reviewed By: durham Subscribers: durham, fried Differential Revision: https://phabricator.intern.facebook.com/D6678541 Signature: 6678541:1515455512:967266dc69c702dbff95fdea05671e11c32ebf28
252 lines
9.1 KiB
Plaintext
252 lines
9.1 KiB
Plaintext
linelog is a storage format inspired by the "Interleaved deltas" idea. See
|
|
https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction.
|
|
|
|
0. SCCS Weave
|
|
|
|
To understand what linelog is, first we have a quick look at a simplified
|
|
(with header removed) SCCS weave format, which is an implementation of the
|
|
"Interleaved deltas" idea.
|
|
|
|
0.1 Basic SCCS Weave File Format
|
|
|
|
A SCCS weave file consists of plain text lines. Each line is either a
|
|
special instruction starting with "^A" or part of the content of the real
|
|
file the weave tracks. There are 3 important operations, where REV denotes
|
|
the revision number:
|
|
|
|
^AI REV, marking the beginning of an insertion block introduced by REV
|
|
^AD REV, marking the beginning of a deletion block introduced by REV
|
|
^AE REV, marking the end of the block started by "^AI REV" or "^AD REV"
|
|
|
|
Note on revision numbers: For any two different revision numbers, one must
|
|
be an ancestor of the other to make them comparable. This enforces linear
|
|
history. Besides, the comparison functions (">=", "<") should be efficient.
|
|
This means, if revisions are strings like git or hg, an external map is
|
|
required to convert them into integers.
|
|
|
|
For example, to represent the following changes:
|
|
|
|
REV 1 | REV 2 | REV 3
|
|
------+-------+-------
|
|
a | a | a
|
|
b | b | 2
|
|
c | 1 | c
|
|
| 2 |
|
|
| c |
|
|
|
|
A possible weave file looks like:
|
|
|
|
^AI 1
|
|
a
|
|
^AD 3
|
|
b
|
|
^AI 2
|
|
1
|
|
^AE 3
|
|
2
|
|
^AE 2
|
|
c
|
|
^AE 1
|
|
|
|
An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In
|
|
the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3".
|
|
Therefore we need some extra information for "^AE". The SCCS weave uses a
|
|
revision number. It could also be a boolean value about whether it is an
|
|
insertion or a deletion (see section 0.4).
|
|
|
|
0.2 Checkout
|
|
|
|
The "checkout" operation is to retrieve file content at a given revision,
|
|
say X. It's doable by going through the file line by line and:
|
|
|
|
- If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there
|
|
- If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there
|
|
- Ignore ^AE
|
|
- For normal lines, just output them
|
|
|
|
0.3 Annotate
|
|
|
|
The "annotate" operation is to show extra metadata like the revision number
|
|
and the original line number a line comes from.
|
|
|
|
It's basically just a "Checkout". For the extra metadata, they can be stored
|
|
side by side with the line contents. Alternatively, we can infer the
|
|
revision number from "^AI"s.
|
|
|
|
Some SCM tools have to calculate diffs on the fly and thus are much slower
|
|
on this operation.
|
|
|
|
0.4 Tree Structure
|
|
|
|
The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE"
|
|
blocks can be interleaved.
|
|
|
|
If we consider insertions and deletions separately, they can form tree
|
|
structures, respectively.
|
|
|
|
+--- ^AI 1 +--- ^AD 3
|
|
| +- ^AI 2 | +- ^AD 2
|
|
| | | |
|
|
| +- ^AE 2 | +- ^AE 2
|
|
| |
|
|
+--- ^AE 1 +--- ^AE 3
|
|
|
|
More specifically, it's possible to build a tree for all insertions, where
|
|
the tree node has the structure "(rev, startline, endline)". "startline" is
|
|
the line number of "^AI" and "endline" is the line number of the matched
|
|
"^AE". The tree will have these properties:
|
|
|
|
1. child.rev > parent.rev
|
|
2. child.startline > parent.startline
|
|
3. child.endline < parent.endline
|
|
|
|
A similar tree for all deletions can also be built with the first property
|
|
changed to:
|
|
|
|
1. child.rev < parent.rev
|
|
|
|
0.5 Malformed Cases
|
|
|
|
The following cases are considered malformed in our implementation:
|
|
|
|
1. Interleaved insertions, or interleaved deletions.
|
|
It can be rewritten to a non-interleaved tree structure.
|
|
|
|
^AI/D x ^AI/D x
|
|
^AI/D y -> ^AI/D y
|
|
^AE x ^AE y
|
|
^AE y ^AE x
|
|
|
|
2. Nested insertions, where the inner one has a smaller revision number.
|
|
It can be rewritten to a non-nested form.
|
|
|
|
^AI x + 1 ^AI x + 1
|
|
^AI x -> ^AE x + 1
|
|
^AE x ^AI x
|
|
^AE x + 1 ^AE x
|
|
|
|
3. Insertion or deletion inside another deletion, where the outer deletion
|
|
block has a smaller revision number.
|
|
|
|
^AD x ^AD x
|
|
^AI/D x + 1 -> ^AE x
|
|
^AE x + 1 ^AI/D x + 1
|
|
^AE x ^AE x
|
|
|
|
Some of them may be valid in other implementations for special purposes. For
|
|
example, to "revive" a previously deleted block in a newer revision.
|
|
|
|
0.6 Cases Can Be Optimized
|
|
|
|
It's always better to get things nested. For example, the left is more
|
|
efficient than the right while they represent the same content:
|
|
|
|
+--- ^AD 2 +- ^AD 1
|
|
| +- ^AD 1 | LINE A
|
|
| | LINE A +- ^AE 1
|
|
| +- ^AE 1 +- ^AD 2
|
|
| LINE B | LINE B
|
|
+--- ^AE 2 +- ^AE 2
|
|
|
|
Our implementation sometimes generates the less efficient data. To always
|
|
get the optimal form, it requires extra code complexity that seems unworthy.
|
|
|
|
0.7 Inefficiency
|
|
|
|
The file format can be slow because:
|
|
|
|
- Inserting a new line at position P requires rewriting all data after P.
|
|
- Finding "^AE" requires walking through the content (O(N), where N is the
|
|
number of lines between "^AI/D" and "^AE").
|
|
|
|
1. Linelog
|
|
|
|
The linelog is a binary format that dedicates to speed up mercurial (or
|
|
git)'s "annotate" operation. It's designed to avoid issues mentioned in
|
|
section 0.7.
|
|
|
|
1.1 Content Stored
|
|
|
|
Linelog is not another storage for file contents. It only stores line
|
|
numbers and corresponding revision numbers, instead of actual line content.
|
|
This is okay for the "annotate" operation because usually the external
|
|
source is fast to checkout the content of a file at a specific revision.
|
|
|
|
A typical SCCS weave is also fast on the "grep" operation, which needs
|
|
random accesses to line contents from different revisions of a file. This
|
|
can be slow with linelog's no-line-content design. However we could use
|
|
an extra map ((rev, line num) -> line content) to speed it up.
|
|
|
|
Note the revision numbers in linelog should be independent from mercurial
|
|
integer revision numbers. There should be some mapping between linelog rev
|
|
and hg hash stored side by side, to make the files reusable after being
|
|
copied to another machine.
|
|
|
|
1.2 Basic Format
|
|
|
|
A linelog file consists of "instruction"s. An "instruction" can be either:
|
|
|
|
- JGE REV ADDR # jump to ADDR if rev >= REV
|
|
- JL REV ADDR # jump to ADDR if rev < REV
|
|
- LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV
|
|
|
|
For example, here is the example linelog representing the same file with
|
|
3 revisions mentioned in section 0.1:
|
|
|
|
SCCS | Linelog
|
|
Weave | Addr : Instruction
|
|
------+------+-------------
|
|
^AI 1 | 0 : JL 1 8
|
|
a | 1 : LINE 1 0
|
|
^AD 3 | 2 : JGE 3 6
|
|
b | 3 : LINE 1 1
|
|
^AI 2 | 4 : JL 2 7
|
|
1 | 5 : LINE 2 2
|
|
^AE 3 |
|
|
2 | 6 : LINE 2 3
|
|
^AE 2 |
|
|
c | 7 : LINE 1 2
|
|
^AE 1 |
|
|
| 8 : END
|
|
|
|
This way, "find ^AE" is O(1) because we just jump there. And we can insert
|
|
new lines without rewriting most part of the file by appending new lines and
|
|
changing a single instruction to jump to them.
|
|
|
|
The current implementation uses 64 bits for an instruction: The opcode (JGE,
|
|
JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32
|
|
bits. It also stores the max revision number and buffer size at the first
|
|
64 bits for quick access to these values.
|
|
|
|
1.3 Comparing with Mercurial's revlog format
|
|
|
|
Apparently, linelog is very different from revlog: linelog stores rev and
|
|
line numbers, while revlog has line contents and other metadata (like
|
|
parents, flags). However, the revlog format could also be used to store rev
|
|
and line numbers. For example, to speed up the annotate operation, we could
|
|
also pre-calculate annotate results and just store them using the revlog
|
|
format.
|
|
|
|
Therefore, linelog is actually somehow similar to revlog, with the important
|
|
trade-off that it only supports linear history (mentioned in section 0.1).
|
|
Essentially, the differences are:
|
|
|
|
a) Linelog is full of deltas, while revlog could contain full file
|
|
contents sometimes. So linelog is smaller. Revlog could trade
|
|
reconstruction speed for file size - best case, revlog is as small as
|
|
linelog.
|
|
b) The interleaved delta structure allows skipping large portion of
|
|
uninteresting deltas so linelog's content reconstruction is faster than
|
|
the delta-only version of revlog (however it's possible to construct
|
|
a case where interleaved deltas degrade to plain deltas, so linelog
|
|
worst case would be delta-only revlog). Revlog could trade file size
|
|
for reconstruction speed.
|
|
c) Linelog implicitly maintains the order of all lines it stores. So it
|
|
could dump all the lines from all revisions, with a reasonable order.
|
|
While revlog could also dump all line additions, it requires extra
|
|
computation to figure out the order putting those lines - that's some
|
|
kind of "merge".
|
|
|
|
"c" makes "hg absorb" easier to implement and makes it possible to do
|
|
"annotate --deleted".
|