sapling/linelog/README
Jun Wu 68e5a15af7 linelog: add a README explaining the file format
Summary: This should help reviewers.

Test Plan: Run some spell checkers.

Reviewers: #mercurial, simonfar, ttung

Reviewed By: simonfar

Subscribers: mpm, akushner, ikostia, simonfar, mjpieters

Differential Revision: https://phabricator.intern.facebook.com/D3628440

Tasks: 12416202

Signature: t1:3628440:1469640270:a44f682e22b59615d0d9f12c93417c905b035ab6
2016-07-27 17:12:15 +01:00

220 lines
7.5 KiB
Plaintext

linelog is a storage format inspired by the "Interleaved deltas" idea. See
https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction.
0. SCCS Weave
To understand what linelog is, first we have a quick look at a simplified
(with header removed) SCCS weave format, which is an implementation of the
"Interleaved deltas" idea.
0.1 Basic SCCS Weave File Format
A SCCS weave file consists of plain text lines. Each line is either a
special instruction starting with "^A" or part of the content of the real
file the weave tracks. There are 3 important operations, where REV denotes
the revision number:
^AI REV, marking the beginning of an insertion block introduced by REV
^AD REV, marking the beginning of a deletion block introduced by REV
^AE REV, marking the end of the block started by "^AI REV" or "^AD REV"
Note on revision numbers: For any two different revision numbers, one must
be an ancestor of the other to make them comparable. This enforces linear
history. Besides, the comparison functions (">=", "<") should be efficient.
This means, if revisions are strings like git or hg, an external map is
required to convert them into integers.
For example, to represent the following changes:
REV 1 | REV 2 | REV 3
------+-------+-------
a | a | a
b | b | 2
c | 1 | c
| 2 |
| c |
A possible weave file looks like:
^AI 1
a
^AD 3
b
^AI 2
1
^AE 3
2
^AE 2
c
^AE 1
An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In
the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3".
Therefore we need some extra information for "^AE". The SCCS weave uses a
revision number. It could also be a boolean value about whether it is an
insertion or a deletion (see section 0.4).
0.2 Checkout
The "checkout" operation is to retrieve file content at a given revision,
say X. It's doable by going through the file line by line and:
- If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there
- If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there
- Ignore ^AE
- For normal lines, just output them
0.3 Annotate
The "annotate" operation is to show extra metadata like the revision number
and the original line number a line comes from.
It's basically just a "Checkout". For the extra metadata, they can be stored
side by side with the line contents. Alternatively, we can infer the
revision number from "^AI"s.
Some SCM tools have to calculate diffs on the fly and thus are much slower
on this operation.
0.4 Tree Structure
The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE"
blocks can be interleaved.
If we consider insertions and deletions separately, they can form tree
structures, respectively.
+--- ^AI 1 +--- ^AD 3
| +- ^AI 2 | +- ^AD 2
| | | |
| +- ^AE 2 | +- ^AE 2
| |
+--- ^AE 1 +--- ^AE 3
More specifically, it's possible to build a tree for all insertions, where
the tree node has the structure "(rev, startline, endline)". "startline" is
the line number of "^AI" and "endline" is the line number of the matched
"^AE". The tree will have these properties:
1. child.rev > parent.rev
2. child.startline > parent.startline
3. child.endline < parent.endline
A similar tree for all deletions can also be built with the first property
changed to:
1. child.rev < parent.rev
0.5 Malformed Cases
The following cases are considered malformed in our implementation:
1. Interleaved insertions, or interleaved deletions.
It can be rewritten to a non-interleaved tree structure.
^AI/D x ^AI/D x
^AI/D y -> ^AI/D y
^AE x ^AE y
^AE y ^AE x
2. Nested insertions, where the inner one has a smaller revision number.
It can be rewritten to a non-nested form.
^AI x + 1 ^AI x + 1
^AI x -> ^AE x + 1
^AE x ^AI x
^AE x + 1 ^AE x
3. Insertion or deletion inside another deletion, where the outer deletion
block has a smaller revision number.
^AD x ^AD x
^AI/D x + 1 -> ^AE x
^AE x + 1 ^AI/D x + 1
^AE x ^AE x
Some of them may be valid in other implementations for special purposes. For
example, to "revive" a previously deleted block in a newer revision.
0.6 Cases Can Be Optimized
It's always better to get things nested. For example, the left is more
efficient than the right while they represent the same content:
+--- ^AD 2 +- ^AD 1
| +- ^AD 1 | LINE A
| | LINE A +- ^AE 1
| +- ^AE 1 +- ^AD 2
| LINE B | LINE B
+--- ^AE 2 +- ^AE 2
Our implementation sometimes generates the less efficient data. To always
get the optimal form, it requires extra code complexity that seems unworthy.
0.7 Inefficiency
The file format can be slow because:
- Inserting a new line at position P requires rewriting all data after P.
- Finding "^AE" requires walking through the content (O(N), where N is the
number of lines between "^AI/D" and "^AE").
1. Linelog
The linelog is a binary format that dedicates to speed up mercurial (or
git)'s "annotate" operation. It's designed to avoid issues mentioned in
section 0.7.
1.1 Content Stored
Linelog is not another storage for file contents. It only stores line
numbers and corresponding revision numbers, instead of actual line content.
This is okay for the "annotate" operation because usually the external
source is fast to checkout the content of a file at a specific revision.
A typical SCCS weave is also fast on the "grep" operation, which needs
random accesses to line contents from different revisions of a file. This
can be slow with linelog's no-line-content design. However we could use
an extra map ((rev, line num) -> line content) to speed it up.
Note the revision numbers in linelog should be independent from mercurial
integer revision numbers. There should be some mapping between linelog rev
and hg hash stored side by side, to make the files reusable after being
copied to another machine.
1.2 Basic Format
A linelog file consists of "instruction"s. An "instruction" can be either:
- JGE REV ADDR # jump to ADDR if rev >= REV
- JL REV ADDR # jump to ADDR if rev < REV
- LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV
For example, here is the example linelog representing the same file with
3 revisions mentioned in section 0.1:
SCCS | Linelog
Weave | Addr : Instruction
------+------+-------------
^AI 1 | 0 : JL 1 8
a | 1 : LINE 1 0
^AD 3 | 2 : JGE 3 6
b | 3 : LINE 1 1
^AI 2 | 4 : JL 2 7
1 | 5 : LINE 2 2
^AE 3 |
2 | 6 : LINE 2 3
^AE 2 |
c | 7 : LINE 1 2
^AE 1 |
| 8 : END
This way, "find ^AE" is O(1) because we just jump there. And we can insert
new lines without rewriting most part of the file by appending new lines and
changing a single instruction to jump to them.
The current implementation uses 64 bits for an instruction: The opcode (JGE,
JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32
bits. It also stores the max revision number and buffer size at the first
64 bits for quick access to these values.