mirror of
https://github.com/facebook/sapling.git
synced 2024-12-25 05:53:24 +03:00
newdoc: add Segmented Changelog section
Summary: I gathered information from various sources in this newdoc entry. The goal is to have an easy link that gives a general summary of what Segmented Changelog is and what it does. Reviewed By: quark-zju Differential Revision: D25908322 fbshipit-source-id: 8f3d2725afe8db5a9bc179189c287fd54aaca08b
This commit is contained in:
parent
46b6c1da7c
commit
10f855d28c
@ -29,6 +29,7 @@ Contents:
|
||||
internals/Mutation
|
||||
internals/RequiresFile
|
||||
internals/RevlogNG
|
||||
internals/SegmentedChangelog
|
||||
internals/UnlinkingFilesOnWindows
|
||||
internals/Visibility
|
||||
internals/WhatGoesWhere
|
||||
|
248
eden/scm/newdoc/dev/internals/SegmentedChangelog.rst
Normal file
248
eden/scm/newdoc/dev/internals/SegmentedChangelog.rst
Normal file
@ -0,0 +1,248 @@
|
||||
Segmented Changelog
|
||||
===================
|
||||
|
||||
The core of ``Segmented Changelog`` is an efficient representation for
|
||||
``Directed Acyclic Graphs``. This new structure splits the concerns of the
|
||||
``Changelog`` in 3 parts:
|
||||
|
||||
1. graph shape
|
||||
|
||||
2. node identifiers (commit hashes)
|
||||
|
||||
3. node information (commit messages)
|
||||
|
||||
|
||||
Properties for Segmented Changelog:
|
||||
|
||||
* tiny graph shape
|
||||
|
||||
* efficient set representation (O(1) space for ``ancestors(master)``) and
|
||||
calculations in common cases
|
||||
|
||||
* efficient graph calculations, most algorithms range from ``O(1)`` to
|
||||
``O(merges)``, ``ancestor`` and ``common ancestor`` algorithms scale with
|
||||
merges
|
||||
|
||||
* assumes a main branch that is updated with a pushrebase workflow
|
||||
|
||||
* some algorithms (ex. descendants) do not scale well with many visible heads
|
||||
|
||||
|
||||
Segments
|
||||
--------
|
||||
|
||||
Commit hashes are hard to compress. To represent a graph losslessly and
|
||||
efficiently, integer identities scale much better.
|
||||
|
||||
Given a graph, Segmented Changelog will assign (mostly) continuous integer
|
||||
identifiers to all nodes. The assignment of the integers needs to be a valid
|
||||
topological sorting of the nodes in the graph.
|
||||
|
||||
A segment is a collection of nodes labeled continuously. We can describe
|
||||
segments by the lowest integer that it contains and the highest integer::
|
||||
|
||||
x:y = [i for i in range(x, y+1)]
|
||||
|
||||
Let ``parents(i)`` be the function that returns the list of parents of the
|
||||
node labelled ``i``. We describe the parents of a segment as the identifiers
|
||||
that are parents of nodes in the segment but are not part of the segment
|
||||
themselves, with the order preserved::
|
||||
|
||||
parents(x:y) = [p for i in range(x, y+1)
|
||||
for p in parents(i)
|
||||
if p not in range(x, y+1)]
|
||||
|
||||
As an observation we have the property that the parents of the segment composed
|
||||
of one note is equal to the parents of the node::
|
||||
|
||||
parents(i:i) == parents(i)
|
||||
|
||||
``Segments`` can be stacked to form multiple levels. A higher level segment
|
||||
merges one or more continuous lower level segments. For example::
|
||||
|
||||
Level 0: 0:5, 6:10, 11:15, 16:20, ...
|
||||
Level 1: 0:10, 11:20, ...
|
||||
Level 2: 0:20, ...
|
||||
Level 3: ...
|
||||
|
||||
Segments in the lowest level (0) are called ``Flat Segments``. They
|
||||
are special. They *and their parents* losslessly encode the shape of
|
||||
the entire graph. Higher level segments are optional for correctness
|
||||
but useful for better performance.
|
||||
|
||||
``Flat segment`` constraints:
|
||||
|
||||
* ``heads(x:y) = [y]``
|
||||
|
||||
* ``roots(x:y) = [x]``
|
||||
|
||||
* ``((x:y) - x) & merge() = []``
|
||||
|
||||
Example::
|
||||
|
||||
3 -- 4 ---\ /- 9 -- 10 -\
|
||||
| | |
|
||||
1 -- 2 -- 5 -- 6 -- 7 -- 8 ---- 11 -- 12
|
||||
|
||||
The segments we have are:
|
||||
|
||||
* ``1:2``, ``parents(1:2) = []``
|
||||
|
||||
* ``3:4``, ``parents(3:4) = []``
|
||||
|
||||
* ``5:8``, ``parents(5:8) = [2, 4]``
|
||||
|
||||
* ``9:10``, ``parents(9:10) = [7]``
|
||||
|
||||
* ``11:12``, ``parents(11:12) = [8, 10]``
|
||||
|
||||
We described our graph using the ``5:8`` segment. Note that ``5:8`` and ``5:7``
|
||||
obey the ``Segment`` definition and can be used in different contexts. We can
|
||||
describe this as a property:
|
||||
|
||||
* For ``x:y`` flat segment, ``x:i`` is a flat segment for ``x <= i <= y`` and
|
||||
``parents(x:y) == parents(x:i)``.
|
||||
|
||||
A great property of flat segments is that they compress the original graph to
|
||||
``O(|merge|)``. The parent function can be derived without any data loss.
|
||||
|
||||
High-level Segments.
|
||||
|
||||
Constraints:
|
||||
|
||||
* ``heads(x:y) = [y]``
|
||||
|
||||
Properties:
|
||||
|
||||
* Compresses commits across merges.
|
||||
|
||||
* Using high level segments we cannot get parents of arbitrary nodes. ``Flat
|
||||
Segments`` are required.
|
||||
|
||||
Example. Continuing with the graph in the ``Flat Segments`` section, we have
|
||||
the following high level commits:
|
||||
|
||||
* ``1:8``, ``parents(1:8) = []``
|
||||
|
||||
* ``9:12``, ``parents(9:12) = [7, 8]``
|
||||
|
||||
* ``1:12``, ``parents(1:12) = []``
|
||||
|
||||
|
||||
We can express sets using sets of ``Segments``.
|
||||
|
||||
Example:
|
||||
|
||||
* ``ancestors(12) = 1:12``
|
||||
|
||||
* ``ancestors(11) = 1:8 + 9:10 + 11:11``
|
||||
|
||||
* ``ancestors(10) = 1:2 + 3:4 + 5:7 + 9:10``
|
||||
|
||||
We can then operate on these sets in more complex ways. For computing common
|
||||
ancestors we can write::
|
||||
|
||||
ancestor(10, 8)
|
||||
= max(::10 & ::8)
|
||||
= max((1:7 + 9:10) & 1:8)
|
||||
= max(1:7)
|
||||
= 7
|
||||
|
||||
Implementation Details
|
||||
----------------------
|
||||
|
||||
Assigning sequential integers algorithm
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Any topological order is valid though it may not be efficient. A ``Depth-First
|
||||
Search`` order assignment yields good results. An optimization that we can do
|
||||
it is for merges, we first assign the branch with least number of merges.
|
||||
|
||||
To assign a number for ``x``, check its parents.
|
||||
|
||||
* If all parents have numbers, assign the next available number to ``x``.
|
||||
|
||||
* Otherwise, pick the parent with less merges. Assign it recursively.
|
||||
|
||||
Groups
|
||||
~~~~~~
|
||||
|
||||
Efficiency of ``Segmented Changelog`` requires efficient representation of
|
||||
sets. For example, ``ancestors(master)`` is ideally ``0:x``, not ``0:i +
|
||||
j:x``. To ensure that, ``Segmented Changelog`` reserves a large integer space
|
||||
for master and tries to avoid assigning master + 1 to non-master nodes. This is
|
||||
achieved by ``Groups``.
|
||||
|
||||
In our ``Changelog`` use case, we have the ``Master`` group and the
|
||||
``NonMaster`` group. It is generally assumed that the ``Master`` group will
|
||||
have a lot of commits (and we mostly care about them). ``Master`` commits are
|
||||
assigned starting from ``0``. ``NonMaster`` commits are assigned from ``2^56``
|
||||
onwards. Local commits are stored in the ``NonMaster`` section. This has the
|
||||
purpose of preserving efficient ``Segment`` construction for the ``Master``
|
||||
section as new commits get added of top of this section. This allows the
|
||||
``Master`` section to have segments with maximum length.
|
||||
|
||||
Core Components
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
We described 3 concepts in the beginning:
|
||||
|
||||
1. graph shape, we may also call ``IdDag``.
|
||||
|
||||
2. node identifiers, we may also call ``IdMap``.
|
||||
|
||||
3. node information, we may also call ``HgCommits``.
|
||||
|
||||
The ``IdDag`` describes all graph algorithms and all operations leverage
|
||||
segments. The ``IdMap`` is a bidirectional map from ``Segmented Changelog Id``
|
||||
to node identifier, in the case of the changelog the ``Sha1`` hash of the
|
||||
commit. ``HgCommits`` is a simple key value store from ``Sha1`` commit hash to
|
||||
commit message.
|
||||
|
||||
For the client ``IdDag``, ``IdMap`` and ``HgCommits`` are stored using
|
||||
``IndexedLog``.
|
||||
|
||||
Server implementation
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For the server, ``IdDag`` is stored fully in process for all serving instances.
|
||||
At startup, a ``IdDag`` is downloaded from the ``Blobstore``, then it is
|
||||
updated locally as commits come in. Protocols do not depend on "internal"
|
||||
``Segmented Changelog Ids``, instead they express shape relative to common node
|
||||
identifiers (commit hashes).
|
||||
|
||||
The ``IdMap`` is stored in a ``SQL`` store. For the servers, the ``IdMap``
|
||||
additionally stores a version. This is done to facilitate regenerating the
|
||||
whole ``IdMap`` while continuing to serve requests. ``IdMap`` regeneration may
|
||||
be performed when better algorithms are developed or bugs are uncovered.
|
||||
|
||||
``HgCommits`` have their own ``SQL`` storage. For the purpose of ``Segmented
|
||||
Changelog`` we only care that we can query the respective commit identifier that
|
||||
is stored in the ``IdMap``.
|
||||
|
||||
To construct the ``Segmented Changelog`` from a repository we use the
|
||||
``Segmented Changelog Seeder``. The ``Seeder`` will construct a new ``IdMap``
|
||||
version and an initial ``IdDag``.
|
||||
|
||||
We mentioned that at startup an ``IdDag`` has to be loaded from the
|
||||
``Blobstore`` so we need a process that updates the ``IdMap`` and the
|
||||
``IdDag``. This process is the ``Segmented Changelog Tailer``. It periodically
|
||||
checks for new commits and incrementally adds entries to the ``IdMap`` and
|
||||
reconstructs the ``IdDag``.
|
||||
|
||||
The client communicates with the server using ``EdenApi`` endpoints:
|
||||
|
||||
* ``/{repo}/commit/revlog_data``: commit content by ``Sha1`` hash in the format
|
||||
that verifies the ``Sha1`` hash.
|
||||
|
||||
* ``/{repo}/commit/location_to_hash``: used to translate compressed commit info
|
||||
(graph location) to hashes.
|
||||
|
||||
* ``/{repo}/commit/hash_to_location``:used to translate hashes to compressed
|
||||
commit info (graph location).
|
||||
|
||||
* ``/{repo}/clone``: downloads ``SegmentedChangelog`` repository data.
|
||||
|
||||
* ``/{repo}/full_idmap_clone``: downloads ``SegmentedChangelog`` repository
|
||||
data along with all commit hashes in the commit graph.
|
||||
|
Loading…
Reference in New Issue
Block a user