From 0bfa034edb6e67f3d4a2264bbc01f88dfc691a77 Mon Sep 17 00:00:00 2001 From: Philip C Monk Date: Fri, 18 Sep 2015 16:45:09 -0400 Subject: [PATCH] added clay architecture doc --- pub/doc/arvo/clay/architecture.md | 415 ++++++++++++++++++++++++++++++ 1 file changed, 415 insertions(+) create mode 100644 pub/doc/arvo/clay/architecture.md diff --git a/pub/doc/arvo/clay/architecture.md b/pub/doc/arvo/clay/architecture.md new file mode 100644 index 000000000..ed65ccd26 --- /dev/null +++ b/pub/doc/arvo/clay/architecture.md @@ -0,0 +1,415 @@ +# clay + +## high-level + +clay is the primary filesystem for the arvo operating system, +which is the core of an urbit. The architecture of clay is +intrinsically connected with arvo, but we assume no knowledge of +either arvo or urbit. We will point out only those features of +arvo that are necessary for an understanding of clay, and we will +do so only when they arise. + +The first relevant feature of arvo is that it is a deterministic +system where input and output are defined as a series of events +and effects. The state of arvo is simply a function of its event +log. None of the effects from an event are emitted until the +event is entered in the log and persisted, either to disk or +another trusted source of persistence, such as a Kafka cluster. +Consequently, arvo is a single-level store: everything in its +state is persistent. + +In a more traditional OS, everything in RAM can be erased at any +time by power failure, and is always erased on reboot. Thus, a +primary purpose of a filesystem is to ensure files persist across +power failures and reboots. In arvo, both power failures and +reboots are special cases of suspending computation, which is +done safely since our event log is already persistent. Therefore, +clay is not needed in arvo for persistence. Why, then, do we have a +filesystem? There are two answers to this question. + +First, clay provides a filesystem tree, which is a convenient +user interface for some applications. Unix has the useful concept +of virtual filesystems, which are used for everything from direct +access to devices, to random number generators, to the /proc +tree. It is easy and intuitive to read from and write to a +filesystem tree. + +Second, clay has a distributed revision control system baked into +it. Traditional filesystems are not revision controlled, so +userspace software -- such as git -- is written on top of them to +do so. clay natively provides the same functionality as modern +DVCSes, and more. + +clay has two other unique properties that we'll cover later on: +it supports typed data and is referentially transparent. + +### Revision Control + +Every urbit has one or more "desks", which are independently +revision-controlled branches. Each desk contains its own mark +definitions, apps, doc, and so forth. + +Traditionally, an urbit has at least a base and a home desk. The +base desk has all the system software from the distribution. the +home desk is a fork of base with all the stuff specific to the +user of the urbit. + +A desk is a series of numbered commits, the most recent of which +represents the current state of the desk. A commit is composed of +(1) an absolute time when it was created, (2) a list of zero or +more parents, and (3) a map from paths to data. + +Most commits have exactly one parent, but the initial commit on a +desk may have zero parents, and merge commits have more than one +parent. + +The non-meta data is stored in the map of paths to data. It's +worth noting that no constraints are put on this map, so, for +example, both /a/b and /a/b/c could have data. This is impossible +in a traditional Unix filesystem since it means that /a/b is both +a file and a directory. Conventionally, the final element in the +path is its mark -- much like a filename extension in Unix. Thus, +/doc/readme.md in Unix is stored as /doc/readme/md in urbit. + +The data is not stored directly in the map; rather, a hash of the +data is stored, and we maintain a master blob store. Thus, if the +same data is referred to in multiple commits (as, for example, +when a file doesn't change between commits), only the hash is +duplicated. + +In the master blob store, we either store the data directly, or +else we store a diff against another blob. The hash is dependent +only on the data within and not on whether or not it's stored +directly, so we may on occasion rearrange the contents of the +blob store for performance reasons. + +Recall that a desk is a series of numbered commits. Not every +commit in a desk must be numbered. For example, if the base desk +has had 50 commits since home was forked from it, then a merge +from base to home will only add a single revision number to home, +although the full commit history will be accessible by traversing +the parentage of the individual commits. + +We do guarantee that the first commit is numbered 1, commits are +numbered consecutively after that (i.e. there are no "holes"), +the topmost commit is always numbered, and every numbered commit +is an ancestor of every later numbered commit. + +There are three ways to refer to particular commits in the +revision history. Firstly, one can use the revision number. +Secondly, one can use any absolute time between the one numbered +commit and the next (inclusive of the first, exclusive of the +second). Thirdly, every desk has a map of labels to revision +numbers. These labels may be used to refer to specific commits. + +Additionally, clay is a global filesystem, so data on other urbit +is easily accessible the same way as data on our local urbit. In +general, the path to a particular revision of a desk is +/~urbit-name/desk-name/revision. Thus, to get /try/readme/md +from revision 5 of the home desk on ~sampel-sipnym, we refer to +/~sampel-sipnym/home/5/try/readme/md. Clay's namespace is thus +global and referentially transparent. + +XXX reactivity here? + +### A Typed Filesystem + +Since clay is a general filesystem for storing data of arbitrary +types, in order to revision control correctly it needs to be +aware of types all the way through. Traditional revision control +does an excellent job of handling source code, so for source code +we act very similar to traditional revision control. The +challenge is to handle other data similarly well. + +For example, modern VCSs generally support "binary files", which +are files for which the standard textual diffing, patching, and +merging algorithms are not helpful. A "diff" of two binary files +is just a pair of the files, "patching" this diff is just +replacing the old file with the new one, and "merging" +non-identical diffs is always a conflict, which can't even be +helpfully annotated. Without knowing anything about the structure +of a blob of data, this is the best we can do. + +Often, though, "binary" files have some internal structure, and +it is possible to create diff, patch, and merge algorithms that +take advantage of this structure. An image may be the result of a +base image with some set of operations applied. With algorithms +aware of this set of operations, not only can revision control +software save space by not having to save every revision of the +image individually, these transformations can be made on parallel +branches and merged at will. + +Suppose Alice is tasked with touching up a picture, improving the +color balance, adjusting the contrast, and so forth, while Bob +has the job of cropping the picture to fit where it's needed and +adding textual overlay. Without type-aware revision control, +these changes must be made serially, requiring Alice and Bob to +explicitly coordinate their efforts. With type-aware revision +control, these operations may be performed in parallel, and then +the two changesets can be merged programmatically. + +Of course, even some kinds of text files may be better served by +diff, patch, and merge algorithms aware of the structure of the +files. Consider a file containing a pretty-printed JSON object. +Small changes in the JSON object may result in rather significant +changes in how the object is pretty-printed (for example, by +addding an indentation level, splitting a single line into +multiple lines). + +A text file wrapped at 80 columns also reacts suboptimally with +unadorned Hunt-McIlroy diffs. A single word inserted in a +paragraph may push the final word or two of the line onto the +next line, and the entire rest of the paragraph may be flagged as +a change. Two diffs consisting of a single added word to +different sentences may be flagged as a conflict. In general, +prose should be diffed by sentence, not by line. + +As far as the author is aware, clay is the first generalized, +type-aware revision control system. We'll go into the workings +of this system in some detail. + +### Marks + +Central to a typed filesystem is the idea of types. In clay, we +call these "marks". A mark is a file that defines a type, +conversion routines to and from the mark, and diff, patch, and +merge routines. + +For example, a `%txt` mark may be a list of lines of text, and it +may include conversions to `%mime` to allow it to be serialized +and sent to a browswer or to the unix filesystem. It will also +include Hunt-McIlroy diff, patch, and merge algorithms. + +A `%json` mark would be defined as a json object in the code, and +it would have a parser to convert from `%txt` and a printer to +convert back to `%txt`. The diff, patch, and merge algorithms are +fairly straightforward for json, though they're very different +from the text ones. + +More formally, a mark is a core with three arms, `++grab`, +`++grow`, and `++grad`. In `++grab` is a series of functions to +convert from other marks to the given mark. In `++grow` is a +series of functions to convert from the given mark to other +marks. In `++grad` is `++diff`, `++pact`, `++join`, and `++mash`. + +The types are as follows, in an informal pseudocode: + + ++ grab: + ++ mime: -> + ++ txt: -> + ... + ++ grow: + ++ mime: -> + ++ txt: -> + ... + ++ grad + ++ diff: (, ) -> + ++ pact: (, ) -> + ++ join: (, ) -> or NULL + ++ mash: (, ) -> + +These types are basically what you would expect. Not every mark +has each of these functions defined -- all of them are optional +in the general case. + +In general, for a particular mark, the `++grab` and `++grow` entries +(if they exist) should be inverses of each other. + +In `++grad`, `++diff` takes two instances of a mark and produces +a diff of them. `++pact` takes an instance of a mark and patches +it with the given diff. `++join` takes two diffs and attempts to +merge them into a single diff. If there are conflicts, it +produces null. `++mash` takes two diffs and forces a merge, +annotating any conflicts. + +In general, if `++diff` called with A and B produces diff D, then +`++pact` called with A and D should produce B. Also, if `++join` +of two diffs does not produce null, then `++mash` of the same +diffs should produce the same result. + +Alternately, instead of `++diff`, `++pact`, `++join`, and +`++mash`, a mark can provide the same functionality by defining +`++sted` to be the name of another mark to which we wish to +delegate the revision control responsibilities. Then, before +running any of those functions, clay will convert to the other +mark, and convert back afterward. For example, the `%hoon` mark +is revision-controlled in the same way as `%txt`, so its `++grad` +is simply `++sted %txt`. Of course, `++txt` must be defined in +`++grow` and `++grab` as well. + +Every file in clay has a mark, and that mark must have a +fully-functioning `++grad`. Marks are used for more than just +clay, and other marks don't need a `++grad`, but if a piece of +data is to be saved to clay, we must know how to revision-control +it. + +Additionally, if a file is to be synced out to unix, then it must +have conversion routines to and from the `%mime` mark. + +##Using clay + +### Reading and Subscribing + +When reading from Clay, there are three types of requests. A +`%sing` request asks for data at single revsion. A `%next` +request asks to be notified the next time there's a change to +given file. A `%many` request asks to be notified on every +change in a desk for a range of changes. + +For `%sing` and `%next`, there are generally three things to be +queried. A `%u` request simply checks for the existence of a +file at a path. A `%x` request gets the data in the file at a +path. A `%y` request gets a hash of the data in the file at the +path combined with all its children and their data. Thus, `%y` +of a node changes if it or any of its children change. + +A `%sing` request is fulfilled immediately if possible. If the +requested revision is in the future, or is on another ship for +which we don't have the result cached, we don't respond +immediately. If the requested revision is in the future, we wait +until the revision happens before we respond to the request. If +the request is for data on another ship, we pass on the request +to the other ship. In general, Clay subscriptions, like most +things in Urbit, aren't guaranteed to return immediately. +They'll return when they can, and they'll do so in a +referentially transparent manner. + +A `%next` request checks query at the given revision, and it +produces the result of the query the next time it changes, along +with the revsion number when it changes. Thus, a `%next` of a +`%u` is triggered when a file is added or deleted, a `%next of a +`%x` is triggered when a file is added, deleted, or changed, and +a `%next` of a `%y` is triggered when a file or any of its +children is added, deleted, or changed. + +A `%many` request is triggered every time the given desk has a +new revision. Unlike a `%next`, a `%many` has both a start and +an end revsion, after which it stops returning. For `%next`, a +single change is reported, and if the caller wishes to hear of +the next change, it must resubscribe. For `%many`, every revsion +from the start to the end triggers a response. Since a `%many` +request doesn't ask for any particular data, there aren't `%u`, +`%x`, and `%y` versions for it. + +### Unix sync + +One of the primary functions of clay is as a convenient user +interface. While tools exist to use clay from within urbit, it's +often useful to be able to treat clay like any other filesystem +from the Unix perspective -- to "mount" it, as it were. + +From urbit, you can run `|mount /path/to/directory %mount-point`, +and this will mount the given clay directory to the mount-point +directory in Unix. Every file is converted to `%mime` before it's +written to Unix, and converted back when read from Unix. The +entire directory is watched (a la Dropbox), and every change is +auto-committed to clay. + +### Merging + +Merging is a fundamental operation for a distributed revision +control system. At their root, clay's merges are similar to +git's, but with some additions to accomodate typed data. There +are seven different merge strategies. + +Throughout our discussion, we'll say that the merge is from +Alice's desk to Bob's. Recall that a commit is a date (for all +new commits this will be the current date), a list of parents, +and the data itself. + +A `%init` merge should be used iff it's the first commit to a +desk. The head of Alice's desk is used as the number 1 commit to +Bob's desk. Obviously, the ancestry remains intact through +traversing the parentage of the commit even though previous +commits are not numbered for Bob's desk. + +A `%this` merge means to keep what's in Bob's desk, but join the +ancestry. Thus, the new commit has the head of each desk as +parents, but the data is exactly what's in Bob's desk. For those +following along in git, this is the 'ours' merge strategy, not +the '--ours' option to the 'recursive' merge strategy. In other +words, even if Alice makes a change that does not conflict with +Bob, we throw it away. It's Bob's way or the highway. + +A `%that` merge means to take what's in Alice's desk, but join +the ancestry. This is the reverse of `%this`. + +A `%fine` merge is a "fast-forward" merge. This succeeds iff one +head is in the ancestry of the other. In this case, we use the +descendant as our new head. + +For `%meet`, `%mate`, and `%meld` merges, we first find the most +recent common ancestor to use as our merge base. If we have no +common ancestors, then we fail. If we have more than one most +recent common ancestor, then we have a criss-cross situation, +which should be handled delicately. At present, we delicately +throw up our hands and give up, but something akin to git's +'recursive' strategy should be implemented in the future. + +There's a functional inclusion ordering on `%fine`, `%meet`, +`%mate`, and `%meld` such that if an earlier strategy would have +succeeded, then every later strategy will produce the same +result. Put another way, every earlier strategy is the same as +every later strategy except with a restricted domain. + +A `%meet` merge only succeeds if the changes from the merge base +to Alice's head (hereafter, "Alice's changes") are in different +files than Bob's changes. In this case, the parents are both +Alice's and Bob's heads, and the data is the merge base plus +Alice's changed files plus Bob's changed files. + +A `%mate` merge attempts to merge changes to the same file when +both Alice and bob change it. If the merge is clean, we use it; +otherwise, we fail. A merge between different types of changes -- +for example, deleting a file vs changing it -- is always a +conflict. If we succeed, the parents are both Alice's and Bob's +heads, and the data is the merge base plus Alice's changed files +plus Bob's changed files plus the merged files. + +A `%meld` merge will succeed even if there are conflicts. If +there are conflicts in a file, then we use the merge base's +version of that file, and we produce a set of files with +conflicts. The parents are both Alice's and Bob's heads, and the +data is the merge base plus Alice's changed files plus Bob's +changed files plus the successfully merged files plus the merge +base's version of the conflicting files. + +That's the extent of the merge options in clay proper. In +userspace there's a final option `%auto`, which is the most +common. `%auto` checks to see if Bob's desk exists, and if it +doesn't we use a `%init` merge. Otherwise, we progressively try +`%fine`, `%meet`, and `%mate` until one succeeds. + +If none succeed, we merge Bob's desk into a scratch desk. Then, +we merge Alice's desk into the scratch desk with the `%meld` +option to force the merge. For each file in the produced set of +conflicting files, we call the `++mash` function for the +appropriate mark, which annotates the conflicts if we know how. + +Finally, we display a message to the user informing them of the +scratch desk's existence, which files have annotated conflicts, +and which files have unannotated conflicts. When the user has +resolved the conflicts, they can merge the scratch desk back into +Bob's desk. This will be a `%fine` merge since Bob's head is in +the ancestry of the scratch desk. + +### Autosync + +Tracking and staying in sync with another desk is another +fundamental operation. We call this "autosync". This doesn't mean +simply mirroring a desk, since that wouldn't allow local changes. +We simply want to apply changes as they are made upstream, as +long as there are no conflicts with local changes. + +This is implemented by watching the other desk, and, when it has +changes, merging these changes into our desk with the usual merge +strategies. + +Note that it's quite reasonable for two desks to be autosynced to +each other. This results in any change on one desk being mirrored +to the other and vice versa. + +Additionally, it's fine to set up an autosync even if one desk, +the other desk, or both desks do not exist. The sync will be +activated when the upstream desk comes into existence and will +create the downstream desk if needed.