diff --git a/README.md b/README.md new file mode 100644 index 0000000..1614024 --- /dev/null +++ b/README.md @@ -0,0 +1,5 @@ +# New Mars + +A redesigned Mars for the Urth/Mars Urbit runtime. Currently WIP. + +Read the [proposal](notes/a-proposal-nock-performance.md) and [hypotheses](notes/b-hypotheses.md) for an overview. diff --git a/notes/a-proposal-nock-performance.md b/notes/a-proposal-nock-performance.md new file mode 100644 index 0000000..d142817 --- /dev/null +++ b/notes/a-proposal-nock-performance.md @@ -0,0 +1,141 @@ +# Introduction + +The current Nock implementation is a limitation on the performance of Urbit. When performance of code is not limited by +algorithmic concerns, generally the only approach to increasing performance is to jet the code in question. For code such +as arithmetic, encryption, or bitwise operations this is the correct approach. For code with more complex control flow or +memory management behavior, writing a jet is a difficult and error-prone process. + +It is possible for interpreted languages to be made very fast. Evaluation of a Nock-9 currently takes many tens or even a +few hundreds of microseconds, which in a deep and heterogenous call trace such as that of Ames quickly adds up to multiple +milliseconds to process one event. Memory management overhead after nock evaluation completes adds >1 millisecond, and +memory management overhead during evaluation is harder to measure but likely to be significant. + +Functional programming language implementations mostly do not mutate, they allocate. This means that many allocations are +are discarded quickly. Nock is extreme about this: there is no possible way to mutate in Nock (with the only exception being +a case of optimization where there is no sharing). Therefore allocation should be fast, and garbage should not incur +management overhead. + +Further, while a computed-goto bytecode interpreter is far faster than naive structural recursion over the noun tree or a +switch-cased interpreter, it still requires a computed jump in between every instruction, and does not admit well-known +low-level optimizations. + +Urbit is a personal server. Nock is the language in which that personal server's software is provided. Browsers are personal +clients, and Javascript is the language in which browser software is provided. Javascript at one time had a reputation for +slowness due to its interpreted nature. But modern Javascript is quite fast, and this has had a qualitative, not simply +quantitative, effect on the types of software written for the browser platform. + +Making Nock much faster than it is currently would plausibly have the same effect for Urbit. +It would provide immediate benefits in the form of Ames throughput and JSON handling for client interaction. +Further, applications not presently feasible on Urbit would rapidly become feasible. + +This proposal also includes changes which would allow for incremental snapshotting and large looms, thus removing other +limitations to implementing applications for Urbit on Urbit. + +# Ideas + +## Lexical memory management +The current worker uses (explicit/manual) reference counting to manage allocated objects, and adds objects +scavenged on reclamation to a free list. This means that evaluation is subject to the overhead of reference counting all +allocated objects and of maintaining free lists when dead objects are scavenged. For a language which allocates at the +rate Nock (or really any functional language) does, this is not optimal. + +However, the reference-counting scheme has the desirable property of having predictable, lexically-mappable behavior for +memory management. This behavior means that two runs of the same nock program produce the same code traces, even within +memory management code. + +This could be achieved similarly by the following system. + +Two 'call' stacks are maintained. Perhaps they share a memory arena and grow towards each other, analogous to roads without +heaps. Logically, they are one, interleaved stack, that is, a push to the top of the (logical) stack pushes onto the opposite +stack from the current top of the (logical) stack. + +Noun allocation is performed by extending the stack frame and writing the noun on the stack. There is no heap*. + +When it is time to pop a stack frame and return a value to the control represented by the previous stack frame, +a limited form of copying collection is performed. The return value is copied to return-target stack frame, which +because of the interleaved stack, also has free space adjacent. Descendant nouns referenced by the current noun are +copied in their turn, and pointers updated. + +Note that only references to the returning frame need to initiate copies, and there can be no references to data in +the returning frame from outside the current frame, because there is no mutation and no cyclical references in Nock. +So the copied nouns can reference nouns "further up" the stack, but nouns further up the stack cannot reference nouns +in the current stack frame. + +### Optimization: hash-indexed heap for large nouns +While for most computation this scheme should be an improvement, it can result in repeated copies up-the-stack +of large nouns. Nouns over a certain size can be ejected to an external heap indexed by a hashed table, thus providing +de-duplication and eliminating the need to copy. + +### Advantages +* Allocation in this model is very fast as it involves only a pointer increment. +* Allocations are compact (not interleaved with free space) and tend toward adjacency of relative structures, + leading to generally better cache locality. +* Pause times for 'collection' are lexically limited *by the size of the noun returned, less parts of the noun + originating above the returning frame in lexical scope.* +* The predictable and repeatable memory managment behavior is retained. +* Memory management overhead is proportional to the size of nouns returned, *not* the size of + discarded memory as is presently the case. +* Memory management does not require any data structures to persist between collections. + (Ephemeral memory for the collector can be allocated above the frame being scavenged.) +* Big loom/snapshots: the implementation will use 64 bit pointers and thus remove the 2GB limit on loom/snapshot size. +* Incremental snapshots: By ejecting the result of an arvo event computation to the hash table, + incremental snapshotting can be done by storing only new hashes in the table. +* Control structures for the interpreter itself are stored off the loom, simplifying it drastically. + +### Disadvantages +* Copying and pointer adjustment could be expensive for large nouns (but see 'Optimization', above) +* Slightly less eager scavenging than reference counting, allocations persist for a lexical scope. +* Snapshots would not be compatible with current loom + +## Just-in-time compilation + +Nock code for execution is currently compiled at runtime to bytecode, which is interpreted by a looping interpreter using +'computed gotos', that is, program addresses for labels are computed and stored in an array, which is indexed by the +opcodes of the bytecode representation. + +This approach is much faster than a naive looping or recursive structural interpreter. It can be made faster, especially +for code which is "hot" i.e. routinely called in rapid iteration. + +The approach is to keep a counter on a segment of bytecode which is incremented each time it is run. This counter would +persist in between invocations of Arvo, so as to notice code which is 'hot' across the event loop. When a counter hits +a threshhold, the bytecode is translated into an LLVM graph, which can be fed to LLVM and result in a function pointer. +This function pointer is then stored as an "automatic jet" of the code. + +Of course, the JIT compiler should also respect jet hints and link in existing jets, as LLVM is not likely to e.g. +optimize the naive O((x, y)^2) `%+ add x y` invocation into code using the ALU. + +This approach of JIT compilation of hotspot code is used to great effect by Javascript in a context where +code loading is ephemeral and the performance benefits from a particular invocation of the JIT compiler last only +for the duration that a page is loaded. In a context where an Urbit persistently loops through much the same code for +every event (until Arvo or an application are updated) the overhead could be amortized across an even greater number of +invocations, over a longer period of time. + +An even simpler approach is to simply JIT every formula to machine code, with the assumption that most code +will not be ephemeral. + +# Tasks + +## A new Mars + +***Time (core): 2 months*** + +***Time (jet porting): ?*** + +A new Mars implementation is written in Racket-derived C, containing a bytecode interpreter for Nock as well as snapshotting +and event logging. The implementation initially uses Boehm GC or similar off-the-shelf memory management. Jets are supported by porting them to use an allocator supplied by the interpreter. + +## Lexical memory + +***Time: 3 months*** + +The new mars implementation is converted to use lexical memory management as described above. +Jets may allocate memory using an allocation function provided by the interpreter, and may use this +memory as they wish, but *must not* mutate memory that they did not allocate. + +Question: is Beohm modular or flexible enough that we can use all or part of it to implement this strategy? + +## Jets-In-Time compilation of Nock bytecode + +***Time: 4 months*** + +The new mars creates jets on-the-fly by using LLVM to compile Nock bytecode to machine code, whenever some metric of heat is reached (this metric is probably just a counter, as Urbit code will tend to be highly persistent rather than ephemeral). diff --git a/notes/b-hypotheses.md b/notes/b-hypotheses.md new file mode 100644 index 0000000..31a12aa --- /dev/null +++ b/notes/b-hypotheses.md @@ -0,0 +1,45 @@ +# Hypotheses tested by New Mars + +## Stack-only allocation for computation +**Hypotheses:** +*Stack allocations, with nouns for return copied up the stack ("lexical memory management") +is a feasible way to implement Nock. Specifically* +- *Lexical memory management is simple to implement.* +- *Lexical memory management provides performant allocation and collection of Nock nouns.* + +## Just-in-time compilation +Intuitively, compiling Nock bytecode to machine code should provide performance wins with a highly-amortized cost. + +Urbit code is highly persistent. We don't dynamically load code for a short interactive session and then discard it, +but instead load *and already compile* hoon code to Nock, then continue using that code in an event loop for a long period +until the next OTA update of Arvo or an application. + +Especially since we already take the time to compile hoon to nock, it likely makes sense to compile Nock to machine code +that can be directly invoked without interpretation overhead. + +**Hypothesis:** +*Compiling Nock bytecode to machine code and caching the compilation results in much faster Nock execution, +and the expense of compilation is acceptable given the amortization across a very high number of invocations.* + +## Large-noun hash table +The major downside of the lexical memory management scheme is that large nouns allocated deep in the stack and returned from +Arvo will be repeatedly copied to return them through multiple continuation frames. This can be ameliorated by using +a separate arena for large nouns. The step of copying the noun up the stack tracks how much memory has been copied, and, +at a certain threshhold, resets the stack pointer to undo the copy and instead copies the noun into the separate +arena and returns a reference. + +By making this arena a hash table, we can create a store which can be copy-collected without adjusting references. +This can also serve to deduplicate nouns in the table. + +The hashtable serves as a place to store non-noun objects, such as bytecode or jet dashboards, and a place to store noun +metadata. Rather than suffering the overhead of possible metadata annotations on every cell, we can instead only +allow metadata as the head of a hashtable entry. + +This hashtable also permits differential snapshotting, by storing only the hashes which are new in the table since the last +snapshot. It also permits paging of large nouns to disk, as a hashtable entry could be marked with a path to a page file +and paged out. + +**Hypotheses**: +- *A hash referenced memory arena for large nouns resolves the major downside of lexical memory management by preventing repeated copying of large nouns.* +- *A hash referenced memory arena provides a store for non-noun objects in the nock interpreter.* +- *A hash referenced memory arena provides transparent incremental snapshotting and disk paging of large nouns.* \ No newline at end of file diff --git a/notes/notes-~2021.9.23.md b/notes/notes-~2021.9.23.md new file mode 100644 index 0000000..811acd6 --- /dev/null +++ b/notes/notes-~2021.9.23.md @@ -0,0 +1,21 @@ +# Notes ~2021.9.23 +## Discussion with ~rovnys-ricfer +* Some discussion of the memory model, in particular using tag bits to identify + - direct noun + - indirect noun + - pointer to cell + - hash table reference +* Hash table ejection strategy + - When the copier is returning a noun, it counts how many iterations of copying it has performed + - Above a tunable threshhold, an entry in the hash table is made and the noun is copied there instead. + - Existing pointers into the hash table are copied and re-inserted as new entries, thus maintaining an invariant + that a hash table entry can only reference its own memory by a direct pointer, or another hash table entry, + by hash reference. + - nouns that require metadata (jet pointers, cached bytecode) are ejected to the hashtable + - hashtable can also store non-noun data such as bytecode + - TBD: a collection strategy for the hash table. + +## Infrastructure channel discussion +* Interpreter should handle nock 12 with an extra input of a scry gate stack, so it can be used to jet `+mink` +* Also need to return crash values from the interpreter, and build traces when passing through nock 11. +* ~master-morzod objects to calling the machine code from JIT compilation "automatic jets" and is probably right. \ No newline at end of file diff --git a/notes/notes-~2021.9.24.md b/notes/notes-~2021.9.24.md new file mode 100644 index 0000000..183526e --- /dev/null +++ b/notes/notes-~2021.9.24.md @@ -0,0 +1,23 @@ +# Notes ~2021.9.24 +## Exploration of `+mink` +The [`+mink`](https://github.com/urbit/urbit/blob/fa894b9690deae9e2334ccec5492ba90cb0b38f9/pkg/arvo/sys/hoon.hoon#L5978-L6106) +arm in [`hoon.hoon`](https://github.com/urbit/urbit/blob/master/pkg/arvo/sys/hoon.hoon) is a metacircular Nock interpreter +intended to be jetted by invoking the host Nock interpreter. In addition to the subject and formula to evaluate, `+mink` +takes a scry gate which is used to evaluate nock 12 (scry) formulas. + +The jet uses `u3r_mean` to take apart the sample of the `+mink` gate and feeds the resulting nouns to +`u3n_nock_et`, which runs the interpreter and produces a 'toon', + +Scry gates are kept in a stack because a scry gate may itself scry, and should not re-enter itself. + +## Exceptions in new mars +### Suboptimal but simple way +The simple thing to do would be to make every continuation expect a toon and on an error or block result, immediately call +the next continuation up the stack with it, until a continuation installed e.g. by the +mink jet branched on the result. +This is effectively writing the interpreter as a CPS/trampoline translation of +mink. + +### Likely more optimal way +It should be possible to store exception contexts in the hash table: comprising stack pointers for unwinding, +code pointer to jump to (possibly we could use setjmp/longjmp instead), and a hash reference to the next outermost +handler. The hash reference to the current exception context would be held in a register. This structure could also hold +the current scry gate. \ No newline at end of file