README and import proposal from gist

2024-11-22 15:08:54 +03:00 · 2021-10-21 18:13:52 -04:00 · 2021-10-21 18:13:52 -04:00 · 9fc3ff91c5
commit 9fc3ff91c5
parent e8e3383947
5 changed files with 235 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,5 @@
+# New Mars
+
+A redesigned Mars for the Urth/Mars Urbit runtime. Currently WIP.
+
+Read the [proposal](notes/a-proposal-nock-performance.md) and [hypotheses](notes/b-hypotheses.md) for an overview.
--- a/notes/a-proposal-nock-performance.md
+++ b/notes/a-proposal-nock-performance.md
@ -0,0 +1,141 @@
+# Introduction
+
+The current Nock implementation is a limitation on the performance of Urbit. When performance of code is not limited by
+algorithmic concerns, generally the only approach to increasing performance is to jet the code in question. For code such
+as arithmetic, encryption, or bitwise operations this is the correct approach. For code with more complex control flow or
+memory management behavior, writing a jet is a difficult and error-prone process.
+
+It is possible for interpreted languages to be made very fast. Evaluation of a Nock-9 currently takes many tens or even a
+few hundreds of microseconds, which in a deep and heterogenous call trace such as that of Ames quickly adds up to multiple
+milliseconds to process one event. Memory management overhead after nock evaluation completes adds >1 millisecond, and
+memory management overhead during evaluation is harder to measure but likely to be significant.
+
+Functional programming language implementations mostly do not mutate, they allocate. This means that many allocations are
+are discarded quickly. Nock is extreme about this: there is no possible way to mutate in Nock (with the only exception being
+a case of optimization where there is no sharing). Therefore allocation should be fast, and garbage should not incur
+management overhead.
+
+Further, while a computed-goto bytecode interpreter is far faster than naive structural recursion over the noun tree or a
+switch-cased interpreter, it still requires a computed jump in between every instruction, and does not admit well-known
+low-level optimizations.
+
+Urbit is a personal server. Nock is the language in which that personal server's software is provided. Browsers are personal
+clients, and Javascript is the language in which browser software is provided. Javascript at one time had a reputation for
+slowness due to its interpreted nature. But modern Javascript is quite fast, and this has had a qualitative, not simply
+quantitative, effect on the types of software written for the browser platform.
+
+Making Nock much faster than it is currently would plausibly have the same effect for Urbit.
+It would provide immediate benefits in the form of Ames throughput and JSON handling for client interaction.
+Further, applications not presently feasible on Urbit would rapidly become feasible.
+
+This proposal also includes changes which would allow for incremental snapshotting and large looms, thus removing other
+limitations to implementing applications for Urbit on Urbit.
+
+# Ideas
+
+## Lexical memory management
+The current worker uses (explicit/manual) reference counting to manage allocated objects, and adds objects
+scavenged on reclamation to a free list. This means that evaluation is subject to the overhead of reference counting all
+allocated objects and of maintaining free lists when dead objects are scavenged. For a language which allocates at the
+rate Nock (or really any functional language) does, this is not optimal.
+
+However, the reference-counting scheme has the desirable property of having predictable, lexically-mappable behavior for
+memory management. This behavior means that two runs of the same nock program produce the same code traces, even within
+memory management code.
+
+This could be achieved similarly by the following system.
+
+Two 'call' stacks are maintained. Perhaps they share a memory arena and grow towards each other, analogous to roads without
+heaps. Logically, they are one, interleaved stack, that is, a push to the top of the (logical) stack pushes onto the opposite
+stack from the current top of the (logical) stack.
+
+Noun allocation is performed by extending the stack frame and writing the noun on the stack. There is no heap*.
+
+When it is time to pop a stack frame and return a value to the control represented by the previous stack frame,
+a limited form of copying collection is performed. The return value is copied to return-target stack frame, which
+because of the interleaved stack, also has free space adjacent. Descendant nouns referenced by the current noun are
+copied in their turn, and pointers updated.
+
+Note that only references to the returning frame need to initiate copies, and there can be no references to data in
+the returning frame from outside the current frame, because there is no mutation and no cyclical references in Nock.
+So the copied nouns can reference nouns "further up" the stack, but nouns further up the stack cannot reference nouns
+in the current stack frame.
+
+### Optimization: hash-indexed heap for large nouns
+While for most computation this scheme should be an improvement, it can result in repeated copies up-the-stack
+of large nouns. Nouns over a certain size can be ejected to an external heap indexed by a hashed table, thus providing
+de-duplication and eliminating the need to copy.
+
+### Advantages
+* Allocation in this model is very fast as it involves only a pointer increment.
+* Allocations are compact (not interleaved with free space) and tend toward adjacency of relative structures,
+  leading to generally better cache locality.
+* Pause times for 'collection' are lexically limited *by the size of the noun returned, less parts of the noun
+  originating above the returning frame in lexical scope.*
+* The predictable and repeatable memory managment behavior is retained.
+* Memory management overhead is proportional to the size of nouns returned, *not* the size of
+  discarded memory as is presently the case.
+* Memory management does not require any data structures to persist between collections.
+  (Ephemeral memory for the collector can be allocated above the frame being scavenged.)
+* Big loom/snapshots: the implementation will use 64 bit pointers and thus remove the 2GB limit on loom/snapshot size.
+* Incremental snapshots: By ejecting the result of an arvo event computation to the hash table,
+  incremental snapshotting can be done by storing only new hashes in the table.
+* Control structures for the interpreter itself are stored off the loom, simplifying it drastically.
+
+### Disadvantages
+* Copying and pointer adjustment could be expensive for large nouns (but see 'Optimization', above)
+* Slightly less eager scavenging than reference counting, allocations persist for a lexical scope.
+* Snapshots would not be compatible with current loom
+
+## Just-in-time compilation
+
+Nock code for execution is currently compiled at runtime to bytecode, which is interpreted by a looping interpreter using
+'computed gotos', that is, program addresses for labels are computed and stored in an array, which is indexed by the
+opcodes of the bytecode representation.
+
+This approach is much faster than a naive looping or recursive structural interpreter. It can be made faster, especially
+for code which is "hot" i.e. routinely called in rapid iteration.
+
+The approach is to keep a counter on a segment of bytecode which is incremented each time it is run. This counter would
+persist in between invocations of Arvo, so as to notice code which is 'hot' across the event loop. When a counter hits
+a threshhold, the bytecode is translated into an LLVM graph, which can be fed to LLVM and result in a function pointer.
+This function pointer is then stored as an "automatic jet" of the code.
+
+Of course, the JIT compiler should also respect jet hints and link in existing jets, as LLVM is not likely to e.g.
+optimize the naive O((x, y)^2) `%+  add  x  y` invocation into code using the ALU.
+
+This approach of JIT compilation of hotspot code is used to great effect by Javascript in a context where
+code loading is ephemeral and the performance benefits from a particular invocation of the JIT compiler last only
+for the duration that a page is loaded. In a context where an Urbit persistently loops through much the same code for
+every event (until Arvo or an application are updated) the overhead could be amortized across an even greater number of
+invocations, over a longer period of time.
+
+An even simpler approach is to simply JIT every formula to machine code, with the assumption that most code
+will not be ephemeral. 
+
+# Tasks
+
+## A new Mars
+
+***Time (core): 2 months***
+
+***Time (jet porting): ?***
+
+A new Mars implementation is written in Racket-derived C, containing a bytecode interpreter for Nock as well as snapshotting
+and event logging. The implementation initially uses Boehm GC or similar off-the-shelf memory management. Jets are supported by porting them to use an allocator supplied by the interpreter.
+
+## Lexical memory
+
+***Time: 3 months***
+
+The new mars implementation is converted to use lexical memory management as described above.
+Jets may allocate memory using an allocation function provided by the interpreter, and may use this
+memory as they wish, but *must not* mutate memory that they did not allocate.
+
+Question: is Beohm modular or flexible enough that we can use all or part of it to implement this strategy?
+
+## Jets-In-Time compilation of Nock bytecode
+
+***Time: 4 months***
+
+The new mars creates jets on-the-fly by using LLVM to compile Nock bytecode to machine code, whenever some metric of heat is reached (this metric is probably just a counter, as Urbit code will tend to be highly persistent rather than ephemeral).
--- a/notes/b-hypotheses.md
+++ b/notes/b-hypotheses.md
@ -0,0 +1,45 @@
+# Hypotheses tested by New Mars
+
+## Stack-only allocation for computation
+**Hypotheses:**
+*Stack allocations, with nouns for return copied up the stack ("lexical memory management")
+is a feasible way to implement Nock. Specifically*
+- *Lexical memory management is simple to implement.*
+- *Lexical memory management provides performant allocation and collection of Nock nouns.*
+
+## Just-in-time compilation
+Intuitively, compiling Nock bytecode to machine code should provide performance wins with a highly-amortized cost.
+
+Urbit code is highly persistent. We don't dynamically load code for a short interactive session and then discard it,
+but instead load *and already compile* hoon code to Nock, then continue using that code in an event loop for a long period
+until the next OTA update of Arvo or an application.
+
+Especially since we already take the time to compile hoon to nock, it likely makes sense to compile Nock to machine code
+that can be directly invoked without interpretation overhead.
+
+**Hypothesis:**
+*Compiling Nock bytecode to machine code and caching the compilation results in much faster Nock execution,
+and the expense of compilation is acceptable given the amortization across a very high number of invocations.*
+
+## Large-noun hash table
+The major downside of the lexical memory management scheme is that large nouns allocated deep in the stack and returned from
+Arvo will be repeatedly copied to return them through multiple continuation frames. This can be ameliorated by using
+a separate arena for large nouns. The step of copying the noun up the stack tracks how much memory has been copied, and,
+at a certain threshhold, resets the stack pointer to undo the copy and instead copies the noun into the separate
+arena and returns a reference.
+
+By making this arena a hash table, we can create a store which can be copy-collected without adjusting references.
+This can also serve to deduplicate nouns in the table.
+
+The hashtable serves as a place to store non-noun objects, such as bytecode or jet dashboards, and a place to store noun
+metadata. Rather than suffering the overhead of possible metadata annotations on every cell, we can instead only
+allow metadata as the head of a hashtable entry.
+
+This hashtable also permits differential snapshotting, by storing only the hashes which are new in the table since the last
+snapshot. It also permits paging of large nouns to disk, as a hashtable entry could be marked with a path to a page file
+and paged out.
+
+**Hypotheses**:
+- *A hash referenced memory arena for large nouns resolves the major downside of lexical memory management by preventing repeated copying of large nouns.*
+- *A hash referenced memory arena provides a store for non-noun objects in the nock interpreter.*
+- *A hash referenced memory arena provides transparent incremental snapshotting and disk paging of large nouns.*
--- a/notes/notes-~2021.9.23.md
+++ b/notes/notes-~2021.9.23.md
@ -0,0 +1,21 @@
+# Notes ~2021.9.23
+## Discussion with ~rovnys-ricfer
+* Some discussion of the memory model, in particular using tag bits to identify
+  - direct noun
+  - indirect noun
+  - pointer to cell
+  - hash table reference
+* Hash table ejection strategy
+  - When the copier is returning a noun, it counts how many iterations of copying it has performed
+  - Above a tunable threshhold, an entry in the hash table is made and the noun is copied there instead.
+  - Existing pointers into the hash table are copied and re-inserted as new entries, thus maintaining an invariant
+    that a hash table entry can only reference its own memory by a direct pointer, or another hash table entry,
+    by hash reference.
+  - nouns that require metadata (jet pointers, cached bytecode) are ejected to the hashtable
+  - hashtable can also store non-noun data such as bytecode
+  - TBD: a collection strategy for the hash table.
+
+## Infrastructure channel discussion
+* Interpreter should handle nock 12 with an extra input of a scry gate stack, so it can be used to jet `+mink`
+* Also need to return crash values from the interpreter, and build traces when passing through nock 11.
+* ~master-morzod objects to calling the machine code from JIT compilation "automatic jets" and is probably right.
--- a/notes/notes-~2021.9.24.md
+++ b/notes/notes-~2021.9.24.md
@ -0,0 +1,23 @@
+# Notes ~2021.9.24
+## Exploration of `+mink`
+The [`+mink`](https://github.com/urbit/urbit/blob/fa894b9690deae9e2334ccec5492ba90cb0b38f9/pkg/arvo/sys/hoon.hoon#L5978-L6106)
+arm in [`hoon.hoon`](https://github.com/urbit/urbit/blob/master/pkg/arvo/sys/hoon.hoon) is a metacircular Nock interpreter
+intended to be jetted by invoking the host Nock interpreter. In addition to the subject and formula to evaluate, `+mink`
+takes a scry gate which is used to evaluate nock 12 (scry) formulas.
+
+The jet uses `u3r_mean` to take apart the sample of the `+mink` gate and feeds the resulting nouns to 
+`u3n_nock_et`, which runs the interpreter and produces a 'toon', 
+
+Scry gates are kept in a stack because a scry gate may itself scry, and should not re-enter itself.
+
+## Exceptions in new mars
+### Suboptimal but simple way
+The simple thing to do would be to make every continuation expect a toon and on an error or block result, immediately call
+the next continuation up the stack with it, until a continuation installed e.g. by the +mink jet branched on the result.
+This is effectively writing the interpreter as a CPS/trampoline translation of +mink.
+
+### Likely more optimal way
+It should be possible to store exception contexts in the hash table: comprising stack pointers for unwinding,
+code pointer to jump to (possibly we could use setjmp/longjmp instead), and a hash reference to the next outermost
+handler. The hash reference to the current exception context would be held in a register. This structure could also hold
+the current scry gate.