README and import proposal from gist

This commit is contained in:
Edward Amsden 2021-10-21 18:13:52 -04:00
parent e8e3383947
commit 9fc3ff91c5
No known key found for this signature in database
GPG Key ID: 548EDF608CA956F6
5 changed files with 235 additions and 0 deletions

5
README.md Normal file
View File

@ -0,0 +1,5 @@
# New Mars
A redesigned Mars for the Urth/Mars Urbit runtime. Currently WIP.
Read the [proposal](notes/a-proposal-nock-performance.md) and [hypotheses](notes/b-hypotheses.md) for an overview.

View File

@ -0,0 +1,141 @@
# Introduction
The current Nock implementation is a limitation on the performance of Urbit. When performance of code is not limited by
algorithmic concerns, generally the only approach to increasing performance is to jet the code in question. For code such
as arithmetic, encryption, or bitwise operations this is the correct approach. For code with more complex control flow or
memory management behavior, writing a jet is a difficult and error-prone process.
It is possible for interpreted languages to be made very fast. Evaluation of a Nock-9 currently takes many tens or even a
few hundreds of microseconds, which in a deep and heterogenous call trace such as that of Ames quickly adds up to multiple
milliseconds to process one event. Memory management overhead after nock evaluation completes adds >1 millisecond, and
memory management overhead during evaluation is harder to measure but likely to be significant.
Functional programming language implementations mostly do not mutate, they allocate. This means that many allocations are
are discarded quickly. Nock is extreme about this: there is no possible way to mutate in Nock (with the only exception being
a case of optimization where there is no sharing). Therefore allocation should be fast, and garbage should not incur
management overhead.
Further, while a computed-goto bytecode interpreter is far faster than naive structural recursion over the noun tree or a
switch-cased interpreter, it still requires a computed jump in between every instruction, and does not admit well-known
low-level optimizations.
Urbit is a personal server. Nock is the language in which that personal server's software is provided. Browsers are personal
clients, and Javascript is the language in which browser software is provided. Javascript at one time had a reputation for
slowness due to its interpreted nature. But modern Javascript is quite fast, and this has had a qualitative, not simply
quantitative, effect on the types of software written for the browser platform.
Making Nock much faster than it is currently would plausibly have the same effect for Urbit.
It would provide immediate benefits in the form of Ames throughput and JSON handling for client interaction.
Further, applications not presently feasible on Urbit would rapidly become feasible.
This proposal also includes changes which would allow for incremental snapshotting and large looms, thus removing other
limitations to implementing applications for Urbit on Urbit.
# Ideas
## Lexical memory management
The current worker uses (explicit/manual) reference counting to manage allocated objects, and adds objects
scavenged on reclamation to a free list. This means that evaluation is subject to the overhead of reference counting all
allocated objects and of maintaining free lists when dead objects are scavenged. For a language which allocates at the
rate Nock (or really any functional language) does, this is not optimal.
However, the reference-counting scheme has the desirable property of having predictable, lexically-mappable behavior for
memory management. This behavior means that two runs of the same nock program produce the same code traces, even within
memory management code.
This could be achieved similarly by the following system.
Two 'call' stacks are maintained. Perhaps they share a memory arena and grow towards each other, analogous to roads without
heaps. Logically, they are one, interleaved stack, that is, a push to the top of the (logical) stack pushes onto the opposite
stack from the current top of the (logical) stack.
Noun allocation is performed by extending the stack frame and writing the noun on the stack. There is no heap*.
When it is time to pop a stack frame and return a value to the control represented by the previous stack frame,
a limited form of copying collection is performed. The return value is copied to return-target stack frame, which
because of the interleaved stack, also has free space adjacent. Descendant nouns referenced by the current noun are
copied in their turn, and pointers updated.
Note that only references to the returning frame need to initiate copies, and there can be no references to data in
the returning frame from outside the current frame, because there is no mutation and no cyclical references in Nock.
So the copied nouns can reference nouns "further up" the stack, but nouns further up the stack cannot reference nouns
in the current stack frame.
### Optimization: hash-indexed heap for large nouns
While for most computation this scheme should be an improvement, it can result in repeated copies up-the-stack
of large nouns. Nouns over a certain size can be ejected to an external heap indexed by a hashed table, thus providing
de-duplication and eliminating the need to copy.
### Advantages
* Allocation in this model is very fast as it involves only a pointer increment.
* Allocations are compact (not interleaved with free space) and tend toward adjacency of relative structures,
leading to generally better cache locality.
* Pause times for 'collection' are lexically limited *by the size of the noun returned, less parts of the noun
originating above the returning frame in lexical scope.*
* The predictable and repeatable memory managment behavior is retained.
* Memory management overhead is proportional to the size of nouns returned, *not* the size of
discarded memory as is presently the case.
* Memory management does not require any data structures to persist between collections.
(Ephemeral memory for the collector can be allocated above the frame being scavenged.)
* Big loom/snapshots: the implementation will use 64 bit pointers and thus remove the 2GB limit on loom/snapshot size.
* Incremental snapshots: By ejecting the result of an arvo event computation to the hash table,
incremental snapshotting can be done by storing only new hashes in the table.
* Control structures for the interpreter itself are stored off the loom, simplifying it drastically.
### Disadvantages
* Copying and pointer adjustment could be expensive for large nouns (but see 'Optimization', above)
* Slightly less eager scavenging than reference counting, allocations persist for a lexical scope.
* Snapshots would not be compatible with current loom
## Just-in-time compilation
Nock code for execution is currently compiled at runtime to bytecode, which is interpreted by a looping interpreter using
'computed gotos', that is, program addresses for labels are computed and stored in an array, which is indexed by the
opcodes of the bytecode representation.
This approach is much faster than a naive looping or recursive structural interpreter. It can be made faster, especially
for code which is "hot" i.e. routinely called in rapid iteration.
The approach is to keep a counter on a segment of bytecode which is incremented each time it is run. This counter would
persist in between invocations of Arvo, so as to notice code which is 'hot' across the event loop. When a counter hits
a threshhold, the bytecode is translated into an LLVM graph, which can be fed to LLVM and result in a function pointer.
This function pointer is then stored as an "automatic jet" of the code.
Of course, the JIT compiler should also respect jet hints and link in existing jets, as LLVM is not likely to e.g.
optimize the naive O((x, y)^2) `%+ add x y` invocation into code using the ALU.
This approach of JIT compilation of hotspot code is used to great effect by Javascript in a context where
code loading is ephemeral and the performance benefits from a particular invocation of the JIT compiler last only
for the duration that a page is loaded. In a context where an Urbit persistently loops through much the same code for
every event (until Arvo or an application are updated) the overhead could be amortized across an even greater number of
invocations, over a longer period of time.
An even simpler approach is to simply JIT every formula to machine code, with the assumption that most code
will not be ephemeral.
# Tasks
## A new Mars
***Time (core): 2 months***
***Time (jet porting): ?***
A new Mars implementation is written in Racket-derived C, containing a bytecode interpreter for Nock as well as snapshotting
and event logging. The implementation initially uses Boehm GC or similar off-the-shelf memory management. Jets are supported by porting them to use an allocator supplied by the interpreter.
## Lexical memory
***Time: 3 months***
The new mars implementation is converted to use lexical memory management as described above.
Jets may allocate memory using an allocation function provided by the interpreter, and may use this
memory as they wish, but *must not* mutate memory that they did not allocate.
Question: is Beohm modular or flexible enough that we can use all or part of it to implement this strategy?
## Jets-In-Time compilation of Nock bytecode
***Time: 4 months***
The new mars creates jets on-the-fly by using LLVM to compile Nock bytecode to machine code, whenever some metric of heat is reached (this metric is probably just a counter, as Urbit code will tend to be highly persistent rather than ephemeral).

45
notes/b-hypotheses.md Normal file
View File

@ -0,0 +1,45 @@
# Hypotheses tested by New Mars
## Stack-only allocation for computation
**Hypotheses:**
*Stack allocations, with nouns for return copied up the stack ("lexical memory management")
is a feasible way to implement Nock. Specifically*
- *Lexical memory management is simple to implement.*
- *Lexical memory management provides performant allocation and collection of Nock nouns.*
## Just-in-time compilation
Intuitively, compiling Nock bytecode to machine code should provide performance wins with a highly-amortized cost.
Urbit code is highly persistent. We don't dynamically load code for a short interactive session and then discard it,
but instead load *and already compile* hoon code to Nock, then continue using that code in an event loop for a long period
until the next OTA update of Arvo or an application.
Especially since we already take the time to compile hoon to nock, it likely makes sense to compile Nock to machine code
that can be directly invoked without interpretation overhead.
**Hypothesis:**
*Compiling Nock bytecode to machine code and caching the compilation results in much faster Nock execution,
and the expense of compilation is acceptable given the amortization across a very high number of invocations.*
## Large-noun hash table
The major downside of the lexical memory management scheme is that large nouns allocated deep in the stack and returned from
Arvo will be repeatedly copied to return them through multiple continuation frames. This can be ameliorated by using
a separate arena for large nouns. The step of copying the noun up the stack tracks how much memory has been copied, and,
at a certain threshhold, resets the stack pointer to undo the copy and instead copies the noun into the separate
arena and returns a reference.
By making this arena a hash table, we can create a store which can be copy-collected without adjusting references.
This can also serve to deduplicate nouns in the table.
The hashtable serves as a place to store non-noun objects, such as bytecode or jet dashboards, and a place to store noun
metadata. Rather than suffering the overhead of possible metadata annotations on every cell, we can instead only
allow metadata as the head of a hashtable entry.
This hashtable also permits differential snapshotting, by storing only the hashes which are new in the table since the last
snapshot. It also permits paging of large nouns to disk, as a hashtable entry could be marked with a path to a page file
and paged out.
**Hypotheses**:
- *A hash referenced memory arena for large nouns resolves the major downside of lexical memory management by preventing repeated copying of large nouns.*
- *A hash referenced memory arena provides a store for non-noun objects in the nock interpreter.*
- *A hash referenced memory arena provides transparent incremental snapshotting and disk paging of large nouns.*

21
notes/notes-~2021.9.23.md Normal file
View File

@ -0,0 +1,21 @@
# Notes ~2021.9.23
## Discussion with ~rovnys-ricfer
* Some discussion of the memory model, in particular using tag bits to identify
- direct noun
- indirect noun
- pointer to cell
- hash table reference
* Hash table ejection strategy
- When the copier is returning a noun, it counts how many iterations of copying it has performed
- Above a tunable threshhold, an entry in the hash table is made and the noun is copied there instead.
- Existing pointers into the hash table are copied and re-inserted as new entries, thus maintaining an invariant
that a hash table entry can only reference its own memory by a direct pointer, or another hash table entry,
by hash reference.
- nouns that require metadata (jet pointers, cached bytecode) are ejected to the hashtable
- hashtable can also store non-noun data such as bytecode
- TBD: a collection strategy for the hash table.
## Infrastructure channel discussion
* Interpreter should handle nock 12 with an extra input of a scry gate stack, so it can be used to jet `+mink`
* Also need to return crash values from the interpreter, and build traces when passing through nock 11.
* ~master-morzod objects to calling the machine code from JIT compilation "automatic jets" and is probably right.

23
notes/notes-~2021.9.24.md Normal file
View File

@ -0,0 +1,23 @@
# Notes ~2021.9.24
## Exploration of `+mink`
The [`+mink`](https://github.com/urbit/urbit/blob/fa894b9690deae9e2334ccec5492ba90cb0b38f9/pkg/arvo/sys/hoon.hoon#L5978-L6106)
arm in [`hoon.hoon`](https://github.com/urbit/urbit/blob/master/pkg/arvo/sys/hoon.hoon) is a metacircular Nock interpreter
intended to be jetted by invoking the host Nock interpreter. In addition to the subject and formula to evaluate, `+mink`
takes a scry gate which is used to evaluate nock 12 (scry) formulas.
The jet uses `u3r_mean` to take apart the sample of the `+mink` gate and feeds the resulting nouns to
`u3n_nock_et`, which runs the interpreter and produces a 'toon',
Scry gates are kept in a stack because a scry gate may itself scry, and should not re-enter itself.
## Exceptions in new mars
### Suboptimal but simple way
The simple thing to do would be to make every continuation expect a toon and on an error or block result, immediately call
the next continuation up the stack with it, until a continuation installed e.g. by the +mink jet branched on the result.
This is effectively writing the interpreter as a CPS/trampoline translation of +mink.
### Likely more optimal way
It should be possible to store exception contexts in the hash table: comprising stack pointers for unwinding,
code pointer to jump to (possibly we could use setjmp/longjmp instead), and a hash reference to the next outermost
handler. The hash reference to the current exception context would be held in a register. This structure could also hold
the current scry gate.