mirror of
https://github.com/urbit/ares.git
synced 2024-11-22 15:08:54 +03:00
README and import proposal from gist
This commit is contained in:
parent
e8e3383947
commit
9fc3ff91c5
5
README.md
Normal file
5
README.md
Normal file
@ -0,0 +1,5 @@
|
||||
# New Mars
|
||||
|
||||
A redesigned Mars for the Urth/Mars Urbit runtime. Currently WIP.
|
||||
|
||||
Read the [proposal](notes/a-proposal-nock-performance.md) and [hypotheses](notes/b-hypotheses.md) for an overview.
|
141
notes/a-proposal-nock-performance.md
Normal file
141
notes/a-proposal-nock-performance.md
Normal file
@ -0,0 +1,141 @@
|
||||
# Introduction
|
||||
|
||||
The current Nock implementation is a limitation on the performance of Urbit. When performance of code is not limited by
|
||||
algorithmic concerns, generally the only approach to increasing performance is to jet the code in question. For code such
|
||||
as arithmetic, encryption, or bitwise operations this is the correct approach. For code with more complex control flow or
|
||||
memory management behavior, writing a jet is a difficult and error-prone process.
|
||||
|
||||
It is possible for interpreted languages to be made very fast. Evaluation of a Nock-9 currently takes many tens or even a
|
||||
few hundreds of microseconds, which in a deep and heterogenous call trace such as that of Ames quickly adds up to multiple
|
||||
milliseconds to process one event. Memory management overhead after nock evaluation completes adds >1 millisecond, and
|
||||
memory management overhead during evaluation is harder to measure but likely to be significant.
|
||||
|
||||
Functional programming language implementations mostly do not mutate, they allocate. This means that many allocations are
|
||||
are discarded quickly. Nock is extreme about this: there is no possible way to mutate in Nock (with the only exception being
|
||||
a case of optimization where there is no sharing). Therefore allocation should be fast, and garbage should not incur
|
||||
management overhead.
|
||||
|
||||
Further, while a computed-goto bytecode interpreter is far faster than naive structural recursion over the noun tree or a
|
||||
switch-cased interpreter, it still requires a computed jump in between every instruction, and does not admit well-known
|
||||
low-level optimizations.
|
||||
|
||||
Urbit is a personal server. Nock is the language in which that personal server's software is provided. Browsers are personal
|
||||
clients, and Javascript is the language in which browser software is provided. Javascript at one time had a reputation for
|
||||
slowness due to its interpreted nature. But modern Javascript is quite fast, and this has had a qualitative, not simply
|
||||
quantitative, effect on the types of software written for the browser platform.
|
||||
|
||||
Making Nock much faster than it is currently would plausibly have the same effect for Urbit.
|
||||
It would provide immediate benefits in the form of Ames throughput and JSON handling for client interaction.
|
||||
Further, applications not presently feasible on Urbit would rapidly become feasible.
|
||||
|
||||
This proposal also includes changes which would allow for incremental snapshotting and large looms, thus removing other
|
||||
limitations to implementing applications for Urbit on Urbit.
|
||||
|
||||
# Ideas
|
||||
|
||||
## Lexical memory management
|
||||
The current worker uses (explicit/manual) reference counting to manage allocated objects, and adds objects
|
||||
scavenged on reclamation to a free list. This means that evaluation is subject to the overhead of reference counting all
|
||||
allocated objects and of maintaining free lists when dead objects are scavenged. For a language which allocates at the
|
||||
rate Nock (or really any functional language) does, this is not optimal.
|
||||
|
||||
However, the reference-counting scheme has the desirable property of having predictable, lexically-mappable behavior for
|
||||
memory management. This behavior means that two runs of the same nock program produce the same code traces, even within
|
||||
memory management code.
|
||||
|
||||
This could be achieved similarly by the following system.
|
||||
|
||||
Two 'call' stacks are maintained. Perhaps they share a memory arena and grow towards each other, analogous to roads without
|
||||
heaps. Logically, they are one, interleaved stack, that is, a push to the top of the (logical) stack pushes onto the opposite
|
||||
stack from the current top of the (logical) stack.
|
||||
|
||||
Noun allocation is performed by extending the stack frame and writing the noun on the stack. There is no heap*.
|
||||
|
||||
When it is time to pop a stack frame and return a value to the control represented by the previous stack frame,
|
||||
a limited form of copying collection is performed. The return value is copied to return-target stack frame, which
|
||||
because of the interleaved stack, also has free space adjacent. Descendant nouns referenced by the current noun are
|
||||
copied in their turn, and pointers updated.
|
||||
|
||||
Note that only references to the returning frame need to initiate copies, and there can be no references to data in
|
||||
the returning frame from outside the current frame, because there is no mutation and no cyclical references in Nock.
|
||||
So the copied nouns can reference nouns "further up" the stack, but nouns further up the stack cannot reference nouns
|
||||
in the current stack frame.
|
||||
|
||||
### Optimization: hash-indexed heap for large nouns
|
||||
While for most computation this scheme should be an improvement, it can result in repeated copies up-the-stack
|
||||
of large nouns. Nouns over a certain size can be ejected to an external heap indexed by a hashed table, thus providing
|
||||
de-duplication and eliminating the need to copy.
|
||||
|
||||
### Advantages
|
||||
* Allocation in this model is very fast as it involves only a pointer increment.
|
||||
* Allocations are compact (not interleaved with free space) and tend toward adjacency of relative structures,
|
||||
leading to generally better cache locality.
|
||||
* Pause times for 'collection' are lexically limited *by the size of the noun returned, less parts of the noun
|
||||
originating above the returning frame in lexical scope.*
|
||||
* The predictable and repeatable memory managment behavior is retained.
|
||||
* Memory management overhead is proportional to the size of nouns returned, *not* the size of
|
||||
discarded memory as is presently the case.
|
||||
* Memory management does not require any data structures to persist between collections.
|
||||
(Ephemeral memory for the collector can be allocated above the frame being scavenged.)
|
||||
* Big loom/snapshots: the implementation will use 64 bit pointers and thus remove the 2GB limit on loom/snapshot size.
|
||||
* Incremental snapshots: By ejecting the result of an arvo event computation to the hash table,
|
||||
incremental snapshotting can be done by storing only new hashes in the table.
|
||||
* Control structures for the interpreter itself are stored off the loom, simplifying it drastically.
|
||||
|
||||
### Disadvantages
|
||||
* Copying and pointer adjustment could be expensive for large nouns (but see 'Optimization', above)
|
||||
* Slightly less eager scavenging than reference counting, allocations persist for a lexical scope.
|
||||
* Snapshots would not be compatible with current loom
|
||||
|
||||
## Just-in-time compilation
|
||||
|
||||
Nock code for execution is currently compiled at runtime to bytecode, which is interpreted by a looping interpreter using
|
||||
'computed gotos', that is, program addresses for labels are computed and stored in an array, which is indexed by the
|
||||
opcodes of the bytecode representation.
|
||||
|
||||
This approach is much faster than a naive looping or recursive structural interpreter. It can be made faster, especially
|
||||
for code which is "hot" i.e. routinely called in rapid iteration.
|
||||
|
||||
The approach is to keep a counter on a segment of bytecode which is incremented each time it is run. This counter would
|
||||
persist in between invocations of Arvo, so as to notice code which is 'hot' across the event loop. When a counter hits
|
||||
a threshhold, the bytecode is translated into an LLVM graph, which can be fed to LLVM and result in a function pointer.
|
||||
This function pointer is then stored as an "automatic jet" of the code.
|
||||
|
||||
Of course, the JIT compiler should also respect jet hints and link in existing jets, as LLVM is not likely to e.g.
|
||||
optimize the naive O((x, y)^2) `%+ add x y` invocation into code using the ALU.
|
||||
|
||||
This approach of JIT compilation of hotspot code is used to great effect by Javascript in a context where
|
||||
code loading is ephemeral and the performance benefits from a particular invocation of the JIT compiler last only
|
||||
for the duration that a page is loaded. In a context where an Urbit persistently loops through much the same code for
|
||||
every event (until Arvo or an application are updated) the overhead could be amortized across an even greater number of
|
||||
invocations, over a longer period of time.
|
||||
|
||||
An even simpler approach is to simply JIT every formula to machine code, with the assumption that most code
|
||||
will not be ephemeral.
|
||||
|
||||
# Tasks
|
||||
|
||||
## A new Mars
|
||||
|
||||
***Time (core): 2 months***
|
||||
|
||||
***Time (jet porting): ?***
|
||||
|
||||
A new Mars implementation is written in Racket-derived C, containing a bytecode interpreter for Nock as well as snapshotting
|
||||
and event logging. The implementation initially uses Boehm GC or similar off-the-shelf memory management. Jets are supported by porting them to use an allocator supplied by the interpreter.
|
||||
|
||||
## Lexical memory
|
||||
|
||||
***Time: 3 months***
|
||||
|
||||
The new mars implementation is converted to use lexical memory management as described above.
|
||||
Jets may allocate memory using an allocation function provided by the interpreter, and may use this
|
||||
memory as they wish, but *must not* mutate memory that they did not allocate.
|
||||
|
||||
Question: is Beohm modular or flexible enough that we can use all or part of it to implement this strategy?
|
||||
|
||||
## Jets-In-Time compilation of Nock bytecode
|
||||
|
||||
***Time: 4 months***
|
||||
|
||||
The new mars creates jets on-the-fly by using LLVM to compile Nock bytecode to machine code, whenever some metric of heat is reached (this metric is probably just a counter, as Urbit code will tend to be highly persistent rather than ephemeral).
|
45
notes/b-hypotheses.md
Normal file
45
notes/b-hypotheses.md
Normal file
@ -0,0 +1,45 @@
|
||||
# Hypotheses tested by New Mars
|
||||
|
||||
## Stack-only allocation for computation
|
||||
**Hypotheses:**
|
||||
*Stack allocations, with nouns for return copied up the stack ("lexical memory management")
|
||||
is a feasible way to implement Nock. Specifically*
|
||||
- *Lexical memory management is simple to implement.*
|
||||
- *Lexical memory management provides performant allocation and collection of Nock nouns.*
|
||||
|
||||
## Just-in-time compilation
|
||||
Intuitively, compiling Nock bytecode to machine code should provide performance wins with a highly-amortized cost.
|
||||
|
||||
Urbit code is highly persistent. We don't dynamically load code for a short interactive session and then discard it,
|
||||
but instead load *and already compile* hoon code to Nock, then continue using that code in an event loop for a long period
|
||||
until the next OTA update of Arvo or an application.
|
||||
|
||||
Especially since we already take the time to compile hoon to nock, it likely makes sense to compile Nock to machine code
|
||||
that can be directly invoked without interpretation overhead.
|
||||
|
||||
**Hypothesis:**
|
||||
*Compiling Nock bytecode to machine code and caching the compilation results in much faster Nock execution,
|
||||
and the expense of compilation is acceptable given the amortization across a very high number of invocations.*
|
||||
|
||||
## Large-noun hash table
|
||||
The major downside of the lexical memory management scheme is that large nouns allocated deep in the stack and returned from
|
||||
Arvo will be repeatedly copied to return them through multiple continuation frames. This can be ameliorated by using
|
||||
a separate arena for large nouns. The step of copying the noun up the stack tracks how much memory has been copied, and,
|
||||
at a certain threshhold, resets the stack pointer to undo the copy and instead copies the noun into the separate
|
||||
arena and returns a reference.
|
||||
|
||||
By making this arena a hash table, we can create a store which can be copy-collected without adjusting references.
|
||||
This can also serve to deduplicate nouns in the table.
|
||||
|
||||
The hashtable serves as a place to store non-noun objects, such as bytecode or jet dashboards, and a place to store noun
|
||||
metadata. Rather than suffering the overhead of possible metadata annotations on every cell, we can instead only
|
||||
allow metadata as the head of a hashtable entry.
|
||||
|
||||
This hashtable also permits differential snapshotting, by storing only the hashes which are new in the table since the last
|
||||
snapshot. It also permits paging of large nouns to disk, as a hashtable entry could be marked with a path to a page file
|
||||
and paged out.
|
||||
|
||||
**Hypotheses**:
|
||||
- *A hash referenced memory arena for large nouns resolves the major downside of lexical memory management by preventing repeated copying of large nouns.*
|
||||
- *A hash referenced memory arena provides a store for non-noun objects in the nock interpreter.*
|
||||
- *A hash referenced memory arena provides transparent incremental snapshotting and disk paging of large nouns.*
|
21
notes/notes-~2021.9.23.md
Normal file
21
notes/notes-~2021.9.23.md
Normal file
@ -0,0 +1,21 @@
|
||||
# Notes ~2021.9.23
|
||||
## Discussion with ~rovnys-ricfer
|
||||
* Some discussion of the memory model, in particular using tag bits to identify
|
||||
- direct noun
|
||||
- indirect noun
|
||||
- pointer to cell
|
||||
- hash table reference
|
||||
* Hash table ejection strategy
|
||||
- When the copier is returning a noun, it counts how many iterations of copying it has performed
|
||||
- Above a tunable threshhold, an entry in the hash table is made and the noun is copied there instead.
|
||||
- Existing pointers into the hash table are copied and re-inserted as new entries, thus maintaining an invariant
|
||||
that a hash table entry can only reference its own memory by a direct pointer, or another hash table entry,
|
||||
by hash reference.
|
||||
- nouns that require metadata (jet pointers, cached bytecode) are ejected to the hashtable
|
||||
- hashtable can also store non-noun data such as bytecode
|
||||
- TBD: a collection strategy for the hash table.
|
||||
|
||||
## Infrastructure channel discussion
|
||||
* Interpreter should handle nock 12 with an extra input of a scry gate stack, so it can be used to jet `+mink`
|
||||
* Also need to return crash values from the interpreter, and build traces when passing through nock 11.
|
||||
* ~master-morzod objects to calling the machine code from JIT compilation "automatic jets" and is probably right.
|
23
notes/notes-~2021.9.24.md
Normal file
23
notes/notes-~2021.9.24.md
Normal file
@ -0,0 +1,23 @@
|
||||
# Notes ~2021.9.24
|
||||
## Exploration of `+mink`
|
||||
The [`+mink`](https://github.com/urbit/urbit/blob/fa894b9690deae9e2334ccec5492ba90cb0b38f9/pkg/arvo/sys/hoon.hoon#L5978-L6106)
|
||||
arm in [`hoon.hoon`](https://github.com/urbit/urbit/blob/master/pkg/arvo/sys/hoon.hoon) is a metacircular Nock interpreter
|
||||
intended to be jetted by invoking the host Nock interpreter. In addition to the subject and formula to evaluate, `+mink`
|
||||
takes a scry gate which is used to evaluate nock 12 (scry) formulas.
|
||||
|
||||
The jet uses `u3r_mean` to take apart the sample of the `+mink` gate and feeds the resulting nouns to
|
||||
`u3n_nock_et`, which runs the interpreter and produces a 'toon',
|
||||
|
||||
Scry gates are kept in a stack because a scry gate may itself scry, and should not re-enter itself.
|
||||
|
||||
## Exceptions in new mars
|
||||
### Suboptimal but simple way
|
||||
The simple thing to do would be to make every continuation expect a toon and on an error or block result, immediately call
|
||||
the next continuation up the stack with it, until a continuation installed e.g. by the +mink jet branched on the result.
|
||||
This is effectively writing the interpreter as a CPS/trampoline translation of +mink.
|
||||
|
||||
### Likely more optimal way
|
||||
It should be possible to store exception contexts in the hash table: comprising stack pointers for unwinding,
|
||||
code pointer to jump to (possibly we could use setjmp/longjmp instead), and a hash reference to the next outermost
|
||||
handler. The hash reference to the current exception context would be held in a register. This structure could also hold
|
||||
the current scry gate.
|
Loading…
Reference in New Issue
Block a user