Add persistent memory proposal to docs directory

2024-11-22 15:08:54 +03:00 · 2022-07-30 12:22:05 -05:00 · 2022-07-30 12:22:05 -05:00 · fd1d85c171
commit fd1d85c171
parent 26424c4bae
1 changed files with 93 additions and 0 deletions
--- a/docs/persistence.md
+++ b/docs/persistence.md
@ -0,0 +1,93 @@
+# New Mars persistence
+
+New Mars requires a persistence layer. Nouns which are allocated in the top frame of a 2stackz memory manager can only be freed by resetting the stack. This loses all nouns. Therefore, for nouns which persist between computations, another arena is needed.
+
+These are also the nouns which should be included in a snapshot. Therefore, this arena ought to exist in memory mapped to persistent storage. This also represents a solution to the loom size limit: since memory is block-storage-backed, we can grow the Arvo state much larger than physical RAM or system swap space permit, and rely on virtual memory to make pages resident when we read them.
+
+Since this memory will never be used to directly allocate a noun, we do not need the speed of bump-allocating. We do require stable addressing along with fragmentation resistance. Therefore, we will take design cues from `malloc` implementations, and maintain information about free spaces in our memory arena. For nouns, freeing will be triggered by tracing or reference counting from a set of roots.
+
+This implies that any noun used as an input to a computation run in a stack arena must be maintained as a root until that computation completes. For instance: an Arvo core stored in persistent memory is used as input to an event computation. This core is maintained as a root in the persistent arena until the computation has completed.
+
+## Copy-on-Write File-Backed Memory Arena
+
+As the [Varnish](https://varnish-cache.org/docs/trunk/phk/notes.html) notes point out: a userspace program ought not to go to great lengths to juggle disk vs memory residency for persistent data. This is something the (terran) operating system already puts large amounts of work into. It is not necessary to work very hard to manage eviction and restoration of immutable data. At the greatest extent of effort, judicious use of `madvise` calls can reduce block storage wait times.
+
+Every page in our memory-backed arena is mapped to a region of a file. This ensures that the operating system can evict the page at will, without need for swap space. The file contains an index which is read when the process starts, describing which regions should be mapped to which virtual memory addresses. This permits us to persist data without serialization or deserialization. The file stores a memory image of the persistent arena.
+
+What is more difficult is to ensure that data *written* to the arena is fully synchronized and that the persistent index between file and memory locations do not point to corrupted data. This is addressed with a "copy-on-write" strategy, similar but not identical to that of [LMDB](http://www.lmdb.tech/media/20120829-LinuxCon-MDB-txt.pdf).
+
+The copy-on-write strategy ensures that commmitted data is not corrupted. Since it is not possible to "atomically" write data to disk, committed data is not overwritten, but rather copied and modified. Committing then progressively ensures that this new data is durably written to disk, and that the index of disk regions to memory regions is durably updated, before it will permit the previous copy to be overwritten.
+
+### Handling writes to the arena
+
+By default, pages in the persistent memory arena are protected read-only (`PROT_READ` flag to `mmap()` or `mprotect()`). This means that an attempt to write through a pointer into such a page will trigger a protection fault, raising a SIGSEGV to our process.
+
+To handle this signal and resolve the fault, our handler will:
+
+- record the page as being in the middle of a copy-on-write, so other threads faulting will not also attempt to copy the page.
+- `mmap()` a new page to an unused region of the backing block storage, being sure to set `MAP_SHARED`
+- Copy the contents of the faulting page to this new page.  _NOTE_: this does not yet write anything to the file.
+- `mmap()` the new region to the address of the original page, being sure to set `MAP_SHARED`, `PROT_READ`, and `PROT_WRITE`
+- `munmap()` the page mapped as a copy target
+- Move the page from the copy-on-write set to the dirty set.
+
+The net effect of this is that the file region to which the read-protected page was mapped will remain unmodified, but our process continues with identical contents in a virtual memory page mapped to a new region of the file. The process continues from the faulting instruction and writes to memory. Index metadata which maps regions of the file to regions of memory remains untouched at this juncture.
+
+### Committing writes
+
+To facilitate restoring the persistent memory arena from durable block storage, we reserve two pages at the beginning of the file for metadata storage. This does not limit us to a page for metadata, we may commit metadata objects alongside our stored data, and reference it from a page. A metadata page contains a counter and a checksum of the page's contents. The page with a valid checksum and the more recent counter is used to load the data.
+
+To atomically commit writes since the last commit to durable storage, we simply `msync()` (note: and `fsync()` or is that implied by `msync()`?) each dirty page. If all syncs succeed, we know that data in these pages is durably persisted to disk. We may now safely write a new metadata page to whichever metadata page is not in use. Once this too successfully `msync`s, our writes are durably committed.
+
+TODO: fine-grained commits e.g. for events but not a snapshot
+
+### When to commit
+
+Committing is required to know that an event is durably persisted in the log. It is also required to save a snapshot to disk.
+
+### Memory allocation
+
+Allocation of memory space for persisted nouns, and for helper structures for the runtime, follows largely the algorithm from [phkmalloc](https://papers.freebsd.org/1998/phk-malloc.files/phk-malloc-paper.pdf).
+Briefly:
+
+- Pages are tracked as free, start, continue, or partitioned
+- Small allocations seek space in a partitioned page, and only partition a page if no suitable space is found
+- Large allocations (> 1/2 a page) are allocated sufficient contiguous free pages
+
+One difference is that since all of our persistent memory is file-backed, the acquisition of new memory consists in mapping new, initially-dirty pages, backed by unused regions of block storage, instead of using the `brk` syscall.
+
+### Block region allocation
+
+As well as allocating memory to noun (fragments) which need to be copied into the persistent set, we must also allocate regions of block storage to memory pages. Fortunately this is much easier. All pages are the same size. When committing writes, we can take note of file regions which are not mapped by the new metadata, and prepopulate a free list of file regions with these regions. The list becomes usable once the commit succeeds. We simply pop regions off of the list, starting with those nearest the beginning of the file. until it is exhausted, at which point we must extend the file.
+
+It's not clear whether it would be necessary to shrink the backing file, but if so, this could initially be done manually with a separate program, or a heuristic could be applied. Perhaps when a certain ratio of free-to-used pages are run, a defragmentation step is run, followed by shrinking the file.
+
+## Snapshots
+
+After each event is processed, the working state is copied to the snapshot persistent arena to be used as input for the next event. Creating a durable snapshot consists of committing, and recording a pair of event number and noun pointer in a durable metadata record. (TODO: be less handwavy here)
+
+## Event log
+
+Epochs in the event log can be stored as linked lists in the event log persistent arena. In order to safely release effects computed from incoming events, the events must be durably written to disk.
+
+Allocations in this arena may be made by bumping a pointer within pages. Dirtying a page does not necessitate copying as long as the pointer to the most recent event can be stored consistently, and is updated *after* an event is made durable on disk. This eliminates repeated copies of a page to re-dirty it after a commit. The only time when a free block of storage must be allocated to a page is when a new page is mapped. If each epoch has its own page set, then when an epoch is discarded, pages may simply be unmapped together and their blocks re-added to the free list.
+
+## Concurrency
+
+Because nouns are immutable, and because the copy-on-write approach ensures that a new, dirty page is a copy of an existing clean page prior to mapping it, readers should never see an inconsistent state of an object so long as that object is traceable from a root.
+
+It is not necessary to support multiple writers to the arena. Event (`+poke`) computation is linearized: we cannot start computing the next event until the result of the previous event is copied to the arena. Computation of scries (`+peek`) may run concurrently, but will only read from the arena unless new code generation is necessary.
+
+Thus, it should suffice to maintain a single, global writer lock for an arena.
+Contention for this lock is not likely to be high, and this will eliminate a great deal of complexity which would obtain if page dirtying and commits required more granular synchronization.
+
+## Tuning
+
+Commit frequency of the Arvo snapshot arena is a potentially tunable value, with performance implications.
+Committing more often incurs more disk IO, saving nouns that will potentially soon be discarded (Ames queues, for instance). However, under memory pressure, it makes paging more performant, as the proportion of dirty pages which must be *written* to disk to evict is smaller. Already-committed pages can simply be evicted from memory and re-read from disk when a read fault occurs. Committing less often requires writes as part of eviction, but reduces disk IO. It is desirable as long as memory pressure as low, and highly desirable for block storage where write-wearing is a concern.
+
+## TODO
+- Migration to/from vere
+  - Portable snapshots?
+  - LMDB event log import/export
+- Raw block devices?