From 390bd52919d8e17bd111c1fa13c2a33231cd76f9 Mon Sep 17 00:00:00 2001 From: Edward Amsden Date: Thu, 25 Aug 2022 12:45:53 -0500 Subject: [PATCH] persistence specification: import prose from bounty --- docs/persistence.md | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/docs/persistence.md b/docs/persistence.md index 35ef03a..6f7a779 100644 --- a/docs/persistence.md +++ b/docs/persistence.md @@ -1,4 +1,5 @@ # New Mars persistence +## Rationale New Mars requires a persistence layer. Nouns which are allocated in the top frame of a 2stackz memory manager can only be freed by resetting the stack. This loses all nouns. Therefore, for nouns which persist between computations, another arena is needed. @@ -20,9 +21,15 @@ The copy-on-write strategy ensures that commmitted data is not corrupted. Since ### Handling writes to the arena +There are several ways in which we can handle writes without overwriting, and possibly corrupting, data already committed to our store. + By default, pages in the persistent memory arena are protected read-only (`PROT_READ` flag to `mmap()` or `mprotect()`). This means that an attempt to write through a pointer into such a page will trigger a protection fault, raising a SIGSEGV to our process. -To handle this signal and resolve the fault, our handler will: +#### Immediate copying + +The most likely successful approach involves immediately copying the page to a new page, mapped to a new region of the file, and then mapping this page back over the same virtual memory space. + +To handle the SIGSEGV signal and resolve the fault, our handler will: - move the page to the copy-on-write set, so other threads faulting will not also attempt to copy the page. - `write()` the contents of the page to an unused region of the backing block storage. Note that this will not block on writing through to the disk, though it may write through if the OS decides. @@ -31,11 +38,17 @@ To handle this signal and resolve the fault, our handler will: The net effect of this is that the file region to which the read-protected page was mapped will remain unmodified, but our process continues with identical contents in a virtual memory page mapped to a new region of the file. The process continues from the faulting instruction and writes to memory. Index metadata which maps regions of the file to regions of memory remains untouched at this juncture. -### Committing writes - To facilitate restoring the persistent memory arena from durable block storage, we reserve two pages at the beginning of the file for metadata storage. This does not limit us to a page for metadata, we may commit metadata objects alongside our stored data, and reference it from a page. A metadata page contains a counter and a checksum of the page's contents. The page with a valid checksum and the more recent counter is used to load the data. -To atomically commit writes since the last commit to durable storage, we simply `msync()` (note: and `fsync()` or is that implied by `msync()`?) each dirty page. If all syncs succeed, we know that data in these pages is durably persisted to disk. We may now safely write a new metadata page to whichever metadata page is not in use. Once this too successfully `msync`s, our writes are durably committed. +To atomically commit writes since the last commit to durable storage, we simply `msync()` each dirty page. If all syncs succeed, we know that data in these pages is durably persisted to disk. We may now safely write a new metadata page to whichever metadata page is not in use. Once this too successfully `msync`s, our writes are durably committed. + +#### Lazy copying + +Another possible approach which leverages kernel support for in-memory copy-on-write, at the expense of using a boundable amount of swap, is to make writable mappings `MAP_PRIVATE`. This instructs the kernel to copy the page in the page cache on the first write. Unfortunately, this is now an anonymous mapping, meaning it is not backed by file storage. +Thus, if the kernel elects to evict it, it will be written to swap space on disk. It is still necessary to map pages as `PROT_READ` and use a signal handler, because we must track which pages are dirtied. It is almost certainly necessary to set a configurable limit on the number of dirty pages before a snapshot will be committed, since the number of dirty pages is limited by the available RAM and swap. + +To commit, we select regions in the file from the free list, and `write()` each dirty page to a free region. +We must then achieve a successful `fsync()` to know that the file has been committed. ### When to commit @@ -58,16 +71,15 @@ As well as allocating memory to noun (fragments) which need to be copied into th It's not clear whether it would be necessary to shrink the backing file, but if so, this could initially be done manually with a separate program, or a heuristic could be applied. Perhaps when a certain ratio of free-to-used pages are run, a defragmentation step is run, followed by shrinking the file. -## Snapshots +### Snapshots After each event is processed, the working state is copied to the snapshot persistent arena to be used as input for the next event. Creating a durable snapshot consists of committing, and recording a pair of event number and noun pointer in a durable metadata record. Metadata is referenced from the root page of a commit, and includes the event number of the snapshot and a pointer to the snapshotted noun. -## Event log +### Event log The event log should live in a separate arena from Arvo snapshot nouns. These arenas should be of incomparable seniority: that is, references are not permitted between either of them. This precludes synchronization issues between snapshot and event log commits, and ensures that old epochs of the event log can be freed without corrupting a snapshot. -In practice, this is achieved by copying the event from the stack to the event log once it is successfully computed, so that the computed arvo core and effects reference the event on the stack, rather than the event in the event log. -This ensures that if the event or some subnoun of it is included in the Arvo snapshot, it is copied to the snapshot arena. +This requires logic in the copier to ensure that a reference is only preserved as a reference if the seniority of the source is ≤ the seniority of the destination. Epochs in the event log can be stored as linked lists in the event log persistent arena. In order to safely release effects computed from incoming events, the events must be durably written to disk. @@ -75,7 +87,7 @@ Allocations in this arena may be made by bumping a pointer within pages. Dirtyin The separate arena need not be maintained in an entirely separate memory region. Rather, on disk, two indexes each with two root pages (which are alternated for commits) can be maintained, and the allocator can track which pages belong to which arena, and inform the persistence subsystem of this when requesting free pages, so that the persistence subsystem knows which index should track a newly dirtied page. -## Concurrency +### Concurrency Because nouns are immutable, and because the copy-on-write approach ensures that a new, dirty page is a copy of an existing clean page prior to mapping it, readers should never see an inconsistent state of an object so long as that object is traceable from a root. @@ -89,8 +101,3 @@ Contention for this lock is not likely to be high, and this will eliminate a gre Commit frequency of the Arvo snapshot arena is a potentially tunable value, with performance implications. Committing more often incurs more disk IO, saving nouns that will potentially soon be discarded (Ames queues, for instance). However, under memory pressure, it makes paging more performant, as the proportion of dirty pages which must be *written* to disk to evict is smaller. Already-committed pages can simply be evicted from memory and re-read from disk when a read fault occurs. Committing less often requires writes as part of eviction, but reduces disk IO. It is desirable as long as memory pressure as low, and highly desirable for block storage where write-wearing is a concern. -## TODO -- Migration to/from vere - - Portable snapshots? - - LMDB event log import/export -- Raw block devices?