### Description
Resolves#163
In fulfillment of https://urbit.org/grants/loom-pointer-compression
### Benchmark
#### Basic brass pill fakezod boot benchmark - x86_64 linux
Pay primary attention to `Elapsed (wall clock) time`, `Maximum resident
set size
(kbytes)` and `Major (requiring I/O) page faults`
##### Takeaway
We expected increased memory usage because this is naturally a tradeoff
of
alignment. Do note, that runs (2) and (3) included changes to align the
_stack_
as well as the _heap_. In the run without stack alignment (4) you can
see that
stack alignment has no effect on max RSS -- at least when booting from a
pill. From some basic evaluation in gdb I've done in the past, I expect
stack
usage when DWORD-aligned to increase by ~50% (rather than a theoretical
100%). Stack usage is quite small compared to heap usage however, so you
shouldn't expect to see this reflected in maximum RSS. Overall maximum
resident
memory increased by about ~33%.
The number of major pagefaults encountered during a brass boot is
roughly equal
to prior.
The elapsed (wall clock) time difference between (2) and (3) is
essentially
zero. There is essentially no performance gained by the virtual bit size
being a
compile-time constant.
There is a small latency cost in the current DWORD-aligned heap
allocation
implementation as compared to a runtime that doesn't require allocations
to be
aligned. Compare the elapsed times of (1) -- 2:21.09 or 141.09 s -- and
(2) --
2:23.50 or 143.50 s -- result: ~1.7% increased latency. If you look at
run (4)
however, which excluded stack alignment changes -- 2:22.55 or 142.55 s
--, we
split the difference at ~1.0% increased latency. Note, this _is_
repeatable and
the 1% difference isn't random. Running the same program over again on
the same
system exhibits tiny variance.
##### 1) -O3 no pointer compression vere/develop
A run from the HEAD of vere/develop
commit 7c890c3350
```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.25
System time (seconds): 0.03
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:21.09
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 148036
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3492
Minor (reclaiming a frame) page faults: 6188
Voluntary context switches: 68
Involuntary context switches: 5
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
##### 2) -O3 compiletime determined virtual bit size
i/163/pointer-compression
State of pointer compression work prior to migration (where concessions
to
runtime determined virtual bit size were made)
commit 4083f1c660
```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.39
System time (seconds): 0.04
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:23.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 197176
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3487
Minor (reclaiming a frame) page faults: 6219
Voluntary context switches: 68
Involuntary context switches: 2
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
##### 3) -O3 runtime determined virtual bit size
i/163/pointer-compression
Current state of pointer compression work -- after implementation of
migration
and runtime determined virtual bit size concession
commit 8dffe067e1:
```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.40
System time (seconds): 0.06
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:23.52
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 197200
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3489
Minor (reclaiming a frame) page faults: 7242
Voluntary context switches: 69
Involuntary context switches: 4
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
##### 4) -O3 runtime determined virtual bit size _WITHOUT_ stack
alignment barter-simsum/pointer-compression-no-align-stack
commit 8b0438ab3b
```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.42
System time (seconds): 0.06
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:22.55
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 197204
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3492
Minor (reclaiming a frame) page faults: 6221
Voluntary context switches: 68
Involuntary context switches: 3
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
##### FINAL BENCHMARK BEFORE MERGE:
This was run after some fairly significant changes to minimize malloc
padding,
fix memory corruption when run with `U3_MEMORY_DEBUG`, and more.
It was agreed to keep stack alignment out of this PR as it currently
isn't used
and costs us a bit of latency.
Runtime vs compiletime determined pointer compression still shows no
latency
difference on the x86 linux machine tested (ddr4 memory). On an m2 mac
air,
there _was_ a 5% latency increase from compiletime to runtime pointer
compression. This may be fixed later and would not necessitate another
migration.
The additional free list sanity checking done in `u3a_loom_sane`
introduces
negligible latency in `u3e_save`. On a relatively fragmented heap, it
only takes
60ms to complete. This will be kept in order to detect _some_ memory
corruption
if it occurs and prevent that corruption from propagating to disk.
A brass pill boot was performed off of
13e0b43d8da4bdd318fcd4e3d3610caa3af4608a. Observe there is no regression
in the
Elapsed (wall clock) time statistic. Further, the maximum resident set
size has
been reduced by 25% back to its pre pointer compression size (150M).
This is
likely due to a decrease in the average allocation's padding.
Lastly, total sweep size was compared between a freshly booted pier
without
pointer compression and with pointer compression post migration. There
is no
noticeable increase in the overall size of allocations.
```
Command being timed: "./burbit -t -q -F zod -B brass.pill -c brasspillbench"
User time (seconds): 1.27
System time (seconds): 0.04
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:21.49
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 150088
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3490
Minor (reclaiming a frame) page faults: 6188
Voluntary context switches: 64
Involuntary context switches: 4
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
pad malloc internal pad calculations were largely responsible for the
corruption. It happened to be the case that without U3_MEMORY_DEBUG set (which
doubles the size of a `u3a_box`), we overallocated just enough memory for the
pad miscalculation to not effect us.
This fixes both the overallocation and the pad miscalculation.
This drastically simplifies the bizarre (ald_w, alp_w) alignment logic which
also seems to have been the cause of issues with heap corruption originating in
handling of internal 16 byte alignment by _ca_box_make_hat
version numbers - currently for u3e and u3v - should be kept in their own header
to avoid dependency loops since anything should be able to source these.
I would like to avoid literal comparison like:
if (ver_w == 2) { }
and instead opt for
if (ver_w == U3V_VER2) { }
or
if (ver_w == U3V_LATEST) { }
Though this may seem overkill at first given the only version we've incremented
is u3H->ver_w, this is the simplest solution to avoid dependency loops
Unsure how we should ultimately do this. Specifying 33 bits should be
conditional on the loom having compressed pointers obviously.
P.S. Having implemented the migration by now, the migration is not optional, so
this is fine.
I'm assuming the old switch-case was an attempted performance optimization -
more constants with math that could be elided by the compiler, e.g.
`if (gal_w > (UINT32_MAX - 35 >> 5))`.
However, looking at -O3 disassembly, I really doubt it's any faster. There are
at least as many conditional jumps and the instruction size is about 2x larger.
===
Moreover, this change does not artificially limit the size that gal_w can
be. For instance, in the previous implementation, for a value of a_y=2, gal_w
could not exceed the following without bailing:
`(UINT32_MAX - 35) >> 5` =>
`0x07FFFFFF`
Now, gal_w cannot exceed
`((UINT32_MAX - (32 + max_y)) >> (5 - a_y))` =>
`((UINT32_MAX - 35) >> 3)` =>
`0x1FFFFFFF`
===
This has been confirmed to return exactly the same results as prior
These were the most minimal set of changes that allowed me to build vere
on macOS x86-64. See #131 for context. To build, I ran `bazel build
--clang_version="12.0.0" :urbit`.
## `chop`
`urbit chop <pier>` implements a simple, offline **event log
truncation**[^1] tool.
`chop` gracefully stops the given pier (if running), backs up the
current snapshot to `<pier>/.urb/bhk`, makes sure a current snapshot
exists (i.e., is fully written to disk in `chk/*.bin` with no existing
patch files), reads the metadata and the last event from the pier's
event log, initializes a fresh event log in the `<pier>/.urb/log/chop`
directory, writes the metadata and last event from the original log into
the fresh one, renames the original event log to
`<pier>/.urb/log/chop/data_<first>_<last>.mdb.bak` where `first` and
`last` are the first and last event numbers from the event log, and
exits.
Pilots are then free to move, archive, or delete their `.bak` event log
file, resume normal operation of their ship, and enjoy the many benefits
of lowered disk pressure and any reductions in associated hosting costs.
I've tested `chop` successfully on my own planet `~mastyr-bottec`
(multiple times), three different comets (all fresh), and multitudes of
fake galaxies.
Resolves#122.
Note: `knit`, which is the "undo" button for `chop`, is being
implemented in its own PR #184.
[^1]: https://roadmap.urbit.org/project/event-log-truncation