vere/pkg
Ted Blackman c8be121455
Pointer Compression to enable 8G Loom (#164)
### Description

Resolves #163

In fulfillment of https://urbit.org/grants/loom-pointer-compression

### Benchmark

#### Basic brass pill fakezod boot benchmark - x86_64 linux

Pay primary attention to `Elapsed (wall clock) time`, `Maximum resident
set size
(kbytes)` and `Major (requiring I/O) page faults`

##### Takeaway

We expected increased memory usage because this is naturally a tradeoff
of
alignment. Do note, that runs (2) and (3) included changes to align the
_stack_
as well as the _heap_. In the run without stack alignment (4) you can
see that
stack alignment has no effect on max RSS -- at least when booting from a
pill. From some basic evaluation in gdb I've done in the past, I expect
stack
usage when DWORD-aligned to increase by ~50% (rather than a theoretical
100%). Stack usage is quite small compared to heap usage however, so you
shouldn't expect to see this reflected in maximum RSS. Overall maximum
resident
memory increased by about ~33%.

The number of major pagefaults encountered during a brass boot is
roughly equal
to prior.

The elapsed (wall clock) time difference between (2) and (3) is
essentially
zero. There is essentially no performance gained by the virtual bit size
being a
compile-time constant.

There is a small latency cost in the current DWORD-aligned heap
allocation
implementation as compared to a runtime that doesn't require allocations
to be
aligned. Compare the elapsed times of (1) -- 2:21.09 or 141.09 s -- and
(2) --
2:23.50 or 143.50 s -- result: ~1.7% increased latency. If you look at
run (4)
however, which excluded stack alignment changes -- 2:22.55 or 142.55 s
--, we
split the difference at ~1.0% increased latency. Note, this _is_
repeatable and
the 1% difference isn't random. Running the same program over again on
the same
system exhibits tiny variance.

##### 1) -O3 no pointer compression vere/develop

A run from the HEAD of vere/develop

commit 7c890c3350

```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.25
System time (seconds): 0.03
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:21.09
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 148036
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3492
Minor (reclaiming a frame) page faults: 6188
Voluntary context switches: 68
Involuntary context switches: 5
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```

##### 2) -O3 compiletime determined virtual bit size
i/163/pointer-compression

State of pointer compression work prior to migration (where concessions
to
runtime determined virtual bit size were made)

commit 4083f1c660

```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.39
System time (seconds): 0.04
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:23.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 197176
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3487
Minor (reclaiming a frame) page faults: 6219
Voluntary context switches: 68
Involuntary context switches: 2
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```

##### 3) -O3 runtime determined virtual bit size
i/163/pointer-compression

Current state of pointer compression work -- after implementation of
migration
and runtime determined virtual bit size concession

commit 8dffe067e1:

```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.40
System time (seconds): 0.06
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:23.52
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 197200
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3489
Minor (reclaiming a frame) page faults: 7242
Voluntary context switches: 69
Involuntary context switches: 4
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```

##### 4) -O3 runtime determined virtual bit size _WITHOUT_ stack
alignment barter-simsum/pointer-compression-no-align-stack

commit 8b0438ab3b

```
Command being timed: "./urbit -t -q -F zod -B brass.pill -c zod"
User time (seconds): 1.42
System time (seconds): 0.06
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:22.55
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 197204
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3492
Minor (reclaiming a frame) page faults: 6221
Voluntary context switches: 68
Involuntary context switches: 3
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```

##### FINAL BENCHMARK BEFORE MERGE:

This was run after some fairly significant changes to minimize malloc
padding,
fix memory corruption when run with `U3_MEMORY_DEBUG`, and more.

It was agreed to keep stack alignment out of this PR as it currently
isn't used
and costs us a bit of latency.

Runtime vs compiletime determined pointer compression still shows no
latency
difference on the x86 linux machine tested (ddr4 memory). On an m2 mac
air,
there _was_ a 5% latency increase from compiletime to runtime pointer
compression. This may be fixed later and would not necessitate another
migration.

The additional free list sanity checking done in `u3a_loom_sane`
introduces
negligible latency in `u3e_save`. On a relatively fragmented heap, it
only takes
60ms to complete. This will be kept in order to detect _some_ memory
corruption
if it occurs and prevent that corruption from propagating to disk.

A brass pill boot was performed off of
13e0b43d8da4bdd318fcd4e3d3610caa3af4608a. Observe there is no regression
in the
Elapsed (wall clock) time statistic. Further, the maximum resident set
size has
been reduced by 25% back to its pre pointer compression size (150M).
This is
likely due to a decrease in the average allocation's padding.

Lastly, total sweep size was compared between a freshly booted pier
without
pointer compression and with pointer compression post migration. There
is no
noticeable increase in the overall size of allocations.

```
Command being timed: "./burbit -t -q -F zod -B brass.pill -c brasspillbench"
User time (seconds): 1.27
System time (seconds): 0.04
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:21.49
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 150088
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3490
Minor (reclaiming a frame) page faults: 6188
Voluntary context switches: 64
Involuntary context switches: 4
Swaps: 0
File system inputs: 14866
File system outputs: 21544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
2023-02-28 15:56:25 -05:00
..
c3 refactor alignment functions 2023-02-28 12:07:37 -05:00
ent build: fix broken import for macos-x86_64 2023-01-24 17:34:21 +02:00
noun u3a_loom_sane() 2023-02-28 12:07:37 -05:00
ur Build with musl instead of glibc on Linux (#27) 2023-01-09 13:54:11 -05:00
urcrypt Build with bazel on darwin-arm64 (#13) 2023-01-09 13:46:53 -05:00
vere Pointer Compression to enable 8G Loom (#164) 2023-02-28 15:56:25 -05:00