urbit/Spec/u3.md

276 lines
10 KiB
Markdown
Raw Normal View History

2014-11-06 02:36:30 +03:00
# u3: noun processing in C.
`u3` is the C library that makes Urbit work. If it wasn't called
`u3`, it might be called `libnoun` - it's a library for making
and storing nouns.
What's a noun? A noun is either a cell or an atom. A cell is an
ordered pair of any two nouns. An atom is an unsigned integer of
any size.
To the C programmer, this is not a terribly complicated data
structure, so why do you need a library for it?
One: nouns have a well-defined computation kernel, Nock, whose
spec fits on a page and gzips to 340 bytes. But the only
arithmetic operation in Nock is increment. So it's nontrivial
to compute both efficiently and correctly.
Two: `u3` is designed to support "permanent computing," ie, a
single-level store which is transparently checkpointed. This
2014-11-06 22:13:57 +03:00
implies a specialized memory-management model, etc, etc.
2014-11-06 02:36:30 +03:00
2014-11-06 22:13:57 +03:00
(Does `u3` depend on the higher levels of Urbit, Arvo and Hoon?
2014-11-06 02:36:30 +03:00
Yes and no. It expects you to load something shaped like an Arvo
kernel, and use it as an event-processing function. But you
don't need to use this feature if you don't want, and your kernel
2014-11-06 22:13:57 +03:00
can be anything you want.)
2014-11-06 02:36:30 +03:00
2014-11-06 06:12:47 +03:00
## c3: C in Urbit
2014-11-06 02:36:30 +03:00
2014-11-06 22:13:57 +03:00
Under `u3` is the simple `c3` layer, which is just how we write C
2014-11-06 06:12:47 +03:00
in Urbit.
2014-11-06 02:36:30 +03:00
2014-11-06 06:12:47 +03:00
When writing C in u3, please of course follow the conventions of
the code around you as regards indentation, etc. It's especially
important that every function have a header comment, even if it
says nothing interesting.
2014-11-06 02:36:30 +03:00
But some of our idiosyncrasies go beyond convention. Yes, we've
done awful things to C. Here's what we did and why we did.
### c3: integer types
First, it's generally acknowledged that underspecified integer
types are C's worst disaster. C99 fixed this, but the `stdint`
types are wordy and annoying. We've replaced them with:
/* Good integers.
*/
typedef uint64_t c3_d; // double-word
typedef int64_t c3_ds; // signed double-word
typedef uint32_t c3_w; // word
typedef int32_t c3_ws; // signed word
typedef uint16_t c3_s; // short
typedef int16_t c3_ss; // signed short
typedef uint8_t c3_y; // byte
typedef int8_t c3_ys; // signed byte
typedef uint8_t c3_b; // bit
typedef uint8_t c3_t; // boolean
typedef uint8_t c3_o; // loobean
typedef uint8_t c3_g; // 32-bit log - 0-31 bits
typedef uint32_t c3_l; // little; 31-bit unsigned integer
typedef uint32_t c3_m; // mote; also c3_l; LSB first a-z 4-char string.
/* Bad integers.
*/
typedef char c3_c; // does not match int8_t or uint8_t
typedef int c3_i; // int - really bad
typedef uintptr_t c3_p; // pointer-length uint - really really bad
typedef intptr_t c3_ps; // pointer-length int - really really bad
Some of these need explanation. A loobean is a Nock boolean -
Nock, for mysterious reasons, uses 0 as true (always say "yes")
and 1 as false (always say "no").
Nock and/or Hoon cannot tell the difference between a short atom
and a long one, but at the `u3` level every atom under `2^31` is
direct. The `c3_l` type is useful to annotate this. A `c3_m` is
a mote - a string of up to 4 characters in a `c3_l`, least
significant byte first. A `c3_g` should be a 5-bit atom. Of
course, C cannot enforce these constraints, only document them.
Use the "bad" - ie, poorly specified - integer types only when
interfacing with external code that expects them.
An enormous number of motes are defined in `i/c/motes.h`. There
is no reason to delete motes that aren't being used, or even to
modularize the definitions. Keep them alphabetical, though.
### c3: variables and variable naming
The C3 style uses Hoon style TLV variable names, with a quasi
Hungarian syntax. This is weird, but works really well, as
long as what you're doing isn't hideous.
A TLV variable name is a random pronounceable three-letter
string, sometimes with some vague relationship to its meaning,
but usually not. Usually CVC (consonant-vowel-consonant) is a
good choice.
You should use TLVs much the way math people use Greek letters.
The same concept should in general get the same name across
different contexts. When you're working in a given area, you'll
tend to remember the binding from TLV to concept by sheer power
of associative memory. When you come back to it, it's not that
hard to relearn. And of course, when in doubt, comment it.
Variables take pseudo-Hungarian suffixes, matching in general the
suffix of the integer type:
c3_w wor_w; // 32-bit word
Unlike in true Hungarian, there is no change for pointer
variables. Structure variables take a `_u` suffix;
### c3: loobeans
The code (from `defs.h`) tells the story:
# define c3y 0
# define c3n 1
# define _(x) (c3y == (x))
# define __(x) ((x) ? c3y : c3n)
# define c3a(x, y) __(_(x) && _(y))
# define c3o(x, y) __(_(x) || _(y))
In short, use `_()` to turn a loobean into a boolean, `__` to go
the other way. Use `!` as usual, `c3y` for yes and `c3n` for no,
`c3a` for and and `c3o` for or.
## u3: introduction to the noun world
2014-11-06 22:13:57 +03:00
The division between `c3` and `u3` is that you could theoretically
imagine using `c3` as just a generic C environment. Anything to do
with nouns is in `u3`.
### u3: a map of the system
There are two kinds of symbols in `u3`: regular and irregular.
2014-11-06 22:29:53 +03:00
Regular symbols follow this pattern:
prefix purpose .h .c
-------------------------------------------------------
u3a_ allocation i/n/a.h n/a.c
u3e_ persistence i/n/e.h n/e.c
u3h_ hashtables i/n/h.h n/h.c
u3i_ noun construction i/n/i.h n/i.c
u3j_ jet control i/n/j.h n/j.c
u3m_ system management i/n/m.h n/m.c
u3n_ nock computation i/n/n.h n/n.c
u3r_ noun access (error returns) i/n/r.h n/r.c
u3t_ profiling i/n/t.h n/t.c
u3v_ arvo i/n/v.h n/v.c
u3x_ noun access (error crashes) i/n/x.h n/x.c
u3z_ memoization i/n/z.h n/z.c
u3k[a-g] jets (transfer, C args) i/j/k.h j/[a-g]/*.c
u3q[a-g] jets (retain, C args) i/j/q.h j/[a-g]/*.c
u3w[a-g] jets (retain, nock core) i/j/w.h j/[a-g]/*.c
2014-11-06 22:13:57 +03:00
2014-11-07 01:08:37 +03:00
Irregular symbols always start with `u3` and obey no other rules.
They're defined in `i/n/u.h`. Finally, `i/all.h` includes all
these headers (fast compilers, yay) and is all you need to
program in `u3`.
### u3: reference counts
The only really essential thing you need to know about `u3` is
how to handle reference counts. Everything else, you can skip
and just get to work.
u3 deals with reference-counted, immutable, acyclic nouns.
Unfortunately, we are not Apple and can't build reference
counting into your C compiler, so you need to count by hand.
Every allocated noun contains a counter which counts the number
of references to it - typically variables with type `u3_noun`.
When this counter goes to 0, the noun is freed.
To tell `u3` that you've added a reference to a noun, call the
function `u3a_gain()` or its shorthand `u3k()`. (For your
convenience, this function returns its argument.) To tell `u3`
that you've destroyed a reference, call `u3a_lose()` or `u3z()`.
(If you screw up by decrementing the counter too much, `u3` will
dump core in horrible ways. If you screw up by incrementing it
too much, `u3` will leak memory. To check for memory leaks,
set the `bug_o` flag in `u3e_boot()` - eg, run `vere` with `-g`.
Memory leaks are difficult to debug - the best way to handle
leaks is just to revert to a version that didn't have them, and
look over your code again.)
### u3: reference protocols
*THIS IS THE MOST CRITICAL SECTION IN THE `u3` DOCUMENTATION.*
The key question when calling a C function in a refcounted world
is what the function will do to the noun refcounts - and, if the
function returns a noun, what it does to the return.
There are two semantic patterns, `transfer` and `retain`. In
`transfer` semantics, the caller "gives" a use count to the
callee, which "gives back" any return. For instance, if I have
{
u3_noun foo = u3i_string("foobar");
u3_noun bar;
bar = u3f_futz(foo);
[...]
u3z(bar);
}
Suppose `u3f_futz()` has `transfer` semantics. At `[...]`, my
code holds one reference to `bar` and zero references to `foo` -
which has been freed, unless it's part of `bar`. My code now
owns `bar` and gets to work with it until it's done, at which
point a `u3z()` is required.
On the other hand, if `u3f_futz()` has `retain` semantics, we
need to write
{
u3_noun foo = u3i_string("foobar");
u3_noun bar;
bar = u3f_futz(foo);
[...]
u3z(foo);
}
because calling `u3f_futz()` does not release our ownership of
`foo`, which we have to free ourselves.
But if we free `bar`, we are making a great mistake, because our
reference to it is not in any way registered in the memory
manager (which cannot track references in C variables, of
course). It is normal and healthy to have these uncounted
C references, but they must be treated with care.
The bottom line is that it's essential for the caller to know
the refcount semantics of any function which takes or returns a
noun. (In some unusual circumstances, different arguments or
returns in one function may be handled differently.)
Broadly speaking, as a design question, retain semantics are more
appropriate for functions which inspect or query nouns. For
instance, `u3h()` (which takes the head of a noun) retains, so
that we can traverse a noun tree without constantly incrementing
and decrementing.
Transfer semantics are more appropriate for functions which make
nouns, which is obviously what most functions do.
In general, though, in most places it's not worth thinking about
what your function does. There is probably a convention for it.
Follow the convention.
### u3: reference conventions
The `u3` convention is that, unless otherwise specified, *all
functions have transfer semantics* - with the exception of the
prefixes: `u3r`, `u3x`, `u3z`, `u3q` and `u3w`. Also, within
jet directories `a` through `f` (but not `g`), internal functions
retain (for historical reasons).
2014-11-06 22:13:57 +03:00
2014-11-07 01:08:37 +03:00
If functions outside this set have retain semantics, they need to
be commented, both in the `.h` and `.c` file, with `RETAIN` in
all caps. Yes, it's this important.
2014-11-06 02:36:30 +03:00
2014-11-07 01:08:37 +03:00
### u3: system and memory architecture
2014-11-06 02:36:30 +03:00
2014-11-07 01:08:37 +03:00
Describing
2014-11-06 02:36:30 +03:00