2014-11-06 02:36:30 +03:00
|
|
|
# u3: noun processing in C.
|
|
|
|
|
|
|
|
`u3` is the C library that makes Urbit work. If it wasn't called
|
|
|
|
`u3`, it might be called `libnoun` - it's a library for making
|
|
|
|
and storing nouns.
|
|
|
|
|
|
|
|
What's a noun? A noun is either a cell or an atom. A cell is an
|
|
|
|
ordered pair of any two nouns. An atom is an unsigned integer of
|
|
|
|
any size.
|
|
|
|
|
|
|
|
To the C programmer, this is not a terribly complicated data
|
|
|
|
structure, so why do you need a library for it?
|
|
|
|
|
|
|
|
One: nouns have a well-defined computation kernel, Nock, whose
|
|
|
|
spec fits on a page and gzips to 340 bytes. But the only
|
|
|
|
arithmetic operation in Nock is increment. So it's nontrivial
|
|
|
|
to compute both efficiently and correctly.
|
|
|
|
|
|
|
|
Two: `u3` is designed to support "permanent computing," ie, a
|
|
|
|
single-level store which is transparently checkpointed. This
|
2014-11-06 22:13:57 +03:00
|
|
|
implies a specialized memory-management model, etc, etc.
|
2014-11-06 02:36:30 +03:00
|
|
|
|
2014-11-06 22:13:57 +03:00
|
|
|
(Does `u3` depend on the higher levels of Urbit, Arvo and Hoon?
|
2014-11-06 02:36:30 +03:00
|
|
|
Yes and no. It expects you to load something shaped like an Arvo
|
|
|
|
kernel, and use it as an event-processing function. But you
|
|
|
|
don't need to use this feature if you don't want, and your kernel
|
2014-11-06 22:13:57 +03:00
|
|
|
can be anything you want.)
|
2014-11-06 02:36:30 +03:00
|
|
|
|
2014-11-06 06:12:47 +03:00
|
|
|
## c3: C in Urbit
|
2014-11-06 02:36:30 +03:00
|
|
|
|
2014-11-06 22:13:57 +03:00
|
|
|
Under `u3` is the simple `c3` layer, which is just how we write C
|
2014-11-06 06:12:47 +03:00
|
|
|
in Urbit.
|
2014-11-06 02:36:30 +03:00
|
|
|
|
2014-11-06 06:12:47 +03:00
|
|
|
When writing C in u3, please of course follow the conventions of
|
|
|
|
the code around you as regards indentation, etc. It's especially
|
|
|
|
important that every function have a header comment, even if it
|
|
|
|
says nothing interesting.
|
2014-11-06 02:36:30 +03:00
|
|
|
|
|
|
|
But some of our idiosyncrasies go beyond convention. Yes, we've
|
|
|
|
done awful things to C. Here's what we did and why we did.
|
|
|
|
|
|
|
|
### c3: integer types
|
|
|
|
|
|
|
|
First, it's generally acknowledged that underspecified integer
|
|
|
|
types are C's worst disaster. C99 fixed this, but the `stdint`
|
|
|
|
types are wordy and annoying. We've replaced them with:
|
|
|
|
|
|
|
|
/* Good integers.
|
|
|
|
*/
|
|
|
|
typedef uint64_t c3_d; // double-word
|
|
|
|
typedef int64_t c3_ds; // signed double-word
|
|
|
|
typedef uint32_t c3_w; // word
|
|
|
|
typedef int32_t c3_ws; // signed word
|
|
|
|
typedef uint16_t c3_s; // short
|
|
|
|
typedef int16_t c3_ss; // signed short
|
|
|
|
typedef uint8_t c3_y; // byte
|
|
|
|
typedef int8_t c3_ys; // signed byte
|
|
|
|
typedef uint8_t c3_b; // bit
|
|
|
|
|
|
|
|
typedef uint8_t c3_t; // boolean
|
|
|
|
typedef uint8_t c3_o; // loobean
|
|
|
|
typedef uint8_t c3_g; // 32-bit log - 0-31 bits
|
|
|
|
typedef uint32_t c3_l; // little; 31-bit unsigned integer
|
|
|
|
typedef uint32_t c3_m; // mote; also c3_l; LSB first a-z 4-char string.
|
|
|
|
|
|
|
|
/* Bad integers.
|
|
|
|
*/
|
|
|
|
typedef char c3_c; // does not match int8_t or uint8_t
|
|
|
|
typedef int c3_i; // int - really bad
|
|
|
|
typedef uintptr_t c3_p; // pointer-length uint - really really bad
|
|
|
|
typedef intptr_t c3_ps; // pointer-length int - really really bad
|
|
|
|
|
|
|
|
Some of these need explanation. A loobean is a Nock boolean -
|
|
|
|
Nock, for mysterious reasons, uses 0 as true (always say "yes")
|
|
|
|
and 1 as false (always say "no").
|
|
|
|
|
|
|
|
Nock and/or Hoon cannot tell the difference between a short atom
|
|
|
|
and a long one, but at the `u3` level every atom under `2^31` is
|
|
|
|
direct. The `c3_l` type is useful to annotate this. A `c3_m` is
|
|
|
|
a mote - a string of up to 4 characters in a `c3_l`, least
|
|
|
|
significant byte first. A `c3_g` should be a 5-bit atom. Of
|
|
|
|
course, C cannot enforce these constraints, only document them.
|
|
|
|
|
|
|
|
Use the "bad" - ie, poorly specified - integer types only when
|
|
|
|
interfacing with external code that expects them.
|
|
|
|
|
|
|
|
An enormous number of motes are defined in `i/c/motes.h`. There
|
|
|
|
is no reason to delete motes that aren't being used, or even to
|
|
|
|
modularize the definitions. Keep them alphabetical, though.
|
|
|
|
|
|
|
|
### c3: variables and variable naming
|
|
|
|
|
|
|
|
The C3 style uses Hoon style TLV variable names, with a quasi
|
|
|
|
Hungarian syntax. This is weird, but works really well, as
|
|
|
|
long as what you're doing isn't hideous.
|
|
|
|
|
|
|
|
A TLV variable name is a random pronounceable three-letter
|
|
|
|
string, sometimes with some vague relationship to its meaning,
|
|
|
|
but usually not. Usually CVC (consonant-vowel-consonant) is a
|
|
|
|
good choice.
|
|
|
|
|
|
|
|
You should use TLVs much the way math people use Greek letters.
|
|
|
|
The same concept should in general get the same name across
|
|
|
|
different contexts. When you're working in a given area, you'll
|
|
|
|
tend to remember the binding from TLV to concept by sheer power
|
|
|
|
of associative memory. When you come back to it, it's not that
|
|
|
|
hard to relearn. And of course, when in doubt, comment it.
|
|
|
|
|
|
|
|
Variables take pseudo-Hungarian suffixes, matching in general the
|
|
|
|
suffix of the integer type:
|
|
|
|
|
|
|
|
c3_w wor_w; // 32-bit word
|
|
|
|
|
|
|
|
Unlike in true Hungarian, there is no change for pointer
|
|
|
|
variables. Structure variables take a `_u` suffix;
|
|
|
|
|
|
|
|
### c3: loobeans
|
|
|
|
|
|
|
|
The code (from `defs.h`) tells the story:
|
|
|
|
|
|
|
|
# define c3y 0
|
|
|
|
# define c3n 1
|
|
|
|
|
|
|
|
# define _(x) (c3y == (x))
|
|
|
|
# define __(x) ((x) ? c3y : c3n)
|
|
|
|
# define c3a(x, y) __(_(x) && _(y))
|
|
|
|
# define c3o(x, y) __(_(x) || _(y))
|
|
|
|
|
|
|
|
In short, use `_()` to turn a loobean into a boolean, `__` to go
|
|
|
|
the other way. Use `!` as usual, `c3y` for yes and `c3n` for no,
|
|
|
|
`c3a` for and and `c3o` for or.
|
|
|
|
|
|
|
|
## u3: introduction to the noun world
|
|
|
|
|
2014-11-06 22:13:57 +03:00
|
|
|
The division between `c3` and `u3` is that you could theoretically
|
|
|
|
imagine using `c3` as just a generic C environment. Anything to do
|
|
|
|
with nouns is in `u3`.
|
|
|
|
|
|
|
|
### u3: a map of the system
|
|
|
|
|
|
|
|
There are two kinds of symbols in `u3`: regular and irregular.
|
2014-11-06 22:29:53 +03:00
|
|
|
Regular symbols follow this pattern:
|
|
|
|
|
|
|
|
prefix purpose .h .c
|
|
|
|
-------------------------------------------------------
|
|
|
|
u3a_ allocation i/n/a.h n/a.c
|
|
|
|
u3e_ persistence i/n/e.h n/e.c
|
|
|
|
u3h_ hashtables i/n/h.h n/h.c
|
|
|
|
u3i_ noun construction i/n/i.h n/i.c
|
|
|
|
u3j_ jet control i/n/j.h n/j.c
|
|
|
|
u3m_ system management i/n/m.h n/m.c
|
|
|
|
u3n_ nock computation i/n/n.h n/n.c
|
|
|
|
u3r_ noun access (error returns) i/n/r.h n/r.c
|
|
|
|
u3t_ profiling i/n/t.h n/t.c
|
|
|
|
u3v_ arvo i/n/v.h n/v.c
|
|
|
|
u3x_ noun access (error crashes) i/n/x.h n/x.c
|
|
|
|
u3z_ memoization i/n/z.h n/z.c
|
|
|
|
u3k[a-g] jets (transfer, C args) i/j/k.h j/[a-g]/*.c
|
|
|
|
u3q[a-g] jets (retain, C args) i/j/q.h j/[a-g]/*.c
|
|
|
|
u3w[a-g] jets (retain, nock core) i/j/w.h j/[a-g]/*.c
|
2014-11-06 22:13:57 +03:00
|
|
|
|
2014-11-07 01:08:37 +03:00
|
|
|
Irregular symbols always start with `u3` and obey no other rules.
|
|
|
|
They're defined in `i/n/u.h`. Finally, `i/all.h` includes all
|
|
|
|
these headers (fast compilers, yay) and is all you need to
|
|
|
|
program in `u3`.
|
|
|
|
|
|
|
|
### u3: reference counts
|
|
|
|
|
|
|
|
The only really essential thing you need to know about `u3` is
|
|
|
|
how to handle reference counts. Everything else, you can skip
|
|
|
|
and just get to work.
|
|
|
|
|
|
|
|
u3 deals with reference-counted, immutable, acyclic nouns.
|
|
|
|
Unfortunately, we are not Apple and can't build reference
|
|
|
|
counting into your C compiler, so you need to count by hand.
|
|
|
|
|
|
|
|
Every allocated noun contains a counter which counts the number
|
|
|
|
of references to it - typically variables with type `u3_noun`.
|
|
|
|
When this counter goes to 0, the noun is freed.
|
|
|
|
|
|
|
|
To tell `u3` that you've added a reference to a noun, call the
|
|
|
|
function `u3a_gain()` or its shorthand `u3k()`. (For your
|
|
|
|
convenience, this function returns its argument.) To tell `u3`
|
|
|
|
that you've destroyed a reference, call `u3a_lose()` or `u3z()`.
|
|
|
|
|
|
|
|
(If you screw up by decrementing the counter too much, `u3` will
|
|
|
|
dump core in horrible ways. If you screw up by incrementing it
|
|
|
|
too much, `u3` will leak memory. To check for memory leaks,
|
|
|
|
set the `bug_o` flag in `u3e_boot()` - eg, run `vere` with `-g`.
|
|
|
|
Memory leaks are difficult to debug - the best way to handle
|
|
|
|
leaks is just to revert to a version that didn't have them, and
|
|
|
|
look over your code again.)
|
|
|
|
|
|
|
|
### u3: reference protocols
|
|
|
|
|
|
|
|
*THIS IS THE MOST CRITICAL SECTION IN THE `u3` DOCUMENTATION.*
|
|
|
|
|
|
|
|
The key question when calling a C function in a refcounted world
|
|
|
|
is what the function will do to the noun refcounts - and, if the
|
|
|
|
function returns a noun, what it does to the return.
|
|
|
|
|
|
|
|
There are two semantic patterns, `transfer` and `retain`. In
|
|
|
|
`transfer` semantics, the caller "gives" a use count to the
|
|
|
|
callee, which "gives back" any return. For instance, if I have
|
|
|
|
|
|
|
|
{
|
|
|
|
u3_noun foo = u3i_string("foobar");
|
|
|
|
u3_noun bar;
|
|
|
|
|
|
|
|
bar = u3f_futz(foo);
|
|
|
|
[...]
|
|
|
|
u3z(bar);
|
|
|
|
}
|
|
|
|
|
|
|
|
Suppose `u3f_futz()` has `transfer` semantics. At `[...]`, my
|
|
|
|
code holds one reference to `bar` and zero references to `foo` -
|
|
|
|
which has been freed, unless it's part of `bar`. My code now
|
|
|
|
owns `bar` and gets to work with it until it's done, at which
|
|
|
|
point a `u3z()` is required.
|
|
|
|
|
|
|
|
On the other hand, if `u3f_futz()` has `retain` semantics, we
|
|
|
|
need to write
|
|
|
|
|
|
|
|
{
|
|
|
|
u3_noun foo = u3i_string("foobar");
|
|
|
|
u3_noun bar;
|
|
|
|
|
|
|
|
bar = u3f_futz(foo);
|
|
|
|
[...]
|
|
|
|
u3z(foo);
|
|
|
|
}
|
|
|
|
|
|
|
|
because calling `u3f_futz()` does not release our ownership of
|
|
|
|
`foo`, which we have to free ourselves.
|
|
|
|
|
|
|
|
But if we free `bar`, we are making a great mistake, because our
|
|
|
|
reference to it is not in any way registered in the memory
|
|
|
|
manager (which cannot track references in C variables, of
|
|
|
|
course). It is normal and healthy to have these uncounted
|
|
|
|
C references, but they must be treated with care.
|
|
|
|
|
|
|
|
The bottom line is that it's essential for the caller to know
|
|
|
|
the refcount semantics of any function which takes or returns a
|
|
|
|
noun. (In some unusual circumstances, different arguments or
|
|
|
|
returns in one function may be handled differently.)
|
|
|
|
|
|
|
|
Broadly speaking, as a design question, retain semantics are more
|
|
|
|
appropriate for functions which inspect or query nouns. For
|
|
|
|
instance, `u3h()` (which takes the head of a noun) retains, so
|
|
|
|
that we can traverse a noun tree without constantly incrementing
|
|
|
|
and decrementing.
|
|
|
|
|
|
|
|
Transfer semantics are more appropriate for functions which make
|
|
|
|
nouns, which is obviously what most functions do.
|
|
|
|
|
|
|
|
In general, though, in most places it's not worth thinking about
|
|
|
|
what your function does. There is probably a convention for it.
|
|
|
|
Follow the convention.
|
|
|
|
|
|
|
|
### u3: reference conventions
|
|
|
|
|
|
|
|
The `u3` convention is that, unless otherwise specified, *all
|
|
|
|
functions have transfer semantics* - with the exception of the
|
|
|
|
prefixes: `u3r`, `u3x`, `u3z`, `u3q` and `u3w`. Also, within
|
|
|
|
jet directories `a` through `f` (but not `g`), internal functions
|
|
|
|
retain (for historical reasons).
|
2014-11-06 22:13:57 +03:00
|
|
|
|
2014-11-07 01:08:37 +03:00
|
|
|
If functions outside this set have retain semantics, they need to
|
|
|
|
be commented, both in the `.h` and `.c` file, with `RETAIN` in
|
|
|
|
all caps. Yes, it's this important.
|
2014-11-06 02:36:30 +03:00
|
|
|
|
2014-11-07 01:08:37 +03:00
|
|
|
### u3: system and memory architecture
|
2014-11-06 02:36:30 +03:00
|
|
|
|
2014-11-07 01:08:37 +03:00
|
|
|
Describing
|
2014-11-06 02:36:30 +03:00
|
|
|
|