Merge pull request #224 from urbit/blog-text

Blog text
This commit is contained in:
Sigilante 2022-11-15 07:19:41 -06:00 committed by GitHub
commit f3157b5977
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -0,0 +1,492 @@
+++
title = "What Every Hooner Should Know About Text on Urbit"
date = "2022-11-15"
description = "How many ways can you write a single word?"
[extra]
author = "N E Davis"
ship = "~lagrev-nocfep"
image = "https://media.urbit.org/site/posts/essays/blog-text-bottles.png"
+++
![](https://media.urbit.org/site/posts/essays/blog-text-bottles.png)
# What Every Hooner Should Know About Text on Urbit
## Forms of Text
[Text strings](https://en.wikipedia.org/wiki/String_%28computer_science%29%) are sequences of characters. At one level, the file containing code is itself a string—at a more fine-grained level, we take strings to mean either byte sequences obtained from literals (like `'Hello Mars'`) or from external APIs. This blog post will expand on [existing docs](https://developers.urbit.org/guides/additional/strings) to explain what is going on with text in various corners of Hoon.
Setting aside [literal syntax](https://developers.urbit.org/blog/literals), Urbit distinguishes quite a few text representation types:
1. `cord`s (`@t`, [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte))
2. `knot`s (`@ta`)
3. `term`s (`@tas`)
4. `tape`s (`(list @tD`)
5. UTF-32 strings (`@c`)
6. `tour`s (`(list @c)`)
7. `tank`s (formatted print trees)
8. `tang`s (`(list tank)`)
9. `wain`s (`(list cord)`)
10. `wall`s (`(list tape)`)
11. `path`s (`(list knot)`) (with alias `wire`)
12. JSON-tagged trees
13. Sail (for HTML)
Let's examine each of these in turn.
### `cord` (`@t`)
A `cord` is a [UTF-8](https://en.wikipedia.org/wiki/UTF-8) [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte) atom used to represent text directly. A `cord` is denoted by single quotes `'surrounding the text'` and has no restrictions other than requiring valid UTF-8 content (thus all Unicode characters). `cord`s are preferred over `tape`s when text is not being processed.
```hoon
> *@t
''
> ((sane %t) 'Hello Mars!')
%.y
```
One big difference between `cord`s and strings in other languages is that Urbit uniformly expects escape characters (such as `\n`, newline) to be written as their ASCII value in hexadecimal: thus, Hoon uses `\0a` for C-style `\n`.
### `knot` (`@ta`)
A `knot` is an atom type that permits only a subset of the URL-safe ASCII characters (thus excluding control characters, spaces, upper-case characters, and ``!"#$%&'()*+,/:;<=>?@[\]^` {|}``). Stated positively, `knot`s can contain lower-case characters, numbers, and `-._~`. A `knot` is denoted by starting with the unique prefix `~.` sigdot. Generally `knot`s are used for paths (as in Clay, for wires, and so forth).
As the Dojo doesn't actually check for atom validity, it is possible to erroneously "cast" a value into a `knot` representation when it is not a valid `knot`. Use `++sane` to produce a check gate to avoid attempting to parse invalid `knot`s.
```hoon
> *@ta
~.
> ((sane %ta) 'Hello Mars!')
%.n
> ((sane %ta) 'hellomars')
%.y
```
You can see all ASCII characters checked for their `knot` compatibility using ``(turn (gulf 32 127) |=(a=@ [`@t`a ((sane %ta) a)]))``. `++wood` is a `cord` escape: it catches `@ta`-invalid characters in `@t`s and converts them lossily to `@ta`.
### `term` (`@tas`)
A `term` is an atom type intended for marking tags, types, and labels. A value prefixed with `%` cen such as `%hello` is first a _constant_ (_q.v._) and only possesses `term`-nature if explicitly marked as such with `@tas`. A term is [defined](https://developers.urbit.org/reference/hoon/basic) as “an atomic ASCII string which obeys symbol rules: lowercase and digit only, infix hyphen, first character must be a lowercase letter.”
Urbit uses `term`s to represent internal data tags throughout the Hoon compiler, the Arvo kernel, and userspace.
(Note that the empty `term` is written `%$`, not `%~`. `%~` is a constant null value, not a `term`.)
As with `knot`s, values can be incorrectly cast to `@tas` in the Dojo. Use `++sane` to avoid issues as a result of this behavior.
Here we also use the _type spear_ `-:!>` to extract the type of the values demonstratively.
```hoon
> *@tas
%$
> -:!>(%hello-mars)
#t/%hello-mars
> -:!>(`@tas`%hello-mars)
#t/@tas
> ((sane %tas) 'Hello Mars!')
%.n
> ((sane %tas) 'hello-mars')
%.y
> -:!>(%~)
#t/%~
```
### `tape` (`(list @tD)`)
A `tape` is a list of `@tD` 8-bit atoms. Similar to `cord`s, `tape`s support UTF-8 text and all Unicode characters. Each byte is represented as its own serial entry, rather than as a whole character. `tape`s are `list`s not atoms, meaning they can be easily parsed and processed using `list` tools such as `++snag`, `++oust`, and so forth.
```hoon
> ""
""
> `(list @)`""
~
> "Hello Mars!"
"Hello Mars!"
> "Hello \"Mars\"!"
"Hello \"Mars\"!"
> `(list @t)`"Hello \"Mars\"!"
<|H e l l o   " M a r s " !|>
```
The `tape` type is slightly more restrictive than just `(list @t)`, and so `(list @t)` has a slightly different representation yielded to it by the pretty-printer.
```hoon
> "Hello Mars"
"Hello Mars"
> `(list @t)`"Hello Mars"
<|H e l l o   M a r s|>
```
What's the `@tD` doing in `(list @tD)`? By convention, a suffixed upper-case letter indicates the size of the entry in bits, with `A` for 2⁰ = 1, `B` for 2¹ = 2, `C` for 2² = 4, `D` for 2³ = 8, and so forth. While the inclusion of `D` isn't coercive, it is advisory: a `tape` is processed in such a way that multi-byte characters are broken into successive bytes:
```hoon
> `(list @ux)``(list @)`"küßî"
~[0x6b 0xc3 0xbc 0xc3 0x9f 0xc3 0xae]
```
#### Converting Text to Hoon
There are a few ways to get from a `cord` of text to a Hoon representation.
Most commonly, one has a value as text and needs to get it as an atom, or vice versa.
- [`++scot`](https://developers.urbit.org/reference/hoon/stdlib/4m#scot) takes a Hoon atom and produces a `cord` or `knot`.
```hoon
> (scot %ud 1.000)
~.1.000
> (scot %ux 0xdead.beef)
~.0xdead.beef
> (scot %p ~sampel-palnet)
~.~sampel-palnet
> > (scot %si --1)
~.--0i1
```
This example shows the atom literal syntax we wrote about recently:
```hoon
> (scot %t 'Hello Mars')
~.~~~48.ello.~4d.ars
> ~~~48.ello.~4d.ars
'Hello Mars'
```
- [`++scow`](https://developers.urbit.org/reference/hoon/stdlib/4m#scow) does the same but to a `tape`.
- [`++slaw`](https://developers.urbit.org/reference/hoon/stdlib/4m#slaw) converts a `cord` representation—in Hoon aura notation—into an `unit` of `@` atom.
```hoon
> (slaw %ux '0xdead.beef')
[~ 3.735.928.559]
> (slaw %p '~sampel-palnet')
[~ 1.624.961.343]
> (slaw %p '~sample-planet')
~
```
- [`++ream`](https://developers.urbit.org/reference/hoon/stdlib/5d#ream) accepts a `cord` and shows the resulting abstract syntax tree of Hoon.
```hoon
> (ream '+(2)')
[%dtls p=[%sand p=%ud q=2]]
```
Other methods, such as text to number, are included in the discussion of JSON and MIME type data below.
#### Interpolation
`tape`s support interpolation: including the result of Hoon expressions as text in the middle of the tape.
Curly braces `{` sel and `}` ser indicate that the result of a calculation has been converted into a `tape` directly.
```hoon
> "There are {(scow %ud (sub (pow 2 128) (pow 2 64)))} comets."
"There are 340.282.366.920.938.463.444.927.863.358.058.659.840 comets."
```
Angle brackers `<` gal and `>` gar employ automatic text conversion:
```hoon
> "There are many ships, but {<our>} is my ship."
"There are many ships, but ~zod is my ship."
```
#### `cord` v. `tape`
Most commonly, developers will represent text using either `tape`s or `cord`s. Both of these facilitate straightforward direct representation as string literals using either single quotes `'example of cord'` or double quotes `"example of tape"`.
As a practical matter, `tape`s occupy more space than their corresponding `cord`s. `tape`s are implemented as linked lists in the runtime. These are easy to work with but consume more memory and can take longer to process in some ways.
Prefer `cord`s for data storage and representation, but `tape`s for data processing.
A `cord` can be transformed into a `tape` using `++trip` (mnemonic "tape rip"). The reverse transformation, from `tape` to `cord`, is accomplished via `++crip` (mnemonic "cord rip").
```hoon
> (trip 'Hello Mars!')
"Hello Mars!"
> (crip "Hello Mars!")
'Hello Mars!'
```
#### An Aside on Unicode
Unicode is a chart of character representations, with each character receiving a unique number or _codepoint_. This codepoint is then represented in various ways in binary encodings, the most common of which is [UTF-8](https://en.wikipedia.org/wiki/UTF-8). UTF-8 is a variable-byte encoding scheme which balances the economy of representing common characters like ASCII using only a single byte with the ability to represent characters from more complex character sets like Chinese `漢語` or Cherokee `ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ`. While something of a pain when processing byte-by-byte, this allows for an adaptively compact way of writing values (rather than the mostly-zeroes [UTF-32](https://en.wikipedia.org/wiki/UTF-32) mode, available in Urbit as `@c`.) A `char` is a self-conscious UTF-8 single byte in Hoon, but it's simply an alias for `@t` and doesn't enforce bitwidth.
Joel Spolsky wrote [a classic article on Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) which happily has been partly-superseded by much more extensive software support in the two decades since its publication.
### `@c` & `tour` (`(list @c)`)
As just mentioned, Unicode has several distinct encoding schemes. [UTF-32](https://en.wikipedia.org/wiki/UTF-32) can represent any Unicode value in four bytes, meaning that index accesses are direct (rather than needing to be calculated as with UTF-8). Urbit provides UTF-32 `@c` data for the terminal stack to use with terminal cursor position, but otherwise they are not used much. You never see these in practice in userspace.
You can use `++taft` to convert from a UTF-8 `cord` to a UTF-32 `@c`, and `++tuft` to go the other way.
```hoon
> (taft 'hello')
~-hello
> (taft 'Hello Mars')
~-~48.ello.~4d.ars
> `@ux`(taft 'Hello Mars')
0x73.0000.0072.0000.0061.0000.004d.0000.0020.0000.006f.0000.006c.0000.006c.0000.0065.0000.0048
> (tuft ~-~48.ello.~4d.ars)
'Hello Mars'
```
One library, `l10n`, proposes to handle text as a list of UTF-8 multi-byte characters, `calf` or `(list @t)`, rather than a `tape`, which has each byte as a separate entry. This eases processing for certain Unicode text operations.
### `tank`s (formatted print trees) & `tang`s ((list tank))
Moving past the simple text types, we find that text alone provides little information about structure or display. Formatted print trees, or `tank`s, are commonly used to produce error messages and other data displays within the Dojo.
A `tank` is a structure of tagged values. The tag indicates to the pretty-printer how to convert the final value to a `tape` for output (using `ram:re`).
```hoon
> ~(ram re 'Hello Mars')
"Hello Mars"
> ~(ram re leaf+"Hello Mars")
"Hello Mars"
> ~(ram re rose+[["|" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~])
"«Hello Mars|Phobos|Deimos»"
> %~ ram re
:- %palm
:- ["|" "<" ":" ">"]
:~ leaf+"Hello Mars"
rose+[["║" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~]
==
"<:Hello Mars|«Hello MarsPhobosDeimos»>"
```
Formatted text based on `tank`s is very helpful when working with `%say` generators.
### `wain`s (`(list cord)`) & `wall`s (`(list tape)`)
Collections of `cord`s and `tape`s are occasionally useful when building output.
The `shoe`/`sole` CLI libraries use `wain`s and `wall`s for various aspects of rendering an app at the CLI.
### `path`s (`(list knot)`) (with alias `wire`)
Gall agents and Clay both use `path`s to uniquely identify resources such as noun data on the file system or subscriptions. Furthermore, a `wire` is an alias for a `path` which particularly denotes the subscriber's identification, preferably unique. Any valid `@ta` value separated by `/` fas values becomes a `path`, and `=` tis entries in the first three slots are expanded to the Clay `beak`.
```hoon
> /hello/mars
[%hello %mars ~]
> /1/2/3
[~.1 ~.2 ~.3 ~]
> /
~
> /===
[~.~zod ~.base ~.~2022.11.9..19.13.51..efb6 ~]
```
### JSON-style strings
[JSON](https://en.wikipedia.org/wiki/JSON) is a data interchange format based on text. Web apps and several other platforms use JSON as a fairly concise human-readable way to transmit information, including text.
Hoon represents the equivalent structure of the JSON as a tagged noun. This requires parsing a JSON string into a tagged noun structure, then reparsing that into particular Hoon values.
For our purposes here, a JSON-style string thus means a tagged string `s+'Hello Mars'`.
```hoon
> =myjson '{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [
"Catherine",
"Thomas",
"Trevor"
],
"spouse": null
}'
> (de-json:html myjson)
[ ~
[ %o
p
{ [p='firstName' q=[%s p='John']]
[p='lastName' q=[%s p='Smith']]
[ p='children'
q=[%a p=~[[%s p='Catherine'] [%s p='Thomas'] [%s p='Trevor']]]
]
[ p='address'
q
[ %o
p
{ [p='postalCode' q=[%s p='10021-3100']]
[p='streetAddress' q=[%s p='21 2nd Street']]
[p='city' q=[%s p='New York']]
[p='state' q=[%s p='NY']]
}
]
]
[ p='phoneNumbers'
q
[ %a
p
~[
[ %o
p
{ [p='type' q=[%s p='home']]
[p='number' q=[%s p='212 555-1234']]
}
]
[ %o
p
{ [p='type' q=[%s p='office']]
[p='number' q=[%s p='646 555-4567']]
}
]
]
]
]
[p='spouse' q=~]
[p='isAlive' q=[%b p=%.y]]
[p='age' q=[%n p=~.27]]
}
]
]
```
#### Converting Text to Hoon (and Vice Versa)
Notice at this point that most of the values in the `json` data structure are tagged with `%s` string except for a few: `%a` array, `%b` boolean, `%n` number, and `%o` map. The tricky part to deal with in reparsing these values back to and from text are the `%n` numbers, since Hoon has several number types.
Thus we must consider how to convert `json` values [to](https://developers.urbit.org/reference/hoon/zuse/2d_6) and [from](https://developers.urbit.org/reference/hoon/zuse/2d_1-5) Hoon representations. Fortunately, most gates one would need are already included in the Zuse standard library for handling `json` structures. The standard JSON-style operations include:
- [`++numb:enjs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_1-5#numbenjsformat) converts from `@u` to a JSON number (as `knot`).
```hoon
> (numb:enjs:format 0xdead.beef)
[%n p=~.3735928559]
```
- [`++ne:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nedejsformat) parses a JSON-style string as a real, or `@rd`.
```hoon
> (ne:dejs:format n+'0.31415e1')
.~3.1415
```
- [`++ni:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nidejsformat) parses a JSON-style string as an integer, or `@ud`.
```hoon
> (ni:dejs:format n+'65536')
65.536
```
- `++ns:dejs:format` parses a JSON-style string as a signed integer, or `@sd`.
```hoon
> (ns:dejs:format n+'-1')
-1
```
- [`++nu:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nudejsformat) parses a JSON-style string as a hexadecimal.
```hoon
> (nu:dejs:format s+'deadbeef')
0xdead.beef
```
There are date format parsers as well, such as [`++du`](https://developers.urbit.org/reference/hoon/zuse/2d_6#dudejsformat).
Another category of converters are the MIME parsers. These are nominally for webpages serving content, but prove useful in a variety of other situations as well.
- `++en:base16:mimes:html` converts a `@ux` hexadecimal value to a `cord` with zero-padding (while `++de` goes the other way).
```hoon
> (en:base16:mimes:html 8 0x12.3456.7890.abcd)
'001234567890abcd'
> (de:base16:mimes:html '012345')
[~ [p=3 q=74.565]]
```
There are base-64 and base-58 (Bitcoin address) parsers as well.
### Sail (for HTML)
Sail is Hoon's internal markup for HTML and XML. It can support all HTML tags and attributes. The [Sail guide](https://developers.urbit.org/guides/additional/sail) contains full details on how to work with the markup format, but here I want to briefly demonstrate how text in Sail is handled.
Basically, Sail opens a tag and associates either the rest of the line (`:`) or continuing text until `==`.
```hoon
;html
;head
;title = My page
;meta(charset "utf-8");
==
;body
;h1: Welcome!
;p
; Hello, world!
; Welcome to my page.
; Here is an image:
;br;
;img@"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg";
==
==
==
```
The `;` markers open a tag or, within a string like `<p>`'s content, mark subsequent lines. Since the entire Sail file is a `tape`, we can use `tape` interpolation to inject the results of Hoon expressions.
```hoon
;p
; Hello, world!
; Welcome to my page.
; Today is {<now.bowl>}.
; I have {<+(4)>} fingers.
==
```
### Further Reading
This article may be considered a sister to the Hoon School pages on [“Trees and Addressing (Tapes)”](https://developers.urbit.org/guides/core/hoon-school/G-trees#exercise-tapes-for-text) and [“Text Processing I”](https://developers.urbit.org/guides/core/hoon-school/J-stdlib-text). There are further details on many elements of working with strings in [“Working with Strings”](https://developers.urbit.org/guides/additional/strings), unsurprisingly.
You may also find [~wicdev-wisryts “Input and Output in Hoon”](https://urbit.org/blog/io-in-hoon) an instructive supplement.