developers.urbit.org/content/blog/text-overview.md
2022-11-14 18:02:08 -06:00

493 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

+++
title = "What Every Hooner Should Know About Text on Urbit"
date = "2022-11-15"
description = "How many ways can you write a single word?"
[extra]
author = "N E Davis"
ship = "~lagrev-nocfep"
image = "https://media.urbit.org/site/posts/essays/blog-text-bottles.png"
+++
![](https://media.urbit.org/site/posts/essays/blog-text-bottles.png)
# What Every Hooner Should Know About Text on Urbit
## Forms of Text
[Text strings](https://en.wikipedia.org/wiki/String_%28computer_science%29%) are sequences of characters. At one level, the file containing code is itself a string—at a more fine-grained level, we take strings to mean either byte sequences obtained from literals (like `'Hello Mars'`) or from external APIs. This blog post will expand on [existing docs](https://developers.urbit.org/guides/additional/strings) to explain what is going on with text in various corners of Hoon.
Setting aside [literal syntax](https://developers.urbit.org/blog/literals), Urbit distinguishes quite a few text representation types:
1. `cord`s (`@t`, [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte))
2. `knot`s (`@ta`)
3. `term`s (`@tas`)
4. `tape`s (`(list @tD`)
5. UTF-32 strings (`@c`)
6. `tour`s (`(list @c)`)
7. `tank`s (formatted print trees)
8. `tang`s (`(list tank)`)
9. `wain`s (`(list cord)`)
10. `wall`s (`(list tape)`)
11. `path`s (`(list knot)`) (with alias `wire`)
12. JSON-tagged trees
13. Sail (for HTML)
Let's examine each of these in turn.
### `cord` (`@t`)
A `cord` is a [UTF-8](https://en.wikipedia.org/wiki/UTF-8) [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte) atom used to represent text directly. A `cord` is denoted by single quotes `'surrounding the text'` and has no restrictions other than requiring valid UTF-8 content (thus all Unicode characters). `cord`s are preferred over `tape`s when text is not being processed.
```hoon
> *@t
''
> ((sane %t) 'Hello Mars!')
%.y
```
One big difference between `cord`s and strings in other languages is that Urbit uniformly expects escape characters (such as `\n`, newline) to be written as their ASCII value in hexadecimal: thus, Hoon uses `\0a` for C-style `\n`.
### `knot` (`@ta`)
A `knot` is an atom type that permits only a subset of the URL-safe ASCII characters (thus excluding control characters, spaces, upper-case characters, and ``!"#$%&'()*+,/:;<=>?@[\]^` {|}``). Stated positively, `knot`s can contain lower-case characters, numbers, and `-._~`. A `knot` is denoted by starting with the unique prefix `~.` sigdot. Generally `knot`s are used for paths (as in Clay, for wires, and so forth).
As the Dojo doesn't actually check for atom validity, it is possible to erroneously "cast" a value into a `knot` representation when it is not a valid `knot`. Use `++sane` to produce a check gate to avoid attempting to parse invalid `knot`s.
```hoon
> *@ta
~.
> ((sane %ta) 'Hello Mars!')
%.n
> ((sane %ta) 'hellomars')
%.y
```
You can see all ASCII characters checked for their `knot` compatibility using ``(turn (gulf 32 127) |=(a=@ [`@t`a ((sane %ta) a)]))``. `++wood` is a `cord` escape: it catches `@ta`-invalid characters in `@t`s and converts them lossily to `@ta`.
### `term` (`@tas`)
A `term` is an atom type intended for marking tags, types, and labels. A value prefixed with `%` cen such as `%hello` is first a _constant_ (_q.v._) and only possesses `term`-nature if explicitly marked as such with `@tas`. A term is [defined](https://developers.urbit.org/reference/hoon/basic) as “an atomic ASCII string which obeys symbol rules: lowercase and digit only, infix hyphen, first character must be a lowercase letter.”
Urbit uses `term`s to represent internal data tags throughout the Hoon compiler, the Arvo kernel, and userspace.
(Note that the empty `term` is written `%$`, not `%~`. `%~` is a constant null value, not a `term`.)
As with `knot`s, values can be incorrectly cast to `@tas` in the Dojo. Use `++sane` to avoid issues as a result of this behavior.
Here we also use the _type spear_ `-:!>` to extract the type of the values demonstratively.
```hoon
> *@tas
%$
> -:!>(%hello-mars)
#t/%hello-mars
> -:!>(`@tas`%hello-mars)
#t/@tas
> ((sane %tas) 'Hello Mars!')
%.n
> ((sane %tas) 'hello-mars')
%.y
> -:!>(%~)
#t/%~
```
### `tape` (`(list @tD)`)
A `tape` is a list of `@tD` 8-bit atoms. Similar to `cord`s, `tape`s support UTF-8 text and all Unicode characters. Each byte is represented as its own serial entry, rather than as a whole character. `tape`s are `list`s not atoms, meaning they can be easily parsed and processed using `list` tools such as `++snag`, `++oust`, and so forth.
```hoon
> ""
""
> `(list @)`""
~
> "Hello Mars!"
"Hello Mars!"
> "Hello \"Mars\"!"
"Hello \"Mars\"!"
> `(list @t)`"Hello \"Mars\"!"
<|H e l l o   " M a r s " !|>
```
The `tape` type is slightly more restrictive than just `(list @t)`, and so `(list @t)` has a slightly different representation yielded to it by the pretty-printer.
```hoon
> "Hello Mars"
"Hello Mars"
> `(list @t)`"Hello Mars"
<|H e l l o   M a r s|>
```
What's the `@tD` doing in `(list @tD)`? By convention, a suffixed upper-case letter indicates the size of the entry in bits, with `A` for 2⁰ = 1, `B` for 2¹ = 2, `C` for 2² = 4, `D` for 2³ = 8, and so forth. While the inclusion of `D` isn't coercive, it is advisory: a `tape` is processed in such a way that multi-byte characters are broken into successive bytes:
```hoon
> `(list @ux)``(list @)`"küßî"
~[0x6b 0xc3 0xbc 0xc3 0x9f 0xc3 0xae]
```
#### Converting Text to Hoon
There are a few ways to get from a `cord` of text to a Hoon representation.
Most commonly, one has a value as text and needs to get it as an atom, or vice versa.
- [`++scot`](https://developers.urbit.org/reference/hoon/stdlib/4m#scot) takes a Hoon atom and produces a `cord` or `knot`.
```hoon
> (scot %ud 1.000)
~.1.000
> (scot %ux 0xdead.beef)
~.0xdead.beef
> (scot %p ~sampel-palnet)
~.~sampel-palnet
> > (scot %si --1)
~.--0i1
```
This example shows the atom literal syntax we wrote about recently:
```hoon
> (scot %t 'Hello Mars')
~.~~~48.ello.~4d.ars
> ~~~48.ello.~4d.ars
'Hello Mars'
```
- [`++scow`](https://developers.urbit.org/reference/hoon/stdlib/4m#scow) does the same but to a `tape`.
- [`++slaw`](https://developers.urbit.org/reference/hoon/stdlib/4m#slaw) converts a `cord` representation—in Hoon aura notation—into an `unit` of `@` atom.
```hoon
> (slaw %ux '0xdead.beef')
[~ 3.735.928.559]
> (slaw %p '~sampel-palnet')
[~ 1.624.961.343]
> (slaw %p '~sample-planet')
~
```
- [`++ream`](https://developers.urbit.org/reference/hoon/stdlib/5d#ream) accepts a `cord` and shows the resulting abstract syntax tree of Hoon.
```hoon
> (ream '+(2)')
[%dtls p=[%sand p=%ud q=2]]
```
Other methods, such as text to number, are included in the discussion of JSON and MIME type data below.
#### Interpolation
`tape`s support interpolation: including the result of Hoon expressions as text in the middle of the tape.
Curly braces `{` sel and `}` ser indicate that the result of a calculation has been converted into a `tape` directly.
```hoon
> "There are {(scow %ud (sub (pow 2 128) (pow 2 64)))} comets."
"There are 340.282.366.920.938.463.444.927.863.358.058.659.840 comets."
```
Angle brackers `<` gal and `>` gar employ automatic text conversion:
```hoon
> "There are many ships, but {<our>} is my ship."
"There are many ships, but ~zod is my ship."
```
#### `cord` v. `tape`
Most commonly, developers will represent text using either `tape`s or `cord`s. Both of these facilitate straightforward direct representation as string literals using either single quotes `'example of cord'` or double quotes `"example of tape"`.
As a practical matter, `tape`s occupy more space than their corresponding `cord`s. `tape`s are implemented as linked lists in the runtime. These are easy to work with but consume more memory and can take longer to process in some ways.
Prefer `cord`s for data storage and representation, but `tape`s for data processing.
A `cord` can be transformed into a `tape` using `++trip` (mnemonic "tape rip"). The reverse transformation, from `tape` to `cord`, is accomplished via `++crip` (mnemonic "cord rip").
```hoon
> (trip 'Hello Mars!')
"Hello Mars!"
> (crip "Hello Mars!")
'Hello Mars!'
```
#### An Aside on Unicode
Unicode is a chart of character representations, with each character receiving a unique number or _codepoint_. This codepoint is then represented in various ways in binary encodings, the most common of which is [UTF-8](https://en.wikipedia.org/wiki/UTF-8). UTF-8 is a variable-byte encoding scheme which balances the economy of representing common characters like ASCII using only a single byte with the ability to represent characters from more complex character sets like Chinese `漢語` or Cherokee `ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ`. While something of a pain when processing byte-by-byte, this allows for an adaptively compact way of writing values (rather than the mostly-zeroes [UTF-32](https://en.wikipedia.org/wiki/UTF-32) mode, available in Urbit as `@c`.) A `char` is a self-conscious UTF-8 single byte in Hoon, but it's simply an alias for `@t` and doesn't enforce bitwidth.
Joel Spolsky wrote [a classic article on Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) which happily has been partly-superseded by much more extensive software support in the two decades since its publication.
### `@c` & `tour` (`(list @c)`)
As just mentioned, Unicode has several distinct encoding schemes. [UTF-32](https://en.wikipedia.org/wiki/UTF-32) can represent any Unicode value in four bytes, meaning that index accesses are direct (rather than needing to be calculated as with UTF-8). Urbit provides UTF-32 `@c` data for the terminal stack to use with terminal cursor position, but otherwise they are not used much. You never see these in practice in userspace.
You can use `++taft` to convert from a UTF-8 `cord` to a UTF-32 `@c`, and `++tuft` to go the other way.
```hoon
> (taft 'hello')
~-hello
> (taft 'Hello Mars')
~-~48.ello.~4d.ars
> `@ux`(taft 'Hello Mars')
0x73.0000.0072.0000.0061.0000.004d.0000.0020.0000.006f.0000.006c.0000.006c.0000.0065.0000.0048
> (tuft ~-~48.ello.~4d.ars)
'Hello Mars'
```
One library, `l10n`, proposes to handle text as a list of UTF-8 multi-byte characters, `calf` or `(list @t)`, rather than a `tape`, which has each byte as a separate entry. This eases processing for certain Unicode text operations.
### `tank`s (formatted print trees) & `tang`s ((list tank))
Moving past the simple text types, we find that text alone provides little information about structure or display. Formatted print trees, or `tank`s, are commonly used to produce error messages and other data displays within the Dojo.
A `tank` is a structure of tagged values. The tag indicates to the pretty-printer how to convert the final value to a `tape` for output (using `ram:re`).
```hoon
> ~(ram re 'Hello Mars')
"Hello Mars"
> ~(ram re leaf+"Hello Mars")
"Hello Mars"
> ~(ram re rose+[["|" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~])
"«Hello Mars|Phobos|Deimos»"
> %~ ram re
:- %palm
:- ["|" "<" ":" ">"]
:~ leaf+"Hello Mars"
rose+[["║" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~]
==
"<:Hello Mars|«Hello Mars║Phobos║Deimos»>"
```
Formatted text based on `tank`s is very helpful when working with `%say` generators.
### `wain`s (`(list cord)`) & `wall`s (`(list tape)`)
Collections of `cord`s and `tape`s are occasionally useful when building output.
The `shoe`/`sole` CLI libraries use `wain`s and `wall`s for various aspects of rendering an app at the CLI.
### `path`s (`(list knot)`) (with alias `wire`)
Gall agents and Clay both use `path`s to uniquely identify resources such as noun data on the file system or subscriptions. Furthermore, a `wire` is an alias for a `path` which particularly denotes the subscriber's identification, preferably unique. Any valid `@ta` value separated by `/` fas values becomes a `path`, and `=` tis entries in the first three slots are expanded to the Clay `beak`.
```hoon
> /hello/mars
[%hello %mars ~]
> /1/2/3
[~.1 ~.2 ~.3 ~]
> /
~
> /===
[~.~zod ~.base ~.~2022.11.9..19.13.51..efb6 ~]
```
### JSON-style strings
[JSON](https://en.wikipedia.org/wiki/JSON) is a data interchange format based on text. Web apps and several other platforms use JSON as a fairly concise human-readable way to transmit information, including text.
Hoon represents the equivalent structure of the JSON as a tagged noun. This requires parsing a JSON string into a tagged noun structure, then reparsing that into particular Hoon values.
For our purposes here, a JSON-style string thus means a tagged string `s+'Hello Mars'`.
```hoon
> =myjson '{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [
"Catherine",
"Thomas",
"Trevor"
],
"spouse": null
}'
> (de-json:html myjson)
[ ~
[ %o
p
{ [p='firstName' q=[%s p='John']]
[p='lastName' q=[%s p='Smith']]
[ p='children'
q=[%a p=~[[%s p='Catherine'] [%s p='Thomas'] [%s p='Trevor']]]
]
[ p='address'
q
[ %o
p
{ [p='postalCode' q=[%s p='10021-3100']]
[p='streetAddress' q=[%s p='21 2nd Street']]
[p='city' q=[%s p='New York']]
[p='state' q=[%s p='NY']]
}
]
]
[ p='phoneNumbers'
q
[ %a
p
~[
[ %o
p
{ [p='type' q=[%s p='home']]
[p='number' q=[%s p='212 555-1234']]
}
]
[ %o
p
{ [p='type' q=[%s p='office']]
[p='number' q=[%s p='646 555-4567']]
}
]
]
]
]
[p='spouse' q=~]
[p='isAlive' q=[%b p=%.y]]
[p='age' q=[%n p=~.27]]
}
]
]
```
#### Converting Text to Hoon (and Vice Versa)
Notice at this point that most of the values in the `json` data structure are tagged with `%s` string except for a few: `%a` array, `%b` boolean, `%n` number, and `%o` map. The tricky part to deal with in reparsing these values back to and from text are the `%n` numbers, since Hoon has several number types.
Thus we must consider how to convert `json` values [to](https://developers.urbit.org/reference/hoon/zuse/2d_6) and [from](https://developers.urbit.org/reference/hoon/zuse/2d_1-5) Hoon representations. Fortunately, most gates one would need are already included in the Zuse standard library for handling `json` structures. The standard JSON-style operations include:
- [`++numb:enjs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_1-5#numbenjsformat) converts from `@u` to a JSON number (as `knot`).
```hoon
> (numb:enjs:format 0xdead.beef)
[%n p=~.3735928559]
```
- [`++ne:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nedejsformat) parses a JSON-style string as a real, or `@rd`.
```hoon
> (ne:dejs:format n+'0.31415e1')
.~3.1415
```
- [`++ni:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nidejsformat) parses a JSON-style string as an integer, or `@ud`.
```hoon
> (ni:dejs:format n+'65536')
65.536
```
- `++ns:dejs:format` parses a JSON-style string as a signed integer, or `@sd`.
```hoon
> (ns:dejs:format n+'-1')
-1
```
- [`++nu:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nudejsformat) parses a JSON-style string as a hexadecimal.
```hoon
> (nu:dejs:format s+'deadbeef')
0xdead.beef
```
There are date format parsers as well, such as [`++du`](https://developers.urbit.org/reference/hoon/zuse/2d_6#dudejsformat).
Another category of converters are the MIME parsers. These are nominally for webpages serving content, but prove useful in a variety of other situations as well.
- `++en:base16:mimes:html` converts a `@ux` hexadecimal value to a `cord` with zero-padding (while `++de` goes the other way).
```hoon
> (en:base16:mimes:html 8 0x12.3456.7890.abcd)
'001234567890abcd'
> (de:base16:mimes:html '012345')
[~ [p=3 q=74.565]]
```
There are base-64 and base-58 (Bitcoin address) parsers as well.
### Sail (for HTML)
Sail is Hoon's internal markup for HTML and XML. It can support all HTML tags and attributes. The [Sail guide](https://developers.urbit.org/guides/additional/sail) contains full details on how to work with the markup format, but here I want to briefly demonstrate how text in Sail is handled.
Basically, Sail opens a tag and associates either the rest of the line (`:`) or continuing text until `==`.
```hoon
;html
;head
;title = My page
;meta(charset "utf-8");
==
;body
;h1: Welcome!
;p
; Hello, world!
; Welcome to my page.
; Here is an image:
;br;
;img@"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg";
==
==
==
```
The `;` markers open a tag or, within a string like `<p>`'s content, mark subsequent lines. Since the entire Sail file is a `tape`, we can use `tape` interpolation to inject the results of Hoon expressions.
```hoon
;p
; Hello, world!
; Welcome to my page.
; Today is {<now.bowl>}.
; I have {<+(4)>} fingers.
==
```
### Further Reading
This article may be considered a sister to the Hoon School pages on [“Trees and Addressing (Tapes)”](https://developers.urbit.org/guides/core/hoon-school/G-trees#exercise-tapes-for-text) and [“Text Processing I”](https://developers.urbit.org/guides/core/hoon-school/J-stdlib-text). There are further details on many elements of working with strings in [“Working with Strings”](https://developers.urbit.org/guides/additional/strings), unsurprisingly.
You may also find [~wicdev-wisryts “Input and Output in Hoon”](https://urbit.org/blog/io-in-hoon) an instructive supplement.