+++ title = "What Every Hooner Should Know About Text on Urbit" date = "2022-11-15" description = "How many ways can you write a single word?" [extra] author = "N E Davis" ship = "~lagrev-nocfep" image = "https://media.urbit.org/site/posts/essays/blog-text-bottles.png" +++ ![](https://media.urbit.org/site/posts/essays/blog-text-bottles.png) # What Every Hooner Should Know About Text on Urbit ## Forms of Text [Text strings](https://en.wikipedia.org/wiki/String_%28computer_science%29%) are sequences of characters. At one level, the file containing code is itself a string—at a more fine-grained level, we take strings to mean either byte sequences obtained from literals (like `'Hello Mars'`) or from external APIs. This blog post will expand on [existing docs](https://developers.urbit.org/guides/additional/strings) to explain what is going on with text in various corners of Hoon. Setting aside [literal syntax](https://developers.urbit.org/blog/literals), Urbit distinguishes quite a few text representation types: 1. `cord`s (`@t`, [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte)) 2. `knot`s (`@ta`) 3. `term`s (`@tas`) 4. `tape`s (`(list @tD`) 5. UTF-32 strings (`@c`) 6. `tour`s (`(list @c)`) 7. `tank`s (formatted print trees) 8. `tang`s (`(list tank)`) 9. `wain`s (`(list cord)`) 10. `wall`s (`(list tape)`) 11. `path`s (`(list knot)`) (with alias `wire`) 12. JSON-tagged trees 13. Sail (for HTML) Let's examine each of these in turn. ### `cord` (`@t`) A `cord` is a [UTF-8](https://en.wikipedia.org/wiki/UTF-8) [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte) atom used to represent text directly. A `cord` is denoted by single quotes `'surrounding the text'` and has no restrictions other than requiring valid UTF-8 content (thus all Unicode characters). `cord`s are preferred over `tape`s when text is not being processed. ```hoon > *@t '' > ((sane %t) 'Hello Mars!') %.y ``` One big difference between `cord`s and strings in other languages is that Urbit uniformly expects escape characters (such as `\n`, newline) to be written as their ASCII value in hexadecimal: thus, Hoon uses `\0a` for C-style `\n`. ### `knot` (`@ta`) A `knot` is an atom type that permits only a subset of the URL-safe ASCII characters (thus excluding control characters, spaces, upper-case characters, and ``!"#$%&'()*+,/:;<=>?@[\]^` {|}``). Stated positively, `knot`s can contain lower-case characters, numbers, and `-._~`. A `knot` is denoted by starting with the unique prefix `~.` sigdot. Generally `knot`s are used for paths (as in Clay, for wires, and so forth). As the Dojo doesn't actually check for atom validity, it is possible to erroneously "cast" a value into a `knot` representation when it is not a valid `knot`. Use `++sane` to produce a check gate to avoid attempting to parse invalid `knot`s. ```hoon > *@ta ~. > ((sane %ta) 'Hello Mars!') %.n > ((sane %ta) 'hellomars') %.y ``` You can see all ASCII characters checked for their `knot` compatibility using ``(turn (gulf 32 127) |=(a=@ [`@t`a ((sane %ta) a)]))``. `++wood` is a `cord` escape: it catches `@ta`-invalid characters in `@t`s and converts them lossily to `@ta`. ### `term` (`@tas`) A `term` is an atom type intended for marking tags, types, and labels. A value prefixed with `%` cen such as `%hello` is first a _constant_ (_q.v._) and only possesses `term`-nature if explicitly marked as such with `@tas`. A term is [defined](https://developers.urbit.org/reference/hoon/basic) as “an atomic ASCII string which obeys symbol rules: lowercase and digit only, infix hyphen, first character must be a lowercase letter.” Urbit uses `term`s to represent internal data tags throughout the Hoon compiler, the Arvo kernel, and userspace. (Note that the empty `term` is written `%$`, not `%~`. `%~` is a constant null value, not a `term`.) As with `knot`s, values can be incorrectly cast to `@tas` in the Dojo. Use `++sane` to avoid issues as a result of this behavior. Here we also use the _type spear_ `-:!>` to extract the type of the values demonstratively. ```hoon > *@tas %$ > -:!>(%hello-mars) #t/%hello-mars > -:!>(`@tas`%hello-mars) #t/@tas > ((sane %tas) 'Hello Mars!') %.n > ((sane %tas) 'hello-mars') %.y > -:!>(%~) #t/%~ ``` ### `tape` (`(list @tD)`) A `tape` is a list of `@tD` 8-bit atoms. Similar to `cord`s, `tape`s support UTF-8 text and all Unicode characters. Each byte is represented as its own serial entry, rather than as a whole character. `tape`s are `list`s not atoms, meaning they can be easily parsed and processed using `list` tools such as `++snag`, `++oust`, and so forth. ```hoon > "" "" > `(list @)`"" ~ > "Hello Mars!" "Hello Mars!" > "Hello \"Mars\"!" "Hello \"Mars\"!" > `(list @t)`"Hello \"Mars\"!" <|H e l l o   " M a r s " !|> ``` The `tape` type is slightly more restrictive than just `(list @t)`, and so `(list @t)` has a slightly different representation yielded to it by the pretty-printer. ```hoon > "Hello Mars" "Hello Mars" > `(list @t)`"Hello Mars" <|H e l l o   M a r s|> ``` What's the `@tD` doing in `(list @tD)`? By convention, a suffixed upper-case letter indicates the size of the entry in bits, with `A` for 2⁰ = 1, `B` for 2¹ = 2, `C` for 2² = 4, `D` for 2³ = 8, and so forth. While the inclusion of `D` isn't coercive, it is advisory: a `tape` is processed in such a way that multi-byte characters are broken into successive bytes: ```hoon > `(list @ux)``(list @)`"küßî" ~[0x6b 0xc3 0xbc 0xc3 0x9f 0xc3 0xae] ``` #### Converting Text to Hoon There are a few ways to get from a `cord` of text to a Hoon representation. Most commonly, one has a value as text and needs to get it as an atom, or vice versa. - [`++scot`](https://developers.urbit.org/reference/hoon/stdlib/4m#scot) takes a Hoon atom and produces a `cord` or `knot`. ```hoon > (scot %ud 1.000) ~.1.000 > (scot %ux 0xdead.beef) ~.0xdead.beef > (scot %p ~sampel-palnet) ~.~sampel-palnet > > (scot %si --1) ~.--0i1 ``` This example shows the atom literal syntax we wrote about recently: ```hoon > (scot %t 'Hello Mars') ~.~~~48.ello.~4d.ars > ~~~48.ello.~4d.ars 'Hello Mars' ``` - [`++scow`](https://developers.urbit.org/reference/hoon/stdlib/4m#scow) does the same but to a `tape`. - [`++slaw`](https://developers.urbit.org/reference/hoon/stdlib/4m#slaw) converts a `cord` representation—in Hoon aura notation—into an `unit` of `@` atom. ```hoon > (slaw %ux '0xdead.beef') [~ 3.735.928.559] > (slaw %p '~sampel-palnet') [~ 1.624.961.343] > (slaw %p '~sample-planet') ~ ``` - [`++ream`](https://developers.urbit.org/reference/hoon/stdlib/5d#ream) accepts a `cord` and shows the resulting abstract syntax tree of Hoon. ```hoon > (ream '+(2)') [%dtls p=[%sand p=%ud q=2]] ``` Other methods, such as text to number, are included in the discussion of JSON and MIME type data below. #### Interpolation `tape`s support interpolation: including the result of Hoon expressions as text in the middle of the tape. Curly braces `{` sel and `}` ser indicate that the result of a calculation has been converted into a `tape` directly. ```hoon > "There are {(scow %ud (sub (pow 2 128) (pow 2 64)))} comets." "There are 340.282.366.920.938.463.444.927.863.358.058.659.840 comets." ``` Angle brackers `<` gal and `>` gar employ automatic text conversion: ```hoon > "There are many ships, but {} is my ship." "There are many ships, but ~zod is my ship." ``` #### `cord` v. `tape` Most commonly, developers will represent text using either `tape`s or `cord`s. Both of these facilitate straightforward direct representation as string literals using either single quotes `'example of cord'` or double quotes `"example of tape"`. As a practical matter, `tape`s occupy more space than their corresponding `cord`s. `tape`s are implemented as linked lists in the runtime. These are easy to work with but consume more memory and can take longer to process in some ways. Prefer `cord`s for data storage and representation, but `tape`s for data processing. A `cord` can be transformed into a `tape` using `++trip` (mnemonic "tape rip"). The reverse transformation, from `tape` to `cord`, is accomplished via `++crip` (mnemonic "cord rip"). ```hoon > (trip 'Hello Mars!') "Hello Mars!" > (crip "Hello Mars!") 'Hello Mars!' ``` #### An Aside on Unicode Unicode is a chart of character representations, with each character receiving a unique number or _codepoint_. This codepoint is then represented in various ways in binary encodings, the most common of which is [UTF-8](https://en.wikipedia.org/wiki/UTF-8). UTF-8 is a variable-byte encoding scheme which balances the economy of representing common characters like ASCII using only a single byte with the ability to represent characters from more complex character sets like Chinese `漢語` or Cherokee `ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ`. While something of a pain when processing byte-by-byte, this allows for an adaptively compact way of writing values (rather than the mostly-zeroes [UTF-32](https://en.wikipedia.org/wiki/UTF-32) mode, available in Urbit as `@c`.) A `char` is a self-conscious UTF-8 single byte in Hoon, but it's simply an alias for `@t` and doesn't enforce bitwidth. Joel Spolsky wrote [a classic article on Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) which happily has been partly-superseded by much more extensive software support in the two decades since its publication. ### `@c` & `tour` (`(list @c)`) As just mentioned, Unicode has several distinct encoding schemes. [UTF-32](https://en.wikipedia.org/wiki/UTF-32) can represent any Unicode value in four bytes, meaning that index accesses are direct (rather than needing to be calculated as with UTF-8). Urbit provides UTF-32 `@c` data for the terminal stack to use with terminal cursor position, but otherwise they are not used much. You never see these in practice in userspace. You can use `++taft` to convert from a UTF-8 `cord` to a UTF-32 `@c`, and `++tuft` to go the other way. ```hoon > (taft 'hello') ~-hello > (taft 'Hello Mars') ~-~48.ello.~4d.ars > `@ux`(taft 'Hello Mars') 0x73.0000.0072.0000.0061.0000.004d.0000.0020.0000.006f.0000.006c.0000.006c.0000.0065.0000.0048 > (tuft ~-~48.ello.~4d.ars) 'Hello Mars' ``` One library, `l10n`, proposes to handle text as a list of UTF-8 multi-byte characters, `calf` or `(list @t)`, rather than a `tape`, which has each byte as a separate entry. This eases processing for certain Unicode text operations. ### `tank`s (formatted print trees) & `tang`s ((list tank)) Moving past the simple text types, we find that text alone provides little information about structure or display. Formatted print trees, or `tank`s, are commonly used to produce error messages and other data displays within the Dojo. A `tank` is a structure of tagged values. The tag indicates to the pretty-printer how to convert the final value to a `tape` for output (using `ram:re`). ```hoon > ~(ram re 'Hello Mars') "Hello Mars" > ~(ram re leaf+"Hello Mars") "Hello Mars" > ~(ram re rose+[["|" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~]) "«Hello Mars|Phobos|Deimos»" > %~ ram re :- %palm :- ["|" "<" ":" ">"] :~ leaf+"Hello Mars" rose+[["║" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~] == "<:Hello Mars|«Hello Mars║Phobos║Deimos»>" ``` Formatted text based on `tank`s is very helpful when working with `%say` generators. ### `wain`s (`(list cord)`) & `wall`s (`(list tape)`) Collections of `cord`s and `tape`s are occasionally useful when building output. The `shoe`/`sole` CLI libraries use `wain`s and `wall`s for various aspects of rendering an app at the CLI. ### `path`s (`(list knot)`) (with alias `wire`) Gall agents and Clay both use `path`s to uniquely identify resources such as noun data on the file system or subscriptions. Furthermore, a `wire` is an alias for a `path` which particularly denotes the subscriber's identification, preferably unique. Any valid `@ta` value separated by `/` fas values becomes a `path`, and `=` tis entries in the first three slots are expanded to the Clay `beak`. ```hoon > /hello/mars [%hello %mars ~] > /1/2/3 [~.1 ~.2 ~.3 ~] > / ~ > /=== [~.~zod ~.base ~.~2022.11.9..19.13.51..efb6 ~] ``` ### JSON-style strings [JSON](https://en.wikipedia.org/wiki/JSON) is a data interchange format based on text. Web apps and several other platforms use JSON as a fairly concise human-readable way to transmit information, including text. Hoon represents the equivalent structure of the JSON as a tagged noun. This requires parsing a JSON string into a tagged noun structure, then reparsing that into particular Hoon values. For our purposes here, a JSON-style string thus means a tagged string `s+'Hello Mars'`. ```hoon > =myjson '{ "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 27, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ], "children": [ "Catherine", "Thomas", "Trevor" ], "spouse": null }' > (de-json:html myjson) [ ~ [ %o p { [p='firstName' q=[%s p='John']] [p='lastName' q=[%s p='Smith']] [ p='children' q=[%a p=~[[%s p='Catherine'] [%s p='Thomas'] [%s p='Trevor']]] ] [ p='address' q [ %o p { [p='postalCode' q=[%s p='10021-3100']] [p='streetAddress' q=[%s p='21 2nd Street']] [p='city' q=[%s p='New York']] [p='state' q=[%s p='NY']] } ] ] [ p='phoneNumbers' q [ %a p ~[ [ %o p { [p='type' q=[%s p='home']] [p='number' q=[%s p='212 555-1234']] } ] [ %o p { [p='type' q=[%s p='office']] [p='number' q=[%s p='646 555-4567']] } ] ] ] ] [p='spouse' q=~] [p='isAlive' q=[%b p=%.y]] [p='age' q=[%n p=~.27]] } ] ] ``` #### Converting Text to Hoon (and Vice Versa) Notice at this point that most of the values in the `json` data structure are tagged with `%s` string except for a few: `%a` array, `%b` boolean, `%n` number, and `%o` map. The tricky part to deal with in reparsing these values back to and from text are the `%n` numbers, since Hoon has several number types. Thus we must consider how to convert `json` values [to](https://developers.urbit.org/reference/hoon/zuse/2d_6) and [from](https://developers.urbit.org/reference/hoon/zuse/2d_1-5) Hoon representations. Fortunately, most gates one would need are already included in the Zuse standard library for handling `json` structures. The standard JSON-style operations include: - [`++numb:enjs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_1-5#numbenjsformat) converts from `@u` to a JSON number (as `knot`). ```hoon > (numb:enjs:format 0xdead.beef) [%n p=~.3735928559] ``` - [`++ne:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nedejsformat) parses a JSON-style string as a real, or `@rd`. ```hoon > (ne:dejs:format n+'0.31415e1') .~3.1415 ``` - [`++ni:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nidejsformat) parses a JSON-style string as an integer, or `@ud`. ```hoon > (ni:dejs:format n+'65536') 65.536 ``` - `++ns:dejs:format` parses a JSON-style string as a signed integer, or `@sd`. ```hoon > (ns:dejs:format n+'-1') -1 ``` - [`++nu:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nudejsformat) parses a JSON-style string as a hexadecimal. ```hoon > (nu:dejs:format s+'deadbeef') 0xdead.beef ``` There are date format parsers as well, such as [`++du`](https://developers.urbit.org/reference/hoon/zuse/2d_6#dudejsformat). Another category of converters are the MIME parsers. These are nominally for webpages serving content, but prove useful in a variety of other situations as well. - `++en:base16:mimes:html` converts a `@ux` hexadecimal value to a `cord` with zero-padding (while `++de` goes the other way). ```hoon > (en:base16:mimes:html 8 0x12.3456.7890.abcd) '001234567890abcd' > (de:base16:mimes:html '012345') [~ [p=3 q=74.565]] ``` There are base-64 and base-58 (Bitcoin address) parsers as well. ### Sail (for HTML) Sail is Hoon's internal markup for HTML and XML. It can support all HTML tags and attributes. The [Sail guide](https://developers.urbit.org/guides/additional/sail) contains full details on how to work with the markup format, but here I want to briefly demonstrate how text in Sail is handled. Basically, Sail opens a tag and associates either the rest of the line (`:`) or continuing text until `==`. ```hoon ;html ;head ;title = My page ;meta(charset "utf-8"); == ;body ;h1: Welcome! ;p ; Hello, world! ; Welcome to my page. ; Here is an image: ;br; ;img@"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg"; == == == ``` The `;` markers open a tag or, within a string like `

`'s content, mark subsequent lines. Since the entire Sail file is a `tape`, we can use `tape` interpolation to inject the results of Hoon expressions. ```hoon ;p ; Hello, world! ; Welcome to my page. ; Today is {}. ; I have {<+(4)>} fingers. == ``` ### Further Reading This article may be considered a sister to the Hoon School pages on [“Trees and Addressing (Tapes)”](https://developers.urbit.org/guides/core/hoon-school/G-trees#exercise-tapes-for-text) and [“Text Processing I”](https://developers.urbit.org/guides/core/hoon-school/J-stdlib-text). There are further details on many elements of working with strings in [“Working with Strings”](https://developers.urbit.org/guides/additional/strings), unsurprisingly. You may also find [~wicdev-wisryt’s “Input and Output in Hoon”](https://urbit.org/blog/io-in-hoon) an instructive supplement.