developers.urbit.org/content/blog/text-overview.md

+++
title = "What Every Hooner Should Know About Text on Urbit"
date = "2022-11-15"
description = "How many ways can you write a single word?"
[extra]
author = "N E Davis"
ship = "~lagrev-nocfep"
image = "https://media.urbit.org/site/posts/essays/blog-text-bottles.png"
+++

![](https://media.urbit.org/site/posts/essays/blog-text-bottles.png)

#  What Every Hooner Should Know About Text on Urbit

##  Forms of Text

[Text strings](https://en.wikipedia.org/wiki/String_%28computer_science%29%) are sequences of characters.  At one level, the file containing code is itself a string—at a more fine-grained level, we take strings to mean either byte sequences obtained from literals (like `'Hello Mars'`) or from external APIs.  This blog post will expand on [existing docs](https://developers.urbit.org/guides/additional/strings) to explain what is going on with text in various corners of Hoon.

Setting aside [literal syntax](https://developers.urbit.org/blog/literals), Urbit distinguishes quite a few text representation types:

1. `cord`s (`@t`, [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte))
2. `knot`s (`@ta`)
3. `term`s (`@tas`)
4. `tape`s (`(list @tD`)
5. UTF-32 strings (`@c`)
6. `tour`s (`(list @c)`)
7. `tank`s (formatted print trees)
8. `tang`s (`(list tank)`)
9. `wain`s (`(list cord)`)
10. `wall`s (`(list tape)`)
11. `path`s (`(list knot)`) (with alias `wire`)
12. JSON-tagged trees
13. Sail (for HTML)

Let's examine each of these in turn.

###  `cord` (`@t`)

A `cord` is a [UTF-8](https://en.wikipedia.org/wiki/UTF-8) [LSB](https://en.wikipedia.org/wiki/Bit_numbering#Least_significant_byte) atom used to represent text directly.  A `cord` is denoted by single quotes `'surrounding the text'` and has no restrictions other than requiring valid UTF-8 content (thus all Unicode characters).  `cord`s are preferred over `tape`s when text is not being processed.

```hoon
> *@t
''

> ((sane %t) 'Hello Mars!')
%.y
```

One big difference between `cord`s and strings in other languages is that Urbit uniformly expects escape characters (such as `\n`, newline) to be written as their ASCII value in hexadecimal:  thus, Hoon uses `\0a` for C-style `\n`.

### `knot` (`@ta`)

A `knot` is an atom type that permits only a subset of the URL-safe ASCII characters (thus excluding control characters, spaces, upper-case characters, and ``!"#$%&'()*+,/:;<=>?@[\]^` {|}``).  Stated positively, `knot`s can contain lower-case characters, numbers, and `-._~`.  A `knot` is denoted by starting with the unique prefix `~.` sigdot.  Generally `knot`s are used for paths (as in Clay, for wires, and so forth).

As the Dojo doesn't actually check for atom validity, it is possible to erroneously "cast" a value into a `knot` representation when it is not a valid `knot`.  Use `++sane` to produce a check gate to avoid attempting to parse invalid `knot`s.

```hoon
> *@ta
~.

> ((sane %ta) 'Hello Mars!')
%.n

> ((sane %ta) 'hellomars')
%.y
```

You can see all ASCII characters checked for their `knot` compatibility using ``(turn (gulf 32 127) |=(a=@ [`@t`a ((sane %ta) a)]))``.  `++wood` is a `cord` escape:  it catches `@ta`-invalid characters in `@t`s and converts them lossily to `@ta`.

### `term` (`@tas`)

A `term` is an atom type intended for marking tags, types, and labels.  A value prefixed with `%` cen such as `%hello` is first a _constant_ (_q.v._) and only possesses `term`-nature if explicitly marked as such with `@tas`.  A term is [defined](https://developers.urbit.org/reference/hoon/basic) as “an atomic ASCII string which obeys symbol rules: lowercase and digit only, infix hyphen, first character must be a lowercase letter.”

Urbit uses `term`s to represent internal data tags throughout the Hoon compiler, the Arvo kernel, and userspace.

(Note that the empty `term` is written `%$`, not `%~`.  `%~` is a constant null value, not a `term`.)

As with `knot`s, values can be incorrectly cast to `@tas` in the Dojo.  Use `++sane` to avoid issues as a result of this behavior.

Here we also use the _type spear_ `-:!>` to extract the type of the values demonstratively.
```hoon
> *@tas
%$

> -:!>(%hello-mars)
#t/%hello-mars

> -:!>(`@tas`%hello-mars)
#t/@tas

> ((sane %tas) 'Hello Mars!')
%.n

> ((sane %tas) 'hello-mars')
%.y

> -:!>(%~)
#t/%~
```

###  `tape` (`(list @tD)`)

A `tape` is a list of `@tD` 8-bit atoms.  Similar to `cord`s, `tape`s support UTF-8 text and all Unicode characters.  Each byte is represented as its own serial entry, rather than as a whole character.  `tape`s are `list`s not atoms, meaning they can be easily parsed and processed using `list` tools such as `++snag`, `++oust`, and so forth.

```hoon
> ""
""

> `(list @)`""
~

> "Hello Mars!"
"Hello Mars!"

> "Hello \"Mars\"!"
"Hello \"Mars\"!"

> `(list @t)`"Hello \"Mars\"!"
<|H e l l o   " M a r s " !|>
```

The `tape` type is slightly more restrictive than just `(list @t)`, and so `(list @t)` has a slightly different representation yielded to it by the pretty-printer.

```hoon
> "Hello Mars"
"Hello Mars"

> `(list @t)`"Hello Mars"
<|H e l l o   M a r s|>
```

What's the `@tD` doing in `(list @tD)`?  By convention, a suffixed upper-case letter indicates the size of the entry in bits, with `A` for 2⁰ = 1, `B` for 2¹ = 2, `C` for 2² = 4, `D` for 2³ = 8, and so forth.  While the inclusion of `D` isn't coercive, it is advisory:  a `tape` is processed in such a way that multi-byte characters are broken into successive bytes:

```hoon
> `(list @ux)``(list @)`"küßî"
~[0x6b 0xc3 0xbc 0xc3 0x9f 0xc3 0xae]
```

#### Converting Text to Hoon

There are a few ways to get from a `cord` of text to a Hoon representation.

Most commonly, one has a value as text and needs to get it as an atom, or vice versa.

- [`++scot`](https://developers.urbit.org/reference/hoon/stdlib/4m#scot) takes a Hoon atom and produces a `cord` or `knot`.

```hoon
> (scot %ud 1.000)
~.1.000

> (scot %ux 0xdead.beef)
~.0xdead.beef

> (scot %p ~sampel-palnet)
~.~sampel-palnet

> > (scot %si --1)
~.--0i1
```

This example shows the atom literal syntax we wrote about recently:

```hoon
> (scot %t 'Hello Mars')
~.~~~48.ello.~4d.ars

> ~~~48.ello.~4d.ars
'Hello Mars'
```

- [`++scow`](https://developers.urbit.org/reference/hoon/stdlib/4m#scow) does the same but to a `tape`.

- [`++slaw`](https://developers.urbit.org/reference/hoon/stdlib/4m#slaw) converts a `cord` representation—in Hoon aura notation—into an `unit` of `@` atom.

    ```hoon
    > (slaw %ux '0xdead.beef')
    [~ 3.735.928.559]

    > (slaw %p '~sampel-palnet')
    [~ 1.624.961.343]

    > (slaw %p '~sample-planet')
    ~
    ```

- [`++ream`](https://developers.urbit.org/reference/hoon/stdlib/5d#ream) accepts a `cord` and shows the resulting abstract syntax tree of Hoon.

```hoon
> (ream '+(2)')
[%dtls p=[%sand p=%ud q=2]]
```

Other methods, such as text to number, are included in the discussion of JSON and MIME type data below.

#### Interpolation

`tape`s support interpolation:  including the result of Hoon expressions as text in the middle of the tape.

Curly braces `{` sel and `}` ser indicate that the result of a calculation has been converted into a `tape` directly.

```hoon
> "There are {(scow %ud (sub (pow 2 128) (pow 2 64)))} comets."
"There are 340.282.366.920.938.463.444.927.863.358.058.659.840 comets."
```

Angle brackers `<` gal and `>` gar employ automatic text conversion:

```hoon
> "There are many ships, but {<our>} is my ship."
"There are many ships, but ~zod is my ship."
```
#### `cord` v. `tape`

Most commonly, developers will represent text using either `tape`s or `cord`s.  Both of these facilitate straightforward direct representation as string literals using either single quotes `'example of cord'` or double quotes `"example of tape"`.

As a practical matter, `tape`s occupy more space than their corresponding `cord`s.  `tape`s are implemented as linked lists in the runtime.  These are easy to work with but consume more memory and can take longer to process in some ways.

Prefer `cord`s for data storage and representation, but `tape`s for data processing.

A `cord` can be transformed into a `tape` using `++trip` (mnemonic "tape rip").  The reverse transformation, from `tape` to `cord`, is accomplished via `++crip` (mnemonic "cord rip").

```hoon
> (trip 'Hello Mars!')
"Hello Mars!"

> (crip "Hello Mars!")
'Hello Mars!'
```

#### An Aside on Unicode

Unicode is a chart of character representations, with each character receiving a unique number or _codepoint_.  This codepoint is then represented in various ways in binary encodings, the most common of which is [UTF-8](https://en.wikipedia.org/wiki/UTF-8).   UTF-8 is a variable-byte encoding scheme which balances the economy of representing common characters like ASCII using only a single byte with the ability to represent characters from more complex character sets like Chinese `漢語` or Cherokee `ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ`.  While something of a pain when processing byte-by-byte, this allows for an adaptively compact way of writing values (rather than the mostly-zeroes [UTF-32](https://en.wikipedia.org/wiki/UTF-32) mode, available in Urbit as `@c`.)  A `char` is a self-conscious UTF-8 single byte in Hoon, but it's simply an alias for `@t` and doesn't enforce bitwidth.

Joel Spolsky wrote [a classic article on Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) which happily has been partly-superseded by much more extensive software support in the two decades since its publication.

### `@c` & `tour` (`(list @c)`)

As just mentioned, Unicode has several distinct encoding schemes.  [UTF-32](https://en.wikipedia.org/wiki/UTF-32) can represent any Unicode value in four bytes, meaning that index accesses are direct (rather than needing to be calculated as with UTF-8).  Urbit provides UTF-32 `@c` data for the terminal stack to use with terminal cursor position, but otherwise they are not used much.  You never see these in practice in userspace.

You can use `++taft` to convert from a UTF-8 `cord` to a UTF-32 `@c`, and `++tuft` to go the other way.

```hoon
> (taft 'hello')
~-hello

> (taft 'Hello Mars')
~-~48.ello.~4d.ars

> `@ux`(taft 'Hello Mars')
0x73.0000.0072.0000.0061.0000.004d.0000.0020.0000.006f.0000.006c.0000.006c.0000.0065.0000.0048

> (tuft ~-~48.ello.~4d.ars)
'Hello Mars'
```

One library, `l10n`, proposes to handle text as a list of UTF-8 multi-byte characters, `calf` or `(list @t)`, rather than a `tape`, which has each byte as a separate entry.  This eases processing for certain Unicode text operations.

### `tank`s (formatted print trees) & `tang`s ((list tank))

Moving past the simple text types, we find that text alone provides little information about structure or display.  Formatted print trees, or `tank`s, are commonly used to produce error messages and other data displays within the Dojo.

A `tank` is a structure of tagged values.  The tag indicates to the pretty-printer how to convert the final value to a `tape` for output (using `ram:re`).

```hoon
> ~(ram re 'Hello Mars')
"Hello Mars"

> ~(ram re leaf+"Hello Mars")
"Hello Mars"

> ~(ram re rose+[["|" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~])
"«Hello Mars|Phobos|Deimos»"

> %~  ram  re
  :-  %palm
  :-  ["|" "<" ":" ">"]
  :~  leaf+"Hello Mars"
      rose+[["║" "«" "»"] leaf+"Hello Mars" leaf+"Phobos" leaf+"Deimos" ~]
  ==
"<:Hello Mars|«Hello Mars║Phobos║Deimos»>"
```

Formatted text based on `tank`s is very helpful when working with `%say` generators.

### `wain`s (`(list cord)`) & `wall`s (`(list tape)`)

Collections of `cord`s and `tape`s are occasionally useful when building output.

The `shoe`/`sole` CLI libraries use `wain`s and `wall`s for various aspects of rendering an app at the CLI.

### `path`s (`(list knot)`) (with alias `wire`)

Gall agents and Clay both use `path`s to uniquely identify resources such as noun data on the file system or subscriptions.  Furthermore, a `wire` is an alias for a `path` which particularly denotes the subscriber's identification, preferably unique.  Any valid `@ta` value separated by `/` fas values becomes a `path`, and `=` tis entries in the first three slots are expanded to the Clay `beak`.

```hoon
> /hello/mars
[%hello %mars ~]

> /1/2/3
[~.1 ~.2 ~.3 ~]

> /
~

> /===
[~.~zod ~.base ~.~2022.11.9..19.13.51..efb6 ~]
```

### JSON-style strings

[JSON](https://en.wikipedia.org/wiki/JSON) is a data interchange format based on text.  Web apps and several other platforms use JSON as a fairly concise human-readable way to transmit information, including text.

Hoon represents the equivalent structure of the JSON as a tagged noun.  This requires parsing a JSON string into a tagged noun structure, then reparsing that into particular Hoon values.

For our purposes here, a JSON-style string thus means a tagged string `s+'Hello Mars'`.

```hoon
> =myjson '{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [
      "Catherine",
      "Thomas",
      "Trevor"
  ],
  "spouse": null
}'

> (de-json:html myjson)
[ ~
  [ %o
      p
    { [p='firstName' q=[%s p='John']]
      [p='lastName' q=[%s p='Smith']]
      [ p='children'
        q=[%a p=~[[%s p='Catherine'] [%s p='Thomas'] [%s p='Trevor']]]
      ]
      [ p='address'
          q
        [ %o
            p
          { [p='postalCode' q=[%s p='10021-3100']]
            [p='streetAddress' q=[%s p='21 2nd Street']]
            [p='city' q=[%s p='New York']]
            [p='state' q=[%s p='NY']]
          }
        ]
      ]
      [ p='phoneNumbers'
          q
        [ %a
            p
          ~[
            [ %o
                p
              { [p='type' q=[%s p='home']]
                [p='number' q=[%s p='212 555-1234']]
              }
            ]
            [ %o
                p
              { [p='type' q=[%s p='office']]
                [p='number' q=[%s p='646 555-4567']]
              }
            ]
          ]
        ]
      ]
      [p='spouse' q=~]
      [p='isAlive' q=[%b p=%.y]]
      [p='age' q=[%n p=~.27]]
    }
  ]
]
```

#### Converting Text to Hoon (and Vice Versa)

Notice at this point that most of the values in the `json` data structure are tagged with `%s` string except for a few:  `%a` array, `%b` boolean, `%n` number, and `%o` map.  The tricky part to deal with in reparsing these values back to and from text are the `%n` numbers, since Hoon has several number types.

Thus we must consider how to convert `json` values [to](https://developers.urbit.org/reference/hoon/zuse/2d_6)  and [from](https://developers.urbit.org/reference/hoon/zuse/2d_1-5) Hoon representations.  Fortunately, most gates one would need are already included in the Zuse standard library for handling `json` structures.  The standard JSON-style operations include:

- [`++numb:enjs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_1-5#numbenjsformat) converts from `@u` to a JSON number (as `knot`).

    ```hoon
    > (numb:enjs:format 0xdead.beef)
    [%n p=~.3735928559]
    ```

- [`++ne:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nedejsformat) parses a JSON-style string as a real, or `@rd`.

    ```hoon
    > (ne:dejs:format n+'0.31415e1')
    .~3.1415
    ```

- [`++ni:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nidejsformat) parses a JSON-style string as an integer, or `@ud`.

    ```hoon
    > (ni:dejs:format n+'65536')
    65.536
    ```

- `++ns:dejs:format` parses a JSON-style string as a signed integer, or `@sd`.

    ```hoon
    > (ns:dejs:format n+'-1')
    -1
    ```

- [`++nu:dejs:format`](https://developers.urbit.org/reference/hoon/zuse/2d_6#nudejsformat) parses a JSON-style string as a hexadecimal.

    ```hoon
    > (nu:dejs:format s+'deadbeef')
    0xdead.beef
    ```

There are date format parsers as well, such as [`++du`](https://developers.urbit.org/reference/hoon/zuse/2d_6#dudejsformat).

Another category of converters are the MIME parsers.  These are nominally for webpages serving content, but prove useful in a variety of other situations as well.

- `++en:base16:mimes:html` converts a `@ux` hexadecimal value to a `cord` with zero-padding (while `++de` goes the other way).

    ```hoon
    > (en:base16:mimes:html 8 0x12.3456.7890.abcd)
    '001234567890abcd'

    > (de:base16:mimes:html '012345')
    [~ [p=3 q=74.565]]
    ```

There are base-64 and base-58 (Bitcoin address) parsers as well.

### Sail (for HTML)

Sail is Hoon's internal markup for HTML and XML.  It can support all HTML tags and attributes.  The [Sail guide](https://developers.urbit.org/guides/additional/sail) contains full details on how to work with the markup format, but here I want to briefly demonstrate how text in Sail is handled.

Basically, Sail opens a tag and associates either the rest of the line (`:`) or continuing text until `==`.

```hoon
;html
  ;head
    ;title = My page
    ;meta(charset "utf-8");
  ==
  ;body
    ;h1: Welcome!
    ;p
      ; Hello, world!
      ; Welcome to my page.
      ; Here is an image:
      ;br;
      ;img@"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg";
    ==
  ==
==
```

The `;` markers open a tag or, within a string like `<p>`'s content, mark subsequent lines.  Since the entire Sail file is a `tape`, we can use `tape` interpolation to inject the results of Hoon expressions.

```hoon
;p
  ; Hello, world!
  ; Welcome to my page.
  ; Today is {<now.bowl>}.
  ; I have {<+(4)>} fingers.
==
```

### Further Reading

This article may be considered a sister to the Hoon School pages on [“Trees and Addressing (Tapes)”](https://developers.urbit.org/guides/core/hoon-school/G-trees#exercise-tapes-for-text) and [“Text Processing I”](https://developers.urbit.org/guides/core/hoon-school/J-stdlib-text).  There are further details on many elements of working with strings in [“Working with Strings”](https://developers.urbit.org/guides/additional/strings), unsurprisingly.

You may also find [~wicdev-wisryt’s “Input and Output in Hoon”](https://urbit.org/blog/io-in-hoon) an instructive supplement.