shrub/sur/unicode-data.hoon

|%
:>  #  %unicode-data
:>    types to represent UnicdoeData.txt.
+|
++  line
  :>    an individual codepoint definition
  :>
  $:  code=@c               :<  codepoint in hexadecimal format
      name=tape             :<  character name
      gen=general           :<  type of character this is
      :>  canonical combining class for ordering algorithms
      can=@ud
      bi=bidi               :<  bidirectional category of this character
      de=decomp             :<  character decomposition mapping
      ::  todo: decimal/digit/numeric need to be parsed.
      decimal=tape          :<  decimal digit value (or ~)
      digit=tape            :<  digit value, covering non decimal radix forms
      numeric=tape          :<  numeric value, including fractions
      mirrored=?            :<  whether char is mirrored in bidirectional text
      old-name=tape         :<  unicode 1.0 compatibility name
      iso=tape              :<  iso 10646 comment field
      up=(unit @c)          :<  uppercase mapping codepoint
      low=(unit @c)         :<  lowercase mapping codepoint
      title=(unit @c)       :<  titlecase mapping codepoint
  ==
::
++  general
  :>    one of the normative or informative unicode general categories
  :>
  :>  these abbreviations are as found in the unicode standard, except
  :>  lowercased as to be valid symbols.
  $?  $lu  :<  letter, uppercase
      $ll  :<  letter, lowercase
      $lt  :<  letter, titlecase
      $mn  :<  mark, non-spacing
      $mc  :<  mark, spacing combining
      $me  :<  mark, enclosing
      $nd  :<  number, decimal digit
      $nl  :<  number, letter
      $no  :<  number, other
      $zs  :<  separator, space
      $zl  :<  separator, line
      $zp  :<  separator, paragraph
      $cc  :<  other, control
      $cf  :<  other, format
      $cs  :<  other, surrogate
      $co  :<  other, private use
      $cn  :<  other, not assigned
      ::
      $lm  :<  letter, modifier
      $lo  :<  letter, other
      $pc  :<  punctuation, connector
      $pd  :<  punctuation, dash
      $ps  :<  punctuation, open
      $pe  :<  punctuation, close
      $pi  :<  punctuation, initial quote
      $pf  :<  punctuation, final quote
      $po  :<  punctuation, other
      $sm  :<  symbol, math
      $sc  :<  symbol, currency
      $sk  :<  symbol, modifier
      $so  :<  symbol, other
  ==
::
++  bidi
  :>  bidirectional category of a unicode character
  $?  $l    :<  left-to-right
      $lre  :<  left-to-right embedding
      $lri  :<  left-to-right isolate
      $lro  :<  left-to-right override
      $fsi  :<  first strong isolate
      $r    :<  right-to-left
      $al   :<  right-to-left arabic
      $rle  :<  right-to-left embedding
      $rli  :<  right-to-left isolate
      $rlo  :<  right-to-left override
      $pdf  :<  pop directional format
      $pdi  :<  pop directional isolate
      $en   :<  european number
      $es   :<  european number separator
      $et   :<  european number terminator
      $an   :<  arabic number
      $cs   :<  common number separator
      $nsm  :<  non-spacing mark
      $bn   :<  boundary neutral
      $b    :<  paragraph separator
      $s    :<  segment separator
      $ws   :<  whitespace
      $on   :<  other neutrals
  ==
::
++  decomp
  :>  character decomposition mapping.
  :>
  :>  tag: type of decomposition.
  :>  c: a list of codepoints this decomposes into.
  (unit {tag/(unit decomp-tag) c/(list @c)})
::
++  decomp-tag
  :>  tag that describes the type of a character decomposition.
  $?  $font      :<  a font variant
      $nobreak   :<  a no-break version of a space or hyphen
      $initial   :<  an initial presentation form (arabic)
      $medial    :<  a medial presentation form (arabic)
      $final     :<  a final presentation form (arabic)
      $isolated  :<  an isolated presentation form (arabic)
      $circle    :<  an encircled form
      $super     :<  a superscript form
      $sub       :<  a subscript form
      $vertical  :<  a vertical layout presentation form
      $wide      :<  a wide (or zenkaku) compatibility character
      $narrow    :<  a narrow (or hankaku) compatibility character
      $small     :<  a small variant form (cns compatibility)
      $square    :<  a cjk squared font variant
      $fraction  :<  a vulgar fraction form
      $compat    :<  otherwise unspecified compatibility character
  ==
::
:>  #
:>  #  %case-map
:>  #
:>    types to represent fast lookups of case data
+|
++  case-offset
  :>  case offsets can be in either direction
  $%  :>  add {a} to get the new character
      [%add a=@u]
      :>  subtract {a} to get the new character
      [%sub s=@u]
      :>  take no action; return self
      [%none $~]
      :>  represents series of alternating uppercase/lowercase characters
      [%uplo $~]
  ==
::
++  case-node
  :>    a node in a case-tree.
  :>
  :>  represents a range of
  $:  start=@ux
      end=@ux
      upper=case-offset
      lower=case-offset
      title=case-offset
  ==
::
++  case-tree
  :>  a binary search tree of ++case-node items, sorted on span.
  (tree case-node)
--