shrub/sur/unicode-data.hoon

|%
:>  #  %unicode-data
:>    types to represent UnicdoeData.txt.
+|
++  line
  :>    an individual codepoint definition
  :>
  $:  code/@c               :< codepoint in hexadecimal format
      name/tape             :< character name
      gen/general           :< type of character this is
      :>  canonical combining class for ordering algorithms
      can/@ud          
      bi/bidi               :< bidirectional category of this character
      de/decomp             :< character decomposition mapping
      ::  todo: decimal/digit/numeric need to be parsed.                    
      decimal/tape          :< decimal digit value (or ~)
      digit/tape            :< digit value, covering non decimal radix forms
      numeric/tape          :< numeric value, including fractions
      mirrored/?            :< whether char is mirrored in bidirectional text
      old-name/tape         :< unicode 1.0 compatibility name
      iso/tape              :< iso 10646 comment field
      up/(unit @c)          :< uppercase mapping codepoint
      low/(unit @c)         :< lowercase mapping codepoint
      title/(unit @c)       :< titlecase mapping codepoint
  ==
::
++  general
  :>    one of the normative or informative unicode general categories
  :>
  :>  these abbreviations are as found in the unicode standard, except
  :>  lowercased as to be valid symbols.
  $?  $lu  :<  letter, uppercase
      $ll  :<  letter, lowercase
      $lt  :<  letter, titlecase
      $mn  :<  mark, non-spacing
      $mc  :<  mark, spacing combining
      $me  :<  mark, enclosing
      $nd  :<  number, decimal digit
      $nl  :<  number, letter
      $no  :<  number, other
      $zs  :<  separator, space
      $zl  :<  separator, line
      $zp  :<  separator, paragraph
      $cc  :<  other, control
      $cf  :<  other, format
      $cs  :<  other, surrogate
      $co  :<  other, private use
      $cn  :<  other, not assigned
      ::
      $lm  :<  letter, modifier
      $lo  :<  letter, other
      $pc  :<  punctuation, connector
      $pd  :<  punctuation, dash
      $ps  :<  punctuation, open
      $pe  :<  punctuation, close
      $pi  :<  punctuation, initial quote
      $pf  :<  punctuation, final quote
      $po  :<  punctuation, other
      $sm  :<  symbol, math
      $sc  :<  symbol, currency
      $sk  :<  symbol, modifier
      $so  :<  symbol, other
  ==
::
++  bidi
  :>  bidirectional category of a unicode character
  $?  $l    :<  left-to-right
      $lre  :<  left-to-right embedding
      $lri  :<  left-to-right isolate
      $lro  :<  left-to-right override
      $fsi  :<  first strong isolate
      $r    :<  right-to-left
      $al   :<  right-to-left arabic
      $rle  :<  right-to-left embedding
      $rli  :<  right-to-left isolate
      $rlo  :<  right-to-left override
      $pdf  :<  pop directional format
      $pdi  :<  pop directional isolate
      $en   :<  european number
      $es   :<  european number separator
      $et   :<  european number terminator
      $an   :<  arabic number
      $cs   :<  common number separator
      $nsm  :<  non-spacing mark
      $bn   :<  boundary neutral
      $b    :<  paragraph separator
      $s    :<  segment separator
      $ws   :<  whitespace
      $on   :<  other neutrals
  ==
::
++  decomp
  :>  character decomposition mapping.
  :>
  :>  tag: type of decomposition.
  :>  c: a list of codepoints this decomposes into.
  (unit {tag/(unit decomp-tag) c/(list @c)})
::
++  decomp-tag
  :>  tag that describes the type of a character decomposition.
  $?  $font      :<  a font variant
      $no-break  :<  a no-break version of a space or hyphen
      $initial   :<  an initial presentation form (arabic)
      $medial    :<  a medial presentation form (arabic)
      $final     :<  a final presentation form (arabic)
      $isolated  :<  an isolated presentation form (arabic)
      $circle    :<  an encircled form
      $super     :<  a superscript form
      $sub       :<  a subscript form
      $vertical  :<  a vertical layout presentation form
      $wide      :<  a wide (or zenkaku) compatibility character
      $narrow    :<  a narrow (or hankaku) compatibility character
      $small     :<  a small variant form (cns compatibility)
      $square    :<  a cjk squared font variant
      $fraction  :<  a vulgar fraction form
      $compat    :<  otherwise unspecified compatibility character
  ==
::
:>  #
:>  #  %case-map
:>  #
:>    types to represent fast lookups of case data
+|
++  case-offset
  :>  case offsets can be in either direction
  $%  :>  add {a} to get the new character
      {$add a/@u}
      :>  subtract {a} to get the new character
      {$sub s/@u}
      :>  take no action; return self
      {$none $~}
      :>  represents series of alternating uppercase/lowercase characters
      {$uplo $~}
  ==
::
++  case-node
  :>    a node in a case-tree.
  :>
  :>  represents a range of
  $:  start/@ux
      end/@ux
      upper/case-offset
      lower/case-offset
      title/case-offset
  ==
::
++  case-tree
  :>  a binary search tree of ++case-node items, sorted on span.
  (tree case-node)
--
Add real unicode uppercasing. Place UnicodeData.txt in %/lib/unicode-data/txt. The +capitalize generator should then capitalize any unicode tape, including characters outside lower ASCII. 2017-10-07 06:52:33 +03:00			`\|%`
Complete prototype for a unicode aware ++to-{upper,lower}. This uses a mark to parse UnicodeData.txt, and some ford runes to change this into a binary search tree data structure for quick lookups, along with the optimizations found in golang's unicode table to record ranges instead of individual characters. 2017-10-13 08:55:45 +03:00			`:> # %unicode-data`
			`:> types to represent UnicdoeData.txt.`
			`+\|`
Add real unicode uppercasing. Place UnicodeData.txt in %/lib/unicode-data/txt. The +capitalize generator should then capitalize any unicode tape, including characters outside lower ASCII. 2017-10-07 06:52:33 +03:00			`++ line`
			`:> an individual codepoint definition`
			`:>`
WIP simplify parsing constants to ++soft, comment placement 2017-10-13 19:47:38 +03:00			`$: code/@c :< codepoint in hexadecimal format`
			`name/tape :< character name`
			`gen/general :< type of character this is`
			`:> canonical combining class for ordering algorithms`
			`can/@ud`
			`bi/bidi :< bidirectional category of this character`
			`de/decomp :< character decomposition mapping`
			`:: todo: decimal/digit/numeric need to be parsed.`
			`decimal/tape :< decimal digit value (or ~)`
			`digit/tape :< digit value, covering non decimal radix forms`
			`numeric/tape :< numeric value, including fractions`
			`mirrored/? :< whether char is mirrored in bidirectional text`
			`old-name/tape :< unicode 1.0 compatibility name`
			`iso/tape :< iso 10646 comment field`
			`up/(unit @c) :< uppercase mapping codepoint`
			`low/(unit @c) :< lowercase mapping codepoint`
			`title/(unit @c) :< titlecase mapping codepoint`
Add real unicode uppercasing. Place UnicodeData.txt in %/lib/unicode-data/txt. The +capitalize generator should then capitalize any unicode tape, including characters outside lower ASCII. 2017-10-07 06:52:33 +03:00			`==`
			`::`
			`++ general`
			`:> one of the normative or informative unicode general categories`
			`:>`
			`:> these abbreviations are as found in the unicode standard, except`
			`:> lowercased as to be valid symbols.`
			`$? $lu :< letter, uppercase`
			`$ll :< letter, lowercase`
			`$lt :< letter, titlecase`
			`$mn :< mark, non-spacing`
			`$mc :< mark, spacing combining`
			`$me :< mark, enclosing`
			`$nd :< number, decimal digit`
			`$nl :< number, letter`
			`$no :< number, other`
			`$zs :< separator, space`
			`$zl :< separator, line`
			`$zp :< separator, paragraph`
			`$cc :< other, control`
			`$cf :< other, format`
			`$cs :< other, surrogate`
			`$co :< other, private use`
			`$cn :< other, not assigned`
			`::`
			`$lm :< letter, modifier`
			`$lo :< letter, other`
			`$pc :< punctuation, connector`
			`$pd :< punctuation, dash`
			`$ps :< punctuation, open`
			`$pe :< punctuation, close`
			`$pi :< punctuation, initial quote`
			`$pf :< punctuation, final quote`
			`$po :< punctuation, other`
			`$sm :< symbol, math`
			`$sc :< symbol, currency`
			`$sk :< symbol, modifier`
			`$so :< symbol, other`
			`==`
			`::`
			`++ bidi`
			`:> bidirectional category of a unicode character`
			`$? $l :< left-to-right`
			`$lre :< left-to-right embedding`
			`$lri :< left-to-right isolate`
			`$lro :< left-to-right override`
			`$fsi :< first strong isolate`
			`$r :< right-to-left`
			`$al :< right-to-left arabic`
			`$rle :< right-to-left embedding`
			`$rli :< right-to-left isolate`
			`$rlo :< right-to-left override`
			`$pdf :< pop directional format`
			`$pdi :< pop directional isolate`
			`$en :< european number`
			`$es :< european number separator`
			`$et :< european number terminator`
			`$an :< arabic number`
			`$cs :< common number separator`
			`$nsm :< non-spacing mark`
			`$bn :< boundary neutral`
			`$b :< paragraph separator`
			`$s :< segment separator`
			`$ws :< whitespace`
			`$on :< other neutrals`
			`==`
			`::`
			`++ decomp`
			`:> character decomposition mapping.`
			`:>`
			`:> tag: type of decomposition.`
			`:> c: a list of codepoints this decomposes into.`
			`(unit {tag/(unit decomp-tag) c/(list @c)})`
			`::`
			`++ decomp-tag`
			`:> tag that describes the type of a character decomposition.`
			`$? $font :< a font variant`
			`$no-break :< a no-break version of a space or hyphen`
			`$initial :< an initial presentation form (arabic)`
			`$medial :< a medial presentation form (arabic)`
			`$final :< a final presentation form (arabic)`
			`$isolated :< an isolated presentation form (arabic)`
			`$circle :< an encircled form`
			`$super :< a superscript form`
			`$sub :< a subscript form`
			`$vertical :< a vertical layout presentation form`
			`$wide :< a wide (or zenkaku) compatibility character`
			`$narrow :< a narrow (or hankaku) compatibility character`
			`$small :< a small variant form (cns compatibility)`
			`$square :< a cjk squared font variant`
			`$fraction :< a vulgar fraction form`
			`$compat :< otherwise unspecified compatibility character`
			`==`
Complete prototype for a unicode aware ++to-{upper,lower}. This uses a mark to parse UnicodeData.txt, and some ford runes to change this into a binary search tree data structure for quick lookups, along with the optimizations found in golang's unicode table to record ranges instead of individual characters. 2017-10-13 08:55:45 +03:00			`::`
			`:> #`
			`:> # %case-map`
			`:> #`
			`:> types to represent fast lookups of case data`
			`+\|`
			`++ case-offset`
			`:> case offsets can be in either direction`
			`$% :> add {a} to get the new character`
			`{$add a/@u}`
			`:> subtract {a} to get the new character`
			`{$sub s/@u}`
			`:> take no action; return self`
			`{$none $~}`
			`:> represents series of alternating uppercase/lowercase characters`
			`{$uplo $~}`
			`==`
			`::`
			`++ case-node`
			`:> a node in a case-tree.`
			`:>`
			`:> represents a range of`
			`$: start/@ux`
			`end/@ux`
			`upper/case-offset`
			`lower/case-offset`
			`title/case-offset`
			`==`
			`::`
			`++ case-tree`
			`:> a binary search tree of ++case-node items, sorted on span.`
			`(tree case-node)`
Add real unicode uppercasing. Place UnicodeData.txt in %/lib/unicode-data/txt. The +capitalize generator should then capitalize any unicode tape, including characters outside lower ASCII. 2017-10-07 06:52:33 +03:00			`--`