shrub/sur/unicode-data.hoon

151 lines
5.0 KiB
Plaintext
Raw Normal View History

|%
2018-05-25 01:39:56 +03:00
:: # %unicode-data
:: types to represent UnicdoeData.txt.
2018-05-29 09:42:16 +03:00
+| %unicode-data
++ line
2018-05-25 01:39:56 +03:00
:: an individual codepoint definition
::
$: code=@c :: codepoint in hexadecimal format
name=tape :: character name
gen=general :: type of character this is
:: canonical combining class for ordering algorithms
can=@ud
2018-05-25 01:39:56 +03:00
bi=bidi :: bidirectional category of this character
de=decomp :: character decomposition mapping
:: todo: decimal/digit/numeric need to be parsed.
2018-05-25 01:39:56 +03:00
decimal=tape :: decimal digit value (or ~)
digit=tape :: digit value, covering non decimal radix forms
numeric=tape :: numeric value, including fractions
mirrored=? :: whether char is mirrored in bidirectional text
old-name=tape :: unicode 1.0 compatibility name
iso=tape :: iso 10646 comment field
up=(unit @c) :: uppercase mapping codepoint
low=(unit @c) :: lowercase mapping codepoint
title=(unit @c) :: titlecase mapping codepoint
==
::
++ general
2018-05-25 01:39:56 +03:00
:: one of the normative or informative unicode general categories
::
:: these abbreviations are as found in the unicode standard, except
:: lowercased as to be valid symbols.
$? $lu :: letter, uppercase
$ll :: letter, lowercase
$lt :: letter, titlecase
$mn :: mark, non-spacing
$mc :: mark, spacing combining
$me :: mark, enclosing
$nd :: number, decimal digit
$nl :: number, letter
$no :: number, other
$zs :: separator, space
$zl :: separator, line
$zp :: separator, paragraph
$cc :: other, control
$cf :: other, format
$cs :: other, surrogate
$co :: other, private use
$cn :: other, not assigned
::
2018-05-25 01:39:56 +03:00
$lm :: letter, modifier
$lo :: letter, other
$pc :: punctuation, connector
$pd :: punctuation, dash
$ps :: punctuation, open
$pe :: punctuation, close
$pi :: punctuation, initial quote
$pf :: punctuation, final quote
$po :: punctuation, other
$sm :: symbol, math
$sc :: symbol, currency
$sk :: symbol, modifier
$so :: symbol, other
==
::
++ bidi
2018-05-25 01:39:56 +03:00
:: bidirectional category of a unicode character
$? $l :: left-to-right
$lre :: left-to-right embedding
$lri :: left-to-right isolate
$lro :: left-to-right override
$fsi :: first strong isolate
$r :: right-to-left
$al :: right-to-left arabic
$rle :: right-to-left embedding
$rli :: right-to-left isolate
$rlo :: right-to-left override
$pdf :: pop directional format
$pdi :: pop directional isolate
$en :: european number
$es :: european number separator
$et :: european number terminator
$an :: arabic number
$cs :: common number separator
$nsm :: non-spacing mark
$bn :: boundary neutral
$b :: paragraph separator
$s :: segment separator
$ws :: whitespace
$on :: other neutrals
==
::
++ decomp
2018-05-25 01:39:56 +03:00
:: character decomposition mapping.
::
:: tag: type of decomposition.
:: c: a list of codepoints this decomposes into.
(unit {tag/(unit decomp-tag) c/(list @c)})
::
++ decomp-tag
2018-05-25 01:39:56 +03:00
:: tag that describes the type of a character decomposition.
$? $font :: a font variant
$nobreak :: a no-break version of a space or hyphen
$initial :: an initial presentation form (arabic)
$medial :: a medial presentation form (arabic)
$final :: a final presentation form (arabic)
$isolated :: an isolated presentation form (arabic)
$circle :: an encircled form
$super :: a superscript form
$sub :: a subscript form
$vertical :: a vertical layout presentation form
$wide :: a wide (or zenkaku) compatibility character
$narrow :: a narrow (or hankaku) compatibility character
$small :: a small variant form (cns compatibility)
$square :: a cjk squared font variant
$fraction :: a vulgar fraction form
$compat :: otherwise unspecified compatibility character
==
::
2018-05-25 01:39:56 +03:00
:: #
:: # %case-map
:: #
:: types to represent fast lookups of case data
2018-05-29 09:42:16 +03:00
+| %case-map
++ case-offset
2018-05-25 01:39:56 +03:00
:: case offsets can be in either direction
$% :: add {a} to get the new character
[%add a=@u]
2018-05-25 01:39:56 +03:00
:: subtract {a} to get the new character
[%sub s=@u]
2018-05-25 01:39:56 +03:00
:: take no action; return self
2018-03-19 07:18:20 +03:00
[%none ~]
2018-05-25 01:39:56 +03:00
:: represents series of alternating uppercase/lowercase characters
2018-03-19 07:18:20 +03:00
[%uplo ~]
==
::
++ case-node
2018-05-25 01:39:56 +03:00
:: a node in a case-tree.
::
:: represents a range of
$: start=@ux
end=@ux
upper=case-offset
lower=case-offset
title=case-offset
==
::
++ case-tree
2018-05-25 01:39:56 +03:00
:: a binary search tree of ++case-node items, sorted on span.
(tree case-node)
--