2017-10-07 06:52:33 +03:00
|
|
|
|%
|
2017-10-13 08:55:45 +03:00
|
|
|
:> # %unicode-data
|
|
|
|
:> types to represent UnicdoeData.txt.
|
|
|
|
+|
|
2017-10-07 06:52:33 +03:00
|
|
|
++ line
|
|
|
|
:> an individual codepoint definition
|
|
|
|
:>
|
2017-10-13 19:47:38 +03:00
|
|
|
$: code/@c :< codepoint in hexadecimal format
|
|
|
|
name/tape :< character name
|
|
|
|
gen/general :< type of character this is
|
|
|
|
:> canonical combining class for ordering algorithms
|
|
|
|
can/@ud
|
|
|
|
bi/bidi :< bidirectional category of this character
|
|
|
|
de/decomp :< character decomposition mapping
|
|
|
|
:: todo: decimal/digit/numeric need to be parsed.
|
|
|
|
decimal/tape :< decimal digit value (or ~)
|
|
|
|
digit/tape :< digit value, covering non decimal radix forms
|
|
|
|
numeric/tape :< numeric value, including fractions
|
|
|
|
mirrored/? :< whether char is mirrored in bidirectional text
|
|
|
|
old-name/tape :< unicode 1.0 compatibility name
|
|
|
|
iso/tape :< iso 10646 comment field
|
|
|
|
up/(unit @c) :< uppercase mapping codepoint
|
|
|
|
low/(unit @c) :< lowercase mapping codepoint
|
|
|
|
title/(unit @c) :< titlecase mapping codepoint
|
2017-10-07 06:52:33 +03:00
|
|
|
==
|
|
|
|
::
|
|
|
|
++ general
|
|
|
|
:> one of the normative or informative unicode general categories
|
|
|
|
:>
|
|
|
|
:> these abbreviations are as found in the unicode standard, except
|
|
|
|
:> lowercased as to be valid symbols.
|
|
|
|
$? $lu :< letter, uppercase
|
|
|
|
$ll :< letter, lowercase
|
|
|
|
$lt :< letter, titlecase
|
|
|
|
$mn :< mark, non-spacing
|
|
|
|
$mc :< mark, spacing combining
|
|
|
|
$me :< mark, enclosing
|
|
|
|
$nd :< number, decimal digit
|
|
|
|
$nl :< number, letter
|
|
|
|
$no :< number, other
|
|
|
|
$zs :< separator, space
|
|
|
|
$zl :< separator, line
|
|
|
|
$zp :< separator, paragraph
|
|
|
|
$cc :< other, control
|
|
|
|
$cf :< other, format
|
|
|
|
$cs :< other, surrogate
|
|
|
|
$co :< other, private use
|
|
|
|
$cn :< other, not assigned
|
|
|
|
::
|
|
|
|
$lm :< letter, modifier
|
|
|
|
$lo :< letter, other
|
|
|
|
$pc :< punctuation, connector
|
|
|
|
$pd :< punctuation, dash
|
|
|
|
$ps :< punctuation, open
|
|
|
|
$pe :< punctuation, close
|
|
|
|
$pi :< punctuation, initial quote
|
|
|
|
$pf :< punctuation, final quote
|
|
|
|
$po :< punctuation, other
|
|
|
|
$sm :< symbol, math
|
|
|
|
$sc :< symbol, currency
|
|
|
|
$sk :< symbol, modifier
|
|
|
|
$so :< symbol, other
|
|
|
|
==
|
|
|
|
::
|
|
|
|
++ bidi
|
|
|
|
:> bidirectional category of a unicode character
|
|
|
|
$? $l :< left-to-right
|
|
|
|
$lre :< left-to-right embedding
|
|
|
|
$lri :< left-to-right isolate
|
|
|
|
$lro :< left-to-right override
|
|
|
|
$fsi :< first strong isolate
|
|
|
|
$r :< right-to-left
|
|
|
|
$al :< right-to-left arabic
|
|
|
|
$rle :< right-to-left embedding
|
|
|
|
$rli :< right-to-left isolate
|
|
|
|
$rlo :< right-to-left override
|
|
|
|
$pdf :< pop directional format
|
|
|
|
$pdi :< pop directional isolate
|
|
|
|
$en :< european number
|
|
|
|
$es :< european number separator
|
|
|
|
$et :< european number terminator
|
|
|
|
$an :< arabic number
|
|
|
|
$cs :< common number separator
|
|
|
|
$nsm :< non-spacing mark
|
|
|
|
$bn :< boundary neutral
|
|
|
|
$b :< paragraph separator
|
|
|
|
$s :< segment separator
|
|
|
|
$ws :< whitespace
|
|
|
|
$on :< other neutrals
|
|
|
|
==
|
|
|
|
::
|
|
|
|
++ decomp
|
|
|
|
:> character decomposition mapping.
|
|
|
|
:>
|
|
|
|
:> tag: type of decomposition.
|
|
|
|
:> c: a list of codepoints this decomposes into.
|
|
|
|
(unit {tag/(unit decomp-tag) c/(list @c)})
|
|
|
|
::
|
|
|
|
++ decomp-tag
|
|
|
|
:> tag that describes the type of a character decomposition.
|
|
|
|
$? $font :< a font variant
|
|
|
|
$no-break :< a no-break version of a space or hyphen
|
|
|
|
$initial :< an initial presentation form (arabic)
|
|
|
|
$medial :< a medial presentation form (arabic)
|
|
|
|
$final :< a final presentation form (arabic)
|
|
|
|
$isolated :< an isolated presentation form (arabic)
|
|
|
|
$circle :< an encircled form
|
|
|
|
$super :< a superscript form
|
|
|
|
$sub :< a subscript form
|
|
|
|
$vertical :< a vertical layout presentation form
|
|
|
|
$wide :< a wide (or zenkaku) compatibility character
|
|
|
|
$narrow :< a narrow (or hankaku) compatibility character
|
|
|
|
$small :< a small variant form (cns compatibility)
|
|
|
|
$square :< a cjk squared font variant
|
|
|
|
$fraction :< a vulgar fraction form
|
|
|
|
$compat :< otherwise unspecified compatibility character
|
|
|
|
==
|
2017-10-13 08:55:45 +03:00
|
|
|
::
|
|
|
|
:> #
|
|
|
|
:> # %case-map
|
|
|
|
:> #
|
|
|
|
:> types to represent fast lookups of case data
|
|
|
|
+|
|
|
|
|
++ case-offset
|
|
|
|
:> case offsets can be in either direction
|
|
|
|
$% :> add {a} to get the new character
|
|
|
|
{$add a/@u}
|
|
|
|
:> subtract {a} to get the new character
|
|
|
|
{$sub s/@u}
|
|
|
|
:> take no action; return self
|
|
|
|
{$none $~}
|
|
|
|
:> represents series of alternating uppercase/lowercase characters
|
|
|
|
{$uplo $~}
|
|
|
|
==
|
|
|
|
::
|
|
|
|
++ case-node
|
|
|
|
:> a node in a case-tree.
|
|
|
|
:>
|
|
|
|
:> represents a range of
|
|
|
|
$: start/@ux
|
|
|
|
end/@ux
|
|
|
|
upper/case-offset
|
|
|
|
lower/case-offset
|
|
|
|
title/case-offset
|
|
|
|
==
|
|
|
|
::
|
|
|
|
++ case-tree
|
|
|
|
:> a binary search tree of ++case-node items, sorted on span.
|
|
|
|
(tree case-node)
|
2017-10-07 06:52:33 +03:00
|
|
|
--
|