Add a document about unicode text processing plans

2024-09-11 08:25:40 +03:00 · 2020-05-02 14:07:45 +05:30 · 2020-05-02 14:07:45 +05:30 · 688d68d17c
commit 688d68d17c
parent 91c3b0f8fe
1 changed files with 224 additions and 0 deletions
--- a/design/unicode.md
+++ b/design/unicode.md
@ -0,0 +1,224 @@
+## Overview
+
+Unicode is a standard which consists of a `char set`, `encodings`, attributes
+and properties of characters, processing of strings, paragraphs, processing of
+text in `locale` specific manner.
+
+### Char Set
+
+Unicode `char set` represents characters from all languages in the world.  Each
+character is assigned a, code point, a unique number identifying the character,
+and written as `U+0076` where the four hex digits `0076` represent the unique
+number assigned to the character.
+
+### Encodings
+
+A unicode character can be encoded as a:
+
+* fixed length encoding with a 32-bit value directly representing the code
+  point in little endian (UTF32LE) or big endian (UTF32BE) byte ordering.  See
+  https://en.wikipedia.org/wiki/UTF-32.
+* variable length encoding with one or two 16-bit values depending on the code
+  point, UTF16LE and UTF16BE. See https://en.wikipedia.org/wiki/UTF-16.
+* variable length encoding with one, two or three 8-bit values depending on the
+  code point, UTF8. See https://en.wikipedia.org/wiki/UTF-8.
+
+### i18n and L10n
+
+Internationalization (i18n) is being able to represent and process all
+languages in the world. Unicode performs i18n by representing all languages and
+their common processing rules.
+
+Localization (L10n) is being able to customize the common internationalized
+processing to a country or region (`locale`). Unicode specifies various
+standard locales which includes customization of the attributes and processing
+rules for each locale. Custom locales can be created with custom text
+processing rules.
+
+* https://en.wikipedia.org/wiki/Internationalization_and_localization .
+* http://userguide.icu-project.org/i18n
+
+## POSIX Locales
+
+On Debian Linux, the default system wide locale can be administered using
+`localectl` or `sudo dpkg-reconfigure locales`.
+
+In a shell, the `locale` command shows the current locale settings. When you
+start a program from the shell it inherits these settings via the process
+environment and the C library loads and uses the appropriate locale. Even some
+GUI programs if started from the shell can use the values from the
+environment. Other GUI programs may have there own locale settings that can be
+configured from their menu.
+
+The following environment variables can override the system wide locale and
+determine how a process (handled by libc) performs unicode text processing for
+different locale aspects:
+
+```
+LC_CTYPE     Defines character classification and case conversion.
+LC_COLLATE   Defines collation rules.
+LC_MONETARY  Defines the format and symbols used in formatting of monetary information.
+LC_NUMERIC   Defines the decimal delimiter, grouping, and grouping symbol for non-monetary numeric editing.
+LC_TIME      Defines the format and content of date and time information.
+LC_MESSAGES  Defines the format and values of affirmative and negative responses.
+```
+
+Each environment variable above can be set to available `locale` settings. For
+example `LC_CTYPE=en_US.UTF-8` (locale.charmap) specifies a locale `en_US` for
+the language `en` (English) and the territory `US` (USA) to be used for
+character classifications (e.g. `iswalpha`) and case conversions (e.g.
+`toupper`). `UTF-8` is the charmap used by the locale.
+
+`C` and `POSIX` locales are the same and are used by default. glibc also
+provides a generic `i18n` locale.
+
+Use `locale -a` command to see all available locales on a system and `locale
+-m` for charmaps. See `man locale` as well. Also see
+`/usr/share/i18n/SUPPORTED` on Linux (Debian).
+https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html for the
+format in which these environment variables can be specified.
+
+The environment variables are preferred in the following order:
+
+```
+LC_ALL       Overrides everything else
+LC_*         Individual aspect customization
+LANG         Used as a substitute for any unset LC_* variable
+```
+
+Normally only `LANG` should be used. If a particular aspect needs to be
+cutsomized then `LC_*` variables can be used. `LC_ALL` overrides everything.
+On GNU/Linux `LANGUAGE` can also be used as a list of preferred languages
+separated by ":". LANGUAGE is effective only if LANG has been set to something
+other than default and has higher priority than anything else.
+
+* https://www.gnu.org/software/gettext/manual/gettext.html#Users has a good
+  overview of locale settings.
+* https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
+  POSIX Locales standard
+
+### Localizing Messages
+
+GNU https://www.gnu.org/software/gettext/manual/gettext.html can be used by
+programs to translate/localize the user interfacing text to multiple languages.
+Catalogs of localized program error, alert, notification messages are installed
+at e.g.  `/usr/share/locale/en_US/LC_MESSAGES/`.
+
+* https://www.gnu.org/software/gettext/manual/gettext.html
+
+### Creating Locales
+
+From the `debian` manpage of `localedef` POSIX command:
+
+The  localedef  program  reads  the  indicated charmap and input files,
+compiles them to a binary form quickly usable by the locale functions in the C
+library (setlocale(3), localeconv(3), etc.), and places the output in
+outputpath.
+
+See `locale-gen` for a more high level program to generate locales.
+/usr/lib/locale/locale-archive contains the generated binary files. On Debian
+you can use `sudo dpkg-reconfigure locales` to select locales and set system
+locale.
+
+On Linux/glibc (Debian), installed `charmaps` can be found at
+`/usr/share/i18n/charmaps/` and locale definition input files at
+`/usr/share/i18n/locales/`. 
+
+## ICU Locales
+
+An ICU locale is frequently confused with a POSIX locale ID. An ICU locale ID
+is not a POSIX locale ID. ICU locales do not specify the encoding and specify
+variant locales differently.
+
+* http://userguide.icu-project.org/locale
+
+### Localizing Messages
+
+* http://userguide.icu-project.org/locale/localizing
+
+## Unicode Text processing
+
+* http://userguide.icu-project.org/posix
+* http://userguide.icu-project.org/strings/properties
+
+### Locale Independent
+
+1) Char Properties
+2) Normalization
+3) Regex matching
+
+### Locale Specific
+
+1) Case Mapping:
+    * https://unicode.org/faq/casemap_charprop.html
+    * http://www.unicode.org/versions/latest/ch05.pdf#G21180
+    * ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
+    * ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
+2) Breaking
+3) Collation
+4) Charset Conversion
+
+## Haskell
+
+### Haskell Unicode Text Processing
+
+Related packages on hackage:
+
+* [base](https://www.stackage.org/lts/package/base) Data.Char module, uses libc
+* [text](https://www.stackage.org/lts/package/text)
+* [text-icu](https://stackage.org/lts/package/text-icu) C bindings to icu
+
+
+* https://github.com/composewell/unicode-transforms
+* https://github.com/llelf/prose
+* [unicode-properties](https://hackage.haskell.org/package/unicode-properties) Unicode 3.2.0 character properties
+* [hxt-charproperties](http://www.stackage.org/lts/package/hxt-charproperties) Character properties and classes for XML and Unicode
+* [unicode-names](http://hackage.haskell.org/package/unicode-names) Unicode 3.2.0 character names
+* [unicode](https://hackage.haskell.org/package/unicode) Construct and transform unicode characters
+
+
+* [charset](https://www.stackage.org/lts/package/charset) Fast unicode character sets
+
+None of the existing Haskell packages provide comprehensive and fast access to
+properties. `unicode-transforms` provides composition/decomposition data and
+the script to extract data from unicode database.
+
+### Haskell Localization
+
+* http://wiki.haskell.org/Internationalization_of_Haskell_programs
+* https://hackage.haskell.org/package/localize
+* http://hackage.haskell.org/package/localization
+* http://hackage.haskell.org/package/hgettext
+* http://hackage.haskell.org/package/i18n
+* https://hackage.haskell.org/package/setlocale
+* http://hackage.haskell.org/package/system-locale
+* https://github.com/llelf/numerals
+
+## TODO
+
+Factor out a `unicode-data` package from `unicode-transforms`.
+`unicode-transforms` package will depend on unicode-data and can continue to be
+used as is. Other packages can take advantage of the `unicode-data` to provide
+unicode text processing services.
+
+To begin with, this package will contain:
+
+* char properties data
+* case mapping data
+* unicode normalization data
+
+This package can be used in `Streamly.Internal.Data.Unicode.*` to provide:
+
+* Fast access to char properties, we will no longer depend on libc for that and
+  no FFI will be required to do iswspace, iswalpha etc.
+* Correct case mappings from a single char to multi-char
+* Stream based unicode normalization
+* Breaking (locale independent for now)
+
+## Later
+
+* add locale data from CLDR to `unicode-data` to provide locale specific
+  services as well.
+* support locale specific breaking, regex, collation and charset conversion
+* facility to add customized locale data
+* Support providing application level resource data for localization