streamly/unicode.md at 36227986f1117dc61948e260f9dfdcefb5fe0877

mirror of https://github.com/ilyakooo0/streamly.git synced 2024-09-11 08:25:40 +03:00

Harendra Kumar 688d68d17c Add a document about unicode text processing plans

2020-05-11 20:02:02 +05:30

8.7 KiB

Raw Blame History

Overview

Unicode is a standard which consists of a char set, encodings, attributes and properties of characters, processing of strings, paragraphs, processing of text in locale specific manner.

Char Set

Unicode char set represents characters from all languages in the world. Each character is assigned a, code point, a unique number identifying the character, and written as U+0076 where the four hex digits 0076 represent the unique number assigned to the character.

Encodings

A unicode character can be encoded as a:

fixed length encoding with a 32-bit value directly representing the code point in little endian (UTF32LE) or big endian (UTF32BE) byte ordering. See https://en.wikipedia.org/wiki/UTF-32.
variable length encoding with one or two 16-bit values depending on the code point, UTF16LE and UTF16BE. See https://en.wikipedia.org/wiki/UTF-16.
variable length encoding with one, two or three 8-bit values depending on the code point, UTF8. See https://en.wikipedia.org/wiki/UTF-8.

i18n and L10n

Internationalization (i18n) is being able to represent and process all languages in the world. Unicode performs i18n by representing all languages and their common processing rules.

Localization (L10n) is being able to customize the common internationalized processing to a country or region (locale). Unicode specifies various standard locales which includes customization of the attributes and processing rules for each locale. Custom locales can be created with custom text processing rules.

POSIX Locales

On Debian Linux, the default system wide locale can be administered using localectl or sudo dpkg-reconfigure locales.

In a shell, the locale command shows the current locale settings. When you start a program from the shell it inherits these settings via the process environment and the C library loads and uses the appropriate locale. Even some GUI programs if started from the shell can use the values from the environment. Other GUI programs may have there own locale settings that can be configured from their menu.

The following environment variables can override the system wide locale and determine how a process (handled by libc) performs unicode text processing for different locale aspects:

LC_CTYPE     Defines character classification and case conversion.
LC_COLLATE   Defines collation rules.
LC_MONETARY  Defines the format and symbols used in formatting of monetary information.
LC_NUMERIC   Defines the decimal delimiter, grouping, and grouping symbol for non-monetary numeric editing.
LC_TIME      Defines the format and content of date and time information.
LC_MESSAGES  Defines the format and values of affirmative and negative responses.

Each environment variable above can be set to available locale settings. For example LC_CTYPE=en_US.UTF-8 (locale.charmap) specifies a locale en_US for the language en (English) and the territory US (USA) to be used for character classifications (e.g. iswalpha) and case conversions (e.g. toupper). UTF-8 is the charmap used by the locale.

C and POSIX locales are the same and are used by default. glibc also provides a generic i18n locale.

Use locale -a command to see all available locales on a system and locale -m for charmaps. See man locale as well. Also see /usr/share/i18n/SUPPORTED on Linux (Debian). https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html for the format in which these environment variables can be specified.

The environment variables are preferred in the following order:

LC_ALL       Overrides everything else
LC_*         Individual aspect customization
LANG         Used as a substitute for any unset LC_* variable

Normally only LANG should be used. If a particular aspect needs to be cutsomized then LC_* variables can be used. LC_ALL overrides everything. On GNU/Linux LANGUAGE can also be used as a list of preferred languages separated by ":". LANGUAGE is effective only if LANG has been set to something other than default and has higher priority than anything else.

https://www.gnu.org/software/gettext/manual/gettext.html#Users has a good overview of locale settings.
https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html POSIX Locales standard

Localizing Messages

GNU https://www.gnu.org/software/gettext/manual/gettext.html can be used by programs to translate/localize the user interfacing text to multiple languages. Catalogs of localized program error, alert, notification messages are installed at e.g. /usr/share/locale/en_US/LC_MESSAGES/.

https://www.gnu.org/software/gettext/manual/gettext.html

Creating Locales

From the debian manpage of localedef POSIX command:

The localedef program reads the indicated charmap and input files, compiles them to a binary form quickly usable by the locale functions in the C library (setlocale(3), localeconv(3), etc.), and places the output in outputpath.

See locale-gen for a more high level program to generate locales. /usr/lib/locale/locale-archive contains the generated binary files. On Debian you can use sudo dpkg-reconfigure locales to select locales and set system locale.

On Linux/glibc (Debian), installed charmaps can be found at /usr/share/i18n/charmaps/ and locale definition input files at /usr/share/i18n/locales/.

ICU Locales

An ICU locale is frequently confused with a POSIX locale ID. An ICU locale ID is not a POSIX locale ID. ICU locales do not specify the encoding and specify variant locales differently.

http://userguide.icu-project.org/locale

Localizing Messages

http://userguide.icu-project.org/locale/localizing

Unicode Text processing

Locale Independent

Char Properties
Normalization
Regex matching

Locale Specific

Case Mapping:
- https://unicode.org/faq/casemap_charprop.html
- http://www.unicode.org/versions/latest/ch05.pdf#G21180
- ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
- ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
Breaking
Collation
Charset Conversion

Haskell

Haskell Unicode Text Processing

Related packages on hackage:

base Data.Char module, uses libc
text
text-icu C bindings to icu
https://github.com/composewell/unicode-transforms
https://github.com/llelf/prose
unicode-properties Unicode 3.2.0 character properties
hxt-charproperties Character properties and classes for XML and Unicode
unicode-names Unicode 3.2.0 character names
unicode Construct and transform unicode characters
charset Fast unicode character sets

None of the existing Haskell packages provide comprehensive and fast access to properties. unicode-transforms provides composition/decomposition data and the script to extract data from unicode database.

Haskell Localization

TODO

Factor out a unicode-data package from unicode-transforms. unicode-transforms package will depend on unicode-data and can continue to be used as is. Other packages can take advantage of the unicode-data to provide unicode text processing services.

To begin with, this package will contain:

char properties data
case mapping data
unicode normalization data

This package can be used in Streamly.Internal.Data.Unicode.* to provide:

Fast access to char properties, we will no longer depend on libc for that and no FFI will be required to do iswspace, iswalpha etc.
Correct case mappings from a single char to multi-char
Stream based unicode normalization
Breaking (locale independent for now)

Later

add locale data from CLDR to unicode-data to provide locale specific services as well.
support locale specific breaking, regex, collation and charset conversion
facility to add customized locale data
Support providing application level resource data for localization

8.7 KiB Raw Blame History

Overview

Char Set

Encodings

i18n and L10n

POSIX Locales

Localizing Messages

Creating Locales

ICU Locales

Localizing Messages

Unicode Text processing

Locale Independent

Locale Specific

Haskell

Haskell Unicode Text Processing

Haskell Localization

TODO

Later

8.7 KiB

Raw Blame History