Commit Graph

259 Commits

Author SHA1 Message Date
Timothy Flynn
456211932f LibUnicode: Perform code point case conversion lookups in constant time
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.

In total, this change:

    * Does not change the size of libunicode.so (which is nice because,
      generally, the 2-stage lookup tables are expected to trade a bit
      of size for performance).

    * Reduces the runtime of the new benchmark test case added here from
      1.383s to 1.127s (about an 18.5% improvement).
2023-07-28 05:28:50 +02:00
Timothy Flynn
cb128dcf75 LibUnicode: Move the CodePointRangeComparator struct to a public header
Move it out of the generated code so that it may be used by the code
generator itself.
2023-07-26 08:36:20 +02:00
Timothy Flynn
c950f88611 LibUnicode: Stop generating Block property data
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.

Note we *have* always used the block name data from that commit, and
that is still present here.
2023-07-26 08:36:20 +02:00
Timothy Flynn
1393ed2000 AK+LibUnicode: Implement String::equals_ignoring_case without allocating
We currently fully casefold the left- and right-hand sides to compare
two strings with case-insensitivity. Now, we casefold one code point at
a time, storing the result in a view for comparison, until we exhaust
both strings.
2023-03-08 18:57:53 +00:00
Timothy Flynn
f8a0365002 LibUnicode: Detect ZWJ sequences when filtering by emoji presentation
This was preventing some unqualified emoji sequences from rendering
properly, such as the custom SerenityOS flag. We rendered the flag
correctly when given the fully qualified sequence:

    U+1F3F3 U+FEOF U+200D U+1F41E

But were not detecting the unqualified sequence as an emoji when also
filtering for emoji-presentation sequences:

    U+1F3F3 U+200D U+1F41E
2023-03-05 20:21:57 +01:00
Timothy Flynn
42c272c059 LibUnicode: Allow ignoring text presentation emoji in sequence detection
This adds an option to only detect emoji that should always present as
emoji. For example, the copyright symbol (unless followed by an emoji
presentation selector) should render as text.
2023-02-28 13:22:58 +00:00
Timothy Flynn
fa96811a22 LibUnicode: Skip over emoji sequences in grapheme boundary segmentation
Emoji sequences in the grapheme segmentation spec are a bit tricky:

    \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}

Our current strategy of tracking a boolean to indicate if we are in an
emoji sequence was causing us to break up emoji made of multiple sub-
sequences. For example, in the "family: man, woman, girl, boy" sequence:

    U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466

We would break at indices 0 (correctly) and 6 (incorrectly).

Instead of tracking a boolean, it's quite a bit simpler to reason about
emoji sequences by just skipping past them entirely. Note that in cases
like the above emoji, we skip one sub-sequence at a time.
2023-02-25 22:23:39 +01:00
Timothy Flynn
1484d3d9f5 LibUnicode: Add a method to check if a code point could start an emoji 2023-02-24 19:48:47 +01:00
Timothy Flynn
8c38d46c1a LibUnicode: Generate the path to emoji images alongside emoji data
This will provide for quicker emoji lookups, rather than having to
discover and allocate these paths at runtime before we find out if they
even exist.
2023-02-24 19:48:47 +01:00
Timothy Flynn
32a01a60e7 LibUnicode: Remove non-iterative text segmentation algorithms
They are now unused.
2023-02-16 11:18:53 +01:00
Timothy Flynn
6ce7ec2eb3 LibUnicode: Use iterative text segmentation algorithms for titlecasing 2023-02-16 11:18:53 +01:00
Timothy Flynn
5cbf054651 LibUnicode: Fix typos causing text segmentation on mid-word punctuation
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.

The String test cases fixed here are because "b'ar" is now considered
one word.
2023-02-15 12:36:47 +01:00
Timothy Flynn
6e7a6e2d02 LibUnicode: Support finding the next/previous text segmentation boundary 2023-02-15 12:36:47 +01:00
Timothy Flynn
abe7786a81 LibUnicode: Allow iterating over text segmentation boundaries
This will be useful for e.g. finding the next boundary after a specific
index - we can just stop iterating once a condition is satisfied.
2023-02-15 12:36:47 +01:00
Timothy Flynn
dd4c47456e LibUnicode: Implement text segmentation algorithms for all UTF encodings
Similar to commit 6d710eeb43. Rather than
pick-and-chosing what to support, let's just support all encodings now,
as it is trivial. For example, LibGUI will want the UTF-32 overloads.
2023-02-15 12:36:47 +01:00
Timothy Flynn
2d487e4e4c LibUnicode+LibJS: Move text segmentation algorithms to their own files
These algorithms are quite chonky, and more APIs around them are to be
added, so let's move them to their own files for a bit of organization.
2023-02-15 12:36:47 +01:00
MacDue
63b11030f0 Everywhere: Use ReadonlySpan<T> instead of Span<T const> 2023-02-08 19:15:45 +00:00
Timothy Flynn
537fcaf59e AK+LibUnicode: Provide Unicode-aware caseless String matching
The Unicode spec defines much more complicated caseless matching
algorithms in its Collation spec. This implements the "basic" case
folding comparison.
2023-01-18 14:43:40 +00:00
Timothy Flynn
8f2589b3b0 LibUnicode: Parse and generate case folding code point data
Case folding rules have a similar mapping style as special casing rules,
where one code point may map to zero or more case folding rules. These
will be used for case-insensitive string comparisons. To see how case
folding can differ from other casing rules, consider "ß" (U+00DF):

    >>> "ß".lower()
    'ß'

    >>> "ß".upper()
    'SS'

    >>> "ß".title()
    'Ss'

    >>> "ß".casefold()
    'ss'
2023-01-18 14:43:40 +00:00
Timothy Flynn
8d9fb898d7 LibUnicode: Update out-of-date spec links
And remove links that aren't adding much value but will often get out of
date (i.e. links to UCD files, which are already all listed in
unicode_data.cmake).
2023-01-18 14:43:40 +00:00
Timothy Flynn
d6ddca0c0f AK+LibUnicode: Provide Unicode-aware String titlecase transformation 2023-01-16 18:33:44 -05:00
Timothy Flynn
bc51017a03 LibUnicode: Support full case folding for titlecasing a string
Unicode declares that to titlecase a string, the first cased code point
after each word boundary should be transformed to its titlecase mapping.
All other codepoints are transformed to their lowercase mapping.
2023-01-16 18:33:44 -05:00
Timothy Flynn
b562348d31 LibUnicode: Generate simple case folding mappings for titlecase
Note we already generate the special case foldings for titlecase.
2023-01-16 18:33:44 -05:00
Timothy Flynn
6d710eeb43 LibUnicode: Add an overload of word segmentation for UTF-8 strings 2023-01-16 18:33:44 -05:00
Timothy Flynn
58bc831750 LibUnicode: Return a String from Unicode normalization 2023-01-15 01:00:20 +00:00
Timothy Flynn
6fcc1c7426 AK+LibUnicode: Provide Unicode-aware String case transformations
Since AK can't refer to LibUnicode directly, the strategy here is that
if you need case transformations, you can link LibUnicode and receive
them. If you try to use either of these methods without linking it, then
you'll of course get a linker error (note we don't do any fallbacks to
e.g. ASCII case transformations). If you don't need these methods, you
don't have to link LibUnicode.
2023-01-09 19:23:46 -07:00
Timothy Flynn
12f6793223 LibUnicode: Move Unicode-aware case transformations to a helper file
These will be needed by AK::String as well, so move them to a helper
file where they can be re-used.
2023-01-09 19:23:46 -07:00
Timothy Flynn
3d22efccca LibUnicode+LibJS: Propagate OOM from Unicode normalization 2023-01-09 22:48:15 +00:00
Timothy Flynn
1ff29afc45 LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations 2023-01-09 22:48:15 +00:00
Timothy Flynn
d382e77d38 LibUnicode: Fix compilation when the UCD download is disabled 2022-12-14 15:24:48 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Gunnar Beutner
2d3567ee92 Meta+LibUnicode: Avoid relocations for static unicode data
Previously the s_decomposition_mappings variable would refer to other
data in s_decomposition_mappings_data. This would cause thousands of
avoidable relocations at load time.

This saves about 128kB RAM for each process which uses LibUnicode.
2022-11-06 17:34:06 +01:00
Tim Schumacher
ce2f1b845f Everywhere: Mark dependencies of most targets as PRIVATE
Otherwise, we end up propagating those dependencies into targets that
link against that library, which creates unnecessary link-time
dependencies.

Also included are changes to readd now missing dependencies to tools
that actually need them.
2022-11-01 14:49:09 +00:00
Andrew Kaster
b8e51425e9 Lagom+CMake: Propagate dependencies for generated custom targets
We have logic for serenity_generated_sources which works well for source
files that are specified in GENERATED_SOURCES prior to calling
serenity_lib or serenity_bin. However, code generated with
invoke_generator, and the LibWeb generators do not always follow the
pattern of the IDL and GML files.

For the LibWeb generators, we can just add_dependencies to LibWeb at the
time we declare the generate_Foo custom target. However for LibLocale,
LibTimeZone, and LibUnicode, we don't have the name of the target
available, so export the name in a variable to set into
GENERATED_SOURCES.

To make this work for Lagom, we need to make sure that lagom_lib and
serenity_bin in Lagom/CMakeLists.txt call serenity_generated_sources on
the target.

This enables the Xcode generator on macOS hosts, at least for Lagom.
2022-10-17 15:55:55 +02:00
matcool
104b51b912 LibUnicode: Fix Hangul syllable composition for specific cases
This fixes `combine_hangul_code_points` which would try to combine
a LVT syllable with a trailing consonant, resulting in a wrong
character.

Also added a test for this specific case.
2022-10-07 07:53:27 -04:00
Timothy Flynn
19b758ce8b LibUnicode: Add to-and-from string converters for NormalizationForm 2022-10-06 22:14:44 +01:00
matcool
70d0c1616f LibUnicode: Add decomposition mappings and Unicode normalization
The mappings are exposed via `Unicode::code_point_decomposition(u32)`
and `Unicode::code_point_decompositions()`, the latter being useful for
reverse searching a code point from its decomposition.

The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization),
meaning no quick check optimizations.
2022-10-06 08:24:39 -04:00
Timothy Flynn
b7ef36aa36 LibUnicode: Parse and generate custom emoji added for SerenityOS
Parse emoji from emoji-serenity.txt to allow displaying their names and
grouping them together in the EmojiInputDialog.

This also adds an "Unknown" value to the EmojiGroup enum. This will be
useful for emoji that aren't found in the UCD, or for when UCD downloads
are disabled.
2022-09-11 20:33:57 +01:00
Timothy Flynn
b61eca0a1e LibUncode: Parse and generate emoji code point data
According to TR #51, the "best definition of the full set [of emojis] is
in the emoji-test.txt file". This defines not only the emoji themselves,
but the order in which they should be displayed, and what "group" of
emojis they belong to.
2022-09-08 23:12:31 +01:00
Timothy Flynn
9e860d973e LibLocale: Move locale source files to the LibLocale library
Everything is now setup to create the LibLocale library and link it
where needed.
2022-09-05 14:37:16 -04:00
Timothy Flynn
f082b6ae48 LibUnicode: Generate a separate Locale enumeration for special casing
The UCD only cares about a few locales for special casing rules (az, lt,
and tr). Unfortunately, LibUnicode cannot use LibLocale once the
libraries are separate because LibLocale will need to use LibUnicode for
many more things; thus there would be a circular dependency. Instead,
just generate the small enum needed for this one use case.
2022-09-05 14:37:16 -04:00
Timothy Flynn
43a3471298 LibLocale: Move locale source files to the LibLocale folder
These are still included in LibUnicode, but this updates their location
and the include paths of other files which include them.
2022-09-05 14:37:16 -04:00
Timothy Flynn
ff48220dca Userland: Move files destined for LibLocale to the Locale namespace 2022-09-05 14:37:16 -04:00
Timothy Flynn
6c7b05a0ff LibUnicode+LibJS: Move Unicode::get_available_currencies() to Locale.h
This is generated by GenerateLocaleData, which will soon be in the
Locale namespace. Move it out of CurrencyCode.h, as that will continue
to live in the Unicode namespace.
2022-09-05 14:37:16 -04:00
Timothy Flynn
1e0276f541 LibLocale+LibUnicode: Move generated CLDR data files to LibLocale folder
They are still included into LibUnicode, but this moves their generated
location to be under LibLocale.
2022-09-05 14:37:16 -04:00
Timothy Flynn
fc8bf7ac3e LibUnicode+Userland: Migrate generated CLDR data to LibLocaleData
Currently, LibUnicodeData contains the generated UCD and CLDR data. Move
the UCD data to the main LibUnicode library, and rename LibUnicodeData
to LibLocaleData. This is another prepatory change to migrate to
LibLocale.
2022-09-05 14:37:16 -04:00
Timothy Flynn
89d1813b5d LibUnicode: Move CLDR data generators to a LibLocale subfolder
To prepare for placing all CLDR generated data in a new library,
LibLocale, this moves the code generators for the CLDR data to the
LibLocale subfolder.
2022-09-05 14:37:16 -04:00
Timothy Flynn
e3e0602833 LibUnicode: Fully qualify use of AK::Variant in Locale.h
The generated locale data contains an enum also named Variant, as
variants are part of locale strings. This hasn't been an issue, but as
includes are reordered, the order in which the enum and AK::Variant are
included may cause an ambiguity error.
2022-09-05 14:37:16 -04:00
Timothy Flynn
6af9bf1a1e LibUnicode: Fix compilation when ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF 2022-08-25 16:20:22 +01:00