Commit Graph

6 Commits

Author SHA1 Message Date
Timothy Flynn
139c575cc9 LibUnicode: Update to Unicode version 15.1.0
https://unicode.org/versions/Unicode15.1.0/

This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
2023-09-15 18:30:26 +02:00
Timothy Flynn
fa96811a22 LibUnicode: Skip over emoji sequences in grapheme boundary segmentation
Emoji sequences in the grapheme segmentation spec are a bit tricky:

    \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}

Our current strategy of tracking a boolean to indicate if we are in an
emoji sequence was causing us to break up emoji made of multiple sub-
sequences. For example, in the "family: man, woman, girl, boy" sequence:

    U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466

We would break at indices 0 (correctly) and 6 (incorrectly).

Instead of tracking a boolean, it's quite a bit simpler to reason about
emoji sequences by just skipping past them entirely. Note that in cases
like the above emoji, we skip one sub-sequence at a time.
2023-02-25 22:23:39 +01:00
Timothy Flynn
5cbf054651 LibUnicode: Fix typos causing text segmentation on mid-word punctuation
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.

The String test cases fixed here are because "b'ar" is now considered
one word.
2023-02-15 12:36:47 +01:00
Timothy Flynn
abe7786a81 LibUnicode: Allow iterating over text segmentation boundaries
This will be useful for e.g. finding the next boundary after a specific
index - we can just stop iterating once a condition is satisfied.
2023-02-15 12:36:47 +01:00
Timothy Flynn
dd4c47456e LibUnicode: Implement text segmentation algorithms for all UTF encodings
Similar to commit 6d710eeb43. Rather than
pick-and-chosing what to support, let's just support all encodings now,
as it is trivial. For example, LibGUI will want the UTF-32 overloads.
2023-02-15 12:36:47 +01:00
Timothy Flynn
2d487e4e4c LibUnicode+LibJS: Move text segmentation algorithms to their own files
These algorithms are quite chonky, and more APIs around them are to be
added, so let's move them to their own files for a bit of organization.
2023-02-15 12:36:47 +01:00