ICU: better letter identification in normalization

The Letter class does not include non-spacing marks that can also have a consonant or vowel meaning, especially in Indian languages. Use the alnum propoerty instead which includes them all. Also include the vowel-canceling Virama, which is not a letter by itself but changes the transliteration.
2024-11-22 21:28:10 +03:00 · 2022-04-28 17:20:56 +02:00 · 2022-04-28 17:20:56 +02:00 · 63dc4b39bc
commit 63dc4b39bc
parent de828b723e
1 changed files with 2 additions and 2 deletions
--- a/settings/icu_tokenizer.yaml
+++ b/settings/icu_tokenizer.yaml
@ -8,8 +8,8 @@ normalization:
    - "ª > a"
    - "º > o"
    - "[[:Punctuation:][:Symbol:]\u02bc]  > ' '"
-    - "ß > 'ss'" # German szet is unimbigiously equal to double ss
+    - "ß > 'ss'" # German szet is unambiguously equal to double ss
-    - "[^[:Letter:] [:Number:] [:Space:]] >"
+    - "[^[:alnum:] [:Canonical_Combining_Class=Virama:] [:Space:]] >"
    - "[:Lm:] >"
    - ":: [[:Number:]] Latin ()"
    - ":: [[:Number:]] Ascii ();"