mirror of
https://github.com/osm-search/Nominatim.git
synced 2024-11-22 21:28:10 +03:00
ICU: better letter identification in normalization
The Letter class does not include non-spacing marks that can also have a consonant or vowel meaning, especially in Indian languages. Use the alnum propoerty instead which includes them all. Also include the vowel-canceling Virama, which is not a letter by itself but changes the transliteration.
This commit is contained in:
parent
de828b723e
commit
63dc4b39bc
@ -8,8 +8,8 @@ normalization:
|
|||||||
- "ª > a"
|
- "ª > a"
|
||||||
- "º > o"
|
- "º > o"
|
||||||
- "[[:Punctuation:][:Symbol:]\u02bc] > ' '"
|
- "[[:Punctuation:][:Symbol:]\u02bc] > ' '"
|
||||||
- "ß > 'ss'" # German szet is unimbigiously equal to double ss
|
- "ß > 'ss'" # German szet is unambiguously equal to double ss
|
||||||
- "[^[:Letter:] [:Number:] [:Space:]] >"
|
- "[^[:alnum:] [:Canonical_Combining_Class=Virama:] [:Space:]] >"
|
||||||
- "[:Lm:] >"
|
- "[:Lm:] >"
|
||||||
- ":: [[:Number:]] Latin ()"
|
- ":: [[:Number:]] Latin ()"
|
||||||
- ":: [[:Number:]] Ascii ();"
|
- ":: [[:Number:]] Ascii ();"
|
||||||
|
Loading…
Reference in New Issue
Block a user