Automatically repopulate the tokenizer/ directory with the PHP stub
and the postgresql module, when the directory is missing. This allows
to switch working directories and in particular run the service
from a different maschine then where it was installed.
Users still need to make sure that .env files are set up correctly
or they will shoot themselves in the foot.
See #2515.
This gives the analyzer more flexibility in choosing the normalized
form. In particular, an analyzer creating different variants can choose
the variant that will be used as the canonical form.
Mutations are regular-expression-based replacements that are applied
after variants have been computed. They are meant to be used for
variations on character level.
Add spelling variations for German umlauts.
Only one addr: tag can be processed currently, so make
sure it is the one without suffixes to not get odd data.
addr:street is the exception because it uses a different
matching mechanism.
Using partial names turned out to not work well because there are
often similarly named streets next to each other. It also
prevents us from being able to take into account all addr:street:*
tags.
This change gets all the full term tokens for the addr:street tags
from the DB. As they are used for matching only, we can assume that
the term must already be there or there will be no match. This
avoid creating unused full name tags.
Adds a tagger for names by language so that the analyzer of that
language is used. Thus variants are now only applied to names
in the specific language and only tag name tags, no longer to
reference-like tags.
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.