Commit Graph

130 Commits

Author SHA1 Message Date
Sarah Hoffmann
a0ed80d821 restore the tokenizer directory when missing
Automatically repopulate the tokenizer/ directory with the PHP stub
and the postgresql module, when the directory is missing. This allows
to switch working directories and in particular run the service
from a different maschine then where it was installed.
Users still need to make sure that .env files are set up correctly
or they will shoot themselves in the foot.

See #2515.
2022-03-20 11:31:42 +01:00
Sarah Hoffmann
15beeef6ce do not expand records in select list
An expression of the form 'SELECT (func()).*' will be expanded
by Postgresql _before_ execution with the result that the function
will be called as many times as there are fields in the record.
This is not what we want. The function call needs to go into
the FROM clause instead.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
92bc3cd0a7 fix linting issue 2022-03-01 09:34:32 +01:00
Sarah Hoffmann
4a3bbd0319 adapt housenumber cleanup to new word table structure 2022-03-01 09:34:32 +01:00
Sarah Hoffmann
13ed184efd housenumber analyzer: avoid creating too many variants
Housenumber fields with lots of text are likely bad data. So is
data with many changes from letter to digit. Exclude them from adding
optional spaces.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
f03a05f6bb add new analyser for houenumbers
This analyser makes spaces optional.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
a6903651fc add framework for analysing housenumbers
This lays the groundwork for adding variants for housenumbers.
When analysis is enabled, then the 'word' field in the word table
is used as usual, so that variants can be created. There will be
only one analyser allowed which must have the fixed name
'@housenumber'.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
b8c544cc98 icu: move token deduplication into TokenInfo
Puts collection into one common place.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
243725aae1 icu: move housenumber token computation out of TokenInfo
This was the last function to use the cache. There is a more clean
separation of responsibility now.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
0bb59b2e22 handle unknown analyzer
When changing something in the default configuration of the sanatizers
that refers to an analyzer that is not yet loaded, there shouldn't be
any errors.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
837d44391c move generation of normalized token form to analyzer
This gives the analyzer more flexibility in choosing the normalized
form. In particular, an analyzer creating different variants can choose
the variant that will be used as the canonical form.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
f74228830d bdd: run full import on tests
This uncovered a couple of outdated/wrong tests which have been
fixed, too.
2022-02-24 14:27:51 +01:00
Sarah Hoffmann
a3e4e8e5cd delete unused country name tokens 2022-02-23 09:23:06 +01:00
Sarah Hoffmann
38c3ef3da0 add tests for get_string_list()
Renaming test file for sanitizer config because pytest requires
unique names for test files.
2022-02-07 11:22:24 +01:00
Sarah Hoffmann
610f2cc254 sanitizer: move helpers into a configuration class 2022-02-07 10:48:00 +01:00
Sarah Hoffmann
a79a3210e6 implement is-a-name option for housenumbers 2022-02-07 09:27:11 +01:00
Sarah Hoffmann
3ce123ab69 do not clean housenumbers in reverse-only mode 2022-01-20 20:21:13 +01:00
Sarah Hoffmann
d8b7a51ab6 add actual removal of housenumber tokens 2022-01-20 20:18:15 +01:00
Sarah Hoffmann
344a2bfc1a add new command for cleaning word tokens
Just pulls outdated housenumbers for the moment.
2022-01-20 20:05:15 +01:00
Sarah Hoffmann
1e5a8561c0 fix linting issues 2022-01-20 16:00:23 +01:00
Sarah Hoffmann
f3c9578bca complete documentation for new clean-houseunubmers sanatizer 2022-01-20 15:49:32 +01:00
Sarah Hoffmann
3741afa6dc generalize filter-kind parameter for sanatizers
Now behaves the same for tag_analyzer_by_language and
clean_housenumbers. Adds tests.
2022-01-20 15:42:42 +01:00
Sarah Hoffmann
4774e45218 clean_housenumbers: make kinds and delimiters configurable
Also adds unit tests for various options.
2022-01-20 12:07:12 +01:00
Sarah Hoffmann
206ee87188 factor out housenumber splitting into sanitizer 2022-01-19 17:27:50 +01:00
Sarah Hoffmann
3df560ea38 fix linting error 2022-01-18 11:09:21 +01:00
Sarah Hoffmann
adbaf700cd move parsing of mutation config to setup phase 2022-01-18 11:09:21 +01:00
Sarah Hoffmann
b453b0ea95 introduce mutation variants to generic token analyser
Mutations are regular-expression-based replacements that are applied
after variants have been computed. They are meant to be used for
variations on character level.

Add spelling variations for German umlauts.
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
0192a7af96 move variant configuration reading in separate file 2022-01-18 11:09:21 +01:00
Sarah Hoffmann
630ad38a67 refactor variant production to use generators 2022-01-18 11:09:21 +01:00
Sarah Hoffmann
c3788d765e add consistent SPDX copyright headers 2022-01-03 16:23:58 +01:00
Sarah Hoffmann
f9b56a8581 correctly match abbreviated addr:street
This only works when addr:street is abbreviated and the street
name isn't. It does not work the other way around.
2021-12-08 21:58:43 +01:00
Sarah Hoffmann
7f7d2fd5b3 skip most addr: tags with suffixes
Only one addr: tag can be processed currently, so make
sure it is the one without suffixes to not get odd data.
addr:street is the exception because it uses a different
matching mechanism.
2021-12-06 14:55:10 +01:00
Sarah Hoffmann
44cfce1ca4 revert to using full names for street name matching
Using partial names turned out to not work well because there are
often similarly named streets next to each other. It also
prevents us from being able to take into account all addr:street:*
tags.

This change gets all the full term tokens for the addr:street tags
from the DB. As they are used for matching only, we can assume that
the term must already be there or there will be no match. This
avoid creating unused full name tags.
2021-12-06 11:38:38 +01:00
Sarah Hoffmann
7beccb7997 remove unnecessary pass statements 2021-12-02 15:54:24 +01:00
Sarah Hoffmann
14a78f55cd more unit tests for tokenizers 2021-12-02 15:46:36 +01:00
Sarah Hoffmann
37eeccbf4c ICU: use normalization from config in PHP
The TERM_NORMALIZATION config option is no longer applicable.
That was already documented but not yet implemented.
2021-10-27 11:32:44 +02:00
Sarah Hoffmann
53dbe58ada do not count words when in reverse-only mode 2021-10-26 12:00:13 +02:00
Sarah Hoffmann
85797acf1e ICU: add an index over word_ids
Needed for keyword lookup in the details response.
2021-10-25 21:33:27 +02:00
Sarah Hoffmann
ec7184c533 icu: no longer precompute terms
The ICU analyzer no longer drops frequent partials, so it is no
longer necessary to know the frequencies in advance.
2021-10-19 11:52:28 +02:00
Sarah Hoffmann
e8e2502e2f make word recount a tokenizer-specific function 2021-10-19 11:21:16 +02:00
Sarah Hoffmann
6c79a60e19 add documentation for new configuration of ICU tokenizer 2021-10-07 11:55:53 +02:00
Sarah Hoffmann
2a94bfc703 fix argument description for check_database 2021-10-07 09:49:13 +02:00
Sarah Hoffmann
299934fd2a reorganize and complete tests around generic token analysis 2021-10-06 17:03:37 +02:00
Sarah Hoffmann
b18d042832 add tests for sanitizer tagging language 2021-10-06 12:29:25 +02:00
Sarah Hoffmann
97a10ec218 apply variants by languages
Adds a tagger for names by language so that the analyzer of that
language is used. Thus variants are now only applied to names
in the specific language and only tag name tags, no longer to
reference-like tags.
2021-10-06 11:09:54 +02:00
Sarah Hoffmann
d35400a7d7 use analyser provided in the 'analyzer' property
Implements per-name choice of analyzer. If a non-default
analyzer is choosen, then the 'word' identifier is extended
with the name of the ana;yzer, so that we still have unique
items.
2021-10-05 14:10:32 +02:00
Sarah Hoffmann
92f6ec2328 remove support for properties on variants
Those are not going to be used in the near future, so no need to
carry that code around just now.
2021-10-05 10:29:36 +02:00
Sarah Hoffmann
9ba2019470 precompute replacements while loading configuration 2021-10-05 10:20:08 +02:00
Sarah Hoffmann
c171d88194 move parsing of token analysis config to analyzer
Adds a second callback for the analyzer which is responsible
for parsing the configuration rules and converting it to
whatever format necessary. This way, each analyzer implementation
can define its own configuration rules.
2021-10-04 18:31:58 +02:00
Sarah Hoffmann
7cfcbacfc7 make token analyzers configurable modules
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.
2021-10-04 17:37:34 +02:00