Sandor Nagy
7e3701b64a
Fix typo in log message on replication initialisation
2022-03-15 07:50:47 +01:00
Sarah Hoffmann
15beeef6ce
do not expand records in select list
...
An expression of the form 'SELECT (func()).*' will be expanded
by Postgresql _before_ execution with the result that the function
will be called as many times as there are fields in the record.
This is not what we want. The function call needs to go into
the FROM clause instead.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
92bc3cd0a7
fix linting issue
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
4a3bbd0319
adapt housenumber cleanup to new word table structure
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
13ed184efd
housenumber analyzer: avoid creating too many variants
...
Housenumber fields with lots of text are likely bad data. So is
data with many changes from letter to digit. Exclude them from adding
optional spaces.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
f03a05f6bb
add new analyser for houenumbers
...
This analyser makes spaces optional.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
a6903651fc
add framework for analysing housenumbers
...
This lays the groundwork for adding variants for housenumbers.
When analysis is enabled, then the 'word' field in the word table
is used as usual, so that variants can be created. There will be
only one analyser allowed which must have the fixed name
'@housenumber'.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
b8c544cc98
icu: move token deduplication into TokenInfo
...
Puts collection into one common place.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
243725aae1
icu: move housenumber token computation out of TokenInfo
...
This was the last function to use the cache. There is a more clean
separation of responsibility now.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
0bb59b2e22
handle unknown analyzer
...
When changing something in the default configuration of the sanatizers
that refers to an analyzer that is not yet loaded, there shouldn't be
any errors.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
837d44391c
move generation of normalized token form to analyzer
...
This gives the analyzer more flexibility in choosing the normalized
form. In particular, an analyzer creating different variants can choose
the variant that will be used as the canonical form.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
5425394654
add migration to add new derived_names column
2022-02-24 20:50:33 +01:00
Sarah Hoffmann
f74228830d
bdd: run full import on tests
...
This uncovered a couple of outdated/wrong tests which have been
fixed, too.
2022-02-24 14:27:51 +01:00
Sarah Hoffmann
a3e4e8e5cd
delete unused country name tokens
2022-02-23 09:23:06 +01:00
Sarah Hoffmann
38c3ef3da0
add tests for get_string_list()
...
Renaming test file for sanitizer config because pytest requires
unique names for test files.
2022-02-07 11:22:24 +01:00
Sarah Hoffmann
610f2cc254
sanitizer: move helpers into a configuration class
2022-02-07 10:48:00 +01:00
Sarah Hoffmann
a79a3210e6
implement is-a-name option for housenumbers
2022-02-07 09:27:11 +01:00
Sarah Hoffmann
98432395c3
add migration for upcoming change to tiger tables
2022-01-27 11:48:27 +01:00
Sarah Hoffmann
83d2c440d5
add migration for new interpolation table layout
2022-01-27 11:14:55 +01:00
Sarah Hoffmann
e6d855b954
add migration for new lookup index
2022-01-27 11:14:55 +01:00
Sarah Hoffmann
c170d323d9
add tests for cleaning housenumbers
2022-01-20 23:47:20 +01:00
Sarah Hoffmann
3ce123ab69
do not clean housenumbers in reverse-only mode
2022-01-20 20:21:13 +01:00
Sarah Hoffmann
d8b7a51ab6
add actual removal of housenumber tokens
2022-01-20 20:18:15 +01:00
Sarah Hoffmann
344a2bfc1a
add new command for cleaning word tokens
...
Just pulls outdated housenumbers for the moment.
2022-01-20 20:05:15 +01:00
Sarah Hoffmann
1e5a8561c0
fix linting issues
2022-01-20 16:00:23 +01:00
Sarah Hoffmann
f3c9578bca
complete documentation for new clean-houseunubmers sanatizer
2022-01-20 15:49:32 +01:00
Sarah Hoffmann
3741afa6dc
generalize filter-kind parameter for sanatizers
...
Now behaves the same for tag_analyzer_by_language and
clean_housenumbers. Adds tests.
2022-01-20 15:42:42 +01:00
Sarah Hoffmann
4774e45218
clean_housenumbers: make kinds and delimiters configurable
...
Also adds unit tests for various options.
2022-01-20 12:07:12 +01:00
Sarah Hoffmann
206ee87188
factor out housenumber splitting into sanitizer
2022-01-19 17:27:50 +01:00
Sarah Hoffmann
3df560ea38
fix linting error
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
adbaf700cd
move parsing of mutation config to setup phase
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
b453b0ea95
introduce mutation variants to generic token analyser
...
Mutations are regular-expression-based replacements that are applied
after variants have been computed. They are meant to be used for
variations on character level.
Add spelling variations for German umlauts.
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
0192a7af96
move variant configuration reading in separate file
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
630ad38a67
refactor variant production to use generators
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
c3788d765e
add consistent SPDX copyright headers
2022-01-03 16:23:58 +01:00
Sarah Hoffmann
f9b56a8581
correctly match abbreviated addr:street
...
This only works when addr:street is abbreviated and the street
name isn't. It does not work the other way around.
2021-12-08 21:58:43 +01:00
Sarah Hoffmann
7f7d2fd5b3
skip most addr: tags with suffixes
...
Only one addr: tag can be processed currently, so make
sure it is the one without suffixes to not get odd data.
addr:street is the exception because it uses a different
matching mechanism.
2021-12-06 14:55:10 +01:00
Sarah Hoffmann
44cfce1ca4
revert to using full names for street name matching
...
Using partial names turned out to not work well because there are
often similarly named streets next to each other. It also
prevents us from being able to take into account all addr:street:*
tags.
This change gets all the full term tokens for the addr:street tags
from the DB. As they are used for matching only, we can assume that
the term must already be there or there will be no match. This
avoid creating unused full name tags.
2021-12-06 11:38:38 +01:00
Sarah Hoffmann
54d35ddfe9
split cli tests by subcommand and extend coverage
2021-12-02 23:45:48 +01:00
Sarah Hoffmann
7beccb7997
remove unnecessary pass statements
2021-12-02 15:54:24 +01:00
Sarah Hoffmann
14a78f55cd
more unit tests for tokenizers
2021-12-02 15:46:36 +01:00
Sarah Hoffmann
a52ed366e4
add tests for migration
2021-12-01 20:27:40 +01:00
Sarah Hoffmann
810056349f
add migration for inclusive housenumber Tiger index
2021-11-24 12:03:20 +01:00
Sarah Hoffmann
10e979e841
only instantiate indexer once for replication
...
Also makes sure that indexer object exists everywhere were needed.
See #2518 .
2021-11-19 14:48:58 +01:00
Sarah Hoffmann
345c812e43
better error reporting when API script does not exist
...
Check if the API script exists on the expected location before
running php-cli. This way we can add a useful hint about the
project directory.
Fixes #2513 .
2021-11-10 11:58:20 +01:00
Sarah Hoffmann
d479a0585d
prepare release 4.0.0
2021-11-02 20:27:55 +01:00
Sarah Hoffmann
37eeccbf4c
ICU: use normalization from config in PHP
...
The TERM_NORMALIZATION config option is no longer applicable.
That was already documented but not yet implemented.
2021-10-27 11:32:44 +02:00
Sarah Hoffmann
53dbe58ada
do not count words when in reverse-only mode
2021-10-26 12:00:13 +02:00
Sarah Hoffmann
2c4b798f9b
further refactor setup to keep function small
2021-10-26 12:00:13 +02:00
Sarah Hoffmann
9934421442
make word count computation part of the import
...
Accurate word counts are now essential when using
the ICU tokenizer and don't hurt for the legacy one.
Adds about an hour import time.
2021-10-26 12:00:13 +02:00