Commit Graph

3360 Commits

Author SHA1 Message Date
Sarah Hoffmann
7cfcbacfc7 make token analyzers configurable modules
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.
2021-10-04 17:37:34 +02:00
Sarah Hoffmann
52847b61a3 extend ICU config to accomodate multiple analysers
Adds parsing of multiple variant lists from the configuration.
Every entry except one must have a unique 'id' paramter to
distinguish the entries. The entry without id is considered
the default. Currently only the list without an id is used
for analysis.
2021-10-04 16:40:28 +02:00
Sarah Hoffmann
5a36559834 move flatten_config_list into config module
For general usage by other modules.
2021-10-04 11:56:54 +02:00
Sarah Hoffmann
19d4e047f6
Merge pull request #2458 from lonvia/add-tokenizer-preprocessing
Add a "sanitation" step for name and address tags before token processing
2021-10-01 21:53:34 +02:00
Sarah Hoffmann
6b348d43c6 replace test variable for PG env tests
'tty' was removed in PG14 and causes an error.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
732cd27d2e add unit tests for new sanatizer functions 2021-10-01 12:27:24 +02:00
Sarah Hoffmann
8171fe4571 introduce sanitizer step before token analysis
Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
16daa57e47 unify ICUNameProcessorRules and ICURuleLoader
There is no need for the additional layer of indirection that
the ICUNameProcessorRules class adds. The ICURuleLoader can
fill the database properties directly.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
5e5addcdbf fix typo 2021-09-29 14:16:09 +02:00
Sarah Hoffmann
be65c8303f export more data for the tokenizer name preparation
Adds class, type, country and rank to the exported information
and removes the rather odd hack for countries. Whether a place
represents a country boundary can now be computed by the tokenizer.
2021-09-29 11:54:14 +02:00
Sarah Hoffmann
231250f2eb add wrapper class for place data passed to tokenizer
This is mostly for convenience and documentation purposes.
2021-09-29 11:54:07 +02:00
Sarah Hoffmann
d44a428b74
Merge pull request #2455 from lonvia/adjust-address-levels-slovakia
Adjust address levels for boundaries in Slovakia
2021-09-28 11:21:08 +02:00
Sarah Hoffmann
40f9d52ad8
Merge pull request #2454 from lonvia/sort-out-token-assignment-in-sql
ICU tokenizer: switch match method to using partial terms
2021-09-28 09:45:15 +02:00
Sarah Hoffmann
7f3b05c179 adjust address levels for boundaries in Slovakia
Levels choosen according to OSM wiki. Mainly moves admin_level 6
to county level and admin_level 8 to city/town level. Higher
levels are adjusted accordingly.

Fixes #2453.
2021-09-27 23:32:11 +02:00
Sarah Hoffmann
09c9fad6c3 adapt tests to new ICU address token handling 2021-09-27 17:36:23 +02:00
Sarah Hoffmann
bb18479d5b remove unused parameter 2021-09-27 14:58:43 +02:00
Sarah Hoffmann
779ea8ac62
Merge pull request #2452 from lonvia/update-houses-on-street-name-change
Force update of surrounding houses when street or place name changes
2021-09-27 14:55:50 +02:00
Sarah Hoffmann
bd7c7ddad0 icu tokenizer: switch to matching against partial names
When matching address parts from addr:* tags against place names,
the address names where so far converted to full names and compared
those to the place names. This can become problematic with the new
ICU tokenizer once we introduce creation of different variants
depending on the place name context. It wouldn't be clear which
variant to produce to get a match, so we would have to create all of
them. To work around this issue, switch to using the partial terms
for matching. This introduces a larger fuzziness between matches but
that shouldn't be a problem because matching is always geographically
restricted.

The search terms created for address parts have a different problem:
they are already created before we even know if they are going to be
used. This can lead to spurious entries in the word table, which slows
down searching. This problem can also be circumvented by using only
partial terms for the search terms. In terms of searching that means
that the address terms would not get the full-word boost, but given
that the case where an address part does not exist as an OSM object
should be the exception, this is likely acceptable.
2021-09-27 11:36:19 +02:00
Sarah Hoffmann
c6fdcf9b0d adapt documentation for SQL tokenizer interface 2021-09-27 11:36:19 +02:00
Sarah Hoffmann
59fe74ddf6 move name matching into tokenizer module
Instead of requesting the match tokens from the tokenizer
when looking for parent streets/places and address parts,
hand in the saved tokens and ask if they match. This gives
the tokenizer more freedom to decide how name matching
should be done.
2021-09-27 11:36:19 +02:00
Sarah Hoffmann
6d7c067461 force update on rank30 children when place name changes
Name changes may have an effect on parenting. Don't update
surrounding rank30 objects with addr:place tags as this is
potentially too expensive.
2021-09-27 11:04:17 +02:00
Sarah Hoffmann
316205e455 force update of surrounding houses when street name changes
When the street changes its name then this may cause changes
in the parenting of rank-30 objects with an addr:street
tag.

Fixes #2242.
2021-09-27 10:22:41 +02:00
Sarah Hoffmann
d562f11298 slightly increase radius to look for postcodes 2021-09-24 23:56:42 +02:00
Sarah Hoffmann
972628c751
Merge pull request #2449 from lonvia/address-ranking-spain
Adjust address ranks for Spain
2021-09-24 22:48:21 +02:00
Sarah Hoffmann
09b1db63f4 adjust address ranks for Spain
Adjusts levels for boundaries according to the list on
https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative

* no admin_level 5, so drop that from addresses
* admin_level 6 has the province
* admin_level 7 has the county when it exists

Also reranks place=province so that it matches up with
admin_level 6 and introduces place=civil_parish which
is used as a place node for some admin_level=9 boundaries
in Galicia.
2021-09-24 18:39:44 +02:00
Sarah Hoffmann
e9d54f752c
Merge pull request #2447 from lonvia/fix-dynamic-address-assignment
Fix dynamic assignment of address parts
2021-09-19 15:57:28 +02:00
Sarah Hoffmann
c335025167 CI: install locale for CentOS 2021-09-19 13:49:11 +02:00
Sarah Hoffmann
2b2109c89a
Remove the installation warning
Installation has become a lot easier.
2021-09-19 13:01:32 +02:00
Sarah Hoffmann
56124546a6 fix dynamic assignment of address parts
A boolean check for dynamic changes of address parts is not
sufficient. The order of choice should be:

 1. an addr:* part matches the name
 2. the address part surrounds the object
 3. the address part was declared as isaddress

The implementation uses a slightly different ordering
to avoid geometry checks unless strictly necessary (isaddress
is false and no matching address).

See #2446.
2021-09-19 12:34:39 +02:00
Sarah Hoffmann
336258ecf8
Merge pull request #2440 from lonvia/generic-config-loader
Add generic loader for YAML configuration files
2021-09-04 17:41:15 +02:00
Sarah Hoffmann
b894d2c04a fix indent 2021-09-04 10:30:35 +02:00
Sarah Hoffmann
8e1d4818ac use yaml config loader for country info 2021-09-04 00:22:55 +02:00
Sarah Hoffmann
28c98584c1 add tests for generic YAML config reader 2021-09-03 22:31:30 +02:00
Sarah Hoffmann
1c42780bb5 introduce generic YAML config loader
Adds a function to the Configuration class to load a YAML
file. This means that searching for the file is generalised
and works the same now for all configuration files. Changes
the search logic, so that it is always possible to have a
custom version of the configuration file in the project
directory.

Move ICU tokenizer to use new load function.
2021-09-03 18:20:07 +02:00
Sarah Hoffmann
18554dfed7
Merge pull request #2437 from lonvia/tweak-ranking-searches
Some more tweaks for search interpretation
2021-09-03 14:16:23 +02:00
Sarah Hoffmann
2e493fec46
Merge pull request #2436 from lonvia/country-configuration
Move configuration of default languages into a configuration file
2021-09-03 08:55:36 +02:00
Sarah Hoffmann
98c2e08add reduce penalty for special searches by name
Additional penalty for special terms with operator None
should only go to near searches. To reduce the number
of produced searches, restrict the none operator to
appear only in conjunction with the name.
2021-09-03 08:50:38 +02:00
Sarah Hoffmann
94d3dee369 further increase penalty on housenumbers without numbers
Make the penality dependent on the length of the token:
no penalty for one letter house numbers and increasing one
for more letters.
2021-09-02 18:11:49 +02:00
Sarah Hoffmann
7e7dd769fd remove language and partition from name import 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
79da96b369 read partition and languages from config file 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
78fcabade8 move country name generation to country_info module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
284645f505 move generation of country tables in own module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
0b349761a8 add country configuration
The new configuration saves the default language(s) originally
maintained in the OSM wiki as well as the partition information.
2021-09-02 14:41:11 +02:00
Sarah Hoffmann
d18794931a
Merge pull request #2435 from lonvia/simplified-to-traditional-chinese
icu: normalise simplified to traditional chinese
2021-08-31 15:29:26 +02:00
Sarah Hoffmann
b7d4ff3201 icu: normalise simplified to traditional chinese
The conversion is unambigious in most cases, so that the
information loss is minimal.
2021-08-31 11:18:34 +02:00
Sarah Hoffmann
4c6d674e03
Merge pull request #2434 from lonvia/vagrant-scripts-in-actions
Test installation instructions via CI
2021-08-29 10:11:59 +02:00
Sarah Hoffmann
2c97af8021 CI: use packaged source also for test runs 2021-08-24 10:10:01 +02:00
Sarah Hoffmann
832f75a55e CI: unify jobs for different vagrant scripts 2021-08-24 10:10:01 +02:00
Sarah Hoffmann
4e77969545 add workflow for centos 8 2021-08-24 10:10:01 +02:00
Sarah Hoffmann
6ebbbfee61 CI: use vagrant scripts for import tests
Use vanilla docker images of Ubuntu and leave the setup
to the vagrant scripts. Then do the usual import tests.

Also fixes a couple of issues found with the scripts
2021-08-24 10:10:01 +02:00