Commit Graph

518 Commits

Author SHA1 Message Date
Sarah Hoffmann
d35400a7d7 use analyser provided in the 'analyzer' property
Implements per-name choice of analyzer. If a non-default
analyzer is choosen, then the 'word' identifier is extended
with the name of the ana;yzer, so that we still have unique
items.
2021-10-05 14:10:32 +02:00
Sarah Hoffmann
9ba2019470 precompute replacements while loading configuration 2021-10-05 10:20:08 +02:00
Sarah Hoffmann
7cfcbacfc7 make token analyzers configurable modules
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.
2021-10-04 17:37:34 +02:00
Sarah Hoffmann
52847b61a3 extend ICU config to accomodate multiple analysers
Adds parsing of multiple variant lists from the configuration.
Every entry except one must have a unique 'id' paramter to
distinguish the entries. The entry without id is considered
the default. Currently only the list without an id is used
for analysis.
2021-10-04 16:40:28 +02:00
Sarah Hoffmann
6b348d43c6 replace test variable for PG env tests
'tty' was removed in PG14 and causes an error.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
732cd27d2e add unit tests for new sanatizer functions 2021-10-01 12:27:24 +02:00
Sarah Hoffmann
8171fe4571 introduce sanitizer step before token analysis
Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
16daa57e47 unify ICUNameProcessorRules and ICURuleLoader
There is no need for the additional layer of indirection that
the ICUNameProcessorRules class adds. The ICURuleLoader can
fill the database properties directly.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
be65c8303f export more data for the tokenizer name preparation
Adds class, type, country and rank to the exported information
and removes the rather odd hack for countries. Whether a place
represents a country boundary can now be computed by the tokenizer.
2021-09-29 11:54:14 +02:00
Sarah Hoffmann
231250f2eb add wrapper class for place data passed to tokenizer
This is mostly for convenience and documentation purposes.
2021-09-29 11:54:07 +02:00
Sarah Hoffmann
40f9d52ad8
Merge pull request #2454 from lonvia/sort-out-token-assignment-in-sql
ICU tokenizer: switch match method to using partial terms
2021-09-28 09:45:15 +02:00
Sarah Hoffmann
09c9fad6c3 adapt tests to new ICU address token handling 2021-09-27 17:36:23 +02:00
Sarah Hoffmann
bd7c7ddad0 icu tokenizer: switch to matching against partial names
When matching address parts from addr:* tags against place names,
the address names where so far converted to full names and compared
those to the place names. This can become problematic with the new
ICU tokenizer once we introduce creation of different variants
depending on the place name context. It wouldn't be clear which
variant to produce to get a match, so we would have to create all of
them. To work around this issue, switch to using the partial terms
for matching. This introduces a larger fuzziness between matches but
that shouldn't be a problem because matching is always geographically
restricted.

The search terms created for address parts have a different problem:
they are already created before we even know if they are going to be
used. This can lead to spurious entries in the word table, which slows
down searching. This problem can also be circumvented by using only
partial terms for the search terms. In terms of searching that means
that the address terms would not get the full-word boost, but given
that the case where an address part does not exist as an OSM object
should be the exception, this is likely acceptable.
2021-09-27 11:36:19 +02:00
Sarah Hoffmann
6d7c067461 force update on rank30 children when place name changes
Name changes may have an effect on parenting. Don't update
surrounding rank30 objects with addr:place tags as this is
potentially too expensive.
2021-09-27 11:04:17 +02:00
Sarah Hoffmann
316205e455 force update of surrounding houses when street name changes
When the street changes its name then this may cause changes
in the parenting of rank-30 objects with an addr:street
tag.

Fixes #2242.
2021-09-27 10:22:41 +02:00
Sarah Hoffmann
56124546a6 fix dynamic assignment of address parts
A boolean check for dynamic changes of address parts is not
sufficient. The order of choice should be:

 1. an addr:* part matches the name
 2. the address part surrounds the object
 3. the address part was declared as isaddress

The implementation uses a slightly different ordering
to avoid geometry checks unless strictly necessary (isaddress
is false and no matching address).

See #2446.
2021-09-19 12:34:39 +02:00
Sarah Hoffmann
8e1d4818ac use yaml config loader for country info 2021-09-04 00:22:55 +02:00
Sarah Hoffmann
28c98584c1 add tests for generic YAML config reader 2021-09-03 22:31:30 +02:00
Sarah Hoffmann
1c42780bb5 introduce generic YAML config loader
Adds a function to the Configuration class to load a YAML
file. This means that searching for the file is generalised
and works the same now for all configuration files. Changes
the search logic, so that it is always possible to have a
custom version of the configuration file in the project
directory.

Move ICU tokenizer to use new load function.
2021-09-03 18:20:07 +02:00
Sarah Hoffmann
79da96b369 read partition and languages from config file 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
78fcabade8 move country name generation to country_info module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
284645f505 move generation of country tables in own module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
28ee3d0949 move linking of places to the preparation stage
Linked places may bring in extra names. These names need to be
processed by the tokenizer. That means that the linking needs
to be done before the data is handed to the tokenizer. Move finding
the linked place into the preparation stage and update the name
fields. Everything else is still done in the indexing stage.
2021-08-20 22:44:17 +02:00
Sarah Hoffmann
118858a55e rename legacy_icu tokenizer to icu tokenizer
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
2021-08-17 23:11:47 +02:00
Sarah Hoffmann
5f2b9e317a add tests for US state hacks
IL, AS and LA are replaced with the US state in Geocode because
the old tokenizer would simply remove the abbreviations otherwise.
2021-08-17 10:49:07 +02:00
Sarah Hoffmann
1147b83b22 php: make word list a first-class object
This separates the logic of creating word sets from the Phrase
class. A tokenizer may now derived the word sets any way they
like. The SimpleWordList class provides a standard implementation
for splitting phrases on spaces.
2021-08-16 11:51:49 +02:00
Sarah Hoffmann
87dedde5d6 allow multiple files for the import command
The files are forwarded to osm2pgsql which is now able to merge
them correctly.
2021-08-14 21:42:21 +02:00
Sarah Hoffmann
1db098c05d reinstate word column in icu word table
Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
324b1b5575 bdd tests: do not query word table directly
The BDD tests cannot make assumptions about the structure of the
word table anymore because it depends on the tokenizer. Use more
abstract descriptions instead that ask for specific kinds of
tokens.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
e42878eeda adapt unit test for new word table
Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
eb6814d74e convert word info column to json before copying 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
0c023fb4d2 adapt cli tests to Python port for add-data 2021-07-26 10:41:37 +02:00
Sarah Hoffmann
878835e4bd move add-data subcommand into a separate file 2021-07-25 18:14:12 +02:00
Sarah Hoffmann
62d5984b1b limit the number of variants that can be produced 2021-07-04 10:28:28 +02:00
Sarah Hoffmann
e85f7e7aa9 fix subsequent replacements
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.

Also forbit composing a word after a space was added in the
end by a previous replacement.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
b9fbfeff67 only consider partials in multi-words for initial count
This ensures that it is less likely that we exclude meaningful
words like 'hauptstrasse' just because they are frequent.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
62828fc5c1 switch to a more flexible variant description format
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a6aa6360e0 use yaml tag syntax to mark include files 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
0d80a9b897 tests for composing decomposed suffixes 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
f70930b1a0 make compund decomposition pure import feature
Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
9ff4f66f55 complete tests for icu tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e81084f35 complete tests for rule loader 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a0a7b05c9f correctly quote strings when copying in data
Encapsulate the copy string in a class that ensures that
copy lines are written with correct quoting.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2f6e4edcdb update unit tests for adapted abbreviation code 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e3c5d4c5b adapt tests for ICU tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249 move abbreviation computation into import phase
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.

New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
e7b4fc70e7 make sure old data gets deleted on place type change
When changing from some other place type to place=postcode
make sure that the old place type entry in the place table
is deleted.
2021-06-18 10:58:41 +02:00
Sarah Hoffmann
457982e1d2 update postcode in place if it already exists 2021-06-18 00:28:52 +02:00
Sarah Hoffmann
aa558e6080
Merge pull request #2369 from lonvia/exclude-poi-from-housenumber-search
Do not return POIs when dropping house number in query
2021-06-17 15:30:05 +02:00
Sarah Hoffmann
fe11d3cbbd do not return POIs when dropping house number in query
We've previously added searching through rank 30 in a house
number search to enable searches for house number+name.
This had the unintended side effect that rank 30 objects
are also returned in s search that dropped the house number
from the query. This is wrong because POIs cannot function
as a parent to a house number.

This fix drops all rank 30 objects from the results for a
house number search if they do not match the requested house
number.
2021-06-17 14:21:20 +02:00