Commit Graph

368 Commits

Author SHA1 Message Date
Sarah Hoffmann
c171d88194 move parsing of token analysis config to analyzer
Adds a second callback for the analyzer which is responsible
for parsing the configuration rules and converting it to
whatever format necessary. This way, each analyzer implementation
can define its own configuration rules.
2021-10-04 18:31:58 +02:00
Sarah Hoffmann
7cfcbacfc7 make token analyzers configurable modules
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.
2021-10-04 17:37:34 +02:00
Sarah Hoffmann
52847b61a3 extend ICU config to accomodate multiple analysers
Adds parsing of multiple variant lists from the configuration.
Every entry except one must have a unique 'id' paramter to
distinguish the entries. The entry without id is considered
the default. Currently only the list without an id is used
for analysis.
2021-10-04 16:40:28 +02:00
Sarah Hoffmann
5a36559834 move flatten_config_list into config module
For general usage by other modules.
2021-10-04 11:56:54 +02:00
Sarah Hoffmann
732cd27d2e add unit tests for new sanatizer functions 2021-10-01 12:27:24 +02:00
Sarah Hoffmann
8171fe4571 introduce sanitizer step before token analysis
Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
16daa57e47 unify ICUNameProcessorRules and ICURuleLoader
There is no need for the additional layer of indirection that
the ICUNameProcessorRules class adds. The ICURuleLoader can
fill the database properties directly.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
5e5addcdbf fix typo 2021-09-29 14:16:09 +02:00
Sarah Hoffmann
be65c8303f export more data for the tokenizer name preparation
Adds class, type, country and rank to the exported information
and removes the rather odd hack for countries. Whether a place
represents a country boundary can now be computed by the tokenizer.
2021-09-29 11:54:14 +02:00
Sarah Hoffmann
231250f2eb add wrapper class for place data passed to tokenizer
This is mostly for convenience and documentation purposes.
2021-09-29 11:54:07 +02:00
Sarah Hoffmann
bb18479d5b remove unused parameter 2021-09-27 14:58:43 +02:00
Sarah Hoffmann
bd7c7ddad0 icu tokenizer: switch to matching against partial names
When matching address parts from addr:* tags against place names,
the address names where so far converted to full names and compared
those to the place names. This can become problematic with the new
ICU tokenizer once we introduce creation of different variants
depending on the place name context. It wouldn't be clear which
variant to produce to get a match, so we would have to create all of
them. To work around this issue, switch to using the partial terms
for matching. This introduces a larger fuzziness between matches but
that shouldn't be a problem because matching is always geographically
restricted.

The search terms created for address parts have a different problem:
they are already created before we even know if they are going to be
used. This can lead to spurious entries in the word table, which slows
down searching. This problem can also be circumvented by using only
partial terms for the search terms. In terms of searching that means
that the address terms would not get the full-word boost, but given
that the case where an address part does not exist as an OSM object
should be the exception, this is likely acceptable.
2021-09-27 11:36:19 +02:00
Sarah Hoffmann
b894d2c04a fix indent 2021-09-04 10:30:35 +02:00
Sarah Hoffmann
8e1d4818ac use yaml config loader for country info 2021-09-04 00:22:55 +02:00
Sarah Hoffmann
28c98584c1 add tests for generic YAML config reader 2021-09-03 22:31:30 +02:00
Sarah Hoffmann
1c42780bb5 introduce generic YAML config loader
Adds a function to the Configuration class to load a YAML
file. This means that searching for the file is generalised
and works the same now for all configuration files. Changes
the search logic, so that it is always possible to have a
custom version of the configuration file in the project
directory.

Move ICU tokenizer to use new load function.
2021-09-03 18:20:07 +02:00
Sarah Hoffmann
7e7dd769fd remove language and partition from name import 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
79da96b369 read partition and languages from config file 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
78fcabade8 move country name generation to country_info module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
284645f505 move generation of country tables in own module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
28ee3d0949 move linking of places to the preparation stage
Linked places may bring in extra names. These names need to be
processed by the tokenizer. That means that the linking needs
to be done before the data is handed to the tokenizer. Move finding
the linked place into the preparation stage and update the name
fields. Everything else is still done in the indexing stage.
2021-08-20 22:44:17 +02:00
Sarah Hoffmann
118858a55e rename legacy_icu tokenizer to icu tokenizer
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
2021-08-17 23:11:47 +02:00
Sarah Hoffmann
90b40fc3e6 define formal public Python interface for tokenizer
This introduces an abstract class for the Tokenizer/Analyzer
for documentation purposes.
2021-08-16 11:41:54 +02:00
Sarah Hoffmann
75a5c7013f split up large setup function 2021-08-15 12:24:13 +02:00
Sarah Hoffmann
87dedde5d6 allow multiple files for the import command
The files are forwarded to osm2pgsql which is now able to merge
them correctly.
2021-08-14 21:42:21 +02:00
Sarah Hoffmann
d48793c22c fix Python linitin errors 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
1db098c05d reinstate word column in icu word table
Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
e42878eeda adapt unit test for new word table
Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
eb6814d74e convert word info column to json before copying 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
70f154be8b switch word tokens to new word table layout 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
4342b28882 switch special phrases to new word table format 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
5394b1fa1b switch postcode tokens to new word table layout 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
5ab0a63fd6 switch housenumber tokens to new word table layout 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
1618aba5f2 switch country name tokens to new word table layout 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
8377528952 new word table layout for icu tokenizer
The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
e42349c963 replace add-data function with native Python code 2021-07-26 10:41:37 +02:00
Sarah Hoffmann
878835e4bd move add-data subcommand into a separate file 2021-07-25 18:14:12 +02:00
Sarah Hoffmann
2c8242c8df remove special code for pre9.5 postgresql
9.5 is now the minimum requirement.
2021-07-19 10:24:57 +02:00
Sarah Hoffmann
e7d6f89aca increase minimum version for PostgreSQL to 9.5
This is the minimum version we can test with the CI.
With 9.5 there is also complete support for jsonb available.
2021-07-19 10:21:19 +02:00
Sarah Hoffmann
14f777da18 use psycopg's SQL quoting where possible
Use the SQL formatting supplied with psycopg whenever the
query needs to be put together from snippets.
2021-07-12 22:05:22 +02:00
Sarah Hoffmann
6f6681ce67 add helper function for execute_values
Make psycopg2's convenience function accessible through
the cursor.
2021-07-12 21:08:20 +02:00
Sarah Hoffmann
06602b4ec0 provide wrapper function for DROP TABLE
Use psycopg2 formatting to ensure correct quoting.
2021-07-12 20:32:46 +02:00
Sarah Hoffmann
cf98cff2a1 more formatting fixes
Found by flake8.
2021-07-12 17:45:42 +02:00
Sarah Hoffmann
f8b5a63de3 factor out connection reset code 2021-07-12 14:58:44 +02:00
Sarah Hoffmann
568316f07c simplify analyse function 2021-07-12 14:47:50 +02:00
Sarah Hoffmann
daa597b300 split up variant computation for better readability 2021-07-12 14:43:50 +02:00
Sarah Hoffmann
47adb2a3fc reorganise process_place function
Move address processing into its own function as it is
rather extensive.
2021-07-12 11:57:55 +02:00
Sarah Hoffmann
fff0012249 simplify website setup code
Use formaat strings and move variable quoting code into extra
function.
2021-07-12 11:41:05 +02:00
Sarah Hoffmann
d5a1883b62 avoid repeated patterns for table name 2021-07-12 11:33:09 +02:00
Sarah Hoffmann
a08ef43e40 simplify if statements 2021-07-12 11:28:47 +02:00