Commit Graph

3120 Commits

Author SHA1 Message Date
Sarah Hoffmann
044bb6afa5 move tokenization in query into tokenizer 2021-04-30 17:41:08 +02:00
Sarah Hoffmann
3eb4d88057 boilerplate for PHP code of tokenizer
This adds an installation step for PHP code for the tokenizer. The
PHP code is split in two parts. The updateable code is found in
lib-php. The tokenizer installs an additional script in the
project directory which then includes the code from lib-php and
defines all settings that are static to the database. The website
code then always includes the PHP from the project directory.
2021-04-30 11:31:52 +02:00
Sarah Hoffmann
23fd1d032a tests for legacy tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
7cb7cf848d move amenity creation to tokenizer
The BDD tests still use the old-style amenity creation scripts
because we don't have simple means to import a hand-crafted
test file of special phrases right now.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
bef300305e move default country name creation to tokenizer
The new function is also used, when a country us updated. All SQL
function related to country names have been removed.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
dc700c25b6 cache all postcodes 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
0ba93e5ba9 reorganise address iteration in tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
0da481f207 remove debug code 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
d75a235c1f use address tokens in SQL 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
9e92759ac7 extract address tokens in tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
ffc2d82b0e move postcode normalization into tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
d8ed1bfc60 move houseunumber handling to tokenizer
Normalization and token computation are now done in the tokenizer.
The tokenizer keeps a cache to the hundred most used house numbers
to keep the numbers of calls to the database low.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
d711f5a81e move name token creation into tokenizer
Name tokens are now handed in via token_info and used from there.

Also moves the generic search name insertion function back to
placex_triggers.sql.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
fa2bc60468 introduce name analyzer
The name analyzer is the actual work horse of the tokenizer. It
is instantiated on a thread-base and provides all functions for
analysing names and queries.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
e1c5673ac3 require tokeinzer for indexer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
1b1ed820c3 introduce index for finding surrounding buildings 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
a73711f3cd add extra column for tokenizer
Add a jsonb column to the placex and location_property_osmline tables
which can be used by the installed tokenizer as required. No other
part of the software will use or otherwise rely on this column.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
9397bf54b8 introduce external processing in indexer
Indexing is now split into three parts: first a preparation step
that collects the necessary information from the database and
returns it to Python. In a second step the data is transformed
within Python as necessary and then returned to the database
through the usual UPDATE which now not only sets the indexed_status
but also other fields. The third step comprises the address
computation which is still done inside the update trigger in
the database.

The second processing step doesn't do anything useful yet.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
fbbdd31399 move word table and normalisation SQL into tokenizer
Creating and populating the word table is now the responsibility
of the tokenizer.

The get_maxwordfreq() function has been replaced with a
simple template parameter to the SQL during function installation.
The number is taken from the parameter list in the database to
ensure that it is not changed after installation.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
b5540dc35c add migration for configurable tokenizer
Adds a migration that initialises a legacy tokenizer for
an existing database. The migration is not active yet as
it will need completion when more functionality is added
to the legacy tokenizer.
2021-04-30 11:29:57 +02:00
Sarah Hoffmann
296a66558f move module installation to legacy tokenizer 2021-04-30 11:29:57 +02:00
Sarah Hoffmann
af968d4903 introduce tokenizer modules
This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.

The legacy tokenizer implements Nominatim's traditional algorithms.
2021-04-30 11:29:57 +02:00
Sarah Hoffmann
5c7b9ef909
Merge pull request #2303 from lonvia/remove-aux-support
Remove support for AUX housenumber tables
2021-04-30 11:19:35 +02:00
Sarah Hoffmann
185d369404 remove support for AUX housenumber tables
These tables have never been actively maintained and the code is
completely untested. With the upcomming changes, it is unlikely
that the code remains usable.

This removes the aux tables and all code that references them.
2021-04-30 10:08:29 +02:00
Sarah Hoffmann
51d20b19b6
Merge pull request #2299 from lonvia/update-actions
Fix database check for reverse-only
2021-04-27 12:18:45 +02:00
Sarah Hoffmann
46e8c6b112
Merge pull request #2291 from AntoJvlt/special-phrases-statistics
Special phrases statistics
2021-04-27 11:57:05 +02:00
Sarah Hoffmann
c8fb25201a do not check for extra housenumber index for reverse-only
Also adds a database check for reverse only import to the CI.
2021-04-27 10:14:26 +02:00
Sarah Hoffmann
1fd483643b add tests for different scripts 2021-04-26 23:01:06 +02:00
Sarah Hoffmann
a21a0864f1
Merge pull request #2298 from lonvia/add-warming-to-ci
Add warming to CI import tests and fix more Python 3.5 compatibility issues
2021-04-26 11:21:44 +02:00
Sarah Hoffmann
4457bf7528 avoid Path in subprocess parameters
Not supported by Python 3.5.
2021-04-26 10:55:23 +02:00
Sarah Hoffmann
5ed6f18d83 add warming to CI import test 2021-04-26 09:54:09 +02:00
AntoJvlt
abb3d56b20 Switching to log info and only send warning for invalid phrases 2021-04-25 17:57:43 +02:00
AntoJvlt
c5ecb9bae0 Implemented statistics for the import of special phrases through the SpecialPhrasesImporterStatistics class 2021-04-25 17:57:43 +02:00
AntoJvlt
1b68152fb2 reorganization of folder/file for the special phrases importer 2021-04-25 17:57:42 +02:00
Sarah Hoffmann
6812f397af
Merge pull request #2297 from lonvia/update-deployment-docs
docs: update deployment to use project directory
2021-04-24 15:35:00 +02:00
Sarah Hoffmann
68bd9c6091
Merge pull request #2296 from lonvia/disable-too-few-public-methods-check
pylint: disable too-few-public-methods check
2021-04-24 15:03:28 +02:00
Sarah Hoffmann
754f9e3a20 docs: update deployment to use project directory
Fixes #2295.
2021-04-24 15:00:46 +02:00
Sarah Hoffmann
b951b11336 fix pylint complaints 2021-04-24 11:59:32 +02:00
Sarah Hoffmann
89c90bedb9 pylint: disable check too-few-public-methods 2021-04-24 11:39:44 +02:00
Sarah Hoffmann
b4fe7d7c7d
Merge pull request #2293 from darkshredder/update-manpage
Updated manual page
2021-04-24 09:20:28 +02:00
Sarah Hoffmann
5071710db7
Merge pull request #2294 from lonvia/update-actions
CI: add import test against Python 3.5 and fix discovered issues
2021-04-23 23:33:15 +02:00
Sarah Hoffmann
9faaf3fc88 actions: add import on ubuntu 18.04
This uses oldest possible dependencies where possible.
2021-04-23 22:50:08 +02:00
Sarah Hoffmann
9c51c133f7 indexes with includes are not available for postgresql < 11 2021-04-23 22:50:08 +02:00
Sarah Hoffmann
91d2fb6b1c use group() for regex matches
Needed for compatibility with Python 3.5.
2021-04-23 22:50:08 +02:00
Sarah Hoffmann
280406c0d7 use pathlib version of open 2021-04-23 22:50:08 +02:00
Sarah Hoffmann
d5fc3b5e99 subprocess needs string argument
Compatibility change for Python 3.5.
2021-04-23 22:50:08 +02:00
Sarah Hoffmann
f8f8c7e534 check for existance of custom .env before opening 2021-04-23 22:50:08 +02:00
Sarah Hoffmann
3a642d50a4 use more generic ImportError to check for module
ModuleNotFoundError was only introduced in Python 3.6.
2021-04-23 22:50:08 +02:00
Sarah Hoffmann
9685c68e30 replace usages of fromisoformat() with strptime()
fromisoformat was only introduced with Python 3.7 while we
still support Python 3.5.

Fixes #2292.
2021-04-23 22:50:08 +02:00
Sarah Hoffmann
95e6ec091b remove argparse dependency for vagrant scripts
Users don't need to recreate the manpage.
2021-04-23 22:50:08 +02:00