Commit Graph

382 Commits

Author SHA1 Message Date
Sarah Hoffmann
9172696324 postcodes: add support for optional spaces 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
baee6f3de0 postcodes: strip leading country codes 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
28ab2f6048 add postcodes patterns without optional spaces 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
90d4d339db initial postcode cleaner for simple patterns
Moves postcodes that are either in countries without a postcode
system or don't correspond to the local pattern for postcodes into
a field for a normal address part. Makes them searchable but not as
a special address. This has two consequences: they are no longer a
skippable part of the address and the postcodes cannot be searched
on their own.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
cbb4749996 change indexing order for interpolations
Interpolations are now indexed after rank 30 objects. The housenumber
nodes no longer need information from the interpolations while the
interpolations can make use of precomputed postcodes.
2022-06-02 15:16:46 +02:00
Sarah Hoffmann
46689df668 custom comparison for SpecialPhrase
Duplicate elemination only works when a custom hash/equal function
is implemented that is based on the members.
2022-05-30 16:30:41 +02:00
Sarah Hoffmann
e828d0d3f7 move quoting hack to wiki loader
The bad quotes around the type for special phrases
specifically occure in the Wiki pages, so it should be
removed by the loader and not in the generic SpecialPhrase
object.
2022-05-30 14:40:33 +02:00
Sarah Hoffmann
cce0e5ea38 convert special phrase loaders to generators
Generators simplify the code quite a bit compared to the previous
Iterator approach.
2022-05-30 14:12:46 +02:00
Sarah Hoffmann
042e314589 remove the language parameter in the SPWikiLoader
Languages must always be configured through config or environment.
Also use monkeypatched environment in tests.
2022-05-30 10:26:20 +02:00
Sarah Hoffmann
61d813bfef add get_str_list() for config
Converts a config value written as a comma-sparated list into
a Python list of strings.
2022-05-29 13:53:50 +02:00
Sarah Hoffmann
adeebec32a switch tests to ICU tokenizer as default 2022-05-10 14:54:50 +02:00
Sarah Hoffmann
ed6fda6968
Merge pull request #2702 from lonvia/move-country-names-into-includes
Clean up country name settings
2022-05-10 09:21:16 +02:00
Marc Tobias
821dabb138 add git commit hash to --version output 2022-05-09 23:56:13 +02:00
Sarah Hoffmann
9d468f6da0 support arbitrary prefixes in country name list
This means we can now get rid of the last special cases for names.
2022-05-09 11:55:26 +02:00
Marc Tobias
0de83c4a51 fix typos of name Nominatim 2022-05-05 01:04:47 +02:00
Marc Tobias
a79ab41782 new nominatim --version CLI argument 2022-05-04 01:33:25 +02:00
Sarah Hoffmann
4f59644cc2 add tests for new data invalidation functions 2022-04-14 14:52:13 +02:00
Sarah Hoffmann
fd4ab3f262
Merge pull request #2629 from tareqpi/country-names-yaml-configuration
Move default country names into yaml configuration
2022-04-04 09:04:25 +02:00
Tareq Al-Ahdal
e9f979b67b 'read_config' is no longer a fixture
add 'read_config' to test cases that need it
2022-04-01 22:52:17 +08:00
Tareq Al-Ahdal
a323b8f63a test for loading special characters from country_settings.yaml 2022-04-01 21:58:57 +08:00
Tareq Al-Ahdal
9411c14fd2 fix reset country info before loading custom data 2022-04-01 21:55:34 +08:00
Tareq Al-Ahdal
8525e7542f custom country config loads correctly 2022-04-01 21:46:56 +08:00
Sarah Hoffmann
de18cd1523 add test for new table_has_column function 2022-03-31 15:55:20 +02:00
Tareq Al-Ahdal
b5f311d6bc separate unit test function into three functions 2022-03-30 22:06:59 +08:00
Tareq Al-Ahdal
9db13aac72 Added unit tests for loading country info from yaml file 2022-03-25 22:22:44 +08:00
Sarah Hoffmann
a0ed80d821 restore the tokenizer directory when missing
Automatically repopulate the tokenizer/ directory with the PHP stub
and the postgresql module, when the directory is missing. This allows
to switch working directories and in particular run the service
from a different maschine then where it was installed.
Users still need to make sure that .env files are set up correctly
or they will shoot themselves in the foot.

See #2515.
2022-03-20 11:31:42 +01:00
Sarah Hoffmann
0a9f971e44 add tests for new analyzed housenumbers 2022-03-01 09:34:32 +01:00
Sarah Hoffmann
837d44391c move generation of normalized token form to analyzer
This gives the analyzer more flexibility in choosing the normalized
form. In particular, an analyzer creating different variants can choose
the variant that will be used as the canonical form.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
a6b4e8ff67 add tests for housenumber-as-name feature 2022-02-07 11:45:12 +01:00
Sarah Hoffmann
38c3ef3da0 add tests for get_string_list()
Renaming test file for sanitizer config because pytest requires
unique names for test files.
2022-02-07 11:22:24 +01:00
Sarah Hoffmann
610f2cc254 sanitizer: move helpers into a configuration class 2022-02-07 10:48:00 +01:00
Sarah Hoffmann
c170d323d9 add tests for cleaning housenumbers 2022-01-20 23:47:20 +01:00
Sarah Hoffmann
d09db09849 adapt ICU tets to new housenumber sanitizer
Restrict tests to making sure that handing in multiple housenumbers
works.
2022-01-20 16:05:49 +01:00
Sarah Hoffmann
3741afa6dc generalize filter-kind parameter for sanatizers
Now behaves the same for tag_analyzer_by_language and
clean_housenumbers. Adds tests.
2022-01-20 15:42:42 +01:00
Sarah Hoffmann
560a006892 add pytest config
We are using custom marks now which need to be registered to avoid
warnings.
2022-01-20 15:38:02 +01:00
Sarah Hoffmann
4774e45218 clean_housenumbers: make kinds and delimiters configurable
Also adds unit tests for various options.
2022-01-20 12:07:12 +01:00
Sarah Hoffmann
b453b0ea95 introduce mutation variants to generic token analyser
Mutations are regular-expression-based replacements that are applied
after variants have been computed. They are meant to be used for
variations on character level.

Add spelling variations for German umlauts.
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
c3788d765e add consistent SPDX copyright headers 2022-01-03 16:23:58 +01:00
Sarah Hoffmann
7f7d2fd5b3 skip most addr: tags with suffixes
Only one addr: tag can be processed currently, so make
sure it is the one without suffixes to not get odd data.
addr:street is the exception because it uses a different
matching mechanism.
2021-12-06 14:55:10 +01:00
Sarah Hoffmann
44cfce1ca4 revert to using full names for street name matching
Using partial names turned out to not work well because there are
often similarly named streets next to each other. It also
prevents us from being able to take into account all addr:street:*
tags.

This change gets all the full term tokens for the addr:street tags
from the DB. As they are used for matching only, we can assume that
the term must already be there or there will be no match. This
avoid creating unused full name tags.
2021-12-06 11:38:38 +01:00
Sarah Hoffmann
5a9fb6eaf7 specify text type in test SQL
Older version of postgres fail otherwise.
2021-12-03 13:56:23 +01:00
Sarah Hoffmann
54d35ddfe9 split cli tests by subcommand and extend coverage 2021-12-02 23:45:48 +01:00
Sarah Hoffmann
14a78f55cd more unit tests for tokenizers 2021-12-02 15:46:36 +01:00
Sarah Hoffmann
7617a9316e extend API unit tests 2021-12-01 20:48:29 +01:00
Sarah Hoffmann
a52ed366e4 add tests for migration 2021-12-01 20:27:40 +01:00
Sarah Hoffmann
7be164e2a5 more testing for refresh functions 2021-12-01 14:58:54 +01:00
Sarah Hoffmann
a24f25c0d8 more tests for exec utilities 2021-12-01 14:23:51 +01:00
Sarah Hoffmann
993b238a41 add more tests for database import 2021-12-01 11:54:58 +01:00
Sarah Hoffmann
bbbfc8201c add tests for adding additional data
Also adds checks that parameters for osm2pgsql are set
as expected.
2021-12-01 11:22:46 +01:00
Sarah Hoffmann
6f03a4d6ce add tests for flatten_config_file and other than yaml formats 2021-12-01 10:24:11 +01:00
Sarah Hoffmann
c8958a22d2 tests: add fixture for making test project directory 2021-11-30 18:01:46 +01:00
Sarah Hoffmann
37afa2180b generalize fixtures for cli tests 2021-11-30 14:07:39 +01:00
Sarah Hoffmann
b2df8e478a python test: move single-use fixtures to subdirectories 2021-11-30 12:03:16 +01:00
Sarah Hoffmann
50fccb52be remove unused test files 2021-11-30 11:44:10 +01:00
Sarah Hoffmann
b90e719da5 organise python tests in subdirectories
The directories follow the same structure as the modules in
nominatim/.
2021-11-30 11:22:26 +01:00
Sarah Hoffmann
10e979e841 only instantiate indexer once for replication
Also makes sure that indexer object exists everywhere were needed.

See #2518.
2021-11-19 14:48:58 +01:00
Sarah Hoffmann
345c812e43 better error reporting when API script does not exist
Check if the API script exists on the expected location before
running php-cli. This way we can add a useful hint about the
project directory.

Fixes #2513.
2021-11-10 11:58:20 +01:00
Sarah Hoffmann
37eeccbf4c ICU: use normalization from config in PHP
The TERM_NORMALIZATION config option is no longer applicable.
That was already documented but not yet implemented.
2021-10-27 11:32:44 +02:00
Sarah Hoffmann
5a1c3dbea3 fix parsing of operator in special phrases
Because of unstripped input, the operators wouldn't match.
2021-10-25 19:46:30 +02:00
Sarah Hoffmann
1098ab732f allow relative paths for flatnode file 2021-10-22 17:32:51 +02:00
Sarah Hoffmann
507fdd4f40 switch IMPORT_STYLE to use generic file search
Allows relative paths wrt project directory.
2021-10-22 16:49:57 +02:00
Sarah Hoffmann
0ae8d7ac08 have ADDRESS_LEVEL_CONFIG use load_sub_configuration
This means that relative paths now are looked up in the
project directory.
2021-10-22 16:36:52 +02:00
Sarah Hoffmann
c77df2d1eb replace NOMINATIM_PHRASE_CONFIG with command line option 2021-10-22 14:41:14 +02:00
Sarah Hoffmann
c1fa70639b add new replication mode catch-up
This mode gets updates until the server reports no new diffs
anymore.

Also adds additional indexing, when the main indexing step left
a couple of objects to process. This happens only when the
next update is expected to be more than 40min away.
2021-10-20 22:05:15 +02:00
Sarah Hoffmann
824562357b adapt tests for new word count mechanism 2021-10-19 12:03:48 +02:00
Sarah Hoffmann
552fb16cb2 fix template expressions for tablespaces 2021-10-15 15:11:09 +02:00
Sarah Hoffmann
3649487f5e use SP-GIST index for building index where available
Point-in-polygon queries are much faster with a SP-GIST geometry
index, so use that for the index used to check if a housenumber
is inside a building.

Only available with Postgis 3. There is an automatic fallback to
GIST for Postgis 2.
2021-10-10 21:55:38 +02:00
Sarah Hoffmann
299934fd2a reorganize and complete tests around generic token analysis 2021-10-06 17:03:37 +02:00
Sarah Hoffmann
b18d042832 add tests for sanitizer tagging language 2021-10-06 12:29:25 +02:00
Sarah Hoffmann
97a10ec218 apply variants by languages
Adds a tagger for names by language so that the analyzer of that
language is used. Thus variants are now only applied to names
in the specific language and only tag name tags, no longer to
reference-like tags.
2021-10-06 11:09:54 +02:00
Sarah Hoffmann
d35400a7d7 use analyser provided in the 'analyzer' property
Implements per-name choice of analyzer. If a non-default
analyzer is choosen, then the 'word' identifier is extended
with the name of the ana;yzer, so that we still have unique
items.
2021-10-05 14:10:32 +02:00
Sarah Hoffmann
9ba2019470 precompute replacements while loading configuration 2021-10-05 10:20:08 +02:00
Sarah Hoffmann
7cfcbacfc7 make token analyzers configurable modules
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.
2021-10-04 17:37:34 +02:00
Sarah Hoffmann
52847b61a3 extend ICU config to accomodate multiple analysers
Adds parsing of multiple variant lists from the configuration.
Every entry except one must have a unique 'id' paramter to
distinguish the entries. The entry without id is considered
the default. Currently only the list without an id is used
for analysis.
2021-10-04 16:40:28 +02:00
Sarah Hoffmann
6b348d43c6 replace test variable for PG env tests
'tty' was removed in PG14 and causes an error.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
732cd27d2e add unit tests for new sanatizer functions 2021-10-01 12:27:24 +02:00
Sarah Hoffmann
8171fe4571 introduce sanitizer step before token analysis
Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
16daa57e47 unify ICUNameProcessorRules and ICURuleLoader
There is no need for the additional layer of indirection that
the ICUNameProcessorRules class adds. The ICURuleLoader can
fill the database properties directly.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
be65c8303f export more data for the tokenizer name preparation
Adds class, type, country and rank to the exported information
and removes the rather odd hack for countries. Whether a place
represents a country boundary can now be computed by the tokenizer.
2021-09-29 11:54:14 +02:00
Sarah Hoffmann
231250f2eb add wrapper class for place data passed to tokenizer
This is mostly for convenience and documentation purposes.
2021-09-29 11:54:07 +02:00
Sarah Hoffmann
09c9fad6c3 adapt tests to new ICU address token handling 2021-09-27 17:36:23 +02:00
Sarah Hoffmann
8e1d4818ac use yaml config loader for country info 2021-09-04 00:22:55 +02:00
Sarah Hoffmann
28c98584c1 add tests for generic YAML config reader 2021-09-03 22:31:30 +02:00
Sarah Hoffmann
1c42780bb5 introduce generic YAML config loader
Adds a function to the Configuration class to load a YAML
file. This means that searching for the file is generalised
and works the same now for all configuration files. Changes
the search logic, so that it is always possible to have a
custom version of the configuration file in the project
directory.

Move ICU tokenizer to use new load function.
2021-09-03 18:20:07 +02:00
Sarah Hoffmann
79da96b369 read partition and languages from config file 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
78fcabade8 move country name generation to country_info module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
284645f505 move generation of country tables in own module 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
28ee3d0949 move linking of places to the preparation stage
Linked places may bring in extra names. These names need to be
processed by the tokenizer. That means that the linking needs
to be done before the data is handed to the tokenizer. Move finding
the linked place into the preparation stage and update the name
fields. Everything else is still done in the indexing stage.
2021-08-20 22:44:17 +02:00
Sarah Hoffmann
118858a55e rename legacy_icu tokenizer to icu tokenizer
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
2021-08-17 23:11:47 +02:00
Sarah Hoffmann
87dedde5d6 allow multiple files for the import command
The files are forwarded to osm2pgsql which is now able to merge
them correctly.
2021-08-14 21:42:21 +02:00
Sarah Hoffmann
1db098c05d reinstate word column in icu word table
Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
e42878eeda adapt unit test for new word table
Requires a second wrapper class for the word table with the new
layout. This class is interface-compatible, so that later when
the ICU tokenizer becomes the default, all tests that depend on
behaviour of the default tokenizer can be switched to the other
wrapper.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
eb6814d74e convert word info column to json before copying 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
0c023fb4d2 adapt cli tests to Python port for add-data 2021-07-26 10:41:37 +02:00
Sarah Hoffmann
878835e4bd move add-data subcommand into a separate file 2021-07-25 18:14:12 +02:00
Sarah Hoffmann
62d5984b1b limit the number of variants that can be produced 2021-07-04 10:28:28 +02:00
Sarah Hoffmann
e85f7e7aa9 fix subsequent replacements
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.

Also forbit composing a word after a space was added in the
end by a previous replacement.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
b9fbfeff67 only consider partials in multi-words for initial count
This ensures that it is less likely that we exclude meaningful
words like 'hauptstrasse' just because they are frequent.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
62828fc5c1 switch to a more flexible variant description format
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a6aa6360e0 use yaml tag syntax to mark include files 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
0d80a9b897 tests for composing decomposed suffixes 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
f70930b1a0 make compund decomposition pure import feature
Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
9ff4f66f55 complete tests for icu tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e81084f35 complete tests for rule loader 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a0a7b05c9f correctly quote strings when copying in data
Encapsulate the copy string in a class that ensures that
copy lines are written with correct quoting.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2f6e4edcdb update unit tests for adapted abbreviation code 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e3c5d4c5b adapt tests for ICU tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249 move abbreviation computation into import phase
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.

New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
AntoJvlt
3676310efe Improved performance of the postcodes query and some code cleaning 2021-06-12 15:46:08 +02:00
AntoJvlt
1c175e3a67 Clean and update tests for postcodes 2021-06-09 09:31:32 +02:00
AntoJvlt
e879814e43 Update tests for postcodes 2021-06-09 09:31:32 +02:00
Sarah Hoffmann
bc981d0261 fix insertion of special terms and countries into word table
Special terms need to be prefixed by a space because they are
full terms.

For countries avoid duplicate entries of word tokens.

Adds tests for adding country terms.
2021-06-02 20:22:39 +02:00
Sarah Hoffmann
24c986c842 add tests for new full name computation with ICU 2021-05-24 10:41:42 +02:00
Sarah Hoffmann
4f4d15c28a reorganize keyword creation for legacy tokenizer
- only save partial words without internal spaces
- consider comma and semicolon a separator of full words
- consider parts before an opening bracket a full word
  (but not the part after the bracket)

Fixes #244.
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
430c316e45 test: fix linting errors 2021-05-19 23:07:39 +02:00
Sarah Hoffmann
01f5a9ff84 test: more use of table_factory 2021-05-19 17:37:03 +02:00
Sarah Hoffmann
af52eed0dd test: avoid use of tempfile module
Use the tmp_path fixture instead which provides automatic
cleanup.
2021-05-19 16:43:26 +02:00
Sarah Hoffmann
f93d0fa957 test: use src_dir fixture instead of self-computed paths 2021-05-19 16:03:54 +02:00
Sarah Hoffmann
c06a1d007a test: replace raw execute() with fixture code where possible 2021-05-19 12:11:04 +02:00
Sarah Hoffmann
65bd749918 test: use table_rows() and execute_values() where possible
Some uses of scalar() could also be replaced with convenience
functions from the word table mock.
2021-05-19 10:51:10 +02:00
Sarah Hoffmann
510eb53f53 test: move Testingcursor into separate class
Also adds more convenience functions: counting with a where
statement and a wrapper to execute_values().
2021-05-19 10:30:36 +02:00
Sarah Hoffmann
16bb007135
Merge pull request #2336 from lonvia/do-not-mask-error-when-loading-tokenizer
Do not hide errors when importing tokenizer
2021-05-18 23:00:10 +02:00
Sarah Hoffmann
b2722650d4 do not hide errors when importing tokenizer
Explicitly check for the tokenizer source file to check that
the name is correct. We can't use the import error for that
because it hides other import errors like a missing
library.

Fixes #2327.
2021-05-18 16:28:21 +02:00
AntoJvlt
3206bf59df Resolve conflicts 2021-05-17 13:52:35 +02:00
AntoJvlt
8b8dfc46eb Added --no-replace command for special phrases importation and added corresponding tests 2021-05-17 13:25:06 +02:00
AntoJvlt
06aab389ed Code cleaning and SPLoader deleted 2021-05-16 16:59:12 +02:00
AntoJvlt
fb0ebb5bf0 Add tests for the new SPWikiLoader and SPCsvLoader 2021-05-16 16:10:06 +02:00
Sarah Hoffmann
925726222f
Merge pull request #2323 from darkshredder/disable-search-reverse-only
Feat: Disabled search API for --reverse-only imports
2021-05-14 10:40:22 +02:00
Sarah Hoffmann
7d621389ee adapt tests to new TIGER CSV format 2021-05-14 00:02:50 +02:00
Darkshredder
e5ffc59cd5 feat: Added reverse-only-search validation 2021-05-14 02:36:21 +05:30
Sarah Hoffmann
5feece64c1 use WorkerPool for Tiger data import
Requires adding an option that SQL errors are ignored.
2021-05-13 20:36:50 +02:00
Sarah Hoffmann
f5977dac75 ignore invalid coordinates in external postcodes 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
8f2746fe24 ignore entries without country code 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
1ccd4360b4 correctly handle removing all postcodes for country 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
bf864b2c54 index postcodes after refreshing 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
4abaf71234 add and extend tests for new postcode handling 2021-05-13 14:15:42 +02:00
AntoJvlt
9d83da830f Introduction of SPCsvLoader to load special phrases from a csv file 2021-05-10 23:26:39 +02:00
AntoJvlt
00959fac57 Refactoring loading of external special phrases and importation process by introducing SPLoader and SPWikiLoader 2021-05-10 21:49:31 +02:00
Sarah Hoffmann
18c99a5c5f add unit tests for legacy ICU tokenizer 2021-05-05 10:15:27 +02:00
Sarah Hoffmann
8bdb9aa607 mock tokenizer factory for replication tests 2021-05-01 10:50:39 +02:00
Sarah Hoffmann
388ebcbae2 move index creation for word table to tokenizer
This introduces a finalization routing for the tokenizer
where it can post-process the import if necessary.
2021-04-30 17:41:08 +02:00
Sarah Hoffmann
fc995ea6b9 move database check for module to tokenizer 2021-04-30 17:41:08 +02:00
Sarah Hoffmann
893490f94e add more tests for legacy tokenizer 2021-04-30 17:41:08 +02:00
Sarah Hoffmann
3eb4d88057 boilerplate for PHP code of tokenizer
This adds an installation step for PHP code for the tokenizer. The
PHP code is split in two parts. The updateable code is found in
lib-php. The tokenizer installs an additional script in the
project directory which then includes the code from lib-php and
defines all settings that are static to the database. The website
code then always includes the PHP from the project directory.
2021-04-30 11:31:52 +02:00
Sarah Hoffmann
23fd1d032a tests for legacy tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
7cb7cf848d move amenity creation to tokenizer
The BDD tests still use the old-style amenity creation scripts
because we don't have simple means to import a hand-crafted
test file of special phrases right now.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
bef300305e move default country name creation to tokenizer
The new function is also used, when a country us updated. All SQL
function related to country names have been removed.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
ffc2d82b0e move postcode normalization into tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
fa2bc60468 introduce name analyzer
The name analyzer is the actual work horse of the tokenizer. It
is instantiated on a thread-base and provides all functions for
analysing names and queries.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
e1c5673ac3 require tokeinzer for indexer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
9397bf54b8 introduce external processing in indexer
Indexing is now split into three parts: first a preparation step
that collects the necessary information from the database and
returns it to Python. In a second step the data is transformed
within Python as necessary and then returned to the database
through the usual UPDATE which now not only sets the indexed_status
but also other fields. The third step comprises the address
computation which is still done inside the update trigger in
the database.

The second processing step doesn't do anything useful yet.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
fbbdd31399 move word table and normalisation SQL into tokenizer
Creating and populating the word table is now the responsibility
of the tokenizer.

The get_maxwordfreq() function has been replaced with a
simple template parameter to the SQL during function installation.
The number is taken from the parameter list in the database to
ensure that it is not changed after installation.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
296a66558f move module installation to legacy tokenizer 2021-04-30 11:29:57 +02:00
Sarah Hoffmann
af968d4903 introduce tokenizer modules
This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.

The legacy tokenizer implements Nominatim's traditional algorithms.
2021-04-30 11:29:57 +02:00
Sarah Hoffmann
185d369404 remove support for AUX housenumber tables
These tables have never been actively maintained and the code is
completely untested. With the upcomming changes, it is unlikely
that the code remains usable.

This removes the aux tables and all code that references them.
2021-04-30 10:08:29 +02:00
AntoJvlt
1b68152fb2 reorganization of folder/file for the special phrases importer 2021-04-25 17:57:42 +02:00
Sarah Hoffmann
9685c68e30 replace usages of fromisoformat() with strptime()
fromisoformat was only introduced with Python 3.7 while we
still support Python 3.5.

Fixes #2292.
2021-04-23 22:50:08 +02:00
Sarah Hoffmann
50b6d7298c factor out async connection handling into separate class
Also adds a test for reconnecting regularly while indexing.
2021-04-20 14:08:37 +02:00
Sarah Hoffmann
b88b952f56 simplify token precomputation
Rename function to reflect that it is only used for precomputation.
The token IDs are not really needed, so don't bother to compute
the array of tokens.
2021-04-19 17:24:19 +02:00
Darkshredder
1f898405a6 Fix: tiger-data tarfile test 2021-04-19 16:02:52 +05:30
Sarah Hoffmann
79d55357e8 simplify sql and website creation functions 2021-04-19 10:53:30 +02:00
Sarah Hoffmann
4fa6c0ad53 simplify constructor for SQL preprocessor
Use sql path from config.
2021-04-19 10:26:25 +02:00
Sarah Hoffmann
8f63f9516b simplify interface for adding tiger data
Also simplifies tests using existing fixtures.
2021-04-19 10:26:25 +02:00
AntoJvlt
b2ae715699 Only log a warning if a wrong input is detected on the wiki while importing special phrases 2021-04-17 20:19:39 +02:00
AntoJvlt
ec859e41c6 Cleaned tests and add database cleaning tests on test_import_from_wiki 2021-04-17 19:23:33 +02:00
Sarah Hoffmann
2ca11ccc6b add tests for continuing import 2021-04-17 11:10:36 +02:00
Sarah Hoffmann
0f11e311c4 add test for new postcode import function 2021-04-16 16:11:20 +02:00
Sarah Hoffmann
c64193f839
Merge pull request #2263 from AntoJvlt/special-phrases-autoupdate
Implemented auto update of special phrases while importing them
2021-04-15 10:13:25 +02:00
Darkshredder
49ee7505ed Fix: Removed error if endstatement is wrong and improved tests 2021-04-13 15:44:12 +05:30
AntoJvlt
ae2b2cb9a5 Tests added for the auto update of special phrases during import 2021-04-12 14:35:29 +02:00
AntoJvlt
e82de99e5a Cleaned tests of exceptions and fix phrase_settings.json test file name. 2021-03-29 22:07:29 +02:00
AntoJvlt
57ce75eb67 Change command 'import-special-phrases --from-wiki' to 'special-phrases --import-from-wiki'. 2021-03-26 02:22:38 +01:00
AntoJvlt
cde9389e75 Errors fixes, Cleaning code, Improvement and addition of tests 2021-03-26 01:53:33 +01:00
AntoJvlt
2c19bd5ea3 Encapsulation of tools/special_phrases.py into SpecialPhrasesImporter class and add new tests. 2021-03-25 21:13:57 +01:00
AntoJvlt
ff34198569 Code cleaning, tests simplification and use of python3-icu package 2021-03-23 23:56:39 +01:00
AntoJvlt
1ce8b530cd Introduction of PyICU for transliteration in python. Reversed changes in normalization.sql. 2021-03-23 23:34:16 +01:00
AntoJvlt
17cb59efbd Ported functions for the import of special phrases from php to python.
- the command is now --import-special-phrases
- the output is not an sql file anymore, data are directly imported to the database.
- the little part on the documentation (section data import) has been modified.
2021-03-20 19:11:50 +01:00
Darkshredder
077a8c1f95 refactored tests and made changes to code for easy readibility 2021-03-12 18:23:20 +05:30
Darkshredder
7a874d5b97 Ported createCountryNames() to python and added tests 2021-03-12 10:28:41 +05:30
Darkshredder
e5719de657 Added fixture for sql_preprocessor and fixed some issues 2021-03-11 15:39:17 +05:30
Darkshredder
8486a83cf5 Added test for tarfile 2021-03-10 18:14:17 +05:30
Darkshredder
ccfad57fca Added test and removed runlegacyscript 2021-03-10 17:18:12 +05:30
Sarah Hoffmann
09f4d767e4 port index creation to python
Also switches to jinja-based preprocessing, which allows to
simplify the SQL files. Use 'if not exists' where possible
so that the step can be rerun to fix missing indexes.
2021-03-04 11:11:47 +01:00
Sarah Hoffmann
eacabb0e96 move table creation to jinja-based preprocessing 2021-03-03 22:07:51 +01:00
Sarah Hoffmann
d2bd6aa78d introduce jinja2 for preprocessing SQL
Replaces various hand-crafted replacements of varying format with
a single Jinja2 templating mechanism. Allows full access to
configuration if necessary.
2021-03-03 17:51:08 +01:00
Sarah Hoffmann
7ae9c3a9f0 add database_version setting to tests 2021-03-01 21:49:33 +01:00
Sarah Hoffmann
3a0a4b9175 save software version in the database
The version represents the software version that was used to
import the data.
2021-03-01 20:35:15 +01:00
Sarah Hoffmann
db663dd92f remove unused import 2021-03-01 09:26:08 +01:00
Sarah Hoffmann
90a5d23016 use tmp_path fixture in config tests 2021-03-01 09:24:04 +01:00
Sarah Hoffmann
afabbeb546 older versions of Postgresql need explicit return type 2021-02-27 09:46:42 +01:00
Sarah Hoffmann
15b5906790 move setup function to python
There are still back-calls to PHP for some of the sub-steps.
These needs some larger refactoring to be moved to Python.
2021-02-26 15:02:39 +01:00
Sarah Hoffmann
3c186f8030 add a function for the intial indexing run
Also moves postcodes to fully parallel indexing.
2021-02-25 18:42:54 +01:00
Sarah Hoffmann
c7fd0a7af4 port wikipedia importance functions to python 2021-02-25 18:42:54 +01:00
Sarah Hoffmann
32683f73c7 move import-data option to native python
This adds a new dependecy to the Python psutil package.
2021-02-25 18:42:54 +01:00
Sarah Hoffmann
7222235579 introduce custom object for cmdline arguments
Allows to define special functions over the arguments.

Also splits CLI tests in two files as they have become too many.
2021-02-25 18:42:54 +01:00
Sarah Hoffmann
f6e894a53a port database setup function to python
Hide the former PHP functions in a transition command until
they are removed.
2021-02-25 18:42:54 +01:00
Sarah Hoffmann
b93ec2522e use psql for executing sql files
This allows to run larger files without needing to keep
them in memory.
2021-02-25 18:42:54 +01:00
Sarah Hoffmann
af7226393a add function to set up libpq environment
Instead of parsing the DSN for each external libpq program we
are going to execute, provide a function that feeds them all
necessary parameters through the environment.

osm2pgsql is the first user.
2021-02-25 18:42:54 +01:00
Sarah Hoffmann
e520613362 convert connect() into a context manager 2021-02-25 18:42:54 +01:00
Sarah Hoffmann
a1f0fc1a10 improve deadlock detection for various versions of psycopg2
Psycopg2 has changed the kind of exception that is emitted on
deadlocks between versions 2.7 and 2.8. The code was already
trying to catch both kind of errors but because the
psycopg2.errors package is unknown in 2.7 and below, the
code would throw an exception on anything but a deadlock error.

This commit wraps the deadlock handling into a context manager
to avoid code duplication and uses module imports to detect if
the new error codes are available.

Also sets the required psycopg2 version to 2.7 or bigger as
versions below are difficult to test.
2021-02-25 18:11:16 +01:00