Commit Graph

534 Commits

Author SHA1 Message Date
Sarah Hoffmann
e85f7e7aa9 fix subsequent replacements
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.

Also forbit composing a word after a space was added in the
end by a previous replacement.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
b9fbfeff67 only consider partials in multi-words for initial count
This ensures that it is less likely that we exclude meaningful
words like 'hauptstrasse' just because they are frequent.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
62828fc5c1 switch to a more flexible variant description format
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a6aa6360e0 use yaml tag syntax to mark include files 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
0d80a9b897 tests for composing decomposed suffixes 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
f70930b1a0 make compund decomposition pure import feature
Compound decomposition now creates a full name variant on
import just like abbreviations. This simplifies query time
normalization and opens a path for changing abbreviation
and compund decomposition lists for an existing database.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
9ff4f66f55 complete tests for icu tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e81084f35 complete tests for rule loader 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a0a7b05c9f correctly quote strings when copying in data
Encapsulate the copy string in a class that ensures that
copy lines are written with correct quoting.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2f6e4edcdb update unit tests for adapted abbreviation code 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e3c5d4c5b adapt tests for ICU tokenizer 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249 move abbreviation computation into import phase
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.

New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
e7b4fc70e7 make sure old data gets deleted on place type change
When changing from some other place type to place=postcode
make sure that the old place type entry in the place table
is deleted.
2021-06-18 10:58:41 +02:00
Sarah Hoffmann
457982e1d2 update postcode in place if it already exists 2021-06-18 00:28:52 +02:00
Sarah Hoffmann
aa558e6080
Merge pull request #2369 from lonvia/exclude-poi-from-housenumber-search
Do not return POIs when dropping house number in query
2021-06-17 15:30:05 +02:00
Sarah Hoffmann
fe11d3cbbd do not return POIs when dropping house number in query
We've previously added searching through rank 30 in a house
number search to enable searches for house number+name.
This had the unintended side effect that rank 30 objects
are also returned in s search that dropped the house number
from the query. This is wrong because POIs cannot function
as a parent to a house number.

This fix drops all rank 30 objects from the results for a
house number search if they do not match the requested house
number.
2021-06-17 14:21:20 +02:00
AntoJvlt
3676310efe Improved performance of the postcodes query and some code cleaning 2021-06-12 15:46:08 +02:00
AntoJvlt
1c175e3a67 Clean and update tests for postcodes 2021-06-09 09:31:32 +02:00
AntoJvlt
e879814e43 Update tests for postcodes 2021-06-09 09:31:32 +02:00
Sarah Hoffmann
3aac51c81f switch BDD tests to always use search API 2021-06-06 15:27:52 +02:00
Sarah Hoffmann
bc981d0261 fix insertion of special terms and countries into word table
Special terms need to be prefixed by a space because they are
full terms.

For countries avoid duplicate entries of word tokens.

Adds tests for adding country terms.
2021-06-02 20:22:39 +02:00
Sarah Hoffmann
24c986c842 add tests for new full name computation with ICU 2021-05-24 10:41:42 +02:00
Sarah Hoffmann
4f4d15c28a reorganize keyword creation for legacy tokenizer
- only save partial words without internal spaces
- consider comma and semicolon a separator of full words
- consider parts before an opening bracket a full word
  (but not the part after the bracket)

Fixes #244.
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
10143e0ac7
Merge pull request #2342 from lonvia/icu-tokenizer-ci
Add BDD tests with icu tokenizer to CI runs
2021-05-22 10:36:35 +02:00
Sarah Hoffmann
00094c43d1 enable Tiger BDD API test for legacy_icu 2021-05-21 22:39:56 +02:00
Sarah Hoffmann
430c316e45 test: fix linting errors 2021-05-19 23:07:39 +02:00
Sarah Hoffmann
01f5a9ff84 test: more use of table_factory 2021-05-19 17:37:03 +02:00
Sarah Hoffmann
af52eed0dd test: avoid use of tempfile module
Use the tmp_path fixture instead which provides automatic
cleanup.
2021-05-19 16:43:26 +02:00
Sarah Hoffmann
f93d0fa957 test: use src_dir fixture instead of self-computed paths 2021-05-19 16:03:54 +02:00
Sarah Hoffmann
c06a1d007a test: replace raw execute() with fixture code where possible 2021-05-19 12:11:04 +02:00
Sarah Hoffmann
65bd749918 test: use table_rows() and execute_values() where possible
Some uses of scalar() could also be replaced with convenience
functions from the word table mock.
2021-05-19 10:51:10 +02:00
Sarah Hoffmann
510eb53f53 test: move Testingcursor into separate class
Also adds more convenience functions: counting with a where
statement and a wrapper to execute_values().
2021-05-19 10:30:36 +02:00
Sarah Hoffmann
16bb007135
Merge pull request #2336 from lonvia/do-not-mask-error-when-loading-tokenizer
Do not hide errors when importing tokenizer
2021-05-18 23:00:10 +02:00
Sarah Hoffmann
b2722650d4 do not hide errors when importing tokenizer
Explicitly check for the tokenizer source file to check that
the name is correct. We can't use the import error for that
because it hides other import errors like a missing
library.

Fixes #2327.
2021-05-18 16:28:21 +02:00
AntoJvlt
3206bf59df Resolve conflicts 2021-05-17 13:52:35 +02:00
AntoJvlt
8b8dfc46eb Added --no-replace command for special phrases importation and added corresponding tests 2021-05-17 13:25:06 +02:00
AntoJvlt
06aab389ed Code cleaning and SPLoader deleted 2021-05-16 16:59:12 +02:00
AntoJvlt
fb0ebb5bf0 Add tests for the new SPWikiLoader and SPCsvLoader 2021-05-16 16:10:06 +02:00
Sarah Hoffmann
925726222f
Merge pull request #2323 from darkshredder/disable-search-reverse-only
Feat: Disabled search API for --reverse-only imports
2021-05-14 10:40:22 +02:00
Sarah Hoffmann
7d621389ee adapt tests to new TIGER CSV format 2021-05-14 00:02:50 +02:00
Darkshredder
e5ffc59cd5 feat: Added reverse-only-search validation 2021-05-14 02:36:21 +05:30
Sarah Hoffmann
5feece64c1 use WorkerPool for Tiger data import
Requires adding an option that SQL errors are ignored.
2021-05-13 20:36:50 +02:00
Sarah Hoffmann
f5977dac75 ignore invalid coordinates in external postcodes 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
8f2746fe24 ignore entries without country code 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
1ccd4360b4 correctly handle removing all postcodes for country 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
bf864b2c54 index postcodes after refreshing 2021-05-13 14:15:42 +02:00
Sarah Hoffmann
4abaf71234 add and extend tests for new postcode handling 2021-05-13 14:15:42 +02:00
AntoJvlt
9d83da830f Introduction of SPCsvLoader to load special phrases from a csv file 2021-05-10 23:26:39 +02:00
AntoJvlt
00959fac57 Refactoring loading of external special phrases and importation process by introducing SPLoader and SPWikiLoader 2021-05-10 21:49:31 +02:00
Sarah Hoffmann
b2c6eca2c8 add missing transliterations
The ICU library only offers transliterations for a limited set of
script. Add transliterations for missing scripts from the PostgreSQL
module. These means that the same selection of scripts is supported
as with the old module.
2021-05-05 21:16:55 +02:00
Sarah Hoffmann
a263e54b94 enable BDD tests for different tokenizers
The tokenizer to be used can be choosen with -DTOKENIZER.

Adapt all tests, so that they work with legacy_icu tokenizer.
Move lookup in word table to a function in the tokenizer.
Special phrases are temporarily imported from the wiki until
we have an implementation that can import from file. TIGER
tests do not work yet.
2021-05-05 10:31:51 +02:00
Sarah Hoffmann
18c99a5c5f add unit tests for legacy ICU tokenizer 2021-05-05 10:15:27 +02:00
Sarah Hoffmann
8bdb9aa607 mock tokenizer factory for replication tests 2021-05-01 10:50:39 +02:00
Sarah Hoffmann
388ebcbae2 move index creation for word table to tokenizer
This introduces a finalization routing for the tokenizer
where it can post-process the import if necessary.
2021-04-30 17:41:08 +02:00
Sarah Hoffmann
fc995ea6b9 move database check for module to tokenizer 2021-04-30 17:41:08 +02:00
Sarah Hoffmann
be6262c6ce move status test to tokenizer
The availability of the module is now tested by the tokenizer.
2021-04-30 17:41:08 +02:00
Sarah Hoffmann
893490f94e add more tests for legacy tokenizer 2021-04-30 17:41:08 +02:00
Sarah Hoffmann
044bb6afa5 move tokenization in query into tokenizer 2021-04-30 17:41:08 +02:00
Sarah Hoffmann
3eb4d88057 boilerplate for PHP code of tokenizer
This adds an installation step for PHP code for the tokenizer. The
PHP code is split in two parts. The updateable code is found in
lib-php. The tokenizer installs an additional script in the
project directory which then includes the code from lib-php and
defines all settings that are static to the database. The website
code then always includes the PHP from the project directory.
2021-04-30 11:31:52 +02:00
Sarah Hoffmann
23fd1d032a tests for legacy tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
7cb7cf848d move amenity creation to tokenizer
The BDD tests still use the old-style amenity creation scripts
because we don't have simple means to import a hand-crafted
test file of special phrases right now.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
bef300305e move default country name creation to tokenizer
The new function is also used, when a country us updated. All SQL
function related to country names have been removed.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
ffc2d82b0e move postcode normalization into tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
fa2bc60468 introduce name analyzer
The name analyzer is the actual work horse of the tokenizer. It
is instantiated on a thread-base and provides all functions for
analysing names and queries.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
e1c5673ac3 require tokeinzer for indexer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
9397bf54b8 introduce external processing in indexer
Indexing is now split into three parts: first a preparation step
that collects the necessary information from the database and
returns it to Python. In a second step the data is transformed
within Python as necessary and then returned to the database
through the usual UPDATE which now not only sets the indexed_status
but also other fields. The third step comprises the address
computation which is still done inside the update trigger in
the database.

The second processing step doesn't do anything useful yet.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
fbbdd31399 move word table and normalisation SQL into tokenizer
Creating and populating the word table is now the responsibility
of the tokenizer.

The get_maxwordfreq() function has been replaced with a
simple template parameter to the SQL during function installation.
The number is taken from the parameter list in the database to
ensure that it is not changed after installation.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
296a66558f move module installation to legacy tokenizer 2021-04-30 11:29:57 +02:00
Sarah Hoffmann
af968d4903 introduce tokenizer modules
This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.

The legacy tokenizer implements Nominatim's traditional algorithms.
2021-04-30 11:29:57 +02:00
Sarah Hoffmann
185d369404 remove support for AUX housenumber tables
These tables have never been actively maintained and the code is
completely untested. With the upcomming changes, it is unlikely
that the code remains usable.

This removes the aux tables and all code that references them.
2021-04-30 10:08:29 +02:00
Sarah Hoffmann
46e8c6b112
Merge pull request #2291 from AntoJvlt/special-phrases-statistics
Special phrases statistics
2021-04-27 11:57:05 +02:00
Sarah Hoffmann
1fd483643b add tests for different scripts 2021-04-26 23:01:06 +02:00
AntoJvlt
1b68152fb2 reorganization of folder/file for the special phrases importer 2021-04-25 17:57:42 +02:00
Sarah Hoffmann
9685c68e30 replace usages of fromisoformat() with strptime()
fromisoformat was only introduced with Python 3.7 while we
still support Python 3.5.

Fixes #2292.
2021-04-23 22:50:08 +02:00
Sarah Hoffmann
788baafa26 bdd tests: fix place dependen ranking tests
The ranks of places may differ for some countries. Force the
place nodes in the test on null island which always uses the
default ranking.
2021-04-22 17:31:00 +02:00
Sarah Hoffmann
50b6d7298c factor out async connection handling into separate class
Also adds a test for reconnecting regularly while indexing.
2021-04-20 14:08:37 +02:00
Sarah Hoffmann
b88b952f56 simplify token precomputation
Rename function to reflect that it is only used for precomputation.
The token IDs are not really needed, so don't bother to compute
the array of tokens.
2021-04-19 17:24:19 +02:00
Darkshredder
1f898405a6 Fix: tiger-data tarfile test 2021-04-19 16:02:52 +05:30
Sarah Hoffmann
79d55357e8 simplify sql and website creation functions 2021-04-19 10:53:30 +02:00
Sarah Hoffmann
4fa6c0ad53 simplify constructor for SQL preprocessor
Use sql path from config.
2021-04-19 10:26:25 +02:00
Sarah Hoffmann
8f63f9516b simplify interface for adding tiger data
Also simplifies tests using existing fixtures.
2021-04-19 10:26:25 +02:00
AntoJvlt
b2ae715699 Only log a warning if a wrong input is detected on the wiki while importing special phrases 2021-04-17 20:19:39 +02:00
AntoJvlt
ec859e41c6 Cleaned tests and add database cleaning tests on test_import_from_wiki 2021-04-17 19:23:33 +02:00
Sarah Hoffmann
2ca11ccc6b add tests for continuing import 2021-04-17 11:10:36 +02:00
Sarah Hoffmann
0f11e311c4 add test for new postcode import function 2021-04-16 16:11:20 +02:00
Sarah Hoffmann
c64193f839
Merge pull request #2263 from AntoJvlt/special-phrases-autoupdate
Implemented auto update of special phrases while importing them
2021-04-15 10:13:25 +02:00
Darkshredder
49ee7505ed Fix: Removed error if endstatement is wrong and improved tests 2021-04-13 15:44:12 +05:30
AntoJvlt
ae2b2cb9a5 Tests added for the auto update of special phrases during import 2021-04-12 14:35:29 +02:00
Sarah Hoffmann
16a66b5326 move transliteration of housenumbers into indexing
Housenumbers are now saved in transliterated form in the housenumber
column. This saves the transliteration step during lookup.
2021-04-04 15:26:47 +02:00
Sarah Hoffmann
3590e76a1c tests for finding non-ascii housenumbers 2021-04-04 15:26:47 +02:00
Darkshredder
0f9df32d11 Added Test for TokenSpecialTerm 2021-04-02 04:49:05 +05:30
AntoJvlt
e82de99e5a Cleaned tests of exceptions and fix phrase_settings.json test file name. 2021-03-29 22:07:29 +02:00
Sarah Hoffmann
09b2510219
Merge pull request #2228 from AntoJvlt/import-special-phrases-porting-python
Import special phrases porting python
2021-03-29 09:49:35 +02:00
AntoJvlt
57ce75eb67 Change command 'import-special-phrases --from-wiki' to 'special-phrases --import-from-wiki'. 2021-03-26 02:22:38 +01:00
AntoJvlt
cde9389e75 Errors fixes, Cleaning code, Improvement and addition of tests 2021-03-26 01:53:33 +01:00
AntoJvlt
2c19bd5ea3 Encapsulation of tools/special_phrases.py into SpecialPhrasesImporter class and add new tests. 2021-03-25 21:13:57 +01:00
AntoJvlt
ff34198569 Code cleaning, tests simplification and use of python3-icu package 2021-03-23 23:56:39 +01:00
AntoJvlt
1ce8b530cd Introduction of PyICU for transliteration in python. Reversed changes in normalization.sql. 2021-03-23 23:34:16 +01:00
AntoJvlt
9d1c23e4f5 Updated specialphrases_testdb.sql 2021-03-20 19:17:03 +01:00
AntoJvlt
17cb59efbd Ported functions for the import of special phrases from php to python.
- the command is now --import-special-phrases
- the output is not an sql file anymore, data are directly imported to the database.
- the little part on the documentation (section data import) has been modified.
2021-03-20 19:11:50 +01:00