Sarah Hoffmann
94d3dee369
further increase penalty on housenumbers without numbers
...
Make the penality dependent on the length of the token:
no penalty for one letter house numbers and increasing one
for more letters.
2021-09-02 18:11:49 +02:00
Sarah Hoffmann
7e7dd769fd
remove language and partition from name import
2021-09-02 14:41:11 +02:00
Sarah Hoffmann
79da96b369
read partition and languages from config file
2021-09-02 14:41:11 +02:00
Sarah Hoffmann
78fcabade8
move country name generation to country_info module
2021-09-02 14:41:11 +02:00
Sarah Hoffmann
284645f505
move generation of country tables in own module
2021-09-02 14:41:11 +02:00
Sarah Hoffmann
0b349761a8
add country configuration
...
The new configuration saves the default language(s) originally
maintained in the OSM wiki as well as the partition information.
2021-09-02 14:41:11 +02:00
Sarah Hoffmann
d18794931a
Merge pull request #2435 from lonvia/simplified-to-traditional-chinese
...
icu: normalise simplified to traditional chinese
2021-08-31 15:29:26 +02:00
Sarah Hoffmann
b7d4ff3201
icu: normalise simplified to traditional chinese
...
The conversion is unambigious in most cases, so that the
information loss is minimal.
2021-08-31 11:18:34 +02:00
Sarah Hoffmann
4c6d674e03
Merge pull request #2434 from lonvia/vagrant-scripts-in-actions
...
Test installation instructions via CI
2021-08-29 10:11:59 +02:00
Sarah Hoffmann
2c97af8021
CI: use packaged source also for test runs
2021-08-24 10:10:01 +02:00
Sarah Hoffmann
832f75a55e
CI: unify jobs for different vagrant scripts
2021-08-24 10:10:01 +02:00
Sarah Hoffmann
4e77969545
add workflow for centos 8
2021-08-24 10:10:01 +02:00
Sarah Hoffmann
6ebbbfee61
CI: use vagrant scripts for import tests
...
Use vanilla docker images of Ubuntu and leave the setup
to the vagrant scripts. Then do the usual import tests.
Also fixes a couple of issues found with the scripts
2021-08-24 10:10:01 +02:00
Sarah Hoffmann
0fabeefc3e
Merge pull request #2432 from Mastercuber/patch-1
...
Added postcode
2021-08-22 09:32:31 +02:00
Mastercuber
c70d72f06b
Added postcode
...
Added postcode to the list of addressdetails
2021-08-22 02:52:41 +02:00
Sarah Hoffmann
cc141bf1a5
Add link to fixthemap to issue template
2021-08-21 20:36:16 +02:00
Sarah Hoffmann
199532c802
Merge pull request #2429 from lonvia/place-name-to-admin-boundary
...
Indexing: move linking of places to the preparation stage
2021-08-21 10:21:39 +02:00
Sarah Hoffmann
28ee3d0949
move linking of places to the preparation stage
...
Linked places may bring in extra names. These names need to be
processed by the tokenizer. That means that the linking needs
to be done before the data is handed to the tokenizer. Move finding
the linked place into the preparation stage and update the name
fields. Everything else is still done in the indexing stage.
2021-08-20 22:44:17 +02:00
Sarah Hoffmann
925195725d
Merge pull request #2428 from lonvia/rename-icu-tokenizer
...
Rename legacy_icu tokenizer to icu tokenizer
2021-08-18 15:02:19 +02:00
Sarah Hoffmann
f6d22df76e
adapt CI workflow to new tokenizer name
2021-08-18 09:08:20 +02:00
Sarah Hoffmann
118858a55e
rename legacy_icu tokenizer to icu tokenizer
...
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
2021-08-17 23:11:47 +02:00
Sarah Hoffmann
656c1291b1
Merge pull request #2427 from lonvia/remove-us-states-special-casing
...
Move US state hack into legacy tokenizer
2021-08-17 21:55:32 +02:00
Sarah Hoffmann
f00b8dd1c3
move special hack for US states to legacy tokenizer
...
The hack for IL, AL and LA is only needed because these abbreviations
are removed by the legacy tokenizer as a stop word. There is no need
to keep the hack for future tokenizers. Move it therefore to the
token extraction function.
2021-08-17 14:28:55 +02:00
Sarah Hoffmann
5f2b9e317a
add tests for US state hacks
...
IL, AS and LA are replaced with the US state in Geocode because
the old tokenizer would simply remove the abbreviations otherwise.
2021-08-17 10:49:07 +02:00
Sarah Hoffmann
4ae5ba7fc4
Merge pull request #2425 from lonvia/tokenizer-documentation
...
Introduce official Tokenizer API
2021-08-17 09:38:03 +02:00
Sarah Hoffmann
3656eed9ad
add mkdocstrings requirement for building docs
...
mkdocstrings also needs access to the Python sources, so set
a PYTHONPATH accordingly. This makes running mkdocs directly
a bit awkward, therefore add a `make serve-doc` target.
2021-08-16 11:51:49 +02:00
Sarah Hoffmann
2e82a6ce03
docs: extend explanation of query phrase
2021-08-16 11:51:49 +02:00
Sarah Hoffmann
c4b8a3b768
add documentation for PHP part of tokenizer
2021-08-16 11:51:49 +02:00
Sarah Hoffmann
1147b83b22
php: make word list a first-class object
...
This separates the logic of creating word sets from the Phrase
class. A tokenizer may now derived the word sets any way they
like. The SimpleWordList class provides a standard implementation
for splitting phrases on spaces.
2021-08-16 11:51:49 +02:00
Sarah Hoffmann
0fb8eade13
remove country restriction from tokenizer
...
Restricting tokens due to the search context is better done in
the generic search part instead of repeating the same test in
every tokenizer implementation.
2021-08-16 11:41:54 +02:00
Sarah Hoffmann
78d11fe628
document tokenizer SQL interface
2021-08-16 11:41:54 +02:00
Sarah Hoffmann
90b40fc3e6
define formal public Python interface for tokenizer
...
This introduces an abstract class for the Tokenizer/Analyzer
for documentation purposes.
2021-08-16 11:41:54 +02:00
Sarah Hoffmann
e25e268e2e
docs: querying and tokenizers
2021-08-16 08:59:44 +02:00
Sarah Hoffmann
68bff31cc9
docs: add developer doc page for Tokenizer
2021-08-16 08:58:56 +02:00
Sarah Hoffmann
31d9545702
Merge pull request #2424 from lonvia/multi-country-import
...
Update instructions for importing multiple regions
2021-08-16 08:48:28 +02:00
Sarah Hoffmann
e449071a35
Merge pull request #2423 from hummeltech/patch-1
...
Fix old paths for `phpcs` when using `make test`
2021-08-15 22:00:50 +02:00
Sarah Hoffmann
23e3724abb
ignore words without id for status
2021-08-15 21:59:36 +02:00
Sarah Hoffmann
75a5c7013f
split up large setup function
2021-08-15 12:24:13 +02:00
Sarah Hoffmann
56d24085f9
port multi-region update scripts to nominatim tool
...
Also updates the documentation. For the simple case of just
importing multiple regions, provide simplified instructions
that use the new multi-file import feature.
Fixes #2365 .
2021-08-14 23:55:48 +02:00
Sarah Hoffmann
95b82af42a
update osm2pgsql to 1.5.1
2021-08-14 22:46:35 +02:00
Sarah Hoffmann
87dedde5d6
allow multiple files for the import command
...
The files are forwarded to osm2pgsql which is now able to merge
them correctly.
2021-08-14 21:42:21 +02:00
David Hummel
8b6489c60e
Fix old paths for phpcs
when using make test
...
These paths no longer exist since db3ced17bb
, they are now all located under `lib-php`
2021-08-12 13:34:18 -07:00
Sarah Hoffmann
bf4f05fff3
Merge pull request #2413 from osm-search/helm-chart
...
Installation docs - link to Kubernetes install project
2021-08-08 11:09:36 +02:00
mtmail
b0aaa25f0d
Installation docs - link to Kubernetes install project
...
As reported by @robjuz in https://github.com/osm-search/Nominatim/discussions/2412
2021-08-03 12:02:35 +02:00
Sarah Hoffmann
c3ddc7579a
Merge pull request #2408 from lonvia/icu-change-word-table-layout
...
Change table layout of word table for ICU tokenizer
2021-07-28 14:28:49 +02:00
Sarah Hoffmann
fdff579188
php: force use of global Exception class
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
d48793c22c
fix Python linitin errors
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
001b2aa9f9
fix linitin issues in PHP
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
1db098c05d
reinstate word column in icu word table
...
Postgresql is very bad at creating statistics for jsonb
columns. The result is that the query planer tends to
use JIT for queries with a where over 'info' even when
there is an index.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
324b1b5575
bdd tests: do not query word table directly
...
The BDD tests cannot make assumptions about the structure of the
word table anymore because it depends on the tokenizer. Use more
abstract descriptions instead that ask for specific kinds of
tokens.
2021-07-28 11:31:47 +02:00