Commit Graph

256 Commits

Author SHA1 Message Date
Sarah Hoffmann
3683cf7ddc optimise tag match function 2022-11-10 09:38:25 +01:00
Sarah Hoffmann
51ed55cc32 initial flex import scripts
Only implements the extratags style for the moment. Tests pass
for the same behaviour as the gazetteer output. Updates still need
to be done.
2022-11-10 09:37:38 +01:00
Sarah Hoffmann
536f08f33a ignore 5+ postcodes in the US for now
Hierarchical postcodes need a different treatment.
2022-06-24 19:24:22 +02:00
Sarah Hoffmann
e86db3001f fix postcode pattern for Mozambique
Optional groups are not implemented yet.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
ca7b46511d introduce and use analyzer for postcodes 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
18864afa8a postcodes: introduce a default pattern for countries without postcodes 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
9cf700e85d add postcodes for most of the remaining countries
Now includes all postcodes that have optional parts.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
9172696324 postcodes: add support for optional spaces 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
49626ba709 add postcode formats with optional country code
If the country code is not part of the mandatory output, the
country code filter will do the correct handling.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
28ab2f6048 add postcodes patterns without optional spaces 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
6e0014e138 add postcode patterns for numeric postcodes
Adds patterns for countries that have simple numeric-only postcodes.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
8080625747 remove postcodes from countries that don't have them
The postcodes will only be removed as a 'computed postcode' they
are still searchable for the given object.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
21fb501699 add info about countries without a postcode 2022-06-23 23:42:31 +02:00
bgo-eiu
04644102f2
added additional languages for pakistan in country settings 2022-06-16 06:26:44 -04:00
Sarah Hoffmann
8a67ddcb2b remove county nodes in Canada from addresses
Canada has complete coverage for administrative boundaries on
county level. Removing the county nodes from the addresses avoids error
due to a wide-spread doubling of place nodes for city counties.
2022-05-18 10:19:05 +02:00
Sarah Hoffmann
4002bee0c1 make ICU the default tokenizer 2022-05-10 12:02:50 +02:00
Sarah Hoffmann
9d468f6da0 support arbitrary prefixes in country name list
This means we can now get rid of the last special cases for names.
2022-05-09 11:55:26 +02:00
Sarah Hoffmann
3a8ddf736e move country names into separate include files 2022-05-09 11:55:26 +02:00
Sarah Hoffmann
63dc4b39bc ICU: better letter identification in normalization
The Letter class does not include non-spacing marks that can also
have a consonant or vowel meaning, especially in Indian languages.
Use the alnum propoerty instead which includes them all. Also
include the vowel-canceling Virama, which is not a letter by itself
but changes the transliteration.
2022-04-28 18:23:17 +02:00
Sarah Hoffmann
fd4ab3f262
Merge pull request #2629 from tareqpi/country-names-yaml-configuration
Move default country names into yaml configuration
2022-04-04 09:04:25 +02:00
Tareq Al-Ahdal
7bb7ed468a fix storing of escape sequences in database 2022-03-24 13:18:44 +08:00
Sarah Hoffmann
42f0282f14 remove special case for operator names
The OSM data has been sufficiently cleaned up by now that
the operator no longer needs to be considered a name tag.
Use 'brand' as the searchable alternative.
2022-03-18 10:48:53 +01:00
Tareq Al-Ahdal
456d439e97 Reformatting of country keys 2022-03-18 02:23:11 +08:00
Tareq Al-Ahdal
165d17f7f7 reintroduce 'name:' prefix to country name keys 2022-03-13 18:58:27 +08:00
Tareq Al-Ahdal
8b6652a40b move default country names into yaml configuration 2022-03-12 15:17:01 +08:00
Marc Tobias
1fcc9717bb documentation: clarify osm2pgsql isnt in project directory by default 2022-03-10 14:16:12 +01:00
Sarah Hoffmann
f03a05f6bb add new analyser for houenumbers
This analyser makes spaces optional.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
855909b4e9 add 'healthcare' as main tag
Given that the tag is most of the time duplicated by an amenity
tag which is already imported, only import it as a fallback when
there is no name.

Fixes #2609.
2022-02-21 11:52:17 +01:00
Sarah Hoffmann
610f2cc254 sanitizer: move helpers into a configuration class 2022-02-07 10:48:00 +01:00
Sarah Hoffmann
a79a3210e6 implement is-a-name option for housenumbers 2022-02-07 09:27:11 +01:00
Sarah Hoffmann
f3c9578bca complete documentation for new clean-houseunubmers sanatizer 2022-01-20 15:49:32 +01:00
Sarah Hoffmann
206ee87188 factor out housenumber splitting into sanitizer 2022-01-19 17:27:50 +01:00
Sarah Hoffmann
b453b0ea95 introduce mutation variants to generic token analyser
Mutations are regular-expression-based replacements that are applied
after variants have been computed. They are meant to be used for
variations on character level.

Add spelling variations for German umlauts.
2022-01-18 11:09:21 +01:00
Sarah Hoffmann
2034ed387b make ISO3166-2 references searchable 2022-01-13 09:44:42 +01:00
Sarah Hoffmann
fb54bd3fcf consider "modifier letter apostrophe" to be punctuation
While technically being a letter, the apostrophe is often replaced
with a normal apostrophe in writing which is a punctuation mark.
This makes sure that the modifier letter apostrophe yields the same
normalization results and thus is really interchangable.

Only has an effect after the next reimport.

Fixes #2569.
2022-01-10 17:40:03 +01:00
Sarah Hoffmann
5e792078b3 remove some odd varaints of addr:street from the styles
Some import has added names in partial tags which confuse the
street name matching.
2021-12-06 15:17:00 +01:00
Sarah Hoffmann
80e0a3cce4 change default rank for highway objects to 30
The highway key is being used more and more for non-ways these
days. This clashes with Nominatim's assumption that essentially
everything that has a highway tag can be used as the street part
of the address.

Change the default rank of highway objects to 30 to avoid this.
Only the known values for streets keep the rank 26 and are now
listed explicitly.
2021-11-24 22:10:40 +01:00
Sarah Hoffmann
1886952666 avoid special characters in word tokens
Transliteration should only consist of ASCII letters
and numbers. Avoid any other characters.
2021-11-10 17:14:13 +01:00
Sarah Hoffmann
0ae8d7ac08 have ADDRESS_LEVEL_CONFIG use load_sub_configuration
This means that relative paths now are looked up in the
project directory.
2021-10-22 16:36:52 +02:00
Sarah Hoffmann
c77df2d1eb replace NOMINATIM_PHRASE_CONFIG with command line option 2021-10-22 14:41:14 +02:00
Sarah Hoffmann
f4acfed48f add extended documentation of settings 2021-10-18 16:30:52 +02:00
Sarah Hoffmann
97a10ec218 apply variants by languages
Adds a tagger for names by language so that the analyzer of that
language is used. Thus variants are now only applied to names
in the specific language and only tag name tags, no longer to
reference-like tags.
2021-10-06 11:09:54 +02:00
Sarah Hoffmann
7cfcbacfc7 make token analyzers configurable modules
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.
2021-10-04 17:37:34 +02:00
Sarah Hoffmann
52847b61a3 extend ICU config to accomodate multiple analysers
Adds parsing of multiple variant lists from the configuration.
Every entry except one must have a unique 'id' paramter to
distinguish the entries. The entry without id is considered
the default. Currently only the list without an id is used
for analysis.
2021-10-04 16:40:28 +02:00
Sarah Hoffmann
8171fe4571 introduce sanitizer step before token analysis
Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
2021-10-01 12:27:24 +02:00
Sarah Hoffmann
7f3b05c179 adjust address levels for boundaries in Slovakia
Levels choosen according to OSM wiki. Mainly moves admin_level 6
to county level and admin_level 8 to city/town level. Higher
levels are adjusted accordingly.

Fixes #2453.
2021-09-27 23:32:11 +02:00
Sarah Hoffmann
09b1db63f4 adjust address ranks for Spain
Adjusts levels for boundaries according to the list on
https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative

* no admin_level 5, so drop that from addresses
* admin_level 6 has the province
* admin_level 7 has the county when it exists

Also reranks place=province so that it matches up with
admin_level 6 and introduces place=civil_parish which
is used as a place node for some admin_level=9 boundaries
in Galicia.
2021-09-24 18:39:44 +02:00
Sarah Hoffmann
79da96b369 read partition and languages from config file 2021-09-02 14:41:11 +02:00
Sarah Hoffmann
0b349761a8 add country configuration
The new configuration saves the default language(s) originally
maintained in the OSM wiki as well as the partition information.
2021-09-02 14:41:11 +02:00
Sarah Hoffmann
b7d4ff3201 icu: normalise simplified to traditional chinese
The conversion is unambigious in most cases, so that the
information loss is minimal.
2021-08-31 11:18:34 +02:00
Sarah Hoffmann
118858a55e rename legacy_icu tokenizer to icu tokenizer
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
2021-08-17 23:11:47 +02:00
Sarah Hoffmann
8bc3c0a07c
Merge pull request #2382 from lonvia/remove-json-config
Remove outdated ICU tokenizer JSON config
2021-07-05 12:34:34 +02:00
Sarah Hoffmann
fd8751658f exclude name:etymology and name:signed
name:etymology contains a description of the name origin and is
thus more informative than search-worthy.

name:signed basically indicates that the feature does not have
a name.
2021-07-05 11:04:16 +02:00
Sarah Hoffmann
4db5a1a0b8 remove outdated ICU tokenizer JSON config 2021-07-05 11:01:35 +02:00
Sarah Hoffmann
e85f7e7aa9 fix subsequent replacements
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.

Also forbit composing a word after a space was added in the
end by a previous replacement.
2021-07-04 10:28:28 +02:00
Sarah Hoffmann
0894ce9dc3 import abbreviations from OSM Wiki
Replaces the variant rules with a slightly cleaned-up
version of the abbreviation lists at
https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
4fd2e961b6 improve normalization
Make sure all special symbols are removed during normalization already.
Those won't be interpreted in any way because they are unlikely to be
searched for.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
62828fc5c1 switch to a more flexible variant description format
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a6aa6360e0 use yaml tag syntax to mark include files 2021-07-04 10:28:20 +02:00
Sarah Hoffmann
1bd9f455fc add abbreviations from legacy tokenizer
These abbreviations are not a perfect fit anymore because
abbreviation replacement is now applied before transliteration.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249 move abbreviation computation into import phase
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.

New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
6ba00e6aee icu tokenizer: move transliteration rules in separate file
The tokenizer configuration has become difficult to handle
due to the additional manual transliteration rules. Allow
to have a separate rule file that is given to the ICU library
as is.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
b2c6eca2c8 add missing transliterations
The ICU library only offers transliterations for a limited set of
script. Add transliterations for missing scripts from the PostgreSQL
module. These means that the same selection of scripts is supported
as with the old module.
2021-05-05 21:16:55 +02:00
Sarah Hoffmann
f44af49df9 add Python part for new ICU-based tokenizer 2021-05-05 10:15:27 +02:00
Marc
3dade534fd Add hint about replication update & recheck intervals being in seconds. 2021-05-04 11:47:15 +02:00
Sarah Hoffmann
af968d4903 introduce tokenizer modules
This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.

The legacy tokenizer implements Nominatim's traditional algorithms.
2021-04-30 11:29:57 +02:00
AntoJvlt
ff34198569 Code cleaning, tests simplification and use of python3-icu package 2021-03-23 23:56:39 +01:00
AntoJvlt
6d56cbb3e8 Changed phrase_settings.py to phrase-settings.json and added migration function for old php settings file. 2021-03-23 23:30:39 +01:00
AntoJvlt
d5acade4db Deleted specialphrases.php and phrase_settings.php 2021-03-20 19:48:05 +01:00
AntoJvlt
17cb59efbd Ported functions for the import of special phrases from php to python.
- the command is now --import-special-phrases
- the output is not an sql file anymore, data are directly imported to the database.
- the little part on the documentation (section data import) has been modified.
2021-03-20 19:11:50 +01:00
Sarah Hoffmann
bddfc109f8 refer to new nominatim tool in configuration comments 2021-02-02 10:56:40 +01:00
Sarah Hoffmann
45ea73913f remove setting for PYOSMIUM_BINARY
pyosmium is now called as a library from the python code,
so that pyosmium-get-changes is no longer needed.
2021-01-30 15:55:04 +01:00
Sarah Hoffmann
d78f0ba804 port replication initialisation to Python 2021-01-26 22:50:54 +01:00
Sarah Hoffmann
06d89e1d47 fix various typos 2020-12-19 14:33:04 +01:00
Sarah Hoffmann
8676e45d88 remove old default settings 2020-12-19 14:33:04 +01:00
Sarah Hoffmann
0947b61808 switch remaining settings to dotenv format
CONST_Search_AreaPolygons and CONST_Search_ReversePlanForAll have
been removed completely.
2020-12-19 14:33:04 +01:00
Sarah Hoffmann
15a1666f8a replace database settings with dotenv variant
As we can't refer to the project root dir in the module path, the
module path may now also be a relative directory which is then
taken as being relative to the project root path.

Moves the checkModulePresence() function into the Setup class, so
that it can work on the computed absolute module path.
2020-12-19 14:33:04 +01:00
Sarah Hoffmann
b5480f6e36 reorganise path settings in config
CONST_BasePath is split into separate configuration variables
for binaries, libraries and data. These variables as well as
the installation path are now set in the executable directly and
no longer configurable via project settings.

This is the first step towards an installable software. The
executables should know per installation where to find their
necessary data to execute. Project configuration needs to be
restricted to settings that really concern the specific Nominatim
installation.
2020-12-19 14:33:04 +01:00
Sarah Hoffmann
f89e71a861 make sure that admin levels in NL are kept in order 2020-11-19 09:44:02 +01:00
Hendrik Morée
dcc075b34b Admin levels 8 and 10 of the Netherlands are municipal / city 2020-11-18 11:30:24 +01:00
Sarah Hoffmann
cbbda1ddf0 adapt admin_levels for Belgium
Fixes #272.
2020-11-03 10:46:52 +01:00
Sarah Hoffmann
f050f898bc elevate most natural feature to address rank 22
Makes them be in par with landuse features.
2020-11-02 11:42:10 +01:00
Sarah Hoffmann
b81894d3d5 remove now unused settings related to website
There are two places where the website URL is still used:
for icons, replace the URL with a link to the icon repository
of the UI repo. The more URL now builds the link from the
server info.
2020-10-29 11:13:32 +01:00
Sarah Hoffmann
abd20d3ca6 disable admin level 5 in Russia
They either interfere with cities or refer to historical boundaries.
2020-10-28 10:49:26 +01:00
Sarah Hoffmann
b661c66c00 reorganize ranks of high-level place types
Rank 25 is now available for places that should appear in addresses
but not when a street is present. Use this for som block-like
place types. Also document the particularity of rank 25.

subdevisions and allotments are now at the same level as landuse
which they are frequently used together with.
2020-10-20 20:20:49 +02:00
Sarah Hoffmann
73a0ec22a3 add country-specific address ranks for Russia
Removes admin level 7, which should not exist and promotes
admin level 8 to municipality level.

place=municipality is only used for boroughs of St. Petersburg,
so demote to level 18.

Fixes #926.
2020-10-17 17:54:06 +02:00
Sarah Hoffmann
6a31691121 adapt address levels for admin boundaries in Indonesia 2020-10-09 22:28:06 +02:00
Sarah Hoffmann
f2ff351da4
Merge pull request #1971 from lonvia/drop-support-for-isin
Drop support for is_in tag
2020-09-23 09:20:35 +02:00
Sarah Hoffmann
72193a1c23 exclude unnamed highway areas
These are used to mark large paved areas. Sometimes they exists
together with named regular streets. In such cases the unnamed
area may overshadow the actual street when computing the address
parent. As unnamed highways are not very useful anyway, we
simply remove them from the database.
2020-09-22 21:42:13 +02:00
Sarah Hoffmann
d04e87fb80 drop suport for is_in tag 2020-09-22 20:26:36 +02:00
Sarah Hoffmann
7fb62ea904 postal boundary may be imported without name
Postal boundaries usually just have the postcode tag set and are
therefore officially 'nameless'. We want to have them as
boundary=postal_code anyways in order to distiguish them from postcode
points inherited from addr: tags.
2020-09-18 11:33:45 +02:00
Sarah Hoffmann
8ff1f16b7f remove host from default website URL
Just assume that Nominatim runs under the root URL. This is a
more versatile base that also makes 'make serve' work out of the
box.
2020-09-16 11:13:51 +02:00
Sarah Hoffmann
a68cdc40be improve fallback ranking
Boundaries and places now always get a rank < 26 to make sure that
they do not parent to a street. Skip boundary=place completely
because they will be covered throught the secondary place tag.
2020-09-01 17:55:40 +02:00
Sarah Hoffmann
be6ecc388c add support for place=square
Squares are now addressable (on address level 25) and thus can
be attached to a house number via addr:place. Needed to increase
the rank range for matching up addr:place to 25.
2020-08-26 12:12:52 +02:00
Sarah Hoffmann
071db1fae7 remove traffic signs from full styles
Traffic signs rarely have name and are therefore mostly not
searchable. Remove them completely. Allow street lamps only when
they have a name. Removes about 2M object from a planet instance.
2020-08-15 22:37:45 +02:00
Sarah Hoffmann
05e0d3e2d4 add wiki tags to all styles
wikipedia and wikidata tags are needed to compute the importance
so we need to put them into extra tags for all styles.

Fixes #1885.
2020-07-25 10:00:18 +02:00
Sarah Hoffmann
d364afdf3b reenable debug parameter
The parameter got lost when switching to website settings.
Given that the use of a fixed parameter is limited,
debugging output can now only be set via the URL parameter.
2020-07-08 08:32:46 +02:00
Sarah Hoffmann
8b8dcea3de exclude language-specific name:prefix and name:suffix
There are about 1k suffixes and 20k prefixes with a
language-speicfic variant in use. These should not
show up as names.
2020-07-01 18:00:53 +02:00
Sarah Hoffmann
6e4ee160ee adapt tests to new search ranks 2020-06-17 10:53:11 +02:00
Sarah Hoffmann
a5697c5279 change place node expansion for large area table
So far we've used a buffer around a place node to define its
potential address reach. This had two problems: the buffer was
so large that addresses often contain false positives and the
buffer is really distorted when getting closer to the poles.

Change the buffer here to draw a bounndig box at a certain
distance in meter. This means that we always use the same
box everywhere on the planet and can make the extent much
smaller. Using a box has the advantage that it is much faster
to figure out if a point is within the box.
2020-06-17 10:53:11 +02:00