Canada has complete coverage for administrative boundaries on
county level. Removing the county nodes from the addresses avoids error
due to a wide-spread doubling of place nodes for city counties.
The Letter class does not include non-spacing marks that can also
have a consonant or vowel meaning, especially in Indian languages.
Use the alnum propoerty instead which includes them all. Also
include the vowel-canceling Virama, which is not a letter by itself
but changes the transliteration.
The OSM data has been sufficiently cleaned up by now that
the operator no longer needs to be considered a name tag.
Use 'brand' as the searchable alternative.
Given that the tag is most of the time duplicated by an amenity
tag which is already imported, only import it as a fallback when
there is no name.
Fixes#2609.
Mutations are regular-expression-based replacements that are applied
after variants have been computed. They are meant to be used for
variations on character level.
Add spelling variations for German umlauts.
While technically being a letter, the apostrophe is often replaced
with a normal apostrophe in writing which is a punctuation mark.
This makes sure that the modifier letter apostrophe yields the same
normalization results and thus is really interchangable.
Only has an effect after the next reimport.
Fixes#2569.
The highway key is being used more and more for non-ways these
days. This clashes with Nominatim's assumption that essentially
everything that has a highway tag can be used as the street part
of the address.
Change the default rank of highway objects to 30 to avoid this.
Only the known values for streets keep the rank 26 and are now
listed explicitly.
Adds a tagger for names by language so that the analyzer of that
language is used. Thus variants are now only applied to names
in the specific language and only tag name tags, no longer to
reference-like tags.
Adds a mandatory section 'analyzer' to the token-analysis entries
which define, which analyser to use. Currently there is exactly
one, generic, which implements the former ICUNameProcessor.
Adds parsing of multiple variant lists from the configuration.
Every entry except one must have a unique 'id' paramter to
distinguish the entries. The entry without id is considered
the default. Currently only the list without an id is used
for analysis.
Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.
Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
Levels choosen according to OSM wiki. Mainly moves admin_level 6
to county level and admin_level 8 to city/town level. Higher
levels are adjusted accordingly.
Fixes#2453.
Adjusts levels for boundaries according to the list on
https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative
* no admin_level 5, so drop that from addresses
* admin_level 6 has the province
* admin_level 7 has the county when it exists
Also reranks place=province so that it matches up with
admin_level 6 and introduces place=civil_parish which
is used as a place node for some admin_level=9 boundaries
in Galicia.
The new icu tokenizer is now no longer compatible with the old
legacy tokenizer in terms of data structures. Therefore there
is also no longer a need to refer to the legacy tokenizer in the
name.
name:etymology contains a description of the name origin and is
thus more informative than search-worthy.
name:signed basically indicates that the feature does not have
a name.
Two replacement words directly following each other did not
work as expected because each expects a space at the
beginning/end while there was only one space available.
Also forbit composing a word after a space was added in the
end by a previous replacement.
Make sure all special symbols are removed during normalization already.
Those won't be interpreted in any way because they are unlikely to be
searched for.
The new format combines compound splitting and abbreviation.
It also allows to restrict rules to additional conditions
(like language or region). This latter ability is not used
yet.
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.
New dependency for python library datrie.
The tokenizer configuration has become difficult to handle
due to the additional manual transliteration rules. Allow
to have a separate rule file that is given to the ICU library
as is.
The ICU library only offers transliterations for a limited set of
script. Add transliterations for missing scripts from the PostgreSQL
module. These means that the same selection of scripts is supported
as with the old module.