mirror of
https://github.com/osm-search/Nominatim.git
synced 2024-12-26 14:36:23 +03:00
Merge pull request #2381 from lonvia/reorganise-abbreviations
Reorganise abbreviation handling
This commit is contained in:
commit
d4c7bf20a2
15
.github/actions/build-nominatim/action.yml
vendored
15
.github/actions/build-nominatim/action.yml
vendored
@ -1,13 +1,26 @@
|
||||
name: 'Build Nominatim'
|
||||
|
||||
inputs:
|
||||
ubuntu:
|
||||
description: 'Version of Ubuntu to install on'
|
||||
required: false
|
||||
default: '20'
|
||||
|
||||
runs:
|
||||
using: "composite"
|
||||
|
||||
steps:
|
||||
- name: Install prerequisites
|
||||
run: |
|
||||
sudo apt-get install -y -qq libboost-system-dev libboost-filesystem-dev libexpat1-dev zlib1g-dev libbz2-dev libpq-dev libproj-dev libicu-dev python3-psycopg2 python3-pyosmium python3-dotenv python3-psutil python3-jinja2 python3-icu
|
||||
sudo apt-get install -y -qq libboost-system-dev libboost-filesystem-dev libexpat1-dev zlib1g-dev libbz2-dev libpq-dev libproj-dev libicu-dev
|
||||
if [ "x$UBUNTUVER" == "x18" ]; then
|
||||
pip3 install python-dotenv psycopg2==2.7.7 jinja2==2.8 psutil==5.4.2 pyicu osmium
|
||||
else
|
||||
sudo apt-get install -y -qq python3-icu python3-datrie python3-pyosmium python3-jinja2 python3-psutil python3-psycopg2 python3-dotenv
|
||||
fi
|
||||
shell: bash
|
||||
env:
|
||||
UBUNTUVER: ${{ inputs.ubuntu }}
|
||||
|
||||
- name: Download dependencies
|
||||
run: |
|
||||
|
9
.github/workflows/ci-tests.yml
vendored
9
.github/workflows/ci-tests.yml
vendored
@ -134,13 +134,8 @@ jobs:
|
||||
postgresql-version: ${{ matrix.postgresql }}
|
||||
postgis-version: ${{ matrix.postgis }}
|
||||
- uses: ./Nominatim/.github/actions/build-nominatim
|
||||
|
||||
- name: Install extra dependencies for Ubuntu 18
|
||||
run: |
|
||||
sudo apt-get install libicu-dev
|
||||
pip3 install python-dotenv psycopg2==2.7.7 jinja2==2.8 psutil==5.4.2 pyicu osmium
|
||||
shell: bash
|
||||
if: matrix.ubuntu == 18
|
||||
with:
|
||||
ubuntu: ${{ matrix.ubuntu }}
|
||||
|
||||
- name: Clean installation
|
||||
run: rm -rf Nominatim build
|
||||
|
@ -1,7 +1,7 @@
|
||||
[MASTER]
|
||||
|
||||
extension-pkg-whitelist=osmium
|
||||
ignored-modules=icu
|
||||
ignored-modules=icu,datrie
|
||||
|
||||
[MESSAGES CONTROL]
|
||||
|
||||
|
@ -258,5 +258,6 @@ install(FILES settings/env.defaults
|
||||
settings/import-address.style
|
||||
settings/import-full.style
|
||||
settings/import-extratags.style
|
||||
settings/legacy_icu_tokenizer.json
|
||||
settings/legacy_icu_tokenizer.yaml
|
||||
settings/icu-rules/extended-unicode-to-asccii.yaml
|
||||
DESTINATION ${NOMINATIM_CONFIGDIR})
|
||||
|
@ -45,6 +45,7 @@ For running Nominatim:
|
||||
* [psutil](https://github.com/giampaolo/psutil)
|
||||
* [Jinja2](https://palletsprojects.com/p/jinja/)
|
||||
* [PyICU](https://pypi.org/project/PyICU/)
|
||||
* [datrie](https://github.com/pytries/datrie)
|
||||
* [PHP](https://php.net) (7.0 or later)
|
||||
* PHP-pgsql
|
||||
* PHP-intl (bundled with PHP)
|
||||
|
205
docs/admin/Tokenizers.md
Normal file
205
docs/admin/Tokenizers.md
Normal file
@ -0,0 +1,205 @@
|
||||
# Tokenizers
|
||||
|
||||
The tokenizer module in Nominatim is responsible for analysing the names given
|
||||
to OSM objects and the terms of an incoming query in order to make sure, they
|
||||
can be matched appropriately.
|
||||
|
||||
Nominatim offers different tokenizer modules, which behave differently and have
|
||||
different configuration options. This sections describes the tokenizers and how
|
||||
they can be configured.
|
||||
|
||||
!!! important
|
||||
The use of a tokenizer is tied to a database installation. You need to choose
|
||||
and configure the tokenizer before starting the initial import. Once the import
|
||||
is done, you cannot switch to another tokenizer anymore. Reconfiguring the
|
||||
chosen tokenizer is very limited as well. See the comments in each tokenizer
|
||||
section.
|
||||
|
||||
## Legacy tokenizer
|
||||
|
||||
The legacy tokenizer implements the analysis algorithms of older Nominatim
|
||||
versions. It uses a special Postgresql module to normalize names and queries.
|
||||
This tokenizer is currently the default.
|
||||
|
||||
To enable the tokenizer add the following line to your project configuration:
|
||||
|
||||
```
|
||||
NOMINATIM_TOKENIZER=legacy
|
||||
```
|
||||
|
||||
The Postgresql module for the tokenizer is available in the `module` directory
|
||||
and also installed with the remainder of the software under
|
||||
`lib/nominatim/module/nominatim.so`. You can specify a custom location for
|
||||
the module with
|
||||
|
||||
```
|
||||
NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
|
||||
```
|
||||
|
||||
This is in particular useful when the database runs on a different server.
|
||||
See [Advanced installations](Advanced-Installations.md#importing-nominatim-to-an-external-postgresql-database) for details.
|
||||
|
||||
There are no other configuration options for the legacy tokenizer. All
|
||||
normalization functions are hard-coded.
|
||||
|
||||
## ICU tokenizer
|
||||
|
||||
!!! danger
|
||||
This tokenizer is currently in active development and still subject
|
||||
to backwards-incompatible changes.
|
||||
|
||||
The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
|
||||
normalize names and queries. It also offers configurable decomposition and
|
||||
abbreviation handling.
|
||||
|
||||
### How it works
|
||||
|
||||
On import the tokenizer processes names in the following four stages:
|
||||
|
||||
1. The **Normalization** part removes all non-relevant information from the
|
||||
input.
|
||||
2. Incoming names are now converted to **full names**. This process is currently
|
||||
hard coded and mostly serves to handle name tags from OSM that contain
|
||||
multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
|
||||
3. Next the tokenizer creates **variants** from the full names. These variants
|
||||
cover decomposition and abbreviation handling. Variants are saved to the
|
||||
database, so that it is not necessary to create the variants for a search
|
||||
query.
|
||||
4. The final **Tokenization** step converts the names to a simple ASCII form,
|
||||
potentially removing further spelling variants for better matching.
|
||||
|
||||
At query time only stage 1) and 4) are used. The query is normalized and
|
||||
tokenized and the resulting string used for searching in the database.
|
||||
|
||||
### Configuration
|
||||
|
||||
The ICU tokenizer is configured using a YAML file which can be configured using
|
||||
`NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
|
||||
saved as part of the internal database status. Later changes to the variable
|
||||
have no effect.
|
||||
|
||||
Here is an example configuration file:
|
||||
|
||||
``` yaml
|
||||
normalization:
|
||||
- ":: lower ()"
|
||||
- "ß > 'ss'" # German szet is unimbigiously equal to double ss
|
||||
transliteration:
|
||||
- !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
|
||||
- ":: Ascii ()"
|
||||
variants:
|
||||
- language: de
|
||||
words:
|
||||
- ~haus => haus
|
||||
- ~strasse -> str
|
||||
- language: en
|
||||
words:
|
||||
- road -> rd
|
||||
- bridge -> bdge,br,brdg,bri,brg
|
||||
```
|
||||
|
||||
The configuration file contains three sections:
|
||||
`normalization`, `transliteration`, `variants`.
|
||||
|
||||
The normalization and transliteration sections each must contain a list of
|
||||
[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
|
||||
The rules are applied in the order in which they appear in the file.
|
||||
You can also include additional rules from external yaml file using the
|
||||
`!include` tag. The included file must contain a valid YAML list of ICU rules
|
||||
and may again include other files.
|
||||
|
||||
!!! warning
|
||||
The ICU rule syntax contains special characters that conflict with the
|
||||
YAML syntax. You should therefore always enclose the ICU rules in
|
||||
double-quotes.
|
||||
|
||||
The variants section defines lists of replacements which create alternative
|
||||
spellings of a name. To create the variants, a name is scanned from left to
|
||||
right and the longest matching replacement is applied until the end of the
|
||||
string is reached.
|
||||
|
||||
The variants section must contain a list of replacement groups. Each group
|
||||
defines a set of properties that describes where the replacements are
|
||||
applicable. In addition, the word section defines the list of replacements
|
||||
to be made. The basic replacement description is of the form:
|
||||
|
||||
```
|
||||
<source>[,<source>[...]] => <target>[,<target>[...]]
|
||||
```
|
||||
|
||||
The left side contains one or more `source` terms to be replaced. The right side
|
||||
lists one or more replacements. Each source is replaced with each replacement
|
||||
term.
|
||||
|
||||
!!! tip
|
||||
The source and target terms are internally normalized using the
|
||||
normalization rules given in the configuration. This ensures that the
|
||||
strings match as expected. In fact, it is better to use unnormalized
|
||||
words in the configuration because then it is possible to change the
|
||||
rules for normalization later without having to adapt the variant rules.
|
||||
|
||||
#### Decomposition
|
||||
|
||||
In its standard form, only full words match against the source. There
|
||||
is a special notation to match the prefix and suffix of a word:
|
||||
|
||||
``` yaml
|
||||
- ~strasse => str # matches "strasse" as full word and in suffix position
|
||||
- hinter~ => hntr # matches "hinter" as full word and in prefix position
|
||||
```
|
||||
|
||||
There is no facility to match a string in the middle of the word. The suffix
|
||||
and prefix notation automatically trigger the decomposition mode: two variants
|
||||
are created for each replacement, one with the replacement attached to the word
|
||||
and one separate. So in above example, the tokenization of "hauptstrasse" will
|
||||
create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
|
||||
triggers the variants "rote str" and "rotestr". By having decomposition work
|
||||
both ways, it is sufficient to create the variants at index time. The variant
|
||||
rules are not applied at query time.
|
||||
|
||||
To avoid automatic decomposition, use the '|' notation:
|
||||
|
||||
``` yaml
|
||||
- ~strasse |=> str
|
||||
```
|
||||
|
||||
simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
|
||||
|
||||
#### Initial and final terms
|
||||
|
||||
It is also possible to restrict replacements to the beginning and end of a
|
||||
name:
|
||||
|
||||
``` yaml
|
||||
- ^south => n # matches only at the beginning of the name
|
||||
- road$ => rd # matches only at the end of the name
|
||||
```
|
||||
|
||||
So the first example would trigger a replacement for "south 45th street" but
|
||||
not for "the south beach restaurant".
|
||||
|
||||
#### Replacements vs. variants
|
||||
|
||||
The replacement syntax `source => target` works as a pure replacement. It changes
|
||||
the name instead of creating a variant. To create an additional version, you'd
|
||||
have to write `source => source,target`. As this is a frequent case, there is
|
||||
a shortcut notation for it:
|
||||
|
||||
```
|
||||
<source>[,<source>[...]] -> <target>[,<target>[...]]
|
||||
```
|
||||
|
||||
The simple arrow causes an additional variant to be added. Note that
|
||||
decomposition has an effect here on the source as well. So a rule
|
||||
|
||||
```yaml
|
||||
- ~strasse => str
|
||||
```
|
||||
|
||||
means that for a word like `hauptstrasse` four variants are created:
|
||||
`hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
|
||||
|
||||
### Reconfiguration
|
||||
|
||||
Changing the configuration after the import is currently not possible, although
|
||||
this feature may be added at a later time.
|
@ -20,6 +20,7 @@ pages:
|
||||
- 'Update' : 'admin/Update.md'
|
||||
- 'Deploy' : 'admin/Deployment.md'
|
||||
- 'Customize Imports' : 'admin/Customization.md'
|
||||
- 'Tokenizers' : 'admin/Tokenizers.md'
|
||||
- 'Nominatim UI' : 'admin/Setup-Nominatim-UI.md'
|
||||
- 'Advanced Installations' : 'admin/Advanced-Installations.md'
|
||||
- 'Migration from older Versions' : 'admin/Migration.md'
|
||||
|
@ -47,9 +47,7 @@ class Tokenizer
|
||||
|
||||
private function makeStandardWord($sTerm)
|
||||
{
|
||||
$sNorm = ' '.$this->oTransliterator->transliterate($sTerm).' ';
|
||||
|
||||
return trim(str_replace(CONST_Abbreviations[0], CONST_Abbreviations[1], $sNorm));
|
||||
return trim($this->oTransliterator->transliterate(' '.$sTerm.' '));
|
||||
}
|
||||
|
||||
|
||||
@ -90,6 +88,7 @@ class Tokenizer
|
||||
foreach ($aPhrases as $iPhrase => $oPhrase) {
|
||||
$sNormQuery .= ','.$this->normalizeString($oPhrase->getPhrase());
|
||||
$sPhrase = $this->makeStandardWord($oPhrase->getPhrase());
|
||||
Debug::printVar('Phrase', $sPhrase);
|
||||
if (strlen($sPhrase) > 0) {
|
||||
$aWords = explode(' ', $sPhrase);
|
||||
Tokenizer::addTokens($aTokens, $aWords);
|
||||
|
@ -87,25 +87,48 @@ $$ LANGUAGE SQL IMMUTABLE STRICT;
|
||||
|
||||
--------------- private functions ----------------------------------------------
|
||||
|
||||
CREATE OR REPLACE FUNCTION getorcreate_term_id(lookup_term TEXT)
|
||||
RETURNS INTEGER
|
||||
CREATE OR REPLACE FUNCTION getorcreate_full_word(norm_term TEXT, lookup_terms TEXT[],
|
||||
OUT full_token INT,
|
||||
OUT partial_tokens INT[])
|
||||
AS $$
|
||||
DECLARE
|
||||
return_id INTEGER;
|
||||
partial_terms TEXT[] = '{}'::TEXT[];
|
||||
term TEXT;
|
||||
term_id INTEGER;
|
||||
term_count INTEGER;
|
||||
BEGIN
|
||||
SELECT min(word_id), max(search_name_count) INTO return_id, term_count
|
||||
FROM word WHERE word_token = lookup_term and class is null and type is null;
|
||||
SELECT min(word_id) INTO full_token
|
||||
FROM word WHERE word = norm_term and class is null and country_code is null;
|
||||
|
||||
IF return_id IS NULL THEN
|
||||
return_id := nextval('seq_word');
|
||||
INSERT INTO word (word_id, word_token, search_name_count)
|
||||
VALUES (return_id, lookup_term, 0);
|
||||
ELSEIF left(lookup_term, 1) = ' ' and term_count > {{ max_word_freq }} THEN
|
||||
return_id := 0;
|
||||
IF full_token IS NULL THEN
|
||||
full_token := nextval('seq_word');
|
||||
INSERT INTO word (word_id, word_token, word, search_name_count)
|
||||
SELECT full_token, ' ' || lookup_term, norm_term, 0 FROM unnest(lookup_terms) as lookup_term;
|
||||
END IF;
|
||||
|
||||
RETURN return_id;
|
||||
FOR term IN SELECT unnest(string_to_array(unnest(lookup_terms), ' ')) LOOP
|
||||
term := trim(term);
|
||||
IF NOT (ARRAY[term] <@ partial_terms) THEN
|
||||
partial_terms := partial_terms || term;
|
||||
END IF;
|
||||
END LOOP;
|
||||
|
||||
partial_tokens := '{}'::INT[];
|
||||
FOR term IN SELECT unnest(partial_terms) LOOP
|
||||
SELECT min(word_id), max(search_name_count) INTO term_id, term_count
|
||||
FROM word WHERE word_token = term and class is null and country_code is null;
|
||||
|
||||
IF term_id IS NULL THEN
|
||||
term_id := nextval('seq_word');
|
||||
term_count := 0;
|
||||
INSERT INTO word (word_id, word_token, search_name_count)
|
||||
VALUES (term_id, term, 0);
|
||||
END IF;
|
||||
|
||||
IF term_count < {{ max_word_freq }} THEN
|
||||
partial_tokens := array_merge(partial_tokens, ARRAY[term_id]);
|
||||
END IF;
|
||||
END LOOP;
|
||||
END;
|
||||
$$
|
||||
LANGUAGE plpgsql;
|
||||
|
@ -4,6 +4,7 @@ Helper functions for handling DB accesses.
|
||||
import subprocess
|
||||
import logging
|
||||
import gzip
|
||||
import io
|
||||
|
||||
from nominatim.db.connection import get_pg_env
|
||||
from nominatim.errors import UsageError
|
||||
@ -57,3 +58,49 @@ def execute_file(dsn, fname, ignore_errors=False, pre_code=None, post_code=None)
|
||||
|
||||
if ret != 0 or remain > 0:
|
||||
raise UsageError("Failed to execute SQL file.")
|
||||
|
||||
|
||||
# List of characters that need to be quoted for the copy command.
|
||||
_SQL_TRANSLATION = {ord(u'\\') : u'\\\\',
|
||||
ord(u'\t') : u'\\t',
|
||||
ord(u'\n') : u'\\n'}
|
||||
|
||||
class CopyBuffer:
|
||||
""" Data collector for the copy_from command.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.buffer = io.StringIO()
|
||||
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
|
||||
def __exit__(self, exc_type, exc_value, traceback):
|
||||
if self.buffer is not None:
|
||||
self.buffer.close()
|
||||
|
||||
|
||||
def add(self, *data):
|
||||
""" Add another row of data to the copy buffer.
|
||||
"""
|
||||
first = True
|
||||
for column in data:
|
||||
if first:
|
||||
first = False
|
||||
else:
|
||||
self.buffer.write('\t')
|
||||
if column is None:
|
||||
self.buffer.write('\\N')
|
||||
else:
|
||||
self.buffer.write(str(column).translate(_SQL_TRANSLATION))
|
||||
self.buffer.write('\n')
|
||||
|
||||
|
||||
def copy_out(self, cur, table, columns=None):
|
||||
""" Copy all collected data into the given table.
|
||||
"""
|
||||
if self.buffer.tell() > 0:
|
||||
self.buffer.seek(0)
|
||||
cur.copy_from(self.buffer, table, columns=columns)
|
||||
|
142
nominatim/tokenizer/icu_name_processor.py
Normal file
142
nominatim/tokenizer/icu_name_processor.py
Normal file
@ -0,0 +1,142 @@
|
||||
"""
|
||||
Processor for names that are imported into the database based on the
|
||||
ICU library.
|
||||
"""
|
||||
from collections import defaultdict
|
||||
import itertools
|
||||
|
||||
from icu import Transliterator
|
||||
import datrie
|
||||
|
||||
from nominatim.db.properties import set_property, get_property
|
||||
from nominatim.tokenizer import icu_variants as variants
|
||||
|
||||
DBCFG_IMPORT_NORM_RULES = "tokenizer_import_normalisation"
|
||||
DBCFG_IMPORT_TRANS_RULES = "tokenizer_import_transliteration"
|
||||
DBCFG_IMPORT_REPLACEMENTS = "tokenizer_import_replacements"
|
||||
DBCFG_SEARCH_STD_RULES = "tokenizer_search_standardization"
|
||||
|
||||
|
||||
class ICUNameProcessorRules:
|
||||
""" Data object that saves the rules needed for the name processor.
|
||||
|
||||
The rules can either be initialised through an ICURuleLoader or
|
||||
be loaded from a database when a connection is given.
|
||||
"""
|
||||
def __init__(self, loader=None, conn=None):
|
||||
if loader is not None:
|
||||
self.norm_rules = loader.get_normalization_rules()
|
||||
self.trans_rules = loader.get_transliteration_rules()
|
||||
self.replacements = loader.get_replacement_pairs()
|
||||
self.search_rules = loader.get_search_rules()
|
||||
elif conn is not None:
|
||||
self.norm_rules = get_property(conn, DBCFG_IMPORT_NORM_RULES)
|
||||
self.trans_rules = get_property(conn, DBCFG_IMPORT_TRANS_RULES)
|
||||
self.replacements = \
|
||||
variants.unpickle_variant_set(get_property(conn, DBCFG_IMPORT_REPLACEMENTS))
|
||||
self.search_rules = get_property(conn, DBCFG_SEARCH_STD_RULES)
|
||||
else:
|
||||
assert False, "Parameter loader or conn required."
|
||||
|
||||
|
||||
def save_rules(self, conn):
|
||||
""" Save the rules in the property table of the given database.
|
||||
the rules can be loaded again by handing in a connection into
|
||||
the constructor of the class.
|
||||
"""
|
||||
set_property(conn, DBCFG_IMPORT_NORM_RULES, self.norm_rules)
|
||||
set_property(conn, DBCFG_IMPORT_TRANS_RULES, self.trans_rules)
|
||||
set_property(conn, DBCFG_IMPORT_REPLACEMENTS,
|
||||
variants.pickle_variant_set(self.replacements))
|
||||
set_property(conn, DBCFG_SEARCH_STD_RULES, self.search_rules)
|
||||
|
||||
|
||||
class ICUNameProcessor:
|
||||
""" Collects the different transformation rules for normalisation of names
|
||||
and provides the functions to aply the transformations.
|
||||
"""
|
||||
|
||||
def __init__(self, rules):
|
||||
self.normalizer = Transliterator.createFromRules("icu_normalization",
|
||||
rules.norm_rules)
|
||||
self.to_ascii = Transliterator.createFromRules("icu_to_ascii",
|
||||
rules.trans_rules +
|
||||
";[:Space:]+ > ' '")
|
||||
self.search = Transliterator.createFromRules("icu_search",
|
||||
rules.search_rules)
|
||||
|
||||
# Intermediate reorder by source. Also compute required character set.
|
||||
immediate = defaultdict(list)
|
||||
chars = set()
|
||||
for variant in rules.replacements:
|
||||
if variant.source[-1] == ' ' and variant.replacement[-1] == ' ':
|
||||
replstr = variant.replacement[:-1]
|
||||
else:
|
||||
replstr = variant.replacement
|
||||
immediate[variant.source].append(replstr)
|
||||
chars.update(variant.source)
|
||||
# Then copy to datrie
|
||||
self.replacements = datrie.Trie(''.join(chars))
|
||||
for src, repllist in immediate.items():
|
||||
self.replacements[src] = repllist
|
||||
|
||||
|
||||
def get_normalized(self, name):
|
||||
""" Normalize the given name, i.e. remove all elements not relevant
|
||||
for search.
|
||||
"""
|
||||
return self.normalizer.transliterate(name).strip()
|
||||
|
||||
def get_variants_ascii(self, norm_name):
|
||||
""" Compute the spelling variants for the given normalized name
|
||||
and transliterate the result.
|
||||
"""
|
||||
baseform = '^ ' + norm_name + ' ^'
|
||||
partials = ['']
|
||||
|
||||
startpos = 0
|
||||
pos = 0
|
||||
force_space = False
|
||||
while pos < len(baseform):
|
||||
full, repl = self.replacements.longest_prefix_item(baseform[pos:],
|
||||
(None, None))
|
||||
if full is not None:
|
||||
done = baseform[startpos:pos]
|
||||
partials = [v + done + r
|
||||
for v, r in itertools.product(partials, repl)
|
||||
if not force_space or r.startswith(' ')]
|
||||
if len(partials) > 128:
|
||||
# If too many variants are produced, they are unlikely
|
||||
# to be helpful. Only use the original term.
|
||||
startpos = 0
|
||||
break
|
||||
startpos = pos + len(full)
|
||||
if full[-1] == ' ':
|
||||
startpos -= 1
|
||||
force_space = True
|
||||
pos = startpos
|
||||
else:
|
||||
pos += 1
|
||||
force_space = False
|
||||
|
||||
results = set()
|
||||
|
||||
if startpos == 0:
|
||||
trans_name = self.to_ascii.transliterate(norm_name).strip()
|
||||
if trans_name:
|
||||
results.add(trans_name)
|
||||
else:
|
||||
for variant in partials:
|
||||
name = variant + baseform[startpos:]
|
||||
trans_name = self.to_ascii.transliterate(name[1:-1]).strip()
|
||||
if trans_name:
|
||||
results.add(trans_name)
|
||||
|
||||
return list(results)
|
||||
|
||||
|
||||
def get_search_normalized(self, name):
|
||||
""" Return the normalized version of the name (including transliteration)
|
||||
to be applied at search time.
|
||||
"""
|
||||
return self.search.transliterate(' ' + name + ' ').strip()
|
246
nominatim/tokenizer/icu_rule_loader.py
Normal file
246
nominatim/tokenizer/icu_rule_loader.py
Normal file
@ -0,0 +1,246 @@
|
||||
"""
|
||||
Helper class to create ICU rules from a configuration file.
|
||||
"""
|
||||
import io
|
||||
import logging
|
||||
import itertools
|
||||
from pathlib import Path
|
||||
import re
|
||||
|
||||
import yaml
|
||||
from icu import Transliterator
|
||||
|
||||
from nominatim.errors import UsageError
|
||||
import nominatim.tokenizer.icu_variants as variants
|
||||
|
||||
LOG = logging.getLogger()
|
||||
|
||||
def _flatten_yaml_list(content):
|
||||
if not content:
|
||||
return []
|
||||
|
||||
if not isinstance(content, list):
|
||||
raise UsageError("List expected in ICU yaml configuration.")
|
||||
|
||||
output = []
|
||||
for ele in content:
|
||||
if isinstance(ele, list):
|
||||
output.extend(_flatten_yaml_list(ele))
|
||||
else:
|
||||
output.append(ele)
|
||||
|
||||
return output
|
||||
|
||||
|
||||
class VariantRule:
|
||||
""" Saves a single variant expansion.
|
||||
|
||||
An expansion consists of the normalized replacement term and
|
||||
a dicitonary of properties that describe when the expansion applies.
|
||||
"""
|
||||
|
||||
def __init__(self, replacement, properties):
|
||||
self.replacement = replacement
|
||||
self.properties = properties or {}
|
||||
|
||||
|
||||
class ICURuleLoader:
|
||||
""" Compiler for ICU rules from a tokenizer configuration file.
|
||||
"""
|
||||
|
||||
def __init__(self, configfile):
|
||||
self.configfile = configfile
|
||||
self.variants = set()
|
||||
|
||||
if configfile.suffix == '.yaml':
|
||||
self._load_from_yaml()
|
||||
else:
|
||||
raise UsageError("Unknown format of tokenizer configuration.")
|
||||
|
||||
|
||||
def get_search_rules(self):
|
||||
""" Return the ICU rules to be used during search.
|
||||
The rules combine normalization and transliteration.
|
||||
"""
|
||||
# First apply the normalization rules.
|
||||
rules = io.StringIO()
|
||||
rules.write(self.normalization_rules)
|
||||
|
||||
# Then add transliteration.
|
||||
rules.write(self.transliteration_rules)
|
||||
return rules.getvalue()
|
||||
|
||||
def get_normalization_rules(self):
|
||||
""" Return rules for normalisation of a term.
|
||||
"""
|
||||
return self.normalization_rules
|
||||
|
||||
def get_transliteration_rules(self):
|
||||
""" Return the rules for converting a string into its asciii representation.
|
||||
"""
|
||||
return self.transliteration_rules
|
||||
|
||||
def get_replacement_pairs(self):
|
||||
""" Return the list of possible compound decompositions with
|
||||
application of abbreviations included.
|
||||
The result is a list of pairs: the first item is the sequence to
|
||||
replace, the second is a list of replacements.
|
||||
"""
|
||||
return self.variants
|
||||
|
||||
def _yaml_include_representer(self, loader, node):
|
||||
value = loader.construct_scalar(node)
|
||||
|
||||
if Path(value).is_absolute():
|
||||
content = Path(value).read_text()
|
||||
else:
|
||||
content = (self.configfile.parent / value).read_text()
|
||||
|
||||
return yaml.safe_load(content)
|
||||
|
||||
|
||||
def _load_from_yaml(self):
|
||||
yaml.add_constructor('!include', self._yaml_include_representer,
|
||||
Loader=yaml.SafeLoader)
|
||||
rules = yaml.safe_load(self.configfile.read_text())
|
||||
|
||||
self.normalization_rules = self._cfg_to_icu_rules(rules, 'normalization')
|
||||
self.transliteration_rules = self._cfg_to_icu_rules(rules, 'transliteration')
|
||||
self._parse_variant_list(self._get_section(rules, 'variants'))
|
||||
|
||||
|
||||
def _get_section(self, rules, section):
|
||||
""" Get the section named 'section' from the rules. If the section does
|
||||
not exist, raise a usage error with a meaningful message.
|
||||
"""
|
||||
if section not in rules:
|
||||
LOG.fatal("Section '%s' not found in tokenizer config '%s'.",
|
||||
section, str(self.configfile))
|
||||
raise UsageError("Syntax error in tokenizer configuration file.")
|
||||
|
||||
return rules[section]
|
||||
|
||||
|
||||
def _cfg_to_icu_rules(self, rules, section):
|
||||
""" Load an ICU ruleset from the given section. If the section is a
|
||||
simple string, it is interpreted as a file name and the rules are
|
||||
loaded verbatim from the given file. The filename is expected to be
|
||||
relative to the tokenizer rule file. If the section is a list then
|
||||
each line is assumed to be a rule. All rules are concatenated and returned.
|
||||
"""
|
||||
content = self._get_section(rules, section)
|
||||
|
||||
if content is None:
|
||||
return ''
|
||||
|
||||
return ';'.join(_flatten_yaml_list(content)) + ';'
|
||||
|
||||
|
||||
def _parse_variant_list(self, rules):
|
||||
self.variants.clear()
|
||||
|
||||
if not rules:
|
||||
return
|
||||
|
||||
rules = _flatten_yaml_list(rules)
|
||||
|
||||
vmaker = _VariantMaker(self.normalization_rules)
|
||||
|
||||
properties = []
|
||||
for section in rules:
|
||||
# Create the property field and deduplicate against existing
|
||||
# instances.
|
||||
props = variants.ICUVariantProperties.from_rules(section)
|
||||
for existing in properties:
|
||||
if existing == props:
|
||||
props = existing
|
||||
break
|
||||
else:
|
||||
properties.append(props)
|
||||
|
||||
for rule in (section.get('words') or []):
|
||||
self.variants.update(vmaker.compute(rule, props))
|
||||
|
||||
|
||||
class _VariantMaker:
|
||||
""" Generater for all necessary ICUVariants from a single variant rule.
|
||||
|
||||
All text in rules is normalized to make sure the variants match later.
|
||||
"""
|
||||
|
||||
def __init__(self, norm_rules):
|
||||
self.norm = Transliterator.createFromRules("rule_loader_normalization",
|
||||
norm_rules)
|
||||
|
||||
|
||||
def compute(self, rule, props):
|
||||
""" Generator for all ICUVariant tuples from a single variant rule.
|
||||
"""
|
||||
parts = re.split(r'(\|)?([=-])>', rule)
|
||||
if len(parts) != 4:
|
||||
raise UsageError("Syntax error in variant rule: " + rule)
|
||||
|
||||
decompose = parts[1] is None
|
||||
src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')]
|
||||
repl_terms = (self.norm.transliterate(t.strip()) for t in parts[3].split(','))
|
||||
|
||||
# If the source should be kept, add a 1:1 replacement
|
||||
if parts[2] == '-':
|
||||
for src in src_terms:
|
||||
if src:
|
||||
for froms, tos in _create_variants(*src, src[0], decompose):
|
||||
yield variants.ICUVariant(froms, tos, props)
|
||||
|
||||
for src, repl in itertools.product(src_terms, repl_terms):
|
||||
if src and repl:
|
||||
for froms, tos in _create_variants(*src, repl, decompose):
|
||||
yield variants.ICUVariant(froms, tos, props)
|
||||
|
||||
|
||||
def _parse_variant_word(self, name):
|
||||
name = name.strip()
|
||||
match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name)
|
||||
if match is None or (match.group(1) == '~' and match.group(3) == '~'):
|
||||
raise UsageError("Invalid variant word descriptor '{}'".format(name))
|
||||
norm_name = self.norm.transliterate(match.group(2))
|
||||
if not norm_name:
|
||||
return None
|
||||
|
||||
return norm_name, match.group(1), match.group(3)
|
||||
|
||||
|
||||
_FLAG_MATCH = {'^': '^ ',
|
||||
'$': ' ^',
|
||||
'': ' '}
|
||||
|
||||
|
||||
def _create_variants(src, preflag, postflag, repl, decompose):
|
||||
if preflag == '~':
|
||||
postfix = _FLAG_MATCH[postflag]
|
||||
# suffix decomposition
|
||||
src = src + postfix
|
||||
repl = repl + postfix
|
||||
|
||||
yield src, repl
|
||||
yield ' ' + src, ' ' + repl
|
||||
|
||||
if decompose:
|
||||
yield src, ' ' + repl
|
||||
yield ' ' + src, repl
|
||||
elif postflag == '~':
|
||||
# prefix decomposition
|
||||
prefix = _FLAG_MATCH[preflag]
|
||||
src = prefix + src
|
||||
repl = prefix + repl
|
||||
|
||||
yield src, repl
|
||||
yield src + ' ', repl + ' '
|
||||
|
||||
if decompose:
|
||||
yield src, repl + ' '
|
||||
yield src + ' ', repl
|
||||
else:
|
||||
prefix = _FLAG_MATCH[preflag]
|
||||
postfix = _FLAG_MATCH[postflag]
|
||||
|
||||
yield prefix + src + postfix, prefix + repl + postfix
|
58
nominatim/tokenizer/icu_variants.py
Normal file
58
nominatim/tokenizer/icu_variants.py
Normal file
@ -0,0 +1,58 @@
|
||||
"""
|
||||
Data structures for saving variant expansions for ICU tokenizer.
|
||||
"""
|
||||
from collections import namedtuple
|
||||
import json
|
||||
|
||||
_ICU_VARIANT_PORPERTY_FIELDS = ['lang']
|
||||
|
||||
|
||||
class ICUVariantProperties(namedtuple('_ICUVariantProperties', _ICU_VARIANT_PORPERTY_FIELDS,
|
||||
defaults=(None, )*len(_ICU_VARIANT_PORPERTY_FIELDS))):
|
||||
""" Data container for saving properties that describe when a variant
|
||||
should be applied.
|
||||
|
||||
Porperty instances are hashable.
|
||||
"""
|
||||
@classmethod
|
||||
def from_rules(cls, _):
|
||||
""" Create a new property type from a generic dictionary.
|
||||
|
||||
The function only takes into account the properties that are
|
||||
understood presently and ignores all others.
|
||||
"""
|
||||
return cls(lang=None)
|
||||
|
||||
|
||||
ICUVariant = namedtuple('ICUVariant', ['source', 'replacement', 'properties'])
|
||||
|
||||
|
||||
def pickle_variant_set(variants):
|
||||
""" Serializes an iterable of variant rules to a string.
|
||||
"""
|
||||
# Create a list of property sets. So they don't need to be duplicated
|
||||
properties = {}
|
||||
pid = 1
|
||||
for variant in variants:
|
||||
if variant.properties not in properties:
|
||||
properties[variant.properties] = pid
|
||||
pid += 1
|
||||
|
||||
# Convert the variants into a simple list.
|
||||
variants = [(v.source, v.replacement, properties[v.properties]) for v in variants]
|
||||
|
||||
# Convert everythin to json.
|
||||
return json.dumps({'properties': {v: k._asdict() for k, v in properties.items()},
|
||||
'variants': variants})
|
||||
|
||||
|
||||
def unpickle_variant_set(variant_string):
|
||||
""" Deserializes a variant string that was previously created with
|
||||
pickle_variant_set() into a set of ICUVariants.
|
||||
"""
|
||||
data = json.loads(variant_string)
|
||||
|
||||
properties = {int(k): ICUVariantProperties(**v) for k, v in data['properties'].items()}
|
||||
print(properties)
|
||||
|
||||
return set((ICUVariant(src, repl, properties[pid]) for src, repl, pid in data['variants']))
|
@ -3,26 +3,23 @@ Tokenizer implementing normalisation as used before Nominatim 4 but using
|
||||
libICU instead of the PostgreSQL module.
|
||||
"""
|
||||
from collections import Counter
|
||||
import functools
|
||||
import io
|
||||
import itertools
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from textwrap import dedent
|
||||
from pathlib import Path
|
||||
|
||||
from icu import Transliterator
|
||||
import psycopg2.extras
|
||||
|
||||
from nominatim.db.connection import connect
|
||||
from nominatim.db.properties import set_property, get_property
|
||||
from nominatim.db.utils import CopyBuffer
|
||||
from nominatim.db.sql_preprocessor import SQLPreprocessor
|
||||
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
|
||||
from nominatim.tokenizer.icu_name_processor import ICUNameProcessor, ICUNameProcessorRules
|
||||
|
||||
DBCFG_NORMALIZATION = "tokenizer_normalization"
|
||||
DBCFG_MAXWORDFREQ = "tokenizer_maxwordfreq"
|
||||
DBCFG_TRANSLITERATION = "tokenizer_transliteration"
|
||||
DBCFG_ABBREVIATIONS = "tokenizer_abbreviations"
|
||||
DBCFG_TERM_NORMALIZATION = "tokenizer_term_normalization"
|
||||
|
||||
LOG = logging.getLogger()
|
||||
|
||||
@ -41,9 +38,9 @@ class LegacyICUTokenizer:
|
||||
def __init__(self, dsn, data_dir):
|
||||
self.dsn = dsn
|
||||
self.data_dir = data_dir
|
||||
self.normalization = None
|
||||
self.transliteration = None
|
||||
self.abbreviations = None
|
||||
self.naming_rules = None
|
||||
self.term_normalization = None
|
||||
self.max_word_frequency = None
|
||||
|
||||
|
||||
def init_new_db(self, config, init_db=True):
|
||||
@ -55,14 +52,14 @@ class LegacyICUTokenizer:
|
||||
if config.TOKENIZER_CONFIG:
|
||||
cfgfile = Path(config.TOKENIZER_CONFIG)
|
||||
else:
|
||||
cfgfile = config.config_dir / 'legacy_icu_tokenizer.json'
|
||||
cfgfile = config.config_dir / 'legacy_icu_tokenizer.yaml'
|
||||
|
||||
rules = json.loads(cfgfile.read_text())
|
||||
self.transliteration = ';'.join(rules['normalization']) + ';'
|
||||
self.abbreviations = rules["abbreviations"]
|
||||
self.normalization = config.TERM_NORMALIZATION
|
||||
loader = ICURuleLoader(cfgfile)
|
||||
self.naming_rules = ICUNameProcessorRules(loader=loader)
|
||||
self.term_normalization = config.TERM_NORMALIZATION
|
||||
self.max_word_frequency = config.MAX_WORD_FREQUENCY
|
||||
|
||||
self._install_php(config)
|
||||
self._install_php(config.lib_dir.php)
|
||||
self._save_config(config)
|
||||
|
||||
if init_db:
|
||||
@ -74,9 +71,9 @@ class LegacyICUTokenizer:
|
||||
""" Initialise the tokenizer from the project directory.
|
||||
"""
|
||||
with connect(self.dsn) as conn:
|
||||
self.normalization = get_property(conn, DBCFG_NORMALIZATION)
|
||||
self.transliteration = get_property(conn, DBCFG_TRANSLITERATION)
|
||||
self.abbreviations = json.loads(get_property(conn, DBCFG_ABBREVIATIONS))
|
||||
self.naming_rules = ICUNameProcessorRules(conn=conn)
|
||||
self.term_normalization = get_property(conn, DBCFG_TERM_NORMALIZATION)
|
||||
self.max_word_frequency = get_property(conn, DBCFG_MAXWORDFREQ)
|
||||
|
||||
|
||||
def finalize_import(self, config):
|
||||
@ -103,9 +100,7 @@ class LegacyICUTokenizer:
|
||||
"""
|
||||
self.init_from_project()
|
||||
|
||||
if self.normalization is None\
|
||||
or self.transliteration is None\
|
||||
or self.abbreviations is None:
|
||||
if self.naming_rules is None:
|
||||
return "Configuration for tokenizer 'legacy_icu' are missing."
|
||||
|
||||
return None
|
||||
@ -126,26 +121,20 @@ class LegacyICUTokenizer:
|
||||
|
||||
Analyzers are not thread-safe. You need to instantiate one per thread.
|
||||
"""
|
||||
norm = Transliterator.createFromRules("normalizer", self.normalization)
|
||||
trans = Transliterator.createFromRules("trans", self.transliteration)
|
||||
return LegacyICUNameAnalyzer(self.dsn, norm, trans, self.abbreviations)
|
||||
return LegacyICUNameAnalyzer(self.dsn, ICUNameProcessor(self.naming_rules))
|
||||
|
||||
|
||||
def _install_php(self, config):
|
||||
# pylint: disable=missing-format-attribute
|
||||
def _install_php(self, phpdir):
|
||||
""" Install the php script for the tokenizer.
|
||||
"""
|
||||
abbr_inverse = list(zip(*self.abbreviations))
|
||||
php_file = self.data_dir / "tokenizer.php"
|
||||
php_file.write_text(dedent("""\
|
||||
<?php
|
||||
@define('CONST_Max_Word_Frequency', {1.MAX_WORD_FREQUENCY});
|
||||
@define('CONST_Term_Normalization_Rules', "{0.normalization}");
|
||||
@define('CONST_Transliteration', "{0.transliteration}");
|
||||
@define('CONST_Abbreviations', array(array('{2}'), array('{3}')));
|
||||
require_once('{1.lib_dir.php}/tokenizer/legacy_icu_tokenizer.php');
|
||||
""".format(self, config,
|
||||
"','".join(abbr_inverse[0]),
|
||||
"','".join(abbr_inverse[1]))))
|
||||
@define('CONST_Max_Word_Frequency', {0.max_word_frequency});
|
||||
@define('CONST_Term_Normalization_Rules', "{0.term_normalization}");
|
||||
@define('CONST_Transliteration', "{0.naming_rules.search_rules}");
|
||||
require_once('{1}/tokenizer/legacy_icu_tokenizer.php');
|
||||
""".format(self, phpdir)))
|
||||
|
||||
|
||||
def _save_config(self, config):
|
||||
@ -153,10 +142,10 @@ class LegacyICUTokenizer:
|
||||
database as database properties.
|
||||
"""
|
||||
with connect(self.dsn) as conn:
|
||||
set_property(conn, DBCFG_NORMALIZATION, self.normalization)
|
||||
self.naming_rules.save_rules(conn)
|
||||
|
||||
set_property(conn, DBCFG_MAXWORDFREQ, config.MAX_WORD_FREQUENCY)
|
||||
set_property(conn, DBCFG_TRANSLITERATION, self.transliteration)
|
||||
set_property(conn, DBCFG_ABBREVIATIONS, json.dumps(self.abbreviations))
|
||||
set_property(conn, DBCFG_TERM_NORMALIZATION, self.term_normalization)
|
||||
|
||||
|
||||
def _init_db_tables(self, config):
|
||||
@ -172,25 +161,30 @@ class LegacyICUTokenizer:
|
||||
|
||||
# get partial words and their frequencies
|
||||
words = Counter()
|
||||
with self.name_analyzer() as analyzer:
|
||||
with conn.cursor(name="words") as cur:
|
||||
cur.execute("SELECT svals(name) as v, count(*) FROM place GROUP BY v")
|
||||
name_proc = ICUNameProcessor(self.naming_rules)
|
||||
with conn.cursor(name="words") as cur:
|
||||
cur.execute(""" SELECT v, count(*) FROM
|
||||
(SELECT svals(name) as v FROM place)x
|
||||
WHERE length(v) < 75 GROUP BY v""")
|
||||
|
||||
for name, cnt in cur:
|
||||
term = analyzer.make_standard_word(name)
|
||||
if term:
|
||||
for word in term.split():
|
||||
words[word] += cnt
|
||||
for name, cnt in cur:
|
||||
terms = set()
|
||||
for word in name_proc.get_variants_ascii(name_proc.get_normalized(name)):
|
||||
if ' ' in word:
|
||||
terms.update(word.split())
|
||||
for term in terms:
|
||||
words[term] += cnt
|
||||
|
||||
# copy them back into the word table
|
||||
copystr = io.StringIO(''.join(('{}\t{}\n'.format(*args) for args in words.items())))
|
||||
with CopyBuffer() as copystr:
|
||||
for args in words.items():
|
||||
copystr.add(*args)
|
||||
|
||||
|
||||
with conn.cursor() as cur:
|
||||
copystr.seek(0)
|
||||
cur.copy_from(copystr, 'word', columns=['word_token', 'search_name_count'])
|
||||
cur.execute("""UPDATE word SET word_id = nextval('seq_word')
|
||||
WHERE word_id is null""")
|
||||
with conn.cursor() as cur:
|
||||
copystr.copy_out(cur, 'word',
|
||||
columns=['word_token', 'search_name_count'])
|
||||
cur.execute("""UPDATE word SET word_id = nextval('seq_word')
|
||||
WHERE word_id is null""")
|
||||
|
||||
conn.commit()
|
||||
|
||||
@ -202,12 +196,10 @@ class LegacyICUNameAnalyzer:
|
||||
normalization.
|
||||
"""
|
||||
|
||||
def __init__(self, dsn, normalizer, transliterator, abbreviations):
|
||||
def __init__(self, dsn, name_proc):
|
||||
self.conn = connect(dsn).connection
|
||||
self.conn.autocommit = True
|
||||
self.normalizer = normalizer
|
||||
self.transliterator = transliterator
|
||||
self.abbreviations = abbreviations
|
||||
self.name_processor = name_proc
|
||||
|
||||
self._cache = _TokenCache()
|
||||
|
||||
@ -228,7 +220,7 @@ class LegacyICUNameAnalyzer:
|
||||
self.conn = None
|
||||
|
||||
|
||||
def get_word_token_info(self, conn, words):
|
||||
def get_word_token_info(self, words):
|
||||
""" Return token information for the given list of words.
|
||||
If a word starts with # it is assumed to be a full name
|
||||
otherwise is a partial name.
|
||||
@ -242,11 +234,11 @@ class LegacyICUNameAnalyzer:
|
||||
tokens = {}
|
||||
for word in words:
|
||||
if word.startswith('#'):
|
||||
tokens[word] = ' ' + self.make_standard_word(word[1:])
|
||||
tokens[word] = ' ' + self.name_processor.get_search_normalized(word[1:])
|
||||
else:
|
||||
tokens[word] = self.make_standard_word(word)
|
||||
tokens[word] = self.name_processor.get_search_normalized(word)
|
||||
|
||||
with conn.cursor() as cur:
|
||||
with self.conn.cursor() as cur:
|
||||
cur.execute("""SELECT word_token, word_id
|
||||
FROM word, (SELECT unnest(%s::TEXT[]) as term) t
|
||||
WHERE word_token = t.term
|
||||
@ -254,15 +246,9 @@ class LegacyICUNameAnalyzer:
|
||||
(list(tokens.values()), ))
|
||||
ids = {r[0]: r[1] for r in cur}
|
||||
|
||||
return [(k, v, ids[v]) for k, v in tokens.items()]
|
||||
return [(k, v, ids.get(v, None)) for k, v in tokens.items()]
|
||||
|
||||
|
||||
def normalize(self, phrase):
|
||||
""" Normalize the given phrase, i.e. remove all properties that
|
||||
are irrelevant for search.
|
||||
"""
|
||||
return self.normalizer.transliterate(phrase)
|
||||
|
||||
@staticmethod
|
||||
def normalize_postcode(postcode):
|
||||
""" Convert the postcode to a standardized form.
|
||||
@ -273,34 +259,18 @@ class LegacyICUNameAnalyzer:
|
||||
return postcode.strip().upper()
|
||||
|
||||
|
||||
@functools.lru_cache(maxsize=1024)
|
||||
def make_standard_word(self, name):
|
||||
""" Create the normalised version of the input.
|
||||
"""
|
||||
norm = ' ' + self.transliterator.transliterate(name) + ' '
|
||||
for full, abbr in self.abbreviations:
|
||||
if full in norm:
|
||||
norm = norm.replace(full, abbr)
|
||||
|
||||
return norm.strip()
|
||||
|
||||
|
||||
def _make_standard_hnr(self, hnr):
|
||||
""" Create a normalised version of a housenumber.
|
||||
|
||||
This function takes minor shortcuts on transliteration.
|
||||
"""
|
||||
if hnr.isdigit():
|
||||
return hnr
|
||||
|
||||
return self.transliterator.transliterate(hnr)
|
||||
return self.name_processor.get_search_normalized(hnr)
|
||||
|
||||
def update_postcodes_from_db(self):
|
||||
""" Update postcode tokens in the word table from the location_postcode
|
||||
table.
|
||||
"""
|
||||
to_delete = []
|
||||
copystr = io.StringIO()
|
||||
with self.conn.cursor() as cur:
|
||||
# This finds us the rows in location_postcode and word that are
|
||||
# missing in the other table.
|
||||
@ -313,32 +283,31 @@ class LegacyICUNameAnalyzer:
|
||||
ON pc = word) x
|
||||
WHERE pc is null or word is null""")
|
||||
|
||||
for postcode, word in cur:
|
||||
if postcode is None:
|
||||
to_delete.append(word)
|
||||
else:
|
||||
copystr.write(postcode)
|
||||
copystr.write('\t ')
|
||||
copystr.write(self.transliterator.transliterate(postcode))
|
||||
copystr.write('\tplace\tpostcode\t0\n')
|
||||
with CopyBuffer() as copystr:
|
||||
for postcode, word in cur:
|
||||
if postcode is None:
|
||||
to_delete.append(word)
|
||||
else:
|
||||
copystr.add(
|
||||
postcode,
|
||||
' ' + self.name_processor.get_search_normalized(postcode),
|
||||
'place', 'postcode', 0)
|
||||
|
||||
if to_delete:
|
||||
cur.execute("""DELETE FROM WORD
|
||||
WHERE class ='place' and type = 'postcode'
|
||||
and word = any(%s)
|
||||
""", (to_delete, ))
|
||||
if to_delete:
|
||||
cur.execute("""DELETE FROM WORD
|
||||
WHERE class ='place' and type = 'postcode'
|
||||
and word = any(%s)
|
||||
""", (to_delete, ))
|
||||
|
||||
if copystr.getvalue():
|
||||
copystr.seek(0)
|
||||
cur.copy_from(copystr, 'word',
|
||||
columns=['word', 'word_token', 'class', 'type',
|
||||
'search_name_count'])
|
||||
copystr.copy_out(cur, 'word',
|
||||
columns=['word', 'word_token', 'class', 'type',
|
||||
'search_name_count'])
|
||||
|
||||
|
||||
def update_special_phrases(self, phrases, should_replace):
|
||||
""" Replace the search index for special phrases with the new phrases.
|
||||
"""
|
||||
norm_phrases = set(((self.normalize(p[0]), p[1], p[2], p[3])
|
||||
norm_phrases = set(((self.name_processor.get_normalized(p[0]), p[1], p[2], p[3])
|
||||
for p in phrases))
|
||||
|
||||
with self.conn.cursor() as cur:
|
||||
@ -350,54 +319,64 @@ class LegacyICUNameAnalyzer:
|
||||
for label, cls, typ, oper in cur:
|
||||
existing_phrases.add((label, cls, typ, oper or '-'))
|
||||
|
||||
to_add = norm_phrases - existing_phrases
|
||||
to_delete = existing_phrases - norm_phrases
|
||||
|
||||
if to_add:
|
||||
copystr = io.StringIO()
|
||||
for word, cls, typ, oper in to_add:
|
||||
term = self.make_standard_word(word)
|
||||
if term:
|
||||
copystr.write(word)
|
||||
copystr.write('\t ')
|
||||
copystr.write(term)
|
||||
copystr.write('\t')
|
||||
copystr.write(cls)
|
||||
copystr.write('\t')
|
||||
copystr.write(typ)
|
||||
copystr.write('\t')
|
||||
copystr.write(oper if oper in ('in', 'near') else '\\N')
|
||||
copystr.write('\t0\n')
|
||||
|
||||
copystr.seek(0)
|
||||
cur.copy_from(copystr, 'word',
|
||||
columns=['word', 'word_token', 'class', 'type',
|
||||
'operator', 'search_name_count'])
|
||||
|
||||
if to_delete and should_replace:
|
||||
psycopg2.extras.execute_values(
|
||||
cur,
|
||||
""" DELETE FROM word USING (VALUES %s) as v(name, in_class, in_type, op)
|
||||
WHERE word = name and class = in_class and type = in_type
|
||||
and ((op = '-' and operator is null) or op = operator)""",
|
||||
to_delete)
|
||||
added = self._add_special_phrases(cur, norm_phrases, existing_phrases)
|
||||
if should_replace:
|
||||
deleted = self._remove_special_phrases(cur, norm_phrases,
|
||||
existing_phrases)
|
||||
else:
|
||||
deleted = 0
|
||||
|
||||
LOG.info("Total phrases: %s. Added: %s. Deleted: %s",
|
||||
len(norm_phrases), len(to_add), len(to_delete))
|
||||
len(norm_phrases), added, deleted)
|
||||
|
||||
|
||||
def _add_special_phrases(self, cursor, new_phrases, existing_phrases):
|
||||
""" Add all phrases to the database that are not yet there.
|
||||
"""
|
||||
to_add = new_phrases - existing_phrases
|
||||
|
||||
added = 0
|
||||
with CopyBuffer() as copystr:
|
||||
for word, cls, typ, oper in to_add:
|
||||
term = self.name_processor.get_search_normalized(word)
|
||||
if term:
|
||||
copystr.add(word, ' ' + term, cls, typ,
|
||||
oper if oper in ('in', 'near') else None, 0)
|
||||
added += 1
|
||||
|
||||
copystr.copy_out(cursor, 'word',
|
||||
columns=['word', 'word_token', 'class', 'type',
|
||||
'operator', 'search_name_count'])
|
||||
|
||||
return added
|
||||
|
||||
|
||||
@staticmethod
|
||||
def _remove_special_phrases(cursor, new_phrases, existing_phrases):
|
||||
""" Remove all phrases from the databse that are no longer in the
|
||||
new phrase list.
|
||||
"""
|
||||
to_delete = existing_phrases - new_phrases
|
||||
|
||||
if to_delete:
|
||||
psycopg2.extras.execute_values(
|
||||
cursor,
|
||||
""" DELETE FROM word USING (VALUES %s) as v(name, in_class, in_type, op)
|
||||
WHERE word = name and class = in_class and type = in_type
|
||||
and ((op = '-' and operator is null) or op = operator)""",
|
||||
to_delete)
|
||||
|
||||
return len(to_delete)
|
||||
|
||||
|
||||
def add_country_names(self, country_code, names):
|
||||
""" Add names for the given country to the search index.
|
||||
"""
|
||||
full_names = set((self.make_standard_word(n) for n in names))
|
||||
full_names.discard('')
|
||||
self._add_normalized_country_names(country_code, full_names)
|
||||
word_tokens = set()
|
||||
for name in self._compute_full_names(names):
|
||||
if name:
|
||||
word_tokens.add(' ' + self.name_processor.get_search_normalized(name))
|
||||
|
||||
|
||||
def _add_normalized_country_names(self, country_code, names):
|
||||
""" Add names for the given country to the search index.
|
||||
"""
|
||||
word_tokens = set((' ' + name for name in names))
|
||||
with self.conn.cursor() as cur:
|
||||
# Get existing names
|
||||
cur.execute("SELECT word_token FROM word WHERE country_code = %s",
|
||||
@ -423,14 +402,13 @@ class LegacyICUNameAnalyzer:
|
||||
names = place.get('name')
|
||||
|
||||
if names:
|
||||
full_names = self._compute_full_names(names)
|
||||
fulls, partials = self._compute_name_tokens(names)
|
||||
|
||||
token_info.add_names(self.conn, full_names)
|
||||
token_info.add_names(fulls, partials)
|
||||
|
||||
country_feature = place.get('country_feature')
|
||||
if country_feature and re.fullmatch(r'[A-Za-z][A-Za-z]', country_feature):
|
||||
self._add_normalized_country_names(country_feature.lower(),
|
||||
full_names)
|
||||
self.add_country_names(country_feature.lower(), names)
|
||||
|
||||
address = place.get('address')
|
||||
|
||||
@ -443,38 +421,65 @@ class LegacyICUNameAnalyzer:
|
||||
elif key in ('housenumber', 'streetnumber', 'conscriptionnumber'):
|
||||
hnrs.append(value)
|
||||
elif key == 'street':
|
||||
token_info.add_street(self.conn, self.make_standard_word(value))
|
||||
token_info.add_street(*self._compute_name_tokens({'name': value}))
|
||||
elif key == 'place':
|
||||
token_info.add_place(self.conn, self.make_standard_word(value))
|
||||
token_info.add_place(*self._compute_name_tokens({'name': value}))
|
||||
elif not key.startswith('_') and \
|
||||
key not in ('country', 'full'):
|
||||
addr_terms.append((key, self.make_standard_word(value)))
|
||||
addr_terms.append((key, *self._compute_name_tokens({'name': value})))
|
||||
|
||||
if hnrs:
|
||||
hnrs = self._split_housenumbers(hnrs)
|
||||
token_info.add_housenumbers(self.conn, [self._make_standard_hnr(n) for n in hnrs])
|
||||
|
||||
if addr_terms:
|
||||
token_info.add_address_terms(self.conn, addr_terms)
|
||||
token_info.add_address_terms(addr_terms)
|
||||
|
||||
return token_info.data
|
||||
|
||||
|
||||
def _compute_full_names(self, names):
|
||||
def _compute_name_tokens(self, names):
|
||||
""" Computes the full name and partial name tokens for the given
|
||||
dictionary of names.
|
||||
"""
|
||||
full_names = self._compute_full_names(names)
|
||||
full_tokens = set()
|
||||
partial_tokens = set()
|
||||
|
||||
for name in full_names:
|
||||
norm_name = self.name_processor.get_normalized(name)
|
||||
full, part = self._cache.names.get(norm_name, (None, None))
|
||||
if full is None:
|
||||
variants = self.name_processor.get_variants_ascii(norm_name)
|
||||
if not variants:
|
||||
continue
|
||||
|
||||
with self.conn.cursor() as cur:
|
||||
cur.execute("SELECT (getorcreate_full_word(%s, %s)).*",
|
||||
(norm_name, variants))
|
||||
full, part = cur.fetchone()
|
||||
|
||||
self._cache.names[norm_name] = (full, part)
|
||||
|
||||
full_tokens.add(full)
|
||||
partial_tokens.update(part)
|
||||
|
||||
return full_tokens, partial_tokens
|
||||
|
||||
|
||||
@staticmethod
|
||||
def _compute_full_names(names):
|
||||
""" Return the set of all full name word ids to be used with the
|
||||
given dictionary of names.
|
||||
"""
|
||||
full_names = set()
|
||||
for name in (n for ns in names.values() for n in re.split('[;,]', ns)):
|
||||
word = self.make_standard_word(name)
|
||||
if word:
|
||||
full_names.add(word)
|
||||
for name in (n.strip() for ns in names.values() for n in re.split('[;,]', ns)):
|
||||
if name:
|
||||
full_names.add(name)
|
||||
|
||||
brace_split = name.split('(', 2)
|
||||
if len(brace_split) > 1:
|
||||
word = self.make_standard_word(brace_split[0])
|
||||
if word:
|
||||
full_names.add(word)
|
||||
brace_idx = name.find('(')
|
||||
if brace_idx >= 0:
|
||||
full_names.add(name[:brace_idx].strip())
|
||||
|
||||
return full_names
|
||||
|
||||
@ -486,7 +491,7 @@ class LegacyICUNameAnalyzer:
|
||||
postcode = self.normalize_postcode(postcode)
|
||||
|
||||
if postcode not in self._cache.postcodes:
|
||||
term = self.make_standard_word(postcode)
|
||||
term = self.name_processor.get_search_normalized(postcode)
|
||||
if not term:
|
||||
return
|
||||
|
||||
@ -502,6 +507,7 @@ class LegacyICUNameAnalyzer:
|
||||
""", (' ' + term, postcode))
|
||||
self._cache.postcodes.add(postcode)
|
||||
|
||||
|
||||
@staticmethod
|
||||
def _split_housenumbers(hnrs):
|
||||
if len(hnrs) > 1 or ',' in hnrs[0] or ';' in hnrs[0]:
|
||||
@ -524,7 +530,7 @@ class _TokenInfo:
|
||||
""" Collect token information to be sent back to the database.
|
||||
"""
|
||||
def __init__(self, cache):
|
||||
self.cache = cache
|
||||
self._cache = cache
|
||||
self.data = {}
|
||||
|
||||
@staticmethod
|
||||
@ -532,86 +538,44 @@ class _TokenInfo:
|
||||
return '{%s}' % ','.join((str(s) for s in tokens))
|
||||
|
||||
|
||||
def add_names(self, conn, names):
|
||||
def add_names(self, fulls, partials):
|
||||
""" Adds token information for the normalised names.
|
||||
"""
|
||||
# Start with all partial names
|
||||
terms = set((part for ns in names for part in ns.split()))
|
||||
# Add the full names
|
||||
terms.update((' ' + n for n in names))
|
||||
|
||||
self.data['names'] = self._mk_array(self.cache.get_term_tokens(conn, terms))
|
||||
self.data['names'] = self._mk_array(itertools.chain(fulls, partials))
|
||||
|
||||
|
||||
def add_housenumbers(self, conn, hnrs):
|
||||
""" Extract housenumber information from a list of normalised
|
||||
housenumbers.
|
||||
"""
|
||||
self.data['hnr_tokens'] = self._mk_array(self.cache.get_hnr_tokens(conn, hnrs))
|
||||
self.data['hnr_tokens'] = self._mk_array(self._cache.get_hnr_tokens(conn, hnrs))
|
||||
self.data['hnr'] = ';'.join(hnrs)
|
||||
|
||||
|
||||
def add_street(self, conn, street):
|
||||
def add_street(self, fulls, _):
|
||||
""" Add addr:street match terms.
|
||||
"""
|
||||
if not street:
|
||||
return
|
||||
|
||||
term = ' ' + street
|
||||
|
||||
tid = self.cache.names.get(term)
|
||||
|
||||
if tid is None:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("""SELECT word_id FROM word
|
||||
WHERE word_token = %s
|
||||
and class is null and type is null""",
|
||||
(term, ))
|
||||
if cur.rowcount > 0:
|
||||
tid = cur.fetchone()[0]
|
||||
self.cache.names[term] = tid
|
||||
|
||||
if tid is not None:
|
||||
self.data['street'] = '{%d}' % tid
|
||||
if fulls:
|
||||
self.data['street'] = self._mk_array(fulls)
|
||||
|
||||
|
||||
def add_place(self, conn, place):
|
||||
def add_place(self, fulls, partials):
|
||||
""" Add addr:place search and match terms.
|
||||
"""
|
||||
if not place:
|
||||
return
|
||||
|
||||
partial_ids = self.cache.get_term_tokens(conn, place.split())
|
||||
tid = self.cache.get_term_tokens(conn, [' ' + place])
|
||||
|
||||
self.data['place_search'] = self._mk_array(itertools.chain(partial_ids, tid))
|
||||
self.data['place_match'] = '{%s}' % tid[0]
|
||||
if fulls:
|
||||
self.data['place_search'] = self._mk_array(itertools.chain(fulls, partials))
|
||||
self.data['place_match'] = self._mk_array(fulls)
|
||||
|
||||
|
||||
def add_address_terms(self, conn, terms):
|
||||
def add_address_terms(self, terms):
|
||||
""" Add additional address terms.
|
||||
"""
|
||||
tokens = {}
|
||||
|
||||
for key, value in terms:
|
||||
if not value:
|
||||
continue
|
||||
partial_ids = self.cache.get_term_tokens(conn, value.split())
|
||||
term = ' ' + value
|
||||
tid = self.cache.names.get(term)
|
||||
|
||||
if tid is None:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("""SELECT word_id FROM word
|
||||
WHERE word_token = %s
|
||||
and class is null and type is null""",
|
||||
(term, ))
|
||||
if cur.rowcount > 0:
|
||||
tid = cur.fetchone()[0]
|
||||
self.cache.names[term] = tid
|
||||
|
||||
tokens[key] = [self._mk_array(partial_ids),
|
||||
'{%s}' % ('' if tid is None else str(tid))]
|
||||
for key, fulls, partials in terms:
|
||||
if fulls:
|
||||
tokens[key] = [self._mk_array(itertools.chain(fulls, partials)),
|
||||
self._mk_array(fulls)]
|
||||
|
||||
if tokens:
|
||||
self.data['addr'] = tokens
|
||||
@ -629,32 +593,6 @@ class _TokenCache:
|
||||
self.housenumbers = {}
|
||||
|
||||
|
||||
def get_term_tokens(self, conn, terms):
|
||||
""" Get token ids for a list of terms, looking them up in the database
|
||||
if necessary.
|
||||
"""
|
||||
tokens = []
|
||||
askdb = []
|
||||
|
||||
for term in terms:
|
||||
token = self.names.get(term)
|
||||
if token is None:
|
||||
askdb.append(term)
|
||||
elif token != 0:
|
||||
tokens.append(token)
|
||||
|
||||
if askdb:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("SELECT term, getorcreate_term_id(term) FROM unnest(%s) as term",
|
||||
(askdb, ))
|
||||
for term, tid in cur:
|
||||
self.names[term] = tid
|
||||
if tid != 0:
|
||||
tokens.append(tid)
|
||||
|
||||
return tokens
|
||||
|
||||
|
||||
def get_hnr_tokens(self, conn, terms):
|
||||
""" Get token ids for a list of housenumbers, looking them up in the
|
||||
database if necessary.
|
||||
|
@ -271,8 +271,7 @@ class LegacyNameAnalyzer:
|
||||
self.conn = None
|
||||
|
||||
|
||||
@staticmethod
|
||||
def get_word_token_info(conn, words):
|
||||
def get_word_token_info(self, words):
|
||||
""" Return token information for the given list of words.
|
||||
If a word starts with # it is assumed to be a full name
|
||||
otherwise is a partial name.
|
||||
@ -283,7 +282,7 @@ class LegacyNameAnalyzer:
|
||||
The function is used for testing and debugging only
|
||||
and not necessarily efficient.
|
||||
"""
|
||||
with conn.cursor() as cur:
|
||||
with self.conn.cursor() as cur:
|
||||
cur.execute("""SELECT t.term, word_token, word_id
|
||||
FROM word, (SELECT unnest(%s::TEXT[]) as term) t
|
||||
WHERE word_token = (CASE
|
||||
@ -404,7 +403,7 @@ class LegacyNameAnalyzer:
|
||||
FROM unnest(%s)n) y
|
||||
WHERE NOT EXISTS(SELECT * FROM word
|
||||
WHERE word_token = lookup_token and country_code = %s))
|
||||
""", (country_code, names, country_code))
|
||||
""", (country_code, list(names.values()), country_code))
|
||||
|
||||
|
||||
def process_place(self, place):
|
||||
@ -422,7 +421,7 @@ class LegacyNameAnalyzer:
|
||||
|
||||
country_feature = place.get('country_feature')
|
||||
if country_feature and re.fullmatch(r'[A-Za-z][A-Za-z]', country_feature):
|
||||
self.add_country_names(country_feature.lower(), list(names.values()))
|
||||
self.add_country_names(country_feature.lower(), names)
|
||||
|
||||
address = place.get('address')
|
||||
|
||||
|
@ -272,15 +272,15 @@ def create_country_names(conn, tokenizer, languages=None):
|
||||
|
||||
with tokenizer.name_analyzer() as analyzer:
|
||||
for code, name in cur:
|
||||
names = [code]
|
||||
names = {'countrycode' : code}
|
||||
if code == 'gb':
|
||||
names.append('UK')
|
||||
names['short_name'] = 'UK'
|
||||
if code == 'us':
|
||||
names.append('United States')
|
||||
names['short_name'] = 'United States'
|
||||
|
||||
# country names (only in languages as provided)
|
||||
if name:
|
||||
names.extend((v for k, v in name.items() if _include_key(k)))
|
||||
names.update(((k, v) for k, v in name.items() if _include_key(k)))
|
||||
|
||||
analyzer.add_country_names(code, names)
|
||||
|
||||
|
4941
settings/icu-rules/extended-unicode-to-asccii.yaml
Normal file
4941
settings/icu-rules/extended-unicode-to-asccii.yaml
Normal file
File diff suppressed because it is too large
Load Diff
24
settings/icu-rules/unicode-digits-to-decimal.yaml
Normal file
24
settings/icu-rules/unicode-digits-to-decimal.yaml
Normal file
@ -0,0 +1,24 @@
|
||||
- "[𞥐𐒠߀𖭐꤀𖩠𑓐𑑐𑋰𑄶꩐꘠᱀᭐᮰᠐០᥆༠໐꧰႐᪐᪀᧐𑵐꯰᱐𑱐𑜰𑛀𑙐𑇐꧐꣐෦𑁦0𝟶𝟘𝟬𝟎𝟢₀⓿⓪⁰] > 0"
|
||||
- "[𞥑𐒡߁𖭑꤁𖩡𑓑𑑑𑋱𑄷꩑꘡᱁᭑᮱᠑១᥇༡໑꧱႑᪑᪁᧑𑵑꯱᱑𑱑𑜱𑛁𑙑𑇑꧑꣑෧𑁧1𝟷𝟙𝟭𝟏𝟣₁¹①⑴⒈❶➀➊⓵] > 1"
|
||||
- "[𞥒𐒢߂𖭒꤂𖩢𑓒𑑒𑋲𑄸꩒꘢᱂᭒᮲᠒២᥈༢໒꧲႒᪒᪂᧒𑵒꯲᱒𑱒𑜲𑛂𑙒𑇒꧒꣒෨𑁨2𝟸𝟚𝟮𝟐𝟤₂²②⑵⒉❷➁➋⓶] > 2"
|
||||
- "[𞥓𐒣߃𖭓꤃𖩣𑓓𑑓𑋳𑄹꩓꘣᱃᭓᮳᠓៣᥉༣໓꧳႓᪓᪃᧓𑵓꯳᱓𑱓𑜳𑛃𑙓𑇓꧓꣓෩𑁩3𝟹𝟛𝟯𝟑𝟥₃³③⑶⒊❸➂➌⓷] > 3"
|
||||
- "[𞥔𐒤߄𖭔꤄𖩤𑓔𑑔𑋴𑄺꩔꘤᱄᭔᮴᠔៤᥊༤໔꧴႔᪔᪄᧔𑵔꯴᱔𑱔𑜴𑛄𑙔𑇔꧔꣔෪𑁪4𝟺𝟜𝟰𝟒𝟦₄⁴④⑷⒋❹➃➍⓸] > 4"
|
||||
- "[𞥕𐒥߅𖭕꤅𖩥𑓕𑑕𑋵𑄻꩕꘥᱅᭕᮵᠕៥᥋༥໕꧵႕᪕᪅᧕𑵕꯵᱕𑱕𑜵𑛅𑙕𑇕꧕꣕෫𑁫5𝟻𝟝𝟱𝟓𝟧₅⁵⑤⑸⒌❺➄➎⓹] > 5"
|
||||
- "[𞥖𐒦߆𖭖꤆𖩦𑓖𑑖𑋶𑄼꩖꘦᱆᭖᮶᠖៦᥌༦໖꧶႖᪖᪆᧖𑵖꯶᱖𑱖𑜶𑛆𑙖𑇖꧖꣖෬𑁬6𝟼𝟞𝟲𝟔𝟨₆⁶⑥⑹⒍❻➅➏⓺] > 6"
|
||||
- "[𞥗𐒧߇𖭗꤇𖩧𑓗𑑗𑋷𑄽꩗꘧᱇᭗᮷᠗៧᥍༧໗꧷႗᪗᪇᧗𑵗꯷᱗𑱗𑜷𑛇𑙗𑇗꧗꣗෭𑁭7𝟽𝟟𝟳𝟕𝟩₇⁷⑦⑺⒎❼➆➐⓻] > 7"
|
||||
- "[𞥘𐒨߈𖭘꤈𖩨𑓘𑑘𑋸𑄾꩘꘨᱈᭘᮸᠘៨᥎༨໘꧸႘᪘᪈᧘𑵘꯸᱘𑱘𑜸𑛈𑙘𑇘꧘꣘෮𑁮8𝟾𝟠𝟴𝟖𝟪₈⁸⑧⑻⒏❽➇➑⓼] > 8"
|
||||
- "[𞥙𐒩߉𖭙꤉𖩩𑓙𑑙𑋹𑄿꩙꘩᱉᭙᮹᠙៩᥏༩໙꧹႙᪙᪉᧙𑵙꯹᱙𑱙𑜹𑛉𑙙𑇙꧙꣙෯𑁯9𝟿𝟡𝟵𝟗𝟫₉⁹⑨⑼⒐❾➈➒⓽] > 9"
|
||||
- "[𑜺⑩⑽⒑❿➉➓⓾] > '10'"
|
||||
- "[⑪⑾⒒⓫] > '11'"
|
||||
- "[⑫⑿⒓⓬] > '12'"
|
||||
- "[⑬⒀⒔⓭] > '13'"
|
||||
- "[⑭⒁⒕⓮] > '14'"
|
||||
- "[⑮⒂⒖⓯] > '15'"
|
||||
- "[⑯⒃⒗⓰] > '16'"
|
||||
- "[⑰⒄⒘⓱] > '17'"
|
||||
- "[⑱⒅⒙⓲] > '18'"
|
||||
- "[⑲⒆⒚⓳] > '19'"
|
||||
- "[𑜻⑳⒇⒛⓴] > '20'"
|
||||
- "⅐ > ' 1/7'"
|
||||
- "⅑ > ' 1/9'"
|
||||
- "⅒ > ' 1/10'"
|
19
settings/icu-rules/variants-bg.yaml
Normal file
19
settings/icu-rules/variants-bg.yaml
Normal file
@ -0,0 +1,19 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.91.D1.8A.D0.BB.D0.B3.D0.B0.D1.80.D1.81.D0.BA.D0.B8_.D0.B5.D0.B7.D0.B8.D0.BA_-_Bulgarian
|
||||
- lang: bg
|
||||
words:
|
||||
- Блок -> бл
|
||||
- Булевард -> бул
|
||||
- Вход -> вх
|
||||
- Генерал -> ген
|
||||
- Град -> гр
|
||||
- Доктор -> д-р
|
||||
- Доцент -> доц
|
||||
- Капитан -> кап
|
||||
- Митрополит -> мит
|
||||
- Площад -> пл
|
||||
- Професор -> проф
|
||||
- Свети -> Св
|
||||
- Улица -> ул
|
||||
- Село -> с
|
||||
- Квартал -> кв
|
||||
- Жилищен Комплекс -> ж к
|
90
settings/icu-rules/variants-ca.yaml
Normal file
90
settings/icu-rules/variants-ca.yaml
Normal file
@ -0,0 +1,90 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Catal.C3.A0_-_Catalan
|
||||
- lang: ca
|
||||
words:
|
||||
- aparcament -> aparc
|
||||
- apartament -> apmt
|
||||
- apartat -> apt
|
||||
- àtic -> àt
|
||||
- autopista -> auto
|
||||
- autopista -> autop
|
||||
- autovia -> autov
|
||||
- avinguda -> av
|
||||
- avinguda -> avd
|
||||
- avinguda -> avda
|
||||
- baixada -> bda
|
||||
- baixos -> bxs
|
||||
- barranc -> bnc
|
||||
- barri -> b
|
||||
- barriada -> b
|
||||
- biblioteca -> bibl
|
||||
- bloc -> bl
|
||||
- carrer -> c
|
||||
- carrer -> c/
|
||||
- carreró -> cró
|
||||
- carretera -> ctra
|
||||
- cantonada -> cant
|
||||
- cementiri -> cem
|
||||
- cinturó -> cint
|
||||
- codi postal -> CP
|
||||
- collegi -> coll
|
||||
- collegi públic -> CP
|
||||
- comissaria -> com
|
||||
- convent -> convt
|
||||
- correus -> corr
|
||||
- districte -> distr
|
||||
- drecera -> drec
|
||||
- dreta -> dta
|
||||
- entrada -> entr
|
||||
- entresòl -> entl
|
||||
- escala -> esc
|
||||
- escola -> esc
|
||||
- escola universitària -> EU
|
||||
- església -> esgl
|
||||
- estació -> est
|
||||
- estacionament -> estac
|
||||
- facultat -> fac
|
||||
- finca -> fca
|
||||
- habitació -> hab
|
||||
- hospital -> hosp
|
||||
- hotel -> H
|
||||
- monestir -> mtir
|
||||
- monument -> mon
|
||||
- mossèn -> Mn
|
||||
- municipal -> mpal
|
||||
- museu -> mus
|
||||
- nacional -> nac
|
||||
- nombre -> nre
|
||||
- número -> núm
|
||||
- número -> n
|
||||
- sense número -> s/n
|
||||
- parada -> par
|
||||
- parcel·la -> parc
|
||||
- passadís -> pdís
|
||||
- passatge -> ptge
|
||||
- passeig -> pg
|
||||
- pavelló -> pav
|
||||
- plaça -> pl
|
||||
- plaça -> pça
|
||||
- planta -> pl
|
||||
- població -> pobl
|
||||
- polígon -> pol
|
||||
- polígon industrial -> PI
|
||||
- polígon industrial -> pol ind
|
||||
- porta -> pta
|
||||
- portal -> ptal
|
||||
- principal -> pral
|
||||
- pujada -> pda
|
||||
- punt quilomètric -> PK
|
||||
- rambla -> rbla
|
||||
- ronda -> rda
|
||||
- sagrada -> sgda
|
||||
- sagrat -> sgt
|
||||
- sant -> st
|
||||
- santa -> sta
|
||||
- sobreàtic -> s/àt
|
||||
- travessera -> trav
|
||||
- travessia -> trv
|
||||
- travessia -> trav
|
||||
- urbanització -> urb
|
||||
- sortida -> sort
|
||||
- via -> v
|
6
settings/icu-rules/variants-cs.yaml
Normal file
6
settings/icu-rules/variants-cs.yaml
Normal file
@ -0,0 +1,6 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Cesky_-_Czech
|
||||
- lang: cs
|
||||
words:
|
||||
- Ulice -> Ul
|
||||
- Třída -> Tř
|
||||
- Náměstí -> Nám
|
12
settings/icu-rules/variants-da.yaml
Normal file
12
settings/icu-rules/variants-da.yaml
Normal file
@ -0,0 +1,12 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Dansk_-_Danish
|
||||
- lang: da
|
||||
words:
|
||||
- Lille -> Ll
|
||||
- Nordre -> Ndr
|
||||
- Nørre -> Nr
|
||||
- Søndre, Sønder -> Sdr
|
||||
- Store -> St
|
||||
- Gammel,Gamle -> Gl
|
||||
- ~hal => hal
|
||||
- ~hallen => hallen
|
||||
- ~hallerne => hallerne
|
136
settings/icu-rules/variants-de.yaml
Normal file
136
settings/icu-rules/variants-de.yaml
Normal file
@ -0,0 +1,136 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Deutsch_-_German
|
||||
- lang: de
|
||||
words:
|
||||
- am -> a
|
||||
- an der -> a d
|
||||
- Allgemeines Krankenhaus -> AKH
|
||||
- Altstoffsammelzentrum -> ASZ
|
||||
- auf der -> a d
|
||||
- ~bach -> B
|
||||
- Bad -> B
|
||||
- Bahnhof -> Bhf
|
||||
- Bayerisch, Bayerische, Bayerischer, Bayerisches -> Bayer
|
||||
- Berg -> B
|
||||
- ~berg |-> bg
|
||||
- Bezirk -> Bez
|
||||
- ~brücke -> Br
|
||||
- Bundesgymnasium -> BG
|
||||
- Bundespolizeidirektion -> BPD
|
||||
- Bundesrealgymnasium -> BRG
|
||||
- ~burg |-> bg
|
||||
- burgenländische,burgenländischer,burgenländisches -> bgld
|
||||
- Bürgermeister -> Bgm
|
||||
- Chaussee -> Ch
|
||||
- Deutsche, Deutscher, Deutsches -> dt
|
||||
- Deutscher Alpenverein -> DAV
|
||||
- Deutsch -> Dt
|
||||
- ~denkmal -> Dkm
|
||||
- Dorf -> Df
|
||||
- ~dorf |-> df
|
||||
- Doktor -> Dr
|
||||
- ehemalige, ehemaliger, ehemaliges -> ehem
|
||||
- Fabrik -> Fb
|
||||
- Fachhochschule -> FH
|
||||
- Freiwillige Feuerwehr -> FF
|
||||
- Forsthaus -> Fh
|
||||
- ~gasse |-> g
|
||||
- Gasthaus -> Gh
|
||||
- Gasthof -> Ghf
|
||||
- Gemeinde -> Gde
|
||||
- Graben -> Gr
|
||||
- Großer, Große, Großes -> Gr, G
|
||||
- Gymnasium und Realgymnasium -> GRG
|
||||
- Handelsakademie -> HAK
|
||||
- Handelsschule -> HASCH
|
||||
- Haltestelle -> Hst
|
||||
- Hauptbahnhof -> Hbf
|
||||
- Haus -> Hs
|
||||
- Heilige, Heiliger, Heiliges -> Hl
|
||||
- Hintere, Hinterer, Hinteres -> Ht, Hint
|
||||
- Hohe, Hoher, Hohes -> H
|
||||
- ~höhle -> H
|
||||
- Höhere Technische Lehranstalt -> HTL
|
||||
- ~hütte -> Htt
|
||||
- im -> i
|
||||
- in -> i
|
||||
- in der -> i d
|
||||
- Ingenieur -> Ing
|
||||
- Internationale, Internationaler, Internationales -> Int
|
||||
- Jagdhaus -> Jh
|
||||
- Jagdhütte -> Jhtt
|
||||
- Kapelle -> Kap, Kpl
|
||||
- Katastralgemeinde -> KG
|
||||
- Kläranlage -> KA
|
||||
- Kleiner, Kleine, Kleines -> kl
|
||||
- Klein~ -> Kl.
|
||||
- Kleingartenanlage -> KGA
|
||||
- Kleingartenverein -> KGV
|
||||
- Kogel -> Kg
|
||||
- ~kogel |-> kg
|
||||
- Konzentrationslager -> KZ, KL
|
||||
- Krankenhaus -> KH
|
||||
- ~kreuz |-> kz
|
||||
- Landeskrankenhaus -> LKH
|
||||
- Maria -> Ma
|
||||
- Magister -> Mag
|
||||
- Magistratsabteilung -> MA
|
||||
- Markt -> Mkt
|
||||
- Müllverbrennungsanlage -> MVA
|
||||
- Nationalpark -> NP
|
||||
- Naturschutzgebiet -> NSG
|
||||
- Neue Mittelschule -> NMS
|
||||
- Niedere, Niederer, Niederes -> Nd
|
||||
- Niederösterreich -> NÖ
|
||||
- nördliche, nördlicher, nördliches -> nördl
|
||||
- Nummer -> Nr
|
||||
- ob -> o
|
||||
- Oberer, Obere, Oberes -> ob
|
||||
- Ober~ -> Ob
|
||||
- Österreichischer Alpenverein -> ÖAV
|
||||
- Österreichischer Gebirgsverein -> ÖGV
|
||||
- Österreichischer Touristenklub -> ÖTK
|
||||
- östliche, östlicher, östliches -> östl
|
||||
- Pater -> P
|
||||
- Pfad -> P
|
||||
- Platz -> Pl
|
||||
- ~platz$ -> pl
|
||||
- Professor -> Prof
|
||||
- Quelle -> Q, Qu
|
||||
- Reservoir -> Res
|
||||
- Rhein -> Rh
|
||||
- Rundwanderweg -> RWW
|
||||
- Ruine -> R
|
||||
- Sandgrube, Schottergrube -> SG
|
||||
- Sankt -> St
|
||||
- Schloss -> Schl
|
||||
- See -> S
|
||||
- ~siedlung -> sdlg
|
||||
- Sozialmedizinisches Zentrum -> SMZ
|
||||
- ~Spitze -> Sp
|
||||
- Steinbruch -> Stb
|
||||
- ~stiege -> stg
|
||||
- ~strasse -> str
|
||||
- südliche, südlicher, südliches -> südl
|
||||
- Unterer, Untere, Unteres -> u, unt
|
||||
- Unter~ -> U
|
||||
- Teich -> T
|
||||
- Technische Universität -> TU
|
||||
- Truppenübungsplatz -> TÜPL, TÜPl
|
||||
- Unfallkrankenhaus -> UKH
|
||||
- ~universität -> uni
|
||||
- verfallen -> verf
|
||||
- von -> v
|
||||
- Vordere, Vorderer, Vorderes -> Vd, Vord
|
||||
- Vorder… -> Vd, Vord
|
||||
- von der -> v d
|
||||
- vor der -> v d
|
||||
- Volksschule -> VS
|
||||
- Wald -> W
|
||||
- Wasserfall -> Wsf, Wssf
|
||||
- ~weg$ -> w
|
||||
- westliche, westlicher, westliches -> westl
|
||||
- Wiener -> Wr
|
||||
- ~wiese$ -> ws
|
||||
- Wirtschaftsuniversität -> WU
|
||||
- Wirtshaus -> Wh
|
||||
- zum -> z
|
54
settings/icu-rules/variants-el.yaml
Normal file
54
settings/icu-rules/variants-el.yaml
Normal file
@ -0,0 +1,54 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.CE.95.CE.BB.CE.BB.CE.B7.CE.BD.CE.B9.CE.BA.CE.AC_-_Greek
|
||||
- lang: el
|
||||
words:
|
||||
- Αγίας -> Αγ
|
||||
- Αγίου -> Αγ
|
||||
- Αγίων -> Αγ
|
||||
- Αδελφοί -> Αφοί
|
||||
- Αδελφών -> Αφών
|
||||
- Αλέξανδρου -> Αλ
|
||||
- Ανώτατο Τεχνολογικό Εκπαιδευτικό Ίδρυμα -> ΑΤΕΙ
|
||||
- Αστυνομικό Τμήμα -> ΑΤ
|
||||
- Βασιλέως -> Β
|
||||
- Βασιλέως -> Βασ
|
||||
- Βασιλίσσης -> Β
|
||||
- Βασιλίσσης -> Βασ
|
||||
- Γρηγορίου -> Γρ
|
||||
- Δήμος -> Δ
|
||||
- Δημοτικό Σχολείο -> ΔΣ
|
||||
- Δημοτικό Σχολείο -> Δημ Σχ
|
||||
- Εθνάρχου -> Εθν
|
||||
- Εθνική -> Εθν
|
||||
- Εθνικής -> Εθν
|
||||
- Ελευθέριος -> Ελ
|
||||
- Ελευθερίου -> Ελ
|
||||
- Ελληνικά Ταχυδρομεία -> ΕΛΤΑ
|
||||
- Θεσσαλονίκης -> Θεσ/νίκης
|
||||
- Ιερά Μονή -> Ι Μ
|
||||
- Ιερός Ναός -> Ι Ν
|
||||
- Κτίριο -> Κτ
|
||||
- Κωνσταντίνου -> Κων/νου
|
||||
- Λεωφόρος -> Λ
|
||||
- Λεωφόρος -> Λεωφ
|
||||
- Λίμνη -> Λ
|
||||
- Νέα -> Ν
|
||||
- Νέες -> Ν
|
||||
- Νέο -> Ν
|
||||
- Νέοι -> Ν
|
||||
- Νέος -> Ν
|
||||
- Νησί -> Ν
|
||||
- Νομός -> Ν
|
||||
- Όρος -> Όρ
|
||||
- Παλαιά -> Π
|
||||
- Παλαιές -> Π
|
||||
- Παλαιό -> Π
|
||||
- Παλαιοί -> Π
|
||||
- Παλαιός -> Π
|
||||
- Πανεπιστήμιο -> ΑΕΙ
|
||||
- Πανεπιστήμιο -> Παν
|
||||
- Πλατεία -> Πλ
|
||||
- Ποταμός -> Π
|
||||
- Ποταμός -> Ποτ
|
||||
- Στρατηγού -> Στρ
|
||||
- Ταχυδρομείο -> ΕΛΤΑ
|
||||
- Τεχνολογικό Εκπαιδευτικό Ίδρυμα -> ΤΕΙ
|
485
settings/icu-rules/variants-en.yaml
Normal file
485
settings/icu-rules/variants-en.yaml
Normal file
@ -0,0 +1,485 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#English
|
||||
- lang: en
|
||||
words:
|
||||
- Access -> Accs
|
||||
- Air Force Base -> AFB
|
||||
- Air National Guard Base -> ANGB
|
||||
- Airport -> Aprt
|
||||
- Alley -> Al
|
||||
- Alley -> All
|
||||
- Alley -> Ally
|
||||
- Alley -> Aly
|
||||
- Alleyway -> Alwy
|
||||
- Amble -> Ambl
|
||||
- Apartments -> Apts
|
||||
- Approach -> Apch
|
||||
- Approach -> App
|
||||
- Arcade -> Arc
|
||||
- Arterial -> Artl
|
||||
- Artery -> Arty
|
||||
- Avenue -> Av
|
||||
- Avenue -> Ave
|
||||
- Back -> Bk
|
||||
- Banan -> Ba
|
||||
- Basin -> Basn
|
||||
- Basin -> Bsn
|
||||
- Beach -> Bch
|
||||
- Bend -> Bend
|
||||
- Bend -> Bnd
|
||||
- Block -> Blk
|
||||
- Boardwalk -> Bwlk
|
||||
- Boulevard -> Blvd
|
||||
- Boulevard -> Bvd
|
||||
- Boundary -> Bdy
|
||||
- Bowl -> Bl
|
||||
- Brace -> Br
|
||||
- Brae -> Br
|
||||
- Brae -> Brae
|
||||
- Break -> Brk
|
||||
- Bridge -> Bdge
|
||||
- Bridge -> Br
|
||||
- Bridge -> Brdg
|
||||
- Bridge -> Bri
|
||||
- Broadway -> Bdwy
|
||||
- Broadway -> Bway
|
||||
- Broadway -> Bwy
|
||||
- Brook -> Brk
|
||||
- Brow -> Brw
|
||||
- Brow -> Brow
|
||||
- Buildings -> Bldgs
|
||||
- Buildings -> Bldngs
|
||||
- Business -> Bus
|
||||
- Bypass -> Bps
|
||||
- Bypass -> Byp
|
||||
- Bypass -> Bypa
|
||||
- Byway -> Bywy
|
||||
- Caravan -> Cvn
|
||||
- Causeway -> Caus
|
||||
- Causeway -> Cswy
|
||||
- Causeway -> Cway
|
||||
- Center -> Cen
|
||||
- Center -> Ctr
|
||||
- Central -> Ctrl
|
||||
- Centre -> Cen
|
||||
- Centre -> Ctr
|
||||
- Centreway -> Cnwy
|
||||
- Chase -> Ch
|
||||
- Church -> Ch
|
||||
- Circle -> Cir
|
||||
- Circuit -> Cct
|
||||
- Circuit -> Ci
|
||||
- Circus -> Crc
|
||||
- Circus -> Crcs
|
||||
- City -> Cty
|
||||
- Close -> Cl
|
||||
- Common -> Cmn
|
||||
- Common -> Comm
|
||||
- Community -> Comm
|
||||
- Concourse -> Cnc
|
||||
- Concourse -> Con
|
||||
- Copse -> Cps
|
||||
- Corner -> Cnr
|
||||
- Corner -> Crn
|
||||
- Corso -> Cso
|
||||
- Cottages -> Cotts
|
||||
- County -> Co
|
||||
- County Road -> CR
|
||||
- County Route -> CR
|
||||
- Court -> Crt
|
||||
- Court -> Ct
|
||||
- Courtyard -> Cyd
|
||||
- Courtyard -> Ctyd
|
||||
- Cove -> Ce
|
||||
- Cove -> Cov
|
||||
- Cove -> Cove
|
||||
- Cove -> Cv
|
||||
- Creek -> Ck
|
||||
- Creek -> Cr
|
||||
- Creek -> Crk
|
||||
- Crescent -> Cr
|
||||
- Crescent -> Cres
|
||||
- Crest -> Crst
|
||||
- Crest -> Cst
|
||||
- Croft -> Cft
|
||||
- Cross -> Cs
|
||||
- Cross -> Crss
|
||||
- Crossing -> Crsg
|
||||
- Crossing -> Csg
|
||||
- Crossing -> Xing
|
||||
- Crossroad -> Crd
|
||||
- Crossway -> Cowy
|
||||
- Cul-de-sac -> Cds
|
||||
- Cul-de-sac -> Csac
|
||||
- Curve -> Cve
|
||||
- Cutting -> Cutt
|
||||
- Dale -> Dle
|
||||
- Dale -> Dale
|
||||
- Deviation -> Devn
|
||||
- Dip -> Dip
|
||||
- Distributor -> Dstr
|
||||
- Down -> Dn
|
||||
- Downs -> Dn
|
||||
- Drive -> Dr
|
||||
- Drive -> Drv
|
||||
- Drive -> Dv
|
||||
- Drive-In => Drive-In # prevent abbreviation here
|
||||
- Driveway -> Drwy
|
||||
- Driveway -> Dvwy
|
||||
- Driveway -> Dwy
|
||||
- East -> E
|
||||
- Edge -> Edg
|
||||
- Edge -> Edge
|
||||
- Elbow -> Elb
|
||||
- End -> End
|
||||
- Entrance -> Ent
|
||||
- Esplanade -> Esp
|
||||
- Estate -> Est
|
||||
- Expressway -> Exp
|
||||
- Expressway -> Expy
|
||||
- Expressway -> Expwy
|
||||
- Expressway -> Xway
|
||||
- Extension -> Ex
|
||||
- Fairway -> Fawy
|
||||
- Fairway -> Fy
|
||||
- Father -> Fr
|
||||
- Ferry -> Fy
|
||||
- Field -> Fd
|
||||
- Fire Track -> Ftrk
|
||||
- Firetrail -> Fit
|
||||
- Flat -> Fl
|
||||
- Flat -> Flat
|
||||
- Follow -> Folw
|
||||
- Footway -> Ftwy
|
||||
- Foreshore -> Fshr
|
||||
- Forest Service Road -> FSR
|
||||
- Formation -> Form
|
||||
- Fort -> Ft
|
||||
- Freeway -> Frwy
|
||||
- Freeway -> Fwy
|
||||
- Front -> Frnt
|
||||
- Frontage -> Fr
|
||||
- Frontage -> Frtg
|
||||
- Gap -> Gap
|
||||
- Garden -> Gdn
|
||||
- Gardens -> Gdn
|
||||
- Gardens -> Gdns
|
||||
- Gate -> Ga
|
||||
- Gate -> Gte
|
||||
- Gates -> Ga
|
||||
- Gates -> Gte
|
||||
- Gateway -> Gwy
|
||||
- George -> Geo
|
||||
- Glade -> Gl
|
||||
- Glade -> Gld
|
||||
- Glade -> Glde
|
||||
- Glen -> Gln
|
||||
- Glen -> Glen
|
||||
- Grange -> Gra
|
||||
- Green -> Gn
|
||||
- Green -> Grn
|
||||
- Ground -> Grnd
|
||||
- Grove -> Gr
|
||||
- Grove -> Gro
|
||||
- Grovet -> Gr
|
||||
- Gully -> Gly
|
||||
- Harbor -> Hbr
|
||||
- Harbour -> Hbr
|
||||
- Haven -> Hvn
|
||||
- Head -> Hd
|
||||
- Heads -> Hd
|
||||
- Heights -> Hgts
|
||||
- Heights -> Ht
|
||||
- Heights -> Hts
|
||||
- High School -> HS
|
||||
- Highroad -> Hird
|
||||
- Highroad -> Hrd
|
||||
- Highway -> Hwy
|
||||
- Hill -> Hill
|
||||
- Hill -> Hl
|
||||
- Hills -> Hl
|
||||
- Hills -> Hls
|
||||
- Hospital -> Hosp
|
||||
- House -> Ho
|
||||
- House -> Hse
|
||||
- Industrial -> Ind
|
||||
- Interchange -> Intg
|
||||
- International -> Intl
|
||||
- Island -> I
|
||||
- Island -> Is
|
||||
- Junction -> Jctn
|
||||
- Junction -> Jnc
|
||||
- Junior -> Jr
|
||||
- Key -> Key
|
||||
- Lagoon -> Lgn
|
||||
- Lakes -> L
|
||||
- Landing -> Ldg
|
||||
- Lane -> La
|
||||
- Lane -> Lane
|
||||
- Lane -> Ln
|
||||
- Laneway -> Lnwy
|
||||
- Line -> Line
|
||||
- Line -> Ln
|
||||
- Link -> Link
|
||||
- Link -> Lk
|
||||
- Little -> Lit
|
||||
- Little -> Lt
|
||||
- Lodge -> Ldg
|
||||
- Lookout -> Lkt
|
||||
- Loop -> Loop
|
||||
- Loop -> Lp
|
||||
- Lower -> Low
|
||||
- Lower -> Lr
|
||||
- Lower -> Lwr
|
||||
- Mall -> Mall
|
||||
- Mall -> Ml
|
||||
- Manor -> Mnr
|
||||
- Mansions -> Mans
|
||||
- Market -> Mkt
|
||||
- Meadow -> Mdw
|
||||
- Meadows -> Mdw
|
||||
- Meadows -> Mdws
|
||||
- Mead -> Md
|
||||
- Meander -> Mdr
|
||||
- Meander -> Mndr
|
||||
- Meander -> Mr
|
||||
- Medical -> Med
|
||||
- Memorial -> Mem
|
||||
- Mews -> Mews
|
||||
- Mews -> Mw
|
||||
- Middle -> Mid
|
||||
- Middle School -> MS
|
||||
- Mile -> Mi
|
||||
- Military -> Mil
|
||||
- Motorway -> Mtwy
|
||||
- Motorway -> Mwy
|
||||
- Mount -> Mt
|
||||
- Mountain -> Mtn
|
||||
- Mountains -> Mtn
|
||||
- Municipal -> Mun
|
||||
- Museum -> Mus
|
||||
- National Park -> NP
|
||||
- National Recreation Area -> NRA
|
||||
- National Wildlife Refuge Area -> NWRA
|
||||
- Nook -> Nk
|
||||
- Nook -> Nook
|
||||
- North -> N
|
||||
- Northeast -> NE
|
||||
- Northwest -> NW
|
||||
- Outlook -> Out
|
||||
- Outlook -> Otlk
|
||||
- Parade -> Pde
|
||||
- Paradise -> Pdse
|
||||
- Park -> Park
|
||||
- Park -> Pk
|
||||
- Parklands -> Pkld
|
||||
- Parkway -> Pkwy
|
||||
- Parkway -> Pky
|
||||
- Parkway -> Pwy
|
||||
- Pass -> Pass
|
||||
- Pass -> Ps
|
||||
- Passage -> Psge
|
||||
- Path -> Path
|
||||
- Pathway -> Phwy
|
||||
- Pathway -> Pway
|
||||
- Pathway -> Pwy
|
||||
- Piazza -> Piaz
|
||||
- Pike -> Pk
|
||||
- Place -> Pl
|
||||
- Plain -> Pl
|
||||
- Plains -> Pl
|
||||
- Plateau -> Plat
|
||||
- Plaza -> Pl
|
||||
- Plaza -> Plz
|
||||
- Plaza -> Plza
|
||||
- Pocket -> Pkt
|
||||
- Point -> Pnt
|
||||
- Point -> Pt
|
||||
- Port -> Port
|
||||
- Port -> Pt
|
||||
- Post Office -> PO
|
||||
- Precinct -> Pct
|
||||
- Promenade -> Prm
|
||||
- Promenade -> Prom
|
||||
- Quad -> Quad
|
||||
- Quadrangle -> Qdgl
|
||||
- Quadrant -> Qdrt
|
||||
- Quadrant -> Qd
|
||||
- Quay -> Qy
|
||||
- Quays -> Qy
|
||||
- Quays -> Qys
|
||||
- Ramble -> Ra
|
||||
- Ramble -> Rmbl
|
||||
- Range -> Rge
|
||||
- Range -> Rnge
|
||||
- Reach -> Rch
|
||||
- Reservation -> Res
|
||||
- Reserve -> Res
|
||||
- Reservoir -> Res
|
||||
- Rest -> Rest
|
||||
- Rest -> Rst
|
||||
- Retreat -> Rt
|
||||
- Retreat -> Rtt
|
||||
- Return -> Rtn
|
||||
- Ridge -> Rdg
|
||||
- Ridge -> Rdge
|
||||
- Ridgeway -> Rgwy
|
||||
- Right of Way -> Rowy
|
||||
- Rise -> Ri
|
||||
- Rise -> Rise
|
||||
- River -> R
|
||||
- River -> Riv
|
||||
- River -> Rvr
|
||||
- Riverway -> Rvwy
|
||||
- Riviera -> Rvra
|
||||
- Road -> Rd
|
||||
- Roads -> Rds
|
||||
- Roadside -> Rdsd
|
||||
- Roadway -> Rdwy
|
||||
- Roadway -> Rdy
|
||||
- Robert -> Robt
|
||||
- Rocks -> Rks
|
||||
- Ronde -> Rnde
|
||||
- Rosebowl -> Rsbl
|
||||
- Rotary -> Rty
|
||||
- Round -> Rnd
|
||||
- Route -> Rt
|
||||
- Route -> Rte
|
||||
- Row -> Row
|
||||
- Rue -> Rue
|
||||
- Run -> Run
|
||||
- Saint -> St
|
||||
- Saints -> SS
|
||||
- Senior -> Sr
|
||||
- Serviceway -> Swy
|
||||
- Serviceway -> Svwy
|
||||
- Shunt -> Shun
|
||||
- Siding -> Sdng
|
||||
- Sister -> Sr
|
||||
- Slope -> Slpe
|
||||
- Sound -> Snd
|
||||
- South -> S
|
||||
- South -> Sth
|
||||
- Southeast -> SE
|
||||
- Southwest -> SW
|
||||
- Spur -> Spur
|
||||
- Square -> Sq
|
||||
- Stairway -> Strwy
|
||||
- State Highway -> SH
|
||||
- State Highway -> SHwy
|
||||
- State Route -> SR
|
||||
- Station -> Sta
|
||||
- Station -> Stn
|
||||
- Strand -> Sd
|
||||
- Strand -> Stra
|
||||
- Street -> St
|
||||
- Strip -> Strp
|
||||
- Subway -> Sbwy
|
||||
- Tarn -> Tn
|
||||
- Tarn -> Tarn
|
||||
- Terminal -> Term
|
||||
- Terrace -> Tce
|
||||
- Terrace -> Ter
|
||||
- Terrace -> Terr
|
||||
- Thoroughfare -> Thfr
|
||||
- Thoroughfare -> Thor
|
||||
- Tollway -> Tlwy
|
||||
- Tollway -> Twy
|
||||
- Top -> Top
|
||||
- Tor -> Tor
|
||||
- Towers -> Twrs
|
||||
- Township -> Twp
|
||||
- Trace -> Trce
|
||||
- Track -> Tr
|
||||
- Track -> Trk
|
||||
- Trail -> Trl
|
||||
- Trailer -> Trlr
|
||||
- Triangle -> Tri
|
||||
- Trunkway -> Tkwy
|
||||
- Tunnel -> Tun
|
||||
- Turn -> Tn
|
||||
- Turn -> Trn
|
||||
- Turn -> Turn
|
||||
- Turnpike -> Tpk
|
||||
- Turnpike -> Tpke
|
||||
- Underpass -> Upas
|
||||
- Underpass -> Ups
|
||||
- University -> Uni
|
||||
- University -> Univ
|
||||
- Upper -> Up
|
||||
- Upper -> Upr
|
||||
- Vale -> Va
|
||||
- Vale -> Vale
|
||||
- Valley -> Vy
|
||||
- Viaduct -> Vdct
|
||||
- Viaduct -> Via
|
||||
- Viaduct -> Viad
|
||||
- View -> Vw
|
||||
- View -> View
|
||||
- Village -> Vill
|
||||
- Villas -> Vlls
|
||||
- Vista -> Vst
|
||||
- Vista -> Vsta
|
||||
- Walk -> Walk
|
||||
- Walk -> Wk
|
||||
- Walk -> Wlk
|
||||
- Walkway -> Wkwy
|
||||
- Walkway -> Wky
|
||||
- Waters -> Wtr
|
||||
- Way -> Way
|
||||
- Way -> Wy
|
||||
- West -> W
|
||||
- Wharf -> Whrf
|
||||
- William -> Wm
|
||||
- Wynd -> Wyn
|
||||
- Wynd -> Wynd
|
||||
- Yard -> Yard
|
||||
- Yard -> Yd
|
||||
- lang: en
|
||||
country: ca
|
||||
words:
|
||||
- Circuit -> CIRCT
|
||||
- Concession -> CONC
|
||||
- Corners -> CRNRS
|
||||
- Crossing -> CROSS
|
||||
- Diversion -> DIVERS
|
||||
- Esplanade -> ESPL
|
||||
- Extension -> EXTEN
|
||||
- Grounds -> GRNDS
|
||||
- Harbour -> HARBR
|
||||
- Highlands -> HGHLDS
|
||||
- Landing -> LANDNG
|
||||
- Limits -> LMTS
|
||||
- Lookout -> LKOUT
|
||||
- Orchard -> ORCH
|
||||
- Parkway -> PKY
|
||||
- Passage -> PASS
|
||||
- Pathway -> PTWAY
|
||||
- Private -> PVT
|
||||
- Range -> RG
|
||||
- Subdivision -> SUBDIV
|
||||
- Terrace -> TERR
|
||||
- Townline -> TLINE
|
||||
- Turnabout -> TRNABT
|
||||
- Village -> VILLGE
|
||||
- lang: en
|
||||
country: ph
|
||||
words:
|
||||
- Apartment -> Apt
|
||||
- Barangay -> Brgy
|
||||
- Barangay -> Bgy
|
||||
- Building -> Bldg
|
||||
- Commission -> Comm
|
||||
- Compound -> Cmpd
|
||||
- Compound -> Cpd
|
||||
- Cooperative -> Coop
|
||||
- Department -> Dept
|
||||
- Department -> Dep't
|
||||
- General -> Gen
|
||||
- Governor -> Gov
|
||||
- National -> Nat'l
|
||||
- National High School -> NHS
|
||||
- Philippine -> Phil
|
||||
- Police Community Precinct -> PCP
|
||||
- Province -> Prov
|
||||
- Senior High School -> SHS
|
||||
- Subdivision -> Subd
|
163
settings/icu-rules/variants-es.yaml
Normal file
163
settings/icu-rules/variants-es.yaml
Normal file
@ -0,0 +1,163 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Espa.C3.B1ol_-_Spanish
|
||||
- lang: es
|
||||
words:
|
||||
- Acequia -> Aceq
|
||||
- Alameda -> Alam
|
||||
- Alquería -> Alque
|
||||
- Andador -> Andad
|
||||
- Angosta -> Angta
|
||||
- Apartamento -> Apto
|
||||
- Apartamentos -> Aptos
|
||||
- Apeadero -> Apdro
|
||||
- Arboleda -> Arb
|
||||
- Arrabal -> Arral
|
||||
- Arroyo -> Arry
|
||||
- Asociación de Vecinos -> A VV
|
||||
- Asociación Vecinal -> A V
|
||||
- Autopista -> Auto
|
||||
- Autovía -> Autov
|
||||
- Avenida -> Av
|
||||
- Avenida -> Avd
|
||||
- Avenida -> Avda
|
||||
- Balneario -> Balnr
|
||||
- Banda -> B
|
||||
- Banda -> Bda
|
||||
- Barranco -> Branc
|
||||
- Barranquil -> Bqllo
|
||||
- Barriada -> Barda
|
||||
- Barrio -> B.º
|
||||
- Barrio -> Bo
|
||||
- Bloque -> Blq
|
||||
- Bulevar -> Blvr
|
||||
- Boulevard -> Blvd
|
||||
- Calle -> C/
|
||||
- Calle -> C
|
||||
- Calle -> Cl
|
||||
- Calleja -> Cllja
|
||||
- Callejón -> Callej
|
||||
- Callejón -> Cjón
|
||||
- Callejón -> Cllón
|
||||
- Callejuela -> Cjla
|
||||
- Callizo -> Cllzo
|
||||
- Calzada -> Czada
|
||||
- Camino -> Cno
|
||||
- Camino -> Cmno
|
||||
- Camino hondo -> C H
|
||||
- Camino nuevo -> C N
|
||||
- Camino viejo -> C V
|
||||
- Camping -> Campg
|
||||
- Cantera -> Cantr
|
||||
- Cantina -> Canti
|
||||
- Cantón -> Cant
|
||||
- Carrera -> Cra
|
||||
- Carrero -> Cro
|
||||
- Carretera -> Ctra
|
||||
- Carreterín -> Ctrin
|
||||
- Carretil -> Crtil
|
||||
- Caserío -> Csrio
|
||||
- Centro Integrado de Formación Profesional -> CIFP
|
||||
- Cinturón -> Cint
|
||||
- Circunvalación -> Ccvcn
|
||||
- Cobertizo -> Cbtiz
|
||||
- Colegio de Educación Especial -> CEE
|
||||
- Colegio de Educación Infantil -> CEI
|
||||
- Colegio de Educación Infantil y Primaria -> CEIP
|
||||
- Colegio Rural Agrupado -> CRA
|
||||
- Colonia -> Col
|
||||
- Complejo -> Compj
|
||||
- Conjunto -> Cjto
|
||||
- Convento -> Cnvto
|
||||
- Cooperativa -> Coop
|
||||
- Corralillo -> Crrlo
|
||||
- Corredor -> Crrdo
|
||||
- Cortijo -> Crtjo
|
||||
- Costanilla -> Cstan
|
||||
- Costera -> Coste
|
||||
- Dehesa -> Dhsa
|
||||
- Demarcación -> Demar
|
||||
- Diagonal -> Diag
|
||||
- Diseminado -> Disem
|
||||
- Doctor -> Dr
|
||||
- Doctora -> Dra
|
||||
- Edificio -> Edif
|
||||
- Empresa -> Empr
|
||||
- Entrada -> Entd
|
||||
- Escalera -> Esca
|
||||
- Escalinata -> Escal
|
||||
- Espalda -> Eslda
|
||||
- Estación -> Estcn
|
||||
- Estrada -> Estda
|
||||
- Explanada -> Expla
|
||||
- Extramuros -> Extrm
|
||||
- Extrarradio -> Extrr
|
||||
- Fábrica -> Fca
|
||||
- Fábrica -> Fbrca
|
||||
- Ferrocarril -> F C
|
||||
- Ferrocarriles -> FF CC
|
||||
- Galería -> Gale
|
||||
- Glorieta -> Gta
|
||||
- Gran Vía -> G V
|
||||
- Hipódromo -> Hipód
|
||||
- Instituto de Educación Secundaria -> IES
|
||||
- Jardín -> Jdín
|
||||
- Llanura -> Llnra
|
||||
- Lote -> Lt
|
||||
- Malecón -> Malec
|
||||
- Manzana -> Mz
|
||||
- Mercado -> Merc
|
||||
- Mirador -> Mrdor
|
||||
- Monasterio -> Mtrio
|
||||
- Nuestra Señora -> N.ª S.ª
|
||||
- Nuestra Señora -> Ntr.ª Sr.ª
|
||||
- Nuestra Señora -> Ntra Sra
|
||||
- Palacio -> Palac
|
||||
- Pantano -> Pant
|
||||
- Parque -> Pque
|
||||
- Particular -> Parti
|
||||
- Partida -> Ptda
|
||||
- Pasadizo -> Pzo
|
||||
- Pasaje -> Psje
|
||||
- Paseo -> P.º
|
||||
- Paseo marítimo -> P.º mar
|
||||
- Pasillo -> Psllo
|
||||
- Plaza -> Pl
|
||||
- Plaza -> Pza
|
||||
- Plazoleta -> Pzta
|
||||
- Plazuela -> Plzla
|
||||
- Poblado -> Pbdo
|
||||
- Polígono -> Políg
|
||||
- Polígono industrial -> Pg ind
|
||||
- Pórtico -> Prtco
|
||||
- Portillo -> Ptilo
|
||||
- Prazuela -> Przla
|
||||
- Prolongación -> Prol
|
||||
- Pueblo -> Pblo
|
||||
- Puente -> Pte
|
||||
- Puerta -> Pta
|
||||
- Puerto -> Pto
|
||||
- Punto kilométrico -> P k
|
||||
- Rambla -> Rbla
|
||||
- Residencial -> Resid
|
||||
- Ribera -> Rbra
|
||||
- Rincón -> Rcón
|
||||
- Rinconada -> Rcda
|
||||
- Rotonda -> Rtda
|
||||
- San -> S
|
||||
- Sanatorio -> Sanat
|
||||
- Santa -> Sta
|
||||
- Santo -> Sto
|
||||
- Santas -> Stas
|
||||
- Santos -> Stos
|
||||
- Santuario -> Santu
|
||||
- Sector -> Sect
|
||||
- Sendera -> Sedra
|
||||
- Sendero -> Send
|
||||
- Torrente -> Trrnt
|
||||
- Tránsito -> Tráns
|
||||
- Transversal -> Trval
|
||||
- Trasera -> Tras
|
||||
- Travesía -> Trva
|
||||
- Urbanización -> Urb
|
||||
- Vecindario -> Vecin
|
||||
- Viaducto -> Vcto
|
||||
- Viviendas -> Vvdas
|
8
settings/icu-rules/variants-et.yaml
Normal file
8
settings/icu-rules/variants-et.yaml
Normal file
@ -0,0 +1,8 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Eesti_-_Estonian
|
||||
- lang: et
|
||||
words:
|
||||
- Maantee -> mnt
|
||||
- Puiestee -> pst
|
||||
- Raudtee -> rdt
|
||||
- Raudteejaam -> rdtj
|
||||
- Tänav -> tn
|
6
settings/icu-rules/variants-eu.yaml
Normal file
6
settings/icu-rules/variants-eu.yaml
Normal file
@ -0,0 +1,6 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Euskara_-_Basque
|
||||
- lang: eu
|
||||
words:
|
||||
- Etorbidea -> Etorb
|
||||
- Errepidea -> Err
|
||||
- Kalea -> K
|
23
settings/icu-rules/variants-fi.yaml
Normal file
23
settings/icu-rules/variants-fi.yaml
Normal file
@ -0,0 +1,23 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Suomi_-_Finnish
|
||||
- lang: fi
|
||||
words:
|
||||
- ~alue -> al
|
||||
- ~asema -> as
|
||||
- ~aukio -> auk
|
||||
- ~kaari -> kri
|
||||
- ~katu -> k
|
||||
- ~kuja -> kj
|
||||
- ~kylä -> kl
|
||||
- ~penger -> pgr
|
||||
- ~polku -> p
|
||||
- ~puistikko -> pko
|
||||
- ~puisto -> ps
|
||||
- ~raitti -> r
|
||||
- ~rautatieasema -> ras
|
||||
- ~ranta -> rt
|
||||
- ~rinne -> rn
|
||||
- ~taival -> tvl
|
||||
- ~tie -> t
|
||||
- tienhaara -> th
|
||||
- ~tori -> tr
|
||||
- ~väylä -> vlä
|
297
settings/icu-rules/variants-fr.yaml
Normal file
297
settings/icu-rules/variants-fr.yaml
Normal file
@ -0,0 +1,297 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Fran.C3.A7ais_-_French
|
||||
- lang: fr
|
||||
words:
|
||||
- Abbaye -> ABE
|
||||
- Agglomération -> AGL
|
||||
- Aire -> AIRE
|
||||
- Aires -> AIRE
|
||||
- Allée -> ALL
|
||||
- Allée -> All
|
||||
- Allées -> ALL
|
||||
- Ancien chemin -> ACH
|
||||
- Ancienne route -> ART
|
||||
- Anciennes routes -> ART
|
||||
- Anse -> ANSE
|
||||
- Arcade -> ARC
|
||||
- Arcades -> ARC
|
||||
- Autoroute -> AUT
|
||||
- Avenue -> AV
|
||||
- Avenue -> Av
|
||||
- Barrière -> BRE
|
||||
- Barrières -> BRE
|
||||
- Bas chemin -> BCH
|
||||
- Bastide -> BSTD
|
||||
- Baston -> BAST
|
||||
- Béguinage -> BEGI
|
||||
- Béguinages -> BEGI
|
||||
- Berge -> BER
|
||||
- Berges -> BER
|
||||
- Bois -> BOIS
|
||||
- Boucle -> BCLE
|
||||
- Boulevard -> Bd
|
||||
- Boulevard -> BD
|
||||
- Bourg -> BRG
|
||||
- Butte -> BUT
|
||||
- Cité -> CITE
|
||||
- Cités -> CITE
|
||||
- Côte -> COTE
|
||||
- Côteau -> COTE
|
||||
- Cale -> CALE
|
||||
- Camp -> CAMP
|
||||
- Campagne -> CGNE
|
||||
- Camping -> CPG
|
||||
- Carreau -> CAU
|
||||
- Carrefour -> CAR
|
||||
- Carrière -> CARE
|
||||
- Carrières -> CARE
|
||||
- Carré -> CARR
|
||||
- Castel -> CST
|
||||
- Cavée -> CAV
|
||||
- Central -> CTRE
|
||||
- Centre -> CTRE
|
||||
- Chalet -> CHL
|
||||
- Chapelle -> CHP
|
||||
- Charmille -> CHI
|
||||
- Chaussée -> CHS
|
||||
- Chaussées -> CHS
|
||||
- Chemin -> Ch
|
||||
- Chemin -> CHE
|
||||
- Chemin -> Che
|
||||
- Chemin vicinal -> CHV
|
||||
- Cheminement -> CHEM
|
||||
- Cheminements -> CHEM
|
||||
- Chemins -> CHE
|
||||
- Chemins vicinaux -> CHV
|
||||
- Chez -> CHEZ
|
||||
- Château -> CHT
|
||||
- Cloître -> CLOI
|
||||
- Clos -> CLOS
|
||||
- Col -> COL
|
||||
- Colline -> COLI
|
||||
- Collines -> COLI
|
||||
- Contour -> CTR
|
||||
- Corniche -> COR
|
||||
- Corniches -> COR
|
||||
- Cottage -> COTT
|
||||
- Cottages -> COTT
|
||||
- Cour -> COUR
|
||||
- Cours -> CRS
|
||||
- Cours -> Crs
|
||||
- Darse -> DARS
|
||||
- Degré -> DEG
|
||||
- Degrés -> DEG
|
||||
- Descente -> DSG
|
||||
- Descentes -> DSG
|
||||
- Digue -> DIG
|
||||
- Digues -> DIG
|
||||
- Domaine -> DOM
|
||||
- Domaines -> DOM
|
||||
- Écluse -> ECL
|
||||
- Écluse -> ÉCL
|
||||
- Écluses -> ECL
|
||||
- Écluses -> ÉCL
|
||||
- Église -> EGL
|
||||
- Église -> ÉGL
|
||||
- Enceinte -> EN
|
||||
- Enclave -> ENV
|
||||
- Enclos -> ENC
|
||||
- Escalier -> ESC
|
||||
- Escaliers -> ESC
|
||||
- Espace -> ESPA
|
||||
- Esplanade -> ESP
|
||||
- Esplanades -> ESP
|
||||
- Étang -> ETANG
|
||||
- Étang -> ÉTANG
|
||||
- Faubourg -> FG
|
||||
- Faubourg -> Fg
|
||||
- Ferme -> FRM
|
||||
- Fermes -> FRM
|
||||
- Fontaine -> FON
|
||||
- Fort -> FORT
|
||||
- Forum -> FORM
|
||||
- Fosse -> FOS
|
||||
- Fosses -> FOS
|
||||
- Foyer -> FOYR
|
||||
- Galerie -> GAL
|
||||
- Galeries -> GAL
|
||||
- Gare -> GARE
|
||||
- Garenne -> GARN
|
||||
- Grand boulevard -> GBD
|
||||
- Grand ensemble -> GDEN
|
||||
- Grand’rue -> GR
|
||||
- Grande rue -> GR
|
||||
- Grandes rues -> GR
|
||||
- Grands ensembles -> GDEN
|
||||
- Grille -> GRI
|
||||
- Grimpette -> GRIM
|
||||
- Groupe -> GPE
|
||||
- Groupement -> GPT
|
||||
- Groupes -> GPE
|
||||
- Halle -> HLE
|
||||
- Halles -> HLE
|
||||
- Hameau -> HAM
|
||||
- Hameaux -> HAM
|
||||
- Haut chemin -> HCH
|
||||
- Hauts chemins -> HCH
|
||||
- Hippodrome -> HIP
|
||||
- HLM -> HLM
|
||||
- Île -> ILE
|
||||
- Île -> ÎLE
|
||||
- Immeuble -> IMM
|
||||
- Immeubles -> IMM
|
||||
- Impasse -> IMP
|
||||
- Impasse -> Imp
|
||||
- Impasses -> IMP
|
||||
- Jardin -> JARD
|
||||
- Jardins -> JARD
|
||||
- Jetée -> JTE
|
||||
- Jetées -> JTE
|
||||
- Levée -> LEVE
|
||||
- Lieu-dit -> LD
|
||||
- Lotissement -> LOT
|
||||
- Lotissements -> LOT
|
||||
- Mail -> MAIL
|
||||
- Maison forestière -> MF
|
||||
- Manoir -> MAN
|
||||
- Marche -> MAR
|
||||
- Marches -> MAR
|
||||
- Maréchal -> MAL
|
||||
- Mas -> MAS
|
||||
- Monseigneur -> Mgr
|
||||
- Mont -> Mt
|
||||
- Montée -> MTE
|
||||
- Montées -> MTE
|
||||
- Moulin -> MLN
|
||||
- Moulins -> MLN
|
||||
- Musée -> MUS
|
||||
- Métro -> MET
|
||||
- Métro -> MÉT
|
||||
- Nouvelle route -> NTE
|
||||
- Palais -> PAL
|
||||
- Parc -> PARC
|
||||
- Parcs -> PARC
|
||||
- Parking -> PKG
|
||||
- Parvis -> PRV
|
||||
- Passage -> PAS
|
||||
- Passage -> Pas
|
||||
- Passage -> Pass
|
||||
- Passage à niveau -> PN
|
||||
- Passe -> PASS
|
||||
- Passerelle -> PLE
|
||||
- Passerelles -> PLE
|
||||
- Passes -> PASS
|
||||
- Patio -> PAT
|
||||
- Pavillon -> PAV
|
||||
- Pavillons -> PAV
|
||||
- Petit chemin -> PCH
|
||||
- Petite allée -> PTA
|
||||
- Petite avenue -> PAE
|
||||
- Petite impasse -> PIM
|
||||
- Petite route -> PRT
|
||||
- Petite rue -> PTR
|
||||
- Petites allées -> PTA
|
||||
- Place -> PL
|
||||
- Place -> Pl
|
||||
- Placis -> PLCI
|
||||
- Plage -> PLAG
|
||||
- Plages -> PLAG
|
||||
- Plaine -> PLN
|
||||
- Plan -> PLAN
|
||||
- Plateau -> PLT
|
||||
- Plateaux -> PLT
|
||||
- Pointe -> PNT
|
||||
- Pont -> PONT
|
||||
- Ponts -> PONT
|
||||
- Porche -> PCH
|
||||
- Port -> PORT
|
||||
- Porte -> PTE
|
||||
- Portique -> PORQ
|
||||
- Portiques -> PORQ
|
||||
- Poterne -> POT
|
||||
- Pourtour -> POUR
|
||||
- Presqu’île -> PRQ
|
||||
- Promenade -> PROM
|
||||
- Promenade -> Prom
|
||||
- Pré -> PRE
|
||||
- Pré -> PRÉ
|
||||
- Périphérique -> PERI
|
||||
- Péristyle -> PSTY
|
||||
- Quai -> QU
|
||||
- Quai -> Qu
|
||||
- Quartier -> QUA
|
||||
- Raccourci -> RAC
|
||||
- Raidillon -> RAID
|
||||
- Rampe -> RPE
|
||||
- Rempart -> REM
|
||||
- Roc -> ROC
|
||||
- Rocade -> ROC
|
||||
- Rond point -> RPT
|
||||
- Roquet -> ROQT
|
||||
- Rotonde -> RTD
|
||||
- Route -> RTE
|
||||
- Route -> Rte
|
||||
- Routes -> RTE
|
||||
- Rue -> R
|
||||
- Rue -> R
|
||||
- Ruelle -> RLE
|
||||
- Ruelles -> RLE
|
||||
- Rues -> R
|
||||
- Résidence -> RES
|
||||
- Résidences -> RES
|
||||
- Saint -> St
|
||||
- Sainte -> Ste
|
||||
- Sente -> SEN
|
||||
- Sentes -> SEN
|
||||
- Sentier -> SEN
|
||||
- Sentiers -> SEN
|
||||
- Square -> SQ
|
||||
- Square -> Sq
|
||||
- Stade -> STDE
|
||||
- Station -> STA
|
||||
- Terrain -> TRN
|
||||
- Terrasse -> TSSE
|
||||
- Terrasses -> TSSE
|
||||
- Terre plein -> TPL
|
||||
- Tertre -> TRT
|
||||
- Tertres -> TRT
|
||||
- Tour -> TOUR
|
||||
- Traverse -> TRA
|
||||
- Vallon -> VAL
|
||||
- Vallée -> VAL
|
||||
- Venelle -> VEN
|
||||
- Venelles -> VEN
|
||||
- Via -> VIA
|
||||
- Vieille route -> VTE
|
||||
- Vieux chemin -> VCHE
|
||||
- Villa -> VLA
|
||||
- Village -> VGE
|
||||
- Villages -> VGE
|
||||
- Villas -> VLA
|
||||
- Voie -> VOI
|
||||
- Voies -> VOI
|
||||
- Zone -> ZONE
|
||||
- Zone artisanale -> ZA
|
||||
- Zone d'aménagement concerté -> ZAC
|
||||
- Zone d'aménagement différé -> ZAD
|
||||
- Zone industrielle -> ZI
|
||||
- Zone à urbaniser en priorité -> ZUP
|
||||
- lang: fr
|
||||
country: ca
|
||||
words:
|
||||
- Boulevard -> BOUL
|
||||
- Carré -> CAR
|
||||
- Carrefour -> CARREF
|
||||
- Centre -> C
|
||||
- Chemin -> CH
|
||||
- Croissant -> CROIS
|
||||
- Diversion -> DIVERS
|
||||
- Échangeur -> ÉCH
|
||||
- Esplanade -> ESPL
|
||||
- Passage -> PASS
|
||||
- Plateau -> PLAT
|
||||
- Rang -> RANG
|
||||
- Rond-point -> RDPT
|
||||
- Sentier -> SENT
|
||||
- Subdivision -> SUBDIV
|
||||
- Terrasse -> TSSE
|
||||
- Village -> VILLGE
|
27
settings/icu-rules/variants-gl.yaml
Normal file
27
settings/icu-rules/variants-gl.yaml
Normal file
@ -0,0 +1,27 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Galego_-_Galician
|
||||
- lang: gl
|
||||
words:
|
||||
- Asociación Veciñal -> A V
|
||||
- Asociación de Veciños -> A VV
|
||||
- Avenida -> Av
|
||||
- Avenida -> Avda
|
||||
- Centro Integrado de Formación Profesional -> CIFP
|
||||
- Colexio de Educación Especial -> CEE
|
||||
- Colexio de Educación Infantil -> CEI
|
||||
- Colexio de Educación Infantil e Primaria -> CEIP
|
||||
- Colexio Rural Agrupado -> CRA
|
||||
- Doutor -> Dr
|
||||
- Doutora -> Dra
|
||||
- Edificio -> Edif
|
||||
- Estrada -> Estda
|
||||
- Ferrocarril -> F C
|
||||
- Ferrocarrís -> FF CC
|
||||
- Instituto de Educación Secundaria -> IES
|
||||
- Rúa -> R/
|
||||
- San -> S
|
||||
- Santa -> Sta
|
||||
- Santo -> Sto
|
||||
- Santas -> Stas
|
||||
- Santos -> Stos
|
||||
- Señora -> Sra
|
||||
- Urbanización -> Urb
|
4
settings/icu-rules/variants-hu.yaml
Normal file
4
settings/icu-rules/variants-hu.yaml
Normal file
@ -0,0 +1,4 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Magyar_-_Hungarian
|
||||
- lang: hu
|
||||
words:
|
||||
- utca -> u
|
77
settings/icu-rules/variants-it.yaml
Normal file
77
settings/icu-rules/variants-it.yaml
Normal file
@ -0,0 +1,77 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Italiano_-_Italian
|
||||
- lang: it
|
||||
words:
|
||||
- Calle -> C.le
|
||||
- Campo -> C.po
|
||||
- Cascina -> C.na
|
||||
- Cinque -> 5
|
||||
- Corso -> C.so
|
||||
- Corte -> C.te
|
||||
- Decima -> X
|
||||
- Decimo -> X
|
||||
- Due -> 2
|
||||
- Fondamenta -> F.ta
|
||||
- Largo -> L.go
|
||||
- Località -> Loc
|
||||
- Lungomare -> L.mare
|
||||
- Nona -> IX
|
||||
- Nono -> IX
|
||||
- Nove -> 9
|
||||
- Otto -> 8
|
||||
- Ottava -> VIII
|
||||
- Ottavo -> VIII
|
||||
- Piazza -> P.za
|
||||
- Piazza -> P.zza
|
||||
- Piazzale -> P.le
|
||||
- Piazzetta -> P.ta
|
||||
- Ponte -> P.te
|
||||
- Porta -> P.ta
|
||||
- Prima -> I
|
||||
- Primo -> I
|
||||
- Primo -> 1
|
||||
- Primo -> 1°
|
||||
- Quarta -> IV
|
||||
- Quarto -> IV
|
||||
- Quattro -> IV
|
||||
- Quattro -> 4
|
||||
- Quinta -> V
|
||||
- Quinto -> V
|
||||
- Salizada -> S.da
|
||||
- San -> S
|
||||
- Santa -> S
|
||||
- Santo -> S
|
||||
- Sant' -> S
|
||||
- Santi -> SS
|
||||
- Santissima -> SS.ma
|
||||
- Santissime -> SS.me
|
||||
- Santissimi -> SS.mi
|
||||
- Santissimo -> SS.mo
|
||||
- Seconda -> II
|
||||
- Secondo -> II
|
||||
- Sei -> 6
|
||||
- Sesta -> VI
|
||||
- Sesto -> VI
|
||||
- Sette -> 7
|
||||
- Settima -> VII
|
||||
- Settimo -> VII
|
||||
- Stazione -> Staz
|
||||
- Strada Comunale -> SC
|
||||
- Strada Provinciale -> SP
|
||||
- Strada Regionale -> SR
|
||||
- Strada Statale -> SS
|
||||
- Terzo -> III
|
||||
- Terza -> III
|
||||
- Tre -> 3
|
||||
- Trenta -> XXX
|
||||
- Un -> 1
|
||||
- Una -> 1
|
||||
- Venti -> XX
|
||||
- Venti -> 20
|
||||
- Venticinque -> XXV
|
||||
- Venticinque -> 25
|
||||
- Ventiquattro -> XXIV
|
||||
- Ventitreesimo -> XXIII
|
||||
- Via -> V
|
||||
- Viale -> V.le
|
||||
- Vico -> V.co
|
||||
- Vicolo -> V.lo
|
32
settings/icu-rules/variants-ja.yaml
Normal file
32
settings/icu-rules/variants-ja.yaml
Normal file
@ -0,0 +1,32 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.E6.97.A5.E6.9C.AC.E8.AA.9E_.28Nihongo.29_-_Japanese
|
||||
- lang: ja
|
||||
words:
|
||||
- ~中学校 |-> 中
|
||||
- ~大学 |-> 大
|
||||
- 独立行政法人~ -> 独
|
||||
- 学校法人~ -> 学
|
||||
- ~銀行 |-> 銀
|
||||
- ~合同会社 -> 合
|
||||
- 合同会社~ -> 合
|
||||
- ~合名会社 -> 名
|
||||
- 合名会社~ -> 名
|
||||
- ~合資会社 -> 資
|
||||
- 合資会社~ -> 資
|
||||
- 一般道道~ -> 一
|
||||
- 一般府道~ -> 一
|
||||
- 一般県道~ -> 一
|
||||
- 一般社団法人~ -> 一社
|
||||
- 一般都道~ -> 一
|
||||
- 一般財団法人~ -> 一財
|
||||
- 医療法人~ -> 医
|
||||
- ~株式会社 -> 株
|
||||
- 株式会社~ -> 株
|
||||
- 国立大学法人~ -> 大
|
||||
- 公立大学法人~ -> 大
|
||||
- ~高等学校 |-> 高
|
||||
- ~高等学校 |-> 高校
|
||||
- ~小学校 |-> 小
|
||||
- 主要地方道~ -> 主
|
||||
- 有限会社~ -> 有
|
||||
- ~有限会社 -> 有
|
||||
- 財団法人~ -> 財
|
12
settings/icu-rules/variants-mg.yaml
Normal file
12
settings/icu-rules/variants-mg.yaml
Normal file
@ -0,0 +1,12 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Malagasy_-_Malgache
|
||||
- lang: mg
|
||||
words:
|
||||
- Ambato -> Ato
|
||||
- Ambinany -> Any
|
||||
- Ambodi -> Adi
|
||||
- Ambohi -> Ahi
|
||||
- Ambohitr' -> Atr'
|
||||
- Ambony -> Ani
|
||||
- Ampasi -> Asi
|
||||
- Andoha -> Aha
|
||||
- Andrano -> Ano
|
12
settings/icu-rules/variants-ms.yaml
Normal file
12
settings/icu-rules/variants-ms.yaml
Normal file
@ -0,0 +1,12 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Bahasa_Melayu_-_Malay
|
||||
- lang: ms
|
||||
words:
|
||||
- Jalan -> Jln
|
||||
- Simpang -> Spg
|
||||
- Kampong -> Kg
|
||||
- Sungai -> Sg
|
||||
- Haji -> Hj
|
||||
- Pengiran -> Pg
|
||||
- Awang -> Awg
|
||||
- Dayang -> Dyg
|
||||
|
53
settings/icu-rules/variants-nl.yaml
Normal file
53
settings/icu-rules/variants-nl.yaml
Normal file
@ -0,0 +1,53 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Nederlands_-_Dutch
|
||||
- lang: nl
|
||||
words:
|
||||
- Broeder -> Br
|
||||
- Burgemeester -> Burg
|
||||
- Commandant -> Cmdt
|
||||
- Doctor -> dr
|
||||
- Dokter -> Dr
|
||||
- Dominee -> ds
|
||||
- Gebroeders -> Gebr
|
||||
- Generaal -> Gen
|
||||
- ~gracht -> gr
|
||||
- Ingenieur -> ir
|
||||
- Jonkheer -> Jhr
|
||||
- Kolonel -> Kol
|
||||
- Kanunnik -> Kan
|
||||
- Kardinaal -> Kard
|
||||
- Kort(e) -> Kte, K
|
||||
- Koning -> Kon
|
||||
- Koningin -> Kon
|
||||
- ~laan -> ln
|
||||
- Lange -> L
|
||||
- Luitenant -> Luit
|
||||
- ~markt -> mkt
|
||||
- Meester -> Mr, mr
|
||||
- Mejuffrouw -> Mej
|
||||
- Mevrouw -> Mevr
|
||||
- Minister -> Min
|
||||
- Monseigneur -> Mgr
|
||||
- Noordzijde -> NZ, N Z
|
||||
- Oostzijde -> OZ, O Z
|
||||
- Onze-Lieve-Vrouw,Onze-Lieve-Vrouwe -> O L V, OLV
|
||||
- Pastoor -> Past
|
||||
- ~plein -> pln
|
||||
- President -> Pres
|
||||
- Prins -> Pr
|
||||
- Prinses -> Pr
|
||||
- Professor -> Prof
|
||||
- ~singel -> sngl
|
||||
- ~straat -> str
|
||||
- ~steenweg -> stwg
|
||||
- Sint -> St
|
||||
- Van -> V
|
||||
- Van De -> V D, vd
|
||||
- Van Den -> V D, vd
|
||||
- Van Der -> V D, vd
|
||||
- Verlengde -> Verl
|
||||
- ~vliet -> vlt
|
||||
- Vrouwe -> Vr
|
||||
- ~weg -> wg
|
||||
- Westzijde -> WZ, W Z
|
||||
- Zuidzijde -> ZZ, Z Z
|
||||
- Zuster -> Zr
|
11
settings/icu-rules/variants-no.yaml
Normal file
11
settings/icu-rules/variants-no.yaml
Normal file
@ -0,0 +1,11 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Norsk_-_Norwegian
|
||||
- lang: no
|
||||
words:
|
||||
# convert between Nynorsk and Bookmal here
|
||||
- vei, veg => v,vn,vei,veg
|
||||
- veien, vegen -> v,vn,veien,vegen
|
||||
- gate -> g,gt
|
||||
# convert between the two female forms
|
||||
- gaten, gata => g,gt,gaten,gata
|
||||
- plass, plassen -> pl
|
||||
- sving, svingen -> sv
|
66
settings/icu-rules/variants-pl.yaml
Normal file
66
settings/icu-rules/variants-pl.yaml
Normal file
@ -0,0 +1,66 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Polski_.E2.80.93_Polish
|
||||
- lang: pl
|
||||
words:
|
||||
- Aleja, Aleje, Alei, Alejach, Aleją -> al
|
||||
- Ulica, Ulice, Ulicą, Ulicy -> ul
|
||||
- Plac, Placu, Placem -> pl
|
||||
- Wybrzeże, Wybrzeża, Wybrzeżem -> wyb
|
||||
- Bulwar -> bulw
|
||||
- Dolny, Dolna, Dolne -> Dln
|
||||
- Drugi, Druga, Drugie -> 2
|
||||
- Drugi, Druga, Drugie -> II
|
||||
- Duży, Duża, Duże -> Dz
|
||||
- Duży, Duża, Duże -> Dż
|
||||
- Górny, Górna, Górne -> Grn
|
||||
- Kolonia -> kol
|
||||
- koło, kolo -> k
|
||||
- Mały, Mała, Małe -> Ml
|
||||
- Mały, Mała, Małe -> Mł
|
||||
- Mazowiecka, Mazowiecki, Mazowieckie -> maz
|
||||
- Miasto -> m
|
||||
- Nowy, Nowa, Nowe -> Nw
|
||||
- Nowy, Nowa, Nowe -> N
|
||||
- Osiedle, Osiedlu -> os
|
||||
- Pierwszy, Pierwsza, Pierwsze -> 1
|
||||
- Pierwszy, Pierwsza, Pierwsze -> I
|
||||
- Szkoła Podstawowa -> SP
|
||||
- Stary, Stara, Stare -> St
|
||||
- Stary, Stara, Stare -> Str
|
||||
- Trzeci, Trzecia, Trzecie -> III
|
||||
- Trzeci, Trzecia, Trzecie -> 3
|
||||
- Wielki, Wielka, Wielkie -> Wlk
|
||||
- Wielkopolski, Wielkopolska, Wielkopolskie -> wlkp
|
||||
- Województwo, Województwie -> woj
|
||||
- kardynała, kardynał -> kard
|
||||
- pułkownika, pułkownik -> płk
|
||||
- marszałka, marszałek -> marsz
|
||||
- generała, generał -> gen
|
||||
- Świętego, Świętej, Świętych, święty, święta, święci -> św
|
||||
- Świętych, święci -> śś
|
||||
- Ojców -> oo
|
||||
- Błogosławionego, Błogosławionej, Błogosławionych, błogosławiony, błogosławiona, błogosławieni -> bł
|
||||
- księdza, ksiądz -> ks
|
||||
- księcia, książe -> ks
|
||||
- doktora, doktor -> dr
|
||||
- majora, major -> mjr
|
||||
- biskupa, biskup -> bpa
|
||||
- biskupa, biskup -> bp
|
||||
- rotmistrza, rotmistrz -> rotm
|
||||
- profesora, profesor -> prof
|
||||
- hrabiego, hrabiny, hrabia, hrabina -> hr
|
||||
- porucznika, porucznik -> por
|
||||
- podpułkownika, podpułkownik -> ppłk
|
||||
- pułkownika, pułkownik -> płk
|
||||
- podporucznika, podporucznik -> ppor
|
||||
- porucznika, porucznik -> por
|
||||
- marszałka, marszałek -> marsz
|
||||
- chorążego, chorąży -> chor
|
||||
- szeregowego, szeregowego -> szer
|
||||
- kaprala, kapral -> kpr
|
||||
- plutonowego, plutonowy -> plut
|
||||
- kapitana, kapitan -> kpt
|
||||
- admirała, admirał -> adm
|
||||
- wiceadmirała, wiceadmirał -> wadm
|
||||
- kontradmirała, kontradmirał -> kontradm
|
||||
- batalionów, bataliony -> bat
|
||||
- batalionu, batalion -> bat
|
196
settings/icu-rules/variants-pt.yaml
Normal file
196
settings/icu-rules/variants-pt.yaml
Normal file
@ -0,0 +1,196 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Portugu.C3.AAs_-_Portuguese
|
||||
- lang: pt
|
||||
words:
|
||||
- Associação -> Ass
|
||||
- Alameda -> Al
|
||||
- Alferes -> Alf
|
||||
- Almirante -> Alm
|
||||
- Arquitecto -> Arq
|
||||
- Arquitecto -> Arqº
|
||||
- Arquiteto -> Arq
|
||||
- Arquiteto -> Arqº
|
||||
- Auto-estrada -> A
|
||||
- Avenida -> Av
|
||||
- Avenida -> Avª
|
||||
- Azinhaga -> Az
|
||||
- Bairro -> B
|
||||
- Bairro -> Bº
|
||||
- Bairro -> Br
|
||||
- Beco -> Bc
|
||||
- Beco -> Bco
|
||||
- Bloco -> Bl
|
||||
- Bombeiros Voluntários -> BV
|
||||
- Bombeiros Voluntários -> B.V
|
||||
- Brigadeiro -> Brg
|
||||
- Cacique -> Cac
|
||||
- Calçada -> Cc
|
||||
- Calçadinha -> Ccnh
|
||||
- Câmara Municipal -> CM
|
||||
- Câmara Municipal -> C.M
|
||||
- Caminho -> Cam
|
||||
- Capitão -> Cap
|
||||
- Casal -> Csl
|
||||
- Cave -> Cv
|
||||
- Centro Comercial -> CC
|
||||
- Centro Comercial -> C.C
|
||||
- Ciclo do Ensino Básico -> CEB
|
||||
- Ciclo do Ensino Básico -> C.E.B
|
||||
- Ciclo do Ensino Básico -> C. E. B
|
||||
- Comandante -> Cmdt
|
||||
- Comendador -> Comend
|
||||
- Companhia -> Cª
|
||||
- Conselheiro -> Cons
|
||||
- Coronel -> Cor
|
||||
- Coronel -> Cel
|
||||
- Corte -> C.te
|
||||
- De -> D´
|
||||
- De -> D'
|
||||
- Departamento -> Dept
|
||||
- Deputado -> Dep
|
||||
- Direito -> Dto
|
||||
- Dom -> D
|
||||
- Dona -> D
|
||||
- Dona -> Dª
|
||||
- Doutor -> Dr
|
||||
- Doutora -> Dr
|
||||
- Doutora -> Drª
|
||||
- Doutora -> Dra
|
||||
- Duque -> Dq
|
||||
- Edifício -> Ed
|
||||
- Edifício -> Edf
|
||||
- Embaixador -> Emb
|
||||
- Empresa Pública -> EP
|
||||
- Empresa Pública -> E.P
|
||||
- Enfermeiro -> Enfo
|
||||
- Enfermeiro -> Enfº
|
||||
- Enfermeiro -> Enf
|
||||
- Engenheiro -> Eng
|
||||
- Engenheiro -> Engº
|
||||
- Engenheira -> Eng
|
||||
- Engenheira -> Engª
|
||||
- Escadas -> Esc
|
||||
- Escadinhas -> Escnh
|
||||
- Escola Básica -> EB
|
||||
- Escola Básica -> E.B
|
||||
- Esquerdo -> Esq
|
||||
- Estação de Tratamento de Águas Residuais -> ETAR
|
||||
- Estação de Tratamento de Águas Residuais -> E.T.A.R
|
||||
- Estrada -> Estr
|
||||
- Estrada Municipal -> EM
|
||||
- Estrada Nacional -> EN
|
||||
- Estrada Regional -> ER
|
||||
- Frei -> Fr
|
||||
- Frente -> Ft
|
||||
- Futebol Clube -> FC
|
||||
- Futebol Clube -> F.C
|
||||
- Guarda Nacional Republicana -> GNR
|
||||
- Guarda Nacional Republicana -> G.N.R
|
||||
- General -> Gen
|
||||
- General -> Gal
|
||||
- Habitação -> Hab
|
||||
- Infante -> Inf
|
||||
- Instituto -> Inst
|
||||
- Irmã -> Ima
|
||||
- Irmã -> Imª
|
||||
- Irmã -> Im
|
||||
- Irmão -> Imo
|
||||
- Irmão -> Imº
|
||||
- Irmão -> Im
|
||||
- Itinerário Complementar -> IC
|
||||
- Itinerário Principal -> IP
|
||||
- Jardim -> Jrd
|
||||
- Júnior -> Jr
|
||||
- Largo -> Lg
|
||||
- Limitada -> Lda
|
||||
- Loja -> Lj
|
||||
- Lote -> Lt
|
||||
- Loteamento -> Loteam
|
||||
- Lugar -> Lg
|
||||
- Lugar -> Lug
|
||||
- Maestro -> Mto
|
||||
- Major -> Maj
|
||||
- Marechal -> Mal
|
||||
- Marquês -> Mq
|
||||
- Madre -> Me
|
||||
- Mestre -> Me
|
||||
- Ministério -> Min
|
||||
- Monsenhor -> Mons
|
||||
- Municipal -> M
|
||||
- Nacional -> N
|
||||
- Nossa -> N
|
||||
- Nossa -> Nª
|
||||
- Nossa Senhora -> Ns
|
||||
- Nosso -> N
|
||||
- Número -> N
|
||||
- Número -> Nº
|
||||
- Padre -> Pe
|
||||
- Parque -> Pq
|
||||
- Particular -> Part
|
||||
- Pátio -> Pto
|
||||
- Pavilhão -> Pav
|
||||
- Polícia de Segurança Pública -> PSP
|
||||
- Polícia de Segurança Pública -> P.S.P
|
||||
- Polícia Judiciária -> PJ
|
||||
- Polícia Judiciária -> P.J
|
||||
- Praça -> Pc
|
||||
- Praça -> Pç
|
||||
- Praça -> Pr
|
||||
- Praceta -> Pct
|
||||
- Praceta -> Pctª
|
||||
- Presidente -> Presid
|
||||
- Primeiro -> 1º
|
||||
- Professor -> Prof
|
||||
- Professora -> Prof
|
||||
- Professora -> Profª
|
||||
- Projectada -> Proj
|
||||
- Projetada -> Proj
|
||||
- Prolongamento -> Prolng
|
||||
- Quadra -> Q
|
||||
- Quadra -> Qd
|
||||
- Quinta -> Qta
|
||||
- Regional -> R
|
||||
- Rés-do-chão -> R/c
|
||||
- Rés-do-chão -> Rc
|
||||
- Rotunda -> Rot
|
||||
- Ribeira -> Rª
|
||||
- Ribeira -> Rib
|
||||
- Ribeira -> Ribª
|
||||
- Rio -> R
|
||||
- Rua -> R
|
||||
- Santa -> Sta
|
||||
- Santa -> Stª
|
||||
- Santo -> St
|
||||
- Santo -> Sto
|
||||
- Santo -> Stº
|
||||
- São -> S
|
||||
- Sargento -> Sarg
|
||||
- Sem Número -> S/n
|
||||
- Sem Número -> Sn
|
||||
- Senhor -> S
|
||||
- Senhor -> Sr
|
||||
- Senhora -> S
|
||||
- Senhora -> Sª
|
||||
- Senhora -> Srª
|
||||
- Senhora -> Sr.ª
|
||||
- Senhora -> S.ra
|
||||
- Senhora -> Sra
|
||||
- Sobre-Loja -> Slj
|
||||
- Sociedade -> Soc
|
||||
- Sociedade Anónima -> SA
|
||||
- Sociedade Anónima -> S.A
|
||||
- Sport Clube -> SC
|
||||
- Sport Clube -> S.C
|
||||
- Sub-Cave -> Scv
|
||||
- Superquadra -> Sq
|
||||
- Tenente -> Ten
|
||||
- Torre -> Tr
|
||||
- Transversal -> Transv
|
||||
- Travessa -> Trav
|
||||
- Travessa -> Trv
|
||||
- Travessa -> Tv
|
||||
- Universidade -> Univ
|
||||
- Urbanização -> Urb
|
||||
- Vila -> Vl
|
||||
- Visconde -> Visc
|
||||
- Vivenda -> Vv
|
||||
- Zona -> Zn
|
36
settings/icu-rules/variants-ro.yaml
Normal file
36
settings/icu-rules/variants-ro.yaml
Normal file
@ -0,0 +1,36 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Rom.C3.A2n.C4.83_-_Romanian
|
||||
- lang: ro
|
||||
words:
|
||||
- Aleea -> ale
|
||||
- Aleea -> al
|
||||
- Bulevardul -> bulevard
|
||||
- Bulevardul -> bulev
|
||||
- Bulevardul -> b-dul
|
||||
- Bulevardul -> blvd
|
||||
- Bulevardul -> blv
|
||||
- Bulevardul -> bdul
|
||||
- Bulevardul -> bul
|
||||
- Bulevardul -> bd
|
||||
- Calea -> cal
|
||||
- Fundătura -> fnd
|
||||
- Fundacul -> fdc
|
||||
- Intrarea -> intr
|
||||
- Intrarea -> int
|
||||
- Piața -> p-ța
|
||||
- Piața -> pța
|
||||
- Strada -> stra
|
||||
- Strada -> str
|
||||
- Stradela -> str-la
|
||||
- Stradela -> sdla
|
||||
- Șoseaua -> sos
|
||||
- Splaiul -> sp
|
||||
- Splaiul -> splaiul
|
||||
- Splaiul -> spl
|
||||
- Vârful -> virful
|
||||
- Vârful -> virf
|
||||
- Vârful -> varf
|
||||
- Vârful -> vf
|
||||
- Muntele -> m-tele
|
||||
- Muntele -> m-te
|
||||
- Muntele -> mnt
|
||||
- Muntele -> mt
|
14
settings/icu-rules/variants-ru.yaml
Normal file
14
settings/icu-rules/variants-ru.yaml
Normal file
@ -0,0 +1,14 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.A0.D1.83.D1.81.D1.81.D0.BA.D0.B8.D0.B9_-_Russian
|
||||
- lang: ru
|
||||
words:
|
||||
- аллея -> ал
|
||||
- бульвар -> бул
|
||||
- набережная -> наб
|
||||
- переулок -> пер
|
||||
- площадь -> пл
|
||||
- проезд -> пр
|
||||
- проспект -> просп
|
||||
- шоссе -> ш
|
||||
- тупик -> туп
|
||||
- улица -> ул
|
||||
- область -> обл
|
20
settings/icu-rules/variants-sk.yaml
Normal file
20
settings/icu-rules/variants-sk.yaml
Normal file
@ -0,0 +1,20 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Slovensky_-_Slovak
|
||||
- lang: sk
|
||||
words:
|
||||
- Ulica -> Ul
|
||||
- Námestie -> Nám
|
||||
- Svätého, Svätej -> Sv
|
||||
- Generála -> Gen
|
||||
- Armádneho generála -> Arm gen
|
||||
- Doktora, Doktorky -> Dr
|
||||
- Inžiniera, Inžinierky -> Ing
|
||||
- Majora -> Mjr
|
||||
- Profesora, Profesorky -> Prof
|
||||
- Československej -> Čsl
|
||||
- Plukovníka -> Plk
|
||||
- Podplukovníka -> Pplk
|
||||
- Kapitána -> Kpt
|
||||
- Poručíka -> Por
|
||||
- Podporučíka -> Ppor
|
||||
- Sídlisko -> Sídl
|
||||
- Nábrežie -> Nábr
|
35
settings/icu-rules/variants-sl.yaml
Normal file
35
settings/icu-rules/variants-sl.yaml
Normal file
@ -0,0 +1,35 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Sloven.C5.A1.C4.8Dina_-_Slovenian
|
||||
- lang: sl
|
||||
words:
|
||||
- Cesta -> C
|
||||
- Gasilski Dom -> GD
|
||||
- Osnovna šola -> OŠ
|
||||
- Prostovoljno Gasilsko Društvo -> PGD
|
||||
- Savinjski -> Savinj
|
||||
- Slovenskih -> Slov
|
||||
- Spodnja -> Sp
|
||||
- Spodnje -> Sp
|
||||
- Spodnji -> Sp
|
||||
- Srednja -> Sr
|
||||
- Srednje -> Sr
|
||||
- Srednji -> Sr
|
||||
- Sveta -> Sv
|
||||
- Svete -> Sv
|
||||
- Sveti -> Sv
|
||||
- Svetega -> Sv
|
||||
- Šent -> Št
|
||||
- Ulica -> Ul
|
||||
- Velika -> V
|
||||
- Velike -> V
|
||||
- Veliki -> V
|
||||
- Veliko -> V
|
||||
- Velikem -> V
|
||||
- Velika -> Vel
|
||||
- Velike -> Vel
|
||||
- Veliki -> Vel
|
||||
- Veliko -> Vel
|
||||
- Velikem -> Vel
|
||||
- Zdravstveni dom -> ZD
|
||||
- Zgornja -> Zg
|
||||
- Zgornje -> Zg
|
||||
- Zgornji -> Zg
|
21
settings/icu-rules/variants-sv.yaml
Normal file
21
settings/icu-rules/variants-sv.yaml
Normal file
@ -0,0 +1,21 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Svenska_-_Swedish
|
||||
- lang: sv
|
||||
words:
|
||||
- ~väg, ~vägen -> v
|
||||
- ~gatan, ~gata -> g
|
||||
- ~gränd, ~gränden -> gr
|
||||
- gamla -> G:la
|
||||
- södra -> s
|
||||
- södra -> s:a
|
||||
- norra -> n
|
||||
- norra -> n:a
|
||||
- östra -> ö
|
||||
- östra -> ö:a
|
||||
- västra -> v
|
||||
- västra -> v:a
|
||||
- ~stig, ~stigen -> st
|
||||
- sankt -> s:t
|
||||
- sankta -> s:ta
|
||||
- ~plats, ~platsen -> pl
|
||||
- lilla -> l
|
||||
- stora -> st
|
14
settings/icu-rules/variants-tr.yaml
Normal file
14
settings/icu-rules/variants-tr.yaml
Normal file
@ -0,0 +1,14 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#T.C3.BCrk.C3.A7e_-_Turkish
|
||||
- lang: tr
|
||||
words:
|
||||
- Sokak -> Sk
|
||||
- Sokak -> Sok
|
||||
- Sokağı -> Sk
|
||||
- Sokağı -> Sok
|
||||
- Cadde -> Cd
|
||||
- Caddesi -> Cd
|
||||
- Bulvar -> Bl
|
||||
- Bulvar -> Blv
|
||||
- Bulvarı -> Bl
|
||||
- Mahalle -> Mh
|
||||
- Mahalle -> Mah
|
10
settings/icu-rules/variants-uk.yaml
Normal file
10
settings/icu-rules/variants-uk.yaml
Normal file
@ -0,0 +1,10 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.A3.D0.BA.D1.80.D0.B0.D1.97.D0.BD.D1.81.D1.8C.D0.BA.D0.B0_-_Ukrainian
|
||||
- lang: uk
|
||||
words:
|
||||
- бульвар -> бул
|
||||
- дорога -> дор
|
||||
- провулок -> пров
|
||||
- площа -> пл
|
||||
- проспект -> просп
|
||||
- шосе -> ш
|
||||
- вулиця -> вул
|
48
settings/icu-rules/variants-vi.yaml
Normal file
48
settings/icu-rules/variants-vi.yaml
Normal file
@ -0,0 +1,48 @@
|
||||
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Ti.E1.BA.BFng_Vi.E1.BB.87t_.E2.80.93_Vietnamese
|
||||
- lang: vi
|
||||
words:
|
||||
- Thành phố -> TP
|
||||
- Thị xã -> TX
|
||||
- Thị trấn -> TT
|
||||
- Quận -> Q
|
||||
- Phường -> P
|
||||
- Phường -> Ph
|
||||
- Quốc lộ -> QL
|
||||
- Tỉnh lộ -> TL
|
||||
- Đại lộ -> ĐL
|
||||
- Đường -> Đ
|
||||
- Công trường -> CT
|
||||
- Quảng trường -> QT
|
||||
- Sân bay -> SB
|
||||
- Sân bay quốc tế -> SBQT
|
||||
- Phi trường -> PT
|
||||
- Đường sắt -> ĐS
|
||||
- Trung tâm -> TT
|
||||
- Trung tâm Thương mại -> TTTM
|
||||
- Khách sạn -> KS
|
||||
- Khách sạn -> K/S
|
||||
- Bưu điện -> BĐ
|
||||
- Đại học -> ĐH
|
||||
- Cao đẳng -> CĐ
|
||||
- Trung học Phổ thông -> THPT
|
||||
- Trung học Cơ sở -> THCS
|
||||
- Tiểu học -> TH
|
||||
- Khu công nghiệp -> KCN
|
||||
- Khu nghỉ mát -> KNM
|
||||
- Khu du lịch -> KDL
|
||||
- Công viên văn hóa -> CVVH
|
||||
- Công viên -> CV
|
||||
- Vươn quốc gia -> VQG
|
||||
- Viện bảo tàng -> VBT
|
||||
- Sân vận động -> SVĐ
|
||||
- Nhà thi đấu -> NTĐ
|
||||
- Câu lạc bộ -> CLB
|
||||
- Nhà thờ -> NT
|
||||
- Nhà hát -> NH
|
||||
- Rạp hát -> RH
|
||||
- Công ty -> Cty
|
||||
- Tổng công ty -> TCty
|
||||
- Tổng công ty -> TCT
|
||||
- Công ty cổ phần -> CTCP
|
||||
- Công ty cổ phần -> Cty CP
|
||||
- Căn cứ không quân -> CCKQ
|
File diff suppressed because it is too large
Load Diff
56
settings/legacy_icu_tokenizer.yaml
Normal file
56
settings/legacy_icu_tokenizer.yaml
Normal file
@ -0,0 +1,56 @@
|
||||
normalization:
|
||||
- ":: lower ()"
|
||||
- !include icu-rules/unicode-digits-to-decimal.yaml
|
||||
- "'№' > 'no'"
|
||||
- "'n°' > 'no'"
|
||||
- "'nº' > 'no'"
|
||||
- "ª > a"
|
||||
- "º > o"
|
||||
- "[[:Punctuation:][:Symbol:]] > ' '"
|
||||
- "ß > 'ss'" # German szet is unimbigiously equal to double ss
|
||||
- "[^[:Letter:] [:Number:] [:Space:]] >"
|
||||
- "[:Lm:] >"
|
||||
- ":: [[:Number:]] Latin ()"
|
||||
- ":: [[:Number:]] Ascii ();"
|
||||
- ":: [[:Number:]] NFD ();"
|
||||
- "[[:Nonspacing Mark:] [:Cf:]] >;"
|
||||
- "[:Space:]+ > ' '"
|
||||
transliteration:
|
||||
- ":: Latin ()"
|
||||
- !include icu-rules/extended-unicode-to-asccii.yaml
|
||||
- ":: Ascii ()"
|
||||
- ":: NFD ()"
|
||||
- "[^[:Ascii:]] >"
|
||||
- ":: lower ()"
|
||||
- ":: NFC ()"
|
||||
variants:
|
||||
- !include icu-rules/variants-bg.yaml
|
||||
- !include icu-rules/variants-ca.yaml
|
||||
- !include icu-rules/variants-cs.yaml
|
||||
- !include icu-rules/variants-da.yaml
|
||||
- !include icu-rules/variants-de.yaml
|
||||
- !include icu-rules/variants-el.yaml
|
||||
- !include icu-rules/variants-en.yaml
|
||||
- !include icu-rules/variants-es.yaml
|
||||
- !include icu-rules/variants-et.yaml
|
||||
- !include icu-rules/variants-eu.yaml
|
||||
- !include icu-rules/variants-fi.yaml
|
||||
- !include icu-rules/variants-fr.yaml
|
||||
- !include icu-rules/variants-gl.yaml
|
||||
- !include icu-rules/variants-hu.yaml
|
||||
- !include icu-rules/variants-it.yaml
|
||||
- !include icu-rules/variants-ja.yaml
|
||||
- !include icu-rules/variants-mg.yaml
|
||||
- !include icu-rules/variants-ms.yaml
|
||||
- !include icu-rules/variants-nl.yaml
|
||||
- !include icu-rules/variants-no.yaml
|
||||
- !include icu-rules/variants-pl.yaml
|
||||
- !include icu-rules/variants-pt.yaml
|
||||
- !include icu-rules/variants-ro.yaml
|
||||
- !include icu-rules/variants-ru.yaml
|
||||
- !include icu-rules/variants-sk.yaml
|
||||
- !include icu-rules/variants-sl.yaml
|
||||
- !include icu-rules/variants-sv.yaml
|
||||
- !include icu-rules/variants-tr.yaml
|
||||
- !include icu-rules/variants-uk.yaml
|
||||
- !include icu-rules/variants-vi.yaml
|
@ -53,7 +53,7 @@ Feature: Import and search of names
|
||||
Scenario: Special characters in name
|
||||
Given the places
|
||||
| osm | class | type | name |
|
||||
| N1 | place | locality | Jim-Knopf-Str |
|
||||
| N1 | place | locality | Jim-Knopf-Straße |
|
||||
| N2 | place | locality | Smith/Weston |
|
||||
| N3 | place | locality | space mountain |
|
||||
| N4 | place | locality | space |
|
||||
|
@ -214,7 +214,7 @@ def check_search_name_contents(context, exclude):
|
||||
for name, value in zip(row.headings, row.cells):
|
||||
if name in ('name_vector', 'nameaddress_vector'):
|
||||
items = [x.strip() for x in value.split(',')]
|
||||
tokens = analyzer.get_word_token_info(context.db, items)
|
||||
tokens = analyzer.get_word_token_info(items)
|
||||
|
||||
if not exclude:
|
||||
assert len(tokens) >= len(items), \
|
||||
|
@ -173,6 +173,7 @@ def place_row(place_table, temp_db_cursor):
|
||||
""" A factory for rows in the place table. The table is created as a
|
||||
prerequisite to the fixture.
|
||||
"""
|
||||
psycopg2.extras.register_hstore(temp_db_cursor)
|
||||
idseq = itertools.count(1001)
|
||||
def _insert(osm_type='N', osm_id=None, cls='amenity', typ='cafe', names=None,
|
||||
admin_level=None, address=None, extratags=None, geom=None):
|
||||
|
@ -98,6 +98,13 @@ class MockWordTable:
|
||||
WHERE class = 'place' and type = 'postcode'""")
|
||||
return set((row[0] for row in cur))
|
||||
|
||||
def get_partial_words(self):
|
||||
with self.conn.cursor() as cur:
|
||||
cur.execute("""SELECT word_token, search_name_count FROM word
|
||||
WHERE class is null and country_code is null
|
||||
and not word_token like ' %'""")
|
||||
return set((tuple(row) for row in cur))
|
||||
|
||||
|
||||
class MockPlacexTable:
|
||||
""" A placex table for testing.
|
||||
|
@ -50,3 +50,68 @@ def test_execute_file_with_post_code(dsn, tmp_path, temp_db_cursor):
|
||||
db_utils.execute_file(dsn, tmpfile, post_code='INSERT INTO test VALUES(23)')
|
||||
|
||||
assert temp_db_cursor.row_set('SELECT * FROM test') == {(23, )}
|
||||
|
||||
|
||||
class TestCopyBuffer:
|
||||
TABLE_NAME = 'copytable'
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup_test_table(self, table_factory):
|
||||
table_factory(self.TABLE_NAME, 'colA INT, colB TEXT')
|
||||
|
||||
|
||||
def table_rows(self, cursor):
|
||||
return cursor.row_set('SELECT * FROM ' + self.TABLE_NAME)
|
||||
|
||||
|
||||
def test_copybuffer_empty(self):
|
||||
with db_utils.CopyBuffer() as buf:
|
||||
buf.copy_out(None, "dummy")
|
||||
|
||||
|
||||
def test_all_columns(self, temp_db_cursor):
|
||||
with db_utils.CopyBuffer() as buf:
|
||||
buf.add(3, 'hum')
|
||||
buf.add(None, 'f\\t')
|
||||
|
||||
buf.copy_out(temp_db_cursor, self.TABLE_NAME)
|
||||
|
||||
assert self.table_rows(temp_db_cursor) == {(3, 'hum'), (None, 'f\\t')}
|
||||
|
||||
|
||||
def test_selected_columns(self, temp_db_cursor):
|
||||
with db_utils.CopyBuffer() as buf:
|
||||
buf.add('foo')
|
||||
|
||||
buf.copy_out(temp_db_cursor, self.TABLE_NAME,
|
||||
columns=['colB'])
|
||||
|
||||
assert self.table_rows(temp_db_cursor) == {(None, 'foo')}
|
||||
|
||||
|
||||
def test_reordered_columns(self, temp_db_cursor):
|
||||
with db_utils.CopyBuffer() as buf:
|
||||
buf.add('one', 1)
|
||||
buf.add(' two ', 2)
|
||||
|
||||
buf.copy_out(temp_db_cursor, self.TABLE_NAME,
|
||||
columns=['colB', 'colA'])
|
||||
|
||||
assert self.table_rows(temp_db_cursor) == {(1, 'one'), (2, ' two ')}
|
||||
|
||||
|
||||
def test_special_characters(self, temp_db_cursor):
|
||||
with db_utils.CopyBuffer() as buf:
|
||||
buf.add('foo\tbar')
|
||||
buf.add('sun\nson')
|
||||
buf.add('\\N')
|
||||
|
||||
buf.copy_out(temp_db_cursor, self.TABLE_NAME,
|
||||
columns=['colB'])
|
||||
|
||||
assert self.table_rows(temp_db_cursor) == {(None, 'foo\tbar'),
|
||||
(None, 'sun\nson'),
|
||||
(None, '\\N')}
|
||||
|
||||
|
||||
|
||||
|
104
test/python/test_tokenizer_icu_name_processor.py
Normal file
104
test/python/test_tokenizer_icu_name_processor.py
Normal file
@ -0,0 +1,104 @@
|
||||
"""
|
||||
Tests for import name normalisation and variant generation.
|
||||
"""
|
||||
from textwrap import dedent
|
||||
|
||||
import pytest
|
||||
|
||||
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
|
||||
from nominatim.tokenizer.icu_name_processor import ICUNameProcessor, ICUNameProcessorRules
|
||||
|
||||
from nominatim.errors import UsageError
|
||||
|
||||
@pytest.fixture
|
||||
def cfgfile(tmp_path, suffix='.yaml'):
|
||||
def _create_config(*variants, **kwargs):
|
||||
content = dedent("""\
|
||||
normalization:
|
||||
- ":: NFD ()"
|
||||
- "'🜳' > ' '"
|
||||
- "[[:Nonspacing Mark:] [:Cf:]] >"
|
||||
- ":: lower ()"
|
||||
- "[[:Punctuation:][:Space:]]+ > ' '"
|
||||
- ":: NFC ()"
|
||||
transliteration:
|
||||
- ":: Latin ()"
|
||||
- "'🜵' > ' '"
|
||||
""")
|
||||
content += "variants:\n - words:\n"
|
||||
content += '\n'.join((" - " + s for s in variants)) + '\n'
|
||||
for k, v in kwargs:
|
||||
content += " {}: {}\n".format(k, v)
|
||||
fpath = tmp_path / ('test_config' + suffix)
|
||||
fpath.write_text(dedent(content))
|
||||
return fpath
|
||||
|
||||
return _create_config
|
||||
|
||||
|
||||
def get_normalized_variants(proc, name):
|
||||
return proc.get_variants_ascii(proc.get_normalized(name))
|
||||
|
||||
|
||||
def test_variants_empty(cfgfile):
|
||||
fpath = cfgfile('saint -> 🜵', 'street -> st')
|
||||
|
||||
rules = ICUNameProcessorRules(loader=ICURuleLoader(fpath))
|
||||
proc = ICUNameProcessor(rules)
|
||||
|
||||
assert get_normalized_variants(proc, '🜵') == []
|
||||
assert get_normalized_variants(proc, '🜳') == []
|
||||
assert get_normalized_variants(proc, 'saint') == ['saint']
|
||||
|
||||
|
||||
VARIANT_TESTS = [
|
||||
(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}),
|
||||
(('weg => wg',), "holzweg", {'holzweg'}),
|
||||
(('weg -> wg',), "holzweg", {'holzweg'}),
|
||||
(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}),
|
||||
(('~weg -> weg',), "holzweg", {'holz weg', 'holzweg'}),
|
||||
(('~weg => w',), "holzweg", {'holz w', 'holzw'}),
|
||||
(('~weg -> w',), "holzweg", {'holz weg', 'holzweg', 'holz w', 'holzw'}),
|
||||
(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}),
|
||||
(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}),
|
||||
(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}),
|
||||
(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}),
|
||||
(('weg => wg',), "Meier Weg", {'meier wg'}),
|
||||
(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}),
|
||||
(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße",
|
||||
{'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}),
|
||||
(('am => a', 'bach => b'), "am bach", {'a b'}),
|
||||
(('am => a', '~bach => b'), "am bach", {'a b'}),
|
||||
(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}),
|
||||
(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}),
|
||||
(('saint -> s,st', 'street -> st'), "Saint Johns Street",
|
||||
{'saint johns street', 's johns street', 'st johns street',
|
||||
'saint johns st', 's johns st', 'st johns st'}),
|
||||
(('river$ -> r',), "River Bend Road", {'river bend road'}),
|
||||
(('river$ -> r',), "Bent River", {'bent river', 'bent r'}),
|
||||
(('^north => n',), "North 2nd Street", {'n 2nd street'}),
|
||||
(('^north => n',), "Airport North", {'airport north'}),
|
||||
(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}),
|
||||
(('am => a',), "am am am am am am am am", {'a a a a a a a a'})
|
||||
]
|
||||
|
||||
@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS)
|
||||
def test_variants(cfgfile, rules, name, variants):
|
||||
fpath = cfgfile(*rules)
|
||||
proc = ICUNameProcessor(ICUNameProcessorRules(loader=ICURuleLoader(fpath)))
|
||||
|
||||
result = get_normalized_variants(proc, name)
|
||||
|
||||
assert len(result) == len(set(result))
|
||||
assert set(get_normalized_variants(proc, name)) == variants
|
||||
|
||||
|
||||
def test_search_normalized(cfgfile):
|
||||
fpath = cfgfile('~street => s,st', 'master => mstr')
|
||||
|
||||
rules = ICUNameProcessorRules(loader=ICURuleLoader(fpath))
|
||||
proc = ICUNameProcessor(rules)
|
||||
|
||||
assert proc.get_search_normalized('Master Street') == 'master street'
|
||||
assert proc.get_search_normalized('Earnes St') == 'earnes st'
|
||||
assert proc.get_search_normalized('Nostreet') == 'nostreet'
|
264
test/python/test_tokenizer_icu_rule_loader.py
Normal file
264
test/python/test_tokenizer_icu_rule_loader.py
Normal file
@ -0,0 +1,264 @@
|
||||
"""
|
||||
Tests for converting a config file to ICU rules.
|
||||
"""
|
||||
import pytest
|
||||
from textwrap import dedent
|
||||
|
||||
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
|
||||
from nominatim.errors import UsageError
|
||||
|
||||
from icu import Transliterator
|
||||
|
||||
@pytest.fixture
|
||||
def cfgfile(tmp_path, suffix='.yaml'):
|
||||
def _create_config(*variants, **kwargs):
|
||||
content = dedent("""\
|
||||
normalization:
|
||||
- ":: NFD ()"
|
||||
- "[[:Nonspacing Mark:] [:Cf:]] >"
|
||||
- ":: lower ()"
|
||||
- "[[:Punctuation:][:Space:]]+ > ' '"
|
||||
- ":: NFC ()"
|
||||
transliteration:
|
||||
- ":: Latin ()"
|
||||
- "[[:Punctuation:][:Space:]]+ > ' '"
|
||||
""")
|
||||
content += "variants:\n - words:\n"
|
||||
content += '\n'.join((" - " + s for s in variants)) + '\n'
|
||||
for k, v in kwargs:
|
||||
content += " {}: {}\n".format(k, v)
|
||||
fpath = tmp_path / ('test_config' + suffix)
|
||||
fpath.write_text(dedent(content))
|
||||
return fpath
|
||||
|
||||
return _create_config
|
||||
|
||||
|
||||
def test_empty_rule_file(tmp_path):
|
||||
fpath = tmp_path / ('test_config.yaml')
|
||||
fpath.write_text(dedent("""\
|
||||
normalization:
|
||||
transliteration:
|
||||
variants:
|
||||
"""))
|
||||
|
||||
rules = ICURuleLoader(fpath)
|
||||
assert rules.get_search_rules() == ''
|
||||
assert rules.get_normalization_rules() == ''
|
||||
assert rules.get_transliteration_rules() == ''
|
||||
assert list(rules.get_replacement_pairs()) == []
|
||||
|
||||
CONFIG_SECTIONS = ('normalization', 'transliteration', 'variants')
|
||||
|
||||
@pytest.mark.parametrize("section", CONFIG_SECTIONS)
|
||||
def test_missing_normalization(tmp_path, section):
|
||||
fpath = tmp_path / ('test_config.yaml')
|
||||
with fpath.open('w') as fd:
|
||||
for name in CONFIG_SECTIONS:
|
||||
if name != section:
|
||||
fd.write(name + ':\n')
|
||||
|
||||
with pytest.raises(UsageError):
|
||||
ICURuleLoader(fpath)
|
||||
|
||||
|
||||
def test_get_search_rules(cfgfile):
|
||||
loader = ICURuleLoader(cfgfile())
|
||||
|
||||
rules = loader.get_search_rules()
|
||||
trans = Transliterator.createFromRules("test", rules)
|
||||
|
||||
assert trans.transliterate(" Baum straße ") == " baum straße "
|
||||
assert trans.transliterate(" Baumstraße ") == " baumstraße "
|
||||
assert trans.transliterate(" Baumstrasse ") == " baumstrasse "
|
||||
assert trans.transliterate(" Baumstr ") == " baumstr "
|
||||
assert trans.transliterate(" Baumwegstr ") == " baumwegstr "
|
||||
assert trans.transliterate(" Αθήνα ") == " athēna "
|
||||
assert trans.transliterate(" проспект ") == " prospekt "
|
||||
|
||||
|
||||
def test_get_normalization_rules(cfgfile):
|
||||
loader = ICURuleLoader(cfgfile())
|
||||
rules = loader.get_normalization_rules()
|
||||
trans = Transliterator.createFromRules("test", rules)
|
||||
|
||||
assert trans.transliterate(" проспект-Prospekt ") == " проспект prospekt "
|
||||
|
||||
|
||||
def test_get_transliteration_rules(cfgfile):
|
||||
loader = ICURuleLoader(cfgfile())
|
||||
rules = loader.get_transliteration_rules()
|
||||
trans = Transliterator.createFromRules("test", rules)
|
||||
|
||||
assert trans.transliterate(" проспект-Prospekt ") == " prospekt Prospekt "
|
||||
|
||||
|
||||
def test_transliteration_rules_from_file(tmp_path):
|
||||
cfgpath = tmp_path / ('test_config.yaml')
|
||||
cfgpath.write_text(dedent("""\
|
||||
normalization:
|
||||
transliteration:
|
||||
- "'ax' > 'b'"
|
||||
- !include transliteration.yaml
|
||||
variants:
|
||||
"""))
|
||||
transpath = tmp_path / ('transliteration.yaml')
|
||||
transpath.write_text('- "x > y"')
|
||||
|
||||
loader = ICURuleLoader(cfgpath)
|
||||
rules = loader.get_transliteration_rules()
|
||||
trans = Transliterator.createFromRules("test", rules)
|
||||
|
||||
assert trans.transliterate(" axxt ") == " byt "
|
||||
|
||||
|
||||
class TestGetReplacements:
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup_cfg(self, cfgfile):
|
||||
self.cfgfile = cfgfile
|
||||
|
||||
def get_replacements(self, *variants):
|
||||
loader = ICURuleLoader(self.cfgfile(*variants))
|
||||
rules = loader.get_replacement_pairs()
|
||||
|
||||
return set((v.source, v.replacement) for v in rules)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
|
||||
'~foo~ -> bar', 'fo~ o -> bar'])
|
||||
def test_invalid_variant_description(self, variant):
|
||||
with pytest.raises(UsageError):
|
||||
ICURuleLoader(self.cfgfile(variant))
|
||||
|
||||
def test_add_full(self):
|
||||
repl = self.get_replacements("foo -> bar")
|
||||
|
||||
assert repl == {(' foo ', ' bar '), (' foo ', ' foo ')}
|
||||
|
||||
|
||||
def test_replace_full(self):
|
||||
repl = self.get_replacements("foo => bar")
|
||||
|
||||
assert repl == {(' foo ', ' bar ')}
|
||||
|
||||
|
||||
def test_add_suffix_no_decompose(self):
|
||||
repl = self.get_replacements("~berg |-> bg")
|
||||
|
||||
assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
|
||||
(' berg ', ' berg '), (' berg ', ' bg ')}
|
||||
|
||||
|
||||
def test_replace_suffix_no_decompose(self):
|
||||
repl = self.get_replacements("~berg |=> bg")
|
||||
|
||||
assert repl == {('berg ', 'bg '), (' berg ', ' bg ')}
|
||||
|
||||
|
||||
def test_add_suffix_decompose(self):
|
||||
repl = self.get_replacements("~berg -> bg")
|
||||
|
||||
assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
|
||||
(' berg ', ' berg '), (' berg ', 'berg '),
|
||||
('berg ', 'bg '), ('berg ', ' bg '),
|
||||
(' berg ', 'bg '), (' berg ', ' bg ')}
|
||||
|
||||
|
||||
def test_replace_suffix_decompose(self):
|
||||
repl = self.get_replacements("~berg => bg")
|
||||
|
||||
assert repl == {('berg ', 'bg '), ('berg ', ' bg '),
|
||||
(' berg ', 'bg '), (' berg ', ' bg ')}
|
||||
|
||||
|
||||
def test_add_prefix_no_compose(self):
|
||||
repl = self.get_replacements("hinter~ |-> hnt")
|
||||
|
||||
assert repl == {(' hinter', ' hinter'), (' hinter ', ' hinter '),
|
||||
(' hinter', ' hnt'), (' hinter ', ' hnt ')}
|
||||
|
||||
|
||||
def test_replace_prefix_no_compose(self):
|
||||
repl = self.get_replacements("hinter~ |=> hnt")
|
||||
|
||||
assert repl == {(' hinter', ' hnt'), (' hinter ', ' hnt ')}
|
||||
|
||||
|
||||
def test_add_prefix_compose(self):
|
||||
repl = self.get_replacements("hinter~-> h")
|
||||
|
||||
assert repl == {(' hinter', ' hinter'), (' hinter', ' hinter '),
|
||||
(' hinter', ' h'), (' hinter', ' h '),
|
||||
(' hinter ', ' hinter '), (' hinter ', ' hinter'),
|
||||
(' hinter ', ' h '), (' hinter ', ' h')}
|
||||
|
||||
|
||||
def test_replace_prefix_compose(self):
|
||||
repl = self.get_replacements("hinter~=> h")
|
||||
|
||||
assert repl == {(' hinter', ' h'), (' hinter', ' h '),
|
||||
(' hinter ', ' h '), (' hinter ', ' h')}
|
||||
|
||||
|
||||
def test_add_beginning_only(self):
|
||||
repl = self.get_replacements("^Premier -> Pr")
|
||||
|
||||
assert repl == {('^ premier ', '^ premier '), ('^ premier ', '^ pr ')}
|
||||
|
||||
|
||||
def test_replace_beginning_only(self):
|
||||
repl = self.get_replacements("^Premier => Pr")
|
||||
|
||||
assert repl == {('^ premier ', '^ pr ')}
|
||||
|
||||
|
||||
def test_add_final_only(self):
|
||||
repl = self.get_replacements("road$ -> rd")
|
||||
|
||||
assert repl == {(' road ^', ' road ^'), (' road ^', ' rd ^')}
|
||||
|
||||
|
||||
def test_replace_final_only(self):
|
||||
repl = self.get_replacements("road$ => rd")
|
||||
|
||||
assert repl == {(' road ^', ' rd ^')}
|
||||
|
||||
|
||||
def test_decompose_only(self):
|
||||
repl = self.get_replacements("~foo -> foo")
|
||||
|
||||
assert repl == {('foo ', 'foo '), ('foo ', ' foo '),
|
||||
(' foo ', 'foo '), (' foo ', ' foo ')}
|
||||
|
||||
|
||||
def test_add_suffix_decompose_end_only(self):
|
||||
repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")
|
||||
|
||||
assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
|
||||
(' berg ', ' berg '), (' berg ', ' bg '),
|
||||
('berg ^', 'berg ^'), ('berg ^', ' berg ^'),
|
||||
('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
|
||||
(' berg ^', 'berg ^'), (' berg ^', 'bg ^'),
|
||||
(' berg ^', ' berg ^'), (' berg ^', ' bg ^')}
|
||||
|
||||
|
||||
def test_replace_suffix_decompose_end_only(self):
|
||||
repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")
|
||||
|
||||
assert repl == {('berg ', 'bg '), (' berg ', ' bg '),
|
||||
('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
|
||||
(' berg ^', 'bg ^'), (' berg ^', ' bg ^')}
|
||||
|
||||
|
||||
def test_add_multiple_suffix(self):
|
||||
repl = self.get_replacements("~berg,~burg -> bg")
|
||||
|
||||
assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
|
||||
(' berg ', ' berg '), (' berg ', 'berg '),
|
||||
('berg ', 'bg '), ('berg ', ' bg '),
|
||||
(' berg ', 'bg '), (' berg ', ' bg '),
|
||||
('burg ', 'burg '), ('burg ', ' burg '),
|
||||
(' burg ', ' burg '), (' burg ', 'burg '),
|
||||
('burg ', 'bg '), ('burg ', ' bg '),
|
||||
(' burg ', 'bg '), (' burg ', ' bg ')}
|
@ -260,7 +260,9 @@ def test_update_special_phrase_modify(analyzer, word_table, make_standard_name):
|
||||
|
||||
|
||||
def test_add_country_names(analyzer, word_table, make_standard_name):
|
||||
analyzer.add_country_names('de', ['Germany', 'Deutschland', 'germany'])
|
||||
analyzer.add_country_names('de', {'name': 'Germany',
|
||||
'name:de': 'Deutschland',
|
||||
'short_name': 'germany'})
|
||||
|
||||
assert word_table.get_country() \
|
||||
== {('de', ' #germany#'),
|
||||
@ -272,7 +274,7 @@ def test_add_more_country_names(analyzer, word_table, make_standard_name):
|
||||
word_table.add_country('it', ' #italy#')
|
||||
word_table.add_country('it', ' #itala#')
|
||||
|
||||
analyzer.add_country_names('it', ['Italy', 'IT'])
|
||||
analyzer.add_country_names('it', {'name': 'Italy', 'ref': 'IT'})
|
||||
|
||||
assert word_table.get_country() \
|
||||
== {('fr', ' #france#'),
|
||||
|
@ -2,10 +2,13 @@
|
||||
Tests for Legacy ICU tokenizer.
|
||||
"""
|
||||
import shutil
|
||||
import yaml
|
||||
|
||||
import pytest
|
||||
|
||||
from nominatim.tokenizer import legacy_icu_tokenizer
|
||||
from nominatim.tokenizer.icu_name_processor import ICUNameProcessorRules
|
||||
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
|
||||
from nominatim.db import properties
|
||||
|
||||
|
||||
@ -40,16 +43,10 @@ def tokenizer_factory(dsn, tmp_path, property_table,
|
||||
@pytest.fixture
|
||||
def db_prop(temp_db_conn):
|
||||
def _get_db_property(name):
|
||||
return properties.get_property(temp_db_conn,
|
||||
getattr(legacy_icu_tokenizer, name))
|
||||
return properties.get_property(temp_db_conn, name)
|
||||
|
||||
return _get_db_property
|
||||
|
||||
@pytest.fixture
|
||||
def tokenizer_setup(tokenizer_factory, test_config):
|
||||
tok = tokenizer_factory()
|
||||
tok.init_new_db(test_config)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def analyzer(tokenizer_factory, test_config, monkeypatch,
|
||||
@ -62,9 +59,15 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
|
||||
tok.init_new_db(test_config)
|
||||
monkeypatch.undo()
|
||||
|
||||
def _mk_analyser(trans=':: upper();', abbr=(('STREET', 'ST'), )):
|
||||
tok.transliteration = trans
|
||||
tok.abbreviations = abbr
|
||||
def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
|
||||
variants=('~gasse -> gasse', 'street => st', )):
|
||||
cfgfile = tmp_path / 'analyser_test_config.yaml'
|
||||
with cfgfile.open('w') as stream:
|
||||
cfgstr = {'normalization' : list(norm),
|
||||
'transliteration' : list(trans),
|
||||
'variants' : [ {'words': list(variants)}]}
|
||||
yaml.dump(cfgstr, stream)
|
||||
tok.naming_rules = ICUNameProcessorRules(loader=ICURuleLoader(cfgfile))
|
||||
|
||||
return tok.name_analyzer()
|
||||
|
||||
@ -72,10 +75,54 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def getorcreate_term_id(temp_db_cursor):
|
||||
temp_db_cursor.execute("""CREATE OR REPLACE FUNCTION getorcreate_term_id(lookup_term TEXT)
|
||||
RETURNS INTEGER AS $$
|
||||
SELECT nextval('seq_word')::INTEGER; $$ LANGUAGE SQL""")
|
||||
def getorcreate_full_word(temp_db_cursor):
|
||||
temp_db_cursor.execute("""CREATE OR REPLACE FUNCTION getorcreate_full_word(
|
||||
norm_term TEXT, lookup_terms TEXT[],
|
||||
OUT full_token INT,
|
||||
OUT partial_tokens INT[])
|
||||
AS $$
|
||||
DECLARE
|
||||
partial_terms TEXT[] = '{}'::TEXT[];
|
||||
term TEXT;
|
||||
term_id INTEGER;
|
||||
term_count INTEGER;
|
||||
BEGIN
|
||||
SELECT min(word_id) INTO full_token
|
||||
FROM word WHERE word = norm_term and class is null and country_code is null;
|
||||
|
||||
IF full_token IS NULL THEN
|
||||
full_token := nextval('seq_word');
|
||||
INSERT INTO word (word_id, word_token, word, search_name_count)
|
||||
SELECT full_token, ' ' || lookup_term, norm_term, 0 FROM unnest(lookup_terms) as lookup_term;
|
||||
END IF;
|
||||
|
||||
FOR term IN SELECT unnest(string_to_array(unnest(lookup_terms), ' ')) LOOP
|
||||
term := trim(term);
|
||||
IF NOT (ARRAY[term] <@ partial_terms) THEN
|
||||
partial_terms := partial_terms || term;
|
||||
END IF;
|
||||
END LOOP;
|
||||
|
||||
partial_tokens := '{}'::INT[];
|
||||
FOR term IN SELECT unnest(partial_terms) LOOP
|
||||
SELECT min(word_id), max(search_name_count) INTO term_id, term_count
|
||||
FROM word WHERE word_token = term and class is null and country_code is null;
|
||||
|
||||
IF term_id IS NULL THEN
|
||||
term_id := nextval('seq_word');
|
||||
term_count := 0;
|
||||
INSERT INTO word (word_id, word_token, search_name_count)
|
||||
VALUES (term_id, term, 0);
|
||||
END IF;
|
||||
|
||||
IF NOT (ARRAY[term_id] <@ partial_tokens) THEN
|
||||
partial_tokens := partial_tokens || term_id;
|
||||
END IF;
|
||||
END LOOP;
|
||||
END;
|
||||
$$
|
||||
LANGUAGE plpgsql;
|
||||
""")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@ -91,19 +138,37 @@ def test_init_new(tokenizer_factory, test_config, monkeypatch, db_prop):
|
||||
tok = tokenizer_factory()
|
||||
tok.init_new_db(test_config)
|
||||
|
||||
assert db_prop('DBCFG_NORMALIZATION') == ':: lower();'
|
||||
assert db_prop('DBCFG_TRANSLITERATION') is not None
|
||||
assert db_prop('DBCFG_ABBREVIATIONS') is not None
|
||||
assert db_prop(legacy_icu_tokenizer.DBCFG_TERM_NORMALIZATION) == ':: lower();'
|
||||
assert db_prop(legacy_icu_tokenizer.DBCFG_MAXWORDFREQ) is not None
|
||||
|
||||
|
||||
def test_init_from_project(tokenizer_setup, tokenizer_factory):
|
||||
def test_init_word_table(tokenizer_factory, test_config, place_row, word_table):
|
||||
place_row(names={'name' : 'Test Area', 'ref' : '52'})
|
||||
place_row(names={'name' : 'No Area'})
|
||||
place_row(names={'name' : 'Holzstrasse'})
|
||||
|
||||
tok = tokenizer_factory()
|
||||
tok.init_new_db(test_config)
|
||||
|
||||
assert word_table.get_partial_words() == {('test', 1),
|
||||
('no', 1), ('area', 2),
|
||||
('holz', 1), ('strasse', 1),
|
||||
('str', 1)}
|
||||
|
||||
|
||||
def test_init_from_project(monkeypatch, test_config, tokenizer_factory):
|
||||
monkeypatch.setenv('NOMINATIM_TERM_NORMALIZATION', ':: lower();')
|
||||
monkeypatch.setenv('NOMINATIM_MAX_WORD_FREQUENCY', '90300')
|
||||
tok = tokenizer_factory()
|
||||
tok.init_new_db(test_config)
|
||||
monkeypatch.undo()
|
||||
|
||||
tok = tokenizer_factory()
|
||||
tok.init_from_project()
|
||||
|
||||
assert tok.normalization is not None
|
||||
assert tok.transliteration is not None
|
||||
assert tok.abbreviations is not None
|
||||
assert tok.naming_rules is not None
|
||||
assert tok.term_normalization == ':: lower();'
|
||||
assert tok.max_word_frequency == '90300'
|
||||
|
||||
|
||||
def test_update_sql_functions(db_prop, temp_db_cursor,
|
||||
@ -114,7 +179,7 @@ def test_update_sql_functions(db_prop, temp_db_cursor,
|
||||
tok.init_new_db(test_config)
|
||||
monkeypatch.undo()
|
||||
|
||||
assert db_prop('DBCFG_MAXWORDFREQ') == '1133'
|
||||
assert db_prop(legacy_icu_tokenizer.DBCFG_MAXWORDFREQ) == '1133'
|
||||
|
||||
table_factory('test', 'txt TEXT')
|
||||
|
||||
@ -127,18 +192,11 @@ def test_update_sql_functions(db_prop, temp_db_cursor,
|
||||
assert test_content == set((('1133', ), ))
|
||||
|
||||
|
||||
def test_make_standard_word(analyzer):
|
||||
with analyzer(abbr=(('STREET', 'ST'), ('tiny', 't'))) as anl:
|
||||
assert anl.make_standard_word('tiny street') == 'TINY ST'
|
||||
|
||||
with analyzer(abbr=(('STRASSE', 'STR'), ('STR', 'ST'))) as anl:
|
||||
assert anl.make_standard_word('Hauptstrasse') == 'HAUPTST'
|
||||
|
||||
|
||||
def test_make_standard_hnr(analyzer):
|
||||
with analyzer(abbr=(('IV', '4'),)) as anl:
|
||||
assert anl._make_standard_hnr('345') == '345'
|
||||
assert anl._make_standard_hnr('iv') == 'IV'
|
||||
def test_normalize_postcode(analyzer):
|
||||
with analyzer() as anl:
|
||||
anl.normalize_postcode('123') == '123'
|
||||
anl.normalize_postcode('ab-34 ') == 'AB-34'
|
||||
anl.normalize_postcode('38 Б') == '38 Б'
|
||||
|
||||
|
||||
def test_update_postcodes_from_db_empty(analyzer, table_factory, word_table):
|
||||
@ -168,15 +226,15 @@ def test_update_postcodes_from_db_add_and_remove(analyzer, table_factory, word_t
|
||||
def test_update_special_phrase_empty_table(analyzer, word_table):
|
||||
with analyzer() as anl:
|
||||
anl.update_special_phrases([
|
||||
("König bei", "amenity", "royal", "near"),
|
||||
("Könige", "amenity", "royal", "-"),
|
||||
("König bei", "amenity", "royal", "near"),
|
||||
("Könige ", "amenity", "royal", "-"),
|
||||
("street", "highway", "primary", "in")
|
||||
], True)
|
||||
|
||||
assert word_table.get_special() \
|
||||
== {(' KÖNIG BEI', 'könig bei', 'amenity', 'royal', 'near'),
|
||||
(' KÖNIGE', 'könige', 'amenity', 'royal', None),
|
||||
(' ST', 'street', 'highway', 'primary', 'in')}
|
||||
== {(' KÖNIG BEI', 'König bei', 'amenity', 'royal', 'near'),
|
||||
(' KÖNIGE', 'Könige', 'amenity', 'royal', None),
|
||||
(' STREET', 'street', 'highway', 'primary', 'in')}
|
||||
|
||||
|
||||
def test_update_special_phrase_delete_all(analyzer, word_table):
|
||||
@ -222,66 +280,188 @@ def test_update_special_phrase_modify(analyzer, word_table):
|
||||
(' GARDEN', 'garden', 'leisure', 'garden', 'near')}
|
||||
|
||||
|
||||
def test_process_place_names(analyzer, getorcreate_term_id):
|
||||
def test_add_country_names_new(analyzer, word_table):
|
||||
with analyzer() as anl:
|
||||
info = anl.process_place({'name' : {'name' : 'Soft bAr', 'ref': '34'}})
|
||||
anl.add_country_names('es', {'name': 'Espagña', 'name:en': 'Spain'})
|
||||
|
||||
assert info['names'] == '{1,2,3,4,5}'
|
||||
assert word_table.get_country() == {('es', ' ESPAGÑA'), ('es', ' SPAIN')}
|
||||
|
||||
|
||||
@pytest.mark.parametrize('sep', [',' , ';'])
|
||||
def test_full_names_with_separator(analyzer, getorcreate_term_id, sep):
|
||||
def test_add_country_names_extend(analyzer, word_table):
|
||||
word_table.add_country('ch', ' SCHWEIZ')
|
||||
|
||||
with analyzer() as anl:
|
||||
names = anl._compute_full_names({'name' : sep.join(('New York', 'Big Apple'))})
|
||||
anl.add_country_names('ch', {'name': 'Schweiz', 'name:fr': 'Suisse'})
|
||||
|
||||
assert names == set(('NEW YORK', 'BIG APPLE'))
|
||||
assert word_table.get_country() == {('ch', ' SCHWEIZ'), ('ch', ' SUISSE')}
|
||||
|
||||
|
||||
def test_full_names_with_bracket(analyzer, getorcreate_term_id):
|
||||
with analyzer() as anl:
|
||||
names = anl._compute_full_names({'name' : 'Houseboat (left)'})
|
||||
class TestPlaceNames:
|
||||
|
||||
assert names == set(('HOUSEBOAT (LEFT)', 'HOUSEBOAT'))
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup(self, analyzer, getorcreate_full_word):
|
||||
with analyzer() as anl:
|
||||
self.analyzer = anl
|
||||
yield anl
|
||||
|
||||
|
||||
@pytest.mark.parametrize('pcode', ['12345', 'AB 123', '34-345'])
|
||||
def test_process_place_postcode(analyzer, word_table, pcode):
|
||||
with analyzer() as anl:
|
||||
anl.process_place({'address': {'postcode' : pcode}})
|
||||
def expect_name_terms(self, info, *expected_terms):
|
||||
tokens = self.analyzer.get_word_token_info(expected_terms)
|
||||
for token in tokens:
|
||||
assert token[2] is not None, "No token for {0}".format(token)
|
||||
|
||||
assert word_table.get_postcodes() == {pcode, }
|
||||
assert eval(info['names']) == set((t[2] for t in tokens))
|
||||
|
||||
|
||||
@pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
|
||||
def test_process_place_bad_postcode(analyzer, word_table, pcode):
|
||||
with analyzer() as anl:
|
||||
anl.process_place({'address': {'postcode' : pcode}})
|
||||
def test_simple_names(self):
|
||||
info = self.analyzer.process_place({'name': {'name': 'Soft bAr', 'ref': '34'}})
|
||||
|
||||
assert not word_table.get_postcodes()
|
||||
self.expect_name_terms(info, '#Soft bAr', '#34','Soft', 'bAr', '34')
|
||||
|
||||
|
||||
@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
|
||||
def test_process_place_housenumbers_simple(analyzer, hnr, getorcreate_hnr_id):
|
||||
with analyzer() as anl:
|
||||
info = anl.process_place({'address': {'housenumber' : hnr}})
|
||||
@pytest.mark.parametrize('sep', [',' , ';'])
|
||||
def test_names_with_separator(self, sep):
|
||||
info = self.analyzer.process_place({'name': {'name': sep.join(('New York', 'Big Apple'))}})
|
||||
|
||||
assert info['hnr'] == hnr.upper()
|
||||
assert info['hnr_tokens'] == "{-1}"
|
||||
self.expect_name_terms(info, '#New York', '#Big Apple',
|
||||
'new', 'york', 'big', 'apple')
|
||||
|
||||
|
||||
def test_process_place_housenumbers_lists(analyzer, getorcreate_hnr_id):
|
||||
with analyzer() as anl:
|
||||
info = anl.process_place({'address': {'conscriptionnumber' : '1; 2;3'}})
|
||||
def test_full_names_with_bracket(self):
|
||||
info = self.analyzer.process_place({'name': {'name': 'Houseboat (left)'}})
|
||||
|
||||
assert set(info['hnr'].split(';')) == set(('1', '2', '3'))
|
||||
assert info['hnr_tokens'] == "{-1,-2,-3}"
|
||||
self.expect_name_terms(info, '#Houseboat (left)', '#Houseboat',
|
||||
'houseboat', 'left')
|
||||
|
||||
|
||||
def test_process_place_housenumbers_duplicates(analyzer, getorcreate_hnr_id):
|
||||
with analyzer() as anl:
|
||||
info = anl.process_place({'address': {'housenumber' : '134',
|
||||
'conscriptionnumber' : '134',
|
||||
'streetnumber' : '99a'}})
|
||||
def test_country_name(self, word_table):
|
||||
info = self.analyzer.process_place({'name': {'name': 'Norge'},
|
||||
'country_feature': 'no'})
|
||||
|
||||
self.expect_name_terms(info, '#norge', 'norge')
|
||||
assert word_table.get_country() == {('no', ' NORGE')}
|
||||
|
||||
|
||||
class TestPlaceAddress:
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def setup(self, analyzer, getorcreate_full_word):
|
||||
with analyzer(trans=(":: upper()", "'🜵' > ' '")) as anl:
|
||||
self.analyzer = anl
|
||||
yield anl
|
||||
|
||||
|
||||
def process_address(self, **kwargs):
|
||||
return self.analyzer.process_place({'address': kwargs})
|
||||
|
||||
|
||||
def name_token_set(self, *expected_terms):
|
||||
tokens = self.analyzer.get_word_token_info(expected_terms)
|
||||
for token in tokens:
|
||||
assert token[2] is not None, "No token for {0}".format(token)
|
||||
|
||||
return set((t[2] for t in tokens))
|
||||
|
||||
|
||||
@pytest.mark.parametrize('pcode', ['12345', 'AB 123', '34-345'])
|
||||
def test_process_place_postcode(self, word_table, pcode):
|
||||
self.process_address(postcode=pcode)
|
||||
|
||||
assert word_table.get_postcodes() == {pcode, }
|
||||
|
||||
|
||||
@pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
|
||||
def test_process_place_bad_postcode(self, word_table, pcode):
|
||||
self.process_address(postcode=pcode)
|
||||
|
||||
assert not word_table.get_postcodes()
|
||||
|
||||
|
||||
@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
|
||||
def test_process_place_housenumbers_simple(self, hnr, getorcreate_hnr_id):
|
||||
info = self.process_address(housenumber=hnr)
|
||||
|
||||
assert info['hnr'] == hnr.upper()
|
||||
assert info['hnr_tokens'] == "{-1}"
|
||||
|
||||
|
||||
def test_process_place_housenumbers_lists(self, getorcreate_hnr_id):
|
||||
info = self.process_address(conscriptionnumber='1; 2;3')
|
||||
|
||||
assert set(info['hnr'].split(';')) == set(('1', '2', '3'))
|
||||
assert info['hnr_tokens'] == "{-1,-2,-3}"
|
||||
|
||||
|
||||
def test_process_place_housenumbers_duplicates(self, getorcreate_hnr_id):
|
||||
info = self.process_address(housenumber='134',
|
||||
conscriptionnumber='134',
|
||||
streetnumber='99a')
|
||||
|
||||
assert set(info['hnr'].split(';')) == set(('134', '99A'))
|
||||
assert info['hnr_tokens'] == "{-1,-2}"
|
||||
|
||||
|
||||
def test_process_place_housenumbers_cached(self, getorcreate_hnr_id):
|
||||
info = self.process_address(housenumber="45")
|
||||
assert info['hnr_tokens'] == "{-1}"
|
||||
|
||||
info = self.process_address(housenumber="46")
|
||||
assert info['hnr_tokens'] == "{-2}"
|
||||
|
||||
info = self.process_address(housenumber="41;45")
|
||||
assert eval(info['hnr_tokens']) == {-1, -3}
|
||||
|
||||
info = self.process_address(housenumber="41")
|
||||
assert eval(info['hnr_tokens']) == {-3}
|
||||
|
||||
|
||||
def test_process_place_street(self):
|
||||
info = self.process_address(street='Grand Road')
|
||||
|
||||
assert eval(info['street']) == self.name_token_set('#GRAND ROAD')
|
||||
|
||||
|
||||
def test_process_place_street_empty(self):
|
||||
info = self.process_address(street='🜵')
|
||||
|
||||
assert 'street' not in info
|
||||
|
||||
|
||||
def test_process_place_place(self):
|
||||
info = self.process_address(place='Honu Lulu')
|
||||
|
||||
assert eval(info['place_search']) == self.name_token_set('#HONU LULU',
|
||||
'HONU', 'LULU')
|
||||
assert eval(info['place_match']) == self.name_token_set('#HONU LULU')
|
||||
|
||||
|
||||
def test_process_place_place_empty(self):
|
||||
info = self.process_address(place='🜵')
|
||||
|
||||
assert 'place_search' not in info
|
||||
assert 'place_match' not in info
|
||||
|
||||
|
||||
def test_process_place_address_terms(self):
|
||||
info = self.process_address(country='de', city='Zwickau', state='Sachsen',
|
||||
suburb='Zwickau', street='Hauptstr',
|
||||
full='right behind the church')
|
||||
|
||||
city_full = self.name_token_set('#ZWICKAU')
|
||||
city_all = self.name_token_set('#ZWICKAU', 'ZWICKAU')
|
||||
state_full = self.name_token_set('#SACHSEN')
|
||||
state_all = self.name_token_set('#SACHSEN', 'SACHSEN')
|
||||
|
||||
result = {k: [eval(v[0]), eval(v[1])] for k,v in info['addr'].items()}
|
||||
|
||||
assert result == {'city': [city_all, city_full],
|
||||
'suburb': [city_all, city_full],
|
||||
'state': [state_all, state_full]}
|
||||
|
||||
|
||||
def test_process_place_address_terms_empty(self):
|
||||
info = self.process_address(country='de', city=' ', street='Hauptstr',
|
||||
full='right behind the church')
|
||||
|
||||
assert 'addr' not in info
|
||||
|
||||
assert set(info['hnr'].split(';')) == set(('134', '99A'))
|
||||
assert info['hnr_tokens'] == "{-1,-2}"
|
||||
|
@ -180,7 +180,7 @@ def test_create_country_names(temp_db_with_extensions, temp_db_conn, temp_db_cur
|
||||
|
||||
assert len(tokenizer.analyser_cache['countries']) == 2
|
||||
|
||||
result_set = {k: set(v) for k, v in tokenizer.analyser_cache['countries']}
|
||||
result_set = {k: set(v.values()) for k, v in tokenizer.analyser_cache['countries']}
|
||||
|
||||
if languages:
|
||||
assert result_set == {'us' : set(('us', 'us1', 'United States')),
|
||||
|
@ -42,7 +42,7 @@
|
||||
python3-pip python3-setuptools python3-devel \
|
||||
expat-devel zlib-devel libicu-dev
|
||||
|
||||
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU
|
||||
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU datrie
|
||||
|
||||
|
||||
#
|
||||
|
@ -35,7 +35,7 @@
|
||||
python3-pip python3-setuptools python3-devel \
|
||||
expat-devel zlib-devel libicu-dev
|
||||
|
||||
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU
|
||||
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU datrie
|
||||
|
||||
|
||||
#
|
||||
|
@ -32,10 +32,10 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
|
||||
php php-pgsql php-intl libicu-dev python3-pip \
|
||||
python3-psycopg2 python3-psutil python3-jinja2 python3-icu git
|
||||
|
||||
# The python-dotenv package that comes with Ubuntu 18.04 is too old, so
|
||||
# The python-dotenv adn datrie package that comes with Ubuntu 18.04 is too old, so
|
||||
# install the latest version from pip:
|
||||
|
||||
pip3 install python-dotenv
|
||||
pip3 install python-dotenv datrie
|
||||
|
||||
#
|
||||
# System Configuration
|
||||
|
@ -33,7 +33,8 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
|
||||
postgresql-server-dev-12 postgresql-12-postgis-3 \
|
||||
postgresql-contrib-12 postgresql-12-postgis-3-scripts \
|
||||
php php-pgsql php-intl libicu-dev python3-dotenv \
|
||||
python3-psycopg2 python3-psutil python3-jinja2 python3-icu git
|
||||
python3-psycopg2 python3-psutil python3-jinja2 \
|
||||
python3-icu python3-datrie git
|
||||
|
||||
#
|
||||
# System Configuration
|
||||
|
Loading…
Reference in New Issue
Block a user