Merge pull request #2381 from lonvia/reorganise-abbreviations

Reorganise abbreviation handling
This commit is contained in:
Sarah Hoffmann 2021-07-05 10:32:16 +02:00 committed by GitHub
commit d4c7bf20a2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
64 changed files with 8671 additions and 5320 deletions

View File

@ -1,13 +1,26 @@
name: 'Build Nominatim'
inputs:
ubuntu:
description: 'Version of Ubuntu to install on'
required: false
default: '20'
runs:
using: "composite"
steps:
- name: Install prerequisites
run: |
sudo apt-get install -y -qq libboost-system-dev libboost-filesystem-dev libexpat1-dev zlib1g-dev libbz2-dev libpq-dev libproj-dev libicu-dev python3-psycopg2 python3-pyosmium python3-dotenv python3-psutil python3-jinja2 python3-icu
sudo apt-get install -y -qq libboost-system-dev libboost-filesystem-dev libexpat1-dev zlib1g-dev libbz2-dev libpq-dev libproj-dev libicu-dev
if [ "x$UBUNTUVER" == "x18" ]; then
pip3 install python-dotenv psycopg2==2.7.7 jinja2==2.8 psutil==5.4.2 pyicu osmium
else
sudo apt-get install -y -qq python3-icu python3-datrie python3-pyosmium python3-jinja2 python3-psutil python3-psycopg2 python3-dotenv
fi
shell: bash
env:
UBUNTUVER: ${{ inputs.ubuntu }}
- name: Download dependencies
run: |

View File

@ -134,13 +134,8 @@ jobs:
postgresql-version: ${{ matrix.postgresql }}
postgis-version: ${{ matrix.postgis }}
- uses: ./Nominatim/.github/actions/build-nominatim
- name: Install extra dependencies for Ubuntu 18
run: |
sudo apt-get install libicu-dev
pip3 install python-dotenv psycopg2==2.7.7 jinja2==2.8 psutil==5.4.2 pyicu osmium
shell: bash
if: matrix.ubuntu == 18
with:
ubuntu: ${{ matrix.ubuntu }}
- name: Clean installation
run: rm -rf Nominatim build

View File

@ -1,7 +1,7 @@
[MASTER]
extension-pkg-whitelist=osmium
ignored-modules=icu
ignored-modules=icu,datrie
[MESSAGES CONTROL]

View File

@ -258,5 +258,6 @@ install(FILES settings/env.defaults
settings/import-address.style
settings/import-full.style
settings/import-extratags.style
settings/legacy_icu_tokenizer.json
settings/legacy_icu_tokenizer.yaml
settings/icu-rules/extended-unicode-to-asccii.yaml
DESTINATION ${NOMINATIM_CONFIGDIR})

View File

@ -45,6 +45,7 @@ For running Nominatim:
* [psutil](https://github.com/giampaolo/psutil)
* [Jinja2](https://palletsprojects.com/p/jinja/)
* [PyICU](https://pypi.org/project/PyICU/)
* [datrie](https://github.com/pytries/datrie)
* [PHP](https://php.net) (7.0 or later)
* PHP-pgsql
* PHP-intl (bundled with PHP)

205
docs/admin/Tokenizers.md Normal file
View File

@ -0,0 +1,205 @@
# Tokenizers
The tokenizer module in Nominatim is responsible for analysing the names given
to OSM objects and the terms of an incoming query in order to make sure, they
can be matched appropriately.
Nominatim offers different tokenizer modules, which behave differently and have
different configuration options. This sections describes the tokenizers and how
they can be configured.
!!! important
The use of a tokenizer is tied to a database installation. You need to choose
and configure the tokenizer before starting the initial import. Once the import
is done, you cannot switch to another tokenizer anymore. Reconfiguring the
chosen tokenizer is very limited as well. See the comments in each tokenizer
section.
## Legacy tokenizer
The legacy tokenizer implements the analysis algorithms of older Nominatim
versions. It uses a special Postgresql module to normalize names and queries.
This tokenizer is currently the default.
To enable the tokenizer add the following line to your project configuration:
```
NOMINATIM_TOKENIZER=legacy
```
The Postgresql module for the tokenizer is available in the `module` directory
and also installed with the remainder of the software under
`lib/nominatim/module/nominatim.so`. You can specify a custom location for
the module with
```
NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
```
This is in particular useful when the database runs on a different server.
See [Advanced installations](Advanced-Installations.md#importing-nominatim-to-an-external-postgresql-database) for details.
There are no other configuration options for the legacy tokenizer. All
normalization functions are hard-coded.
## ICU tokenizer
!!! danger
This tokenizer is currently in active development and still subject
to backwards-incompatible changes.
The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
normalize names and queries. It also offers configurable decomposition and
abbreviation handling.
### How it works
On import the tokenizer processes names in the following four stages:
1. The **Normalization** part removes all non-relevant information from the
input.
2. Incoming names are now converted to **full names**. This process is currently
hard coded and mostly serves to handle name tags from OSM that contain
multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
3. Next the tokenizer creates **variants** from the full names. These variants
cover decomposition and abbreviation handling. Variants are saved to the
database, so that it is not necessary to create the variants for a search
query.
4. The final **Tokenization** step converts the names to a simple ASCII form,
potentially removing further spelling variants for better matching.
At query time only stage 1) and 4) are used. The query is normalized and
tokenized and the resulting string used for searching in the database.
### Configuration
The ICU tokenizer is configured using a YAML file which can be configured using
`NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
saved as part of the internal database status. Later changes to the variable
have no effect.
Here is an example configuration file:
``` yaml
normalization:
- ":: lower ()"
- "ß > 'ss'" # German szet is unimbigiously equal to double ss
transliteration:
- !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
- ":: Ascii ()"
variants:
- language: de
words:
- ~haus => haus
- ~strasse -> str
- language: en
words:
- road -> rd
- bridge -> bdge,br,brdg,bri,brg
```
The configuration file contains three sections:
`normalization`, `transliteration`, `variants`.
The normalization and transliteration sections each must contain a list of
[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
The rules are applied in the order in which they appear in the file.
You can also include additional rules from external yaml file using the
`!include` tag. The included file must contain a valid YAML list of ICU rules
and may again include other files.
!!! warning
The ICU rule syntax contains special characters that conflict with the
YAML syntax. You should therefore always enclose the ICU rules in
double-quotes.
The variants section defines lists of replacements which create alternative
spellings of a name. To create the variants, a name is scanned from left to
right and the longest matching replacement is applied until the end of the
string is reached.
The variants section must contain a list of replacement groups. Each group
defines a set of properties that describes where the replacements are
applicable. In addition, the word section defines the list of replacements
to be made. The basic replacement description is of the form:
```
<source>[,<source>[...]] => <target>[,<target>[...]]
```
The left side contains one or more `source` terms to be replaced. The right side
lists one or more replacements. Each source is replaced with each replacement
term.
!!! tip
The source and target terms are internally normalized using the
normalization rules given in the configuration. This ensures that the
strings match as expected. In fact, it is better to use unnormalized
words in the configuration because then it is possible to change the
rules for normalization later without having to adapt the variant rules.
#### Decomposition
In its standard form, only full words match against the source. There
is a special notation to match the prefix and suffix of a word:
``` yaml
- ~strasse => str # matches "strasse" as full word and in suffix position
- hinter~ => hntr # matches "hinter" as full word and in prefix position
```
There is no facility to match a string in the middle of the word. The suffix
and prefix notation automatically trigger the decomposition mode: two variants
are created for each replacement, one with the replacement attached to the word
and one separate. So in above example, the tokenization of "hauptstrasse" will
create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
triggers the variants "rote str" and "rotestr". By having decomposition work
both ways, it is sufficient to create the variants at index time. The variant
rules are not applied at query time.
To avoid automatic decomposition, use the '|' notation:
``` yaml
- ~strasse |=> str
```
simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
#### Initial and final terms
It is also possible to restrict replacements to the beginning and end of a
name:
``` yaml
- ^south => n # matches only at the beginning of the name
- road$ => rd # matches only at the end of the name
```
So the first example would trigger a replacement for "south 45th street" but
not for "the south beach restaurant".
#### Replacements vs. variants
The replacement syntax `source => target` works as a pure replacement. It changes
the name instead of creating a variant. To create an additional version, you'd
have to write `source => source,target`. As this is a frequent case, there is
a shortcut notation for it:
```
<source>[,<source>[...]] -> <target>[,<target>[...]]
```
The simple arrow causes an additional variant to be added. Note that
decomposition has an effect here on the source as well. So a rule
```yaml
- ~strasse => str
```
means that for a word like `hauptstrasse` four variants are created:
`hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
### Reconfiguration
Changing the configuration after the import is currently not possible, although
this feature may be added at a later time.

View File

@ -20,6 +20,7 @@ pages:
- 'Update' : 'admin/Update.md'
- 'Deploy' : 'admin/Deployment.md'
- 'Customize Imports' : 'admin/Customization.md'
- 'Tokenizers' : 'admin/Tokenizers.md'
- 'Nominatim UI' : 'admin/Setup-Nominatim-UI.md'
- 'Advanced Installations' : 'admin/Advanced-Installations.md'
- 'Migration from older Versions' : 'admin/Migration.md'

View File

@ -47,9 +47,7 @@ class Tokenizer
private function makeStandardWord($sTerm)
{
$sNorm = ' '.$this->oTransliterator->transliterate($sTerm).' ';
return trim(str_replace(CONST_Abbreviations[0], CONST_Abbreviations[1], $sNorm));
return trim($this->oTransliterator->transliterate(' '.$sTerm.' '));
}
@ -90,6 +88,7 @@ class Tokenizer
foreach ($aPhrases as $iPhrase => $oPhrase) {
$sNormQuery .= ','.$this->normalizeString($oPhrase->getPhrase());
$sPhrase = $this->makeStandardWord($oPhrase->getPhrase());
Debug::printVar('Phrase', $sPhrase);
if (strlen($sPhrase) > 0) {
$aWords = explode(' ', $sPhrase);
Tokenizer::addTokens($aTokens, $aWords);

View File

@ -87,25 +87,48 @@ $$ LANGUAGE SQL IMMUTABLE STRICT;
--------------- private functions ----------------------------------------------
CREATE OR REPLACE FUNCTION getorcreate_term_id(lookup_term TEXT)
RETURNS INTEGER
CREATE OR REPLACE FUNCTION getorcreate_full_word(norm_term TEXT, lookup_terms TEXT[],
OUT full_token INT,
OUT partial_tokens INT[])
AS $$
DECLARE
return_id INTEGER;
partial_terms TEXT[] = '{}'::TEXT[];
term TEXT;
term_id INTEGER;
term_count INTEGER;
BEGIN
SELECT min(word_id), max(search_name_count) INTO return_id, term_count
FROM word WHERE word_token = lookup_term and class is null and type is null;
SELECT min(word_id) INTO full_token
FROM word WHERE word = norm_term and class is null and country_code is null;
IF return_id IS NULL THEN
return_id := nextval('seq_word');
INSERT INTO word (word_id, word_token, search_name_count)
VALUES (return_id, lookup_term, 0);
ELSEIF left(lookup_term, 1) = ' ' and term_count > {{ max_word_freq }} THEN
return_id := 0;
IF full_token IS NULL THEN
full_token := nextval('seq_word');
INSERT INTO word (word_id, word_token, word, search_name_count)
SELECT full_token, ' ' || lookup_term, norm_term, 0 FROM unnest(lookup_terms) as lookup_term;
END IF;
RETURN return_id;
FOR term IN SELECT unnest(string_to_array(unnest(lookup_terms), ' ')) LOOP
term := trim(term);
IF NOT (ARRAY[term] <@ partial_terms) THEN
partial_terms := partial_terms || term;
END IF;
END LOOP;
partial_tokens := '{}'::INT[];
FOR term IN SELECT unnest(partial_terms) LOOP
SELECT min(word_id), max(search_name_count) INTO term_id, term_count
FROM word WHERE word_token = term and class is null and country_code is null;
IF term_id IS NULL THEN
term_id := nextval('seq_word');
term_count := 0;
INSERT INTO word (word_id, word_token, search_name_count)
VALUES (term_id, term, 0);
END IF;
IF term_count < {{ max_word_freq }} THEN
partial_tokens := array_merge(partial_tokens, ARRAY[term_id]);
END IF;
END LOOP;
END;
$$
LANGUAGE plpgsql;

View File

@ -4,6 +4,7 @@ Helper functions for handling DB accesses.
import subprocess
import logging
import gzip
import io
from nominatim.db.connection import get_pg_env
from nominatim.errors import UsageError
@ -57,3 +58,49 @@ def execute_file(dsn, fname, ignore_errors=False, pre_code=None, post_code=None)
if ret != 0 or remain > 0:
raise UsageError("Failed to execute SQL file.")
# List of characters that need to be quoted for the copy command.
_SQL_TRANSLATION = {ord(u'\\') : u'\\\\',
ord(u'\t') : u'\\t',
ord(u'\n') : u'\\n'}
class CopyBuffer:
""" Data collector for the copy_from command.
"""
def __init__(self):
self.buffer = io.StringIO()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, traceback):
if self.buffer is not None:
self.buffer.close()
def add(self, *data):
""" Add another row of data to the copy buffer.
"""
first = True
for column in data:
if first:
first = False
else:
self.buffer.write('\t')
if column is None:
self.buffer.write('\\N')
else:
self.buffer.write(str(column).translate(_SQL_TRANSLATION))
self.buffer.write('\n')
def copy_out(self, cur, table, columns=None):
""" Copy all collected data into the given table.
"""
if self.buffer.tell() > 0:
self.buffer.seek(0)
cur.copy_from(self.buffer, table, columns=columns)

View File

@ -0,0 +1,142 @@
"""
Processor for names that are imported into the database based on the
ICU library.
"""
from collections import defaultdict
import itertools
from icu import Transliterator
import datrie
from nominatim.db.properties import set_property, get_property
from nominatim.tokenizer import icu_variants as variants
DBCFG_IMPORT_NORM_RULES = "tokenizer_import_normalisation"
DBCFG_IMPORT_TRANS_RULES = "tokenizer_import_transliteration"
DBCFG_IMPORT_REPLACEMENTS = "tokenizer_import_replacements"
DBCFG_SEARCH_STD_RULES = "tokenizer_search_standardization"
class ICUNameProcessorRules:
""" Data object that saves the rules needed for the name processor.
The rules can either be initialised through an ICURuleLoader or
be loaded from a database when a connection is given.
"""
def __init__(self, loader=None, conn=None):
if loader is not None:
self.norm_rules = loader.get_normalization_rules()
self.trans_rules = loader.get_transliteration_rules()
self.replacements = loader.get_replacement_pairs()
self.search_rules = loader.get_search_rules()
elif conn is not None:
self.norm_rules = get_property(conn, DBCFG_IMPORT_NORM_RULES)
self.trans_rules = get_property(conn, DBCFG_IMPORT_TRANS_RULES)
self.replacements = \
variants.unpickle_variant_set(get_property(conn, DBCFG_IMPORT_REPLACEMENTS))
self.search_rules = get_property(conn, DBCFG_SEARCH_STD_RULES)
else:
assert False, "Parameter loader or conn required."
def save_rules(self, conn):
""" Save the rules in the property table of the given database.
the rules can be loaded again by handing in a connection into
the constructor of the class.
"""
set_property(conn, DBCFG_IMPORT_NORM_RULES, self.norm_rules)
set_property(conn, DBCFG_IMPORT_TRANS_RULES, self.trans_rules)
set_property(conn, DBCFG_IMPORT_REPLACEMENTS,
variants.pickle_variant_set(self.replacements))
set_property(conn, DBCFG_SEARCH_STD_RULES, self.search_rules)
class ICUNameProcessor:
""" Collects the different transformation rules for normalisation of names
and provides the functions to aply the transformations.
"""
def __init__(self, rules):
self.normalizer = Transliterator.createFromRules("icu_normalization",
rules.norm_rules)
self.to_ascii = Transliterator.createFromRules("icu_to_ascii",
rules.trans_rules +
";[:Space:]+ > ' '")
self.search = Transliterator.createFromRules("icu_search",
rules.search_rules)
# Intermediate reorder by source. Also compute required character set.
immediate = defaultdict(list)
chars = set()
for variant in rules.replacements:
if variant.source[-1] == ' ' and variant.replacement[-1] == ' ':
replstr = variant.replacement[:-1]
else:
replstr = variant.replacement
immediate[variant.source].append(replstr)
chars.update(variant.source)
# Then copy to datrie
self.replacements = datrie.Trie(''.join(chars))
for src, repllist in immediate.items():
self.replacements[src] = repllist
def get_normalized(self, name):
""" Normalize the given name, i.e. remove all elements not relevant
for search.
"""
return self.normalizer.transliterate(name).strip()
def get_variants_ascii(self, norm_name):
""" Compute the spelling variants for the given normalized name
and transliterate the result.
"""
baseform = '^ ' + norm_name + ' ^'
partials = ['']
startpos = 0
pos = 0
force_space = False
while pos < len(baseform):
full, repl = self.replacements.longest_prefix_item(baseform[pos:],
(None, None))
if full is not None:
done = baseform[startpos:pos]
partials = [v + done + r
for v, r in itertools.product(partials, repl)
if not force_space or r.startswith(' ')]
if len(partials) > 128:
# If too many variants are produced, they are unlikely
# to be helpful. Only use the original term.
startpos = 0
break
startpos = pos + len(full)
if full[-1] == ' ':
startpos -= 1
force_space = True
pos = startpos
else:
pos += 1
force_space = False
results = set()
if startpos == 0:
trans_name = self.to_ascii.transliterate(norm_name).strip()
if trans_name:
results.add(trans_name)
else:
for variant in partials:
name = variant + baseform[startpos:]
trans_name = self.to_ascii.transliterate(name[1:-1]).strip()
if trans_name:
results.add(trans_name)
return list(results)
def get_search_normalized(self, name):
""" Return the normalized version of the name (including transliteration)
to be applied at search time.
"""
return self.search.transliterate(' ' + name + ' ').strip()

View File

@ -0,0 +1,246 @@
"""
Helper class to create ICU rules from a configuration file.
"""
import io
import logging
import itertools
from pathlib import Path
import re
import yaml
from icu import Transliterator
from nominatim.errors import UsageError
import nominatim.tokenizer.icu_variants as variants
LOG = logging.getLogger()
def _flatten_yaml_list(content):
if not content:
return []
if not isinstance(content, list):
raise UsageError("List expected in ICU yaml configuration.")
output = []
for ele in content:
if isinstance(ele, list):
output.extend(_flatten_yaml_list(ele))
else:
output.append(ele)
return output
class VariantRule:
""" Saves a single variant expansion.
An expansion consists of the normalized replacement term and
a dicitonary of properties that describe when the expansion applies.
"""
def __init__(self, replacement, properties):
self.replacement = replacement
self.properties = properties or {}
class ICURuleLoader:
""" Compiler for ICU rules from a tokenizer configuration file.
"""
def __init__(self, configfile):
self.configfile = configfile
self.variants = set()
if configfile.suffix == '.yaml':
self._load_from_yaml()
else:
raise UsageError("Unknown format of tokenizer configuration.")
def get_search_rules(self):
""" Return the ICU rules to be used during search.
The rules combine normalization and transliteration.
"""
# First apply the normalization rules.
rules = io.StringIO()
rules.write(self.normalization_rules)
# Then add transliteration.
rules.write(self.transliteration_rules)
return rules.getvalue()
def get_normalization_rules(self):
""" Return rules for normalisation of a term.
"""
return self.normalization_rules
def get_transliteration_rules(self):
""" Return the rules for converting a string into its asciii representation.
"""
return self.transliteration_rules
def get_replacement_pairs(self):
""" Return the list of possible compound decompositions with
application of abbreviations included.
The result is a list of pairs: the first item is the sequence to
replace, the second is a list of replacements.
"""
return self.variants
def _yaml_include_representer(self, loader, node):
value = loader.construct_scalar(node)
if Path(value).is_absolute():
content = Path(value).read_text()
else:
content = (self.configfile.parent / value).read_text()
return yaml.safe_load(content)
def _load_from_yaml(self):
yaml.add_constructor('!include', self._yaml_include_representer,
Loader=yaml.SafeLoader)
rules = yaml.safe_load(self.configfile.read_text())
self.normalization_rules = self._cfg_to_icu_rules(rules, 'normalization')
self.transliteration_rules = self._cfg_to_icu_rules(rules, 'transliteration')
self._parse_variant_list(self._get_section(rules, 'variants'))
def _get_section(self, rules, section):
""" Get the section named 'section' from the rules. If the section does
not exist, raise a usage error with a meaningful message.
"""
if section not in rules:
LOG.fatal("Section '%s' not found in tokenizer config '%s'.",
section, str(self.configfile))
raise UsageError("Syntax error in tokenizer configuration file.")
return rules[section]
def _cfg_to_icu_rules(self, rules, section):
""" Load an ICU ruleset from the given section. If the section is a
simple string, it is interpreted as a file name and the rules are
loaded verbatim from the given file. The filename is expected to be
relative to the tokenizer rule file. If the section is a list then
each line is assumed to be a rule. All rules are concatenated and returned.
"""
content = self._get_section(rules, section)
if content is None:
return ''
return ';'.join(_flatten_yaml_list(content)) + ';'
def _parse_variant_list(self, rules):
self.variants.clear()
if not rules:
return
rules = _flatten_yaml_list(rules)
vmaker = _VariantMaker(self.normalization_rules)
properties = []
for section in rules:
# Create the property field and deduplicate against existing
# instances.
props = variants.ICUVariantProperties.from_rules(section)
for existing in properties:
if existing == props:
props = existing
break
else:
properties.append(props)
for rule in (section.get('words') or []):
self.variants.update(vmaker.compute(rule, props))
class _VariantMaker:
""" Generater for all necessary ICUVariants from a single variant rule.
All text in rules is normalized to make sure the variants match later.
"""
def __init__(self, norm_rules):
self.norm = Transliterator.createFromRules("rule_loader_normalization",
norm_rules)
def compute(self, rule, props):
""" Generator for all ICUVariant tuples from a single variant rule.
"""
parts = re.split(r'(\|)?([=-])>', rule)
if len(parts) != 4:
raise UsageError("Syntax error in variant rule: " + rule)
decompose = parts[1] is None
src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')]
repl_terms = (self.norm.transliterate(t.strip()) for t in parts[3].split(','))
# If the source should be kept, add a 1:1 replacement
if parts[2] == '-':
for src in src_terms:
if src:
for froms, tos in _create_variants(*src, src[0], decompose):
yield variants.ICUVariant(froms, tos, props)
for src, repl in itertools.product(src_terms, repl_terms):
if src and repl:
for froms, tos in _create_variants(*src, repl, decompose):
yield variants.ICUVariant(froms, tos, props)
def _parse_variant_word(self, name):
name = name.strip()
match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name)
if match is None or (match.group(1) == '~' and match.group(3) == '~'):
raise UsageError("Invalid variant word descriptor '{}'".format(name))
norm_name = self.norm.transliterate(match.group(2))
if not norm_name:
return None
return norm_name, match.group(1), match.group(3)
_FLAG_MATCH = {'^': '^ ',
'$': ' ^',
'': ' '}
def _create_variants(src, preflag, postflag, repl, decompose):
if preflag == '~':
postfix = _FLAG_MATCH[postflag]
# suffix decomposition
src = src + postfix
repl = repl + postfix
yield src, repl
yield ' ' + src, ' ' + repl
if decompose:
yield src, ' ' + repl
yield ' ' + src, repl
elif postflag == '~':
# prefix decomposition
prefix = _FLAG_MATCH[preflag]
src = prefix + src
repl = prefix + repl
yield src, repl
yield src + ' ', repl + ' '
if decompose:
yield src, repl + ' '
yield src + ' ', repl
else:
prefix = _FLAG_MATCH[preflag]
postfix = _FLAG_MATCH[postflag]
yield prefix + src + postfix, prefix + repl + postfix

View File

@ -0,0 +1,58 @@
"""
Data structures for saving variant expansions for ICU tokenizer.
"""
from collections import namedtuple
import json
_ICU_VARIANT_PORPERTY_FIELDS = ['lang']
class ICUVariantProperties(namedtuple('_ICUVariantProperties', _ICU_VARIANT_PORPERTY_FIELDS,
defaults=(None, )*len(_ICU_VARIANT_PORPERTY_FIELDS))):
""" Data container for saving properties that describe when a variant
should be applied.
Porperty instances are hashable.
"""
@classmethod
def from_rules(cls, _):
""" Create a new property type from a generic dictionary.
The function only takes into account the properties that are
understood presently and ignores all others.
"""
return cls(lang=None)
ICUVariant = namedtuple('ICUVariant', ['source', 'replacement', 'properties'])
def pickle_variant_set(variants):
""" Serializes an iterable of variant rules to a string.
"""
# Create a list of property sets. So they don't need to be duplicated
properties = {}
pid = 1
for variant in variants:
if variant.properties not in properties:
properties[variant.properties] = pid
pid += 1
# Convert the variants into a simple list.
variants = [(v.source, v.replacement, properties[v.properties]) for v in variants]
# Convert everythin to json.
return json.dumps({'properties': {v: k._asdict() for k, v in properties.items()},
'variants': variants})
def unpickle_variant_set(variant_string):
""" Deserializes a variant string that was previously created with
pickle_variant_set() into a set of ICUVariants.
"""
data = json.loads(variant_string)
properties = {int(k): ICUVariantProperties(**v) for k, v in data['properties'].items()}
print(properties)
return set((ICUVariant(src, repl, properties[pid]) for src, repl, pid in data['variants']))

View File

@ -3,26 +3,23 @@ Tokenizer implementing normalisation as used before Nominatim 4 but using
libICU instead of the PostgreSQL module.
"""
from collections import Counter
import functools
import io
import itertools
import json
import logging
import re
from textwrap import dedent
from pathlib import Path
from icu import Transliterator
import psycopg2.extras
from nominatim.db.connection import connect
from nominatim.db.properties import set_property, get_property
from nominatim.db.utils import CopyBuffer
from nominatim.db.sql_preprocessor import SQLPreprocessor
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
from nominatim.tokenizer.icu_name_processor import ICUNameProcessor, ICUNameProcessorRules
DBCFG_NORMALIZATION = "tokenizer_normalization"
DBCFG_MAXWORDFREQ = "tokenizer_maxwordfreq"
DBCFG_TRANSLITERATION = "tokenizer_transliteration"
DBCFG_ABBREVIATIONS = "tokenizer_abbreviations"
DBCFG_TERM_NORMALIZATION = "tokenizer_term_normalization"
LOG = logging.getLogger()
@ -41,9 +38,9 @@ class LegacyICUTokenizer:
def __init__(self, dsn, data_dir):
self.dsn = dsn
self.data_dir = data_dir
self.normalization = None
self.transliteration = None
self.abbreviations = None
self.naming_rules = None
self.term_normalization = None
self.max_word_frequency = None
def init_new_db(self, config, init_db=True):
@ -55,14 +52,14 @@ class LegacyICUTokenizer:
if config.TOKENIZER_CONFIG:
cfgfile = Path(config.TOKENIZER_CONFIG)
else:
cfgfile = config.config_dir / 'legacy_icu_tokenizer.json'
cfgfile = config.config_dir / 'legacy_icu_tokenizer.yaml'
rules = json.loads(cfgfile.read_text())
self.transliteration = ';'.join(rules['normalization']) + ';'
self.abbreviations = rules["abbreviations"]
self.normalization = config.TERM_NORMALIZATION
loader = ICURuleLoader(cfgfile)
self.naming_rules = ICUNameProcessorRules(loader=loader)
self.term_normalization = config.TERM_NORMALIZATION
self.max_word_frequency = config.MAX_WORD_FREQUENCY
self._install_php(config)
self._install_php(config.lib_dir.php)
self._save_config(config)
if init_db:
@ -74,9 +71,9 @@ class LegacyICUTokenizer:
""" Initialise the tokenizer from the project directory.
"""
with connect(self.dsn) as conn:
self.normalization = get_property(conn, DBCFG_NORMALIZATION)
self.transliteration = get_property(conn, DBCFG_TRANSLITERATION)
self.abbreviations = json.loads(get_property(conn, DBCFG_ABBREVIATIONS))
self.naming_rules = ICUNameProcessorRules(conn=conn)
self.term_normalization = get_property(conn, DBCFG_TERM_NORMALIZATION)
self.max_word_frequency = get_property(conn, DBCFG_MAXWORDFREQ)
def finalize_import(self, config):
@ -103,9 +100,7 @@ class LegacyICUTokenizer:
"""
self.init_from_project()
if self.normalization is None\
or self.transliteration is None\
or self.abbreviations is None:
if self.naming_rules is None:
return "Configuration for tokenizer 'legacy_icu' are missing."
return None
@ -126,26 +121,20 @@ class LegacyICUTokenizer:
Analyzers are not thread-safe. You need to instantiate one per thread.
"""
norm = Transliterator.createFromRules("normalizer", self.normalization)
trans = Transliterator.createFromRules("trans", self.transliteration)
return LegacyICUNameAnalyzer(self.dsn, norm, trans, self.abbreviations)
return LegacyICUNameAnalyzer(self.dsn, ICUNameProcessor(self.naming_rules))
def _install_php(self, config):
# pylint: disable=missing-format-attribute
def _install_php(self, phpdir):
""" Install the php script for the tokenizer.
"""
abbr_inverse = list(zip(*self.abbreviations))
php_file = self.data_dir / "tokenizer.php"
php_file.write_text(dedent("""\
<?php
@define('CONST_Max_Word_Frequency', {1.MAX_WORD_FREQUENCY});
@define('CONST_Term_Normalization_Rules', "{0.normalization}");
@define('CONST_Transliteration', "{0.transliteration}");
@define('CONST_Abbreviations', array(array('{2}'), array('{3}')));
require_once('{1.lib_dir.php}/tokenizer/legacy_icu_tokenizer.php');
""".format(self, config,
"','".join(abbr_inverse[0]),
"','".join(abbr_inverse[1]))))
@define('CONST_Max_Word_Frequency', {0.max_word_frequency});
@define('CONST_Term_Normalization_Rules', "{0.term_normalization}");
@define('CONST_Transliteration', "{0.naming_rules.search_rules}");
require_once('{1}/tokenizer/legacy_icu_tokenizer.php');
""".format(self, phpdir)))
def _save_config(self, config):
@ -153,10 +142,10 @@ class LegacyICUTokenizer:
database as database properties.
"""
with connect(self.dsn) as conn:
set_property(conn, DBCFG_NORMALIZATION, self.normalization)
self.naming_rules.save_rules(conn)
set_property(conn, DBCFG_MAXWORDFREQ, config.MAX_WORD_FREQUENCY)
set_property(conn, DBCFG_TRANSLITERATION, self.transliteration)
set_property(conn, DBCFG_ABBREVIATIONS, json.dumps(self.abbreviations))
set_property(conn, DBCFG_TERM_NORMALIZATION, self.term_normalization)
def _init_db_tables(self, config):
@ -172,25 +161,30 @@ class LegacyICUTokenizer:
# get partial words and their frequencies
words = Counter()
with self.name_analyzer() as analyzer:
with conn.cursor(name="words") as cur:
cur.execute("SELECT svals(name) as v, count(*) FROM place GROUP BY v")
name_proc = ICUNameProcessor(self.naming_rules)
with conn.cursor(name="words") as cur:
cur.execute(""" SELECT v, count(*) FROM
(SELECT svals(name) as v FROM place)x
WHERE length(v) < 75 GROUP BY v""")
for name, cnt in cur:
term = analyzer.make_standard_word(name)
if term:
for word in term.split():
words[word] += cnt
for name, cnt in cur:
terms = set()
for word in name_proc.get_variants_ascii(name_proc.get_normalized(name)):
if ' ' in word:
terms.update(word.split())
for term in terms:
words[term] += cnt
# copy them back into the word table
copystr = io.StringIO(''.join(('{}\t{}\n'.format(*args) for args in words.items())))
with CopyBuffer() as copystr:
for args in words.items():
copystr.add(*args)
with conn.cursor() as cur:
copystr.seek(0)
cur.copy_from(copystr, 'word', columns=['word_token', 'search_name_count'])
cur.execute("""UPDATE word SET word_id = nextval('seq_word')
WHERE word_id is null""")
with conn.cursor() as cur:
copystr.copy_out(cur, 'word',
columns=['word_token', 'search_name_count'])
cur.execute("""UPDATE word SET word_id = nextval('seq_word')
WHERE word_id is null""")
conn.commit()
@ -202,12 +196,10 @@ class LegacyICUNameAnalyzer:
normalization.
"""
def __init__(self, dsn, normalizer, transliterator, abbreviations):
def __init__(self, dsn, name_proc):
self.conn = connect(dsn).connection
self.conn.autocommit = True
self.normalizer = normalizer
self.transliterator = transliterator
self.abbreviations = abbreviations
self.name_processor = name_proc
self._cache = _TokenCache()
@ -228,7 +220,7 @@ class LegacyICUNameAnalyzer:
self.conn = None
def get_word_token_info(self, conn, words):
def get_word_token_info(self, words):
""" Return token information for the given list of words.
If a word starts with # it is assumed to be a full name
otherwise is a partial name.
@ -242,11 +234,11 @@ class LegacyICUNameAnalyzer:
tokens = {}
for word in words:
if word.startswith('#'):
tokens[word] = ' ' + self.make_standard_word(word[1:])
tokens[word] = ' ' + self.name_processor.get_search_normalized(word[1:])
else:
tokens[word] = self.make_standard_word(word)
tokens[word] = self.name_processor.get_search_normalized(word)
with conn.cursor() as cur:
with self.conn.cursor() as cur:
cur.execute("""SELECT word_token, word_id
FROM word, (SELECT unnest(%s::TEXT[]) as term) t
WHERE word_token = t.term
@ -254,15 +246,9 @@ class LegacyICUNameAnalyzer:
(list(tokens.values()), ))
ids = {r[0]: r[1] for r in cur}
return [(k, v, ids[v]) for k, v in tokens.items()]
return [(k, v, ids.get(v, None)) for k, v in tokens.items()]
def normalize(self, phrase):
""" Normalize the given phrase, i.e. remove all properties that
are irrelevant for search.
"""
return self.normalizer.transliterate(phrase)
@staticmethod
def normalize_postcode(postcode):
""" Convert the postcode to a standardized form.
@ -273,34 +259,18 @@ class LegacyICUNameAnalyzer:
return postcode.strip().upper()
@functools.lru_cache(maxsize=1024)
def make_standard_word(self, name):
""" Create the normalised version of the input.
"""
norm = ' ' + self.transliterator.transliterate(name) + ' '
for full, abbr in self.abbreviations:
if full in norm:
norm = norm.replace(full, abbr)
return norm.strip()
def _make_standard_hnr(self, hnr):
""" Create a normalised version of a housenumber.
This function takes minor shortcuts on transliteration.
"""
if hnr.isdigit():
return hnr
return self.transliterator.transliterate(hnr)
return self.name_processor.get_search_normalized(hnr)
def update_postcodes_from_db(self):
""" Update postcode tokens in the word table from the location_postcode
table.
"""
to_delete = []
copystr = io.StringIO()
with self.conn.cursor() as cur:
# This finds us the rows in location_postcode and word that are
# missing in the other table.
@ -313,32 +283,31 @@ class LegacyICUNameAnalyzer:
ON pc = word) x
WHERE pc is null or word is null""")
for postcode, word in cur:
if postcode is None:
to_delete.append(word)
else:
copystr.write(postcode)
copystr.write('\t ')
copystr.write(self.transliterator.transliterate(postcode))
copystr.write('\tplace\tpostcode\t0\n')
with CopyBuffer() as copystr:
for postcode, word in cur:
if postcode is None:
to_delete.append(word)
else:
copystr.add(
postcode,
' ' + self.name_processor.get_search_normalized(postcode),
'place', 'postcode', 0)
if to_delete:
cur.execute("""DELETE FROM WORD
WHERE class ='place' and type = 'postcode'
and word = any(%s)
""", (to_delete, ))
if to_delete:
cur.execute("""DELETE FROM WORD
WHERE class ='place' and type = 'postcode'
and word = any(%s)
""", (to_delete, ))
if copystr.getvalue():
copystr.seek(0)
cur.copy_from(copystr, 'word',
columns=['word', 'word_token', 'class', 'type',
'search_name_count'])
copystr.copy_out(cur, 'word',
columns=['word', 'word_token', 'class', 'type',
'search_name_count'])
def update_special_phrases(self, phrases, should_replace):
""" Replace the search index for special phrases with the new phrases.
"""
norm_phrases = set(((self.normalize(p[0]), p[1], p[2], p[3])
norm_phrases = set(((self.name_processor.get_normalized(p[0]), p[1], p[2], p[3])
for p in phrases))
with self.conn.cursor() as cur:
@ -350,54 +319,64 @@ class LegacyICUNameAnalyzer:
for label, cls, typ, oper in cur:
existing_phrases.add((label, cls, typ, oper or '-'))
to_add = norm_phrases - existing_phrases
to_delete = existing_phrases - norm_phrases
if to_add:
copystr = io.StringIO()
for word, cls, typ, oper in to_add:
term = self.make_standard_word(word)
if term:
copystr.write(word)
copystr.write('\t ')
copystr.write(term)
copystr.write('\t')
copystr.write(cls)
copystr.write('\t')
copystr.write(typ)
copystr.write('\t')
copystr.write(oper if oper in ('in', 'near') else '\\N')
copystr.write('\t0\n')
copystr.seek(0)
cur.copy_from(copystr, 'word',
columns=['word', 'word_token', 'class', 'type',
'operator', 'search_name_count'])
if to_delete and should_replace:
psycopg2.extras.execute_values(
cur,
""" DELETE FROM word USING (VALUES %s) as v(name, in_class, in_type, op)
WHERE word = name and class = in_class and type = in_type
and ((op = '-' and operator is null) or op = operator)""",
to_delete)
added = self._add_special_phrases(cur, norm_phrases, existing_phrases)
if should_replace:
deleted = self._remove_special_phrases(cur, norm_phrases,
existing_phrases)
else:
deleted = 0
LOG.info("Total phrases: %s. Added: %s. Deleted: %s",
len(norm_phrases), len(to_add), len(to_delete))
len(norm_phrases), added, deleted)
def _add_special_phrases(self, cursor, new_phrases, existing_phrases):
""" Add all phrases to the database that are not yet there.
"""
to_add = new_phrases - existing_phrases
added = 0
with CopyBuffer() as copystr:
for word, cls, typ, oper in to_add:
term = self.name_processor.get_search_normalized(word)
if term:
copystr.add(word, ' ' + term, cls, typ,
oper if oper in ('in', 'near') else None, 0)
added += 1
copystr.copy_out(cursor, 'word',
columns=['word', 'word_token', 'class', 'type',
'operator', 'search_name_count'])
return added
@staticmethod
def _remove_special_phrases(cursor, new_phrases, existing_phrases):
""" Remove all phrases from the databse that are no longer in the
new phrase list.
"""
to_delete = existing_phrases - new_phrases
if to_delete:
psycopg2.extras.execute_values(
cursor,
""" DELETE FROM word USING (VALUES %s) as v(name, in_class, in_type, op)
WHERE word = name and class = in_class and type = in_type
and ((op = '-' and operator is null) or op = operator)""",
to_delete)
return len(to_delete)
def add_country_names(self, country_code, names):
""" Add names for the given country to the search index.
"""
full_names = set((self.make_standard_word(n) for n in names))
full_names.discard('')
self._add_normalized_country_names(country_code, full_names)
word_tokens = set()
for name in self._compute_full_names(names):
if name:
word_tokens.add(' ' + self.name_processor.get_search_normalized(name))
def _add_normalized_country_names(self, country_code, names):
""" Add names for the given country to the search index.
"""
word_tokens = set((' ' + name for name in names))
with self.conn.cursor() as cur:
# Get existing names
cur.execute("SELECT word_token FROM word WHERE country_code = %s",
@ -423,14 +402,13 @@ class LegacyICUNameAnalyzer:
names = place.get('name')
if names:
full_names = self._compute_full_names(names)
fulls, partials = self._compute_name_tokens(names)
token_info.add_names(self.conn, full_names)
token_info.add_names(fulls, partials)
country_feature = place.get('country_feature')
if country_feature and re.fullmatch(r'[A-Za-z][A-Za-z]', country_feature):
self._add_normalized_country_names(country_feature.lower(),
full_names)
self.add_country_names(country_feature.lower(), names)
address = place.get('address')
@ -443,38 +421,65 @@ class LegacyICUNameAnalyzer:
elif key in ('housenumber', 'streetnumber', 'conscriptionnumber'):
hnrs.append(value)
elif key == 'street':
token_info.add_street(self.conn, self.make_standard_word(value))
token_info.add_street(*self._compute_name_tokens({'name': value}))
elif key == 'place':
token_info.add_place(self.conn, self.make_standard_word(value))
token_info.add_place(*self._compute_name_tokens({'name': value}))
elif not key.startswith('_') and \
key not in ('country', 'full'):
addr_terms.append((key, self.make_standard_word(value)))
addr_terms.append((key, *self._compute_name_tokens({'name': value})))
if hnrs:
hnrs = self._split_housenumbers(hnrs)
token_info.add_housenumbers(self.conn, [self._make_standard_hnr(n) for n in hnrs])
if addr_terms:
token_info.add_address_terms(self.conn, addr_terms)
token_info.add_address_terms(addr_terms)
return token_info.data
def _compute_full_names(self, names):
def _compute_name_tokens(self, names):
""" Computes the full name and partial name tokens for the given
dictionary of names.
"""
full_names = self._compute_full_names(names)
full_tokens = set()
partial_tokens = set()
for name in full_names:
norm_name = self.name_processor.get_normalized(name)
full, part = self._cache.names.get(norm_name, (None, None))
if full is None:
variants = self.name_processor.get_variants_ascii(norm_name)
if not variants:
continue
with self.conn.cursor() as cur:
cur.execute("SELECT (getorcreate_full_word(%s, %s)).*",
(norm_name, variants))
full, part = cur.fetchone()
self._cache.names[norm_name] = (full, part)
full_tokens.add(full)
partial_tokens.update(part)
return full_tokens, partial_tokens
@staticmethod
def _compute_full_names(names):
""" Return the set of all full name word ids to be used with the
given dictionary of names.
"""
full_names = set()
for name in (n for ns in names.values() for n in re.split('[;,]', ns)):
word = self.make_standard_word(name)
if word:
full_names.add(word)
for name in (n.strip() for ns in names.values() for n in re.split('[;,]', ns)):
if name:
full_names.add(name)
brace_split = name.split('(', 2)
if len(brace_split) > 1:
word = self.make_standard_word(brace_split[0])
if word:
full_names.add(word)
brace_idx = name.find('(')
if brace_idx >= 0:
full_names.add(name[:brace_idx].strip())
return full_names
@ -486,7 +491,7 @@ class LegacyICUNameAnalyzer:
postcode = self.normalize_postcode(postcode)
if postcode not in self._cache.postcodes:
term = self.make_standard_word(postcode)
term = self.name_processor.get_search_normalized(postcode)
if not term:
return
@ -502,6 +507,7 @@ class LegacyICUNameAnalyzer:
""", (' ' + term, postcode))
self._cache.postcodes.add(postcode)
@staticmethod
def _split_housenumbers(hnrs):
if len(hnrs) > 1 or ',' in hnrs[0] or ';' in hnrs[0]:
@ -524,7 +530,7 @@ class _TokenInfo:
""" Collect token information to be sent back to the database.
"""
def __init__(self, cache):
self.cache = cache
self._cache = cache
self.data = {}
@staticmethod
@ -532,86 +538,44 @@ class _TokenInfo:
return '{%s}' % ','.join((str(s) for s in tokens))
def add_names(self, conn, names):
def add_names(self, fulls, partials):
""" Adds token information for the normalised names.
"""
# Start with all partial names
terms = set((part for ns in names for part in ns.split()))
# Add the full names
terms.update((' ' + n for n in names))
self.data['names'] = self._mk_array(self.cache.get_term_tokens(conn, terms))
self.data['names'] = self._mk_array(itertools.chain(fulls, partials))
def add_housenumbers(self, conn, hnrs):
""" Extract housenumber information from a list of normalised
housenumbers.
"""
self.data['hnr_tokens'] = self._mk_array(self.cache.get_hnr_tokens(conn, hnrs))
self.data['hnr_tokens'] = self._mk_array(self._cache.get_hnr_tokens(conn, hnrs))
self.data['hnr'] = ';'.join(hnrs)
def add_street(self, conn, street):
def add_street(self, fulls, _):
""" Add addr:street match terms.
"""
if not street:
return
term = ' ' + street
tid = self.cache.names.get(term)
if tid is None:
with conn.cursor() as cur:
cur.execute("""SELECT word_id FROM word
WHERE word_token = %s
and class is null and type is null""",
(term, ))
if cur.rowcount > 0:
tid = cur.fetchone()[0]
self.cache.names[term] = tid
if tid is not None:
self.data['street'] = '{%d}' % tid
if fulls:
self.data['street'] = self._mk_array(fulls)
def add_place(self, conn, place):
def add_place(self, fulls, partials):
""" Add addr:place search and match terms.
"""
if not place:
return
partial_ids = self.cache.get_term_tokens(conn, place.split())
tid = self.cache.get_term_tokens(conn, [' ' + place])
self.data['place_search'] = self._mk_array(itertools.chain(partial_ids, tid))
self.data['place_match'] = '{%s}' % tid[0]
if fulls:
self.data['place_search'] = self._mk_array(itertools.chain(fulls, partials))
self.data['place_match'] = self._mk_array(fulls)
def add_address_terms(self, conn, terms):
def add_address_terms(self, terms):
""" Add additional address terms.
"""
tokens = {}
for key, value in terms:
if not value:
continue
partial_ids = self.cache.get_term_tokens(conn, value.split())
term = ' ' + value
tid = self.cache.names.get(term)
if tid is None:
with conn.cursor() as cur:
cur.execute("""SELECT word_id FROM word
WHERE word_token = %s
and class is null and type is null""",
(term, ))
if cur.rowcount > 0:
tid = cur.fetchone()[0]
self.cache.names[term] = tid
tokens[key] = [self._mk_array(partial_ids),
'{%s}' % ('' if tid is None else str(tid))]
for key, fulls, partials in terms:
if fulls:
tokens[key] = [self._mk_array(itertools.chain(fulls, partials)),
self._mk_array(fulls)]
if tokens:
self.data['addr'] = tokens
@ -629,32 +593,6 @@ class _TokenCache:
self.housenumbers = {}
def get_term_tokens(self, conn, terms):
""" Get token ids for a list of terms, looking them up in the database
if necessary.
"""
tokens = []
askdb = []
for term in terms:
token = self.names.get(term)
if token is None:
askdb.append(term)
elif token != 0:
tokens.append(token)
if askdb:
with conn.cursor() as cur:
cur.execute("SELECT term, getorcreate_term_id(term) FROM unnest(%s) as term",
(askdb, ))
for term, tid in cur:
self.names[term] = tid
if tid != 0:
tokens.append(tid)
return tokens
def get_hnr_tokens(self, conn, terms):
""" Get token ids for a list of housenumbers, looking them up in the
database if necessary.

View File

@ -271,8 +271,7 @@ class LegacyNameAnalyzer:
self.conn = None
@staticmethod
def get_word_token_info(conn, words):
def get_word_token_info(self, words):
""" Return token information for the given list of words.
If a word starts with # it is assumed to be a full name
otherwise is a partial name.
@ -283,7 +282,7 @@ class LegacyNameAnalyzer:
The function is used for testing and debugging only
and not necessarily efficient.
"""
with conn.cursor() as cur:
with self.conn.cursor() as cur:
cur.execute("""SELECT t.term, word_token, word_id
FROM word, (SELECT unnest(%s::TEXT[]) as term) t
WHERE word_token = (CASE
@ -404,7 +403,7 @@ class LegacyNameAnalyzer:
FROM unnest(%s)n) y
WHERE NOT EXISTS(SELECT * FROM word
WHERE word_token = lookup_token and country_code = %s))
""", (country_code, names, country_code))
""", (country_code, list(names.values()), country_code))
def process_place(self, place):
@ -422,7 +421,7 @@ class LegacyNameAnalyzer:
country_feature = place.get('country_feature')
if country_feature and re.fullmatch(r'[A-Za-z][A-Za-z]', country_feature):
self.add_country_names(country_feature.lower(), list(names.values()))
self.add_country_names(country_feature.lower(), names)
address = place.get('address')

View File

@ -272,15 +272,15 @@ def create_country_names(conn, tokenizer, languages=None):
with tokenizer.name_analyzer() as analyzer:
for code, name in cur:
names = [code]
names = {'countrycode' : code}
if code == 'gb':
names.append('UK')
names['short_name'] = 'UK'
if code == 'us':
names.append('United States')
names['short_name'] = 'United States'
# country names (only in languages as provided)
if name:
names.extend((v for k, v in name.items() if _include_key(k)))
names.update(((k, v) for k, v in name.items() if _include_key(k)))
analyzer.add_country_names(code, names)

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,24 @@
- "[𞥐𐒠߀𖭐꤀𖩠𑓐𑑐𑋰𑄶꩐꘠᱀᭐᮰᠐០᥆༠໐꧰႐᪐᪀᧐𑵐꯰᱐𑱐𑜰𑛀𑙐𑇐꧐꣐෦𑁦0𝟶𝟘𝟬𝟎𝟢₀⓿⓪⁰] > 0"
- "[𞥑𐒡߁𖭑꤁𖩡𑓑𑑑𑋱𑄷꩑꘡᱁᭑᮱᠑១᥇༡໑꧱႑᪑᪁᧑𑵑꯱᱑𑱑𑜱𑛁𑙑𑇑꧑꣑෧𑁧1𝟷𝟙𝟭𝟏𝟣₁¹①⑴⒈❶➀➊⓵] > 1"
- "[𞥒𐒢߂𖭒꤂𖩢𑓒𑑒𑋲𑄸꩒꘢᱂᭒᮲᠒២᥈༢໒꧲႒᪒᪂᧒𑵒꯲᱒𑱒𑜲𑛂𑙒𑇒꧒꣒෨𑁨2𝟸𝟚𝟮𝟐𝟤₂²②⑵⒉❷➁➋⓶] > 2"
- "[𞥓𐒣߃𖭓꤃𖩣𑓓𑑓𑋳𑄹꩓꘣᱃᭓᮳᠓៣᥉༣໓꧳႓᪓᪃᧓𑵓꯳᱓𑱓𑜳𑛃𑙓𑇓꧓꣓෩𑁩3𝟹𝟛𝟯𝟑𝟥₃³③⑶⒊❸➂➌⓷] > 3"
- "[𞥔𐒤߄𖭔꤄𖩤𑓔𑑔𑋴𑄺꩔꘤᱄᭔᮴᠔៤᥊༤໔꧴႔᪔᪄᧔𑵔꯴᱔𑱔𑜴𑛄𑙔𑇔꧔꣔෪𑁪4𝟺𝟜𝟰𝟒𝟦₄⁴④⑷⒋❹➃➍⓸] > 4"
- "[𞥕𐒥߅𖭕꤅𖩥𑓕𑑕𑋵𑄻꩕꘥᱅᭕᮵᠕៥᥋༥໕꧵႕᪕᪅᧕𑵕꯵᱕𑱕𑜵𑛅𑙕𑇕꧕꣕෫𑁫5𝟻𝟝𝟱𝟓𝟧₅⁵⑤⑸⒌❺➄➎⓹] > 5"
- "[𞥖𐒦߆𖭖꤆𖩦𑓖𑑖𑋶𑄼꩖꘦᱆᭖᮶᠖៦᥌༦໖꧶႖᪖᪆᧖𑵖꯶᱖𑱖𑜶𑛆𑙖𑇖꧖꣖෬𑁬6𝟼𝟞𝟲𝟔𝟨₆⁶⑥⑹⒍❻➅➏⓺] > 6"
- "[𞥗𐒧߇𖭗꤇𖩧𑓗𑑗𑋷𑄽꩗꘧᱇᭗᮷᠗៧᥍༧໗꧷႗᪗᪇᧗𑵗꯷᱗𑱗𑜷𑛇𑙗𑇗꧗꣗෭𑁭7𝟽𝟟𝟳𝟕𝟩₇⁷⑦⑺⒎❼➆➐⓻] > 7"
- "[𞥘𐒨߈𖭘꤈𖩨𑓘𑑘𑋸𑄾꩘꘨᱈᭘᮸᠘៨᥎༨໘꧸႘᪘᪈᧘𑵘꯸᱘𑱘𑜸𑛈𑙘𑇘꧘꣘෮𑁮8𝟾𝟠𝟴𝟖𝟪₈⁸⑧⑻⒏❽➇➑⓼] > 8"
- "[𞥙𐒩߉𖭙꤉𖩩𑓙𑑙𑋹𑄿꩙꘩᱉᭙᮹᠙៩᥏༩໙꧹႙᪙᪉᧙𑵙꯹᱙𑱙𑜹𑛉𑙙𑇙꧙꣙෯𑁯9𝟿𝟡𝟵𝟗𝟫₉⁹⑨⑼⒐❾➈➒⓽] > 9"
- "[𑜺⑩⑽⒑❿➉➓⓾] > '10'"
- "[⑪⑾⒒⓫] > '11'"
- "[⑫⑿⒓⓬] > '12'"
- "[⑬⒀⒔⓭] > '13'"
- "[⑭⒁⒕⓮] > '14'"
- "[⑮⒂⒖⓯] > '15'"
- "[⑯⒃⒗⓰] > '16'"
- "[⑰⒄⒘⓱] > '17'"
- "[⑱⒅⒙⓲] > '18'"
- "[⑲⒆⒚⓳] > '19'"
- "[𑜻⑳⒇⒛⓴] > '20'"
- "⅐ > ' 1/7'"
- "⅑ > ' 1/9'"
- "⅒ > ' 1/10'"

View File

@ -0,0 +1,19 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.91.D1.8A.D0.BB.D0.B3.D0.B0.D1.80.D1.81.D0.BA.D0.B8_.D0.B5.D0.B7.D0.B8.D0.BA_-_Bulgarian
- lang: bg
words:
- Блок -> бл
- Булевард -> бул
- Вход -> вх
- Генерал -> ген
- Град -> гр
- Доктор -> д-р
- Доцент -> доц
- Капитан -> кап
- Митрополит -> мит
- Площад -> пл
- Професор -> проф
- Свети -> Св
- Улица -> ул
- Село -> с
- Квартал -> кв
- Жилищен Комплекс -> ж к

View File

@ -0,0 +1,90 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Catal.C3.A0_-_Catalan
- lang: ca
words:
- aparcament -> aparc
- apartament -> apmt
- apartat -> apt
- àtic -> àt
- autopista -> auto
- autopista -> autop
- autovia -> autov
- avinguda -> av
- avinguda -> avd
- avinguda -> avda
- baixada -> bda
- baixos -> bxs
- barranc -> bnc
- barri -> b
- barriada -> b
- biblioteca -> bibl
- bloc -> bl
- carrer -> c
- carrer -> c/
- carreró -> cró
- carretera -> ctra
- cantonada -> cant
- cementiri -> cem
- cinturó -> cint
- codi postal -> CP
- collegi -> coll
- collegi públic -> CP
- comissaria -> com
- convent -> convt
- correus -> corr
- districte -> distr
- drecera -> drec
- dreta -> dta
- entrada -> entr
- entresòl -> entl
- escala -> esc
- escola -> esc
- escola universitària -> EU
- església -> esgl
- estació -> est
- estacionament -> estac
- facultat -> fac
- finca -> fca
- habitació -> hab
- hospital -> hosp
- hotel -> H
- monestir -> mtir
- monument -> mon
- mossèn -> Mn
- municipal -> mpal
- museu -> mus
- nacional -> nac
- nombre -> nre
- número -> núm
- número -> n
- sense número -> s/n
- parada -> par
- parcel·la -> parc
- passadís -> pdís
- passatge -> ptge
- passeig -> pg
- pavelló -> pav
- plaça -> pl
- plaça -> pça
- planta -> pl
- població -> pobl
- polígon -> pol
- polígon industrial -> PI
- polígon industrial -> pol ind
- porta -> pta
- portal -> ptal
- principal -> pral
- pujada -> pda
- punt quilomètric -> PK
- rambla -> rbla
- ronda -> rda
- sagrada -> sgda
- sagrat -> sgt
- sant -> st
- santa -> sta
- sobreàtic -> s/àt
- travessera -> trav
- travessia -> trv
- travessia -> trav
- urbanització -> urb
- sortida -> sort
- via -> v

View File

@ -0,0 +1,6 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Cesky_-_Czech
- lang: cs
words:
- Ulice -> Ul
- Třída -> Tř
- Náměstí -> Nám

View File

@ -0,0 +1,12 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Dansk_-_Danish
- lang: da
words:
- Lille -> Ll
- Nordre -> Ndr
- Nørre -> Nr
- Søndre, Sønder -> Sdr
- Store -> St
- Gammel,Gamle -> Gl
- ~hal => hal
- ~hallen => hallen
- ~hallerne => hallerne

View File

@ -0,0 +1,136 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Deutsch_-_German
- lang: de
words:
- am -> a
- an der -> a d
- Allgemeines Krankenhaus -> AKH
- Altstoffsammelzentrum -> ASZ
- auf der -> a d
- ~bach -> B
- Bad -> B
- Bahnhof -> Bhf
- Bayerisch, Bayerische, Bayerischer, Bayerisches -> Bayer
- Berg -> B
- ~berg |-> bg
- Bezirk -> Bez
- ~brücke -> Br
- Bundesgymnasium -> BG
- Bundespolizeidirektion -> BPD
- Bundesrealgymnasium -> BRG
- ~burg |-> bg
- burgenländische,burgenländischer,burgenländisches -> bgld
- Bürgermeister -> Bgm
- Chaussee -> Ch
- Deutsche, Deutscher, Deutsches -> dt
- Deutscher Alpenverein -> DAV
- Deutsch -> Dt
- ~denkmal -> Dkm
- Dorf -> Df
- ~dorf |-> df
- Doktor -> Dr
- ehemalige, ehemaliger, ehemaliges -> ehem
- Fabrik -> Fb
- Fachhochschule -> FH
- Freiwillige Feuerwehr -> FF
- Forsthaus -> Fh
- ~gasse |-> g
- Gasthaus -> Gh
- Gasthof -> Ghf
- Gemeinde -> Gde
- Graben -> Gr
- Großer, Große, Großes -> Gr, G
- Gymnasium und Realgymnasium -> GRG
- Handelsakademie -> HAK
- Handelsschule -> HASCH
- Haltestelle -> Hst
- Hauptbahnhof -> Hbf
- Haus -> Hs
- Heilige, Heiliger, Heiliges -> Hl
- Hintere, Hinterer, Hinteres -> Ht, Hint
- Hohe, Hoher, Hohes -> H
- ~höhle -> H
- Höhere Technische Lehranstalt -> HTL
- ~hütte -> Htt
- im -> i
- in -> i
- in der -> i d
- Ingenieur -> Ing
- Internationale, Internationaler, Internationales -> Int
- Jagdhaus -> Jh
- Jagdhütte -> Jhtt
- Kapelle -> Kap, Kpl
- Katastralgemeinde -> KG
- Kläranlage -> KA
- Kleiner, Kleine, Kleines -> kl
- Klein~ -> Kl.
- Kleingartenanlage -> KGA
- Kleingartenverein -> KGV
- Kogel -> Kg
- ~kogel |-> kg
- Konzentrationslager -> KZ, KL
- Krankenhaus -> KH
- ~kreuz |-> kz
- Landeskrankenhaus -> LKH
- Maria -> Ma
- Magister -> Mag
- Magistratsabteilung -> MA
- Markt -> Mkt
- Müllverbrennungsanlage -> MVA
- Nationalpark -> NP
- Naturschutzgebiet -> NSG
- Neue Mittelschule -> NMS
- Niedere, Niederer, Niederes -> Nd
- Niederösterreich -> NÖ
- nördliche, nördlicher, nördliches -> nördl
- Nummer -> Nr
- ob -> o
- Oberer, Obere, Oberes -> ob
- Ober~ -> Ob
- Österreichischer Alpenverein -> ÖAV
- Österreichischer Gebirgsverein -> ÖGV
- Österreichischer Touristenklub -> ÖTK
- östliche, östlicher, östliches -> östl
- Pater -> P
- Pfad -> P
- Platz -> Pl
- ~platz$ -> pl
- Professor -> Prof
- Quelle -> Q, Qu
- Reservoir -> Res
- Rhein -> Rh
- Rundwanderweg -> RWW
- Ruine -> R
- Sandgrube, Schottergrube -> SG
- Sankt -> St
- Schloss -> Schl
- See -> S
- ~siedlung -> sdlg
- Sozialmedizinisches Zentrum -> SMZ
- ~Spitze -> Sp
- Steinbruch -> Stb
- ~stiege -> stg
- ~strasse -> str
- südliche, südlicher, südliches -> südl
- Unterer, Untere, Unteres -> u, unt
- Unter~ -> U
- Teich -> T
- Technische Universität -> TU
- Truppenübungsplatz -> TÜPL, TÜPl
- Unfallkrankenhaus -> UKH
- ~universität -> uni
- verfallen -> verf
- von -> v
- Vordere, Vorderer, Vorderes -> Vd, Vord
- Vorder… -> Vd, Vord
- von der -> v d
- vor der -> v d
- Volksschule -> VS
- Wald -> W
- Wasserfall -> Wsf, Wssf
- ~weg$ -> w
- westliche, westlicher, westliches -> westl
- Wiener -> Wr
- ~wiese$ -> ws
- Wirtschaftsuniversität -> WU
- Wirtshaus -> Wh
- zum -> z

View File

@ -0,0 +1,54 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.CE.95.CE.BB.CE.BB.CE.B7.CE.BD.CE.B9.CE.BA.CE.AC_-_Greek
- lang: el
words:
- Αγίας -> Αγ
- Αγίου -> Αγ
- Αγίων -> Αγ
- Αδελφοί -> Αφοί
- Αδελφών -> Αφών
- Αλέξανδρου -> Αλ
- Ανώτατο Τεχνολογικό Εκπαιδευτικό Ίδρυμα -> ΑΤΕΙ
- Αστυνομικό Τμήμα -> ΑΤ
- Βασιλέως -> Β
- Βασιλέως -> Βασ
- Βασιλίσσης -> Β
- Βασιλίσσης -> Βασ
- Γρηγορίου -> Γρ
- Δήμος -> Δ
- Δημοτικό Σχολείο -> ΔΣ
- Δημοτικό Σχολείο -> Δημ Σχ
- Εθνάρχου -> Εθν
- Εθνική -> Εθν
- Εθνικής -> Εθν
- Ελευθέριος -> Ελ
- Ελευθερίου -> Ελ
- Ελληνικά Ταχυδρομεία -> ΕΛΤΑ
- Θεσσαλονίκης -> Θεσ/νίκης
- Ιερά Μονή -> Ι Μ
- Ιερός Ναός -> Ι Ν
- Κτίριο -> Κτ
- Κωνσταντίνου -> Κων/νου
- Λεωφόρος -> Λ
- Λεωφόρος -> Λεωφ
- Λίμνη -> Λ
- Νέα -> Ν
- Νέες -> Ν
- Νέο -> Ν
- Νέοι -> Ν
- Νέος -> Ν
- Νησί -> Ν
- Νομός -> Ν
- Όρος -> Όρ
- Παλαιά -> Π
- Παλαιές -> Π
- Παλαιό -> Π
- Παλαιοί -> Π
- Παλαιός -> Π
- Πανεπιστήμιο -> ΑΕΙ
- Πανεπιστήμιο -> Παν
- Πλατεία -> Πλ
- Ποταμός -> Π
- Ποταμός -> Ποτ
- Στρατηγού -> Στρ
- Ταχυδρομείο -> ΕΛΤΑ
- Τεχνολογικό Εκπαιδευτικό Ίδρυμα -> ΤΕΙ

View File

@ -0,0 +1,485 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#English
- lang: en
words:
- Access -> Accs
- Air Force Base -> AFB
- Air National Guard Base -> ANGB
- Airport -> Aprt
- Alley -> Al
- Alley -> All
- Alley -> Ally
- Alley -> Aly
- Alleyway -> Alwy
- Amble -> Ambl
- Apartments -> Apts
- Approach -> Apch
- Approach -> App
- Arcade -> Arc
- Arterial -> Artl
- Artery -> Arty
- Avenue -> Av
- Avenue -> Ave
- Back -> Bk
- Banan -> Ba
- Basin -> Basn
- Basin -> Bsn
- Beach -> Bch
- Bend -> Bend
- Bend -> Bnd
- Block -> Blk
- Boardwalk -> Bwlk
- Boulevard -> Blvd
- Boulevard -> Bvd
- Boundary -> Bdy
- Bowl -> Bl
- Brace -> Br
- Brae -> Br
- Brae -> Brae
- Break -> Brk
- Bridge -> Bdge
- Bridge -> Br
- Bridge -> Brdg
- Bridge -> Bri
- Broadway -> Bdwy
- Broadway -> Bway
- Broadway -> Bwy
- Brook -> Brk
- Brow -> Brw
- Brow -> Brow
- Buildings -> Bldgs
- Buildings -> Bldngs
- Business -> Bus
- Bypass -> Bps
- Bypass -> Byp
- Bypass -> Bypa
- Byway -> Bywy
- Caravan -> Cvn
- Causeway -> Caus
- Causeway -> Cswy
- Causeway -> Cway
- Center -> Cen
- Center -> Ctr
- Central -> Ctrl
- Centre -> Cen
- Centre -> Ctr
- Centreway -> Cnwy
- Chase -> Ch
- Church -> Ch
- Circle -> Cir
- Circuit -> Cct
- Circuit -> Ci
- Circus -> Crc
- Circus -> Crcs
- City -> Cty
- Close -> Cl
- Common -> Cmn
- Common -> Comm
- Community -> Comm
- Concourse -> Cnc
- Concourse -> Con
- Copse -> Cps
- Corner -> Cnr
- Corner -> Crn
- Corso -> Cso
- Cottages -> Cotts
- County -> Co
- County Road -> CR
- County Route -> CR
- Court -> Crt
- Court -> Ct
- Courtyard -> Cyd
- Courtyard -> Ctyd
- Cove -> Ce
- Cove -> Cov
- Cove -> Cove
- Cove -> Cv
- Creek -> Ck
- Creek -> Cr
- Creek -> Crk
- Crescent -> Cr
- Crescent -> Cres
- Crest -> Crst
- Crest -> Cst
- Croft -> Cft
- Cross -> Cs
- Cross -> Crss
- Crossing -> Crsg
- Crossing -> Csg
- Crossing -> Xing
- Crossroad -> Crd
- Crossway -> Cowy
- Cul-de-sac -> Cds
- Cul-de-sac -> Csac
- Curve -> Cve
- Cutting -> Cutt
- Dale -> Dle
- Dale -> Dale
- Deviation -> Devn
- Dip -> Dip
- Distributor -> Dstr
- Down -> Dn
- Downs -> Dn
- Drive -> Dr
- Drive -> Drv
- Drive -> Dv
- Drive-In => Drive-In # prevent abbreviation here
- Driveway -> Drwy
- Driveway -> Dvwy
- Driveway -> Dwy
- East -> E
- Edge -> Edg
- Edge -> Edge
- Elbow -> Elb
- End -> End
- Entrance -> Ent
- Esplanade -> Esp
- Estate -> Est
- Expressway -> Exp
- Expressway -> Expy
- Expressway -> Expwy
- Expressway -> Xway
- Extension -> Ex
- Fairway -> Fawy
- Fairway -> Fy
- Father -> Fr
- Ferry -> Fy
- Field -> Fd
- Fire Track -> Ftrk
- Firetrail -> Fit
- Flat -> Fl
- Flat -> Flat
- Follow -> Folw
- Footway -> Ftwy
- Foreshore -> Fshr
- Forest Service Road -> FSR
- Formation -> Form
- Fort -> Ft
- Freeway -> Frwy
- Freeway -> Fwy
- Front -> Frnt
- Frontage -> Fr
- Frontage -> Frtg
- Gap -> Gap
- Garden -> Gdn
- Gardens -> Gdn
- Gardens -> Gdns
- Gate -> Ga
- Gate -> Gte
- Gates -> Ga
- Gates -> Gte
- Gateway -> Gwy
- George -> Geo
- Glade -> Gl
- Glade -> Gld
- Glade -> Glde
- Glen -> Gln
- Glen -> Glen
- Grange -> Gra
- Green -> Gn
- Green -> Grn
- Ground -> Grnd
- Grove -> Gr
- Grove -> Gro
- Grovet -> Gr
- Gully -> Gly
- Harbor -> Hbr
- Harbour -> Hbr
- Haven -> Hvn
- Head -> Hd
- Heads -> Hd
- Heights -> Hgts
- Heights -> Ht
- Heights -> Hts
- High School -> HS
- Highroad -> Hird
- Highroad -> Hrd
- Highway -> Hwy
- Hill -> Hill
- Hill -> Hl
- Hills -> Hl
- Hills -> Hls
- Hospital -> Hosp
- House -> Ho
- House -> Hse
- Industrial -> Ind
- Interchange -> Intg
- International -> Intl
- Island -> I
- Island -> Is
- Junction -> Jctn
- Junction -> Jnc
- Junior -> Jr
- Key -> Key
- Lagoon -> Lgn
- Lakes -> L
- Landing -> Ldg
- Lane -> La
- Lane -> Lane
- Lane -> Ln
- Laneway -> Lnwy
- Line -> Line
- Line -> Ln
- Link -> Link
- Link -> Lk
- Little -> Lit
- Little -> Lt
- Lodge -> Ldg
- Lookout -> Lkt
- Loop -> Loop
- Loop -> Lp
- Lower -> Low
- Lower -> Lr
- Lower -> Lwr
- Mall -> Mall
- Mall -> Ml
- Manor -> Mnr
- Mansions -> Mans
- Market -> Mkt
- Meadow -> Mdw
- Meadows -> Mdw
- Meadows -> Mdws
- Mead -> Md
- Meander -> Mdr
- Meander -> Mndr
- Meander -> Mr
- Medical -> Med
- Memorial -> Mem
- Mews -> Mews
- Mews -> Mw
- Middle -> Mid
- Middle School -> MS
- Mile -> Mi
- Military -> Mil
- Motorway -> Mtwy
- Motorway -> Mwy
- Mount -> Mt
- Mountain -> Mtn
- Mountains -> Mtn
- Municipal -> Mun
- Museum -> Mus
- National Park -> NP
- National Recreation Area -> NRA
- National Wildlife Refuge Area -> NWRA
- Nook -> Nk
- Nook -> Nook
- North -> N
- Northeast -> NE
- Northwest -> NW
- Outlook -> Out
- Outlook -> Otlk
- Parade -> Pde
- Paradise -> Pdse
- Park -> Park
- Park -> Pk
- Parklands -> Pkld
- Parkway -> Pkwy
- Parkway -> Pky
- Parkway -> Pwy
- Pass -> Pass
- Pass -> Ps
- Passage -> Psge
- Path -> Path
- Pathway -> Phwy
- Pathway -> Pway
- Pathway -> Pwy
- Piazza -> Piaz
- Pike -> Pk
- Place -> Pl
- Plain -> Pl
- Plains -> Pl
- Plateau -> Plat
- Plaza -> Pl
- Plaza -> Plz
- Plaza -> Plza
- Pocket -> Pkt
- Point -> Pnt
- Point -> Pt
- Port -> Port
- Port -> Pt
- Post Office -> PO
- Precinct -> Pct
- Promenade -> Prm
- Promenade -> Prom
- Quad -> Quad
- Quadrangle -> Qdgl
- Quadrant -> Qdrt
- Quadrant -> Qd
- Quay -> Qy
- Quays -> Qy
- Quays -> Qys
- Ramble -> Ra
- Ramble -> Rmbl
- Range -> Rge
- Range -> Rnge
- Reach -> Rch
- Reservation -> Res
- Reserve -> Res
- Reservoir -> Res
- Rest -> Rest
- Rest -> Rst
- Retreat -> Rt
- Retreat -> Rtt
- Return -> Rtn
- Ridge -> Rdg
- Ridge -> Rdge
- Ridgeway -> Rgwy
- Right of Way -> Rowy
- Rise -> Ri
- Rise -> Rise
- River -> R
- River -> Riv
- River -> Rvr
- Riverway -> Rvwy
- Riviera -> Rvra
- Road -> Rd
- Roads -> Rds
- Roadside -> Rdsd
- Roadway -> Rdwy
- Roadway -> Rdy
- Robert -> Robt
- Rocks -> Rks
- Ronde -> Rnde
- Rosebowl -> Rsbl
- Rotary -> Rty
- Round -> Rnd
- Route -> Rt
- Route -> Rte
- Row -> Row
- Rue -> Rue
- Run -> Run
- Saint -> St
- Saints -> SS
- Senior -> Sr
- Serviceway -> Swy
- Serviceway -> Svwy
- Shunt -> Shun
- Siding -> Sdng
- Sister -> Sr
- Slope -> Slpe
- Sound -> Snd
- South -> S
- South -> Sth
- Southeast -> SE
- Southwest -> SW
- Spur -> Spur
- Square -> Sq
- Stairway -> Strwy
- State Highway -> SH
- State Highway -> SHwy
- State Route -> SR
- Station -> Sta
- Station -> Stn
- Strand -> Sd
- Strand -> Stra
- Street -> St
- Strip -> Strp
- Subway -> Sbwy
- Tarn -> Tn
- Tarn -> Tarn
- Terminal -> Term
- Terrace -> Tce
- Terrace -> Ter
- Terrace -> Terr
- Thoroughfare -> Thfr
- Thoroughfare -> Thor
- Tollway -> Tlwy
- Tollway -> Twy
- Top -> Top
- Tor -> Tor
- Towers -> Twrs
- Township -> Twp
- Trace -> Trce
- Track -> Tr
- Track -> Trk
- Trail -> Trl
- Trailer -> Trlr
- Triangle -> Tri
- Trunkway -> Tkwy
- Tunnel -> Tun
- Turn -> Tn
- Turn -> Trn
- Turn -> Turn
- Turnpike -> Tpk
- Turnpike -> Tpke
- Underpass -> Upas
- Underpass -> Ups
- University -> Uni
- University -> Univ
- Upper -> Up
- Upper -> Upr
- Vale -> Va
- Vale -> Vale
- Valley -> Vy
- Viaduct -> Vdct
- Viaduct -> Via
- Viaduct -> Viad
- View -> Vw
- View -> View
- Village -> Vill
- Villas -> Vlls
- Vista -> Vst
- Vista -> Vsta
- Walk -> Walk
- Walk -> Wk
- Walk -> Wlk
- Walkway -> Wkwy
- Walkway -> Wky
- Waters -> Wtr
- Way -> Way
- Way -> Wy
- West -> W
- Wharf -> Whrf
- William -> Wm
- Wynd -> Wyn
- Wynd -> Wynd
- Yard -> Yard
- Yard -> Yd
- lang: en
country: ca
words:
- Circuit -> CIRCT
- Concession -> CONC
- Corners -> CRNRS
- Crossing -> CROSS
- Diversion -> DIVERS
- Esplanade -> ESPL
- Extension -> EXTEN
- Grounds -> GRNDS
- Harbour -> HARBR
- Highlands -> HGHLDS
- Landing -> LANDNG
- Limits -> LMTS
- Lookout -> LKOUT
- Orchard -> ORCH
- Parkway -> PKY
- Passage -> PASS
- Pathway -> PTWAY
- Private -> PVT
- Range -> RG
- Subdivision -> SUBDIV
- Terrace -> TERR
- Townline -> TLINE
- Turnabout -> TRNABT
- Village -> VILLGE
- lang: en
country: ph
words:
- Apartment -> Apt
- Barangay -> Brgy
- Barangay -> Bgy
- Building -> Bldg
- Commission -> Comm
- Compound -> Cmpd
- Compound -> Cpd
- Cooperative -> Coop
- Department -> Dept
- Department -> Dep't
- General -> Gen
- Governor -> Gov
- National -> Nat'l
- National High School -> NHS
- Philippine -> Phil
- Police Community Precinct -> PCP
- Province -> Prov
- Senior High School -> SHS
- Subdivision -> Subd

View File

@ -0,0 +1,163 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Espa.C3.B1ol_-_Spanish
- lang: es
words:
- Acequia -> Aceq
- Alameda -> Alam
- Alquería -> Alque
- Andador -> Andad
- Angosta -> Angta
- Apartamento -> Apto
- Apartamentos -> Aptos
- Apeadero -> Apdro
- Arboleda -> Arb
- Arrabal -> Arral
- Arroyo -> Arry
- Asociación de Vecinos -> A VV
- Asociación Vecinal -> A V
- Autopista -> Auto
- Autovía -> Autov
- Avenida -> Av
- Avenida -> Avd
- Avenida -> Avda
- Balneario -> Balnr
- Banda -> B
- Banda -> Bda
- Barranco -> Branc
- Barranquil -> Bqllo
- Barriada -> Barda
- Barrio -> B.º
- Barrio -> Bo
- Bloque -> Blq
- Bulevar -> Blvr
- Boulevard -> Blvd
- Calle -> C/
- Calle -> C
- Calle -> Cl
- Calleja -> Cllja
- Callejón -> Callej
- Callejón -> Cjón
- Callejón -> Cllón
- Callejuela -> Cjla
- Callizo -> Cllzo
- Calzada -> Czada
- Camino -> Cno
- Camino -> Cmno
- Camino hondo -> C H
- Camino nuevo -> C N
- Camino viejo -> C V
- Camping -> Campg
- Cantera -> Cantr
- Cantina -> Canti
- Cantón -> Cant
- Carrera -> Cra
- Carrero -> Cro
- Carretera -> Ctra
- Carreterín -> Ctrin
- Carretil -> Crtil
- Caserío -> Csrio
- Centro Integrado de Formación Profesional -> CIFP
- Cinturón -> Cint
- Circunvalación -> Ccvcn
- Cobertizo -> Cbtiz
- Colegio de Educación Especial -> CEE
- Colegio de Educación Infantil -> CEI
- Colegio de Educación Infantil y Primaria -> CEIP
- Colegio Rural Agrupado -> CRA
- Colonia -> Col
- Complejo -> Compj
- Conjunto -> Cjto
- Convento -> Cnvto
- Cooperativa -> Coop
- Corralillo -> Crrlo
- Corredor -> Crrdo
- Cortijo -> Crtjo
- Costanilla -> Cstan
- Costera -> Coste
- Dehesa -> Dhsa
- Demarcación -> Demar
- Diagonal -> Diag
- Diseminado -> Disem
- Doctor -> Dr
- Doctora -> Dra
- Edificio -> Edif
- Empresa -> Empr
- Entrada -> Entd
- Escalera -> Esca
- Escalinata -> Escal
- Espalda -> Eslda
- Estación -> Estcn
- Estrada -> Estda
- Explanada -> Expla
- Extramuros -> Extrm
- Extrarradio -> Extrr
- Fábrica -> Fca
- Fábrica -> Fbrca
- Ferrocarril -> F C
- Ferrocarriles -> FF CC
- Galería -> Gale
- Glorieta -> Gta
- Gran Vía -> G V
- Hipódromo -> Hipód
- Instituto de Educación Secundaria -> IES
- Jardín -> Jdín
- Llanura -> Llnra
- Lote -> Lt
- Malecón -> Malec
- Manzana -> Mz
- Mercado -> Merc
- Mirador -> Mrdor
- Monasterio -> Mtrio
- Nuestra Señora -> N.ª S.ª
- Nuestra Señora -> Ntr.ª Sr.ª
- Nuestra Señora -> Ntra Sra
- Palacio -> Palac
- Pantano -> Pant
- Parque -> Pque
- Particular -> Parti
- Partida -> Ptda
- Pasadizo -> Pzo
- Pasaje -> Psje
- Paseo -> P.º
- Paseo marítimo -> P.º mar
- Pasillo -> Psllo
- Plaza -> Pl
- Plaza -> Pza
- Plazoleta -> Pzta
- Plazuela -> Plzla
- Poblado -> Pbdo
- Polígono -> Políg
- Polígono industrial -> Pg ind
- Pórtico -> Prtco
- Portillo -> Ptilo
- Prazuela -> Przla
- Prolongación -> Prol
- Pueblo -> Pblo
- Puente -> Pte
- Puerta -> Pta
- Puerto -> Pto
- Punto kilométrico -> P k
- Rambla -> Rbla
- Residencial -> Resid
- Ribera -> Rbra
- Rincón -> Rcón
- Rinconada -> Rcda
- Rotonda -> Rtda
- San -> S
- Sanatorio -> Sanat
- Santa -> Sta
- Santo -> Sto
- Santas -> Stas
- Santos -> Stos
- Santuario -> Santu
- Sector -> Sect
- Sendera -> Sedra
- Sendero -> Send
- Torrente -> Trrnt
- Tránsito -> Tráns
- Transversal -> Trval
- Trasera -> Tras
- Travesía -> Trva
- Urbanización -> Urb
- Vecindario -> Vecin
- Viaducto -> Vcto
- Viviendas -> Vvdas

View File

@ -0,0 +1,8 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Eesti_-_Estonian
- lang: et
words:
- Maantee -> mnt
- Puiestee -> pst
- Raudtee -> rdt
- Raudteejaam -> rdtj
- Tänav -> tn

View File

@ -0,0 +1,6 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Euskara_-_Basque
- lang: eu
words:
- Etorbidea -> Etorb
- Errepidea -> Err
- Kalea -> K

View File

@ -0,0 +1,23 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Suomi_-_Finnish
- lang: fi
words:
- ~alue -> al
- ~asema -> as
- ~aukio -> auk
- ~kaari -> kri
- ~katu -> k
- ~kuja -> kj
- ~kylä -> kl
- ~penger -> pgr
- ~polku -> p
- ~puistikko -> pko
- ~puisto -> ps
- ~raitti -> r
- ~rautatieasema -> ras
- ~ranta -> rt
- ~rinne -> rn
- ~taival -> tvl
- ~tie -> t
- tienhaara -> th
- ~tori -> tr
- ~väylä -> vlä

View File

@ -0,0 +1,297 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Fran.C3.A7ais_-_French
- lang: fr
words:
- Abbaye -> ABE
- Agglomération -> AGL
- Aire -> AIRE
- Aires -> AIRE
- Allée -> ALL
- Allée -> All
- Allées -> ALL
- Ancien chemin -> ACH
- Ancienne route -> ART
- Anciennes routes -> ART
- Anse -> ANSE
- Arcade -> ARC
- Arcades -> ARC
- Autoroute -> AUT
- Avenue -> AV
- Avenue -> Av
- Barrière -> BRE
- Barrières -> BRE
- Bas chemin -> BCH
- Bastide -> BSTD
- Baston -> BAST
- Béguinage -> BEGI
- Béguinages -> BEGI
- Berge -> BER
- Berges -> BER
- Bois -> BOIS
- Boucle -> BCLE
- Boulevard -> Bd
- Boulevard -> BD
- Bourg -> BRG
- Butte -> BUT
- Cité -> CITE
- Cités -> CITE
- Côte -> COTE
- Côteau -> COTE
- Cale -> CALE
- Camp -> CAMP
- Campagne -> CGNE
- Camping -> CPG
- Carreau -> CAU
- Carrefour -> CAR
- Carrière -> CARE
- Carrières -> CARE
- Carré -> CARR
- Castel -> CST
- Cavée -> CAV
- Central -> CTRE
- Centre -> CTRE
- Chalet -> CHL
- Chapelle -> CHP
- Charmille -> CHI
- Chaussée -> CHS
- Chaussées -> CHS
- Chemin -> Ch
- Chemin -> CHE
- Chemin -> Che
- Chemin vicinal -> CHV
- Cheminement -> CHEM
- Cheminements -> CHEM
- Chemins -> CHE
- Chemins vicinaux -> CHV
- Chez -> CHEZ
- Château -> CHT
- Cloître -> CLOI
- Clos -> CLOS
- Col -> COL
- Colline -> COLI
- Collines -> COLI
- Contour -> CTR
- Corniche -> COR
- Corniches -> COR
- Cottage -> COTT
- Cottages -> COTT
- Cour -> COUR
- Cours -> CRS
- Cours -> Crs
- Darse -> DARS
- Degré -> DEG
- Degrés -> DEG
- Descente -> DSG
- Descentes -> DSG
- Digue -> DIG
- Digues -> DIG
- Domaine -> DOM
- Domaines -> DOM
- Écluse -> ECL
- Écluse -> ÉCL
- Écluses -> ECL
- Écluses -> ÉCL
- Église -> EGL
- Église -> ÉGL
- Enceinte -> EN
- Enclave -> ENV
- Enclos -> ENC
- Escalier -> ESC
- Escaliers -> ESC
- Espace -> ESPA
- Esplanade -> ESP
- Esplanades -> ESP
- Étang -> ETANG
- Étang -> ÉTANG
- Faubourg -> FG
- Faubourg -> Fg
- Ferme -> FRM
- Fermes -> FRM
- Fontaine -> FON
- Fort -> FORT
- Forum -> FORM
- Fosse -> FOS
- Fosses -> FOS
- Foyer -> FOYR
- Galerie -> GAL
- Galeries -> GAL
- Gare -> GARE
- Garenne -> GARN
- Grand boulevard -> GBD
- Grand ensemble -> GDEN
- Grandrue -> GR
- Grande rue -> GR
- Grandes rues -> GR
- Grands ensembles -> GDEN
- Grille -> GRI
- Grimpette -> GRIM
- Groupe -> GPE
- Groupement -> GPT
- Groupes -> GPE
- Halle -> HLE
- Halles -> HLE
- Hameau -> HAM
- Hameaux -> HAM
- Haut chemin -> HCH
- Hauts chemins -> HCH
- Hippodrome -> HIP
- HLM -> HLM
- Île -> ILE
- Île -> ÎLE
- Immeuble -> IMM
- Immeubles -> IMM
- Impasse -> IMP
- Impasse -> Imp
- Impasses -> IMP
- Jardin -> JARD
- Jardins -> JARD
- Jetée -> JTE
- Jetées -> JTE
- Levée -> LEVE
- Lieu-dit -> LD
- Lotissement -> LOT
- Lotissements -> LOT
- Mail -> MAIL
- Maison forestière -> MF
- Manoir -> MAN
- Marche -> MAR
- Marches -> MAR
- Maréchal -> MAL
- Mas -> MAS
- Monseigneur -> Mgr
- Mont -> Mt
- Montée -> MTE
- Montées -> MTE
- Moulin -> MLN
- Moulins -> MLN
- Musée -> MUS
- Métro -> MET
- Métro -> MÉT
- Nouvelle route -> NTE
- Palais -> PAL
- Parc -> PARC
- Parcs -> PARC
- Parking -> PKG
- Parvis -> PRV
- Passage -> PAS
- Passage -> Pas
- Passage -> Pass
- Passage à niveau -> PN
- Passe -> PASS
- Passerelle -> PLE
- Passerelles -> PLE
- Passes -> PASS
- Patio -> PAT
- Pavillon -> PAV
- Pavillons -> PAV
- Petit chemin -> PCH
- Petite allée -> PTA
- Petite avenue -> PAE
- Petite impasse -> PIM
- Petite route -> PRT
- Petite rue -> PTR
- Petites allées -> PTA
- Place -> PL
- Place -> Pl
- Placis -> PLCI
- Plage -> PLAG
- Plages -> PLAG
- Plaine -> PLN
- Plan -> PLAN
- Plateau -> PLT
- Plateaux -> PLT
- Pointe -> PNT
- Pont -> PONT
- Ponts -> PONT
- Porche -> PCH
- Port -> PORT
- Porte -> PTE
- Portique -> PORQ
- Portiques -> PORQ
- Poterne -> POT
- Pourtour -> POUR
- Presquîle -> PRQ
- Promenade -> PROM
- Promenade -> Prom
- Pré -> PRE
- Pré -> PRÉ
- Périphérique -> PERI
- Péristyle -> PSTY
- Quai -> QU
- Quai -> Qu
- Quartier -> QUA
- Raccourci -> RAC
- Raidillon -> RAID
- Rampe -> RPE
- Rempart -> REM
- Roc -> ROC
- Rocade -> ROC
- Rond point -> RPT
- Roquet -> ROQT
- Rotonde -> RTD
- Route -> RTE
- Route -> Rte
- Routes -> RTE
- Rue -> R
- Rue -> R
- Ruelle -> RLE
- Ruelles -> RLE
- Rues -> R
- Résidence -> RES
- Résidences -> RES
- Saint -> St
- Sainte -> Ste
- Sente -> SEN
- Sentes -> SEN
- Sentier -> SEN
- Sentiers -> SEN
- Square -> SQ
- Square -> Sq
- Stade -> STDE
- Station -> STA
- Terrain -> TRN
- Terrasse -> TSSE
- Terrasses -> TSSE
- Terre plein -> TPL
- Tertre -> TRT
- Tertres -> TRT
- Tour -> TOUR
- Traverse -> TRA
- Vallon -> VAL
- Vallée -> VAL
- Venelle -> VEN
- Venelles -> VEN
- Via -> VIA
- Vieille route -> VTE
- Vieux chemin -> VCHE
- Villa -> VLA
- Village -> VGE
- Villages -> VGE
- Villas -> VLA
- Voie -> VOI
- Voies -> VOI
- Zone -> ZONE
- Zone artisanale -> ZA
- Zone d'aménagement concerté -> ZAC
- Zone d'aménagement différé -> ZAD
- Zone industrielle -> ZI
- Zone à urbaniser en priorité -> ZUP
- lang: fr
country: ca
words:
- Boulevard -> BOUL
- Carré -> CAR
- Carrefour -> CARREF
- Centre -> C
- Chemin -> CH
- Croissant -> CROIS
- Diversion -> DIVERS
- Échangeur -> ÉCH
- Esplanade -> ESPL
- Passage -> PASS
- Plateau -> PLAT
- Rang -> RANG
- Rond-point -> RDPT
- Sentier -> SENT
- Subdivision -> SUBDIV
- Terrasse -> TSSE
- Village -> VILLGE

View File

@ -0,0 +1,27 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Galego_-_Galician
- lang: gl
words:
- Asociación Veciñal -> A V
- Asociación de Veciños -> A VV
- Avenida -> Av
- Avenida -> Avda
- Centro Integrado de Formación Profesional -> CIFP
- Colexio de Educación Especial -> CEE
- Colexio de Educación Infantil -> CEI
- Colexio de Educación Infantil e Primaria -> CEIP
- Colexio Rural Agrupado -> CRA
- Doutor -> Dr
- Doutora -> Dra
- Edificio -> Edif
- Estrada -> Estda
- Ferrocarril -> F C
- Ferrocarrís -> FF CC
- Instituto de Educación Secundaria -> IES
- Rúa -> R/
- San -> S
- Santa -> Sta
- Santo -> Sto
- Santas -> Stas
- Santos -> Stos
- Señora -> Sra
- Urbanización -> Urb

View File

@ -0,0 +1,4 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Magyar_-_Hungarian
- lang: hu
words:
- utca -> u

View File

@ -0,0 +1,77 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Italiano_-_Italian
- lang: it
words:
- Calle -> C.le
- Campo -> C.po
- Cascina -> C.na
- Cinque -> 5
- Corso -> C.so
- Corte -> C.te
- Decima -> X
- Decimo -> X
- Due -> 2
- Fondamenta -> F.ta
- Largo -> L.go
- Località -> Loc
- Lungomare -> L.mare
- Nona -> IX
- Nono -> IX
- Nove -> 9
- Otto -> 8
- Ottava -> VIII
- Ottavo -> VIII
- Piazza -> P.za
- Piazza -> P.zza
- Piazzale -> P.le
- Piazzetta -> P.ta
- Ponte -> P.te
- Porta -> P.ta
- Prima -> I
- Primo -> I
- Primo -> 1
- Primo -> 1°
- Quarta -> IV
- Quarto -> IV
- Quattro -> IV
- Quattro -> 4
- Quinta -> V
- Quinto -> V
- Salizada -> S.da
- San -> S
- Santa -> S
- Santo -> S
- Sant' -> S
- Santi -> SS
- Santissima -> SS.ma
- Santissime -> SS.me
- Santissimi -> SS.mi
- Santissimo -> SS.mo
- Seconda -> II
- Secondo -> II
- Sei -> 6
- Sesta -> VI
- Sesto -> VI
- Sette -> 7
- Settima -> VII
- Settimo -> VII
- Stazione -> Staz
- Strada Comunale -> SC
- Strada Provinciale -> SP
- Strada Regionale -> SR
- Strada Statale -> SS
- Terzo -> III
- Terza -> III
- Tre -> 3
- Trenta -> XXX
- Un -> 1
- Una -> 1
- Venti -> XX
- Venti -> 20
- Venticinque -> XXV
- Venticinque -> 25
- Ventiquattro -> XXIV
- Ventitreesimo -> XXIII
- Via -> V
- Viale -> V.le
- Vico -> V.co
- Vicolo -> V.lo

View File

@ -0,0 +1,32 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.E6.97.A5.E6.9C.AC.E8.AA.9E_.28Nihongo.29_-_Japanese
- lang: ja
words:
- ~中学校 |-> 中
- ~大学 |-> 大
- 独立行政法人~ -> 独
- 学校法人~ -> 学
- ~銀行 |-> 銀
- ~合同会社 -> 合
- 合同会社~ -> 合
- ~合名会社 -> 名
- 合名会社~ -> 名
- ~合資会社 -> 資
- 合資会社~ -> 資
- 一般道道~ -> 一
- 一般府道~ -> 一
- 一般県道~ -> 一
- 一般社団法人~ -> 一社
- 一般都道~ -> 一
- 一般財団法人~ -> 一財
- 医療法人~ -> 医
- ~株式会社 -> 株
- 株式会社~ -> 株
- 国立大学法人~ -> 大
- 公立大学法人~ -> 大
- ~高等学校 |-> 高
- ~高等学校 |-> 高校
- ~小学校 |-> 小
- 主要地方道~ -> 主
- 有限会社~ -> 有
- ~有限会社 -> 有
- 財団法人~ -> 財

View File

@ -0,0 +1,12 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Malagasy_-_Malgache
- lang: mg
words:
- Ambato -> Ato
- Ambinany -> Any
- Ambodi -> Adi
- Ambohi -> Ahi
- Ambohitr' -> Atr'
- Ambony -> Ani
- Ampasi -> Asi
- Andoha -> Aha
- Andrano -> Ano

View File

@ -0,0 +1,12 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Bahasa_Melayu_-_Malay
- lang: ms
words:
- Jalan -> Jln
- Simpang -> Spg
- Kampong -> Kg
- Sungai -> Sg
- Haji -> Hj
- Pengiran -> Pg
- Awang -> Awg
- Dayang -> Dyg

View File

@ -0,0 +1,53 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Nederlands_-_Dutch
- lang: nl
words:
- Broeder -> Br
- Burgemeester -> Burg
- Commandant -> Cmdt
- Doctor -> dr
- Dokter -> Dr
- Dominee -> ds
- Gebroeders -> Gebr
- Generaal -> Gen
- ~gracht -> gr
- Ingenieur -> ir
- Jonkheer -> Jhr
- Kolonel -> Kol
- Kanunnik -> Kan
- Kardinaal -> Kard
- Kort(e) -> Kte, K
- Koning -> Kon
- Koningin -> Kon
- ~laan -> ln
- Lange -> L
- Luitenant -> Luit
- ~markt -> mkt
- Meester -> Mr, mr
- Mejuffrouw -> Mej
- Mevrouw -> Mevr
- Minister -> Min
- Monseigneur -> Mgr
- Noordzijde -> NZ, N Z
- Oostzijde -> OZ, O Z
- Onze-Lieve-Vrouw,Onze-Lieve-Vrouwe -> O L V, OLV
- Pastoor -> Past
- ~plein -> pln
- President -> Pres
- Prins -> Pr
- Prinses -> Pr
- Professor -> Prof
- ~singel -> sngl
- ~straat -> str
- ~steenweg -> stwg
- Sint -> St
- Van -> V
- Van De -> V D, vd
- Van Den -> V D, vd
- Van Der -> V D, vd
- Verlengde -> Verl
- ~vliet -> vlt
- Vrouwe -> Vr
- ~weg -> wg
- Westzijde -> WZ, W Z
- Zuidzijde -> ZZ, Z Z
- Zuster -> Zr

View File

@ -0,0 +1,11 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Norsk_-_Norwegian
- lang: no
words:
# convert between Nynorsk and Bookmal here
- vei, veg => v,vn,vei,veg
- veien, vegen -> v,vn,veien,vegen
- gate -> g,gt
# convert between the two female forms
- gaten, gata => g,gt,gaten,gata
- plass, plassen -> pl
- sving, svingen -> sv

View File

@ -0,0 +1,66 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Polski_.E2.80.93_Polish
- lang: pl
words:
- Aleja, Aleje, Alei, Alejach, Aleją -> al
- Ulica, Ulice, Ulicą, Ulicy -> ul
- Plac, Placu, Placem -> pl
- Wybrzeże, Wybrzeża, Wybrzeżem -> wyb
- Bulwar -> bulw
- Dolny, Dolna, Dolne -> Dln
- Drugi, Druga, Drugie -> 2
- Drugi, Druga, Drugie -> II
- Duży, Duża, Duże -> Dz
- Duży, Duża, Duże -> Dż
- Górny, Górna, Górne -> Grn
- Kolonia -> kol
- koło, kolo -> k
- Mały, Mała, Małe -> Ml
- Mały, Mała, Małe -> Mł
- Mazowiecka, Mazowiecki, Mazowieckie -> maz
- Miasto -> m
- Nowy, Nowa, Nowe -> Nw
- Nowy, Nowa, Nowe -> N
- Osiedle, Osiedlu -> os
- Pierwszy, Pierwsza, Pierwsze -> 1
- Pierwszy, Pierwsza, Pierwsze -> I
- Szkoła Podstawowa -> SP
- Stary, Stara, Stare -> St
- Stary, Stara, Stare -> Str
- Trzeci, Trzecia, Trzecie -> III
- Trzeci, Trzecia, Trzecie -> 3
- Wielki, Wielka, Wielkie -> Wlk
- Wielkopolski, Wielkopolska, Wielkopolskie -> wlkp
- Województwo, Województwie -> woj
- kardynała, kardynał -> kard
- pułkownika, pułkownik -> płk
- marszałka, marszałek -> marsz
- generała, generał -> gen
- Świętego, Świętej, Świętych, święty, święta, święci -> św
- Świętych, święci -> śś
- Ojców -> oo
- Błogosławionego, Błogosławionej, Błogosławionych, błogosławiony, błogosławiona, błogosławieni -> bł
- księdza, ksiądz -> ks
- księcia, książe -> ks
- doktora, doktor -> dr
- majora, major -> mjr
- biskupa, biskup -> bpa
- biskupa, biskup -> bp
- rotmistrza, rotmistrz -> rotm
- profesora, profesor -> prof
- hrabiego, hrabiny, hrabia, hrabina -> hr
- porucznika, porucznik -> por
- podpułkownika, podpułkownik -> ppłk
- pułkownika, pułkownik -> płk
- podporucznika, podporucznik -> ppor
- porucznika, porucznik -> por
- marszałka, marszałek -> marsz
- chorążego, chorąży -> chor
- szeregowego, szeregowego -> szer
- kaprala, kapral -> kpr
- plutonowego, plutonowy -> plut
- kapitana, kapitan -> kpt
- admirała, admirał -> adm
- wiceadmirała, wiceadmirał -> wadm
- kontradmirała, kontradmirał -> kontradm
- batalionów, bataliony -> bat
- batalionu, batalion -> bat

View File

@ -0,0 +1,196 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Portugu.C3.AAs_-_Portuguese
- lang: pt
words:
- Associação -> Ass
- Alameda -> Al
- Alferes -> Alf
- Almirante -> Alm
- Arquitecto -> Arq
- Arquitecto -> Arqº
- Arquiteto -> Arq
- Arquiteto -> Arqº
- Auto-estrada -> A
- Avenida -> Av
- Avenida -> Avª
- Azinhaga -> Az
- Bairro -> B
- Bairro -> Bº
- Bairro -> Br
- Beco -> Bc
- Beco -> Bco
- Bloco -> Bl
- Bombeiros Voluntários -> BV
- Bombeiros Voluntários -> B.V
- Brigadeiro -> Brg
- Cacique -> Cac
- Calçada -> Cc
- Calçadinha -> Ccnh
- Câmara Municipal -> CM
- Câmara Municipal -> C.M
- Caminho -> Cam
- Capitão -> Cap
- Casal -> Csl
- Cave -> Cv
- Centro Comercial -> CC
- Centro Comercial -> C.C
- Ciclo do Ensino Básico -> CEB
- Ciclo do Ensino Básico -> C.E.B
- Ciclo do Ensino Básico -> C. E. B
- Comandante -> Cmdt
- Comendador -> Comend
- Companhia -> Cª
- Conselheiro -> Cons
- Coronel -> Cor
- Coronel -> Cel
- Corte -> C.te
- De -> D´
- De -> D'
- Departamento -> Dept
- Deputado -> Dep
- Direito -> Dto
- Dom -> D
- Dona -> D
- Dona -> Dª
- Doutor -> Dr
- Doutora -> Dr
- Doutora -> Drª
- Doutora -> Dra
- Duque -> Dq
- Edifício -> Ed
- Edifício -> Edf
- Embaixador -> Emb
- Empresa Pública -> EP
- Empresa Pública -> E.P
- Enfermeiro -> Enfo
- Enfermeiro -> Enfº
- Enfermeiro -> Enf
- Engenheiro -> Eng
- Engenheiro -> Engº
- Engenheira -> Eng
- Engenheira -> Engª
- Escadas -> Esc
- Escadinhas -> Escnh
- Escola Básica -> EB
- Escola Básica -> E.B
- Esquerdo -> Esq
- Estação de Tratamento de Águas Residuais -> ETAR
- Estação de Tratamento de Águas Residuais -> E.T.A.R
- Estrada -> Estr
- Estrada Municipal -> EM
- Estrada Nacional -> EN
- Estrada Regional -> ER
- Frei -> Fr
- Frente -> Ft
- Futebol Clube -> FC
- Futebol Clube -> F.C
- Guarda Nacional Republicana -> GNR
- Guarda Nacional Republicana -> G.N.R
- General -> Gen
- General -> Gal
- Habitação -> Hab
- Infante -> Inf
- Instituto -> Inst
- Irmã -> Ima
- Irmã -> Imª
- Irmã -> Im
- Irmão -> Imo
- Irmão -> Imº
- Irmão -> Im
- Itinerário Complementar -> IC
- Itinerário Principal -> IP
- Jardim -> Jrd
- Júnior -> Jr
- Largo -> Lg
- Limitada -> Lda
- Loja -> Lj
- Lote -> Lt
- Loteamento -> Loteam
- Lugar -> Lg
- Lugar -> Lug
- Maestro -> Mto
- Major -> Maj
- Marechal -> Mal
- Marquês -> Mq
- Madre -> Me
- Mestre -> Me
- Ministério -> Min
- Monsenhor -> Mons
- Municipal -> M
- Nacional -> N
- Nossa -> N
- Nossa -> Nª
- Nossa Senhora -> Ns
- Nosso -> N
- Número -> N
- Número -> Nº
- Padre -> Pe
- Parque -> Pq
- Particular -> Part
- Pátio -> Pto
- Pavilhão -> Pav
- Polícia de Segurança Pública -> PSP
- Polícia de Segurança Pública -> P.S.P
- Polícia Judiciária -> PJ
- Polícia Judiciária -> P.J
- Praça -> Pc
- Praça -> Pç
- Praça -> Pr
- Praceta -> Pct
- Praceta -> Pctª
- Presidente -> Presid
- Primeiro -> 1º
- Professor -> Prof
- Professora -> Prof
- Professora -> Profª
- Projectada -> Proj
- Projetada -> Proj
- Prolongamento -> Prolng
- Quadra -> Q
- Quadra -> Qd
- Quinta -> Qta
- Regional -> R
- Rés-do-chão -> R/c
- Rés-do-chão -> Rc
- Rotunda -> Rot
- Ribeira -> Rª
- Ribeira -> Rib
- Ribeira -> Ribª
- Rio -> R
- Rua -> R
- Santa -> Sta
- Santa -> Stª
- Santo -> St
- Santo -> Sto
- Santo -> Stº
- São -> S
- Sargento -> Sarg
- Sem Número -> S/n
- Sem Número -> Sn
- Senhor -> S
- Senhor -> Sr
- Senhora -> S
- Senhora -> Sª
- Senhora -> Srª
- Senhora -> Sr.ª
- Senhora -> S.ra
- Senhora -> Sra
- Sobre-Loja -> Slj
- Sociedade -> Soc
- Sociedade Anónima -> SA
- Sociedade Anónima -> S.A
- Sport Clube -> SC
- Sport Clube -> S.C
- Sub-Cave -> Scv
- Superquadra -> Sq
- Tenente -> Ten
- Torre -> Tr
- Transversal -> Transv
- Travessa -> Trav
- Travessa -> Trv
- Travessa -> Tv
- Universidade -> Univ
- Urbanização -> Urb
- Vila -> Vl
- Visconde -> Visc
- Vivenda -> Vv
- Zona -> Zn

View File

@ -0,0 +1,36 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Rom.C3.A2n.C4.83_-_Romanian
- lang: ro
words:
- Aleea -> ale
- Aleea -> al
- Bulevardul -> bulevard
- Bulevardul -> bulev
- Bulevardul -> b-dul
- Bulevardul -> blvd
- Bulevardul -> blv
- Bulevardul -> bdul
- Bulevardul -> bul
- Bulevardul -> bd
- Calea -> cal
- Fundătura -> fnd
- Fundacul -> fdc
- Intrarea -> intr
- Intrarea -> int
- Piața -> p-ța
- Piața -> pța
- Strada -> stra
- Strada -> str
- Stradela -> str-la
- Stradela -> sdla
- Șoseaua -> sos
- Splaiul -> sp
- Splaiul -> splaiul
- Splaiul -> spl
- Vârful -> virful
- Vârful -> virf
- Vârful -> varf
- Vârful -> vf
- Muntele -> m-tele
- Muntele -> m-te
- Muntele -> mnt
- Muntele -> mt

View File

@ -0,0 +1,14 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.A0.D1.83.D1.81.D1.81.D0.BA.D0.B8.D0.B9_-_Russian
- lang: ru
words:
- аллея -> ал
- бульвар -> бул
- набережная -> наб
- переулок -> пер
- площадь -> пл
- проезд -> пр
- проспект -> просп
- шоссе -> ш
- тупик -> туп
- улица -> ул
- область -> обл

View File

@ -0,0 +1,20 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Slovensky_-_Slovak
- lang: sk
words:
- Ulica -> Ul
- Námestie -> Nám
- Svätého, Svätej -> Sv
- Generála -> Gen
- Armádneho generála -> Arm gen
- Doktora, Doktorky -> Dr
- Inžiniera, Inžinierky -> Ing
- Majora -> Mjr
- Profesora, Profesorky -> Prof
- Československej -> Čsl
- Plukovníka -> Plk
- Podplukovníka -> Pplk
- Kapitána -> Kpt
- Poručíka -> Por
- Podporučíka -> Ppor
- Sídlisko -> Sídl
- Nábrežie -> Nábr

View File

@ -0,0 +1,35 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Sloven.C5.A1.C4.8Dina_-_Slovenian
- lang: sl
words:
- Cesta -> C
- Gasilski Dom -> GD
- Osnovna šola -> OŠ
- Prostovoljno Gasilsko Društvo -> PGD
- Savinjski -> Savinj
- Slovenskih -> Slov
- Spodnja -> Sp
- Spodnje -> Sp
- Spodnji -> Sp
- Srednja -> Sr
- Srednje -> Sr
- Srednji -> Sr
- Sveta -> Sv
- Svete -> Sv
- Sveti -> Sv
- Svetega -> Sv
- Šent -> Št
- Ulica -> Ul
- Velika -> V
- Velike -> V
- Veliki -> V
- Veliko -> V
- Velikem -> V
- Velika -> Vel
- Velike -> Vel
- Veliki -> Vel
- Veliko -> Vel
- Velikem -> Vel
- Zdravstveni dom -> ZD
- Zgornja -> Zg
- Zgornje -> Zg
- Zgornji -> Zg

View File

@ -0,0 +1,21 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Svenska_-_Swedish
- lang: sv
words:
- ~väg, ~vägen -> v
- ~gatan, ~gata -> g
- ~gränd, ~gränden -> gr
- gamla -> G:la
- södra -> s
- södra -> s:a
- norra -> n
- norra -> n:a
- östra -> ö
- östra -> ö:a
- västra -> v
- västra -> v:a
- ~stig, ~stigen -> st
- sankt -> s:t
- sankta -> s:ta
- ~plats, ~platsen -> pl
- lilla -> l
- stora -> st

View File

@ -0,0 +1,14 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#T.C3.BCrk.C3.A7e_-_Turkish
- lang: tr
words:
- Sokak -> Sk
- Sokak -> Sok
- Sokağı -> Sk
- Sokağı -> Sok
- Cadde -> Cd
- Caddesi -> Cd
- Bulvar -> Bl
- Bulvar -> Blv
- Bulvarı -> Bl
- Mahalle -> Mh
- Mahalle -> Mah

View File

@ -0,0 +1,10 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.A3.D0.BA.D1.80.D0.B0.D1.97.D0.BD.D1.81.D1.8C.D0.BA.D0.B0_-_Ukrainian
- lang: uk
words:
- бульвар -> бул
- дорога -> дор
- провулок -> пров
- площа -> пл
- проспект -> просп
- шосе -> ш
- вулиця -> вул

View File

@ -0,0 +1,48 @@
# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Ti.E1.BA.BFng_Vi.E1.BB.87t_.E2.80.93_Vietnamese
- lang: vi
words:
- Thành phố -> TP
- Thị xã -> TX
- Thị trấn -> TT
- Quận -> Q
- Phường -> P
- Phường -> Ph
- Quốc lộ -> QL
- Tỉnh lộ -> TL
- Đại lộ -> ĐL
- Đường -> Đ
- Công trường -> CT
- Quảng trường -> QT
- Sân bay -> SB
- Sân bay quốc tế -> SBQT
- Phi trường -> PT
- Đường sắt -> ĐS
- Trung tâm -> TT
- Trung tâm Thương mại -> TTTM
- Khách sạn -> KS
- Khách sạn -> K/S
- Bưu điện -> BĐ
- Đại học -> ĐH
- Cao đẳng -> CĐ
- Trung học Phổ thông -> THPT
- Trung học Cơ sở -> THCS
- Tiểu học -> TH
- Khu công nghiệp -> KCN
- Khu nghỉ mát -> KNM
- Khu du lịch -> KDL
- Công viên văn hóa -> CVVH
- Công viên -> CV
- Vươn quốc gia -> VQG
- Viện bảo tàng -> VBT
- Sân vận động -> SVĐ
- Nhà thi đấu -> NTĐ
- Câu lạc bộ -> CLB
- Nhà thờ -> NT
- Nhà hát -> NH
- Rạp hát -> RH
- Công ty -> Cty
- Tổng công ty -> TCty
- Tổng công ty -> TCT
- Công ty cổ phần -> CTCP
- Công ty cổ phần -> Cty CP
- Căn cứ không quân -> CCKQ

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,56 @@
normalization:
- ":: lower ()"
- !include icu-rules/unicode-digits-to-decimal.yaml
- "'№' > 'no'"
- "'n°' > 'no'"
- "'nº' > 'no'"
- "ª > a"
- "º > o"
- "[[:Punctuation:][:Symbol:]] > ' '"
- "ß > 'ss'" # German szet is unimbigiously equal to double ss
- "[^[:Letter:] [:Number:] [:Space:]] >"
- "[:Lm:] >"
- ":: [[:Number:]] Latin ()"
- ":: [[:Number:]] Ascii ();"
- ":: [[:Number:]] NFD ();"
- "[[:Nonspacing Mark:] [:Cf:]] >;"
- "[:Space:]+ > ' '"
transliteration:
- ":: Latin ()"
- !include icu-rules/extended-unicode-to-asccii.yaml
- ":: Ascii ()"
- ":: NFD ()"
- "[^[:Ascii:]] >"
- ":: lower ()"
- ":: NFC ()"
variants:
- !include icu-rules/variants-bg.yaml
- !include icu-rules/variants-ca.yaml
- !include icu-rules/variants-cs.yaml
- !include icu-rules/variants-da.yaml
- !include icu-rules/variants-de.yaml
- !include icu-rules/variants-el.yaml
- !include icu-rules/variants-en.yaml
- !include icu-rules/variants-es.yaml
- !include icu-rules/variants-et.yaml
- !include icu-rules/variants-eu.yaml
- !include icu-rules/variants-fi.yaml
- !include icu-rules/variants-fr.yaml
- !include icu-rules/variants-gl.yaml
- !include icu-rules/variants-hu.yaml
- !include icu-rules/variants-it.yaml
- !include icu-rules/variants-ja.yaml
- !include icu-rules/variants-mg.yaml
- !include icu-rules/variants-ms.yaml
- !include icu-rules/variants-nl.yaml
- !include icu-rules/variants-no.yaml
- !include icu-rules/variants-pl.yaml
- !include icu-rules/variants-pt.yaml
- !include icu-rules/variants-ro.yaml
- !include icu-rules/variants-ru.yaml
- !include icu-rules/variants-sk.yaml
- !include icu-rules/variants-sl.yaml
- !include icu-rules/variants-sv.yaml
- !include icu-rules/variants-tr.yaml
- !include icu-rules/variants-uk.yaml
- !include icu-rules/variants-vi.yaml

View File

@ -53,7 +53,7 @@ Feature: Import and search of names
Scenario: Special characters in name
Given the places
| osm | class | type | name |
| N1 | place | locality | Jim-Knopf-Str |
| N1 | place | locality | Jim-Knopf-Straße |
| N2 | place | locality | Smith/Weston |
| N3 | place | locality | space mountain |
| N4 | place | locality | space |

View File

@ -214,7 +214,7 @@ def check_search_name_contents(context, exclude):
for name, value in zip(row.headings, row.cells):
if name in ('name_vector', 'nameaddress_vector'):
items = [x.strip() for x in value.split(',')]
tokens = analyzer.get_word_token_info(context.db, items)
tokens = analyzer.get_word_token_info(items)
if not exclude:
assert len(tokens) >= len(items), \

View File

@ -173,6 +173,7 @@ def place_row(place_table, temp_db_cursor):
""" A factory for rows in the place table. The table is created as a
prerequisite to the fixture.
"""
psycopg2.extras.register_hstore(temp_db_cursor)
idseq = itertools.count(1001)
def _insert(osm_type='N', osm_id=None, cls='amenity', typ='cafe', names=None,
admin_level=None, address=None, extratags=None, geom=None):

View File

@ -98,6 +98,13 @@ class MockWordTable:
WHERE class = 'place' and type = 'postcode'""")
return set((row[0] for row in cur))
def get_partial_words(self):
with self.conn.cursor() as cur:
cur.execute("""SELECT word_token, search_name_count FROM word
WHERE class is null and country_code is null
and not word_token like ' %'""")
return set((tuple(row) for row in cur))
class MockPlacexTable:
""" A placex table for testing.

View File

@ -50,3 +50,68 @@ def test_execute_file_with_post_code(dsn, tmp_path, temp_db_cursor):
db_utils.execute_file(dsn, tmpfile, post_code='INSERT INTO test VALUES(23)')
assert temp_db_cursor.row_set('SELECT * FROM test') == {(23, )}
class TestCopyBuffer:
TABLE_NAME = 'copytable'
@pytest.fixture(autouse=True)
def setup_test_table(self, table_factory):
table_factory(self.TABLE_NAME, 'colA INT, colB TEXT')
def table_rows(self, cursor):
return cursor.row_set('SELECT * FROM ' + self.TABLE_NAME)
def test_copybuffer_empty(self):
with db_utils.CopyBuffer() as buf:
buf.copy_out(None, "dummy")
def test_all_columns(self, temp_db_cursor):
with db_utils.CopyBuffer() as buf:
buf.add(3, 'hum')
buf.add(None, 'f\\t')
buf.copy_out(temp_db_cursor, self.TABLE_NAME)
assert self.table_rows(temp_db_cursor) == {(3, 'hum'), (None, 'f\\t')}
def test_selected_columns(self, temp_db_cursor):
with db_utils.CopyBuffer() as buf:
buf.add('foo')
buf.copy_out(temp_db_cursor, self.TABLE_NAME,
columns=['colB'])
assert self.table_rows(temp_db_cursor) == {(None, 'foo')}
def test_reordered_columns(self, temp_db_cursor):
with db_utils.CopyBuffer() as buf:
buf.add('one', 1)
buf.add(' two ', 2)
buf.copy_out(temp_db_cursor, self.TABLE_NAME,
columns=['colB', 'colA'])
assert self.table_rows(temp_db_cursor) == {(1, 'one'), (2, ' two ')}
def test_special_characters(self, temp_db_cursor):
with db_utils.CopyBuffer() as buf:
buf.add('foo\tbar')
buf.add('sun\nson')
buf.add('\\N')
buf.copy_out(temp_db_cursor, self.TABLE_NAME,
columns=['colB'])
assert self.table_rows(temp_db_cursor) == {(None, 'foo\tbar'),
(None, 'sun\nson'),
(None, '\\N')}

View File

@ -0,0 +1,104 @@
"""
Tests for import name normalisation and variant generation.
"""
from textwrap import dedent
import pytest
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
from nominatim.tokenizer.icu_name_processor import ICUNameProcessor, ICUNameProcessorRules
from nominatim.errors import UsageError
@pytest.fixture
def cfgfile(tmp_path, suffix='.yaml'):
def _create_config(*variants, **kwargs):
content = dedent("""\
normalization:
- ":: NFD ()"
- "'🜳' > ' '"
- "[[:Nonspacing Mark:] [:Cf:]] >"
- ":: lower ()"
- "[[:Punctuation:][:Space:]]+ > ' '"
- ":: NFC ()"
transliteration:
- ":: Latin ()"
- "'🜵' > ' '"
""")
content += "variants:\n - words:\n"
content += '\n'.join((" - " + s for s in variants)) + '\n'
for k, v in kwargs:
content += " {}: {}\n".format(k, v)
fpath = tmp_path / ('test_config' + suffix)
fpath.write_text(dedent(content))
return fpath
return _create_config
def get_normalized_variants(proc, name):
return proc.get_variants_ascii(proc.get_normalized(name))
def test_variants_empty(cfgfile):
fpath = cfgfile('saint -> 🜵', 'street -> st')
rules = ICUNameProcessorRules(loader=ICURuleLoader(fpath))
proc = ICUNameProcessor(rules)
assert get_normalized_variants(proc, '🜵') == []
assert get_normalized_variants(proc, '🜳') == []
assert get_normalized_variants(proc, 'saint') == ['saint']
VARIANT_TESTS = [
(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}),
(('weg => wg',), "holzweg", {'holzweg'}),
(('weg -> wg',), "holzweg", {'holzweg'}),
(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}),
(('~weg -> weg',), "holzweg", {'holz weg', 'holzweg'}),
(('~weg => w',), "holzweg", {'holz w', 'holzw'}),
(('~weg -> w',), "holzweg", {'holz weg', 'holzweg', 'holz w', 'holzw'}),
(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}),
(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}),
(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}),
(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}),
(('weg => wg',), "Meier Weg", {'meier wg'}),
(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}),
(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße",
{'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}),
(('am => a', 'bach => b'), "am bach", {'a b'}),
(('am => a', '~bach => b'), "am bach", {'a b'}),
(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}),
(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}),
(('saint -> s,st', 'street -> st'), "Saint Johns Street",
{'saint johns street', 's johns street', 'st johns street',
'saint johns st', 's johns st', 'st johns st'}),
(('river$ -> r',), "River Bend Road", {'river bend road'}),
(('river$ -> r',), "Bent River", {'bent river', 'bent r'}),
(('^north => n',), "North 2nd Street", {'n 2nd street'}),
(('^north => n',), "Airport North", {'airport north'}),
(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}),
(('am => a',), "am am am am am am am am", {'a a a a a a a a'})
]
@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS)
def test_variants(cfgfile, rules, name, variants):
fpath = cfgfile(*rules)
proc = ICUNameProcessor(ICUNameProcessorRules(loader=ICURuleLoader(fpath)))
result = get_normalized_variants(proc, name)
assert len(result) == len(set(result))
assert set(get_normalized_variants(proc, name)) == variants
def test_search_normalized(cfgfile):
fpath = cfgfile('~street => s,st', 'master => mstr')
rules = ICUNameProcessorRules(loader=ICURuleLoader(fpath))
proc = ICUNameProcessor(rules)
assert proc.get_search_normalized('Master Street') == 'master street'
assert proc.get_search_normalized('Earnes St') == 'earnes st'
assert proc.get_search_normalized('Nostreet') == 'nostreet'

View File

@ -0,0 +1,264 @@
"""
Tests for converting a config file to ICU rules.
"""
import pytest
from textwrap import dedent
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
from nominatim.errors import UsageError
from icu import Transliterator
@pytest.fixture
def cfgfile(tmp_path, suffix='.yaml'):
def _create_config(*variants, **kwargs):
content = dedent("""\
normalization:
- ":: NFD ()"
- "[[:Nonspacing Mark:] [:Cf:]] >"
- ":: lower ()"
- "[[:Punctuation:][:Space:]]+ > ' '"
- ":: NFC ()"
transliteration:
- ":: Latin ()"
- "[[:Punctuation:][:Space:]]+ > ' '"
""")
content += "variants:\n - words:\n"
content += '\n'.join((" - " + s for s in variants)) + '\n'
for k, v in kwargs:
content += " {}: {}\n".format(k, v)
fpath = tmp_path / ('test_config' + suffix)
fpath.write_text(dedent(content))
return fpath
return _create_config
def test_empty_rule_file(tmp_path):
fpath = tmp_path / ('test_config.yaml')
fpath.write_text(dedent("""\
normalization:
transliteration:
variants:
"""))
rules = ICURuleLoader(fpath)
assert rules.get_search_rules() == ''
assert rules.get_normalization_rules() == ''
assert rules.get_transliteration_rules() == ''
assert list(rules.get_replacement_pairs()) == []
CONFIG_SECTIONS = ('normalization', 'transliteration', 'variants')
@pytest.mark.parametrize("section", CONFIG_SECTIONS)
def test_missing_normalization(tmp_path, section):
fpath = tmp_path / ('test_config.yaml')
with fpath.open('w') as fd:
for name in CONFIG_SECTIONS:
if name != section:
fd.write(name + ':\n')
with pytest.raises(UsageError):
ICURuleLoader(fpath)
def test_get_search_rules(cfgfile):
loader = ICURuleLoader(cfgfile())
rules = loader.get_search_rules()
trans = Transliterator.createFromRules("test", rules)
assert trans.transliterate(" Baum straße ") == " baum straße "
assert trans.transliterate(" Baumstraße ") == " baumstraße "
assert trans.transliterate(" Baumstrasse ") == " baumstrasse "
assert trans.transliterate(" Baumstr ") == " baumstr "
assert trans.transliterate(" Baumwegstr ") == " baumwegstr "
assert trans.transliterate(" Αθήνα ") == " athēna "
assert trans.transliterate(" проспект ") == " prospekt "
def test_get_normalization_rules(cfgfile):
loader = ICURuleLoader(cfgfile())
rules = loader.get_normalization_rules()
trans = Transliterator.createFromRules("test", rules)
assert trans.transliterate(" проспект-Prospekt ") == " проспект prospekt "
def test_get_transliteration_rules(cfgfile):
loader = ICURuleLoader(cfgfile())
rules = loader.get_transliteration_rules()
trans = Transliterator.createFromRules("test", rules)
assert trans.transliterate(" проспект-Prospekt ") == " prospekt Prospekt "
def test_transliteration_rules_from_file(tmp_path):
cfgpath = tmp_path / ('test_config.yaml')
cfgpath.write_text(dedent("""\
normalization:
transliteration:
- "'ax' > 'b'"
- !include transliteration.yaml
variants:
"""))
transpath = tmp_path / ('transliteration.yaml')
transpath.write_text('- "x > y"')
loader = ICURuleLoader(cfgpath)
rules = loader.get_transliteration_rules()
trans = Transliterator.createFromRules("test", rules)
assert trans.transliterate(" axxt ") == " byt "
class TestGetReplacements:
@pytest.fixture(autouse=True)
def setup_cfg(self, cfgfile):
self.cfgfile = cfgfile
def get_replacements(self, *variants):
loader = ICURuleLoader(self.cfgfile(*variants))
rules = loader.get_replacement_pairs()
return set((v.source, v.replacement) for v in rules)
@pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
'~foo~ -> bar', 'fo~ o -> bar'])
def test_invalid_variant_description(self, variant):
with pytest.raises(UsageError):
ICURuleLoader(self.cfgfile(variant))
def test_add_full(self):
repl = self.get_replacements("foo -> bar")
assert repl == {(' foo ', ' bar '), (' foo ', ' foo ')}
def test_replace_full(self):
repl = self.get_replacements("foo => bar")
assert repl == {(' foo ', ' bar ')}
def test_add_suffix_no_decompose(self):
repl = self.get_replacements("~berg |-> bg")
assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
(' berg ', ' berg '), (' berg ', ' bg ')}
def test_replace_suffix_no_decompose(self):
repl = self.get_replacements("~berg |=> bg")
assert repl == {('berg ', 'bg '), (' berg ', ' bg ')}
def test_add_suffix_decompose(self):
repl = self.get_replacements("~berg -> bg")
assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
(' berg ', ' berg '), (' berg ', 'berg '),
('berg ', 'bg '), ('berg ', ' bg '),
(' berg ', 'bg '), (' berg ', ' bg ')}
def test_replace_suffix_decompose(self):
repl = self.get_replacements("~berg => bg")
assert repl == {('berg ', 'bg '), ('berg ', ' bg '),
(' berg ', 'bg '), (' berg ', ' bg ')}
def test_add_prefix_no_compose(self):
repl = self.get_replacements("hinter~ |-> hnt")
assert repl == {(' hinter', ' hinter'), (' hinter ', ' hinter '),
(' hinter', ' hnt'), (' hinter ', ' hnt ')}
def test_replace_prefix_no_compose(self):
repl = self.get_replacements("hinter~ |=> hnt")
assert repl == {(' hinter', ' hnt'), (' hinter ', ' hnt ')}
def test_add_prefix_compose(self):
repl = self.get_replacements("hinter~-> h")
assert repl == {(' hinter', ' hinter'), (' hinter', ' hinter '),
(' hinter', ' h'), (' hinter', ' h '),
(' hinter ', ' hinter '), (' hinter ', ' hinter'),
(' hinter ', ' h '), (' hinter ', ' h')}
def test_replace_prefix_compose(self):
repl = self.get_replacements("hinter~=> h")
assert repl == {(' hinter', ' h'), (' hinter', ' h '),
(' hinter ', ' h '), (' hinter ', ' h')}
def test_add_beginning_only(self):
repl = self.get_replacements("^Premier -> Pr")
assert repl == {('^ premier ', '^ premier '), ('^ premier ', '^ pr ')}
def test_replace_beginning_only(self):
repl = self.get_replacements("^Premier => Pr")
assert repl == {('^ premier ', '^ pr ')}
def test_add_final_only(self):
repl = self.get_replacements("road$ -> rd")
assert repl == {(' road ^', ' road ^'), (' road ^', ' rd ^')}
def test_replace_final_only(self):
repl = self.get_replacements("road$ => rd")
assert repl == {(' road ^', ' rd ^')}
def test_decompose_only(self):
repl = self.get_replacements("~foo -> foo")
assert repl == {('foo ', 'foo '), ('foo ', ' foo '),
(' foo ', 'foo '), (' foo ', ' foo ')}
def test_add_suffix_decompose_end_only(self):
repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")
assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
(' berg ', ' berg '), (' berg ', ' bg '),
('berg ^', 'berg ^'), ('berg ^', ' berg ^'),
('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
(' berg ^', 'berg ^'), (' berg ^', 'bg ^'),
(' berg ^', ' berg ^'), (' berg ^', ' bg ^')}
def test_replace_suffix_decompose_end_only(self):
repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")
assert repl == {('berg ', 'bg '), (' berg ', ' bg '),
('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
(' berg ^', 'bg ^'), (' berg ^', ' bg ^')}
def test_add_multiple_suffix(self):
repl = self.get_replacements("~berg,~burg -> bg")
assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
(' berg ', ' berg '), (' berg ', 'berg '),
('berg ', 'bg '), ('berg ', ' bg '),
(' berg ', 'bg '), (' berg ', ' bg '),
('burg ', 'burg '), ('burg ', ' burg '),
(' burg ', ' burg '), (' burg ', 'burg '),
('burg ', 'bg '), ('burg ', ' bg '),
(' burg ', 'bg '), (' burg ', ' bg ')}

View File

@ -260,7 +260,9 @@ def test_update_special_phrase_modify(analyzer, word_table, make_standard_name):
def test_add_country_names(analyzer, word_table, make_standard_name):
analyzer.add_country_names('de', ['Germany', 'Deutschland', 'germany'])
analyzer.add_country_names('de', {'name': 'Germany',
'name:de': 'Deutschland',
'short_name': 'germany'})
assert word_table.get_country() \
== {('de', ' #germany#'),
@ -272,7 +274,7 @@ def test_add_more_country_names(analyzer, word_table, make_standard_name):
word_table.add_country('it', ' #italy#')
word_table.add_country('it', ' #itala#')
analyzer.add_country_names('it', ['Italy', 'IT'])
analyzer.add_country_names('it', {'name': 'Italy', 'ref': 'IT'})
assert word_table.get_country() \
== {('fr', ' #france#'),

View File

@ -2,10 +2,13 @@
Tests for Legacy ICU tokenizer.
"""
import shutil
import yaml
import pytest
from nominatim.tokenizer import legacy_icu_tokenizer
from nominatim.tokenizer.icu_name_processor import ICUNameProcessorRules
from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
from nominatim.db import properties
@ -40,16 +43,10 @@ def tokenizer_factory(dsn, tmp_path, property_table,
@pytest.fixture
def db_prop(temp_db_conn):
def _get_db_property(name):
return properties.get_property(temp_db_conn,
getattr(legacy_icu_tokenizer, name))
return properties.get_property(temp_db_conn, name)
return _get_db_property
@pytest.fixture
def tokenizer_setup(tokenizer_factory, test_config):
tok = tokenizer_factory()
tok.init_new_db(test_config)
@pytest.fixture
def analyzer(tokenizer_factory, test_config, monkeypatch,
@ -62,9 +59,15 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
tok.init_new_db(test_config)
monkeypatch.undo()
def _mk_analyser(trans=':: upper();', abbr=(('STREET', 'ST'), )):
tok.transliteration = trans
tok.abbreviations = abbr
def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
variants=('~gasse -> gasse', 'street => st', )):
cfgfile = tmp_path / 'analyser_test_config.yaml'
with cfgfile.open('w') as stream:
cfgstr = {'normalization' : list(norm),
'transliteration' : list(trans),
'variants' : [ {'words': list(variants)}]}
yaml.dump(cfgstr, stream)
tok.naming_rules = ICUNameProcessorRules(loader=ICURuleLoader(cfgfile))
return tok.name_analyzer()
@ -72,10 +75,54 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
@pytest.fixture
def getorcreate_term_id(temp_db_cursor):
temp_db_cursor.execute("""CREATE OR REPLACE FUNCTION getorcreate_term_id(lookup_term TEXT)
RETURNS INTEGER AS $$
SELECT nextval('seq_word')::INTEGER; $$ LANGUAGE SQL""")
def getorcreate_full_word(temp_db_cursor):
temp_db_cursor.execute("""CREATE OR REPLACE FUNCTION getorcreate_full_word(
norm_term TEXT, lookup_terms TEXT[],
OUT full_token INT,
OUT partial_tokens INT[])
AS $$
DECLARE
partial_terms TEXT[] = '{}'::TEXT[];
term TEXT;
term_id INTEGER;
term_count INTEGER;
BEGIN
SELECT min(word_id) INTO full_token
FROM word WHERE word = norm_term and class is null and country_code is null;
IF full_token IS NULL THEN
full_token := nextval('seq_word');
INSERT INTO word (word_id, word_token, word, search_name_count)
SELECT full_token, ' ' || lookup_term, norm_term, 0 FROM unnest(lookup_terms) as lookup_term;
END IF;
FOR term IN SELECT unnest(string_to_array(unnest(lookup_terms), ' ')) LOOP
term := trim(term);
IF NOT (ARRAY[term] <@ partial_terms) THEN
partial_terms := partial_terms || term;
END IF;
END LOOP;
partial_tokens := '{}'::INT[];
FOR term IN SELECT unnest(partial_terms) LOOP
SELECT min(word_id), max(search_name_count) INTO term_id, term_count
FROM word WHERE word_token = term and class is null and country_code is null;
IF term_id IS NULL THEN
term_id := nextval('seq_word');
term_count := 0;
INSERT INTO word (word_id, word_token, search_name_count)
VALUES (term_id, term, 0);
END IF;
IF NOT (ARRAY[term_id] <@ partial_tokens) THEN
partial_tokens := partial_tokens || term_id;
END IF;
END LOOP;
END;
$$
LANGUAGE plpgsql;
""")
@pytest.fixture
@ -91,19 +138,37 @@ def test_init_new(tokenizer_factory, test_config, monkeypatch, db_prop):
tok = tokenizer_factory()
tok.init_new_db(test_config)
assert db_prop('DBCFG_NORMALIZATION') == ':: lower();'
assert db_prop('DBCFG_TRANSLITERATION') is not None
assert db_prop('DBCFG_ABBREVIATIONS') is not None
assert db_prop(legacy_icu_tokenizer.DBCFG_TERM_NORMALIZATION) == ':: lower();'
assert db_prop(legacy_icu_tokenizer.DBCFG_MAXWORDFREQ) is not None
def test_init_from_project(tokenizer_setup, tokenizer_factory):
def test_init_word_table(tokenizer_factory, test_config, place_row, word_table):
place_row(names={'name' : 'Test Area', 'ref' : '52'})
place_row(names={'name' : 'No Area'})
place_row(names={'name' : 'Holzstrasse'})
tok = tokenizer_factory()
tok.init_new_db(test_config)
assert word_table.get_partial_words() == {('test', 1),
('no', 1), ('area', 2),
('holz', 1), ('strasse', 1),
('str', 1)}
def test_init_from_project(monkeypatch, test_config, tokenizer_factory):
monkeypatch.setenv('NOMINATIM_TERM_NORMALIZATION', ':: lower();')
monkeypatch.setenv('NOMINATIM_MAX_WORD_FREQUENCY', '90300')
tok = tokenizer_factory()
tok.init_new_db(test_config)
monkeypatch.undo()
tok = tokenizer_factory()
tok.init_from_project()
assert tok.normalization is not None
assert tok.transliteration is not None
assert tok.abbreviations is not None
assert tok.naming_rules is not None
assert tok.term_normalization == ':: lower();'
assert tok.max_word_frequency == '90300'
def test_update_sql_functions(db_prop, temp_db_cursor,
@ -114,7 +179,7 @@ def test_update_sql_functions(db_prop, temp_db_cursor,
tok.init_new_db(test_config)
monkeypatch.undo()
assert db_prop('DBCFG_MAXWORDFREQ') == '1133'
assert db_prop(legacy_icu_tokenizer.DBCFG_MAXWORDFREQ) == '1133'
table_factory('test', 'txt TEXT')
@ -127,18 +192,11 @@ def test_update_sql_functions(db_prop, temp_db_cursor,
assert test_content == set((('1133', ), ))
def test_make_standard_word(analyzer):
with analyzer(abbr=(('STREET', 'ST'), ('tiny', 't'))) as anl:
assert anl.make_standard_word('tiny street') == 'TINY ST'
with analyzer(abbr=(('STRASSE', 'STR'), ('STR', 'ST'))) as anl:
assert anl.make_standard_word('Hauptstrasse') == 'HAUPTST'
def test_make_standard_hnr(analyzer):
with analyzer(abbr=(('IV', '4'),)) as anl:
assert anl._make_standard_hnr('345') == '345'
assert anl._make_standard_hnr('iv') == 'IV'
def test_normalize_postcode(analyzer):
with analyzer() as anl:
anl.normalize_postcode('123') == '123'
anl.normalize_postcode('ab-34 ') == 'AB-34'
anl.normalize_postcode('38 Б') == '38 Б'
def test_update_postcodes_from_db_empty(analyzer, table_factory, word_table):
@ -168,15 +226,15 @@ def test_update_postcodes_from_db_add_and_remove(analyzer, table_factory, word_t
def test_update_special_phrase_empty_table(analyzer, word_table):
with analyzer() as anl:
anl.update_special_phrases([
("König bei", "amenity", "royal", "near"),
("Könige", "amenity", "royal", "-"),
("König bei", "amenity", "royal", "near"),
("Könige ", "amenity", "royal", "-"),
("street", "highway", "primary", "in")
], True)
assert word_table.get_special() \
== {(' KÖNIG BEI', 'könig bei', 'amenity', 'royal', 'near'),
(' KÖNIGE', 'könige', 'amenity', 'royal', None),
(' ST', 'street', 'highway', 'primary', 'in')}
== {(' KÖNIG BEI', 'König bei', 'amenity', 'royal', 'near'),
(' KÖNIGE', 'Könige', 'amenity', 'royal', None),
(' STREET', 'street', 'highway', 'primary', 'in')}
def test_update_special_phrase_delete_all(analyzer, word_table):
@ -222,66 +280,188 @@ def test_update_special_phrase_modify(analyzer, word_table):
(' GARDEN', 'garden', 'leisure', 'garden', 'near')}
def test_process_place_names(analyzer, getorcreate_term_id):
def test_add_country_names_new(analyzer, word_table):
with analyzer() as anl:
info = anl.process_place({'name' : {'name' : 'Soft bAr', 'ref': '34'}})
anl.add_country_names('es', {'name': 'Espagña', 'name:en': 'Spain'})
assert info['names'] == '{1,2,3,4,5}'
assert word_table.get_country() == {('es', ' ESPAGÑA'), ('es', ' SPAIN')}
@pytest.mark.parametrize('sep', [',' , ';'])
def test_full_names_with_separator(analyzer, getorcreate_term_id, sep):
def test_add_country_names_extend(analyzer, word_table):
word_table.add_country('ch', ' SCHWEIZ')
with analyzer() as anl:
names = anl._compute_full_names({'name' : sep.join(('New York', 'Big Apple'))})
anl.add_country_names('ch', {'name': 'Schweiz', 'name:fr': 'Suisse'})
assert names == set(('NEW YORK', 'BIG APPLE'))
assert word_table.get_country() == {('ch', ' SCHWEIZ'), ('ch', ' SUISSE')}
def test_full_names_with_bracket(analyzer, getorcreate_term_id):
with analyzer() as anl:
names = anl._compute_full_names({'name' : 'Houseboat (left)'})
class TestPlaceNames:
assert names == set(('HOUSEBOAT (LEFT)', 'HOUSEBOAT'))
@pytest.fixture(autouse=True)
def setup(self, analyzer, getorcreate_full_word):
with analyzer() as anl:
self.analyzer = anl
yield anl
@pytest.mark.parametrize('pcode', ['12345', 'AB 123', '34-345'])
def test_process_place_postcode(analyzer, word_table, pcode):
with analyzer() as anl:
anl.process_place({'address': {'postcode' : pcode}})
def expect_name_terms(self, info, *expected_terms):
tokens = self.analyzer.get_word_token_info(expected_terms)
for token in tokens:
assert token[2] is not None, "No token for {0}".format(token)
assert word_table.get_postcodes() == {pcode, }
assert eval(info['names']) == set((t[2] for t in tokens))
@pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
def test_process_place_bad_postcode(analyzer, word_table, pcode):
with analyzer() as anl:
anl.process_place({'address': {'postcode' : pcode}})
def test_simple_names(self):
info = self.analyzer.process_place({'name': {'name': 'Soft bAr', 'ref': '34'}})
assert not word_table.get_postcodes()
self.expect_name_terms(info, '#Soft bAr', '#34','Soft', 'bAr', '34')
@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
def test_process_place_housenumbers_simple(analyzer, hnr, getorcreate_hnr_id):
with analyzer() as anl:
info = anl.process_place({'address': {'housenumber' : hnr}})
@pytest.mark.parametrize('sep', [',' , ';'])
def test_names_with_separator(self, sep):
info = self.analyzer.process_place({'name': {'name': sep.join(('New York', 'Big Apple'))}})
assert info['hnr'] == hnr.upper()
assert info['hnr_tokens'] == "{-1}"
self.expect_name_terms(info, '#New York', '#Big Apple',
'new', 'york', 'big', 'apple')
def test_process_place_housenumbers_lists(analyzer, getorcreate_hnr_id):
with analyzer() as anl:
info = anl.process_place({'address': {'conscriptionnumber' : '1; 2;3'}})
def test_full_names_with_bracket(self):
info = self.analyzer.process_place({'name': {'name': 'Houseboat (left)'}})
assert set(info['hnr'].split(';')) == set(('1', '2', '3'))
assert info['hnr_tokens'] == "{-1,-2,-3}"
self.expect_name_terms(info, '#Houseboat (left)', '#Houseboat',
'houseboat', 'left')
def test_process_place_housenumbers_duplicates(analyzer, getorcreate_hnr_id):
with analyzer() as anl:
info = anl.process_place({'address': {'housenumber' : '134',
'conscriptionnumber' : '134',
'streetnumber' : '99a'}})
def test_country_name(self, word_table):
info = self.analyzer.process_place({'name': {'name': 'Norge'},
'country_feature': 'no'})
self.expect_name_terms(info, '#norge', 'norge')
assert word_table.get_country() == {('no', ' NORGE')}
class TestPlaceAddress:
@pytest.fixture(autouse=True)
def setup(self, analyzer, getorcreate_full_word):
with analyzer(trans=(":: upper()", "'🜵' > ' '")) as anl:
self.analyzer = anl
yield anl
def process_address(self, **kwargs):
return self.analyzer.process_place({'address': kwargs})
def name_token_set(self, *expected_terms):
tokens = self.analyzer.get_word_token_info(expected_terms)
for token in tokens:
assert token[2] is not None, "No token for {0}".format(token)
return set((t[2] for t in tokens))
@pytest.mark.parametrize('pcode', ['12345', 'AB 123', '34-345'])
def test_process_place_postcode(self, word_table, pcode):
self.process_address(postcode=pcode)
assert word_table.get_postcodes() == {pcode, }
@pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
def test_process_place_bad_postcode(self, word_table, pcode):
self.process_address(postcode=pcode)
assert not word_table.get_postcodes()
@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
def test_process_place_housenumbers_simple(self, hnr, getorcreate_hnr_id):
info = self.process_address(housenumber=hnr)
assert info['hnr'] == hnr.upper()
assert info['hnr_tokens'] == "{-1}"
def test_process_place_housenumbers_lists(self, getorcreate_hnr_id):
info = self.process_address(conscriptionnumber='1; 2;3')
assert set(info['hnr'].split(';')) == set(('1', '2', '3'))
assert info['hnr_tokens'] == "{-1,-2,-3}"
def test_process_place_housenumbers_duplicates(self, getorcreate_hnr_id):
info = self.process_address(housenumber='134',
conscriptionnumber='134',
streetnumber='99a')
assert set(info['hnr'].split(';')) == set(('134', '99A'))
assert info['hnr_tokens'] == "{-1,-2}"
def test_process_place_housenumbers_cached(self, getorcreate_hnr_id):
info = self.process_address(housenumber="45")
assert info['hnr_tokens'] == "{-1}"
info = self.process_address(housenumber="46")
assert info['hnr_tokens'] == "{-2}"
info = self.process_address(housenumber="41;45")
assert eval(info['hnr_tokens']) == {-1, -3}
info = self.process_address(housenumber="41")
assert eval(info['hnr_tokens']) == {-3}
def test_process_place_street(self):
info = self.process_address(street='Grand Road')
assert eval(info['street']) == self.name_token_set('#GRAND ROAD')
def test_process_place_street_empty(self):
info = self.process_address(street='🜵')
assert 'street' not in info
def test_process_place_place(self):
info = self.process_address(place='Honu Lulu')
assert eval(info['place_search']) == self.name_token_set('#HONU LULU',
'HONU', 'LULU')
assert eval(info['place_match']) == self.name_token_set('#HONU LULU')
def test_process_place_place_empty(self):
info = self.process_address(place='🜵')
assert 'place_search' not in info
assert 'place_match' not in info
def test_process_place_address_terms(self):
info = self.process_address(country='de', city='Zwickau', state='Sachsen',
suburb='Zwickau', street='Hauptstr',
full='right behind the church')
city_full = self.name_token_set('#ZWICKAU')
city_all = self.name_token_set('#ZWICKAU', 'ZWICKAU')
state_full = self.name_token_set('#SACHSEN')
state_all = self.name_token_set('#SACHSEN', 'SACHSEN')
result = {k: [eval(v[0]), eval(v[1])] for k,v in info['addr'].items()}
assert result == {'city': [city_all, city_full],
'suburb': [city_all, city_full],
'state': [state_all, state_full]}
def test_process_place_address_terms_empty(self):
info = self.process_address(country='de', city=' ', street='Hauptstr',
full='right behind the church')
assert 'addr' not in info
assert set(info['hnr'].split(';')) == set(('134', '99A'))
assert info['hnr_tokens'] == "{-1,-2}"

View File

@ -180,7 +180,7 @@ def test_create_country_names(temp_db_with_extensions, temp_db_conn, temp_db_cur
assert len(tokenizer.analyser_cache['countries']) == 2
result_set = {k: set(v) for k, v in tokenizer.analyser_cache['countries']}
result_set = {k: set(v.values()) for k, v in tokenizer.analyser_cache['countries']}
if languages:
assert result_set == {'us' : set(('us', 'us1', 'United States')),

View File

@ -42,7 +42,7 @@
python3-pip python3-setuptools python3-devel \
expat-devel zlib-devel libicu-dev
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU datrie
#

View File

@ -35,7 +35,7 @@
python3-pip python3-setuptools python3-devel \
expat-devel zlib-devel libicu-dev
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU
pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU datrie
#

View File

@ -32,10 +32,10 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
php php-pgsql php-intl libicu-dev python3-pip \
python3-psycopg2 python3-psutil python3-jinja2 python3-icu git
# The python-dotenv package that comes with Ubuntu 18.04 is too old, so
# The python-dotenv adn datrie package that comes with Ubuntu 18.04 is too old, so
# install the latest version from pip:
pip3 install python-dotenv
pip3 install python-dotenv datrie
#
# System Configuration

View File

@ -33,7 +33,8 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
postgresql-server-dev-12 postgresql-12-postgis-3 \
postgresql-contrib-12 postgresql-12-postgis-3-scripts \
php php-pgsql php-intl libicu-dev python3-dotenv \
python3-psycopg2 python3-psutil python3-jinja2 python3-icu git
python3-psycopg2 python3-psutil python3-jinja2 \
python3-icu python3-datrie git
#
# System Configuration