Merge pull request #2381 from lonvia/reorganise-abbreviations

Reorganise abbreviation handling
2024-12-26 14:36:23 +03:00 · 2021-07-05 10:32:16 +02:00 · 2021-07-05 10:32:16 +02:00 · d4c7bf20a2
commit d4c7bf20a2
parent de4fac33dc affe1300d9
64 changed files with 8671 additions and 5320 deletions
--- a/.github/actions/build-nominatim/action.yml
+++ b/.github/actions/build-nominatim/action.yml
@ -1,13 +1,26 @@
 name: 'Build Nominatim'

+inputs:
+    ubuntu:
+        description: 'Version of Ubuntu to install on'
+        required: false
+        default: '20'
+
 runs:
    using: "composite"

    steps:
        - name: Install prerequisites
          run: |
-            sudo apt-get install -y -qq libboost-system-dev libboost-filesystem-dev libexpat1-dev zlib1g-dev libbz2-dev libpq-dev libproj-dev libicu-dev python3-psycopg2 python3-pyosmium python3-dotenv python3-psutil python3-jinja2 python3-icu
+            sudo apt-get install -y -qq libboost-system-dev libboost-filesystem-dev libexpat1-dev zlib1g-dev libbz2-dev libpq-dev libproj-dev libicu-dev
+            if [ "x$UBUNTUVER" == "x18" ]; then
+                pip3 install python-dotenv psycopg2==2.7.7 jinja2==2.8 psutil==5.4.2 pyicu osmium
+            else
+                sudo apt-get install -y -qq python3-icu python3-datrie python3-pyosmium python3-jinja2 python3-psutil python3-psycopg2 python3-dotenv
+            fi
          shell: bash
+          env:
+            UBUNTUVER: ${{ inputs.ubuntu }}

        - name: Download dependencies
          run: |
--- a/.github/workflows/ci-tests.yml
+++ b/.github/workflows/ci-tests.yml
@ -134,13 +134,8 @@ jobs:
                  postgresql-version: ${{ matrix.postgresql }}
                  postgis-version: ${{ matrix.postgis }}
            - uses: ./Nominatim/.github/actions/build-nominatim
-
-            - name: Install extra dependencies for Ubuntu 18
-              run: |
-                sudo apt-get install libicu-dev
-                pip3 install python-dotenv psycopg2==2.7.7 jinja2==2.8 psutil==5.4.2 pyicu osmium
-              shell: bash
-              if: matrix.ubuntu == 18
+              with:
+                  ubuntu: ${{ matrix.ubuntu }}

            - name: Clean installation
              run: rm -rf Nominatim build
--- a/.pylintrc
+++ b/.pylintrc
@ -1,7 +1,7 @@
 [MASTER]

 extension-pkg-whitelist=osmium
-ignored-modules=icu
+ignored-modules=icu,datrie

 [MESSAGES CONTROL]

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -258,5 +258,6 @@ install(FILES settings/env.defaults
              settings/import-address.style
              settings/import-full.style
              settings/import-extratags.style
-              settings/legacy_icu_tokenizer.json
+              settings/legacy_icu_tokenizer.yaml
+              settings/icu-rules/extended-unicode-to-asccii.yaml
        DESTINATION ${NOMINATIM_CONFIGDIR})
--- a/docs/admin/Installation.md
+++ b/docs/admin/Installation.md
@ -45,6 +45,7 @@ For running Nominatim:
  * [psutil](https://github.com/giampaolo/psutil)
  * [Jinja2](https://palletsprojects.com/p/jinja/)
  * [PyICU](https://pypi.org/project/PyICU/)
+  * [datrie](https://github.com/pytries/datrie)
  * [PHP](https://php.net) (7.0 or later)
  * PHP-pgsql
  * PHP-intl (bundled with PHP)
--- a/docs/admin/Tokenizers.md
+++ b/docs/admin/Tokenizers.md
@ -0,0 +1,205 @@
+# Tokenizers
+
+The tokenizer module in Nominatim is responsible for analysing the names given
+to OSM objects and the terms of an incoming query in order to make sure, they
+can be matched appropriately.
+
+Nominatim offers different tokenizer modules, which behave differently and have
+different configuration options. This sections describes the tokenizers and how
+they can be configured.
+
+!!! important
+    The use of a tokenizer is tied to a database installation. You need to choose
+    and configure the tokenizer before starting the initial import. Once the import
+    is done, you cannot switch to another tokenizer anymore. Reconfiguring the
+    chosen tokenizer is very limited as well. See the comments in each tokenizer
+    section.
+
+## Legacy tokenizer
+
+The legacy tokenizer implements the analysis algorithms of older Nominatim
+versions. It uses a special Postgresql module to normalize names and queries.
+This tokenizer is currently the default.
+
+To enable the tokenizer add the following line to your project configuration:
+
+```
+NOMINATIM_TOKENIZER=legacy
+```
+
+The Postgresql module for the tokenizer is available in the `module` directory
+and also installed with the remainder of the software under
+`lib/nominatim/module/nominatim.so`. You can specify a custom location for
+the module with
+
+```
+NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
+```
+
+This is in particular useful when the database runs on a different server.
+See [Advanced installations](Advanced-Installations.md#importing-nominatim-to-an-external-postgresql-database) for details.
+
+There are no other configuration options for the legacy tokenizer. All
+normalization functions are hard-coded.
+
+## ICU tokenizer
+
+!!! danger
+    This tokenizer is currently in active development and still subject
+    to backwards-incompatible changes.
+
+The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
+normalize names and queries. It also offers configurable decomposition and
+abbreviation handling.
+
+### How it works
+
+On import the tokenizer processes names in the following four stages:
+
+1. The **Normalization** part removes all non-relevant information from the
+   input.
+2. Incoming names are now converted to **full names**. This process is currently
+   hard coded and mostly serves to handle name tags from OSM that contain
+   multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
+3. Next the tokenizer creates **variants** from the full names. These variants
+   cover decomposition and abbreviation handling. Variants are saved to the
+   database, so that it is not necessary to create the variants for a search
+   query.
+4. The final **Tokenization** step converts the names to a simple ASCII form,
+   potentially removing further spelling variants for better matching.
+
+At query time only stage 1) and 4) are used. The query is normalized and
+tokenized and the resulting string used for searching in the database.
+
+### Configuration
+
+The ICU tokenizer is configured using a YAML file which can be configured using
+`NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
+saved as part of the internal database status. Later changes to the variable
+have no effect.
+
+Here is an example configuration file:
+
+``` yaml
+normalization:
+    - ":: lower ()"
+    - "ß > 'ss'" # German szet is unimbigiously equal to double ss
+transliteration:
+    - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
+    - ":: Ascii ()"
+variants:
+    - language: de
+      words:
+        - ~haus => haus
+        - ~strasse -> str
+    - language: en
+      words: 
+        - road -> rd
+        - bridge -> bdge,br,brdg,bri,brg
+```
+
+The configuration file contains three sections:
+`normalization`, `transliteration`, `variants`.
+
+The normalization and transliteration sections each must contain a list of
+[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
+The rules are applied in the order in which they appear in the file.
+You can also include additional rules from external yaml file using the
+`!include` tag. The included file must contain a valid YAML list of ICU rules
+and may again include other files.
+
+!!! warning
+    The ICU rule syntax contains special characters that conflict with the
+    YAML syntax. You should therefore always enclose the ICU rules in
+    double-quotes.
+
+The variants section defines lists of replacements which create alternative
+spellings of a name. To create the variants, a name is scanned from left to
+right and the longest matching replacement is applied until the end of the
+string is reached.
+
+The variants section must contain a list of replacement groups. Each group
+defines a set of properties that describes where the replacements are
+applicable. In addition, the word section defines the list of replacements
+to be made. The basic replacement description is of the form:
+
+```
+<source>[,<source>[...]] => <target>[,<target>[...]]
+```
+
+The left side contains one or more `source` terms to be replaced. The right side
+lists one or more replacements. Each source is replaced with each replacement
+term.
+
+!!! tip
+    The source and target terms are internally normalized using the
+    normalization rules given in the configuration. This ensures that the
+    strings match as expected. In fact, it is better to use unnormalized
+    words in the configuration because then it is possible to change the
+    rules for normalization later without having to adapt the variant rules.
+
+#### Decomposition
+
+In its standard form, only full words match against the source. There
+is a special notation to match the prefix and suffix of a word:
+
+``` yaml
+- ~strasse => str  # matches "strasse" as full word and in suffix position
+- hinter~ => hntr  # matches "hinter" as full word and in prefix position
+```
+
+There is no facility to match a string in the middle of the word. The suffix
+and prefix notation automatically trigger the decomposition mode: two variants
+are created for each replacement, one with the replacement attached to the word
+and one separate. So in above example, the tokenization of "hauptstrasse" will
+create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
+triggers the variants "rote str" and "rotestr". By having decomposition work
+both ways, it is sufficient to create the variants at index time. The variant
+rules are not applied at query time.
+
+To avoid automatic decomposition, use the '|' notation:
+
+``` yaml
+- ~strasse |=> str
+```
+
+simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
+
+#### Initial and final terms
+
+It is also possible to restrict replacements to the beginning and end of a
+name:
+
+``` yaml
+- ^south => n  # matches only at the beginning of the name
+- road$ => rd  # matches only at the end of the name
+```
+
+So the first example would trigger a replacement for "south 45th street" but
+not for "the south beach restaurant".
+
+#### Replacements vs. variants
+
+The replacement syntax `source => target` works as a pure replacement. It changes
+the name instead of creating a variant. To create an additional version, you'd
+have to write `source => source,target`. As this is a frequent case, there is
+a shortcut notation for it:
+
+```
+<source>[,<source>[...]] -> <target>[,<target>[...]]
+```
+
+The simple arrow causes an additional variant to be added. Note that
+decomposition has an effect here on the source as well. So a rule
+
+```yaml
+- ~strasse => str
+```
+
+means that for a word like `hauptstrasse` four variants are created:
+`hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
+
+### Reconfiguration
+
+Changing the configuration after the import is currently not possible, although
+this feature may be added at a later time.
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@ -20,6 +20,7 @@ pages:
        - 'Update' : 'admin/Update.md'
        - 'Deploy' : 'admin/Deployment.md'
        - 'Customize Imports' : 'admin/Customization.md'
+        - 'Tokenizers' : 'admin/Tokenizers.md'
        - 'Nominatim UI'  : 'admin/Setup-Nominatim-UI.md'
        - 'Advanced Installations' : 'admin/Advanced-Installations.md'
        - 'Migration from older Versions' : 'admin/Migration.md'
--- a/lib-php/tokenizer/legacy_icu_tokenizer.php
+++ b/lib-php/tokenizer/legacy_icu_tokenizer.php
@ -47,9 +47,7 @@ class Tokenizer

    private function makeStandardWord($sTerm)
    {
-        $sNorm = ' '.$this->oTransliterator->transliterate($sTerm).' ';
-
-        return trim(str_replace(CONST_Abbreviations[0], CONST_Abbreviations[1], $sNorm));
+        return trim($this->oTransliterator->transliterate(' '.$sTerm.' '));
    }


@ -90,6 +88,7 @@ class Tokenizer
        foreach ($aPhrases as $iPhrase => $oPhrase) {
            $sNormQuery .= ','.$this->normalizeString($oPhrase->getPhrase());
            $sPhrase = $this->makeStandardWord($oPhrase->getPhrase());
+            Debug::printVar('Phrase', $sPhrase);
            if (strlen($sPhrase) > 0) {
                $aWords = explode(' ', $sPhrase);
                Tokenizer::addTokens($aTokens, $aWords);
--- a/lib-sql/tokenizer/legacy_icu_tokenizer.sql
+++ b/lib-sql/tokenizer/legacy_icu_tokenizer.sql
@ -87,25 +87,48 @@ $$ LANGUAGE SQL IMMUTABLE STRICT;

 --------------- private functions ----------------------------------------------

-CREATE OR REPLACE FUNCTION getorcreate_term_id(lookup_term TEXT)
-  RETURNS INTEGER
+CREATE OR REPLACE FUNCTION getorcreate_full_word(norm_term TEXT, lookup_terms TEXT[],
+                                                 OUT full_token INT,
+                                                 OUT partial_tokens INT[])
  AS $$
 DECLARE
-  return_id INTEGER;
+  partial_terms TEXT[] = '{}'::TEXT[];
+  term TEXT;
+  term_id INTEGER;
  term_count INTEGER;
 BEGIN
-  SELECT min(word_id), max(search_name_count) INTO return_id, term_count
-    FROM word WHERE word_token = lookup_term and class is null and type is null;
+  SELECT min(word_id) INTO full_token
+    FROM word WHERE word = norm_term and class is null and country_code is null;

-  IF return_id IS NULL THEN
-    return_id := nextval('seq_word');
-    INSERT INTO word (word_id, word_token, search_name_count)
-      VALUES (return_id, lookup_term, 0);
-  ELSEIF left(lookup_term, 1) = ' ' and term_count > {{ max_word_freq }} THEN
-    return_id := 0;
+  IF full_token IS NULL THEN
+    full_token := nextval('seq_word');
+    INSERT INTO word (word_id, word_token, word, search_name_count)
+      SELECT full_token, ' ' || lookup_term, norm_term, 0 FROM unnest(lookup_terms) as lookup_term;
  END IF;

-  RETURN return_id;
+  FOR term IN SELECT unnest(string_to_array(unnest(lookup_terms), ' ')) LOOP
+    term := trim(term);
+    IF NOT (ARRAY[term] <@ partial_terms) THEN
+      partial_terms := partial_terms || term;
+    END IF;
+  END LOOP;
+
+  partial_tokens := '{}'::INT[];
+  FOR term IN SELECT unnest(partial_terms) LOOP
+    SELECT min(word_id), max(search_name_count) INTO term_id, term_count
+      FROM word WHERE word_token = term and class is null and country_code is null;
+
+    IF term_id IS NULL THEN
+      term_id := nextval('seq_word');
+      term_count := 0;
+      INSERT INTO word (word_id, word_token, search_name_count)
+        VALUES (term_id, term, 0);
+    END IF;
+
+    IF term_count < {{ max_word_freq }} THEN
+      partial_tokens := array_merge(partial_tokens, ARRAY[term_id]);
+    END IF;
+  END LOOP;
 END;
 $$
 LANGUAGE plpgsql;
--- a/nominatim/db/utils.py
+++ b/nominatim/db/utils.py
@ -4,6 +4,7 @@ Helper functions for handling DB accesses.
 import subprocess
 import logging
 import gzip
+import io

 from nominatim.db.connection import get_pg_env
 from nominatim.errors import UsageError
@ -57,3 +58,49 @@ def execute_file(dsn, fname, ignore_errors=False, pre_code=None, post_code=None)

    if ret != 0 or remain > 0:
        raise UsageError("Failed to execute SQL file.")
+
+
+# List of characters that need to be quoted for the copy command.
+_SQL_TRANSLATION = {ord(u'\\') : u'\\\\',
+                    ord(u'\t') : u'\\t',
+                    ord(u'\n') : u'\\n'}
+
+class CopyBuffer:
+    """ Data collector for the copy_from command.
+    """
+
+    def __init__(self):
+        self.buffer = io.StringIO()
+
+
+    def __enter__(self):
+        return self
+
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        if self.buffer is not None:
+            self.buffer.close()
+
+
+    def add(self, *data):
+        """ Add another row of data to the copy buffer.
+        """
+        first = True
+        for column in data:
+            if first:
+                first = False
+            else:
+                self.buffer.write('\t')
+            if column is None:
+                self.buffer.write('\\N')
+            else:
+                self.buffer.write(str(column).translate(_SQL_TRANSLATION))
+        self.buffer.write('\n')
+
+
+    def copy_out(self, cur, table, columns=None):
+        """ Copy all collected data into the given table.
+        """
+        if self.buffer.tell() > 0:
+            self.buffer.seek(0)
+            cur.copy_from(self.buffer, table, columns=columns)
--- a/nominatim/tokenizer/icu_name_processor.py
+++ b/nominatim/tokenizer/icu_name_processor.py
@ -0,0 +1,142 @@
+"""
+Processor for names that are imported into the database based on the
+ICU library.
+"""
+from collections import defaultdict
+import itertools
+
+from icu import Transliterator
+import datrie
+
+from nominatim.db.properties import set_property, get_property
+from nominatim.tokenizer import icu_variants as variants
+
+DBCFG_IMPORT_NORM_RULES = "tokenizer_import_normalisation"
+DBCFG_IMPORT_TRANS_RULES = "tokenizer_import_transliteration"
+DBCFG_IMPORT_REPLACEMENTS = "tokenizer_import_replacements"
+DBCFG_SEARCH_STD_RULES = "tokenizer_search_standardization"
+
+
+class ICUNameProcessorRules:
+    """ Data object that saves the rules needed for the name processor.
+
+        The rules can either be initialised through an ICURuleLoader or
+        be loaded from a database when a connection is given.
+    """
+    def __init__(self, loader=None, conn=None):
+        if loader is not None:
+            self.norm_rules = loader.get_normalization_rules()
+            self.trans_rules = loader.get_transliteration_rules()
+            self.replacements = loader.get_replacement_pairs()
+            self.search_rules = loader.get_search_rules()
+        elif conn is not None:
+            self.norm_rules = get_property(conn, DBCFG_IMPORT_NORM_RULES)
+            self.trans_rules = get_property(conn, DBCFG_IMPORT_TRANS_RULES)
+            self.replacements = \
+                variants.unpickle_variant_set(get_property(conn, DBCFG_IMPORT_REPLACEMENTS))
+            self.search_rules = get_property(conn, DBCFG_SEARCH_STD_RULES)
+        else:
+            assert False, "Parameter loader or conn required."
+
+
+    def save_rules(self, conn):
+        """ Save the rules in the property table of the given database.
+            the rules can be loaded again by handing in a connection into
+            the constructor of the class.
+        """
+        set_property(conn, DBCFG_IMPORT_NORM_RULES, self.norm_rules)
+        set_property(conn, DBCFG_IMPORT_TRANS_RULES, self.trans_rules)
+        set_property(conn, DBCFG_IMPORT_REPLACEMENTS,
+                     variants.pickle_variant_set(self.replacements))
+        set_property(conn, DBCFG_SEARCH_STD_RULES, self.search_rules)
+
+
+class ICUNameProcessor:
+    """ Collects the different transformation rules for normalisation of names
+        and provides the functions to aply the transformations.
+    """
+
+    def __init__(self, rules):
+        self.normalizer = Transliterator.createFromRules("icu_normalization",
+                                                         rules.norm_rules)
+        self.to_ascii = Transliterator.createFromRules("icu_to_ascii",
+                                                       rules.trans_rules +
+                                                       ";[:Space:]+ > ' '")
+        self.search = Transliterator.createFromRules("icu_search",
+                                                     rules.search_rules)
+
+        # Intermediate reorder by source. Also compute required character set.
+        immediate = defaultdict(list)
+        chars = set()
+        for variant in rules.replacements:
+            if variant.source[-1] == ' ' and variant.replacement[-1] == ' ':
+                replstr = variant.replacement[:-1]
+            else:
+                replstr = variant.replacement
+            immediate[variant.source].append(replstr)
+            chars.update(variant.source)
+        # Then copy to datrie
+        self.replacements = datrie.Trie(''.join(chars))
+        for src, repllist in immediate.items():
+            self.replacements[src] = repllist
+
+
+    def get_normalized(self, name):
+        """ Normalize the given name, i.e. remove all elements not relevant
+            for search.
+        """
+        return self.normalizer.transliterate(name).strip()
+
+    def get_variants_ascii(self, norm_name):
+        """ Compute the spelling variants for the given normalized name
+            and transliterate the result.
+        """
+        baseform = '^ ' + norm_name + ' ^'
+        partials = ['']
+
+        startpos = 0
+        pos = 0
+        force_space = False
+        while pos < len(baseform):
+            full, repl = self.replacements.longest_prefix_item(baseform[pos:],
+                                                               (None, None))
+            if full is not None:
+                done = baseform[startpos:pos]
+                partials = [v + done + r
+                            for v, r in itertools.product(partials, repl)
+                            if not force_space or r.startswith(' ')]
+                if len(partials) > 128:
+                    # If too many variants are produced, they are unlikely
+                    # to be helpful. Only use the original term.
+                    startpos = 0
+                    break
+                startpos = pos + len(full)
+                if full[-1] == ' ':
+                    startpos -= 1
+                    force_space = True
+                pos = startpos
+            else:
+                pos += 1
+                force_space = False
+
+        results = set()
+
+        if startpos == 0:
+            trans_name = self.to_ascii.transliterate(norm_name).strip()
+            if trans_name:
+                results.add(trans_name)
+        else:
+            for variant in partials:
+                name = variant + baseform[startpos:]
+                trans_name = self.to_ascii.transliterate(name[1:-1]).strip()
+                if trans_name:
+                    results.add(trans_name)
+
+        return list(results)
+
+
+    def get_search_normalized(self, name):
+        """ Return the normalized version of the name (including transliteration)
+            to be applied at search time.
+        """
+        return self.search.transliterate(' ' + name + ' ').strip()
--- a/nominatim/tokenizer/icu_rule_loader.py
+++ b/nominatim/tokenizer/icu_rule_loader.py
@ -0,0 +1,246 @@
+"""
+Helper class to create ICU rules from a configuration file.
+"""
+import io
+import logging
+import itertools
+from pathlib import Path
+import re
+
+import yaml
+from icu import Transliterator
+
+from nominatim.errors import UsageError
+import nominatim.tokenizer.icu_variants as variants
+
+LOG = logging.getLogger()
+
+def _flatten_yaml_list(content):
+    if not content:
+        return []
+
+    if not isinstance(content, list):
+        raise UsageError("List expected in ICU yaml configuration.")
+
+    output = []
+    for ele in content:
+        if isinstance(ele, list):
+            output.extend(_flatten_yaml_list(ele))
+        else:
+            output.append(ele)
+
+    return output
+
+
+class VariantRule:
+    """ Saves a single variant expansion.
+
+        An expansion consists of the normalized replacement term and
+        a dicitonary of properties that describe when the expansion applies.
+    """
+
+    def __init__(self, replacement, properties):
+        self.replacement = replacement
+        self.properties = properties or {}
+
+
+class ICURuleLoader:
+    """ Compiler for ICU rules from a tokenizer configuration file.
+    """
+
+    def __init__(self, configfile):
+        self.configfile = configfile
+        self.variants = set()
+
+        if configfile.suffix == '.yaml':
+            self._load_from_yaml()
+        else:
+            raise UsageError("Unknown format of tokenizer configuration.")
+
+
+    def get_search_rules(self):
+        """ Return the ICU rules to be used during search.
+            The rules combine normalization and transliteration.
+        """
+        # First apply the normalization rules.
+        rules = io.StringIO()
+        rules.write(self.normalization_rules)
+
+        # Then add transliteration.
+        rules.write(self.transliteration_rules)
+        return rules.getvalue()
+
+    def get_normalization_rules(self):
+        """ Return rules for normalisation of a term.
+        """
+        return self.normalization_rules
+
+    def get_transliteration_rules(self):
+        """ Return the rules for converting a string into its asciii representation.
+        """
+        return self.transliteration_rules
+
+    def get_replacement_pairs(self):
+        """ Return the list of possible compound decompositions with
+            application of abbreviations included.
+            The result is a list of pairs: the first item is the sequence to
+            replace, the second is a list of replacements.
+        """
+        return self.variants
+
+    def _yaml_include_representer(self, loader, node):
+        value = loader.construct_scalar(node)
+
+        if Path(value).is_absolute():
+            content = Path(value).read_text()
+        else:
+            content = (self.configfile.parent / value).read_text()
+
+        return yaml.safe_load(content)
+
+
+    def _load_from_yaml(self):
+        yaml.add_constructor('!include', self._yaml_include_representer,
+                             Loader=yaml.SafeLoader)
+        rules = yaml.safe_load(self.configfile.read_text())
+
+        self.normalization_rules = self._cfg_to_icu_rules(rules, 'normalization')
+        self.transliteration_rules = self._cfg_to_icu_rules(rules, 'transliteration')
+        self._parse_variant_list(self._get_section(rules, 'variants'))
+
+
+    def _get_section(self, rules, section):
+        """ Get the section named 'section' from the rules. If the section does
+            not exist, raise a usage error with a meaningful message.
+        """
+        if section not in rules:
+            LOG.fatal("Section '%s' not found in tokenizer config '%s'.",
+                      section, str(self.configfile))
+            raise UsageError("Syntax error in tokenizer configuration file.")
+
+        return rules[section]
+
+
+    def _cfg_to_icu_rules(self, rules, section):
+        """ Load an ICU ruleset from the given section. If the section is a
+            simple string, it is interpreted as a file name and the rules are
+            loaded verbatim from the given file. The filename is expected to be
+            relative to the tokenizer rule file. If the section is a list then
+            each line is assumed to be a rule. All rules are concatenated and returned.
+        """
+        content = self._get_section(rules, section)
+
+        if content is None:
+            return ''
+
+        return ';'.join(_flatten_yaml_list(content)) + ';'
+
+
+    def _parse_variant_list(self, rules):
+        self.variants.clear()
+
+        if not rules:
+            return
+
+        rules = _flatten_yaml_list(rules)
+
+        vmaker = _VariantMaker(self.normalization_rules)
+
+        properties = []
+        for section in rules:
+            # Create the property field and deduplicate against existing
+            # instances.
+            props = variants.ICUVariantProperties.from_rules(section)
+            for existing in properties:
+                if existing == props:
+                    props = existing
+                    break
+            else:
+                properties.append(props)
+
+            for rule in (section.get('words') or []):
+                self.variants.update(vmaker.compute(rule, props))
+
+
+class _VariantMaker:
+    """ Generater for all necessary ICUVariants from a single variant rule.
+
+        All text in rules is normalized to make sure the variants match later.
+    """
+
+    def __init__(self, norm_rules):
+        self.norm = Transliterator.createFromRules("rule_loader_normalization",
+                                                   norm_rules)
+
+
+    def compute(self, rule, props):
+        """ Generator for all ICUVariant tuples from a single variant rule.
+        """
+        parts = re.split(r'(\|)?([=-])>', rule)
+        if len(parts) != 4:
+            raise UsageError("Syntax error in variant rule: " + rule)
+
+        decompose = parts[1] is None
+        src_terms = [self._parse_variant_word(t) for t in parts[0].split(',')]
+        repl_terms = (self.norm.transliterate(t.strip()) for t in parts[3].split(','))
+
+        # If the source should be kept, add a 1:1 replacement
+        if parts[2] == '-':
+            for src in src_terms:
+                if src:
+                    for froms, tos in _create_variants(*src, src[0], decompose):
+                        yield variants.ICUVariant(froms, tos, props)
+
+        for src, repl in itertools.product(src_terms, repl_terms):
+            if src and repl:
+                for froms, tos in _create_variants(*src, repl, decompose):
+                    yield variants.ICUVariant(froms, tos, props)
+
+
+    def _parse_variant_word(self, name):
+        name = name.strip()
+        match = re.fullmatch(r'([~^]?)([^~$^]*)([~$]?)', name)
+        if match is None or (match.group(1) == '~' and match.group(3) == '~'):
+            raise UsageError("Invalid variant word descriptor '{}'".format(name))
+        norm_name = self.norm.transliterate(match.group(2))
+        if not norm_name:
+            return None
+
+        return norm_name, match.group(1), match.group(3)
+
+
+_FLAG_MATCH = {'^': '^ ',
+               '$': ' ^',
+               '': ' '}
+
+
+def _create_variants(src, preflag, postflag, repl, decompose):
+    if preflag == '~':
+        postfix = _FLAG_MATCH[postflag]
+        # suffix decomposition
+        src = src + postfix
+        repl = repl + postfix
+
+        yield src, repl
+        yield ' ' + src, ' ' + repl
+
+        if decompose:
+            yield src, ' ' + repl
+            yield ' ' + src, repl
+    elif postflag == '~':
+        # prefix decomposition
+        prefix = _FLAG_MATCH[preflag]
+        src = prefix + src
+        repl = prefix + repl
+
+        yield src, repl
+        yield src + ' ', repl + ' '
+
+        if decompose:
+            yield src, repl + ' '
+            yield src + ' ', repl
+    else:
+        prefix = _FLAG_MATCH[preflag]
+        postfix = _FLAG_MATCH[postflag]
+
+        yield prefix + src + postfix, prefix + repl + postfix
--- a/nominatim/tokenizer/icu_variants.py
+++ b/nominatim/tokenizer/icu_variants.py
@ -0,0 +1,58 @@
+"""
+Data structures for saving variant expansions for ICU tokenizer.
+"""
+from collections import namedtuple
+import json
+
+_ICU_VARIANT_PORPERTY_FIELDS = ['lang']
+
+
+class ICUVariantProperties(namedtuple('_ICUVariantProperties', _ICU_VARIANT_PORPERTY_FIELDS,
+                                      defaults=(None, )*len(_ICU_VARIANT_PORPERTY_FIELDS))):
+    """ Data container for saving properties that describe when a variant
+        should be applied.
+
+        Porperty instances are hashable.
+    """
+    @classmethod
+    def from_rules(cls, _):
+        """ Create a new property type from a generic dictionary.
+
+            The function only takes into account the properties that are
+            understood presently and ignores all others.
+        """
+        return cls(lang=None)
+
+
+ICUVariant = namedtuple('ICUVariant', ['source', 'replacement', 'properties'])
+
+
+def pickle_variant_set(variants):
+    """ Serializes an iterable of variant rules to a string.
+    """
+    # Create a list of property sets. So they don't need to be duplicated
+    properties = {}
+    pid = 1
+    for variant in variants:
+        if variant.properties not in properties:
+            properties[variant.properties] = pid
+            pid += 1
+
+    # Convert the variants into a simple list.
+    variants = [(v.source, v.replacement, properties[v.properties]) for v in variants]
+
+    # Convert everythin to json.
+    return json.dumps({'properties': {v: k._asdict() for k, v in properties.items()},
+                       'variants': variants})
+
+
+def unpickle_variant_set(variant_string):
+    """ Deserializes a variant string that was previously created with
+        pickle_variant_set() into a set of ICUVariants.
+    """
+    data = json.loads(variant_string)
+
+    properties = {int(k): ICUVariantProperties(**v) for k, v in data['properties'].items()}
+    print(properties)
+
+    return set((ICUVariant(src, repl, properties[pid]) for src, repl, pid in data['variants']))
--- a/nominatim/tokenizer/legacy_icu_tokenizer.py
+++ b/nominatim/tokenizer/legacy_icu_tokenizer.py
@ -3,26 +3,23 @@ Tokenizer implementing normalisation as used before Nominatim 4 but using
 libICU instead of the PostgreSQL module.
 """
 from collections import Counter
-import functools
-import io
 import itertools
-import json
 import logging
 import re
 from textwrap import dedent
 from pathlib import Path

-from icu import Transliterator
 import psycopg2.extras

 from nominatim.db.connection import connect
 from nominatim.db.properties import set_property, get_property
+from nominatim.db.utils import CopyBuffer
 from nominatim.db.sql_preprocessor import SQLPreprocessor
+from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
+from nominatim.tokenizer.icu_name_processor import ICUNameProcessor, ICUNameProcessorRules

-DBCFG_NORMALIZATION = "tokenizer_normalization"
 DBCFG_MAXWORDFREQ = "tokenizer_maxwordfreq"
-DBCFG_TRANSLITERATION = "tokenizer_transliteration"
-DBCFG_ABBREVIATIONS = "tokenizer_abbreviations"
+DBCFG_TERM_NORMALIZATION = "tokenizer_term_normalization"

 LOG = logging.getLogger()

@ -41,9 +38,9 @@ class LegacyICUTokenizer:
    def __init__(self, dsn, data_dir):
        self.dsn = dsn
        self.data_dir = data_dir
-        self.normalization = None
-        self.transliteration = None
-        self.abbreviations = None
+        self.naming_rules = None
+        self.term_normalization = None
+        self.max_word_frequency = None


    def init_new_db(self, config, init_db=True):
@ -55,14 +52,14 @@ class LegacyICUTokenizer:
        if config.TOKENIZER_CONFIG:
            cfgfile = Path(config.TOKENIZER_CONFIG)
        else:
-            cfgfile = config.config_dir / 'legacy_icu_tokenizer.json'
+            cfgfile = config.config_dir / 'legacy_icu_tokenizer.yaml'

-        rules = json.loads(cfgfile.read_text())
-        self.transliteration = ';'.join(rules['normalization']) + ';'
-        self.abbreviations = rules["abbreviations"]
-        self.normalization = config.TERM_NORMALIZATION
+        loader = ICURuleLoader(cfgfile)
+        self.naming_rules = ICUNameProcessorRules(loader=loader)
+        self.term_normalization = config.TERM_NORMALIZATION
+        self.max_word_frequency = config.MAX_WORD_FREQUENCY

-        self._install_php(config)
+        self._install_php(config.lib_dir.php)
        self._save_config(config)

        if init_db:
@ -74,9 +71,9 @@ class LegacyICUTokenizer:
        """ Initialise the tokenizer from the project directory.
        """
        with connect(self.dsn) as conn:
-            self.normalization = get_property(conn, DBCFG_NORMALIZATION)
-            self.transliteration = get_property(conn, DBCFG_TRANSLITERATION)
-            self.abbreviations = json.loads(get_property(conn, DBCFG_ABBREVIATIONS))
+            self.naming_rules = ICUNameProcessorRules(conn=conn)
+            self.term_normalization = get_property(conn, DBCFG_TERM_NORMALIZATION)
+            self.max_word_frequency = get_property(conn, DBCFG_MAXWORDFREQ)


    def finalize_import(self, config):
@ -103,9 +100,7 @@ class LegacyICUTokenizer:
        """
        self.init_from_project()

-        if self.normalization is None\
-           or self.transliteration is None\
-           or self.abbreviations is None:
+        if self.naming_rules is None:
            return "Configuration for tokenizer 'legacy_icu' are missing."

        return None
@ -126,26 +121,20 @@ class LegacyICUTokenizer:

            Analyzers are not thread-safe. You need to instantiate one per thread.
        """
-        norm = Transliterator.createFromRules("normalizer", self.normalization)
-        trans = Transliterator.createFromRules("trans", self.transliteration)
-        return LegacyICUNameAnalyzer(self.dsn, norm, trans, self.abbreviations)
+        return LegacyICUNameAnalyzer(self.dsn, ICUNameProcessor(self.naming_rules))

-
-    def _install_php(self, config):
+    # pylint: disable=missing-format-attribute
+    def _install_php(self, phpdir):
        """ Install the php script for the tokenizer.
        """
-        abbr_inverse = list(zip(*self.abbreviations))
        php_file = self.data_dir / "tokenizer.php"
        php_file.write_text(dedent("""\
            <?php
-            @define('CONST_Max_Word_Frequency', {1.MAX_WORD_FREQUENCY});
-            @define('CONST_Term_Normalization_Rules', "{0.normalization}");
-            @define('CONST_Transliteration', "{0.transliteration}");
-            @define('CONST_Abbreviations', array(array('{2}'), array('{3}')));
-            require_once('{1.lib_dir.php}/tokenizer/legacy_icu_tokenizer.php');
-            """.format(self, config,
-                       "','".join(abbr_inverse[0]),
-                       "','".join(abbr_inverse[1]))))
+            @define('CONST_Max_Word_Frequency', {0.max_word_frequency});
+            @define('CONST_Term_Normalization_Rules', "{0.term_normalization}");
+            @define('CONST_Transliteration', "{0.naming_rules.search_rules}");
+            require_once('{1}/tokenizer/legacy_icu_tokenizer.php');
+            """.format(self, phpdir)))


    def _save_config(self, config):
@ -153,10 +142,10 @@ class LegacyICUTokenizer:
            database as database properties.
        """
        with connect(self.dsn) as conn:
-            set_property(conn, DBCFG_NORMALIZATION, self.normalization)
+            self.naming_rules.save_rules(conn)
+
            set_property(conn, DBCFG_MAXWORDFREQ, config.MAX_WORD_FREQUENCY)
-            set_property(conn, DBCFG_TRANSLITERATION, self.transliteration)
-            set_property(conn, DBCFG_ABBREVIATIONS, json.dumps(self.abbreviations))
+            set_property(conn, DBCFG_TERM_NORMALIZATION, self.term_normalization)


    def _init_db_tables(self, config):
@ -172,25 +161,30 @@ class LegacyICUTokenizer:

            # get partial words and their frequencies
            words = Counter()
-            with self.name_analyzer() as analyzer:
-                with conn.cursor(name="words") as cur:
-                    cur.execute("SELECT svals(name) as v, count(*) FROM place GROUP BY v")
+            name_proc = ICUNameProcessor(self.naming_rules)
+            with conn.cursor(name="words") as cur:
+                cur.execute(""" SELECT v, count(*) FROM
+                                  (SELECT svals(name) as v FROM place)x
+                                WHERE length(v) < 75 GROUP BY v""")

-                    for name, cnt in cur:
-                        term = analyzer.make_standard_word(name)
-                        if term:
-                            for word in term.split():
-                                words[word] += cnt
+                for name, cnt in cur:
+                    terms = set()
+                    for word in name_proc.get_variants_ascii(name_proc.get_normalized(name)):
+                        if ' ' in word:
+                            terms.update(word.split())
+                    for term in terms:
+                        words[term] += cnt

            # copy them back into the word table
-            copystr = io.StringIO(''.join(('{}\t{}\n'.format(*args) for args in words.items())))
+            with CopyBuffer() as copystr:
+                for args in words.items():
+                    copystr.add(*args)

-
-            with conn.cursor() as cur:
-                copystr.seek(0)
-                cur.copy_from(copystr, 'word', columns=['word_token', 'search_name_count'])
-                cur.execute("""UPDATE word SET word_id = nextval('seq_word')
-                               WHERE word_id is null""")
+                with conn.cursor() as cur:
+                    copystr.copy_out(cur, 'word',
+                                     columns=['word_token', 'search_name_count'])
+                    cur.execute("""UPDATE word SET word_id = nextval('seq_word')
+                                   WHERE word_id is null""")

            conn.commit()

@ -202,12 +196,10 @@ class LegacyICUNameAnalyzer:
        normalization.
    """

-    def __init__(self, dsn, normalizer, transliterator, abbreviations):
+    def __init__(self, dsn, name_proc):
        self.conn = connect(dsn).connection
        self.conn.autocommit = True
-        self.normalizer = normalizer
-        self.transliterator = transliterator
-        self.abbreviations = abbreviations
+        self.name_processor = name_proc

        self._cache = _TokenCache()

@ -228,7 +220,7 @@ class LegacyICUNameAnalyzer:
            self.conn = None


-    def get_word_token_info(self, conn, words):
+    def get_word_token_info(self, words):
        """ Return token information for the given list of words.
            If a word starts with # it is assumed to be a full name
            otherwise is a partial name.
@ -242,11 +234,11 @@ class LegacyICUNameAnalyzer:
        tokens = {}
        for word in words:
            if word.startswith('#'):
-                tokens[word] = ' ' + self.make_standard_word(word[1:])
+                tokens[word] = ' ' + self.name_processor.get_search_normalized(word[1:])
            else:
-                tokens[word] = self.make_standard_word(word)
+                tokens[word] = self.name_processor.get_search_normalized(word)

-        with conn.cursor() as cur:
+        with self.conn.cursor() as cur:
            cur.execute("""SELECT word_token, word_id
                           FROM word, (SELECT unnest(%s::TEXT[]) as term) t
                           WHERE word_token = t.term
@ -254,15 +246,9 @@ class LegacyICUNameAnalyzer:
                        (list(tokens.values()), ))
            ids = {r[0]: r[1] for r in cur}

-        return [(k, v, ids[v]) for k, v in tokens.items()]
+        return [(k, v, ids.get(v, None)) for k, v in tokens.items()]


-    def normalize(self, phrase):
-        """ Normalize the given phrase, i.e. remove all properties that
-            are irrelevant for search.
-        """
-        return self.normalizer.transliterate(phrase)
-
    @staticmethod
    def normalize_postcode(postcode):
        """ Convert the postcode to a standardized form.
@ -273,34 +259,18 @@ class LegacyICUNameAnalyzer:
        return postcode.strip().upper()


-    @functools.lru_cache(maxsize=1024)
-    def make_standard_word(self, name):
-        """ Create the normalised version of the input.
-        """
-        norm = ' ' + self.transliterator.transliterate(name) + ' '
-        for full, abbr in self.abbreviations:
-            if full in norm:
-                norm = norm.replace(full, abbr)
-
-        return norm.strip()
-
-
    def _make_standard_hnr(self, hnr):
        """ Create a normalised version of a housenumber.

            This function takes minor shortcuts on transliteration.
        """
-        if hnr.isdigit():
-            return hnr
-
-        return self.transliterator.transliterate(hnr)
+        return self.name_processor.get_search_normalized(hnr)

    def update_postcodes_from_db(self):
        """ Update postcode tokens in the word table from the location_postcode
            table.
        """
        to_delete = []
-        copystr = io.StringIO()
        with self.conn.cursor() as cur:
            # This finds us the rows in location_postcode and word that are
            # missing in the other table.
@ -313,32 +283,31 @@ class LegacyICUNameAnalyzer:
                              ON pc = word) x
                           WHERE pc is null or word is null""")

-            for postcode, word in cur:
-                if postcode is None:
-                    to_delete.append(word)
-                else:
-                    copystr.write(postcode)
-                    copystr.write('\t ')
-                    copystr.write(self.transliterator.transliterate(postcode))
-                    copystr.write('\tplace\tpostcode\t0\n')
+            with CopyBuffer() as copystr:
+                for postcode, word in cur:
+                    if postcode is None:
+                        to_delete.append(word)
+                    else:
+                        copystr.add(
+                            postcode,
+                            ' ' + self.name_processor.get_search_normalized(postcode),
+                            'place', 'postcode', 0)

-            if to_delete:
-                cur.execute("""DELETE FROM WORD
-                               WHERE class ='place' and type = 'postcode'
-                                     and word = any(%s)
-                            """, (to_delete, ))
+                if to_delete:
+                    cur.execute("""DELETE FROM WORD
+                                   WHERE class ='place' and type = 'postcode'
+                                         and word = any(%s)
+                                """, (to_delete, ))

-            if copystr.getvalue():
-                copystr.seek(0)
-                cur.copy_from(copystr, 'word',
-                              columns=['word', 'word_token', 'class', 'type',
-                                       'search_name_count'])
+                copystr.copy_out(cur, 'word',
+                                 columns=['word', 'word_token', 'class', 'type',
+                                          'search_name_count'])


    def update_special_phrases(self, phrases, should_replace):
        """ Replace the search index for special phrases with the new phrases.
        """
-        norm_phrases = set(((self.normalize(p[0]), p[1], p[2], p[3])
+        norm_phrases = set(((self.name_processor.get_normalized(p[0]), p[1], p[2], p[3])
                            for p in phrases))

        with self.conn.cursor() as cur:
@ -350,54 +319,64 @@ class LegacyICUNameAnalyzer:
            for label, cls, typ, oper in cur:
                existing_phrases.add((label, cls, typ, oper or '-'))

-            to_add = norm_phrases - existing_phrases
-            to_delete = existing_phrases - norm_phrases
-
-            if to_add:
-                copystr = io.StringIO()
-                for word, cls, typ, oper in to_add:
-                    term = self.make_standard_word(word)
-                    if term:
-                        copystr.write(word)
-                        copystr.write('\t ')
-                        copystr.write(term)
-                        copystr.write('\t')
-                        copystr.write(cls)
-                        copystr.write('\t')
-                        copystr.write(typ)
-                        copystr.write('\t')
-                        copystr.write(oper if oper in ('in', 'near')  else '\\N')
-                        copystr.write('\t0\n')
-
-                copystr.seek(0)
-                cur.copy_from(copystr, 'word',
-                              columns=['word', 'word_token', 'class', 'type',
-                                       'operator', 'search_name_count'])
-
-            if to_delete and should_replace:
-                psycopg2.extras.execute_values(
-                    cur,
-                    """ DELETE FROM word USING (VALUES %s) as v(name, in_class, in_type, op)
-                        WHERE word = name and class = in_class and type = in_type
-                              and ((op = '-' and operator is null) or op = operator)""",
-                    to_delete)
+            added = self._add_special_phrases(cur, norm_phrases, existing_phrases)
+            if should_replace:
+                deleted = self._remove_special_phrases(cur, norm_phrases,
+                                                       existing_phrases)
+            else:
+                deleted = 0

        LOG.info("Total phrases: %s. Added: %s. Deleted: %s",
-                 len(norm_phrases), len(to_add), len(to_delete))
+                 len(norm_phrases), added, deleted)
+
+
+    def _add_special_phrases(self, cursor, new_phrases, existing_phrases):
+        """ Add all phrases to the database that are not yet there.
+        """
+        to_add = new_phrases - existing_phrases
+
+        added = 0
+        with CopyBuffer() as copystr:
+            for word, cls, typ, oper in to_add:
+                term = self.name_processor.get_search_normalized(word)
+                if term:
+                    copystr.add(word, ' ' + term, cls, typ,
+                                oper if oper in ('in', 'near')  else None, 0)
+                    added += 1
+
+            copystr.copy_out(cursor, 'word',
+                             columns=['word', 'word_token', 'class', 'type',
+                                      'operator', 'search_name_count'])
+
+        return added
+
+
+    @staticmethod
+    def _remove_special_phrases(cursor, new_phrases, existing_phrases):
+        """ Remove all phrases from the databse that are no longer in the
+            new phrase list.
+        """
+        to_delete = existing_phrases - new_phrases
+
+        if to_delete:
+            psycopg2.extras.execute_values(
+                cursor,
+                """ DELETE FROM word USING (VALUES %s) as v(name, in_class, in_type, op)
+                    WHERE word = name and class = in_class and type = in_type
+                          and ((op = '-' and operator is null) or op = operator)""",
+                to_delete)
+
+        return len(to_delete)


    def add_country_names(self, country_code, names):
        """ Add names for the given country to the search index.
        """
-        full_names = set((self.make_standard_word(n) for n in names))
-        full_names.discard('')
-        self._add_normalized_country_names(country_code, full_names)
+        word_tokens = set()
+        for name in self._compute_full_names(names):
+            if name:
+                word_tokens.add(' ' + self.name_processor.get_search_normalized(name))

-
-    def _add_normalized_country_names(self, country_code, names):
-        """ Add names for the given country to the search index.
-        """
-        word_tokens = set((' ' + name for name in names))
        with self.conn.cursor() as cur:
            # Get existing names
            cur.execute("SELECT word_token FROM word WHERE country_code = %s",
@ -423,14 +402,13 @@ class LegacyICUNameAnalyzer:
        names = place.get('name')

        if names:
-            full_names = self._compute_full_names(names)
+            fulls, partials = self._compute_name_tokens(names)

-            token_info.add_names(self.conn, full_names)
+            token_info.add_names(fulls, partials)

            country_feature = place.get('country_feature')
            if country_feature and re.fullmatch(r'[A-Za-z][A-Za-z]', country_feature):
-                self._add_normalized_country_names(country_feature.lower(),
-                                                   full_names)
+                self.add_country_names(country_feature.lower(), names)

        address = place.get('address')

@ -443,38 +421,65 @@ class LegacyICUNameAnalyzer:
                elif key in ('housenumber', 'streetnumber', 'conscriptionnumber'):
                    hnrs.append(value)
                elif key == 'street':
-                    token_info.add_street(self.conn, self.make_standard_word(value))
+                    token_info.add_street(*self._compute_name_tokens({'name': value}))
                elif key == 'place':
-                    token_info.add_place(self.conn, self.make_standard_word(value))
+                    token_info.add_place(*self._compute_name_tokens({'name': value}))
                elif not key.startswith('_') and \
                     key not in ('country', 'full'):
-                    addr_terms.append((key, self.make_standard_word(value)))
+                    addr_terms.append((key, *self._compute_name_tokens({'name': value})))

            if hnrs:
                hnrs = self._split_housenumbers(hnrs)
                token_info.add_housenumbers(self.conn, [self._make_standard_hnr(n) for n in hnrs])

            if addr_terms:
-                token_info.add_address_terms(self.conn, addr_terms)
+                token_info.add_address_terms(addr_terms)

        return token_info.data


-    def _compute_full_names(self, names):
+    def _compute_name_tokens(self, names):
+        """ Computes the full name and partial name tokens for the given
+            dictionary of names.
+        """
+        full_names = self._compute_full_names(names)
+        full_tokens = set()
+        partial_tokens = set()
+
+        for name in full_names:
+            norm_name = self.name_processor.get_normalized(name)
+            full, part = self._cache.names.get(norm_name, (None, None))
+            if full is None:
+                variants = self.name_processor.get_variants_ascii(norm_name)
+                if not variants:
+                    continue
+
+                with self.conn.cursor() as cur:
+                    cur.execute("SELECT (getorcreate_full_word(%s, %s)).*",
+                                (norm_name, variants))
+                    full, part = cur.fetchone()
+
+                self._cache.names[norm_name] = (full, part)
+
+            full_tokens.add(full)
+            partial_tokens.update(part)
+
+        return full_tokens, partial_tokens
+
+
+    @staticmethod
+    def _compute_full_names(names):
        """ Return the set of all full name word ids to be used with the
            given dictionary of names.
        """
        full_names = set()
-        for name in (n for ns in names.values() for n in re.split('[;,]', ns)):
-            word = self.make_standard_word(name)
-            if word:
-                full_names.add(word)
+        for name in (n.strip() for ns in names.values() for n in re.split('[;,]', ns)):
+            if name:
+                full_names.add(name)

-                brace_split = name.split('(', 2)
-                if len(brace_split) > 1:
-                    word = self.make_standard_word(brace_split[0])
-                    if word:
-                        full_names.add(word)
+                brace_idx = name.find('(')
+                if brace_idx >= 0:
+                    full_names.add(name[:brace_idx].strip())

        return full_names

@ -486,7 +491,7 @@ class LegacyICUNameAnalyzer:
            postcode = self.normalize_postcode(postcode)

            if postcode not in self._cache.postcodes:
-                term = self.make_standard_word(postcode)
+                term = self.name_processor.get_search_normalized(postcode)
                if not term:
                    return

@ -502,6 +507,7 @@ class LegacyICUNameAnalyzer:
                                """, (' ' + term, postcode))
                self._cache.postcodes.add(postcode)

+
    @staticmethod
    def _split_housenumbers(hnrs):
        if len(hnrs) > 1 or ',' in hnrs[0] or ';' in hnrs[0]:
@ -524,7 +530,7 @@ class _TokenInfo:
    """ Collect token information to be sent back to the database.
    """
    def __init__(self, cache):
-        self.cache = cache
+        self._cache = cache
        self.data = {}

    @staticmethod
@ -532,86 +538,44 @@ class _TokenInfo:
        return '{%s}' % ','.join((str(s) for s in tokens))


-    def add_names(self, conn, names):
+    def add_names(self, fulls, partials):
        """ Adds token information for the normalised names.
        """
-        # Start with all partial names
-        terms = set((part for ns in names for part in ns.split()))
-        # Add the full names
-        terms.update((' ' + n for n in names))
-
-        self.data['names'] = self._mk_array(self.cache.get_term_tokens(conn, terms))
+        self.data['names'] = self._mk_array(itertools.chain(fulls, partials))


    def add_housenumbers(self, conn, hnrs):
        """ Extract housenumber information from a list of normalised
            housenumbers.
        """
-        self.data['hnr_tokens'] = self._mk_array(self.cache.get_hnr_tokens(conn, hnrs))
+        self.data['hnr_tokens'] = self._mk_array(self._cache.get_hnr_tokens(conn, hnrs))
        self.data['hnr'] = ';'.join(hnrs)


-    def add_street(self, conn, street):
+    def add_street(self, fulls, _):
        """ Add addr:street match terms.
        """
-        if not street:
-            return
-
-        term = ' ' + street
-
-        tid = self.cache.names.get(term)
-
-        if tid is None:
-            with conn.cursor() as cur:
-                cur.execute("""SELECT word_id FROM word
-                                WHERE word_token = %s
-                                      and class is null and type is null""",
-                            (term, ))
-                if cur.rowcount > 0:
-                    tid = cur.fetchone()[0]
-                    self.cache.names[term] = tid
-
-        if tid is not None:
-            self.data['street'] = '{%d}' % tid
+        if fulls:
+            self.data['street'] = self._mk_array(fulls)


-    def add_place(self, conn, place):
+    def add_place(self, fulls, partials):
        """ Add addr:place search and match terms.
        """
-        if not place:
-            return
-
-        partial_ids = self.cache.get_term_tokens(conn, place.split())
-        tid = self.cache.get_term_tokens(conn, [' ' + place])
-
-        self.data['place_search'] = self._mk_array(itertools.chain(partial_ids, tid))
-        self.data['place_match'] = '{%s}' % tid[0]
+        if fulls:
+            self.data['place_search'] = self._mk_array(itertools.chain(fulls, partials))
+            self.data['place_match'] = self._mk_array(fulls)


-    def add_address_terms(self, conn, terms):
+    def add_address_terms(self, terms):
        """ Add additional address terms.
        """
        tokens = {}

-        for key, value in terms:
-            if not value:
-                continue
-            partial_ids = self.cache.get_term_tokens(conn, value.split())
-            term = ' ' + value
-            tid = self.cache.names.get(term)
-
-            if tid is None:
-                with conn.cursor() as cur:
-                    cur.execute("""SELECT word_id FROM word
-                                    WHERE word_token = %s
-                                          and class is null and type is null""",
-                                (term, ))
-                    if cur.rowcount > 0:
-                        tid = cur.fetchone()[0]
-                        self.cache.names[term] = tid
-
-            tokens[key] = [self._mk_array(partial_ids),
-                           '{%s}' % ('' if tid is None else str(tid))]
+        for key, fulls, partials in terms:
+            if fulls:
+                tokens[key] = [self._mk_array(itertools.chain(fulls, partials)),
+                               self._mk_array(fulls)]

        if tokens:
            self.data['addr'] = tokens
@ -629,32 +593,6 @@ class _TokenCache:
        self.housenumbers = {}


-    def get_term_tokens(self, conn, terms):
-        """ Get token ids for a list of terms, looking them up in the database
-            if necessary.
-        """
-        tokens = []
-        askdb = []
-
-        for term in terms:
-            token = self.names.get(term)
-            if token is None:
-                askdb.append(term)
-            elif token != 0:
-                tokens.append(token)
-
-        if askdb:
-            with conn.cursor() as cur:
-                cur.execute("SELECT term, getorcreate_term_id(term) FROM unnest(%s) as term",
-                            (askdb, ))
-                for term, tid in cur:
-                    self.names[term] = tid
-                    if tid != 0:
-                        tokens.append(tid)
-
-        return tokens
-
-
    def get_hnr_tokens(self, conn, terms):
        """ Get token ids for a list of housenumbers, looking them up in the
            database if necessary.
--- a/nominatim/tokenizer/legacy_tokenizer.py
+++ b/nominatim/tokenizer/legacy_tokenizer.py
@ -271,8 +271,7 @@ class LegacyNameAnalyzer:
            self.conn = None


-    @staticmethod
-    def get_word_token_info(conn, words):
+    def get_word_token_info(self, words):
        """ Return token information for the given list of words.
            If a word starts with # it is assumed to be a full name
            otherwise is a partial name.
@ -283,7 +282,7 @@ class LegacyNameAnalyzer:
            The function is used for testing and debugging only
            and not necessarily efficient.
        """
-        with conn.cursor() as cur:
+        with self.conn.cursor() as cur:
            cur.execute("""SELECT t.term, word_token, word_id
                           FROM word, (SELECT unnest(%s::TEXT[]) as term) t
                           WHERE word_token = (CASE
@ -404,7 +403,7 @@ class LegacyNameAnalyzer:
                            FROM unnest(%s)n) y
                      WHERE NOT EXISTS(SELECT * FROM word
                                       WHERE word_token = lookup_token and country_code = %s))
-                """, (country_code, names, country_code))
+                """, (country_code, list(names.values()), country_code))


    def process_place(self, place):
@ -422,7 +421,7 @@ class LegacyNameAnalyzer:

            country_feature = place.get('country_feature')
            if country_feature and re.fullmatch(r'[A-Za-z][A-Za-z]', country_feature):
-                self.add_country_names(country_feature.lower(), list(names.values()))
+                self.add_country_names(country_feature.lower(), names)

        address = place.get('address')

--- a/nominatim/tools/database_import.py
+++ b/nominatim/tools/database_import.py
@ -272,15 +272,15 @@ def create_country_names(conn, tokenizer, languages=None):

        with tokenizer.name_analyzer() as analyzer:
            for code, name in cur:
-                names = [code]
+                names = {'countrycode' : code}
                if code == 'gb':
-                    names.append('UK')
+                    names['short_name'] = 'UK'
                if code == 'us':
-                    names.append('United States')
+                    names['short_name'] = 'United States'

                # country names (only in languages as provided)
                if name:
-                    names.extend((v for k, v in name.items() if _include_key(k)))
+                    names.update(((k, v) for k, v in name.items() if _include_key(k)))

                analyzer.add_country_names(code, names)

--- a/settings/icu-rules/extended-unicode-to-asccii.yaml
+++ b/settings/icu-rules/extended-unicode-to-asccii.yaml
--- a/settings/icu-rules/unicode-digits-to-decimal.yaml
+++ b/settings/icu-rules/unicode-digits-to-decimal.yaml
@ -0,0 +1,24 @@
+- "[𞥐𐒠߀𖭐꤀𖩠𑓐𑑐𑋰𑄶꩐꘠᱀᭐᮰᠐០᥆༠໐꧰႐᪐᪀᧐𑵐꯰᱐𑱐𑜰𑛀𑙐𑇐꧐꣐෦𑁦０𝟶𝟘𝟬𝟎𝟢₀⓿⓪⁰] > 0"
+- "[𞥑𐒡߁𖭑꤁𖩡𑓑𑑑𑋱𑄷꩑꘡᱁᭑᮱᠑១᥇༡໑꧱႑᪑᪁᧑𑵑꯱᱑𑱑𑜱𑛁𑙑𑇑꧑꣑෧𑁧１𝟷𝟙𝟭𝟏𝟣₁¹①⑴⒈❶➀➊⓵] > 1"
+- "[𞥒𐒢߂𖭒꤂𖩢𑓒𑑒𑋲𑄸꩒꘢᱂᭒᮲᠒២᥈༢໒꧲႒᪒᪂᧒𑵒꯲᱒𑱒𑜲𑛂𑙒𑇒꧒꣒෨𑁨２𝟸𝟚𝟮𝟐𝟤₂²②⑵⒉❷➁➋⓶] > 2"
+- "[𞥓𐒣߃𖭓꤃𖩣𑓓𑑓𑋳𑄹꩓꘣᱃᭓᮳᠓៣᥉༣໓꧳႓᪓᪃᧓𑵓꯳᱓𑱓𑜳𑛃𑙓𑇓꧓꣓෩𑁩３𝟹𝟛𝟯𝟑𝟥₃³③⑶⒊❸➂➌⓷] > 3"
+- "[𞥔𐒤߄𖭔꤄𖩤𑓔𑑔𑋴𑄺꩔꘤᱄᭔᮴᠔៤᥊༤໔꧴႔᪔᪄᧔𑵔꯴᱔𑱔𑜴𑛄𑙔𑇔꧔꣔෪𑁪４𝟺𝟜𝟰𝟒𝟦₄⁴④⑷⒋❹➃➍⓸] > 4"
+- "[𞥕𐒥߅𖭕꤅𖩥𑓕𑑕𑋵𑄻꩕꘥᱅᭕᮵᠕៥᥋༥໕꧵႕᪕᪅᧕𑵕꯵᱕𑱕𑜵𑛅𑙕𑇕꧕꣕෫𑁫５𝟻𝟝𝟱𝟓𝟧₅⁵⑤⑸⒌❺➄➎⓹] > 5"
+- "[𞥖𐒦߆𖭖꤆𖩦𑓖𑑖𑋶𑄼꩖꘦᱆᭖᮶᠖៦᥌༦໖꧶႖᪖᪆᧖𑵖꯶᱖𑱖𑜶𑛆𑙖𑇖꧖꣖෬𑁬６𝟼𝟞𝟲𝟔𝟨₆⁶⑥⑹⒍❻➅➏⓺] > 6"
+- "[𞥗𐒧߇𖭗꤇𖩧𑓗𑑗𑋷𑄽꩗꘧᱇᭗᮷᠗៧᥍༧໗꧷႗᪗᪇᧗𑵗꯷᱗𑱗𑜷𑛇𑙗𑇗꧗꣗෭𑁭７𝟽𝟟𝟳𝟕𝟩₇⁷⑦⑺⒎❼➆➐⓻] > 7"
+- "[𞥘𐒨߈𖭘꤈𖩨𑓘𑑘𑋸𑄾꩘꘨᱈᭘᮸᠘៨᥎༨໘꧸႘᪘᪈᧘𑵘꯸᱘𑱘𑜸𑛈𑙘𑇘꧘꣘෮𑁮８𝟾𝟠𝟴𝟖𝟪₈⁸⑧⑻⒏❽➇➑⓼] > 8"
+- "[𞥙𐒩߉𖭙꤉𖩩𑓙𑑙𑋹𑄿꩙꘩᱉᭙᮹᠙៩᥏༩໙꧹႙᪙᪉᧙𑵙꯹᱙𑱙𑜹𑛉𑙙𑇙꧙꣙෯𑁯９𝟿𝟡𝟵𝟗𝟫₉⁹⑨⑼⒐❾➈➒⓽] > 9"
+- "[𑜺⑩⑽⒑❿➉➓⓾] > '10'"
+- "[⑪⑾⒒⓫] > '11'"
+- "[⑫⑿⒓⓬] > '12'"
+- "[⑬⒀⒔⓭] > '13'"
+- "[⑭⒁⒕⓮] > '14'"
+- "[⑮⒂⒖⓯] > '15'"
+- "[⑯⒃⒗⓰] > '16'"
+- "[⑰⒄⒘⓱] > '17'"
+- "[⑱⒅⒙⓲] > '18'"
+- "[⑲⒆⒚⓳] > '19'"
+- "[𑜻⑳⒇⒛⓴] > '20'"
+- "⅐ > ' 1/7'"
+- "⅑ > ' 1/9'"
+- "⅒  > ' 1/10'"
--- a/settings/icu-rules/variants-bg.yaml
+++ b/settings/icu-rules/variants-bg.yaml
@ -0,0 +1,19 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.91.D1.8A.D0.BB.D0.B3.D0.B0.D1.80.D1.81.D0.BA.D0.B8_.D0.B5.D0.B7.D0.B8.D0.BA_-_Bulgarian
+- lang: bg
+  words:
+    - Блок -> бл
+    - Булевард -> бул
+    - Вход -> вх
+    - Генерал -> ген
+    - Град -> гр
+    - Доктор -> д-р
+    - Доцент -> доц
+    - Капитан -> кап
+    - Митрополит -> мит
+    - Площад -> пл
+    - Професор -> проф
+    - Свети -> Св
+    - Улица -> ул
+    - Село -> с
+    - Квартал -> кв
+    - Жилищен Комплекс -> ж к
--- a/settings/icu-rules/variants-ca.yaml
+++ b/settings/icu-rules/variants-ca.yaml
@ -0,0 +1,90 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Catal.C3.A0_-_Catalan
+- lang: ca
+  words:
+    -  aparcament -> aparc
+    -  apartament -> apmt
+    -  apartat -> apt
+    - àtic -> àt
+    -  autopista -> auto
+    -  autopista -> autop
+    -  autovia -> autov
+    -  avinguda -> av
+    -  avinguda -> avd
+    -  avinguda -> avda
+    -  baixada -> bda
+    -  baixos -> bxs
+    -  barranc -> bnc
+    -  barri -> b
+    -  barriada -> b
+    -  biblioteca -> bibl
+    -  bloc -> bl
+    -  carrer -> c
+    -  carrer -> c/
+    -  carreró -> cró
+    -  carretera -> ctra
+    -  cantonada -> cant
+    -  cementiri -> cem
+    -  cinturó -> cint
+    -  codi postal -> CP
+    -  collegi -> coll
+    -  collegi públic -> CP
+    -  comissaria -> com
+    -  convent -> convt
+    -  correus -> corr
+    -  districte -> distr
+    -  drecera -> drec
+    -  dreta -> dta
+    -  entrada -> entr
+    -  entresòl -> entl
+    -  escala -> esc
+    -  escola -> esc
+    -  escola universitària -> EU
+    -  església -> esgl
+    -  estació -> est
+    -  estacionament -> estac
+    -  facultat -> fac
+    -  finca -> fca
+    -  habitació -> hab
+    -  hospital -> hosp
+    -  hotel -> H
+    -  monestir -> mtir
+    -  monument -> mon
+    -  mossèn -> Mn
+    -  municipal -> mpal
+    -  museu -> mus
+    -  nacional -> nac
+    -  nombre -> nre
+    -  número -> núm
+    -  número -> n
+    -  sense número -> s/n
+    -  parada -> par
+    -  parcel·la -> parc
+    -  passadís -> pdís
+    -  passatge -> ptge
+    -  passeig -> pg
+    -  pavelló -> pav
+    -  plaça -> pl
+    -  plaça -> pça
+    -  planta -> pl
+    -  població -> pobl
+    -  polígon -> pol
+    -  polígon industrial -> PI
+    -  polígon industrial -> pol ind
+    -  porta -> pta
+    -  portal -> ptal
+    -  principal -> pral
+    -  pujada -> pda
+    -  punt quilomètric -> PK
+    -  rambla -> rbla
+    -  ronda -> rda
+    -  sagrada -> sgda
+    -  sagrat -> sgt
+    -  sant -> st
+    -  santa -> sta
+    -  sobreàtic -> s/àt
+    -  travessera -> trav
+    -  travessia -> trv
+    -  travessia -> trav
+    -  urbanització -> urb
+    -  sortida -> sort
+    -  via -> v
--- a/settings/icu-rules/variants-cs.yaml
+++ b/settings/icu-rules/variants-cs.yaml
@ -0,0 +1,6 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Cesky_-_Czech
+- lang: cs
+  words:
+    -  Ulice -> Ul
+    -  Třída -> Tř
+    -  Náměstí -> Nám
--- a/settings/icu-rules/variants-da.yaml
+++ b/settings/icu-rules/variants-da.yaml
@ -0,0 +1,12 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Dansk_-_Danish
+- lang: da
+  words:
+    -  Lille -> Ll
+    -  Nordre -> Ndr
+    -  Nørre -> Nr
+    -  Søndre, Sønder -> Sdr
+    -  Store -> St
+    -  Gammel,Gamle -> Gl
+    -  ~hal => hal
+    -  ~hallen => hallen
+    -  ~hallerne => hallerne
--- a/settings/icu-rules/variants-de.yaml
+++ b/settings/icu-rules/variants-de.yaml
@ -0,0 +1,136 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Deutsch_-_German
+- lang: de
+  words:
+    -  am -> a
+    -  an der -> a d
+    -  Allgemeines Krankenhaus -> AKH
+    -  Altstoffsammelzentrum -> ASZ
+    -  auf der -> a d
+    -  ~bach -> B
+    -  Bad -> B
+    -  Bahnhof -> Bhf
+    -  Bayerisch, Bayerische, Bayerischer, Bayerisches -> Bayer
+    -  Berg -> B
+    -  ~berg |-> bg
+    -  Bezirk -> Bez
+    -  ~brücke -> Br
+    -  Bundesgymnasium -> BG
+    -  Bundespolizeidirektion -> BPD
+    -  Bundesrealgymnasium -> BRG
+    -  ~burg |-> bg
+    -  burgenländische,burgenländischer,burgenländisches -> bgld
+    -  Bürgermeister -> Bgm
+    -  Chaussee -> Ch
+    -  Deutsche, Deutscher, Deutsches -> dt
+    -  Deutscher Alpenverein -> DAV
+    -  Deutsch -> Dt
+    -  ~denkmal -> Dkm
+    -  Dorf -> Df
+    -  ~dorf |-> df
+    -  Doktor -> Dr
+    -  ehemalige, ehemaliger, ehemaliges -> ehem
+    -  Fabrik -> Fb
+    -  Fachhochschule -> FH
+    -  Freiwillige Feuerwehr -> FF
+    -  Forsthaus -> Fh
+    -  ~gasse |-> g
+    -  Gasthaus -> Gh
+    -  Gasthof -> Ghf
+    -  Gemeinde -> Gde
+    -  Graben -> Gr
+    -  Großer, Große, Großes -> Gr, G
+    -  Gymnasium und Realgymnasium -> GRG
+    -  Handelsakademie -> HAK
+    -  Handelsschule -> HASCH
+    -  Haltestelle -> Hst
+    -  Hauptbahnhof -> Hbf
+    -  Haus -> Hs
+    -  Heilige, Heiliger, Heiliges -> Hl
+    -  Hintere, Hinterer, Hinteres -> Ht, Hint
+    -  Hohe, Hoher, Hohes -> H
+    -  ~höhle -> H
+    -  Höhere Technische Lehranstalt -> HTL
+    -  ~hütte -> Htt
+    -  im -> i
+    -  in -> i
+    -  in der -> i d
+    -  Ingenieur -> Ing
+    -  Internationale, Internationaler, Internationales -> Int
+    -  Jagdhaus -> Jh
+    -  Jagdhütte -> Jhtt
+    -  Kapelle -> Kap, Kpl
+    -  Katastralgemeinde -> KG
+    -  Kläranlage -> KA
+    -  Kleiner, Kleine, Kleines -> kl
+    -  Klein~ -> Kl.
+    -  Kleingartenanlage -> KGA
+    -  Kleingartenverein -> KGV
+    -  Kogel -> Kg
+    -  ~kogel |-> kg
+    -  Konzentrationslager -> KZ, KL
+    -  Krankenhaus -> KH
+    -  ~kreuz |-> kz
+    -  Landeskrankenhaus -> LKH
+    -  Maria -> Ma
+    -  Magister -> Mag
+    -  Magistratsabteilung -> MA
+    -  Markt -> Mkt
+    -  Müllverbrennungsanlage -> MVA
+    -  Nationalpark -> NP
+    -  Naturschutzgebiet -> NSG
+    -  Neue Mittelschule -> NMS
+    -  Niedere, Niederer, Niederes -> Nd
+    -  Niederösterreich -> NÖ
+    -  nördliche, nördlicher, nördliches -> nördl
+    -  Nummer -> Nr
+    -  ob -> o
+    -  Oberer, Obere, Oberes -> ob
+    -  Ober~ -> Ob
+    -  Österreichischer Alpenverein -> ÖAV
+    -  Österreichischer Gebirgsverein -> ÖGV
+    -  Österreichischer Touristenklub -> ÖTK
+    -  östliche, östlicher, östliches -> östl
+    -  Pater -> P
+    -  Pfad -> P
+    -  Platz -> Pl
+    -  ~platz$ -> pl
+    -  Professor -> Prof
+    -  Quelle -> Q, Qu
+    -  Reservoir -> Res
+    -  Rhein -> Rh
+    -  Rundwanderweg -> RWW
+    -  Ruine -> R
+    -  Sandgrube, Schottergrube -> SG
+    -  Sankt -> St
+    -  Schloss -> Schl
+    -  See -> S
+    -  ~siedlung -> sdlg
+    -  Sozialmedizinisches Zentrum -> SMZ
+    -  ~Spitze -> Sp
+    -  Steinbruch -> Stb
+    -  ~stiege -> stg
+    -  ~strasse -> str
+    -  südliche, südlicher, südliches -> südl
+    -  Unterer, Untere, Unteres -> u, unt
+    -  Unter~ -> U
+    -  Teich -> T
+    -  Technische Universität -> TU
+    -  Truppenübungsplatz -> TÜPL, TÜPl
+    -  Unfallkrankenhaus -> UKH
+    -  ~universität -> uni
+    -  verfallen -> verf
+    -  von -> v
+    -  Vordere, Vorderer, Vorderes -> Vd, Vord
+    -  Vorder… -> Vd, Vord
+    -  von der -> v d
+    -  vor der -> v d
+    -  Volksschule -> VS
+    -  Wald -> W
+    -  Wasserfall -> Wsf, Wssf
+    -  ~weg$ -> w
+    -  westliche, westlicher, westliches -> westl
+    -  Wiener -> Wr
+    -  ~wiese$ -> ws
+    -  Wirtschaftsuniversität -> WU
+    -  Wirtshaus -> Wh
+    -  zum -> z
--- a/settings/icu-rules/variants-el.yaml
+++ b/settings/icu-rules/variants-el.yaml
@ -0,0 +1,54 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.CE.95.CE.BB.CE.BB.CE.B7.CE.BD.CE.B9.CE.BA.CE.AC_-_Greek
+- lang: el
+  words:
+    -  Αγίας -> Αγ
+    -  Αγίου -> Αγ
+    -  Αγίων -> Αγ
+    -  Αδελφοί -> Αφοί
+    -  Αδελφών -> Αφών
+    -  Αλέξανδρου -> Αλ
+    -  Ανώτατο Τεχνολογικό Εκπαιδευτικό Ίδρυμα -> ΑΤΕΙ
+    -  Αστυνομικό Τμήμα -> ΑΤ
+    -  Βασιλέως -> Β
+    -  Βασιλέως -> Βασ
+    -  Βασιλίσσης -> Β
+    -  Βασιλίσσης -> Βασ
+    -  Γρηγορίου -> Γρ
+    -  Δήμος -> Δ
+    -  Δημοτικό Σχολείο -> ΔΣ
+    -  Δημοτικό Σχολείο -> Δημ Σχ
+    -  Εθνάρχου -> Εθν
+    -  Εθνική -> Εθν
+    -  Εθνικής -> Εθν
+    -  Ελευθέριος -> Ελ
+    -  Ελευθερίου -> Ελ
+    -  Ελληνικά Ταχυδρομεία -> ΕΛΤΑ
+    -  Θεσσαλονίκης -> Θεσ/νίκης
+    -  Ιερά Μονή -> Ι Μ
+    -  Ιερός Ναός -> Ι Ν
+    -  Κτίριο -> Κτ
+    -  Κωνσταντίνου -> Κων/νου
+    -  Λεωφόρος -> Λ
+    -  Λεωφόρος -> Λεωφ
+    -  Λίμνη -> Λ
+    -  Νέα -> Ν
+    -  Νέες -> Ν
+    -  Νέο -> Ν
+    -  Νέοι -> Ν
+    -  Νέος -> Ν
+    -  Νησί -> Ν
+    -  Νομός -> Ν
+    -  Όρος -> Όρ
+    -  Παλαιά -> Π
+    -  Παλαιές -> Π
+    -  Παλαιό -> Π
+    -  Παλαιοί -> Π
+    -  Παλαιός -> Π
+    -  Πανεπιστήμιο -> ΑΕΙ
+    -  Πανεπιστήμιο -> Παν
+    -  Πλατεία -> Πλ
+    -  Ποταμός -> Π
+    -  Ποταμός -> Ποτ
+    -  Στρατηγού -> Στρ
+    -  Ταχυδρομείο -> ΕΛΤΑ
+    -  Τεχνολογικό Εκπαιδευτικό Ίδρυμα -> ΤΕΙ
--- a/settings/icu-rules/variants-en.yaml
+++ b/settings/icu-rules/variants-en.yaml
@ -0,0 +1,485 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#English
+- lang: en
+  words:
+    -  Access -> Accs
+    -  Air Force Base -> AFB
+    -  Air National Guard Base -> ANGB
+    -  Airport -> Aprt
+    -  Alley -> Al
+    -  Alley -> All
+    -  Alley -> Ally
+    -  Alley -> Aly
+    -  Alleyway -> Alwy
+    -  Amble -> Ambl
+    -  Apartments -> Apts
+    -  Approach -> Apch
+    -  Approach -> App
+    -  Arcade -> Arc
+    -  Arterial -> Artl
+    -  Artery -> Arty
+    -  Avenue -> Av
+    -  Avenue -> Ave
+    -  Back -> Bk
+    -  Banan -> Ba
+    -  Basin -> Basn
+    -  Basin -> Bsn
+    -  Beach -> Bch
+    -  Bend -> Bend
+    -  Bend -> Bnd
+    -  Block -> Blk
+    -  Boardwalk -> Bwlk
+    -  Boulevard -> Blvd
+    -  Boulevard -> Bvd
+    -  Boundary -> Bdy
+    -  Bowl -> Bl
+    -  Brace -> Br
+    -  Brae -> Br
+    -  Brae -> Brae
+    -  Break -> Brk
+    -  Bridge -> Bdge
+    -  Bridge -> Br
+    -  Bridge -> Brdg
+    -  Bridge -> Bri
+    -  Broadway -> Bdwy
+    -  Broadway -> Bway
+    -  Broadway -> Bwy
+    -  Brook -> Brk
+    -  Brow -> Brw
+    -  Brow -> Brow
+    -  Buildings -> Bldgs
+    -  Buildings -> Bldngs
+    -  Business -> Bus
+    -  Bypass -> Bps
+    -  Bypass -> Byp
+    -  Bypass -> Bypa
+    -  Byway -> Bywy
+    -  Caravan -> Cvn
+    -  Causeway -> Caus
+    -  Causeway -> Cswy
+    -  Causeway -> Cway
+    -  Center -> Cen
+    -  Center -> Ctr
+    -  Central -> Ctrl
+    -  Centre -> Cen
+    -  Centre -> Ctr
+    -  Centreway -> Cnwy
+    -  Chase -> Ch
+    -  Church -> Ch
+    -  Circle -> Cir
+    -  Circuit -> Cct
+    -  Circuit -> Ci
+    -  Circus -> Crc
+    -  Circus -> Crcs
+    -  City -> Cty
+    -  Close -> Cl
+    -  Common -> Cmn
+    -  Common -> Comm
+    -  Community -> Comm
+    -  Concourse -> Cnc
+    -  Concourse -> Con
+    -  Copse -> Cps
+    -  Corner -> Cnr
+    -  Corner -> Crn
+    -  Corso -> Cso
+    -  Cottages -> Cotts
+    -  County -> Co
+    -  County Road -> CR
+    -  County Route -> CR
+    -  Court -> Crt
+    -  Court -> Ct
+    -  Courtyard -> Cyd
+    -  Courtyard -> Ctyd
+    -  Cove -> Ce
+    -  Cove -> Cov
+    -  Cove -> Cove
+    -  Cove -> Cv
+    -  Creek -> Ck
+    -  Creek -> Cr
+    -  Creek -> Crk
+    -  Crescent -> Cr
+    -  Crescent -> Cres
+    -  Crest -> Crst
+    -  Crest -> Cst
+    -  Croft -> Cft
+    -  Cross -> Cs
+    -  Cross -> Crss
+    -  Crossing -> Crsg
+    -  Crossing -> Csg
+    -  Crossing -> Xing
+    -  Crossroad -> Crd
+    -  Crossway -> Cowy
+    -  Cul-de-sac -> Cds
+    -  Cul-de-sac -> Csac
+    -  Curve -> Cve
+    -  Cutting -> Cutt
+    -  Dale -> Dle
+    -  Dale -> Dale
+    -  Deviation -> Devn
+    -  Dip -> Dip
+    -  Distributor -> Dstr
+    -  Down -> Dn
+    -  Downs -> Dn
+    -  Drive -> Dr
+    -  Drive -> Drv
+    -  Drive -> Dv
+    -  Drive-In => Drive-In # prevent abbreviation here
+    -  Driveway -> Drwy
+    -  Driveway -> Dvwy
+    -  Driveway -> Dwy
+    -  East -> E
+    -  Edge -> Edg
+    -  Edge -> Edge
+    -  Elbow -> Elb
+    -  End -> End
+    -  Entrance -> Ent
+    -  Esplanade -> Esp
+    -  Estate -> Est
+    -  Expressway -> Exp
+    -  Expressway -> Expy
+    -  Expressway -> Expwy
+    -  Expressway -> Xway
+    -  Extension -> Ex
+    -  Fairway -> Fawy
+    -  Fairway -> Fy
+    -  Father -> Fr
+    -  Ferry -> Fy
+    -  Field -> Fd
+    -  Fire Track -> Ftrk
+    -  Firetrail -> Fit
+    -  Flat -> Fl
+    -  Flat -> Flat
+    -  Follow -> Folw
+    -  Footway -> Ftwy
+    -  Foreshore -> Fshr
+    -  Forest Service Road -> FSR
+    -  Formation -> Form
+    -  Fort -> Ft
+    -  Freeway -> Frwy
+    -  Freeway -> Fwy
+    -  Front -> Frnt
+    -  Frontage -> Fr
+    -  Frontage -> Frtg
+    -  Gap -> Gap
+    -  Garden -> Gdn
+    -  Gardens -> Gdn
+    -  Gardens -> Gdns
+    -  Gate -> Ga
+    -  Gate -> Gte
+    -  Gates -> Ga
+    -  Gates -> Gte
+    -  Gateway -> Gwy
+    -  George -> Geo
+    -  Glade -> Gl
+    -  Glade -> Gld
+    -  Glade -> Glde
+    -  Glen -> Gln
+    -  Glen -> Glen
+    -  Grange -> Gra
+    -  Green -> Gn
+    -  Green -> Grn
+    -  Ground -> Grnd
+    -  Grove -> Gr
+    -  Grove -> Gro
+    -  Grovet -> Gr
+    -  Gully -> Gly
+    -  Harbor -> Hbr
+    -  Harbour -> Hbr
+    -  Haven -> Hvn
+    -  Head -> Hd
+    -  Heads -> Hd
+    -  Heights -> Hgts
+    -  Heights -> Ht
+    -  Heights -> Hts
+    -  High School -> HS
+    -  Highroad -> Hird
+    -  Highroad -> Hrd
+    -  Highway -> Hwy
+    -  Hill -> Hill
+    -  Hill -> Hl
+    -  Hills -> Hl
+    -  Hills -> Hls
+    -  Hospital -> Hosp
+    -  House -> Ho
+    -  House -> Hse
+    -  Industrial -> Ind
+    -  Interchange -> Intg
+    -  International -> Intl
+    -  Island -> I
+    -  Island -> Is
+    -  Junction -> Jctn
+    -  Junction -> Jnc
+    -  Junior -> Jr
+    -  Key -> Key
+    -  Lagoon -> Lgn
+    -  Lakes -> L
+    -  Landing -> Ldg
+    -  Lane -> La
+    -  Lane -> Lane
+    -  Lane -> Ln
+    -  Laneway -> Lnwy
+    -  Line -> Line
+    -  Line -> Ln
+    -  Link -> Link
+    -  Link -> Lk
+    -  Little -> Lit
+    -  Little -> Lt
+    -  Lodge -> Ldg
+    -  Lookout -> Lkt
+    -  Loop -> Loop
+    -  Loop -> Lp
+    -  Lower -> Low
+    -  Lower -> Lr
+    -  Lower -> Lwr
+    -  Mall -> Mall
+    -  Mall -> Ml
+    -  Manor -> Mnr
+    -  Mansions -> Mans
+    -  Market -> Mkt
+    -  Meadow -> Mdw
+    -  Meadows -> Mdw
+    -  Meadows -> Mdws
+    -  Mead -> Md
+    -  Meander -> Mdr
+    -  Meander -> Mndr
+    -  Meander -> Mr
+    -  Medical -> Med
+    -  Memorial -> Mem
+    -  Mews -> Mews
+    -  Mews -> Mw
+    -  Middle -> Mid
+    -  Middle School -> MS
+    -  Mile -> Mi
+    -  Military -> Mil
+    -  Motorway -> Mtwy
+    -  Motorway -> Mwy
+    -  Mount -> Mt
+    -  Mountain -> Mtn
+    -  Mountains -> Mtn
+    -  Municipal -> Mun
+    -  Museum -> Mus
+    -  National Park -> NP
+    -  National Recreation Area -> NRA
+    -  National Wildlife Refuge Area -> NWRA
+    -  Nook -> Nk
+    -  Nook -> Nook
+    -  North -> N
+    -  Northeast -> NE
+    -  Northwest -> NW
+    -  Outlook -> Out
+    -  Outlook -> Otlk
+    -  Parade -> Pde
+    -  Paradise -> Pdse
+    -  Park -> Park
+    -  Park -> Pk
+    -  Parklands -> Pkld
+    -  Parkway -> Pkwy
+    -  Parkway -> Pky
+    -  Parkway -> Pwy
+    -  Pass -> Pass
+    -  Pass -> Ps
+    -  Passage -> Psge
+    -  Path -> Path
+    -  Pathway -> Phwy
+    -  Pathway -> Pway
+    -  Pathway -> Pwy
+    -  Piazza -> Piaz
+    -  Pike -> Pk
+    -  Place -> Pl
+    -  Plain -> Pl
+    -  Plains -> Pl
+    -  Plateau -> Plat
+    -  Plaza -> Pl
+    -  Plaza -> Plz
+    -  Plaza -> Plza
+    -  Pocket -> Pkt
+    -  Point -> Pnt
+    -  Point -> Pt
+    -  Port -> Port
+    -  Port -> Pt
+    -  Post Office -> PO
+    -  Precinct -> Pct
+    -  Promenade -> Prm
+    -  Promenade -> Prom
+    -  Quad -> Quad
+    -  Quadrangle -> Qdgl
+    -  Quadrant -> Qdrt
+    -  Quadrant -> Qd
+    -  Quay -> Qy
+    -  Quays -> Qy
+    -  Quays -> Qys
+    -  Ramble -> Ra
+    -  Ramble -> Rmbl
+    -  Range -> Rge
+    -  Range -> Rnge
+    -  Reach -> Rch
+    -  Reservation -> Res
+    -  Reserve -> Res
+    -  Reservoir -> Res
+    -  Rest -> Rest
+    -  Rest -> Rst
+    -  Retreat -> Rt
+    -  Retreat -> Rtt
+    -  Return -> Rtn
+    -  Ridge -> Rdg
+    -  Ridge -> Rdge
+    -  Ridgeway -> Rgwy
+    -  Right of Way -> Rowy
+    -  Rise -> Ri
+    -  Rise -> Rise
+    -  River -> R
+    -  River -> Riv
+    -  River -> Rvr
+    -  Riverway -> Rvwy
+    -  Riviera -> Rvra
+    -  Road -> Rd
+    -  Roads -> Rds
+    -  Roadside -> Rdsd
+    -  Roadway -> Rdwy
+    -  Roadway -> Rdy
+    -  Robert -> Robt
+    -  Rocks -> Rks
+    -  Ronde -> Rnde
+    -  Rosebowl -> Rsbl
+    -  Rotary -> Rty
+    -  Round -> Rnd
+    -  Route -> Rt
+    -  Route -> Rte
+    -  Row -> Row
+    -  Rue -> Rue
+    -  Run -> Run
+    -  Saint -> St
+    -  Saints -> SS
+    -  Senior -> Sr
+    -  Serviceway -> Swy
+    -  Serviceway -> Svwy
+    -  Shunt -> Shun
+    -  Siding -> Sdng
+    -  Sister -> Sr
+    -  Slope -> Slpe
+    -  Sound -> Snd
+    -  South -> S
+    -  South -> Sth
+    -  Southeast -> SE
+    -  Southwest -> SW
+    -  Spur -> Spur
+    -  Square -> Sq
+    -  Stairway -> Strwy
+    -  State Highway -> SH
+    -  State Highway -> SHwy
+    -  State Route -> SR
+    -  Station -> Sta
+    -  Station -> Stn
+    -  Strand -> Sd
+    -  Strand -> Stra
+    -  Street -> St
+    -  Strip -> Strp
+    -  Subway -> Sbwy
+    -  Tarn -> Tn
+    -  Tarn -> Tarn
+    -  Terminal -> Term
+    -  Terrace -> Tce
+    -  Terrace -> Ter
+    -  Terrace -> Terr
+    -  Thoroughfare -> Thfr
+    -  Thoroughfare -> Thor
+    -  Tollway -> Tlwy
+    -  Tollway -> Twy
+    -  Top -> Top
+    -  Tor -> Tor
+    -  Towers -> Twrs
+    -  Township -> Twp
+    -  Trace -> Trce
+    -  Track -> Tr
+    -  Track -> Trk
+    -  Trail -> Trl
+    -  Trailer -> Trlr
+    -  Triangle -> Tri
+    -  Trunkway -> Tkwy
+    -  Tunnel -> Tun
+    -  Turn -> Tn
+    -  Turn -> Trn
+    -  Turn -> Turn
+    -  Turnpike -> Tpk
+    -  Turnpike -> Tpke
+    -  Underpass -> Upas
+    -  Underpass -> Ups
+    -  University -> Uni
+    -  University -> Univ
+    -  Upper -> Up
+    -  Upper -> Upr
+    -  Vale -> Va
+    -  Vale -> Vale
+    -  Valley -> Vy
+    -  Viaduct -> Vdct
+    -  Viaduct -> Via
+    -  Viaduct -> Viad
+    -  View -> Vw
+    -  View -> View
+    -  Village -> Vill
+    -  Villas -> Vlls
+    -  Vista -> Vst
+    -  Vista -> Vsta
+    -  Walk -> Walk
+    -  Walk -> Wk
+    -  Walk -> Wlk
+    -  Walkway -> Wkwy
+    -  Walkway -> Wky
+    -  Waters -> Wtr
+    -  Way -> Way
+    -  Way -> Wy
+    -  West -> W
+    -  Wharf -> Whrf
+    -  William -> Wm
+    -  Wynd -> Wyn
+    -  Wynd -> Wynd
+    -  Yard -> Yard
+    -  Yard -> Yd
+- lang: en
+  country: ca
+  words:
+    -  Circuit -> CIRCT
+    -  Concession -> CONC
+    -  Corners -> CRNRS
+    -  Crossing -> CROSS
+    -  Diversion -> DIVERS
+    -  Esplanade -> ESPL
+    -  Extension -> EXTEN
+    -  Grounds -> GRNDS
+    -  Harbour -> HARBR
+    -  Highlands -> HGHLDS
+    -  Landing -> LANDNG
+    -  Limits -> LMTS
+    -  Lookout -> LKOUT
+    -  Orchard -> ORCH
+    -  Parkway -> PKY
+    -  Passage -> PASS
+    -  Pathway -> PTWAY
+    -  Private -> PVT
+    -  Range -> RG
+    -  Subdivision -> SUBDIV
+    -  Terrace -> TERR
+    -  Townline -> TLINE
+    -  Turnabout -> TRNABT
+    -  Village -> VILLGE
+- lang: en
+  country: ph
+  words:
+    -  Apartment -> Apt
+    -  Barangay -> Brgy
+    -  Barangay -> Bgy
+    -  Building -> Bldg
+    -  Commission -> Comm
+    -  Compound -> Cmpd
+    -  Compound -> Cpd
+    -  Cooperative -> Coop
+    -  Department -> Dept
+    -  Department -> Dep't
+    -  General -> Gen
+    -  Governor -> Gov
+    -  National -> Nat'l
+    -  National High School -> NHS
+    -  Philippine -> Phil
+    -  Police Community Precinct -> PCP
+    -  Province -> Prov
+    -  Senior High School -> SHS
+    -  Subdivision -> Subd
--- a/settings/icu-rules/variants-es.yaml
+++ b/settings/icu-rules/variants-es.yaml
@ -0,0 +1,163 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Espa.C3.B1ol_-_Spanish
+- lang: es
+  words:
+    -  Acequia -> Aceq
+    -  Alameda -> Alam
+    -  Alquería -> Alque
+    -  Andador -> Andad
+    -  Angosta -> Angta
+    -  Apartamento -> Apto
+    -  Apartamentos -> Aptos
+    -  Apeadero -> Apdro
+    -  Arboleda -> Arb
+    -  Arrabal -> Arral
+    -  Arroyo -> Arry
+    -  Asociación de Vecinos -> A VV
+    -  Asociación Vecinal -> A V
+    -  Autopista -> Auto
+    -  Autovía -> Autov
+    -  Avenida -> Av
+    -  Avenida -> Avd
+    -  Avenida -> Avda
+    -  Balneario -> Balnr
+    -  Banda -> B
+    -  Banda -> Bda
+    -  Barranco -> Branc
+    -  Barranquil -> Bqllo
+    -  Barriada -> Barda
+    -  Barrio -> B.º
+    -  Barrio -> Bo
+    -  Bloque -> Blq
+    -  Bulevar -> Blvr
+    -  Boulevard -> Blvd
+    -  Calle -> C/
+    -  Calle -> C
+    -  Calle -> Cl
+    -  Calleja -> Cllja
+    -  Callejón -> Callej
+    -  Callejón -> Cjón
+    -  Callejón -> Cllón
+    -  Callejuela -> Cjla
+    -  Callizo -> Cllzo
+    -  Calzada -> Czada
+    -  Camino -> Cno
+    -  Camino -> Cmno
+    -  Camino hondo -> C H
+    -  Camino nuevo -> C N
+    -  Camino viejo -> C V
+    -  Camping -> Campg
+    -  Cantera -> Cantr
+    -  Cantina -> Canti
+    -  Cantón -> Cant
+    -  Carrera -> Cra
+    -  Carrero -> Cro
+    -  Carretera -> Ctra
+    -  Carreterín -> Ctrin
+    -  Carretil -> Crtil
+    -  Caserío -> Csrio
+    -  Centro Integrado de Formación Profesional -> CIFP
+    -  Cinturón -> Cint
+    -  Circunvalación -> Ccvcn
+    -  Cobertizo -> Cbtiz
+    -  Colegio de Educación Especial -> CEE
+    -  Colegio de Educación Infantil -> CEI
+    -  Colegio de Educación Infantil y Primaria -> CEIP
+    -  Colegio Rural Agrupado -> CRA
+    -  Colonia -> Col
+    -  Complejo -> Compj
+    -  Conjunto -> Cjto
+    -  Convento -> Cnvto
+    -  Cooperativa -> Coop
+    -  Corralillo -> Crrlo
+    -  Corredor -> Crrdo
+    -  Cortijo -> Crtjo
+    -  Costanilla -> Cstan
+    -  Costera -> Coste
+    -  Dehesa -> Dhsa
+    -  Demarcación -> Demar
+    -  Diagonal -> Diag
+    -  Diseminado -> Disem
+    -  Doctor -> Dr
+    -  Doctora -> Dra
+    -  Edificio -> Edif
+    -  Empresa -> Empr
+    -  Entrada -> Entd
+    -  Escalera -> Esca
+    -  Escalinata -> Escal
+    -  Espalda -> Eslda
+    -  Estación -> Estcn
+    -  Estrada -> Estda
+    -  Explanada -> Expla
+    -  Extramuros -> Extrm
+    -  Extrarradio -> Extrr
+    -  Fábrica -> Fca
+    -  Fábrica -> Fbrca
+    -  Ferrocarril -> F C
+    -  Ferrocarriles -> FF CC
+    -  Galería -> Gale
+    -  Glorieta -> Gta
+    -  Gran Vía -> G V
+    -  Hipódromo -> Hipód
+    -  Instituto de Educación Secundaria -> IES
+    -  Jardín -> Jdín
+    -  Llanura -> Llnra
+    -  Lote -> Lt
+    -  Malecón -> Malec
+    -  Manzana -> Mz
+    -  Mercado -> Merc
+    -  Mirador -> Mrdor
+    -  Monasterio -> Mtrio
+    -  Nuestra Señora -> N.ª S.ª
+    -  Nuestra Señora -> Ntr.ª Sr.ª
+    -  Nuestra Señora -> Ntra Sra
+    -  Palacio -> Palac
+    -  Pantano -> Pant
+    -  Parque -> Pque
+    -  Particular -> Parti
+    -  Partida -> Ptda
+    -  Pasadizo -> Pzo
+    -  Pasaje -> Psje
+    -  Paseo -> P.º
+    -  Paseo marítimo -> P.º mar
+    -  Pasillo -> Psllo
+    -  Plaza -> Pl
+    -  Plaza -> Pza
+    -  Plazoleta -> Pzta
+    -  Plazuela -> Plzla
+    -  Poblado -> Pbdo
+    -  Polígono -> Políg
+    -  Polígono industrial -> Pg ind
+    -  Pórtico -> Prtco
+    -  Portillo -> Ptilo
+    -  Prazuela -> Przla
+    -  Prolongación -> Prol
+    -  Pueblo -> Pblo
+    -  Puente -> Pte
+    -  Puerta -> Pta
+    -  Puerto -> Pto
+    -  Punto kilométrico -> P k
+    -  Rambla -> Rbla
+    -  Residencial -> Resid
+    -  Ribera -> Rbra
+    -  Rincón -> Rcón
+    -  Rinconada -> Rcda
+    -  Rotonda -> Rtda
+    -  San -> S
+    -  Sanatorio -> Sanat
+    -  Santa -> Sta
+    -  Santo -> Sto
+    -  Santas -> Stas
+    -  Santos -> Stos
+    -  Santuario -> Santu
+    -  Sector -> Sect
+    -  Sendera -> Sedra
+    -  Sendero -> Send
+    -  Torrente -> Trrnt
+    -  Tránsito -> Tráns
+    -  Transversal -> Trval
+    -  Trasera -> Tras
+    -  Travesía -> Trva
+    -  Urbanización -> Urb
+    -  Vecindario -> Vecin
+    -  Viaducto -> Vcto
+    -  Viviendas -> Vvdas
--- a/settings/icu-rules/variants-et.yaml
+++ b/settings/icu-rules/variants-et.yaml
@ -0,0 +1,8 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Eesti_-_Estonian
+- lang: et
+  words:
+    -  Maantee -> mnt
+    -  Puiestee -> pst
+    -  Raudtee -> rdt
+    -  Raudteejaam -> rdtj
+    -  Tänav -> tn
--- a/settings/icu-rules/variants-eu.yaml
+++ b/settings/icu-rules/variants-eu.yaml
@ -0,0 +1,6 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Euskara_-_Basque
+- lang: eu
+  words:
+    -  Etorbidea -> Etorb
+    -  Errepidea -> Err
+    -  Kalea -> K
--- a/settings/icu-rules/variants-fi.yaml
+++ b/settings/icu-rules/variants-fi.yaml
@ -0,0 +1,23 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Suomi_-_Finnish
+- lang: fi
+  words:
+    -  ~alue -> al
+    -  ~asema -> as
+    -  ~aukio -> auk
+    -  ~kaari -> kri
+    -  ~katu -> k
+    -  ~kuja -> kj
+    -  ~kylä -> kl
+    -  ~penger -> pgr
+    -  ~polku -> p
+    -  ~puistikko -> pko
+    -  ~puisto -> ps
+    -  ~raitti -> r
+    -  ~rautatieasema -> ras
+    -  ~ranta -> rt
+    -  ~rinne -> rn
+    -  ~taival -> tvl
+    -  ~tie -> t
+    -  tienhaara -> th
+    -  ~tori -> tr
+    -  ~väylä -> vlä
--- a/settings/icu-rules/variants-fr.yaml
+++ b/settings/icu-rules/variants-fr.yaml
@ -0,0 +1,297 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Fran.C3.A7ais_-_French
+- lang: fr
+  words:
+    -  Abbaye -> ABE
+    -  Agglomération -> AGL
+    -  Aire -> AIRE
+    -  Aires -> AIRE
+    -  Allée -> ALL
+    -  Allée -> All
+    -  Allées -> ALL
+    -  Ancien chemin -> ACH
+    -  Ancienne route -> ART
+    -  Anciennes routes -> ART
+    -  Anse -> ANSE
+    -  Arcade -> ARC
+    -  Arcades -> ARC
+    -  Autoroute -> AUT
+    -  Avenue -> AV
+    -  Avenue -> Av
+    -  Barrière -> BRE
+    -  Barrières -> BRE
+    -  Bas chemin -> BCH
+    -  Bastide -> BSTD
+    -  Baston -> BAST
+    -  Béguinage -> BEGI
+    -  Béguinages -> BEGI
+    -  Berge -> BER
+    -  Berges -> BER
+    -  Bois -> BOIS
+    -  Boucle -> BCLE
+    -  Boulevard -> Bd
+    -  Boulevard -> BD
+    -  Bourg -> BRG
+    -  Butte -> BUT
+    -  Cité -> CITE
+    -  Cités -> CITE
+    -  Côte -> COTE
+    -  Côteau -> COTE
+    -  Cale -> CALE
+    -  Camp -> CAMP
+    -  Campagne -> CGNE
+    -  Camping -> CPG
+    -  Carreau -> CAU
+    -  Carrefour -> CAR
+    -  Carrière -> CARE
+    -  Carrières -> CARE
+    -  Carré -> CARR
+    -  Castel -> CST
+    -  Cavée -> CAV
+    -  Central -> CTRE
+    -  Centre -> CTRE
+    -  Chalet -> CHL
+    -  Chapelle -> CHP
+    -  Charmille -> CHI
+    -  Chaussée -> CHS
+    -  Chaussées -> CHS
+    -  Chemin -> Ch
+    -  Chemin -> CHE
+    -  Chemin -> Che
+    -  Chemin vicinal -> CHV
+    -  Cheminement -> CHEM
+    -  Cheminements -> CHEM
+    -  Chemins -> CHE
+    -  Chemins vicinaux -> CHV
+    -  Chez -> CHEZ
+    -  Château -> CHT
+    -  Cloître -> CLOI
+    -  Clos -> CLOS
+    -  Col -> COL
+    -  Colline -> COLI
+    -  Collines -> COLI
+    -  Contour -> CTR
+    -  Corniche -> COR
+    -  Corniches -> COR
+    -  Cottage -> COTT
+    -  Cottages -> COTT
+    -  Cour -> COUR
+    -  Cours -> CRS
+    -  Cours -> Crs
+    -  Darse -> DARS
+    -  Degré -> DEG
+    -  Degrés -> DEG
+    -  Descente -> DSG
+    -  Descentes -> DSG
+    -  Digue -> DIG
+    -  Digues -> DIG
+    -  Domaine -> DOM
+    -  Domaines -> DOM
+    -  Écluse -> ECL
+    -  Écluse -> ÉCL
+    -  Écluses -> ECL
+    -  Écluses -> ÉCL
+    -  Église -> EGL
+    -  Église -> ÉGL
+    -  Enceinte -> EN
+    -  Enclave -> ENV
+    -  Enclos -> ENC
+    -  Escalier -> ESC
+    -  Escaliers -> ESC
+    -  Espace -> ESPA
+    -  Esplanade -> ESP
+    -  Esplanades -> ESP
+    -  Étang -> ETANG
+    -  Étang -> ÉTANG
+    -  Faubourg -> FG
+    -  Faubourg -> Fg
+    -  Ferme -> FRM
+    -  Fermes -> FRM
+    -  Fontaine -> FON
+    -  Fort -> FORT
+    -  Forum -> FORM
+    -  Fosse -> FOS
+    -  Fosses -> FOS
+    -  Foyer -> FOYR
+    -  Galerie -> GAL
+    -  Galeries -> GAL
+    -  Gare -> GARE
+    -  Garenne -> GARN
+    -  Grand boulevard -> GBD
+    -  Grand ensemble -> GDEN
+    -  Grand’rue -> GR
+    -  Grande rue -> GR
+    -  Grandes rues -> GR
+    -  Grands ensembles -> GDEN
+    -  Grille -> GRI
+    -  Grimpette -> GRIM
+    -  Groupe -> GPE
+    -  Groupement -> GPT
+    -  Groupes -> GPE
+    -  Halle -> HLE
+    -  Halles -> HLE
+    -  Hameau -> HAM
+    -  Hameaux -> HAM
+    -  Haut chemin -> HCH
+    -  Hauts chemins -> HCH
+    -  Hippodrome -> HIP
+    -  HLM -> HLM
+    -  Île -> ILE
+    -  Île -> ÎLE
+    -  Immeuble -> IMM
+    -  Immeubles -> IMM
+    -  Impasse -> IMP
+    -  Impasse -> Imp
+    -  Impasses -> IMP
+    -  Jardin -> JARD
+    -  Jardins -> JARD
+    -  Jetée -> JTE
+    -  Jetées -> JTE
+    -  Levée -> LEVE
+    -  Lieu-dit -> LD
+    -  Lotissement -> LOT
+    -  Lotissements -> LOT
+    -  Mail -> MAIL
+    -  Maison forestière -> MF
+    -  Manoir -> MAN
+    -  Marche -> MAR
+    -  Marches -> MAR
+    -  Maréchal -> MAL
+    -  Mas -> MAS
+    -  Monseigneur -> Mgr
+    -  Mont -> Mt
+    -  Montée -> MTE
+    -  Montées -> MTE
+    -  Moulin -> MLN
+    -  Moulins -> MLN
+    -  Musée -> MUS
+    -  Métro -> MET
+    -  Métro -> MÉT
+    -  Nouvelle route -> NTE
+    -  Palais -> PAL
+    -  Parc -> PARC
+    -  Parcs -> PARC
+    -  Parking -> PKG
+    -  Parvis -> PRV
+    -  Passage -> PAS
+    -  Passage -> Pas
+    -  Passage -> Pass
+    -  Passage à niveau -> PN
+    -  Passe -> PASS
+    -  Passerelle -> PLE
+    -  Passerelles -> PLE
+    -  Passes -> PASS
+    -  Patio -> PAT
+    -  Pavillon -> PAV
+    -  Pavillons -> PAV
+    -  Petit chemin -> PCH
+    -  Petite allée -> PTA
+    -  Petite avenue -> PAE
+    -  Petite impasse -> PIM
+    -  Petite route -> PRT
+    -  Petite rue -> PTR
+    -  Petites allées -> PTA
+    -  Place -> PL
+    -  Place -> Pl
+    -  Placis -> PLCI
+    -  Plage -> PLAG
+    -  Plages -> PLAG
+    -  Plaine -> PLN
+    -  Plan -> PLAN
+    -  Plateau -> PLT
+    -  Plateaux -> PLT
+    -  Pointe -> PNT
+    -  Pont -> PONT
+    -  Ponts -> PONT
+    -  Porche -> PCH
+    -  Port -> PORT
+    -  Porte -> PTE
+    -  Portique -> PORQ
+    -  Portiques -> PORQ
+    -  Poterne -> POT
+    -  Pourtour -> POUR
+    -  Presqu’île -> PRQ
+    -  Promenade -> PROM
+    -  Promenade -> Prom
+    -  Pré -> PRE
+    -  Pré -> PRÉ
+    -  Périphérique -> PERI
+    -  Péristyle -> PSTY
+    -  Quai -> QU
+    -  Quai -> Qu
+    -  Quartier -> QUA
+    -  Raccourci -> RAC
+    -  Raidillon -> RAID
+    -  Rampe -> RPE
+    -  Rempart -> REM
+    -  Roc -> ROC
+    -  Rocade -> ROC
+    -  Rond point -> RPT
+    -  Roquet -> ROQT
+    -  Rotonde -> RTD
+    -  Route -> RTE
+    -  Route -> Rte
+    -  Routes -> RTE
+    -  Rue -> R
+    -  Rue -> R
+    -  Ruelle -> RLE
+    -  Ruelles -> RLE
+    -  Rues -> R
+    -  Résidence -> RES
+    -  Résidences -> RES
+    -  Saint -> St
+    -  Sainte -> Ste
+    -  Sente -> SEN
+    -  Sentes -> SEN
+    -  Sentier -> SEN
+    -  Sentiers -> SEN
+    -  Square -> SQ
+    -  Square -> Sq
+    -  Stade -> STDE
+    -  Station -> STA
+    -  Terrain -> TRN
+    -  Terrasse -> TSSE
+    -  Terrasses -> TSSE
+    -  Terre plein -> TPL
+    -  Tertre -> TRT
+    -  Tertres -> TRT
+    -  Tour -> TOUR
+    -  Traverse -> TRA
+    -  Vallon -> VAL
+    -  Vallée -> VAL
+    -  Venelle -> VEN
+    -  Venelles -> VEN
+    -  Via -> VIA
+    -  Vieille route -> VTE
+    -  Vieux chemin -> VCHE
+    -  Villa -> VLA
+    -  Village -> VGE
+    -  Villages -> VGE
+    -  Villas -> VLA
+    -  Voie -> VOI
+    -  Voies -> VOI
+    -  Zone -> ZONE
+    -  Zone artisanale -> ZA
+    -  Zone d'aménagement concerté -> ZAC
+    -  Zone d'aménagement différé -> ZAD
+    -  Zone industrielle -> ZI
+    -  Zone à urbaniser en priorité -> ZUP
+- lang: fr
+  country: ca
+  words:
+    -  Boulevard -> BOUL
+    -  Carré -> CAR
+    -  Carrefour -> CARREF
+    -  Centre -> C
+    -  Chemin -> CH
+    -  Croissant -> CROIS
+    -  Diversion -> DIVERS
+    -  Échangeur -> ÉCH
+    -  Esplanade -> ESPL
+    -  Passage -> PASS
+    -  Plateau -> PLAT
+    -  Rang -> RANG
+    -  Rond-point -> RDPT
+    -  Sentier -> SENT
+    -  Subdivision -> SUBDIV
+    -  Terrasse -> TSSE
+    -  Village -> VILLGE
--- a/settings/icu-rules/variants-gl.yaml
+++ b/settings/icu-rules/variants-gl.yaml
@ -0,0 +1,27 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Galego_-_Galician
+- lang: gl
+  words:
+    -  Asociación Veciñal -> A V
+    -  Asociación de Veciños -> A VV
+    -  Avenida -> Av
+    -  Avenida -> Avda
+    -  Centro Integrado de Formación Profesional -> CIFP
+    -  Colexio de Educación Especial -> CEE
+    -  Colexio de Educación Infantil -> CEI
+    -  Colexio de Educación Infantil e Primaria -> CEIP
+    -  Colexio Rural Agrupado -> CRA
+    -  Doutor -> Dr
+    -  Doutora -> Dra
+    -  Edificio -> Edif
+    -  Estrada -> Estda
+    -  Ferrocarril -> F C
+    -  Ferrocarrís -> FF CC
+    -  Instituto de Educación Secundaria -> IES
+    -  Rúa -> R/
+    -  San -> S
+    -  Santa -> Sta
+    -  Santo -> Sto
+    -  Santas -> Stas
+    -  Santos -> Stos
+    -  Señora -> Sra
+    -  Urbanización -> Urb
--- a/settings/icu-rules/variants-hu.yaml
+++ b/settings/icu-rules/variants-hu.yaml
@ -0,0 +1,4 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Magyar_-_Hungarian
+- lang: hu
+  words:
+    -  utca -> u
--- a/settings/icu-rules/variants-it.yaml
+++ b/settings/icu-rules/variants-it.yaml
@ -0,0 +1,77 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Italiano_-_Italian
+- lang: it
+  words:
+    -  Calle -> C.le
+    -  Campo -> C.po
+    -  Cascina -> C.na
+    -  Cinque -> 5
+    -  Corso -> C.so
+    -  Corte -> C.te
+    -  Decima -> X
+    -  Decimo -> X
+    -  Due -> 2
+    -  Fondamenta -> F.ta
+    -  Largo -> L.go
+    -  Località -> Loc
+    -  Lungomare -> L.mare
+    -  Nona -> IX
+    -  Nono -> IX
+    -  Nove -> 9
+    -  Otto -> 8
+    -  Ottava -> VIII
+    -  Ottavo -> VIII
+    -  Piazza -> P.za
+    -  Piazza -> P.zza
+    -  Piazzale -> P.le
+    -  Piazzetta -> P.ta
+    -  Ponte -> P.te
+    -  Porta -> P.ta
+    -  Prima -> I
+    -  Primo -> I
+    -  Primo -> 1
+    -  Primo -> 1°
+    -  Quarta -> IV
+    -  Quarto -> IV
+    -  Quattro -> IV
+    -  Quattro -> 4
+    -  Quinta -> V
+    -  Quinto -> V
+    -  Salizada -> S.da
+    -  San -> S
+    -  Santa -> S
+    -  Santo -> S
+    -  Sant' -> S
+    -  Santi -> SS
+    -  Santissima -> SS.ma
+    -  Santissime -> SS.me
+    -  Santissimi -> SS.mi
+    -  Santissimo -> SS.mo
+    -  Seconda -> II
+    -  Secondo -> II
+    -  Sei -> 6
+    -  Sesta -> VI
+    -  Sesto -> VI
+    -  Sette -> 7
+    -  Settima -> VII
+    -  Settimo -> VII
+    -  Stazione -> Staz
+    -  Strada Comunale -> SC
+    -  Strada Provinciale -> SP
+    -  Strada Regionale -> SR
+    -  Strada Statale -> SS
+    -  Terzo -> III
+    -  Terza -> III
+    -  Tre -> 3
+    -  Trenta -> XXX
+    -  Un -> 1
+    -  Una -> 1
+    -  Venti -> XX
+    -  Venti -> 20
+    -  Venticinque -> XXV
+    -  Venticinque -> 25
+    -  Ventiquattro -> XXIV
+    -  Ventitreesimo -> XXIII
+    -  Via -> V
+    -  Viale -> V.le
+    -  Vico -> V.co
+    -  Vicolo -> V.lo
--- a/settings/icu-rules/variants-ja.yaml
+++ b/settings/icu-rules/variants-ja.yaml
@ -0,0 +1,32 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.E6.97.A5.E6.9C.AC.E8.AA.9E_.28Nihongo.29_-_Japanese
+- lang: ja
+  words:
+    -  ~中学校 |-> 中
+    -  ~大学 |-> 大
+    -  独立行政法人~ -> 独
+    -  学校法人~ -> 学
+    -  ~銀行 |-> 銀
+    -  ~合同会社 -> 合
+    -  合同会社~ -> 合
+    -  ~合名会社 -> 名
+    -  合名会社~ -> 名
+    -  ~合資会社 -> 資
+    -  合資会社~ -> 資
+    -  一般道道~ -> 一
+    -  一般府道~ -> 一
+    -  一般県道~ -> 一
+    -  一般社団法人~ -> 一社
+    -  一般都道~ -> 一
+    -  一般財団法人~ -> 一財
+    -  医療法人~ -> 医
+    -  ~株式会社 -> 株
+    -  株式会社~ -> 株
+    -  国立大学法人~ -> 大
+    -  公立大学法人~ -> 大
+    -  ~高等学校 |-> 高
+    -  ~高等学校 |-> 高校
+    -  ~小学校 |-> 小
+    -  主要地方道~ -> 主
+    -  有限会社~ -> 有
+    -  ~有限会社 -> 有
+    -  財団法人~ -> 財
--- a/settings/icu-rules/variants-mg.yaml
+++ b/settings/icu-rules/variants-mg.yaml
@ -0,0 +1,12 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Malagasy_-_Malgache
+- lang: mg
+  words:
+    -  Ambato -> Ato
+    -  Ambinany -> Any
+    -  Ambodi -> Adi
+    -  Ambohi -> Ahi
+    -  Ambohitr' -> Atr'
+    -  Ambony -> Ani
+    -  Ampasi -> Asi
+    -  Andoha -> Aha
+    -  Andrano -> Ano
--- a/settings/icu-rules/variants-ms.yaml
+++ b/settings/icu-rules/variants-ms.yaml
@ -0,0 +1,12 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Bahasa_Melayu_-_Malay
+- lang: ms
+  words:
+    -  Jalan -> Jln
+    -  Simpang -> Spg
+    -  Kampong -> Kg
+    -  Sungai -> Sg
+    -  Haji -> Hj
+    -  Pengiran -> Pg
+    -  Awang -> Awg
+    -  Dayang -> Dyg
+
--- a/settings/icu-rules/variants-nl.yaml
+++ b/settings/icu-rules/variants-nl.yaml
@ -0,0 +1,53 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Nederlands_-_Dutch
+- lang: nl
+  words:
+    -  Broeder -> Br
+    -  Burgemeester -> Burg
+    -  Commandant -> Cmdt
+    -  Doctor -> dr
+    -  Dokter -> Dr
+    -  Dominee -> ds
+    -  Gebroeders -> Gebr
+    -  Generaal -> Gen
+    -  ~gracht -> gr
+    -  Ingenieur -> ir
+    -  Jonkheer -> Jhr
+    -  Kolonel -> Kol
+    -  Kanunnik -> Kan
+    -  Kardinaal -> Kard
+    -  Kort(e) -> Kte, K
+    -  Koning -> Kon
+    -  Koningin -> Kon
+    -  ~laan -> ln
+    -  Lange -> L
+    -  Luitenant -> Luit
+    -  ~markt -> mkt
+    -  Meester -> Mr, mr
+    -  Mejuffrouw -> Mej
+    -  Mevrouw -> Mevr
+    -  Minister -> Min
+    -  Monseigneur -> Mgr
+    -  Noordzijde -> NZ, N Z
+    -  Oostzijde -> OZ, O Z
+    -  Onze-Lieve-Vrouw,Onze-Lieve-Vrouwe -> O L V, OLV
+    -  Pastoor -> Past
+    -  ~plein -> pln
+    -  President -> Pres
+    -  Prins -> Pr
+    -  Prinses -> Pr
+    -  Professor -> Prof
+    -  ~singel -> sngl
+    -  ~straat -> str
+    -  ~steenweg -> stwg
+    -  Sint -> St
+    -  Van -> V
+    -  Van De -> V D, vd
+    -  Van Den -> V D, vd
+    -  Van Der -> V D, vd
+    -  Verlengde -> Verl
+    -  ~vliet -> vlt
+    -  Vrouwe -> Vr
+    -  ~weg -> wg
+    -  Westzijde -> WZ, W Z
+    -  Zuidzijde -> ZZ, Z Z
+    -  Zuster -> Zr
--- a/settings/icu-rules/variants-no.yaml
+++ b/settings/icu-rules/variants-no.yaml
@ -0,0 +1,11 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Norsk_-_Norwegian
+- lang: no
+  words:
+    # convert between Nynorsk and Bookmal here
+    -  vei, veg => v,vn,vei,veg
+    -  veien, vegen -> v,vn,veien,vegen
+    -  gate -> g,gt
+    # convert between the two female forms
+    -  gaten, gata => g,gt,gaten,gata
+    -  plass, plassen -> pl
+    -  sving, svingen -> sv
--- a/settings/icu-rules/variants-pl.yaml
+++ b/settings/icu-rules/variants-pl.yaml
@ -0,0 +1,66 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Polski_.E2.80.93_Polish
+- lang: pl
+  words:
+    -  Aleja, Aleje, Alei, Alejach, Aleją -> al
+    -  Ulica, Ulice, Ulicą, Ulicy -> ul
+    -  Plac, Placu, Placem -> pl
+    -  Wybrzeże, Wybrzeża, Wybrzeżem -> wyb
+    -  Bulwar -> bulw
+    -  Dolny, Dolna, Dolne -> Dln
+    -  Drugi, Druga, Drugie -> 2
+    -  Drugi, Druga, Drugie -> II
+    -  Duży, Duża, Duże -> Dz
+    -  Duży, Duża, Duże -> Dż
+    -  Górny, Górna, Górne -> Grn
+    -  Kolonia -> kol
+    -  koło, kolo -> k
+    -  Mały, Mała, Małe -> Ml
+    -  Mały, Mała, Małe -> Mł
+    -  Mazowiecka, Mazowiecki, Mazowieckie -> maz
+    -  Miasto -> m
+    -  Nowy, Nowa, Nowe -> Nw
+    -  Nowy, Nowa, Nowe -> N
+    -  Osiedle, Osiedlu -> os
+    -  Pierwszy, Pierwsza, Pierwsze -> 1
+    -  Pierwszy, Pierwsza, Pierwsze -> I
+    -  Szkoła Podstawowa -> SP
+    -  Stary, Stara, Stare -> St
+    -  Stary, Stara, Stare -> Str
+    -  Trzeci, Trzecia, Trzecie -> III
+    -  Trzeci, Trzecia, Trzecie -> 3
+    -  Wielki, Wielka, Wielkie -> Wlk
+    -  Wielkopolski, Wielkopolska, Wielkopolskie -> wlkp
+    -  Województwo, Województwie -> woj
+    -  kardynała, kardynał -> kard
+    -  pułkownika, pułkownik -> płk
+    -  marszałka, marszałek -> marsz
+    -  generała, generał -> gen
+    -  Świętego, Świętej, Świętych, święty, święta, święci -> św
+    -  Świętych, święci -> śś
+    -  Ojców -> oo
+    -  Błogosławionego, Błogosławionej, Błogosławionych, błogosławiony, błogosławiona, błogosławieni -> bł
+    -  księdza, ksiądz -> ks
+    -  księcia, książe -> ks
+    -  doktora, doktor -> dr
+    -  majora, major -> mjr
+    -  biskupa, biskup -> bpa
+    -  biskupa, biskup -> bp
+    -  rotmistrza, rotmistrz -> rotm
+    -  profesora, profesor -> prof
+    -  hrabiego, hrabiny, hrabia, hrabina -> hr
+    -  porucznika, porucznik -> por
+    -  podpułkownika, podpułkownik -> ppłk
+    -  pułkownika, pułkownik -> płk
+    -  podporucznika, podporucznik -> ppor
+    -  porucznika, porucznik -> por
+    -  marszałka, marszałek -> marsz
+    -  chorążego, chorąży -> chor
+    -  szeregowego, szeregowego -> szer
+    -  kaprala, kapral -> kpr
+    -  plutonowego, plutonowy -> plut
+    -  kapitana, kapitan -> kpt
+    -  admirała, admirał -> adm
+    -  wiceadmirała, wiceadmirał -> wadm
+    -  kontradmirała, kontradmirał -> kontradm
+    -  batalionów, bataliony -> bat
+    -  batalionu, batalion -> bat
--- a/settings/icu-rules/variants-pt.yaml
+++ b/settings/icu-rules/variants-pt.yaml
@ -0,0 +1,196 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Portugu.C3.AAs_-_Portuguese
+- lang: pt
+  words:
+    -  Associação -> Ass
+    -  Alameda -> Al
+    -  Alferes -> Alf
+    -  Almirante -> Alm
+    -  Arquitecto -> Arq
+    -  Arquitecto -> Arqº
+    -  Arquiteto -> Arq
+    -  Arquiteto -> Arqº
+    -  Auto-estrada -> A
+    -  Avenida -> Av
+    -  Avenida -> Avª
+    -  Azinhaga -> Az
+    -  Bairro -> B
+    -  Bairro -> Bº
+    -  Bairro -> Br
+    -  Beco -> Bc
+    -  Beco -> Bco
+    -  Bloco -> Bl
+    -  Bombeiros Voluntários -> BV
+    -  Bombeiros Voluntários -> B.V
+    -  Brigadeiro -> Brg
+    -  Cacique -> Cac
+    -  Calçada -> Cc
+    -  Calçadinha -> Ccnh
+    -  Câmara Municipal -> CM
+    -  Câmara Municipal -> C.M
+    -  Caminho -> Cam
+    -  Capitão -> Cap
+    -  Casal -> Csl
+    -  Cave -> Cv
+    -  Centro Comercial -> CC
+    -  Centro Comercial -> C.C
+    -  Ciclo do Ensino Básico -> CEB
+    -  Ciclo do Ensino Básico -> C.E.B
+    -  Ciclo do Ensino Básico -> C. E. B
+    -  Comandante -> Cmdt
+    -  Comendador -> Comend
+    -  Companhia -> Cª
+    -  Conselheiro -> Cons
+    -  Coronel -> Cor
+    -  Coronel -> Cel
+    -  Corte -> C.te
+    -  De -> D´
+    -  De -> D'
+    -  Departamento -> Dept
+    -  Deputado -> Dep
+    -  Direito -> Dto
+    -  Dom -> D
+    -  Dona -> D
+    -  Dona -> Dª
+    -  Doutor -> Dr
+    -  Doutora -> Dr
+    -  Doutora -> Drª
+    -  Doutora -> Dra
+    -  Duque -> Dq
+    -  Edifício -> Ed
+    -  Edifício -> Edf
+    -  Embaixador -> Emb
+    -  Empresa Pública -> EP
+    -  Empresa Pública -> E.P
+    -  Enfermeiro -> Enfo
+    -  Enfermeiro -> Enfº
+    -  Enfermeiro -> Enf
+    -  Engenheiro -> Eng
+    -  Engenheiro -> Engº
+    -  Engenheira -> Eng
+    -  Engenheira -> Engª
+    -  Escadas -> Esc
+    -  Escadinhas -> Escnh
+    -  Escola Básica -> EB
+    -  Escola Básica -> E.B
+    -  Esquerdo -> Esq
+    -  Estação de Tratamento de Águas Residuais -> ETAR
+    -  Estação de Tratamento de Águas Residuais -> E.T.A.R
+    -  Estrada -> Estr
+    -  Estrada Municipal -> EM
+    -  Estrada Nacional -> EN
+    -  Estrada Regional -> ER
+    -  Frei -> Fr
+    -  Frente -> Ft
+    -  Futebol Clube -> FC
+    -  Futebol Clube -> F.C
+    -  Guarda Nacional Republicana -> GNR
+    -  Guarda Nacional Republicana -> G.N.R
+    -  General -> Gen
+    -  General -> Gal
+    -  Habitação -> Hab
+    -  Infante -> Inf
+    -  Instituto -> Inst
+    -  Irmã -> Ima
+    -  Irmã -> Imª
+    -  Irmã -> Im
+    -  Irmão -> Imo
+    -  Irmão -> Imº
+    -  Irmão -> Im
+    -  Itinerário Complementar -> IC
+    -  Itinerário Principal -> IP
+    -  Jardim -> Jrd
+    -  Júnior -> Jr
+    -  Largo -> Lg
+    -  Limitada -> Lda
+    -  Loja -> Lj
+    -  Lote -> Lt
+    -  Loteamento -> Loteam
+    -  Lugar -> Lg
+    -  Lugar -> Lug
+    -  Maestro -> Mto
+    -  Major -> Maj
+    -  Marechal -> Mal
+    -  Marquês -> Mq
+    -  Madre -> Me
+    -  Mestre -> Me
+    -  Ministério -> Min
+    -  Monsenhor -> Mons
+    -  Municipal -> M
+    -  Nacional -> N
+    -  Nossa -> N
+    -  Nossa -> Nª
+    -  Nossa Senhora -> Ns
+    -  Nosso -> N
+    -  Número -> N
+    -  Número -> Nº
+    -  Padre -> Pe
+    -  Parque -> Pq
+    -  Particular -> Part
+    -  Pátio -> Pto
+    -  Pavilhão -> Pav
+    -  Polícia de Segurança Pública -> PSP
+    -  Polícia de Segurança Pública -> P.S.P
+    -  Polícia Judiciária -> PJ
+    -  Polícia Judiciária -> P.J
+    -  Praça -> Pc
+    -  Praça -> Pç
+    -  Praça -> Pr
+    -  Praceta -> Pct
+    -  Praceta -> Pctª
+    -  Presidente -> Presid
+    -  Primeiro -> 1º
+    -  Professor -> Prof
+    -  Professora -> Prof
+    -  Professora -> Profª
+    -  Projectada -> Proj
+    -  Projetada -> Proj
+    -  Prolongamento -> Prolng
+    -  Quadra -> Q
+    -  Quadra -> Qd
+    -  Quinta -> Qta
+    -  Regional -> R
+    -  Rés-do-chão -> R/c
+    -  Rés-do-chão -> Rc
+    -  Rotunda -> Rot
+    -  Ribeira -> Rª
+    -  Ribeira -> Rib
+    -  Ribeira -> Ribª
+    -  Rio -> R
+    -  Rua -> R
+    -  Santa -> Sta
+    -  Santa -> Stª
+    -  Santo -> St
+    -  Santo -> Sto
+    -  Santo -> Stº
+    -  São -> S
+    -  Sargento -> Sarg
+    -  Sem Número -> S/n
+    -  Sem Número -> Sn
+    -  Senhor -> S
+    -  Senhor -> Sr
+    -  Senhora -> S
+    -  Senhora -> Sª
+    -  Senhora -> Srª
+    -  Senhora -> Sr.ª
+    -  Senhora -> S.ra
+    -  Senhora -> Sra
+    -  Sobre-Loja -> Slj
+    -  Sociedade -> Soc
+    -  Sociedade Anónima -> SA
+    -  Sociedade Anónima -> S.A
+    -  Sport Clube -> SC
+    -  Sport Clube -> S.C
+    -  Sub-Cave -> Scv
+    -  Superquadra -> Sq
+    -  Tenente -> Ten
+    -  Torre -> Tr
+    -  Transversal -> Transv
+    -  Travessa -> Trav
+    -  Travessa -> Trv
+    -  Travessa -> Tv
+    -  Universidade -> Univ
+    -  Urbanização -> Urb
+    -  Vila -> Vl
+    -  Visconde -> Visc
+    -  Vivenda -> Vv
+    -  Zona -> Zn
--- a/settings/icu-rules/variants-ro.yaml
+++ b/settings/icu-rules/variants-ro.yaml
@ -0,0 +1,36 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Rom.C3.A2n.C4.83_-_Romanian
+- lang: ro
+  words:
+    -  Aleea -> ale
+    -  Aleea -> al
+    -  Bulevardul -> bulevard
+    -  Bulevardul -> bulev
+    -  Bulevardul -> b-dul
+    -  Bulevardul -> blvd
+    -  Bulevardul -> blv
+    -  Bulevardul -> bdul
+    -  Bulevardul -> bul
+    -  Bulevardul -> bd
+    -  Calea -> cal
+    -  Fundătura -> fnd
+    -  Fundacul -> fdc
+    -  Intrarea -> intr
+    -  Intrarea -> int
+    -  Piața -> p-ța
+    -  Piața -> pța
+    -  Strada -> stra
+    -  Strada -> str
+    -  Stradela -> str-la
+    -  Stradela -> sdla
+    -  Șoseaua -> sos
+    -  Splaiul -> sp
+    -  Splaiul -> splaiul
+    -  Splaiul -> spl
+    -  Vârful -> virful
+    -  Vârful -> virf
+    -  Vârful -> varf
+    -  Vârful -> vf
+    -  Muntele -> m-tele
+    -  Muntele -> m-te
+    -  Muntele -> mnt
+    -  Muntele -> mt
--- a/settings/icu-rules/variants-ru.yaml
+++ b/settings/icu-rules/variants-ru.yaml
@ -0,0 +1,14 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.A0.D1.83.D1.81.D1.81.D0.BA.D0.B8.D0.B9_-_Russian
+- lang: ru
+  words:
+    -  аллея -> ал
+    -  бульвар -> бул
+    -  набережная -> наб
+    -  переулок -> пер
+    -  площадь -> пл
+    -  проезд -> пр
+    -  проспект -> просп
+    -  шоссе -> ш
+    -  тупик -> туп
+    -  улица -> ул
+    -  область -> обл
--- a/settings/icu-rules/variants-sk.yaml
+++ b/settings/icu-rules/variants-sk.yaml
@ -0,0 +1,20 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Slovensky_-_Slovak
+- lang: sk
+  words:
+    -  Ulica -> Ul
+    -  Námestie -> Nám
+    -  Svätého, Svätej -> Sv
+    -  Generála -> Gen
+    -  Armádneho generála -> Arm gen
+    -  Doktora, Doktorky -> Dr
+    -  Inžiniera, Inžinierky -> Ing
+    -  Majora -> Mjr
+    -  Profesora, Profesorky -> Prof
+    -  Československej -> Čsl
+    -  Plukovníka -> Plk
+    -  Podplukovníka -> Pplk
+    -  Kapitána -> Kpt
+    -  Poručíka -> Por
+    -  Podporučíka -> Ppor
+    -  Sídlisko -> Sídl
+    -  Nábrežie -> Nábr
--- a/settings/icu-rules/variants-sl.yaml
+++ b/settings/icu-rules/variants-sl.yaml
@ -0,0 +1,35 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Sloven.C5.A1.C4.8Dina_-_Slovenian
+- lang: sl
+  words:
+    -  Cesta -> C
+    -  Gasilski Dom -> GD
+    -  Osnovna šola -> OŠ
+    -  Prostovoljno Gasilsko Društvo -> PGD
+    -  Savinjski -> Savinj
+    -  Slovenskih -> Slov
+    -  Spodnja -> Sp
+    -  Spodnje -> Sp
+    -  Spodnji -> Sp
+    -  Srednja -> Sr
+    -  Srednje -> Sr
+    -  Srednji -> Sr
+    -  Sveta -> Sv
+    -  Svete -> Sv
+    -  Sveti -> Sv
+    -  Svetega -> Sv
+    -  Šent -> Št
+    -  Ulica -> Ul
+    -  Velika -> V
+    -  Velike -> V
+    -  Veliki -> V
+    -  Veliko -> V
+    -  Velikem -> V
+    -  Velika -> Vel
+    -  Velike -> Vel
+    -  Veliki -> Vel
+    -  Veliko -> Vel
+    -  Velikem -> Vel
+    -  Zdravstveni dom -> ZD
+    -  Zgornja -> Zg
+    -  Zgornje -> Zg
+    -  Zgornji -> Zg
--- a/settings/icu-rules/variants-sv.yaml
+++ b/settings/icu-rules/variants-sv.yaml
@ -0,0 +1,21 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Svenska_-_Swedish
+- lang: sv
+  words:
+    -  ~väg, ~vägen -> v
+    -  ~gatan, ~gata -> g
+    -  ~gränd, ~gränden -> gr
+    -  gamla -> G:la
+    -  södra -> s
+    -  södra -> s:a
+    -  norra -> n
+    -  norra -> n:a
+    -  östra -> ö
+    -  östra -> ö:a
+    -  västra -> v
+    -  västra -> v:a
+    -  ~stig, ~stigen -> st
+    -  sankt -> s:t
+    -  sankta -> s:ta
+    -  ~plats, ~platsen -> pl
+    -  lilla -> l
+    -  stora -> st
--- a/settings/icu-rules/variants-tr.yaml
+++ b/settings/icu-rules/variants-tr.yaml
@ -0,0 +1,14 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#T.C3.BCrk.C3.A7e_-_Turkish
+- lang: tr
+  words:
+    -  Sokak -> Sk
+    -  Sokak -> Sok
+    -  Sokağı -> Sk
+    -  Sokağı -> Sok
+    -  Cadde -> Cd
+    -  Caddesi -> Cd
+    -  Bulvar -> Bl
+    -  Bulvar -> Blv
+    -  Bulvarı -> Bl
+    -  Mahalle -> Mh
+    -  Mahalle -> Mah
--- a/settings/icu-rules/variants-uk.yaml
+++ b/settings/icu-rules/variants-uk.yaml
@ -0,0 +1,10 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#.D0.A3.D0.BA.D1.80.D0.B0.D1.97.D0.BD.D1.81.D1.8C.D0.BA.D0.B0_-_Ukrainian
+- lang: uk
+  words:
+    -  бульвар -> бул
+    -  дорога -> дор
+    -  провулок -> пров
+    -  площа -> пл
+    -  проспект -> просп
+    -  шосе -> ш
+    -  вулиця -> вул
--- a/settings/icu-rules/variants-vi.yaml
+++ b/settings/icu-rules/variants-vi.yaml
@ -0,0 +1,48 @@
+# Source: https://wiki.openstreetmap.org/wiki/Name_finder:Abbreviations#Ti.E1.BA.BFng_Vi.E1.BB.87t_.E2.80.93_Vietnamese
+- lang: vi
+  words:
+    -  Thành phố -> TP
+    -  Thị xã -> TX
+    -  Thị trấn -> TT
+    -  Quận -> Q
+    -  Phường -> P
+    -  Phường -> Ph
+    -  Quốc lộ -> QL
+    -  Tỉnh lộ -> TL
+    -  Đại lộ -> ĐL
+    -  Đường -> Đ
+    -  Công trường -> CT
+    -  Quảng trường -> QT
+    -  Sân bay -> SB
+    -  Sân bay quốc tế -> SBQT
+    -  Phi trường -> PT
+    -  Đường sắt -> ĐS
+    -  Trung tâm -> TT
+    -  Trung tâm Thương mại -> TTTM
+    -  Khách sạn -> KS
+    -  Khách sạn -> K/S
+    -  Bưu điện -> BĐ
+    -  Đại học -> ĐH
+    -  Cao đẳng -> CĐ
+    -  Trung học Phổ thông -> THPT
+    -  Trung học Cơ sở -> THCS
+    -  Tiểu học -> TH
+    -  Khu công nghiệp -> KCN
+    -  Khu nghỉ mát -> KNM
+    -  Khu du lịch -> KDL
+    -  Công viên văn hóa -> CVVH
+    -  Công viên -> CV
+    -  Vươn quốc gia -> VQG
+    -  Viện bảo tàng -> VBT
+    -  Sân vận động -> SVĐ
+    -  Nhà thi đấu -> NTĐ
+    -  Câu lạc bộ -> CLB
+    -  Nhà thờ -> NT
+    -  Nhà hát -> NH
+    -  Rạp hát -> RH
+    -  Công ty -> Cty
+    -  Tổng công ty -> TCty
+    -  Tổng công ty -> TCT
+    -  Công ty cổ phần -> CTCP
+    -  Công ty cổ phần -> Cty CP
+    -  Căn cứ không quân -> CCKQ
--- a/settings/legacy_icu_tokenizer.json
+++ b/settings/legacy_icu_tokenizer.json
--- a/settings/legacy_icu_tokenizer.yaml
+++ b/settings/legacy_icu_tokenizer.yaml
@ -0,0 +1,56 @@
+normalization:
+    - ":: lower ()"
+    - !include icu-rules/unicode-digits-to-decimal.yaml
+    - "'№' > 'no'"
+    - "'n°' > 'no'"
+    - "'nº' > 'no'"
+    - "ª > a"
+    - "º > o"
+    - "[[:Punctuation:][:Symbol:]]  > ' '"
+    - "ß > 'ss'" # German szet is unimbigiously equal to double ss
+    - "[^[:Letter:] [:Number:] [:Space:]] >"
+    - "[:Lm:] >"
+    - ":: [[:Number:]] Latin ()"
+    - ":: [[:Number:]] Ascii ();"
+    - ":: [[:Number:]] NFD ();"
+    - "[[:Nonspacing Mark:] [:Cf:]] >;"
+    - "[:Space:]+ > ' '"
+transliteration:
+    - ":: Latin ()"
+    - !include icu-rules/extended-unicode-to-asccii.yaml
+    - ":: Ascii ()"
+    - ":: NFD ()"
+    - "[^[:Ascii:]] >"
+    - ":: lower ()"
+    - ":: NFC ()"
+variants:
+    - !include icu-rules/variants-bg.yaml
+    - !include icu-rules/variants-ca.yaml
+    - !include icu-rules/variants-cs.yaml
+    - !include icu-rules/variants-da.yaml
+    - !include icu-rules/variants-de.yaml
+    - !include icu-rules/variants-el.yaml
+    - !include icu-rules/variants-en.yaml
+    - !include icu-rules/variants-es.yaml
+    - !include icu-rules/variants-et.yaml
+    - !include icu-rules/variants-eu.yaml
+    - !include icu-rules/variants-fi.yaml
+    - !include icu-rules/variants-fr.yaml
+    - !include icu-rules/variants-gl.yaml
+    - !include icu-rules/variants-hu.yaml
+    - !include icu-rules/variants-it.yaml
+    - !include icu-rules/variants-ja.yaml
+    - !include icu-rules/variants-mg.yaml
+    - !include icu-rules/variants-ms.yaml
+    - !include icu-rules/variants-nl.yaml
+    - !include icu-rules/variants-no.yaml
+    - !include icu-rules/variants-pl.yaml
+    - !include icu-rules/variants-pt.yaml
+    - !include icu-rules/variants-ro.yaml
+    - !include icu-rules/variants-ru.yaml
+    - !include icu-rules/variants-sk.yaml
+    - !include icu-rules/variants-sl.yaml
+    - !include icu-rules/variants-sv.yaml
+    - !include icu-rules/variants-tr.yaml
+    - !include icu-rules/variants-uk.yaml
+    - !include icu-rules/variants-vi.yaml
--- a/test/bdd/db/query/normalization.feature
+++ b/test/bdd/db/query/normalization.feature
@ -53,7 +53,7 @@ Feature: Import and search of names
    Scenario: Special characters in name
        Given the places
          | osm | class | type      | name |
-          | N1  | place | locality  | Jim-Knopf-Str |
+          | N1  | place | locality  | Jim-Knopf-Straße |
          | N2  | place | locality  | Smith/Weston |
          | N3  | place | locality  | space mountain |
          | N4  | place | locality  | space |
--- a/test/bdd/steps/steps_db_ops.py
+++ b/test/bdd/steps/steps_db_ops.py
@ -214,7 +214,7 @@ def check_search_name_contents(context, exclude):
                    for name, value in zip(row.headings, row.cells):
                        if name in ('name_vector', 'nameaddress_vector'):
                            items = [x.strip() for x in value.split(',')]
-                            tokens = analyzer.get_word_token_info(context.db, items)
+                            tokens = analyzer.get_word_token_info(items)

                            if not exclude:
                                assert len(tokens) >= len(items), \
--- a/test/python/conftest.py
+++ b/test/python/conftest.py
@ -173,6 +173,7 @@ def place_row(place_table, temp_db_cursor):
    """ A factory for rows in the place table. The table is created as a
        prerequisite to the fixture.
    """
+    psycopg2.extras.register_hstore(temp_db_cursor)
    idseq = itertools.count(1001)
    def _insert(osm_type='N', osm_id=None, cls='amenity', typ='cafe', names=None,
                admin_level=None, address=None, extratags=None, geom=None):
--- a/test/python/mocks.py
+++ b/test/python/mocks.py
@ -98,6 +98,13 @@ class MockWordTable:
                           WHERE class = 'place' and type = 'postcode'""")
            return set((row[0] for row in cur))

+    def get_partial_words(self):
+        with self.conn.cursor() as cur:
+            cur.execute("""SELECT word_token, search_name_count FROM word
+                           WHERE class is null and country_code is null
+                                 and not word_token like ' %'""")
+            return set((tuple(row) for row in cur))
+

 class MockPlacexTable:
    """ A placex table for testing.
--- a/test/python/test_db_utils.py
+++ b/test/python/test_db_utils.py
@ -50,3 +50,68 @@ def test_execute_file_with_post_code(dsn, tmp_path, temp_db_cursor):
    db_utils.execute_file(dsn, tmpfile, post_code='INSERT INTO test VALUES(23)')

    assert temp_db_cursor.row_set('SELECT * FROM test') == {(23, )}
+
+
+class TestCopyBuffer:
+    TABLE_NAME = 'copytable'
+
+    @pytest.fixture(autouse=True)
+    def setup_test_table(self, table_factory):
+        table_factory(self.TABLE_NAME, 'colA INT, colB TEXT')
+
+
+    def table_rows(self, cursor):
+        return cursor.row_set('SELECT * FROM ' + self.TABLE_NAME)
+
+
+    def test_copybuffer_empty(self):
+        with db_utils.CopyBuffer() as buf:
+            buf.copy_out(None, "dummy")
+
+
+    def test_all_columns(self, temp_db_cursor):
+        with db_utils.CopyBuffer() as buf:
+            buf.add(3, 'hum')
+            buf.add(None, 'f\\t')
+
+            buf.copy_out(temp_db_cursor, self.TABLE_NAME)
+
+        assert self.table_rows(temp_db_cursor) == {(3, 'hum'), (None, 'f\\t')}
+
+
+    def test_selected_columns(self, temp_db_cursor):
+        with db_utils.CopyBuffer() as buf:
+            buf.add('foo')
+
+            buf.copy_out(temp_db_cursor, self.TABLE_NAME,
+                         columns=['colB'])
+
+        assert self.table_rows(temp_db_cursor) == {(None, 'foo')}
+
+
+    def test_reordered_columns(self, temp_db_cursor):
+        with db_utils.CopyBuffer() as buf:
+            buf.add('one', 1)
+            buf.add(' two ', 2)
+
+            buf.copy_out(temp_db_cursor, self.TABLE_NAME,
+                         columns=['colB', 'colA'])
+
+        assert self.table_rows(temp_db_cursor) == {(1, 'one'), (2, ' two ')}
+
+
+    def test_special_characters(self, temp_db_cursor):
+        with db_utils.CopyBuffer() as buf:
+            buf.add('foo\tbar')
+            buf.add('sun\nson')
+            buf.add('\\N')
+
+            buf.copy_out(temp_db_cursor, self.TABLE_NAME,
+                         columns=['colB'])
+
+        assert self.table_rows(temp_db_cursor) == {(None, 'foo\tbar'),
+                                                   (None, 'sun\nson'),
+                                                   (None, '\\N')}
+
+
+
--- a/test/python/test_tokenizer_icu_name_processor.py
+++ b/test/python/test_tokenizer_icu_name_processor.py
@ -0,0 +1,104 @@
+"""
+Tests for import name normalisation and variant generation.
+"""
+from textwrap import dedent
+
+import pytest
+
+from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
+from nominatim.tokenizer.icu_name_processor import ICUNameProcessor, ICUNameProcessorRules
+
+from nominatim.errors import UsageError
+
+@pytest.fixture
+def cfgfile(tmp_path, suffix='.yaml'):
+    def _create_config(*variants, **kwargs):
+        content = dedent("""\
+        normalization:
+            - ":: NFD ()"
+            - "'🜳' > ' '"
+            - "[[:Nonspacing Mark:] [:Cf:]] >"
+            - ":: lower ()"
+            - "[[:Punctuation:][:Space:]]+ > ' '"
+            - ":: NFC ()"
+        transliteration:
+            - "::  Latin ()"
+            - "'🜵' > ' '"
+        """)
+        content += "variants:\n  - words:\n"
+        content += '\n'.join(("      - " + s for s in variants)) + '\n'
+        for k, v in kwargs:
+            content += "    {}: {}\n".format(k, v)
+        fpath = tmp_path / ('test_config' + suffix)
+        fpath.write_text(dedent(content))
+        return fpath
+
+    return _create_config
+
+
+def get_normalized_variants(proc, name):
+    return proc.get_variants_ascii(proc.get_normalized(name))
+
+
+def test_variants_empty(cfgfile):
+    fpath = cfgfile('saint -> 🜵', 'street -> st')
+
+    rules = ICUNameProcessorRules(loader=ICURuleLoader(fpath))
+    proc = ICUNameProcessor(rules)
+
+    assert get_normalized_variants(proc, '🜵') == []
+    assert get_normalized_variants(proc, '🜳') == []
+    assert get_normalized_variants(proc, 'saint') == ['saint']
+
+
+VARIANT_TESTS = [
+(('~strasse,~straße -> str', '~weg => weg'), "hallo", {'hallo'}),
+(('weg => wg',), "holzweg", {'holzweg'}),
+(('weg -> wg',), "holzweg", {'holzweg'}),
+(('~weg => weg',), "holzweg", {'holz weg', 'holzweg'}),
+(('~weg -> weg',), "holzweg",  {'holz weg', 'holzweg'}),
+(('~weg => w',), "holzweg", {'holz w', 'holzw'}),
+(('~weg -> w',), "holzweg",  {'holz weg', 'holzweg', 'holz w', 'holzw'}),
+(('~weg => weg',), "Meier Weg", {'meier weg', 'meierweg'}),
+(('~weg -> weg',), "Meier Weg", {'meier weg', 'meierweg'}),
+(('~weg => w',), "Meier Weg", {'meier w', 'meierw'}),
+(('~weg -> w',), "Meier Weg", {'meier weg', 'meierweg', 'meier w', 'meierw'}),
+(('weg => wg',), "Meier Weg", {'meier wg'}),
+(('weg -> wg',), "Meier Weg", {'meier weg', 'meier wg'}),
+(('~strasse,~straße -> str', '~weg => weg'), "Bauwegstraße",
+     {'bauweg straße', 'bauweg str', 'bauwegstraße', 'bauwegstr'}),
+(('am => a', 'bach => b'), "am bach", {'a b'}),
+(('am => a', '~bach => b'), "am bach", {'a b'}),
+(('am -> a', '~bach -> b'), "am bach", {'am bach', 'a bach', 'am b', 'a b'}),
+(('am -> a', '~bach -> b'), "ambach", {'ambach', 'am bach', 'amb', 'am b'}),
+(('saint -> s,st', 'street -> st'), "Saint Johns Street",
+     {'saint johns street', 's johns street', 'st johns street',
+      'saint johns st', 's johns st', 'st johns st'}),
+(('river$ -> r',), "River Bend Road", {'river bend road'}),
+(('river$ -> r',), "Bent River", {'bent river', 'bent r'}),
+(('^north => n',), "North 2nd Street", {'n 2nd street'}),
+(('^north => n',), "Airport North", {'airport north'}),
+(('am -> a',), "am am am am am am am am", {'am am am am am am am am'}),
+(('am => a',), "am am am am am am am am", {'a a a a a a a a'})
+]
+
+@pytest.mark.parametrize("rules,name,variants", VARIANT_TESTS)
+def test_variants(cfgfile, rules, name, variants):
+    fpath = cfgfile(*rules)
+    proc = ICUNameProcessor(ICUNameProcessorRules(loader=ICURuleLoader(fpath)))
+
+    result = get_normalized_variants(proc, name)
+
+    assert len(result) == len(set(result))
+    assert set(get_normalized_variants(proc, name)) == variants
+
+
+def test_search_normalized(cfgfile):
+    fpath = cfgfile('~street => s,st', 'master => mstr')
+
+    rules = ICUNameProcessorRules(loader=ICURuleLoader(fpath))
+    proc = ICUNameProcessor(rules)
+
+    assert proc.get_search_normalized('Master Street') == 'master street'
+    assert proc.get_search_normalized('Earnes St') == 'earnes st'
+    assert proc.get_search_normalized('Nostreet') == 'nostreet'
--- a/test/python/test_tokenizer_icu_rule_loader.py
+++ b/test/python/test_tokenizer_icu_rule_loader.py
@ -0,0 +1,264 @@
+"""
+Tests for converting a config file to ICU rules.
+"""
+import pytest
+from textwrap import dedent
+
+from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
+from nominatim.errors import UsageError
+
+from icu import Transliterator
+
+@pytest.fixture
+def cfgfile(tmp_path, suffix='.yaml'):
+    def _create_config(*variants, **kwargs):
+        content = dedent("""\
+        normalization:
+            - ":: NFD ()"
+            - "[[:Nonspacing Mark:] [:Cf:]] >"
+            - ":: lower ()"
+            - "[[:Punctuation:][:Space:]]+ > ' '"
+            - ":: NFC ()"
+        transliteration:
+            - "::  Latin ()"
+            - "[[:Punctuation:][:Space:]]+ > ' '"
+        """)
+        content += "variants:\n  - words:\n"
+        content += '\n'.join(("      - " + s for s in variants)) + '\n'
+        for k, v in kwargs:
+            content += "    {}: {}\n".format(k, v)
+        fpath = tmp_path / ('test_config' + suffix)
+        fpath.write_text(dedent(content))
+        return fpath
+
+    return _create_config
+
+
+def test_empty_rule_file(tmp_path):
+    fpath = tmp_path / ('test_config.yaml')
+    fpath.write_text(dedent("""\
+        normalization:
+        transliteration:
+        variants:
+        """))
+
+    rules = ICURuleLoader(fpath)
+    assert rules.get_search_rules() == ''
+    assert rules.get_normalization_rules() == ''
+    assert rules.get_transliteration_rules() == ''
+    assert list(rules.get_replacement_pairs()) == []
+
+CONFIG_SECTIONS = ('normalization', 'transliteration', 'variants')
+
+@pytest.mark.parametrize("section", CONFIG_SECTIONS)
+def test_missing_normalization(tmp_path, section):
+    fpath = tmp_path / ('test_config.yaml')
+    with fpath.open('w') as fd:
+        for name in CONFIG_SECTIONS:
+            if name != section:
+                fd.write(name + ':\n')
+
+    with pytest.raises(UsageError):
+        ICURuleLoader(fpath)
+
+
+def test_get_search_rules(cfgfile):
+    loader = ICURuleLoader(cfgfile())
+
+    rules = loader.get_search_rules()
+    trans = Transliterator.createFromRules("test", rules)
+
+    assert trans.transliterate(" Baum straße ") == " baum straße "
+    assert trans.transliterate(" Baumstraße ") == " baumstraße "
+    assert trans.transliterate(" Baumstrasse ") == " baumstrasse "
+    assert trans.transliterate(" Baumstr ") == " baumstr "
+    assert trans.transliterate(" Baumwegstr ") == " baumwegstr "
+    assert trans.transliterate(" Αθήνα ") == " athēna "
+    assert trans.transliterate(" проспект ") == " prospekt "
+
+
+def test_get_normalization_rules(cfgfile):
+    loader = ICURuleLoader(cfgfile())
+    rules = loader.get_normalization_rules()
+    trans = Transliterator.createFromRules("test", rules)
+
+    assert trans.transliterate(" проспект-Prospekt ") == " проспект prospekt "
+
+
+def test_get_transliteration_rules(cfgfile):
+    loader = ICURuleLoader(cfgfile())
+    rules = loader.get_transliteration_rules()
+    trans = Transliterator.createFromRules("test", rules)
+
+    assert trans.transliterate(" проспект-Prospekt ") == " prospekt Prospekt "
+
+
+def test_transliteration_rules_from_file(tmp_path):
+    cfgpath = tmp_path / ('test_config.yaml')
+    cfgpath.write_text(dedent("""\
+        normalization:
+        transliteration:
+            - "'ax' > 'b'"
+            - !include transliteration.yaml
+        variants:
+        """))
+    transpath = tmp_path / ('transliteration.yaml')
+    transpath.write_text('- "x > y"')
+
+    loader = ICURuleLoader(cfgpath)
+    rules = loader.get_transliteration_rules()
+    trans = Transliterator.createFromRules("test", rules)
+
+    assert trans.transliterate(" axxt ") == " byt "
+
+
+class TestGetReplacements:
+
+    @pytest.fixture(autouse=True)
+    def setup_cfg(self, cfgfile):
+        self.cfgfile = cfgfile
+
+    def get_replacements(self, *variants):
+        loader = ICURuleLoader(self.cfgfile(*variants))
+        rules = loader.get_replacement_pairs()
+
+        return set((v.source, v.replacement) for v in rules)
+
+
+    @pytest.mark.parametrize("variant", ['foo > bar', 'foo -> bar -> bar',
+                                         '~foo~ -> bar', 'fo~ o -> bar'])
+    def test_invalid_variant_description(self, variant):
+        with pytest.raises(UsageError):
+            ICURuleLoader(self.cfgfile(variant))
+
+    def test_add_full(self):
+        repl = self.get_replacements("foo -> bar")
+
+        assert repl == {(' foo ', ' bar '), (' foo ', ' foo ')}
+
+
+    def test_replace_full(self):
+        repl = self.get_replacements("foo => bar")
+
+        assert repl == {(' foo ', ' bar ')}
+
+
+    def test_add_suffix_no_decompose(self):
+        repl = self.get_replacements("~berg |-> bg")
+
+        assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
+                        (' berg ', ' berg '), (' berg ', ' bg ')}
+
+
+    def test_replace_suffix_no_decompose(self):
+        repl = self.get_replacements("~berg |=> bg")
+
+        assert repl == {('berg ', 'bg '), (' berg ', ' bg ')}
+
+
+    def test_add_suffix_decompose(self):
+        repl = self.get_replacements("~berg -> bg")
+
+        assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
+                        (' berg ', ' berg '), (' berg ', 'berg '),
+                        ('berg ', 'bg '), ('berg ', ' bg '),
+                        (' berg ', 'bg '), (' berg ', ' bg ')}
+
+
+    def test_replace_suffix_decompose(self):
+        repl = self.get_replacements("~berg => bg")
+
+        assert repl == {('berg ', 'bg '), ('berg ', ' bg '),
+                        (' berg ', 'bg '), (' berg ', ' bg ')}
+
+
+    def test_add_prefix_no_compose(self):
+        repl = self.get_replacements("hinter~ |-> hnt")
+
+        assert repl == {(' hinter', ' hinter'), (' hinter ', ' hinter '),
+                        (' hinter', ' hnt'), (' hinter ', ' hnt ')}
+
+
+    def test_replace_prefix_no_compose(self):
+        repl = self.get_replacements("hinter~ |=> hnt")
+
+        assert repl ==  {(' hinter', ' hnt'), (' hinter ', ' hnt ')}
+
+
+    def test_add_prefix_compose(self):
+        repl = self.get_replacements("hinter~-> h")
+
+        assert repl == {(' hinter', ' hinter'), (' hinter', ' hinter '),
+                        (' hinter', ' h'), (' hinter', ' h '),
+                        (' hinter ', ' hinter '), (' hinter ', ' hinter'),
+                        (' hinter ', ' h '), (' hinter ', ' h')}
+
+
+    def test_replace_prefix_compose(self):
+        repl = self.get_replacements("hinter~=> h")
+
+        assert repl == {(' hinter', ' h'), (' hinter', ' h '),
+                        (' hinter ', ' h '), (' hinter ', ' h')}
+
+
+    def test_add_beginning_only(self):
+        repl = self.get_replacements("^Premier -> Pr")
+
+        assert repl == {('^ premier ', '^ premier '), ('^ premier ', '^ pr ')}
+
+
+    def test_replace_beginning_only(self):
+        repl = self.get_replacements("^Premier => Pr")
+
+        assert repl == {('^ premier ', '^ pr ')}
+
+
+    def test_add_final_only(self):
+        repl = self.get_replacements("road$ -> rd")
+
+        assert repl == {(' road ^', ' road ^'), (' road ^', ' rd ^')}
+
+
+    def test_replace_final_only(self):
+        repl = self.get_replacements("road$ => rd")
+
+        assert repl == {(' road ^', ' rd ^')}
+
+
+    def test_decompose_only(self):
+        repl = self.get_replacements("~foo -> foo")
+
+        assert repl == {('foo ', 'foo '), ('foo ', ' foo '),
+                        (' foo ', 'foo '), (' foo ', ' foo ')}
+
+
+    def test_add_suffix_decompose_end_only(self):
+        repl = self.get_replacements("~berg |-> bg", "~berg$ -> bg")
+
+        assert repl == {('berg ', 'berg '), ('berg ', 'bg '),
+                        (' berg ', ' berg '), (' berg ', ' bg '),
+                        ('berg ^', 'berg ^'), ('berg ^', ' berg ^'),
+                        ('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
+                        (' berg ^', 'berg ^'), (' berg ^', 'bg ^'),
+                        (' berg ^', ' berg ^'), (' berg ^', ' bg ^')}
+
+
+    def test_replace_suffix_decompose_end_only(self):
+        repl = self.get_replacements("~berg |=> bg", "~berg$ => bg")
+
+        assert repl == {('berg ', 'bg '), (' berg ', ' bg '),
+                        ('berg ^', 'bg ^'), ('berg ^', ' bg ^'),
+                        (' berg ^', 'bg ^'), (' berg ^', ' bg ^')}
+
+
+    def test_add_multiple_suffix(self):
+        repl = self.get_replacements("~berg,~burg -> bg")
+
+        assert repl == {('berg ', 'berg '), ('berg ', ' berg '),
+                        (' berg ', ' berg '), (' berg ', 'berg '),
+                        ('berg ', 'bg '), ('berg ', ' bg '),
+                        (' berg ', 'bg '), (' berg ', ' bg '),
+                        ('burg ', 'burg '), ('burg ', ' burg '),
+                        (' burg ', ' burg '), (' burg ', 'burg '),
+                        ('burg ', 'bg '), ('burg ', ' bg '),
+                        (' burg ', 'bg '), (' burg ', ' bg ')}
--- a/test/python/test_tokenizer_legacy.py
+++ b/test/python/test_tokenizer_legacy.py
@ -260,7 +260,9 @@ def test_update_special_phrase_modify(analyzer, word_table, make_standard_name):


 def test_add_country_names(analyzer, word_table, make_standard_name):
-    analyzer.add_country_names('de', ['Germany', 'Deutschland', 'germany'])
+    analyzer.add_country_names('de', {'name': 'Germany',
+                                      'name:de': 'Deutschland',
+                                      'short_name': 'germany'})

    assert word_table.get_country() \
               == {('de', ' #germany#'),
@ -272,7 +274,7 @@ def test_add_more_country_names(analyzer, word_table, make_standard_name):
    word_table.add_country('it', ' #italy#')
    word_table.add_country('it', ' #itala#')

-    analyzer.add_country_names('it', ['Italy', 'IT'])
+    analyzer.add_country_names('it', {'name': 'Italy', 'ref': 'IT'})

    assert word_table.get_country() \
               == {('fr', ' #france#'),
--- a/test/python/test_tokenizer_legacy_icu.py
+++ b/test/python/test_tokenizer_legacy_icu.py
@ -2,10 +2,13 @@
 Tests for Legacy ICU tokenizer.
 """
 import shutil
+import yaml

 import pytest

 from nominatim.tokenizer import legacy_icu_tokenizer
+from nominatim.tokenizer.icu_name_processor import ICUNameProcessorRules
+from nominatim.tokenizer.icu_rule_loader import ICURuleLoader
 from nominatim.db import properties


@ -40,16 +43,10 @@ def tokenizer_factory(dsn, tmp_path, property_table,
@pytest.fixture
 def db_prop(temp_db_conn):
    def _get_db_property(name):
-        return properties.get_property(temp_db_conn,
-                                       getattr(legacy_icu_tokenizer, name))
+        return properties.get_property(temp_db_conn, name)

    return _get_db_property

-@pytest.fixture
-def tokenizer_setup(tokenizer_factory, test_config):
-    tok = tokenizer_factory()
-    tok.init_new_db(test_config)
-

@pytest.fixture
 def analyzer(tokenizer_factory, test_config, monkeypatch,
@ -62,9 +59,15 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
    tok.init_new_db(test_config)
    monkeypatch.undo()

-    def _mk_analyser(trans=':: upper();', abbr=(('STREET', 'ST'), )):
-        tok.transliteration = trans
-        tok.abbreviations = abbr
+    def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
+                     variants=('~gasse -> gasse', 'street => st', )):
+        cfgfile = tmp_path / 'analyser_test_config.yaml'
+        with cfgfile.open('w') as stream:
+            cfgstr = {'normalization' : list(norm),
+                       'transliteration' : list(trans),
+                       'variants' : [ {'words': list(variants)}]}
+            yaml.dump(cfgstr, stream)
+        tok.naming_rules = ICUNameProcessorRules(loader=ICURuleLoader(cfgfile))

        return tok.name_analyzer()

@ -72,10 +75,54 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,


@pytest.fixture
-def getorcreate_term_id(temp_db_cursor):
-    temp_db_cursor.execute("""CREATE OR REPLACE FUNCTION getorcreate_term_id(lookup_term TEXT)
-                              RETURNS INTEGER AS $$
-                                SELECT nextval('seq_word')::INTEGER; $$ LANGUAGE SQL""")
+def getorcreate_full_word(temp_db_cursor):
+    temp_db_cursor.execute("""CREATE OR REPLACE FUNCTION getorcreate_full_word(
+                                                 norm_term TEXT, lookup_terms TEXT[],
+                                                 OUT full_token INT,
+                                                 OUT partial_tokens INT[])
+  AS $$
+DECLARE
+  partial_terms TEXT[] = '{}'::TEXT[];
+  term TEXT;
+  term_id INTEGER;
+  term_count INTEGER;
+BEGIN
+  SELECT min(word_id) INTO full_token
+    FROM word WHERE word = norm_term and class is null and country_code is null;
+
+  IF full_token IS NULL THEN
+    full_token := nextval('seq_word');
+    INSERT INTO word (word_id, word_token, word, search_name_count)
+      SELECT full_token, ' ' || lookup_term, norm_term, 0 FROM unnest(lookup_terms) as lookup_term;
+  END IF;
+
+  FOR term IN SELECT unnest(string_to_array(unnest(lookup_terms), ' ')) LOOP
+    term := trim(term);
+    IF NOT (ARRAY[term] <@ partial_terms) THEN
+      partial_terms := partial_terms || term;
+    END IF;
+  END LOOP;
+
+  partial_tokens := '{}'::INT[];
+  FOR term IN SELECT unnest(partial_terms) LOOP
+    SELECT min(word_id), max(search_name_count) INTO term_id, term_count
+      FROM word WHERE word_token = term and class is null and country_code is null;
+
+    IF term_id IS NULL THEN
+      term_id := nextval('seq_word');
+      term_count := 0;
+      INSERT INTO word (word_id, word_token, search_name_count)
+        VALUES (term_id, term, 0);
+    END IF;
+
+    IF NOT (ARRAY[term_id] <@ partial_tokens) THEN
+        partial_tokens := partial_tokens || term_id;
+    END IF;
+  END LOOP;
+END;
+$$
+LANGUAGE plpgsql;
+                              """)


@pytest.fixture
@ -91,19 +138,37 @@ def test_init_new(tokenizer_factory, test_config, monkeypatch, db_prop):
    tok = tokenizer_factory()
    tok.init_new_db(test_config)

-    assert db_prop('DBCFG_NORMALIZATION') == ':: lower();'
-    assert db_prop('DBCFG_TRANSLITERATION') is not None
-    assert db_prop('DBCFG_ABBREVIATIONS') is not None
+    assert db_prop(legacy_icu_tokenizer.DBCFG_TERM_NORMALIZATION) == ':: lower();'
+    assert db_prop(legacy_icu_tokenizer.DBCFG_MAXWORDFREQ) is not None


-def test_init_from_project(tokenizer_setup, tokenizer_factory):
+def test_init_word_table(tokenizer_factory, test_config, place_row, word_table):
+    place_row(names={'name' : 'Test Area', 'ref' : '52'})
+    place_row(names={'name' : 'No Area'})
+    place_row(names={'name' : 'Holzstrasse'})
+
    tok = tokenizer_factory()
+    tok.init_new_db(test_config)

+    assert word_table.get_partial_words() == {('test', 1),
+                                              ('no', 1), ('area', 2),
+                                              ('holz', 1), ('strasse', 1),
+                                              ('str', 1)}
+
+
+def test_init_from_project(monkeypatch, test_config, tokenizer_factory):
+    monkeypatch.setenv('NOMINATIM_TERM_NORMALIZATION', ':: lower();')
+    monkeypatch.setenv('NOMINATIM_MAX_WORD_FREQUENCY', '90300')
+    tok = tokenizer_factory()
+    tok.init_new_db(test_config)
+    monkeypatch.undo()
+
+    tok = tokenizer_factory()
    tok.init_from_project()

-    assert tok.normalization is not None
-    assert tok.transliteration is not None
-    assert tok.abbreviations is not None
+    assert tok.naming_rules is not None
+    assert tok.term_normalization == ':: lower();'
+    assert tok.max_word_frequency == '90300'


 def test_update_sql_functions(db_prop, temp_db_cursor,
@ -114,7 +179,7 @@ def test_update_sql_functions(db_prop, temp_db_cursor,
    tok.init_new_db(test_config)
    monkeypatch.undo()

-    assert db_prop('DBCFG_MAXWORDFREQ') == '1133'
+    assert db_prop(legacy_icu_tokenizer.DBCFG_MAXWORDFREQ) == '1133'

    table_factory('test', 'txt TEXT')

@ -127,18 +192,11 @@ def test_update_sql_functions(db_prop, temp_db_cursor,
    assert test_content == set((('1133', ), ))


-def test_make_standard_word(analyzer):
-    with analyzer(abbr=(('STREET', 'ST'), ('tiny', 't'))) as anl:
-        assert anl.make_standard_word('tiny street') == 'TINY ST'
-
-    with analyzer(abbr=(('STRASSE', 'STR'), ('STR', 'ST'))) as anl:
-        assert anl.make_standard_word('Hauptstrasse') == 'HAUPTST'
-
-
-def test_make_standard_hnr(analyzer):
-    with analyzer(abbr=(('IV', '4'),)) as anl:
-        assert anl._make_standard_hnr('345') == '345'
-        assert anl._make_standard_hnr('iv') == 'IV'
+def test_normalize_postcode(analyzer):
+    with analyzer() as anl:
+        anl.normalize_postcode('123') == '123'
+        anl.normalize_postcode('ab-34 ') == 'AB-34'
+        anl.normalize_postcode('38 Б') == '38 Б'


 def test_update_postcodes_from_db_empty(analyzer, table_factory, word_table):
@ -168,15 +226,15 @@ def test_update_postcodes_from_db_add_and_remove(analyzer, table_factory, word_t
 def test_update_special_phrase_empty_table(analyzer, word_table):
    with analyzer() as anl:
        anl.update_special_phrases([
-            ("König bei", "amenity", "royal", "near"),
-            ("Könige", "amenity", "royal", "-"),
+            ("König  bei", "amenity", "royal", "near"),
+            ("Könige ", "amenity", "royal", "-"),
            ("street", "highway", "primary", "in")
        ], True)

    assert word_table.get_special() \
-               == {(' KÖNIG BEI', 'könig bei', 'amenity', 'royal', 'near'),
-                   (' KÖNIGE', 'könige', 'amenity', 'royal', None),
-                   (' ST', 'street', 'highway', 'primary', 'in')}
+               == {(' KÖNIG BEI', 'König bei', 'amenity', 'royal', 'near'),
+                   (' KÖNIGE', 'Könige', 'amenity', 'royal', None),
+                   (' STREET', 'street', 'highway', 'primary', 'in')}


 def test_update_special_phrase_delete_all(analyzer, word_table):
@ -222,66 +280,188 @@ def test_update_special_phrase_modify(analyzer, word_table):
                   (' GARDEN', 'garden', 'leisure', 'garden', 'near')}


-def test_process_place_names(analyzer, getorcreate_term_id):
+def test_add_country_names_new(analyzer, word_table):
    with analyzer() as anl:
-        info = anl.process_place({'name' : {'name' : 'Soft bAr', 'ref': '34'}})
+        anl.add_country_names('es', {'name': 'Espagña', 'name:en': 'Spain'})

-    assert info['names'] == '{1,2,3,4,5}'
+    assert word_table.get_country() == {('es', ' ESPAGÑA'), ('es', ' SPAIN')}


-@pytest.mark.parametrize('sep', [',' , ';'])
-def test_full_names_with_separator(analyzer, getorcreate_term_id, sep):
+def test_add_country_names_extend(analyzer, word_table):
+    word_table.add_country('ch', ' SCHWEIZ')
+
    with analyzer() as anl:
-        names = anl._compute_full_names({'name' : sep.join(('New York', 'Big Apple'))})
+        anl.add_country_names('ch', {'name': 'Schweiz', 'name:fr': 'Suisse'})

-    assert names == set(('NEW YORK', 'BIG APPLE'))
+    assert word_table.get_country() == {('ch', ' SCHWEIZ'), ('ch', ' SUISSE')}


-def test_full_names_with_bracket(analyzer, getorcreate_term_id):
-    with analyzer() as anl:
-        names = anl._compute_full_names({'name' : 'Houseboat (left)'})
+class TestPlaceNames:

-    assert names == set(('HOUSEBOAT (LEFT)', 'HOUSEBOAT'))
+    @pytest.fixture(autouse=True)
+    def setup(self, analyzer, getorcreate_full_word):
+        with analyzer() as anl:
+            self.analyzer = anl
+            yield anl


-@pytest.mark.parametrize('pcode', ['12345', 'AB 123', '34-345'])
-def test_process_place_postcode(analyzer, word_table, pcode):
-    with analyzer() as anl:
-        anl.process_place({'address': {'postcode' : pcode}})
+    def expect_name_terms(self, info, *expected_terms):
+        tokens = self.analyzer.get_word_token_info(expected_terms)
+        for token in tokens:
+            assert token[2] is not None, "No token for {0}".format(token)

-    assert word_table.get_postcodes() == {pcode, }
+        assert eval(info['names']) == set((t[2] for t in tokens))


-@pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
-def test_process_place_bad_postcode(analyzer, word_table, pcode):
-    with analyzer() as anl:
-        anl.process_place({'address': {'postcode' : pcode}})
+    def test_simple_names(self):
+        info = self.analyzer.process_place({'name': {'name': 'Soft bAr', 'ref': '34'}})

-    assert not word_table.get_postcodes()
+        self.expect_name_terms(info, '#Soft bAr', '#34','Soft', 'bAr', '34')


-@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
-def test_process_place_housenumbers_simple(analyzer, hnr, getorcreate_hnr_id):
-    with analyzer() as anl:
-        info = anl.process_place({'address': {'housenumber' : hnr}})
+    @pytest.mark.parametrize('sep', [',' , ';'])
+    def test_names_with_separator(self, sep):
+        info = self.analyzer.process_place({'name': {'name': sep.join(('New York', 'Big Apple'))}})

-    assert info['hnr'] == hnr.upper()
-    assert info['hnr_tokens'] == "{-1}"
+        self.expect_name_terms(info, '#New York', '#Big Apple',
+                                     'new', 'york', 'big', 'apple')


-def test_process_place_housenumbers_lists(analyzer, getorcreate_hnr_id):
-    with analyzer() as anl:
-        info = anl.process_place({'address': {'conscriptionnumber' : '1; 2;3'}})
+    def test_full_names_with_bracket(self):
+        info = self.analyzer.process_place({'name': {'name': 'Houseboat (left)'}})

-    assert set(info['hnr'].split(';')) == set(('1', '2', '3'))
-    assert info['hnr_tokens'] == "{-1,-2,-3}"
+        self.expect_name_terms(info, '#Houseboat (left)', '#Houseboat',
+                                     'houseboat', 'left')


-def test_process_place_housenumbers_duplicates(analyzer, getorcreate_hnr_id):
-    with analyzer() as anl:
-        info = anl.process_place({'address': {'housenumber' : '134',
-                                              'conscriptionnumber' : '134',
-                                              'streetnumber' : '99a'}})
+    def test_country_name(self, word_table):
+        info = self.analyzer.process_place({'name': {'name': 'Norge'},
+                                           'country_feature': 'no'})
+
+        self.expect_name_terms(info, '#norge', 'norge')
+        assert word_table.get_country() == {('no', ' NORGE')}
+
+
+class TestPlaceAddress:
+
+    @pytest.fixture(autouse=True)
+    def setup(self, analyzer, getorcreate_full_word):
+        with analyzer(trans=(":: upper()", "'🜵' > ' '")) as anl:
+            self.analyzer = anl
+            yield anl
+
+
+    def process_address(self, **kwargs):
+        return self.analyzer.process_place({'address': kwargs})
+
+
+    def name_token_set(self, *expected_terms):
+        tokens = self.analyzer.get_word_token_info(expected_terms)
+        for token in tokens:
+            assert token[2] is not None, "No token for {0}".format(token)
+
+        return set((t[2] for t in tokens))
+
+
+    @pytest.mark.parametrize('pcode', ['12345', 'AB 123', '34-345'])
+    def test_process_place_postcode(self, word_table, pcode):
+        self.process_address(postcode=pcode)
+
+        assert word_table.get_postcodes() == {pcode, }
+
+
+    @pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
+    def test_process_place_bad_postcode(self, word_table, pcode):
+        self.process_address(postcode=pcode)
+
+        assert not word_table.get_postcodes()
+
+
+    @pytest.mark.parametrize('hnr', ['123a', '1', '101'])
+    def test_process_place_housenumbers_simple(self, hnr, getorcreate_hnr_id):
+        info = self.process_address(housenumber=hnr)
+
+        assert info['hnr'] == hnr.upper()
+        assert info['hnr_tokens'] == "{-1}"
+
+
+    def test_process_place_housenumbers_lists(self, getorcreate_hnr_id):
+        info = self.process_address(conscriptionnumber='1; 2;3')
+
+        assert set(info['hnr'].split(';')) == set(('1', '2', '3'))
+        assert info['hnr_tokens'] == "{-1,-2,-3}"
+
+
+    def test_process_place_housenumbers_duplicates(self, getorcreate_hnr_id):
+        info = self.process_address(housenumber='134',
+                                    conscriptionnumber='134',
+                                    streetnumber='99a')
+
+        assert set(info['hnr'].split(';')) == set(('134', '99A'))
+        assert info['hnr_tokens'] == "{-1,-2}"
+
+
+    def test_process_place_housenumbers_cached(self, getorcreate_hnr_id):
+        info = self.process_address(housenumber="45")
+        assert info['hnr_tokens'] == "{-1}"
+
+        info = self.process_address(housenumber="46")
+        assert info['hnr_tokens'] == "{-2}"
+
+        info = self.process_address(housenumber="41;45")
+        assert eval(info['hnr_tokens']) == {-1, -3}
+
+        info = self.process_address(housenumber="41")
+        assert eval(info['hnr_tokens']) == {-3}
+
+
+    def test_process_place_street(self):
+        info = self.process_address(street='Grand Road')
+
+        assert eval(info['street']) == self.name_token_set('#GRAND ROAD')
+
+
+    def test_process_place_street_empty(self):
+        info = self.process_address(street='🜵')
+
+        assert 'street' not in info
+
+
+    def test_process_place_place(self):
+        info = self.process_address(place='Honu Lulu')
+
+        assert eval(info['place_search']) == self.name_token_set('#HONU LULU',
+                                                                 'HONU', 'LULU')
+        assert eval(info['place_match']) == self.name_token_set('#HONU LULU')
+
+
+    def test_process_place_place_empty(self):
+        info = self.process_address(place='🜵')
+
+        assert 'place_search' not in info
+        assert 'place_match' not in info
+
+
+    def test_process_place_address_terms(self):
+        info = self.process_address(country='de', city='Zwickau', state='Sachsen',
+                                    suburb='Zwickau', street='Hauptstr',
+                                    full='right behind the church')
+
+        city_full = self.name_token_set('#ZWICKAU')
+        city_all = self.name_token_set('#ZWICKAU', 'ZWICKAU')
+        state_full = self.name_token_set('#SACHSEN')
+        state_all = self.name_token_set('#SACHSEN', 'SACHSEN')
+
+        result = {k: [eval(v[0]), eval(v[1])] for k,v in info['addr'].items()}
+
+        assert result == {'city': [city_all, city_full],
+                          'suburb': [city_all, city_full],
+                          'state': [state_all, state_full]}
+
+
+    def test_process_place_address_terms_empty(self):
+        info = self.process_address(country='de', city=' ', street='Hauptstr',
+                                    full='right behind the church')
+
+        assert 'addr' not in info

-    assert set(info['hnr'].split(';')) == set(('134', '99A'))
-    assert info['hnr_tokens'] == "{-1,-2}"
--- a/test/python/test_tools_database_import.py
+++ b/test/python/test_tools_database_import.py
@ -180,7 +180,7 @@ def test_create_country_names(temp_db_with_extensions, temp_db_conn, temp_db_cur

    assert len(tokenizer.analyser_cache['countries']) == 2

-    result_set = {k: set(v) for k, v in tokenizer.analyser_cache['countries']}
+    result_set = {k: set(v.values()) for k, v in tokenizer.analyser_cache['countries']}

    if languages:
        assert result_set == {'us' : set(('us', 'us1', 'United States')),
--- a/vagrant/Install-on-Centos-7.sh
+++ b/vagrant/Install-on-Centos-7.sh
@ -42,7 +42,7 @@
                        python3-pip python3-setuptools python3-devel \
                        expat-devel zlib-devel libicu-dev

-    pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU
+    pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU datrie


 #
--- a/vagrant/Install-on-Centos-8.sh
+++ b/vagrant/Install-on-Centos-8.sh
@ -35,7 +35,7 @@
                        python3-pip python3-setuptools python3-devel \
                        expat-devel zlib-devel libicu-dev

-    pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU
+    pip3 install --user psycopg2 python-dotenv psutil Jinja2 PyICU datrie


 #
--- a/vagrant/Install-on-Ubuntu-18.sh
+++ b/vagrant/Install-on-Ubuntu-18.sh
@ -32,10 +32,10 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
                        php php-pgsql php-intl libicu-dev python3-pip \
                        python3-psycopg2 python3-psutil python3-jinja2 python3-icu git

-# The python-dotenv package that comes with Ubuntu 18.04 is too old, so
+# The python-dotenv adn datrie package that comes with Ubuntu 18.04 is too old, so
 # install the latest version from pip:

-    pip3 install python-dotenv
+    pip3 install python-dotenv datrie

 #
 # System Configuration
--- a/vagrant/Install-on-Ubuntu-20.sh
+++ b/vagrant/Install-on-Ubuntu-20.sh
@ -33,7 +33,8 @@ export DEBIAN_FRONTEND=noninteractive #DOCS:
                        postgresql-server-dev-12 postgresql-12-postgis-3 \
                        postgresql-contrib-12 postgresql-12-postgis-3-scripts \
                        php php-pgsql php-intl libicu-dev python3-dotenv \
-                        python3-psycopg2 python3-psutil python3-jinja2 python3-icu git
+                        python3-psycopg2 python3-psutil python3-jinja2 \
+                        python3-icu python3-datrie git

 #
 # System Configuration