Merge pull request #2757 from lonvia/filter-postcodes

Add filtering, normalisation and variants for postcodes
This commit is contained in:
Sarah Hoffmann 2022-06-24 21:09:41 +02:00 committed by GitHub
commit 3bf3b894ea
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
35 changed files with 1562 additions and 221 deletions

View File

@ -13,4 +13,4 @@ ignored-classes=NominatimArgs,closing
# 'too-many-ancestors' is triggered already by deriving from UserDict
disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use
good-names=i,x,y,fd,db
good-names=i,x,y,fd,db,cc

View File

@ -0,0 +1,149 @@
# Customizing Per-Country Data
Whenever an OSM is imported into Nominatim, the object is first assigned
a country. Nominatim can use this information to adapt various aspects of
the address computation to the local customs of the country. This section
explains how country assignment works and the principal per-country
localizations.
## Country assignment
Countries are assigned on the basis of country data from the OpenStreetMap
input data itself. Countries are expected to be tagged according to the
[administrative boundary schema](https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative):
a OSM relation with `boundary=administrative` and `admin_level=2`. Nominatim
uses the country code to distinguish the countries.
If there is no country data available for a point, then Nominatim uses the
fallback data imported from `data/country_osm_grid.sql.gz`. This was computed
from OSM data as well but is guaranteed to cover all countries.
Some OSM objects may also be located outside any country, for example a buoy
in the middle of the ocean. These object do not get any country assigned and
get a default treatment when it comes to localized handling of data.
## Per-country settings
### Global country settings
The main place to configure settings per country is the file
`settings/country_settings.yaml`. This file has one section per country that
is recognised by Nominatim. Each section is tagged with the country code
(in lower case) and contains the different localization information. Only
countries which are listed in this file are taken into account for computations.
For example, the section for Andorra looks like this:
```
partition: 35
languages: ca
names: !include country-names/ad.yaml
postcode:
pattern: "(ddd)"
output: AD\1
```
The individual settings are described below.
#### `partition`
Nominatim internally splits the data into multiple tables to improve
performance. The partition number tells Nominatim into which table to put
the country. This is purely internal management and has no effect on the
output data.
The default is to have one partition per country.
#### `languages`
A comma-separated list of ISO-639 language codes of default languages in the
country. These are the languages used in name tags without a language suffix.
Note that this is not necessarily the same as the list of official languages
in the country. There may be officially recognised languages in a country
which are only ever used in name tags with the appropriate language suffixes.
Conversely, a non-official language may appear a lot in the name tags, for
example when used as an unofficial Lingua Franca.
List the languages in order of frequency of appearance with the most frequently
used language first. It is not recommended to add languages when there are only
very few occurrences.
If only one language is listed, then Nominatim will 'auto-complete' the
language of names without an explicit language-suffix.
#### `names`
List of names of the country and its translations. These names are used as
a baseline. It is always possible to search countries by the given names, no
matter what other names are in the OSM data. They are also used as a fallback
when a needed translation is not available.
!!! Note
The list of names per country is currently fairly large because Nominatim
supports translations in many languages per default. That is why the
name lists have been separated out into extra files. You can find the
name lists in the file `settings/country-names/<country code>.yaml`.
The names section in the main country settings file only refers to these
files via the special `!include` directive.
#### `postcode`
Describes the format of the postcode that is in use in the country.
When a country has no official postcodes, set this to no. Example:
```
ae:
postcode: no
```
When a country has a postcode, you need to state the postcode pattern and
the default output format. Example:
```
bm:
postcode:
pattern: "(ll)[ -]?(dd)"
output: \1 \2
```
The **pattern** is a regular expression that describes the possible formats
accepted as a postcode. The pattern follows the standard syntax for
[regular expressions in Python](https://docs.python.org/3/library/re.html#regular-expression-syntax)
with two extra shortcuts: `d` is a shortcut for a single digit([0-9])
and `l` for a single ASCII letter ([A-Z]).
Use match groups to indicate groups in the postcode that may optionally be
separated with a space or a hyphen.
For example, the postcode for Bermuda above always consists of two letters
and two digits. They may optionally be separated by a space or hyphen. That
means that Nominatim will consider `AB56`, `AB 56` and `AB-56` spelling variants
for one and the same postcode.
Never add the country code in front of the postcode pattern. Nominatim will
automatically accept variants with a country code prefix for all postcodes.
The **output** field is an optional field that describes what the canonical
spelling of the postcode should be. The format is the
[regular expression expand syntax](https://docs.python.org/3/library/re.html#re.Match.expand) referring back to the bracket groups in the pattern.
Most simple postcodes only have one spelling variant. In that case, the
**output** can be omitted. The postcode will simply be used as is.
In the Bermuda example above, the canonical spelling would be to have a space
between letters and digits.
!!! Warning
When your postcode pattern covers multiple variants of the postcode, then
you must explicitly state the canonical output or Nominatim will not
handle the variations correctly.
### Other country-specific configuration
There are some other configuration files where you can set localized settings
according to the assigned country. These are:
* [Place ranking configuration](Ranking.md)
Please see the linked documentation sections for more information.

View File

@ -205,6 +205,14 @@ The following is a list of sanitizers that are shipped with Nominatim.
rendering:
heading_level: 6
##### clean-postcodes
::: nominatim.tokenizer.sanitizers.clean_postcodes
selection:
members: False
rendering:
heading_level: 6
#### Token Analysis
@ -222,8 +230,12 @@ by a sanitizer (see for example the
The token-analysis section contains the list of configured analyzers. Each
analyzer must have an `id` parameter that uniquely identifies the analyzer.
The only exception is the default analyzer that is used when no special
analyzer was selected. There is one special id '@housenumber'. If an analyzer
with that name is present, it is used for normalization of house numbers.
analyzer was selected. There are analysers with special ids:
* '@housenumber'. If an analyzer with that name is present, it is used
for normalization of house numbers.
* '@potcode'. If an analyzer with that name is present, it is used
for normalization of postcodes.
Different analyzer implementations may exist. To select the implementation,
the `analyzer` parameter must be set. The different implementations are
@ -356,6 +368,14 @@ house numbers of the form '3 a', '3A', '3-A' etc. are all considered equivalent.
The analyzer cannot be customized.
##### Postcode token analyzer
The analyzer `postcodes` is pupose-made to analyze postcodes. It supports
a 'lookup' varaint of the token, which produces variants with optional
spaces. Use together with the clean-postcodes sanitizer.
The analyzer cannot be customized.
### Reconfiguration
Changing the configuration after the import is currently not possible, although

View File

@ -245,11 +245,11 @@ Currently, tokenizers are encouraged to make sure that matching works against
both the search token list and the match token list.
```sql
FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
FUNCTION token_get_postcode(info JSONB) RETURNS TEXT
```
Return the normalized version of the given postcode. This function must return
the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.
Return the postcode for the object, if any exists. The postcode must be in
the form that should also be presented to the end-user.
```sql
FUNCTION token_strip_info(info JSONB) RETURNS JSONB

View File

@ -28,6 +28,7 @@ pages:
- 'Overview': 'customize/Overview.md'
- 'Import Styles': 'customize/Import-Styles.md'
- 'Configuration Settings': 'customize/Settings.md'
- 'Per-Country Data': 'customize/Country-Settings.md'
- 'Place Ranking' : 'customize/Ranking.md'
- 'Tokenizers' : 'customize/Tokenizers.md'
- 'Special Phrases': 'customize/Special-Phrases.md'

View File

@ -25,7 +25,12 @@ class Postcode
public function __construct($iId, $sPostcode, $sCountryCode = '')
{
$this->iId = $iId;
$this->sPostcode = $sPostcode;
$iSplitPos = strpos($sPostcode, '@');
if ($iSplitPos === false) {
$this->sPostcode = $sPostcode;
} else {
$this->sPostcode = substr($sPostcode, 0, $iSplitPos);
}
$this->sCountryCode = empty($sCountryCode) ? '' : $sCountryCode;
}

View File

@ -190,13 +190,17 @@ class Tokenizer
if ($aWord['word'] !== null
&& pg_escape_string($aWord['word']) == $aWord['word']
) {
$sNormPostcode = $this->normalizeString($aWord['word']);
if (strpos($sNormQuery, $sNormPostcode) !== false) {
$oValidTokens->addToken(
$sTok,
new Token\Postcode($iId, $aWord['word'], null)
);
$iSplitPos = strpos($aWord['word'], '@');
if ($iSplitPos === false) {
$sPostcode = $aWord['word'];
} else {
$sPostcode = substr($aWord['word'], 0, $iSplitPos);
}
$oValidTokens->addToken(
$sTok,
new Token\Postcode($iId, $sPostcode, null)
);
}
break;
case 'S': // tokens for classification terms (special phrases)

View File

@ -320,6 +320,11 @@ BEGIN
location := ROW(null, null, null, hstore('ref', place.postcode), 'place',
'postcode', null, null, false, true, 5, 0)::addressline;
RETURN NEXT location;
ELSEIF place.address is not null and place.address ? 'postcode'
and not place.address->'postcode' SIMILAR TO '%(,|;)%' THEN
location := ROW(null, null, null, hstore('ref', place.address->'postcode'), 'place',
'postcode', null, null, false, true, 5, 0)::addressline;
RETURN NEXT location;
END IF;
RETURN;

View File

@ -156,7 +156,6 @@ DECLARE
linegeo GEOMETRY;
splitline GEOMETRY;
sectiongeo GEOMETRY;
interpol_postcode TEXT;
postcode TEXT;
stepmod SMALLINT;
BEGIN
@ -174,8 +173,6 @@ BEGIN
ST_PointOnSurface(NEW.linegeo),
NEW.linegeo);
interpol_postcode := token_normalized_postcode(NEW.address->'postcode');
NEW.token_info := token_strip_info(NEW.token_info);
IF NEW.address ? '_inherited' THEN
NEW.address := hstore('interpolation', NEW.address->'interpolation');
@ -207,6 +204,11 @@ BEGIN
FOR nextnode IN
SELECT DISTINCT ON (nodeidpos)
osm_id, address, geometry,
-- Take the postcode from the node only if it has a housenumber itself.
-- Note that there is a corner-case where the node has a wrongly
-- formatted postcode and therefore 'postcode' contains a derived
-- variant.
CASE WHEN address ? 'postcode' THEN placex.postcode ELSE NULL::text END as postcode,
substring(address->'housenumber','[0-9]+')::integer as hnr
FROM placex, generate_series(1, array_upper(waynodes, 1)) nodeidpos
WHERE osm_type = 'N' and osm_id = waynodes[nodeidpos]::BIGINT
@ -260,13 +262,10 @@ BEGIN
endnumber := newend;
-- determine postcode
postcode := coalesce(interpol_postcode,
token_normalized_postcode(prevnode.address->'postcode'),
token_normalized_postcode(nextnode.address->'postcode'),
postcode);
IF postcode is NULL THEN
SELECT token_normalized_postcode(placex.postcode)
FROM placex WHERE place_id = NEW.parent_place_id INTO postcode;
postcode := coalesce(prevnode.postcode, nextnode.postcode, postcode);
IF postcode is NULL and NEW.parent_place_id > 0 THEN
SELECT placex.postcode FROM placex
WHERE place_id = NEW.parent_place_id INTO postcode;
END IF;
IF postcode is NULL THEN
postcode := get_nearest_postcode(NEW.country_code, nextnode.geometry);

View File

@ -992,7 +992,7 @@ BEGIN
{% if debug %}RAISE WARNING 'Got parent details from search name';{% endif %}
-- determine postcode
NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
NEW.postcode := coalesce(token_get_postcode(NEW.token_info),
location.postcode,
get_nearest_postcode(NEW.country_code, NEW.centroid));
@ -1150,8 +1150,7 @@ BEGIN
{% if debug %}RAISE WARNING 'RETURN insert_addresslines: %, %, %', NEW.parent_place_id, NEW.postcode, nameaddress_vector;{% endif %}
NEW.postcode := coalesce(token_normalized_postcode(NEW.address->'postcode'),
NEW.postcode);
NEW.postcode := coalesce(token_get_postcode(NEW.token_info), NEW.postcode);
-- if we have a name add this to the name search table
IF NEW.name IS NOT NULL THEN

View File

@ -97,10 +97,10 @@ AS $$
$$ LANGUAGE SQL IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
RETURNS TEXT
AS $$
SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
SELECT info->>'postcode';
$$ LANGUAGE SQL IMMUTABLE STRICT;
@ -223,3 +223,26 @@ BEGIN
END;
$$
LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION create_postcode_word(postcode TEXT, lookup_terms TEXT[])
RETURNS BOOLEAN
AS $$
DECLARE
existing INTEGER;
BEGIN
SELECT count(*) INTO existing
FROM word WHERE word = postcode and type = 'P';
IF existing > 0 THEN
RETURN TRUE;
END IF;
-- postcodes don't need word ids
INSERT INTO word (word_token, type, word)
SELECT lookup_term, 'P', postcode FROM unnest(lookup_terms) as lookup_term;
RETURN FALSE;
END;
$$
LANGUAGE plpgsql;

View File

@ -97,10 +97,10 @@ AS $$
$$ LANGUAGE SQL IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION token_normalized_postcode(postcode TEXT)
CREATE OR REPLACE FUNCTION token_get_postcode(info JSONB)
RETURNS TEXT
AS $$
SELECT CASE WHEN postcode SIMILAR TO '%(,|;)%' THEN NULL ELSE upper(trim(postcode))END;
SELECT info->>'postcode';
$$ LANGUAGE SQL IMMUTABLE STRICT;

View File

View File

@ -0,0 +1,109 @@
# SPDX-License-Identifier: GPL-2.0-only
#
# This file is part of Nominatim. (https://nominatim.org)
#
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
Functions for formatting postcodes according to their country-specific
format.
"""
import re
from nominatim.errors import UsageError
from nominatim.tools import country_info
class CountryPostcodeMatcher:
""" Matches and formats a postcode according to a format definition
of the given country.
"""
def __init__(self, country_code, config):
if 'pattern' not in config:
raise UsageError("Field 'pattern' required for 'postcode' "
f"for country '{country_code}'")
pc_pattern = config['pattern'].replace('d', '[0-9]').replace('l', '[A-Z]')
self.norm_pattern = re.compile(f'\\s*(?:{country_code.upper()}[ -]?)?(.*)\\s*')
self.pattern = re.compile(pc_pattern)
self.output = config.get('output', r'\g<0>')
def match(self, postcode):
""" Match the given postcode against the postcode pattern for this
matcher. Returns a `re.Match` object if the match was successful
and None otherwise.
"""
# Upper-case, strip spaces and leading country code.
normalized = self.norm_pattern.fullmatch(postcode.upper())
if normalized:
return self.pattern.fullmatch(normalized.group(1))
return None
def normalize(self, match):
""" Return the default format of the postcode for the given match.
`match` must be a `re.Match` object previously returned by
`match()`
"""
return match.expand(self.output)
class PostcodeFormatter:
""" Container for different postcode formats of the world and
access functions.
"""
def __init__(self):
# Objects without a country code can't have a postcode per definition.
self.country_without_postcode = {None}
self.country_matcher = {}
self.default_matcher = CountryPostcodeMatcher('', {'pattern': '.*'})
for ccode, prop in country_info.iterate('postcode'):
if prop is False:
self.country_without_postcode.add(ccode)
elif isinstance(prop, dict):
self.country_matcher[ccode] = CountryPostcodeMatcher(ccode, prop)
else:
raise UsageError(f"Invalid entry 'postcode' for country '{ccode}'")
def set_default_pattern(self, pattern):
""" Set the postcode match pattern to use, when a country does not
have a specific pattern or is marked as country without postcode.
"""
self.default_matcher = CountryPostcodeMatcher('', {'pattern': pattern})
def get_matcher(self, country_code):
""" Return the CountryPostcodeMatcher for the given country.
Returns None if the country doesn't have a postcode and the
default matcher if there is no specific matcher configured for
the country.
"""
if country_code in self.country_without_postcode:
return None
return self.country_matcher.get(country_code, self.default_matcher)
def match(self, country_code, postcode):
""" Match the given postcode against the postcode pattern for this
matcher. Returns a `re.Match` object if the country has a pattern
and the match was successful or None if the match failed.
"""
if country_code in self.country_without_postcode:
return None
return self.country_matcher.get(country_code, self.default_matcher).match(postcode)
def normalize(self, country_code, match):
""" Return the default format of the postcode for the given match.
`match` must be a `re.Match` object previously returned by
`match()`
"""
return self.country_matcher.get(country_code, self.default_matcher).normalize(match)

View File

@ -11,7 +11,6 @@ libICU instead of the PostgreSQL module.
import itertools
import json
import logging
import re
from textwrap import dedent
from nominatim.db.connection import connect
@ -291,33 +290,72 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
""" Update postcode tokens in the word table from the location_postcode
table.
"""
to_delete = []
analyzer = self.token_analysis.analysis.get('@postcode')
with self.conn.cursor() as cur:
# This finds us the rows in location_postcode and word that are
# missing in the other table.
cur.execute("""SELECT * FROM
(SELECT pc, word FROM
(SELECT distinct(postcode) as pc FROM location_postcode) p
FULL JOIN
(SELECT word FROM word WHERE type = 'P') w
ON pc = word) x
WHERE pc is null or word is null""")
# First get all postcode names currently in the word table.
cur.execute("SELECT DISTINCT word FROM word WHERE type = 'P'")
word_entries = set((entry[0] for entry in cur))
with CopyBuffer() as copystr:
for postcode, word in cur:
if postcode is None:
to_delete.append(word)
else:
copystr.add(self._search_normalized(postcode),
'P', postcode)
# Then compute the required postcode names from the postcode table.
needed_entries = set()
cur.execute("SELECT country_code, postcode FROM location_postcode")
for cc, postcode in cur:
info = PlaceInfo({'country_code': cc,
'class': 'place', 'type': 'postcode',
'address': {'postcode': postcode}})
address = self.sanitizer.process_names(info)[1]
for place in address:
if place.kind == 'postcode':
if analyzer is None:
postcode_name = place.name.strip().upper()
variant_base = None
else:
postcode_name = analyzer.normalize(place.name)
variant_base = place.get_attr("variant")
if variant_base:
needed_entries.add(f'{postcode_name}@{variant_base}')
else:
needed_entries.add(postcode_name)
break
# Now update the word table.
self._delete_unused_postcode_words(word_entries - needed_entries)
self._add_missing_postcode_words(needed_entries - word_entries)
def _delete_unused_postcode_words(self, tokens):
if tokens:
with self.conn.cursor() as cur:
cur.execute("DELETE FROM word WHERE type = 'P' and word = any(%s)",
(list(tokens), ))
def _add_missing_postcode_words(self, tokens):
if not tokens:
return
analyzer = self.token_analysis.analysis.get('@postcode')
terms = []
for postcode_name in tokens:
if '@' in postcode_name:
term, variant = postcode_name.split('@', 2)
term = self._search_normalized(term)
variants = {term}
if analyzer is not None:
variants.update(analyzer.get_variants_ascii(variant))
variants = list(variants)
else:
variants = [self._search_normalized(postcode_name)]
terms.append((postcode_name, variants))
if terms:
with self.conn.cursor() as cur:
cur.execute_values("""SELECT create_postcode_word(pc, var)
FROM (VALUES %s) AS v(pc, var)""",
terms)
if to_delete:
cur.execute("""DELETE FROM WORD
WHERE type ='P' and word = any(%s)
""", (to_delete, ))
copystr.copy_out(cur, 'word',
columns=['word_token', 'type', 'word'])
def update_special_phrases(self, phrases, should_replace):
@ -473,7 +511,7 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
def _process_place_address(self, token_info, address):
for item in address:
if item.kind == 'postcode':
self._add_postcode(item.name)
token_info.set_postcode(self._add_postcode(item))
elif item.kind == 'housenumber':
token_info.add_housenumber(*self._compute_housenumber_token(item))
elif item.kind == 'street':
@ -605,26 +643,38 @@ class LegacyICUNameAnalyzer(AbstractAnalyzer):
return full_tokens, partial_tokens
def _add_postcode(self, postcode):
def _add_postcode(self, item):
""" Make sure the normalized postcode is present in the word table.
"""
if re.search(r'[:,;]', postcode) is None:
postcode = self.normalize_postcode(postcode)
analyzer = self.token_analysis.analysis.get('@postcode')
if postcode not in self._cache.postcodes:
term = self._search_normalized(postcode)
if not term:
return
if analyzer is None:
postcode_name = item.name.strip().upper()
variant_base = None
else:
postcode_name = analyzer.normalize(item.name)
variant_base = item.get_attr("variant")
with self.conn.cursor() as cur:
# no word_id needed for postcodes
cur.execute("""INSERT INTO word (word_token, type, word)
(SELECT %s, 'P', pc FROM (VALUES (%s)) as v(pc)
WHERE NOT EXISTS
(SELECT * FROM word
WHERE type = 'P' and word = pc))
""", (term, postcode))
self._cache.postcodes.add(postcode)
if variant_base:
postcode = f'{postcode_name}@{variant_base}'
else:
postcode = postcode_name
if postcode not in self._cache.postcodes:
term = self._search_normalized(postcode_name)
if not term:
return None
variants = {term}
if analyzer is not None and variant_base:
variants.update(analyzer.get_variants_ascii(variant_base))
with self.conn.cursor() as cur:
cur.execute("SELECT create_postcode_word(%s, %s)",
(postcode, list(variants)))
self._cache.postcodes.add(postcode)
return postcode_name
class _TokenInfo:
@ -637,6 +687,7 @@ class _TokenInfo:
self.street_tokens = set()
self.place_tokens = set()
self.address_tokens = {}
self.postcode = None
@staticmethod
@ -665,6 +716,9 @@ class _TokenInfo:
if self.address_tokens:
out['addr'] = self.address_tokens
if self.postcode:
out['postcode'] = self.postcode
return out
@ -701,6 +755,11 @@ class _TokenInfo:
if partials:
self.address_tokens[key] = self._mk_array(partials)
def set_postcode(self, postcode):
""" Set the postcode to the given one.
"""
self.postcode = postcode
class _TokenCache:
""" Cache for token information to avoid repeated database queries.

View File

@ -467,8 +467,9 @@ class LegacyNameAnalyzer(AbstractAnalyzer):
if key == 'postcode':
# Make sure the normalized postcode is present in the word table.
if re.search(r'[:,;]', value) is None:
self._cache.add_postcode(self.conn,
self.normalize_postcode(value))
norm_pc = self.normalize_postcode(value)
token_info.set_postcode(norm_pc)
self._cache.add_postcode(self.conn, norm_pc)
elif key in ('housenumber', 'streetnumber', 'conscriptionnumber'):
hnrs.append(value)
elif key == 'street':
@ -527,6 +528,11 @@ class _TokenInfo:
self.data['hnr_tokens'], self.data['hnr'] = cur.fetchone()
def set_postcode(self, postcode):
""" Set or replace the postcode token with the given value.
"""
self.data['postcode'] = postcode
def add_street(self, conn, street):
""" Add addr:street match terms.
"""

View File

@ -0,0 +1,74 @@
# SPDX-License-Identifier: GPL-2.0-only
#
# This file is part of Nominatim. (https://nominatim.org)
#
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
Sanitizer that filters postcodes by their officially allowed pattern.
Arguments:
convert-to-address: If set to 'yes' (the default), then postcodes that do
not conform with their country-specific pattern are
converted to an address component. That means that
the postcode does not take part when computing the
postcode centroids of a country but is still searchable.
When set to 'no', non-conforming postcodes are not
searchable either.
default-pattern: Pattern to use, when there is none available for the
country in question. Warning: will not be used for
objects that have no country assigned. These are always
assumed to have no postcode.
"""
from nominatim.data.postcode_format import PostcodeFormatter
class _PostcodeSanitizer:
def __init__(self, config):
self.convert_to_address = config.get_bool('convert-to-address', True)
self.matcher = PostcodeFormatter()
default_pattern = config.get('default-pattern')
if default_pattern is not None and isinstance(default_pattern, str):
self.matcher.set_default_pattern(default_pattern)
def __call__(self, obj):
if not obj.address:
return
postcodes = ((i, o) for i, o in enumerate(obj.address) if o.kind == 'postcode')
for pos, postcode in postcodes:
formatted = self.scan(postcode.name, obj.place.country_code)
if formatted is None:
if self.convert_to_address:
postcode.kind = 'unofficial_postcode'
else:
obj.address.pop(pos)
else:
postcode.name = formatted[0]
postcode.set_attr('variant', formatted[1])
def scan(self, postcode, country):
""" Check the postcode for correct formatting and return the
normalized version. Returns None if the postcode does not
correspond to the oficial format of the given country.
"""
match = self.matcher.match(country, postcode)
if match is None:
return None
return self.matcher.normalize(country, match),\
' '.join(filter(lambda p: p is not None, match.groups()))
def create(config):
""" Create a housenumber processing function.
"""
return _PostcodeSanitizer(config)

View File

@ -44,6 +44,20 @@ class SanitizerConfig(UserDict):
return values
def get_bool(self, param, default=None):
""" Extract a configuration parameter as a boolean.
The parameter must be one of the yaml boolean values or an
user error will be raised. If `default` is given, then the parameter
may also be missing or empty.
"""
value = self.data.get(param, default)
if not isinstance(value, bool):
raise UsageError(f"Parameter '{param}' must be a boolean value ('yes' or 'no'.")
return value
def get_delimiter(self, default=',;'):
""" Return the 'delimiter' parameter in the configuration as a
compiled regular expression that can be used to split the names on the

View File

@ -48,8 +48,7 @@ class _AnalyzerByLanguage:
self.deflangs = {}
if use_defaults in ('mono', 'all'):
for ccode, prop in country_info.iterate():
clangs = prop['languages']
for ccode, clangs in country_info.iterate('languages'):
if len(clangs) == 1 or use_defaults == 'all':
if self.whitelist:
self.deflangs[ccode] = [l for l in clangs if l in self.whitelist]

View File

@ -0,0 +1,65 @@
# SPDX-License-Identifier: GPL-2.0-only
#
# This file is part of Nominatim. (https://nominatim.org)
#
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
Specialized processor for postcodes. Supports a 'lookup' variant of the
token, which produces variants with optional spaces.
"""
from nominatim.tokenizer.token_analysis.generic_mutation import MutationVariantGenerator
### Configuration section
def configure(rules, normalization_rules): # pylint: disable=W0613
""" All behaviour is currently hard-coded.
"""
return None
### Analysis section
def create(normalizer, transliterator, config): # pylint: disable=W0613
""" Create a new token analysis instance for this module.
"""
return PostcodeTokenAnalysis(normalizer, transliterator)
class PostcodeTokenAnalysis:
""" Special normalization and variant generation for postcodes.
This analyser must not be used with anything but postcodes as
it follows some special rules: `normalize` doesn't necessarily
need to return a standard form as per normalization rules. It
needs to return the canonical form of the postcode that is also
used for output. `get_variants_ascii` then needs to ensure that
the generated variants once more follow the standard normalization
and transliteration, so that postcodes are correctly recognised by
the search algorithm.
"""
def __init__(self, norm, trans):
self.norm = norm
self.trans = trans
self.mutator = MutationVariantGenerator(' ', (' ', ''))
def normalize(self, name):
""" Return the standard form of the postcode.
"""
return name.strip().upper()
def get_variants_ascii(self, norm_name):
""" Compute the spelling variants for the given normalized postcode.
Takes the canonical form of the postcode, normalizes it using the
standard rules and then creates variants of the result where
all spaces are optional.
"""
# Postcodes follow their own transliteration rules.
# Make sure at this point, that the terms are normalized in a way
# that they are searchable with the standard transliteration rules.
return [self.trans.transliterate(term) for term in
self.mutator.generate([self.norm.transliterate(norm_name)]) if term]

View File

@ -84,10 +84,20 @@ def setup_country_config(config):
_COUNTRY_INFO.load(config)
def iterate():
def iterate(prop=None):
""" Iterate over country code and properties.
When `prop` is None, all countries are returned with their complete
set of properties.
If `prop` is given, then only countries are returned where the
given property is set. The second item of the tuple contains only
the content of the given property.
"""
return _COUNTRY_INFO.items()
if prop is None:
return _COUNTRY_INFO.items()
return ((c, p[prop]) for c, p in _COUNTRY_INFO.items() if prop in p)
def setup_country_tables(dsn, sql_dir, ignore_partitions=False):

View File

@ -8,6 +8,7 @@
Functions for importing, updating and otherwise maintaining the table
of artificial postcode centroids.
"""
from collections import defaultdict
import csv
import gzip
import logging
@ -16,6 +17,8 @@ from math import isfinite
from psycopg2 import sql as pysql
from nominatim.db.connection import connect
from nominatim.utils.centroid import PointsCentroid
from nominatim.data.postcode_format import PostcodeFormatter
LOG = logging.getLogger()
@ -30,20 +33,31 @@ def _to_float(num, max_value):
return num
class _CountryPostcodesCollector:
class _PostcodeCollector:
""" Collector for postcodes of a single country.
"""
def __init__(self, country):
def __init__(self, country, matcher):
self.country = country
self.collected = {}
self.matcher = matcher
self.collected = defaultdict(PointsCentroid)
self.normalization_cache = None
def add(self, postcode, x, y):
""" Add the given postcode to the collection cache. If the postcode
already existed, it is overwritten with the new centroid.
"""
self.collected[postcode] = (x, y)
if self.matcher is not None:
if self.normalization_cache and self.normalization_cache[0] == postcode:
normalized = self.normalization_cache[1]
else:
match = self.matcher.match(postcode)
normalized = self.matcher.normalize(match) if match else None
self.normalization_cache = (postcode, normalized)
if normalized:
self.collected[normalized] += (x, y)
def commit(self, conn, analyzer, project_dir):
@ -93,16 +107,16 @@ class _CountryPostcodesCollector:
WHERE country_code = %s""",
(self.country, ))
for postcode, x, y in cur:
newx, newy = self.collected.pop(postcode, (None, None))
if newx is not None:
dist = (x - newx)**2 + (y - newy)**2
if dist > 0.0000001:
pcobj = self.collected.pop(postcode, None)
if pcobj:
newx, newy = pcobj.centroid()
if (x - newx) > 0.0000001 or (y - newy) > 0.0000001:
to_update.append((postcode, newx, newy))
else:
to_delete.append(postcode)
to_add = [(k, v[0], v[1]) for k, v in self.collected.items()]
self.collected = []
to_add = [(k, *v.centroid()) for k, v in self.collected.items()]
self.collected = None
return to_add, to_delete, to_update
@ -125,8 +139,10 @@ class _CountryPostcodesCollector:
postcode = analyzer.normalize_postcode(row['postcode'])
if postcode not in self.collected:
try:
self.collected[postcode] = (_to_float(row['lon'], 180),
_to_float(row['lat'], 90))
# Do float conversation separately, it might throw
centroid = (_to_float(row['lon'], 180),
_to_float(row['lat'], 90))
self.collected[postcode] += centroid
except ValueError:
LOG.warning("Bad coordinates %s, %s in %s country postcode file.",
row['lat'], row['lon'], self.country)
@ -158,6 +174,7 @@ def update_postcodes(dsn, project_dir, tokenizer):
potentially enhances it with external data and then updates the
postcodes in the table 'location_postcode'.
"""
matcher = PostcodeFormatter()
with tokenizer.name_analyzer() as analyzer:
with connect(dsn) as conn:
# First get the list of countries that currently have postcodes.
@ -169,19 +186,17 @@ def update_postcodes(dsn, project_dir, tokenizer):
# Recompute the list of valid postcodes from placex.
with conn.cursor(name="placex_postcodes") as cur:
cur.execute("""
SELECT cc as country_code, pc, ST_X(centroid), ST_Y(centroid)
SELECT cc, pc, ST_X(centroid), ST_Y(centroid)
FROM (SELECT
COALESCE(plx.country_code,
get_country_code(ST_Centroid(pl.geometry))) as cc,
token_normalized_postcode(pl.address->'postcode') as pc,
ST_Centroid(ST_Collect(COALESCE(plx.centroid,
ST_Centroid(pl.geometry)))) as centroid
pl.address->'postcode' as pc,
COALESCE(plx.centroid, ST_Centroid(pl.geometry)) as centroid
FROM place AS pl LEFT OUTER JOIN placex AS plx
ON pl.osm_id = plx.osm_id AND pl.osm_type = plx.osm_type
WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null
GROUP BY cc, pc) xx
WHERE pl.address ? 'postcode' AND pl.geometry IS NOT null) xx
WHERE pc IS NOT null AND cc IS NOT null
ORDER BY country_code, pc""")
ORDER BY cc, pc""")
collector = None
@ -189,7 +204,7 @@ def update_postcodes(dsn, project_dir, tokenizer):
if collector is None or country != collector.country:
if collector is not None:
collector.commit(conn, analyzer, project_dir)
collector = _CountryPostcodesCollector(country)
collector = _PostcodeCollector(country, matcher.get_matcher(country))
todo_countries.discard(country)
collector.add(postcode, x, y)
@ -198,7 +213,8 @@ def update_postcodes(dsn, project_dir, tokenizer):
# Now handle any countries that are only in the postcode table.
for country in todo_countries:
_CountryPostcodesCollector(country).commit(conn, analyzer, project_dir)
fmt = matcher.get_matcher(country)
_PostcodeCollector(country, fmt).commit(conn, analyzer, project_dir)
conn.commit()

View File

View File

@ -0,0 +1,48 @@
# SPDX-License-Identifier: GPL-2.0-only
#
# This file is part of Nominatim. (https://nominatim.org)
#
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
Functions for computation of centroids.
"""
from collections.abc import Collection
class PointsCentroid:
""" Centroid computation from single points using an online algorithm.
More points may be added at any time.
Coordinates are internally treated as a 7-digit fixed-point float
(i.e. in OSM style).
"""
def __init__(self):
self.sum_x = 0
self.sum_y = 0
self.count = 0
def centroid(self):
""" Return the centroid of all points collected so far.
"""
if self.count == 0:
raise ValueError("No points available for centroid.")
return (float(self.sum_x/self.count)/10000000,
float(self.sum_y/self.count)/10000000)
def __len__(self):
return self.count
def __iadd__(self, other):
if isinstance(other, Collection) and len(other) == 2:
if all(isinstance(p, (float, int)) for p in other):
x, y = other
self.sum_x += int(x * 10000000)
self.sum_y += int(y * 10000000)
self.count += 1
return self
raise ValueError("Can only add 2-element tuples to centroid.")

File diff suppressed because it is too large Load Diff

View File

@ -32,6 +32,9 @@ sanitizers:
- streetnumber
convert-to-name:
- (\A|.*,)[^\d,]{3,}(,.*|\Z)
- step: clean-postcodes
convert-to-address: yes
default-pattern: "[A-Z0-9- ]{3,12}"
- step: split-name-list
- step: strip-brace-terms
- step: tag-analyzer-by-language
@ -43,6 +46,8 @@ token-analysis:
- analyzer: generic
- id: "@housenumber"
analyzer: housenumbers
- id: "@postcode"
analyzer: postcodes
- id: bg
analyzer: generic
mode: variant-only

View File

@ -163,25 +163,8 @@ Feature: Import of postcodes
| de | 01982 | country:de |
And there are word tokens for postcodes 01982
Scenario: Different postcodes with the same normalization can both be found
Given the places
| osm | class | type | addr+postcode | addr+housenumber | geometry |
| N34 | place | house | EH4 7EA | 111 | country:gb |
| N35 | place | house | E4 7EA | 111 | country:gb |
When importing
Then location_postcode contains exactly
| country | postcode | geometry |
| gb | EH4 7EA | country:gb |
| gb | E4 7EA | country:gb |
When sending search query "EH4 7EA"
Then results contain
| type | display_name |
| postcode | EH4 7EA |
When sending search query "E4 7EA"
Then results contain
| type | display_name |
| postcode | E4 7EA |
@Fail
Scenario: search and address ranks for GB post codes correctly assigned
Given the places
| osm | class | type | postcode | geometry |
@ -195,55 +178,19 @@ Feature: Import of postcodes
| E45 2 | gb | 23 | 5 |
| Y45 | gb | 21 | 5 |
Scenario: wrongly formatted GB postcodes are down-ranked
@fail-legacy
Scenario: Postcodes outside all countries are not added to the postcode and word table
Given the places
| osm | class | type | postcode | geometry |
| N1 | place | postcode | EA452CD | country:gb |
| N2 | place | postcode | E45 23 | country:gb |
| osm | class | type | addr+postcode | addr+housenumber | addr+place | geometry |
| N34 | place | house | 01982 | 111 | Null Island | 0 0.00001 |
And the places
| osm | class | type | name | geometry |
| N1 | place | hamlet | Null Island | 0 0 |
When importing
Then location_postcode contains exactly
| postcode | country | rank_search | rank_address |
| EA452CD | gb | 30 | 30 |
| E45 23 | gb | 30 | 30 |
Scenario: search and address rank for DE postcodes correctly assigned
Given the places
| osm | class | type | postcode | geometry |
| N1 | place | postcode | 56427 | country:de |
| N2 | place | postcode | 5642 | country:de |
| N3 | place | postcode | 5642A | country:de |
| N4 | place | postcode | 564276 | country:de |
When importing
Then location_postcode contains exactly
| postcode | country | rank_search | rank_address |
| 56427 | de | 21 | 11 |
| 5642 | de | 30 | 30 |
| 5642A | de | 30 | 30 |
| 564276 | de | 30 | 30 |
Scenario: search and address rank for other postcodes are correctly assigned
Given the places
| osm | class | type | postcode | geometry |
| N1 | place | postcode | 1 | country:ca |
| N2 | place | postcode | X3 | country:ca |
| N3 | place | postcode | 543 | country:ca |
| N4 | place | postcode | 54dc | country:ca |
| N5 | place | postcode | 12345 | country:ca |
| N6 | place | postcode | 55TT667 | country:ca |
| N7 | place | postcode | 123-65 | country:ca |
| N8 | place | postcode | 12 445 4 | country:ca |
| N9 | place | postcode | A1:bc10 | country:ca |
When importing
Then location_postcode contains exactly
| postcode | country | rank_search | rank_address |
| 1 | ca | 21 | 11 |
| X3 | ca | 21 | 11 |
| 543 | ca | 21 | 11 |
| 54DC | ca | 21 | 11 |
| 12345 | ca | 21 | 11 |
| 55TT667 | ca | 21 | 11 |
| 123-65 | ca | 25 | 11 |
| 12 445 4 | ca | 25 | 11 |
| A1:BC10 | ca | 25 | 11 |
| country | postcode | geometry |
And there are no word tokens for postcodes 01982
When sending search query "111, 01982 Null Island"
Then results contain
| osm | display_name |
| N34 | 111, Null Island, 01982 |

View File

@ -168,14 +168,6 @@ Feature: Import and search of names
| ID | osm |
| 0 | R1 |
Scenario: Unprintable characters in postcodes are ignored
Given the named places
| osm | class | type | address | geometry |
| N234 | amenity | prison | 'postcode' : u'1234\u200e' | country:de |
When importing
And sending search query "1234"
Then result 0 has not attributes osm_type
Scenario Outline: Housenumbers with special characters are found
Given the grid
| 1 | | | | 2 |

View File

@ -0,0 +1,97 @@
@DB
Feature: Querying fo postcode variants
Scenario: Postcodes in Singapore (6-digit postcode)
Given the grid with origin SG
| 10 | | | | 11 |
And the places
| osm | class | type | name | addr+postcode | geometry |
| W1 | highway | path | Lorang | 399174 | 10,11 |
When importing
When sending search query "399174"
Then results contain
| ID | type | display_name |
| 0 | postcode | 399174 |
@fail-legacy
Scenario Outline: Postcodes in the Netherlands (mixed postcode with spaces)
Given the grid with origin NL
| 10 | | | | 11 |
And the places
| osm | class | type | name | addr+postcode | geometry |
| W1 | highway | path | De Weide | 3993 DX | 10,11 |
When importing
When sending search query "3993 DX"
Then results contain
| ID | type | display_name |
| 0 | postcode | 3993 DX |
When sending search query "3993dx"
Then results contain
| ID | type | display_name |
| 0 | postcode | 3993 DX |
Examples:
| postcode |
| 3993 DX |
| 3993DX |
| 3993 dx |
@fail-legacy
Scenario: Postcodes in Singapore (6-digit postcode)
Given the grid with origin SG
| 10 | | | | 11 |
And the places
| osm | class | type | name | addr+postcode | geometry |
| W1 | highway | path | Lorang | 399174 | 10,11 |
When importing
When sending search query "399174"
Then results contain
| ID | type | display_name |
| 0 | postcode | 399174 |
@fail-legacy
Scenario Outline: Postcodes in Andorra (with country code)
Given the grid with origin AD
| 10 | | | | 11 |
And the places
| osm | class | type | name | addr+postcode | geometry |
| W1 | highway | path | Lorang | <postcode> | 10,11 |
When importing
When sending search query "675"
Then results contain
| ID | type | display_name |
| 0 | postcode | AD675 |
When sending search query "AD675"
Then results contain
| ID | type | display_name |
| 0 | postcode | AD675 |
Examples:
| postcode |
| 675 |
| AD 675 |
| AD675 |
Scenario: Different postcodes with the same normalization can both be found
Given the places
| osm | class | type | addr+postcode | addr+housenumber | geometry |
| N34 | place | house | EH4 7EA | 111 | country:gb |
| N35 | place | house | E4 7EA | 111 | country:gb |
When importing
Then location_postcode contains exactly
| country | postcode | geometry |
| gb | EH4 7EA | country:gb |
| gb | E4 7EA | country:gb |
When sending search query "EH4 7EA"
Then results contain
| type | display_name |
| postcode | EH4 7EA |
When sending search query "E4 7EA"
Then results contain
| type | display_name |
| postcode | E4 7EA |

View File

@ -18,13 +18,19 @@ from nominatim.tokenizer import factory as tokenizer_factory
def check_database_integrity(context):
""" Check some generic constraints on the tables.
"""
# place_addressline should not have duplicate (place_id, address_place_id)
cur = context.db.cursor()
cur.execute("""SELECT count(*) FROM
(SELECT place_id, address_place_id, count(*) as c
FROM place_addressline GROUP BY place_id, address_place_id) x
WHERE c > 1""")
assert cur.fetchone()[0] == 0, "Duplicates found in place_addressline"
with context.db.cursor() as cur:
# place_addressline should not have duplicate (place_id, address_place_id)
cur.execute("""SELECT count(*) FROM
(SELECT place_id, address_place_id, count(*) as c
FROM place_addressline GROUP BY place_id, address_place_id) x
WHERE c > 1""")
assert cur.fetchone()[0] == 0, "Duplicates found in place_addressline"
# word table must not have empty word_tokens
if context.nominatim.tokenizer != 'legacy':
cur.execute("SELECT count(*) FROM word WHERE word_token = ''")
assert cur.fetchone()[0] == 0, "Empty word tokens found in word table"
################################ GIVEN ##################################

View File

@ -0,0 +1,102 @@
# SPDX-License-Identifier: GPL-2.0-only
#
# This file is part of Nominatim. (https://nominatim.org)
#
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
Tests for the sanitizer that normalizes postcodes.
"""
import pytest
from nominatim.tokenizer.place_sanitizer import PlaceSanitizer
from nominatim.indexer.place_info import PlaceInfo
from nominatim.tools import country_info
@pytest.fixture
def sanitize(def_config, request):
country_info.setup_country_config(def_config)
sanitizer_args = {'step': 'clean-postcodes'}
for mark in request.node.iter_markers(name="sanitizer_params"):
sanitizer_args.update({k.replace('_', '-') : v for k,v in mark.kwargs.items()})
def _run(country=None, **kwargs):
pi = {'address': kwargs}
if country is not None:
pi['country_code'] = country
_, address = PlaceSanitizer([sanitizer_args]).process_names(PlaceInfo(pi))
return sorted([(p.kind, p.name) for p in address])
return _run
@pytest.mark.parametrize("country", (None, 'ae'))
def test_postcode_no_country(sanitize, country):
assert sanitize(country=country, postcode='23231') == [('unofficial_postcode', '23231')]
@pytest.mark.parametrize("country", (None, 'ae'))
@pytest.mark.sanitizer_params(convert_to_address=False)
def test_postcode_no_country_drop(sanitize, country):
assert sanitize(country=country, postcode='23231') == []
@pytest.mark.parametrize("postcode", ('12345', ' 12345 ', 'de 12345',
'DE12345', 'DE 12345', 'DE-12345'))
def test_postcode_pass_good_format(sanitize, postcode):
assert sanitize(country='de', postcode=postcode) == [('postcode', '12345')]
@pytest.mark.parametrize("postcode", ('123456', '', ' ', '.....',
'DE 12345', 'DEF12345', 'CH 12345'))
@pytest.mark.sanitizer_params(convert_to_address=False)
def test_postcode_drop_bad_format(sanitize, postcode):
assert sanitize(country='de', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('1234', '9435', '99000'))
def test_postcode_cyprus_pass(sanitize, postcode):
assert sanitize(country='cy', postcode=postcode) == [('postcode', postcode)]
@pytest.mark.parametrize("postcode", ('91234', '99a45', '567'))
@pytest.mark.sanitizer_params(convert_to_address=False)
def test_postcode_cyprus_fail(sanitize, postcode):
assert sanitize(country='cy', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('123456', 'A33F2G7'))
def test_postcode_kazakhstan_pass(sanitize, postcode):
assert sanitize(country='kz', postcode=postcode) == [('postcode', postcode)]
@pytest.mark.parametrize("postcode", ('V34T6Y923456', '99345'))
@pytest.mark.sanitizer_params(convert_to_address=False)
def test_postcode_kazakhstan_fail(sanitize, postcode):
assert sanitize(country='kz', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('675 34', '67534', 'SE-675 34', 'SE67534'))
def test_postcode_sweden_pass(sanitize, postcode):
assert sanitize(country='se', postcode=postcode) == [('postcode', '675 34')]
@pytest.mark.parametrize("postcode", ('67 345', '671123'))
@pytest.mark.sanitizer_params(convert_to_address=False)
def test_postcode_sweden_fail(sanitize, postcode):
assert sanitize(country='se', postcode=postcode) == []
@pytest.mark.parametrize("postcode", ('AB1', '123-456-7890', '1 as 44'))
@pytest.mark.sanitizer_params(default_pattern='[A-Z0-9- ]{3,12}')
def test_postcode_default_pattern_pass(sanitize, postcode):
assert sanitize(country='an', postcode=postcode) == [('postcode', postcode.upper())]
@pytest.mark.parametrize("postcode", ('C', '12', 'ABC123DEF 456', '1234,5678', '11223;11224'))
@pytest.mark.sanitizer_params(convert_to_address=False, default_pattern='[A-Z0-9- ]{3,12}')
def test_postcode_default_pattern_fail(sanitize, postcode):
assert sanitize(country='an', postcode=postcode) == []

View File

@ -72,7 +72,8 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
def _mk_analyser(norm=("[[:Punctuation:][:Space:]]+ > ' '",), trans=(':: upper()',),
variants=('~gasse -> gasse', 'street => st', ),
sanitizers=[], with_housenumber=False):
sanitizers=[], with_housenumber=False,
with_postcode=False):
cfgstr = {'normalization': list(norm),
'sanitizers': sanitizers,
'transliteration': list(trans),
@ -81,6 +82,9 @@ def analyzer(tokenizer_factory, test_config, monkeypatch,
if with_housenumber:
cfgstr['token-analysis'].append({'id': '@housenumber',
'analyzer': 'housenumbers'})
if with_postcode:
cfgstr['token-analysis'].append({'id': '@postcode',
'analyzer': 'postcodes'})
(test_config.project_dir / 'icu_tokenizer.yaml').write_text(yaml.dump(cfgstr))
tok.loader = nominatim.tokenizer.icu_rule_loader.ICURuleLoader(test_config)
@ -246,28 +250,69 @@ def test_normalize_postcode(analyzer):
anl.normalize_postcode('38 Б') == '38 Б'
def test_update_postcodes_from_db_empty(analyzer, table_factory, word_table):
table_factory('location_postcode', 'postcode TEXT',
content=(('1234',), ('12 34',), ('AB23',), ('1234',)))
class TestPostcodes:
with analyzer() as anl:
anl.update_postcodes_from_db()
assert word_table.count() == 3
assert word_table.get_postcodes() == {'1234', '12 34', 'AB23'}
@pytest.fixture(autouse=True)
def setup(self, analyzer, sql_functions):
sanitizers = [{'step': 'clean-postcodes'}]
with analyzer(sanitizers=sanitizers, with_postcode=True) as anl:
self.analyzer = anl
yield anl
def test_update_postcodes_from_db_add_and_remove(analyzer, table_factory, word_table):
table_factory('location_postcode', 'postcode TEXT',
content=(('1234',), ('45BC', ), ('XX45', )))
word_table.add_postcode(' 1234', '1234')
word_table.add_postcode(' 5678', '5678')
def process_postcode(self, cc, postcode):
return self.analyzer.process_place(PlaceInfo({'country_code': cc,
'address': {'postcode': postcode}}))
with analyzer() as anl:
anl.update_postcodes_from_db()
assert word_table.count() == 3
assert word_table.get_postcodes() == {'1234', '45BC', 'XX45'}
def test_update_postcodes_from_db_empty(self, table_factory, word_table):
table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
content=(('de', '12345'), ('se', '132 34'),
('bm', 'AB23'), ('fr', '12345')))
self.analyzer.update_postcodes_from_db()
assert word_table.count() == 5
assert word_table.get_postcodes() == {'12345', '132 34@132 34', 'AB 23@AB 23'}
def test_update_postcodes_from_db_ambigious(self, table_factory, word_table):
table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
content=(('in', '123456'), ('sg', '123456')))
self.analyzer.update_postcodes_from_db()
assert word_table.count() == 3
assert word_table.get_postcodes() == {'123456', '123456@123 456'}
def test_update_postcodes_from_db_add_and_remove(self, table_factory, word_table):
table_factory('location_postcode', 'country_code TEXT, postcode TEXT',
content=(('ch', '1234'), ('bm', 'BC 45'), ('bm', 'XX45')))
word_table.add_postcode(' 1234', '1234')
word_table.add_postcode(' 5678', '5678')
self.analyzer.update_postcodes_from_db()
assert word_table.count() == 5
assert word_table.get_postcodes() == {'1234', 'BC 45@BC 45', 'XX 45@XX 45'}
def test_process_place_postcode_simple(self, word_table):
info = self.process_postcode('de', '12345')
assert info['postcode'] == '12345'
assert word_table.get_postcodes() == {'12345', }
def test_process_place_postcode_with_space(self, word_table):
info = self.process_postcode('in', '123 567')
assert info['postcode'] == '123567'
assert word_table.get_postcodes() == {'123567@123 567', }
def test_update_special_phrase_empty_table(analyzer, word_table):
@ -437,13 +482,6 @@ class TestPlaceAddress:
assert word_table.get_postcodes() == {pcode, }
@pytest.mark.parametrize('pcode', ['12:23', 'ab;cd;f', '123;836'])
def test_process_place_bad_postcode(self, word_table, pcode):
self.process_address(postcode=pcode)
assert not word_table.get_postcodes()
@pytest.mark.parametrize('hnr', ['123a', '1', '101'])
def test_process_place_housenumbers_simple(self, hnr, getorcreate_hnr_id):
info = self.process_address(housenumber=hnr)

View File

@ -0,0 +1,60 @@
# SPDX-License-Identifier: GPL-2.0-only
#
# This file is part of Nominatim. (https://nominatim.org)
#
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
Tests for special postcode analysis and variant generation.
"""
import pytest
from icu import Transliterator
import nominatim.tokenizer.token_analysis.postcodes as module
from nominatim.errors import UsageError
DEFAULT_NORMALIZATION = """ :: NFD ();
'🜳' > ' ';
[[:Nonspacing Mark:] [:Cf:]] >;
:: lower ();
[[:Punctuation:][:Space:]]+ > ' ';
:: NFC ();
"""
DEFAULT_TRANSLITERATION = """ :: Latin ();
'🜵' > ' ';
"""
@pytest.fixture
def analyser():
rules = { 'analyzer': 'postcodes'}
config = module.configure(rules, DEFAULT_NORMALIZATION)
trans = Transliterator.createFromRules("test_trans", DEFAULT_TRANSLITERATION)
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
return module.create(norm, trans, config)
def get_normalized_variants(proc, name):
norm = Transliterator.createFromRules("test_norm", DEFAULT_NORMALIZATION)
return proc.get_variants_ascii(norm.transliterate(name).strip())
@pytest.mark.parametrize('name,norm', [('12', '12'),
('A 34 ', 'A 34'),
('34-av', '34-AV')])
def test_normalize(analyser, name, norm):
assert analyser.normalize(name) == norm
@pytest.mark.parametrize('postcode,variants', [('12345', {'12345'}),
('AB-998', {'ab 998', 'ab998'}),
('23 FGH D3', {'23 fgh d3', '23fgh d3',
'23 fghd3', '23fghd3'})])
def test_get_variants_ascii(analyser, postcode, variants):
out = analyser.get_variants_ascii(postcode)
assert len(out) == len(set(out))
assert set(out) == variants

View File

@ -11,7 +11,7 @@ import subprocess
import pytest
from nominatim.tools import postcodes
from nominatim.tools import postcodes, country_info
import dummy_tokenizer
class MockPostcodeTable:
@ -64,11 +64,26 @@ class MockPostcodeTable:
def tokenizer():
return dummy_tokenizer.DummyTokenizer(None, None)
@pytest.fixture
def postcode_table(temp_db_conn, placex_table):
def postcode_table(def_config, temp_db_conn, placex_table):
country_info.setup_country_config(def_config)
return MockPostcodeTable(temp_db_conn)
@pytest.fixture
def insert_implicit_postcode(placex_table, place_row):
"""
Inserts data into the placex and place table
which can then be used to compute one postcode.
"""
def _insert_implicit_postcode(osm_id, country, geometry, address):
placex_table.add(osm_id=osm_id, country=country, geom=geometry)
place_row(osm_id=osm_id, geom='SRID=4326;'+geometry, address=address)
return _insert_implicit_postcode
def test_postcodes_empty(dsn, postcode_table, place_table,
tmp_path, tokenizer):
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
@ -193,7 +208,22 @@ def test_can_compute(dsn, table_factory):
table_factory('place')
assert postcodes.can_compute(dsn)
def test_no_placex_entry(dsn, tmp_path, temp_db_cursor, place_row, postcode_table, tokenizer):
#Rewrite the get_country_code function to verify its execution.
temp_db_cursor.execute("""
CREATE OR REPLACE FUNCTION get_country_code(place geometry)
RETURNS TEXT AS $$ BEGIN
RETURN 'yy';
END; $$ LANGUAGE plpgsql;
""")
place_row(geom='SRID=4326;POINT(10 12)', address=dict(postcode='AB 4511'))
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
assert postcode_table.row_set == {('yy', 'AB 4511', 10, 12)}
def test_discard_badly_formatted_postcodes(dsn, tmp_path, temp_db_cursor, place_row, postcode_table, tokenizer):
#Rewrite the get_country_code function to verify its execution.
temp_db_cursor.execute("""
CREATE OR REPLACE FUNCTION get_country_code(place geometry)
@ -204,16 +234,4 @@ def test_no_placex_entry(dsn, tmp_path, temp_db_cursor, place_row, postcode_tabl
place_row(geom='SRID=4326;POINT(10 12)', address=dict(postcode='AB 4511'))
postcodes.update_postcodes(dsn, tmp_path, tokenizer)
assert postcode_table.row_set == {('fr', 'AB 4511', 10, 12)}
@pytest.fixture
def insert_implicit_postcode(placex_table, place_row):
"""
Inserts data into the placex and place table
which can then be used to compute one postcode.
"""
def _insert_implicit_postcode(osm_id, country, geometry, address):
placex_table.add(osm_id=osm_id, country=country, geom=geometry)
place_row(osm_id=osm_id, geom='SRID=4326;'+geometry, address=address)
return _insert_implicit_postcode
assert not postcode_table.row_set

View File

@ -0,0 +1,56 @@
# SPDX-License-Identifier: GPL-2.0-only
#
# This file is part of Nominatim. (https://nominatim.org)
#
# Copyright (C) 2022 by the Nominatim developer community.
# For a full list of authors see the git log.
"""
Tests for centroid computation.
"""
import pytest
from nominatim.utils.centroid import PointsCentroid
def test_empty_set():
c = PointsCentroid()
with pytest.raises(ValueError, match='No points'):
c.centroid()
@pytest.mark.parametrize("centroid", [(0,0), (-1, 3), [0.0000032, 88.4938]])
def test_one_point_centroid(centroid):
c = PointsCentroid()
c += centroid
assert len(c.centroid()) == 2
assert c.centroid() == (pytest.approx(centroid[0]), pytest.approx(centroid[1]))
def test_multipoint_centroid():
c = PointsCentroid()
c += (20.0, -10.0)
assert c.centroid() == (pytest.approx(20.0), pytest.approx(-10.0))
c += (20.2, -9.0)
assert c.centroid() == (pytest.approx(20.1), pytest.approx(-9.5))
c += (20.2, -9.0)
assert c.centroid() == (pytest.approx(20.13333), pytest.approx(-9.333333))
def test_manypoint_centroid():
c = PointsCentroid()
for _ in range(10000):
c += (4.564732, -0.000034)
assert c.centroid() == (pytest.approx(4.564732), pytest.approx(-0.000034))
@pytest.mark.parametrize("param", ["aa", None, 5, [1, 2, 3], (3, None), ("a", 3.9)])
def test_add_non_tuple(param):
c = PointsCentroid()
with pytest.raises(ValueError, match='2-element tuples'):
c += param