Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2024-11-24 14:32:29 +03:00

Author	SHA1	Message	Date
Sandor Nagy	7e3701b64a	Fix typo in log message on replication initialisation	2022-03-15 07:50:47 +01:00
Sarah Hoffmann	15beeef6ce	do not expand records in select list An expression of the form 'SELECT (func()).*' will be expanded by Postgresql _before_ execution with the result that the function will be called as many times as there are fields in the record. This is not what we want. The function call needs to go into the FROM clause instead.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	92bc3cd0a7	fix linting issue	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	4a3bbd0319	adapt housenumber cleanup to new word table structure	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	13ed184efd	housenumber analyzer: avoid creating too many variants Housenumber fields with lots of text are likely bad data. So is data with many changes from letter to digit. Exclude them from adding optional spaces.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	f03a05f6bb	add new analyser for houenumbers This analyser makes spaces optional.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	a6903651fc	add framework for analysing housenumbers This lays the groundwork for adding variants for housenumbers. When analysis is enabled, then the 'word' field in the word table is used as usual, so that variants can be created. There will be only one analyser allowed which must have the fixed name '@housenumber'.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	b8c544cc98	icu: move token deduplication into TokenInfo Puts collection into one common place.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	243725aae1	icu: move housenumber token computation out of TokenInfo This was the last function to use the cache. There is a more clean separation of responsibility now.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	0bb59b2e22	handle unknown analyzer When changing something in the default configuration of the sanatizers that refers to an analyzer that is not yet loaded, there shouldn't be any errors.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	837d44391c	move generation of normalized token form to analyzer This gives the analyzer more flexibility in choosing the normalized form. In particular, an analyzer creating different variants can choose the variant that will be used as the canonical form.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	5425394654	add migration to add new derived_names column	2022-02-24 20:50:33 +01:00
Sarah Hoffmann	f74228830d	bdd: run full import on tests This uncovered a couple of outdated/wrong tests which have been fixed, too.	2022-02-24 14:27:51 +01:00
Sarah Hoffmann	a3e4e8e5cd	delete unused country name tokens	2022-02-23 09:23:06 +01:00
Sarah Hoffmann	38c3ef3da0	add tests for get_string_list() Renaming test file for sanitizer config because pytest requires unique names for test files.	2022-02-07 11:22:24 +01:00
Sarah Hoffmann	610f2cc254	sanitizer: move helpers into a configuration class	2022-02-07 10:48:00 +01:00
Sarah Hoffmann	a79a3210e6	implement is-a-name option for housenumbers	2022-02-07 09:27:11 +01:00
Sarah Hoffmann	98432395c3	add migration for upcoming change to tiger tables	2022-01-27 11:48:27 +01:00
Sarah Hoffmann	83d2c440d5	add migration for new interpolation table layout	2022-01-27 11:14:55 +01:00
Sarah Hoffmann	e6d855b954	add migration for new lookup index	2022-01-27 11:14:55 +01:00
Sarah Hoffmann	c170d323d9	add tests for cleaning housenumbers	2022-01-20 23:47:20 +01:00
Sarah Hoffmann	3ce123ab69	do not clean housenumbers in reverse-only mode	2022-01-20 20:21:13 +01:00
Sarah Hoffmann	d8b7a51ab6	add actual removal of housenumber tokens	2022-01-20 20:18:15 +01:00
Sarah Hoffmann	344a2bfc1a	add new command for cleaning word tokens Just pulls outdated housenumbers for the moment.	2022-01-20 20:05:15 +01:00
Sarah Hoffmann	1e5a8561c0	fix linting issues	2022-01-20 16:00:23 +01:00
Sarah Hoffmann	f3c9578bca	complete documentation for new clean-houseunubmers sanatizer	2022-01-20 15:49:32 +01:00
Sarah Hoffmann	3741afa6dc	generalize filter-kind parameter for sanatizers Now behaves the same for tag_analyzer_by_language and clean_housenumbers. Adds tests.	2022-01-20 15:42:42 +01:00
Sarah Hoffmann	4774e45218	clean_housenumbers: make kinds and delimiters configurable Also adds unit tests for various options.	2022-01-20 12:07:12 +01:00
Sarah Hoffmann	206ee87188	factor out housenumber splitting into sanitizer	2022-01-19 17:27:50 +01:00
Sarah Hoffmann	3df560ea38	fix linting error	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	adbaf700cd	move parsing of mutation config to setup phase	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	b453b0ea95	introduce mutation variants to generic token analyser Mutations are regular-expression-based replacements that are applied after variants have been computed. They are meant to be used for variations on character level. Add spelling variations for German umlauts.	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	0192a7af96	move variant configuration reading in separate file	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	630ad38a67	refactor variant production to use generators	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	c3788d765e	add consistent SPDX copyright headers	2022-01-03 16:23:58 +01:00
Sarah Hoffmann	f9b56a8581	correctly match abbreviated addr:street This only works when addr:street is abbreviated and the street name isn't. It does not work the other way around.	2021-12-08 21:58:43 +01:00
Sarah Hoffmann	7f7d2fd5b3	skip most addr: tags with suffixes Only one addr: tag can be processed currently, so make sure it is the one without suffixes to not get odd data. addr:street is the exception because it uses a different matching mechanism.	2021-12-06 14:55:10 +01:00
Sarah Hoffmann	44cfce1ca4	revert to using full names for street name matching Using partial names turned out to not work well because there are often similarly named streets next to each other. It also prevents us from being able to take into account all addr:street:* tags. This change gets all the full term tokens for the addr:street tags from the DB. As they are used for matching only, we can assume that the term must already be there or there will be no match. This avoid creating unused full name tags.	2021-12-06 11:38:38 +01:00
Sarah Hoffmann	54d35ddfe9	split cli tests by subcommand and extend coverage	2021-12-02 23:45:48 +01:00
Sarah Hoffmann	7beccb7997	remove unnecessary pass statements	2021-12-02 15:54:24 +01:00
Sarah Hoffmann	14a78f55cd	more unit tests for tokenizers	2021-12-02 15:46:36 +01:00
Sarah Hoffmann	a52ed366e4	add tests for migration	2021-12-01 20:27:40 +01:00
Sarah Hoffmann	810056349f	add migration for inclusive housenumber Tiger index	2021-11-24 12:03:20 +01:00
Sarah Hoffmann	10e979e841	only instantiate indexer once for replication Also makes sure that indexer object exists everywhere were needed. See #2518.	2021-11-19 14:48:58 +01:00
Sarah Hoffmann	345c812e43	better error reporting when API script does not exist Check if the API script exists on the expected location before running php-cli. This way we can add a useful hint about the project directory. Fixes #2513.	2021-11-10 11:58:20 +01:00
Sarah Hoffmann	d479a0585d	prepare release 4.0.0	2021-11-02 20:27:55 +01:00
Sarah Hoffmann	37eeccbf4c	ICU: use normalization from config in PHP The TERM_NORMALIZATION config option is no longer applicable. That was already documented but not yet implemented.	2021-10-27 11:32:44 +02:00
Sarah Hoffmann	53dbe58ada	do not count words when in reverse-only mode	2021-10-26 12:00:13 +02:00
Sarah Hoffmann	2c4b798f9b	further refactor setup to keep function small	2021-10-26 12:00:13 +02:00
Sarah Hoffmann	9934421442	make word count computation part of the import Accurate word counts are now essential when using the ICU tokenizer and don't hurt for the legacy one. Adds about an hour import time.	2021-10-26 12:00:13 +02:00

1 2 3 4 5 ...

442 Commits