Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2024-11-23 13:44:36 +03:00

Author	SHA1	Message	Date
Tareq Al-Ahdal	d09670d208	modify logic to prepend 'name:' to keys'	2022-03-18 06:01:25 +08:00
Tareq Al-Ahdal	d32a7c1888	initialize an empty dictionary for nested name key	2022-03-18 02:50:33 +08:00
Tareq Al-Ahdal	6be2077d92	Merge branch 'master' into country-names-yaml-configuration	2022-03-18 02:36:12 +08:00
Tareq Al-Ahdal	456d439e97	Reformatting of country keys	2022-03-18 02:23:11 +08:00
Tareq Al-Ahdal	b4bd4ff67d	fix linting error	2022-03-15 19:14:04 +08:00
Sandor Nagy	7e3701b64a	Fix typo in log message on replication initialisation	2022-03-15 07:50:47 +01:00
Tareq Al-Ahdal	165d17f7f7	reintroduce 'name:' prefix to country name keys	2022-03-13 18:58:27 +08:00
Tareq Al-Ahdal	377cf36be3	modify data import logic to load country names from yaml	2022-03-12 15:20:57 +08:00
Sarah Hoffmann	15beeef6ce	do not expand records in select list An expression of the form 'SELECT (func()).*' will be expanded by Postgresql _before_ execution with the result that the function will be called as many times as there are fields in the record. This is not what we want. The function call needs to go into the FROM clause instead.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	92bc3cd0a7	fix linting issue	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	4a3bbd0319	adapt housenumber cleanup to new word table structure	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	13ed184efd	housenumber analyzer: avoid creating too many variants Housenumber fields with lots of text are likely bad data. So is data with many changes from letter to digit. Exclude them from adding optional spaces.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	f03a05f6bb	add new analyser for houenumbers This analyser makes spaces optional.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	a6903651fc	add framework for analysing housenumbers This lays the groundwork for adding variants for housenumbers. When analysis is enabled, then the 'word' field in the word table is used as usual, so that variants can be created. There will be only one analyser allowed which must have the fixed name '@housenumber'.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	b8c544cc98	icu: move token deduplication into TokenInfo Puts collection into one common place.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	243725aae1	icu: move housenumber token computation out of TokenInfo This was the last function to use the cache. There is a more clean separation of responsibility now.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	0bb59b2e22	handle unknown analyzer When changing something in the default configuration of the sanatizers that refers to an analyzer that is not yet loaded, there shouldn't be any errors.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	837d44391c	move generation of normalized token form to analyzer This gives the analyzer more flexibility in choosing the normalized form. In particular, an analyzer creating different variants can choose the variant that will be used as the canonical form.	2022-03-01 09:34:32 +01:00
Sarah Hoffmann	5425394654	add migration to add new derived_names column	2022-02-24 20:50:33 +01:00
Sarah Hoffmann	f74228830d	bdd: run full import on tests This uncovered a couple of outdated/wrong tests which have been fixed, too.	2022-02-24 14:27:51 +01:00
Sarah Hoffmann	a3e4e8e5cd	delete unused country name tokens	2022-02-23 09:23:06 +01:00
Sarah Hoffmann	38c3ef3da0	add tests for get_string_list() Renaming test file for sanitizer config because pytest requires unique names for test files.	2022-02-07 11:22:24 +01:00
Sarah Hoffmann	610f2cc254	sanitizer: move helpers into a configuration class	2022-02-07 10:48:00 +01:00
Sarah Hoffmann	a79a3210e6	implement is-a-name option for housenumbers	2022-02-07 09:27:11 +01:00
Sarah Hoffmann	98432395c3	add migration for upcoming change to tiger tables	2022-01-27 11:48:27 +01:00
Sarah Hoffmann	83d2c440d5	add migration for new interpolation table layout	2022-01-27 11:14:55 +01:00
Sarah Hoffmann	e6d855b954	add migration for new lookup index	2022-01-27 11:14:55 +01:00
Sarah Hoffmann	c170d323d9	add tests for cleaning housenumbers	2022-01-20 23:47:20 +01:00
Sarah Hoffmann	3ce123ab69	do not clean housenumbers in reverse-only mode	2022-01-20 20:21:13 +01:00
Sarah Hoffmann	d8b7a51ab6	add actual removal of housenumber tokens	2022-01-20 20:18:15 +01:00
Sarah Hoffmann	344a2bfc1a	add new command for cleaning word tokens Just pulls outdated housenumbers for the moment.	2022-01-20 20:05:15 +01:00
Sarah Hoffmann	1e5a8561c0	fix linting issues	2022-01-20 16:00:23 +01:00
Sarah Hoffmann	f3c9578bca	complete documentation for new clean-houseunubmers sanatizer	2022-01-20 15:49:32 +01:00
Sarah Hoffmann	3741afa6dc	generalize filter-kind parameter for sanatizers Now behaves the same for tag_analyzer_by_language and clean_housenumbers. Adds tests.	2022-01-20 15:42:42 +01:00
Sarah Hoffmann	4774e45218	clean_housenumbers: make kinds and delimiters configurable Also adds unit tests for various options.	2022-01-20 12:07:12 +01:00
Sarah Hoffmann	206ee87188	factor out housenumber splitting into sanitizer	2022-01-19 17:27:50 +01:00
Sarah Hoffmann	3df560ea38	fix linting error	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	adbaf700cd	move parsing of mutation config to setup phase	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	b453b0ea95	introduce mutation variants to generic token analyser Mutations are regular-expression-based replacements that are applied after variants have been computed. They are meant to be used for variations on character level. Add spelling variations for German umlauts.	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	0192a7af96	move variant configuration reading in separate file	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	630ad38a67	refactor variant production to use generators	2022-01-18 11:09:21 +01:00
Sarah Hoffmann	c3788d765e	add consistent SPDX copyright headers	2022-01-03 16:23:58 +01:00
Sarah Hoffmann	f9b56a8581	correctly match abbreviated addr:street This only works when addr:street is abbreviated and the street name isn't. It does not work the other way around.	2021-12-08 21:58:43 +01:00
Sarah Hoffmann	7f7d2fd5b3	skip most addr: tags with suffixes Only one addr: tag can be processed currently, so make sure it is the one without suffixes to not get odd data. addr:street is the exception because it uses a different matching mechanism.	2021-12-06 14:55:10 +01:00
Sarah Hoffmann	44cfce1ca4	revert to using full names for street name matching Using partial names turned out to not work well because there are often similarly named streets next to each other. It also prevents us from being able to take into account all addr:street:* tags. This change gets all the full term tokens for the addr:street tags from the DB. As they are used for matching only, we can assume that the term must already be there or there will be no match. This avoid creating unused full name tags.	2021-12-06 11:38:38 +01:00
Sarah Hoffmann	54d35ddfe9	split cli tests by subcommand and extend coverage	2021-12-02 23:45:48 +01:00
Sarah Hoffmann	7beccb7997	remove unnecessary pass statements	2021-12-02 15:54:24 +01:00
Sarah Hoffmann	14a78f55cd	more unit tests for tokenizers	2021-12-02 15:46:36 +01:00
Sarah Hoffmann	a52ed366e4	add tests for migration	2021-12-01 20:27:40 +01:00
Sarah Hoffmann	810056349f	add migration for inclusive housenumber Tiger index	2021-11-24 12:03:20 +01:00
Sarah Hoffmann	10e979e841	only instantiate indexer once for replication Also makes sure that indexer object exists everywhere were needed. See #2518.	2021-11-19 14:48:58 +01:00
Sarah Hoffmann	345c812e43	better error reporting when API script does not exist Check if the API script exists on the expected location before running php-cli. This way we can add a useful hint about the project directory. Fixes #2513.	2021-11-10 11:58:20 +01:00
Sarah Hoffmann	d479a0585d	prepare release 4.0.0	2021-11-02 20:27:55 +01:00
Sarah Hoffmann	37eeccbf4c	ICU: use normalization from config in PHP The TERM_NORMALIZATION config option is no longer applicable. That was already documented but not yet implemented.	2021-10-27 11:32:44 +02:00
Sarah Hoffmann	53dbe58ada	do not count words when in reverse-only mode	2021-10-26 12:00:13 +02:00
Sarah Hoffmann	2c4b798f9b	further refactor setup to keep function small	2021-10-26 12:00:13 +02:00
Sarah Hoffmann	9934421442	make word count computation part of the import Accurate word counts are now essential when using the ICU tokenizer and don't hurt for the legacy one. Adds about an hour import time.	2021-10-26 12:00:13 +02:00
Sarah Hoffmann	5c778c6d32	Merge pull request #2486 from lonvia/fix-special-phrases Fix parsing of operator in special phrases	2021-10-25 21:45:08 +02:00
Sarah Hoffmann	85797acf1e	ICU: add an index over word_ids Needed for keyword lookup in the details response.	2021-10-25 21:33:27 +02:00
Sarah Hoffmann	c4f5c11a4e	be case-insensitve about special phrase operator	2021-10-25 19:51:20 +02:00
Sarah Hoffmann	5a1c3dbea3	fix parsing of operator in special phrases Because of unstripped input, the operators wouldn't match.	2021-10-25 19:46:30 +02:00
Sarah Hoffmann	13e7398566	allow relative paths for log files	2021-10-25 10:26:05 +02:00
Sarah Hoffmann	1098ab732f	allow relative paths for flatnode file	2021-10-22 17:32:51 +02:00
Sarah Hoffmann	507fdd4f40	switch IMPORT_STYLE to use generic file search Allows relative paths wrt project directory.	2021-10-22 16:49:57 +02:00
Sarah Hoffmann	0ae8d7ac08	have ADDRESS_LEVEL_CONFIG use load_sub_configuration This means that relative paths now are looked up in the project directory.	2021-10-22 16:36:52 +02:00
Sarah Hoffmann	c77df2d1eb	replace NOMINATIM_PHRASE_CONFIG with command line option	2021-10-22 14:41:14 +02:00
Sarah Hoffmann	c1fa70639b	add new replication mode catch-up This mode gets updates until the server reports no new diffs anymore. Also adds additional indexing, when the main indexing step left a couple of objects to process. This happens only when the next update is expected to be more than 40min away.	2021-10-20 22:05:15 +02:00
Sarah Hoffmann	12643c5986	run Tiger import with parallel threads per default	2021-10-19 15:00:26 +02:00
Sarah Hoffmann	ec7184c533	icu: no longer precompute terms The ICU analyzer no longer drops frequent partials, so it is no longer necessary to know the frequencies in advance.	2021-10-19 11:52:28 +02:00
Sarah Hoffmann	e8e2502e2f	make word recount a tokenizer-specific function	2021-10-19 11:21:16 +02:00
Sarah Hoffmann	47417d1871	update and extend man page Provide extended descriptions for most subcommands.	2021-10-18 09:03:07 +02:00
Sarah Hoffmann	552fb16cb2	fix template expressions for tablespaces	2021-10-15 15:11:09 +02:00
Sarah Hoffmann	3649487f5e	use SP-GIST index for building index where available Point-in-polygon queries are much faster with a SP-GIST geometry index, so use that for the index used to check if a housenumber is inside a building. Only available with Postgis 3. There is an automatic fallback to GIST for Postgis 2.	2021-10-10 21:55:38 +02:00
Sarah Hoffmann	6c79a60e19	add documentation for new configuration of ICU tokenizer	2021-10-07 11:55:53 +02:00
Sarah Hoffmann	2a94bfc703	fix argument description for check_database	2021-10-07 09:49:13 +02:00
Sarah Hoffmann	299934fd2a	reorganize and complete tests around generic token analysis	2021-10-06 17:03:37 +02:00
Sarah Hoffmann	b18d042832	add tests for sanitizer tagging language	2021-10-06 12:29:25 +02:00
Sarah Hoffmann	97a10ec218	apply variants by languages Adds a tagger for names by language so that the analyzer of that language is used. Thus variants are now only applied to names in the specific language and only tag name tags, no longer to reference-like tags.	2021-10-06 11:09:54 +02:00
Sarah Hoffmann	d35400a7d7	use analyser provided in the 'analyzer' property Implements per-name choice of analyzer. If a non-default analyzer is choosen, then the 'word' identifier is extended with the name of the ana;yzer, so that we still have unique items.	2021-10-05 14:10:32 +02:00
Sarah Hoffmann	92f6ec2328	remove support for properties on variants Those are not going to be used in the near future, so no need to carry that code around just now.	2021-10-05 10:29:36 +02:00
Sarah Hoffmann	9ba2019470	precompute replacements while loading configuration	2021-10-05 10:20:08 +02:00
Sarah Hoffmann	c171d88194	move parsing of token analysis config to analyzer Adds a second callback for the analyzer which is responsible for parsing the configuration rules and converting it to whatever format necessary. This way, each analyzer implementation can define its own configuration rules.	2021-10-04 18:31:58 +02:00
Sarah Hoffmann	7cfcbacfc7	make token analyzers configurable modules Adds a mandatory section 'analyzer' to the token-analysis entries which define, which analyser to use. Currently there is exactly one, generic, which implements the former ICUNameProcessor.	2021-10-04 17:37:34 +02:00
Sarah Hoffmann	52847b61a3	extend ICU config to accomodate multiple analysers Adds parsing of multiple variant lists from the configuration. Every entry except one must have a unique 'id' paramter to distinguish the entries. The entry without id is considered the default. Currently only the list without an id is used for analysis.	2021-10-04 16:40:28 +02:00
Sarah Hoffmann	5a36559834	move flatten_config_list into config module For general usage by other modules.	2021-10-04 11:56:54 +02:00
Sarah Hoffmann	732cd27d2e	add unit tests for new sanatizer functions	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	8171fe4571	introduce sanitizer step before token analysis Sanatizer functions allow to transform name and address tags before they are handed to the tokenizer. Theses transformations are visible only for the tokenizer and thus only have an influence on the search terms and address match terms for a place. Currently two sanitizers are implemented which are responsible for splitting names with multiple values and removing bracket additions. Both was previously hard-coded in the tokenizer.	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	16daa57e47	unify ICUNameProcessorRules and ICURuleLoader There is no need for the additional layer of indirection that the ICUNameProcessorRules class adds. The ICURuleLoader can fill the database properties directly.	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	5e5addcdbf	fix typo	2021-09-29 14:16:09 +02:00
Sarah Hoffmann	be65c8303f	export more data for the tokenizer name preparation Adds class, type, country and rank to the exported information and removes the rather odd hack for countries. Whether a place represents a country boundary can now be computed by the tokenizer.	2021-09-29 11:54:14 +02:00
Sarah Hoffmann	231250f2eb	add wrapper class for place data passed to tokenizer This is mostly for convenience and documentation purposes.	2021-09-29 11:54:07 +02:00
Sarah Hoffmann	bb18479d5b	remove unused parameter	2021-09-27 14:58:43 +02:00
Sarah Hoffmann	bd7c7ddad0	icu tokenizer: switch to matching against partial names When matching address parts from addr:* tags against place names, the address names where so far converted to full names and compared those to the place names. This can become problematic with the new ICU tokenizer once we introduce creation of different variants depending on the place name context. It wouldn't be clear which variant to produce to get a match, so we would have to create all of them. To work around this issue, switch to using the partial terms for matching. This introduces a larger fuzziness between matches but that shouldn't be a problem because matching is always geographically restricted. The search terms created for address parts have a different problem: they are already created before we even know if they are going to be used. This can lead to spurious entries in the word table, which slows down searching. This problem can also be circumvented by using only partial terms for the search terms. In terms of searching that means that the address terms would not get the full-word boost, but given that the case where an address part does not exist as an OSM object should be the exception, this is likely acceptable.	2021-09-27 11:36:19 +02:00
Sarah Hoffmann	b894d2c04a	fix indent	2021-09-04 10:30:35 +02:00
Sarah Hoffmann	8e1d4818ac	use yaml config loader for country info	2021-09-04 00:22:55 +02:00
Sarah Hoffmann	28c98584c1	add tests for generic YAML config reader	2021-09-03 22:31:30 +02:00
Sarah Hoffmann	1c42780bb5	introduce generic YAML config loader Adds a function to the Configuration class to load a YAML file. This means that searching for the file is generalised and works the same now for all configuration files. Changes the search logic, so that it is always possible to have a custom version of the configuration file in the project directory. Move ICU tokenizer to use new load function.	2021-09-03 18:20:07 +02:00
Sarah Hoffmann	7e7dd769fd	remove language and partition from name import	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	79da96b369	read partition and languages from config file	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	78fcabade8	move country name generation to country_info module	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	284645f505	move generation of country tables in own module	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	28ee3d0949	move linking of places to the preparation stage Linked places may bring in extra names. These names need to be processed by the tokenizer. That means that the linking needs to be done before the data is handed to the tokenizer. Move finding the linked place into the preparation stage and update the name fields. Everything else is still done in the indexing stage.	2021-08-20 22:44:17 +02:00
Sarah Hoffmann	118858a55e	rename legacy_icu tokenizer to icu tokenizer The new icu tokenizer is now no longer compatible with the old legacy tokenizer in terms of data structures. Therefore there is also no longer a need to refer to the legacy tokenizer in the name.	2021-08-17 23:11:47 +02:00
Sarah Hoffmann	90b40fc3e6	define formal public Python interface for tokenizer This introduces an abstract class for the Tokenizer/Analyzer for documentation purposes.	2021-08-16 11:41:54 +02:00
Sarah Hoffmann	75a5c7013f	split up large setup function	2021-08-15 12:24:13 +02:00
Sarah Hoffmann	87dedde5d6	allow multiple files for the import command The files are forwarded to osm2pgsql which is now able to merge them correctly.	2021-08-14 21:42:21 +02:00
Sarah Hoffmann	d48793c22c	fix Python linitin errors	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	1db098c05d	reinstate word column in icu word table Postgresql is very bad at creating statistics for jsonb columns. The result is that the query planer tends to use JIT for queries with a where over 'info' even when there is an index.	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	e42878eeda	adapt unit test for new word table Requires a second wrapper class for the word table with the new layout. This class is interface-compatible, so that later when the ICU tokenizer becomes the default, all tests that depend on behaviour of the default tokenizer can be switched to the other wrapper.	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	eb6814d74e	convert word info column to json before copying	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	70f154be8b	switch word tokens to new word table layout	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	4342b28882	switch special phrases to new word table format	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	5394b1fa1b	switch postcode tokens to new word table layout	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	5ab0a63fd6	switch housenumber tokens to new word table layout	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	1618aba5f2	switch country name tokens to new word table layout	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	8377528952	new word table layout for icu tokenizer The table now directly reflects the different token types. Extra information is saved in a json structure that may be dynamically extended in the future without affecting the table layout.	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	e42349c963	replace add-data function with native Python code	2021-07-26 10:41:37 +02:00
Sarah Hoffmann	878835e4bd	move add-data subcommand into a separate file	2021-07-25 18:14:12 +02:00
Sarah Hoffmann	2c8242c8df	remove special code for pre9.5 postgresql 9.5 is now the minimum requirement.	2021-07-19 10:24:57 +02:00
Sarah Hoffmann	e7d6f89aca	increase minimum version for PostgreSQL to 9.5 This is the minimum version we can test with the CI. With 9.5 there is also complete support for jsonb available.	2021-07-19 10:21:19 +02:00
Sarah Hoffmann	14f777da18	use psycopg's SQL quoting where possible Use the SQL formatting supplied with psycopg whenever the query needs to be put together from snippets.	2021-07-12 22:05:22 +02:00
Sarah Hoffmann	6f6681ce67	add helper function for execute_values Make psycopg2's convenience function accessible through the cursor.	2021-07-12 21:08:20 +02:00
Sarah Hoffmann	06602b4ec0	provide wrapper function for DROP TABLE Use psycopg2 formatting to ensure correct quoting.	2021-07-12 20:32:46 +02:00
Sarah Hoffmann	cf98cff2a1	more formatting fixes Found by flake8.	2021-07-12 17:45:42 +02:00
Sarah Hoffmann	f8b5a63de3	factor out connection reset code	2021-07-12 14:58:44 +02:00
Sarah Hoffmann	568316f07c	simplify analyse function	2021-07-12 14:47:50 +02:00
Sarah Hoffmann	daa597b300	split up variant computation for better readability	2021-07-12 14:43:50 +02:00
Sarah Hoffmann	47adb2a3fc	reorganise process_place function Move address processing into its own function as it is rather extensive.	2021-07-12 11:57:55 +02:00
Sarah Hoffmann	fff0012249	simplify website setup code Use formaat strings and move variable quoting code into extra function.	2021-07-12 11:41:05 +02:00
Sarah Hoffmann	d5a1883b62	avoid repeated patterns for table name	2021-07-12 11:33:09 +02:00
Sarah Hoffmann	a08ef43e40	simplify if statements	2021-07-12 11:28:47 +02:00
Sarah Hoffmann	3661f7a321	avoid multiple returns of same value Found by Sonarqube.	2021-07-11 18:23:42 +02:00
Sarah Hoffmann	a2edbbf78a	cannot use capture_output in subprocess.run Only available since Python 3.7.	2021-07-06 22:57:42 +02:00
Sarah Hoffmann	1e86dc1d93	remove default parameter for namedtuple This is only available in Python 3.7.	2021-07-06 22:57:42 +02:00
Sarah Hoffmann	62d5984b1b	limit the number of variants that can be produced	2021-07-04 10:28:28 +02:00
Sarah Hoffmann	c32551b4e0	restrict partial word counting to names of reasoanble length The partial word count does not split names to save a bit of time. The result is that it might enounter unreasonably long names which in truth consist of multiple words. No accurate statistics are needed so simply restrict the count to words shorter than 75 characters.	2021-07-04 10:28:28 +02:00
Sarah Hoffmann	e85f7e7aa9	fix subsequent replacements Two replacement words directly following each other did not work as expected because each expects a space at the beginning/end while there was only one space available. Also forbit composing a word after a space was added in the end by a previous replacement.	2021-07-04 10:28:28 +02:00
Sarah Hoffmann	7b0f6b7905	leave ICU variant properties empty for now Saving unused properties causes unnecessary duplicates.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	b9fbfeff67	only consider partials in multi-words for initial count This ensures that it is less likely that we exclude meaningful words like 'hauptstrasse' just because they are frequent.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	62828fc5c1	switch to a more flexible variant description format The new format combines compound splitting and abbreviation. It also allows to restrict rules to additional conditions (like language or region). This latter ability is not used yet.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	a6aa6360e0	use yaml tag syntax to mark include files	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	f70930b1a0	make compund decomposition pure import feature Compound decomposition now creates a full name variant on import just like abbreviations. This simplifies query time normalization and opens a path for changing abbreviation and compund decomposition lists for an existing database.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	9ff4f66f55	complete tests for icu tokenizer	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	32ca631b74	fix full term token in special phrases	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	2e81084f35	complete tests for rule loader	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	a0a7b05c9f	correctly quote strings when copying in data Encapsulate the copy string in a class that ensures that copy lines are written with correct quoting.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	2f6e4edcdb	update unit tests for adapted abbreviation code	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	2e3c5d4c5b	adapt tests for ICU tokenizer	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	8413075249	move abbreviation computation into import phase This adds precomputation of abbreviated terms for names and removes abbreviation of terms in the query. Basic import works but still needs some thorough testing as well as speed improvements during import. New dependency for python library datrie.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	6ba00e6aee	icu tokenizer: move transliteration rules in separate file The tokenizer configuration has become difficult to handle due to the additional manual transliteration rules. Allow to have a separate rule file that is given to the ICU library as is.	2021-07-04 10:28:20 +02:00
AntoJvlt	3676310efe	Improved performance of the postcodes query and some code cleaning	2021-06-12 15:46:08 +02:00
AntoJvlt	1c175e3a67	Clean and update tests for postcodes	2021-06-09 09:31:32 +02:00
AntoJvlt	47fb7cd3a8	Use place_exists() into can_compute() for postcodes	2021-06-09 09:31:32 +02:00
AntoJvlt	a4733eed90	Use place instead of placex to compute postcodes	2021-06-09 09:31:32 +02:00
Sarah Hoffmann	bc981d0261	fix insertion of special terms and countries into word table Special terms need to be prefixed by a space because they are full terms. For countries avoid duplicate entries of word tokens. Adds tests for adding country terms.	2021-06-02 20:22:39 +02:00
Sarah Hoffmann	72625dc72a	call freeze after running and non-updateable import Some of the tables will have already been removed but the tables for indexing are still there and should be dropped.	2021-06-02 11:08:48 +02:00
Sarah Hoffmann	cc2f152d70	commit changes to replication log table Fixes #2350.	2021-05-26 11:47:08 +02:00
Sarah Hoffmann	a0e85cc17c	only initialise tokenizer for refresh functions where needed Fixes #2347.	2021-05-25 19:16:22 +02:00
Sarah Hoffmann	24c986c842	add tests for new full name computation with ICU	2021-05-24 10:41:42 +02:00
Sarah Hoffmann	4f4d15c28a	reorganize keyword creation for legacy tokenizer - only save partial words without internal spaces - consider comma and semicolon a separator of full words - consider parts before an opening bracket a full word (but not the part after the bracket) Fixes #244.	2021-05-24 10:41:42 +02:00
Sarah Hoffmann	fa3e48c59f	use make_keywords for place search terms also Ensures that place indeed uses the same search names as other names.	2021-05-23 23:08:11 +02:00
Sarah Hoffmann	16bb007135	Merge pull request #2336 from lonvia/do-not-mask-error-when-loading-tokenizer Do not hide errors when importing tokenizer	2021-05-18 23:00:10 +02:00
AntoJvlt	799a4c9ab6	Documentation update and small code fixes	2021-05-18 22:35:21 +02:00
Sarah Hoffmann	b2722650d4	do not hide errors when importing tokenizer Explicitly check for the tokenizer source file to check that the name is correct. We can't use the import error for that because it hides other import errors like a missing library. Fixes #2327.	2021-05-18 16:28:21 +02:00
AntoJvlt	3206bf59df	Resolve conflicts	2021-05-17 13:52:35 +02:00
AntoJvlt	8b8dfc46eb	Added --no-replace command for special phrases importation and added corresponding tests	2021-05-17 13:25:06 +02:00
AntoJvlt	06aab389ed	Code cleaning and SPLoader deleted	2021-05-16 16:59:12 +02:00
Sarah Hoffmann	925726222f	Merge pull request #2323 from darkshredder/disable-search-reverse-only Feat: Disabled search API for --reverse-only imports	2021-05-14 10:40:22 +02:00
Sarah Hoffmann	7d621389ee	adapt tests to new TIGER CSV format	2021-05-14 00:02:50 +02:00
Sarah Hoffmann	35efe3b41c	use tokenizer during Tiger data import This also changes the required import format to CSV.	2021-05-14 00:02:50 +02:00
Darkshredder	e5ffc59cd5	feat: Added reverse-only-search validation	2021-05-14 02:36:21 +05:30
Sarah Hoffmann	5feece64c1	use WorkerPool for Tiger data import Requires adding an option that SQL errors are ignored.	2021-05-13 20:36:50 +02:00
Sarah Hoffmann	b9a09129fa	move WorkerPool into db module The pool is independent of the indexer and may also be used by other parts of the software.	2021-05-13 17:11:17 +02:00
Sarah Hoffmann	fc860787dd	do not preload postcodes This is too expensive for updates.	2021-05-13 16:14:12 +02:00
Sarah Hoffmann	63e35574d4	Merge pull request #2324 from lonvia/generic-external-postcodes Rework postcode handling and generalised external postcode support	2021-05-13 14:52:19 +02:00
Sarah Hoffmann	db2dbf15f7	fix token_info migration A bad indent meant that only one table received the new column.	2021-05-13 14:31:41 +02:00
Sarah Hoffmann	f5977dac75	ignore invalid coordinates in external postcodes	2021-05-13 14:15:42 +02:00
Sarah Hoffmann	8f2746fe24	ignore entries without country code	2021-05-13 14:15:42 +02:00
Sarah Hoffmann	1ccd4360b4	correctly handle removing all postcodes for country	2021-05-13 14:15:42 +02:00
Sarah Hoffmann	bf864b2c54	index postcodes after refreshing	2021-05-13 14:15:42 +02:00
Sarah Hoffmann	4abaf71234	add and extend tests for new postcode handling	2021-05-13 14:15:42 +02:00
Sarah Hoffmann	a4aba23a83	move filling of postcode table to python The Python code now takes care of reading postcodes from placex, enhancing them with potentially existing external postcodes and updating location_postcodes accordingly. The initial setup and updates use exactly the same function. External postcode handling has been generalized. External postcodes for any country are now accepted. The format of the external postcode file has changed. We now expect CSV, potentially gzipped. The postcodes are no longer saved in the database.	2021-05-13 14:15:42 +02:00
AntoJvlt	9d83da830f	Introduction of SPCsvLoader to load special phrases from a csv file	2021-05-10 23:26:39 +02:00
AntoJvlt	00959fac57	Refactoring loading of external special phrases and importation process by introducing SPLoader and SPWikiLoader	2021-05-10 21:49:31 +02:00
Sarah Hoffmann	872ab91421	fix name of transliterator Should be different from the normalisation rules.	2021-05-05 17:09:38 +02:00
Sarah Hoffmann	a263e54b94	enable BDD tests for different tokenizers The tokenizer to be used can be choosen with -DTOKENIZER. Adapt all tests, so that they work with legacy_icu tokenizer. Move lookup in word table to a function in the tokenizer. Special phrases are temporarily imported from the wiki until we have an implementation that can import from file. TIGER tests do not work yet.	2021-05-05 10:31:51 +02:00
Sarah Hoffmann	18c99a5c5f	add unit tests for legacy ICU tokenizer	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	d55fc39275	cache translieration results	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	ba8ed7967d	add PHP part for new ICU-base tokenizer	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	f44af49df9	add Python part for new ICU-based tokenizer	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	36c624ec71	commit between migrations Later migrations may require tables set up by older ones.	2021-05-01 10:47:35 +02:00
Sarah Hoffmann	7fd871a74d	increase database version for tokenizer migration	2021-05-01 10:47:35 +02:00
Sarah Hoffmann	ced8f0f4a2	fix liniting issues	2021-04-30 17:59:50 +02:00
Sarah Hoffmann	388ebcbae2	move index creation for word table to tokenizer This introduces a finalization routing for the tokenizer where it can post-process the import if necessary.	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	20891abe1c	indexer: fetch extra place data asynchronously The indexer now fetches any extra data besides the place_id asynchronously while processing the places from the last batch. This also means that more places are now fetched at once.	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	6ce6f62b8e	fetch place info asynchronously	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	602728895e	indexer: fetch ids in batches	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	fc995ea6b9	move database check for module to tokenizer	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	3eb4d88057	boilerplate for PHP code of tokenizer This adds an installation step for PHP code for the tokenizer. The PHP code is split in two parts. The updateable code is found in lib-php. The tokenizer installs an additional script in the project directory which then includes the code from lib-php and defines all settings that are static to the database. The website code then always includes the PHP from the project directory.	2021-04-30 11:31:52 +02:00
Sarah Hoffmann	23fd1d032a	tests for legacy tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	7cb7cf848d	move amenity creation to tokenizer The BDD tests still use the old-style amenity creation scripts because we don't have simple means to import a hand-crafted test file of special phrases right now.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	bef300305e	move default country name creation to tokenizer The new function is also used, when a country us updated. All SQL function related to country names have been removed.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	dc700c25b6	cache all postcodes	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	0ba93e5ba9	reorganise address iteration in tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	9e92759ac7	extract address tokens in tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	ffc2d82b0e	move postcode normalization into tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	d8ed1bfc60	move houseunumber handling to tokenizer Normalization and token computation are now done in the tokenizer. The tokenizer keeps a cache to the hundred most used house numbers to keep the numbers of calls to the database low.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	d711f5a81e	move name token creation into tokenizer Name tokens are now handed in via token_info and used from there. Also moves the generic search name insertion function back to placex_triggers.sql.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	fa2bc60468	introduce name analyzer The name analyzer is the actual work horse of the tokenizer. It is instantiated on a thread-base and provides all functions for analysing names and queries.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	e1c5673ac3	require tokeinzer for indexer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	a73711f3cd	add extra column for tokenizer Add a jsonb column to the placex and location_property_osmline tables which can be used by the installed tokenizer as required. No other part of the software will use or otherwise rely on this column.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	9397bf54b8	introduce external processing in indexer Indexing is now split into three parts: first a preparation step that collects the necessary information from the database and returns it to Python. In a second step the data is transformed within Python as necessary and then returned to the database through the usual UPDATE which now not only sets the indexed_status but also other fields. The third step comprises the address computation which is still done inside the update trigger in the database. The second processing step doesn't do anything useful yet.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	fbbdd31399	move word table and normalisation SQL into tokenizer Creating and populating the word table is now the responsibility of the tokenizer. The get_maxwordfreq() function has been replaced with a simple template parameter to the SQL during function installation. The number is taken from the parameter list in the database to ensure that it is not changed after installation.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	b5540dc35c	add migration for configurable tokenizer Adds a migration that initialises a legacy tokenizer for an existing database. The migration is not active yet as it will need completion when more functionality is added to the legacy tokenizer.	2021-04-30 11:29:57 +02:00
Sarah Hoffmann	296a66558f	move module installation to legacy tokenizer	2021-04-30 11:29:57 +02:00
Sarah Hoffmann	af968d4903	introduce tokenizer modules This adds the boilerplate for selecting configurable tokenizers. A tokenizer can be chosen at import time and will then install itself such that it is fixed for the given database import even when the software itself is updated. The legacy tokenizer implements Nominatim's traditional algorithms.	2021-04-30 11:29:57 +02:00
Sarah Hoffmann	185d369404	remove support for AUX housenumber tables These tables have never been actively maintained and the code is completely untested. With the upcomming changes, it is unlikely that the code remains usable. This removes the aux tables and all code that references them.	2021-04-30 10:08:29 +02:00
Sarah Hoffmann	51d20b19b6	Merge pull request #2299 from lonvia/update-actions Fix database check for reverse-only	2021-04-27 12:18:45 +02:00
Sarah Hoffmann	46e8c6b112	Merge pull request #2291 from AntoJvlt/special-phrases-statistics Special phrases statistics	2021-04-27 11:57:05 +02:00
Sarah Hoffmann	c8fb25201a	do not check for extra housenumber index for reverse-only Also adds a database check for reverse only import to the CI.	2021-04-27 10:14:26 +02:00
Sarah Hoffmann	4457bf7528	avoid Path in subprocess parameters Not supported by Python 3.5.	2021-04-26 10:55:23 +02:00
AntoJvlt	abb3d56b20	Switching to log info and only send warning for invalid phrases	2021-04-25 17:57:43 +02:00
AntoJvlt	c5ecb9bae0	Implemented statistics for the import of special phrases through the SpecialPhrasesImporterStatistics class	2021-04-25 17:57:43 +02:00
AntoJvlt	1b68152fb2	reorganization of folder/file for the special phrases importer	2021-04-25 17:57:42 +02:00
Sarah Hoffmann	b951b11336	fix pylint complaints	2021-04-24 11:59:32 +02:00
Sarah Hoffmann	89c90bedb9	pylint: disable check too-few-public-methods	2021-04-24 11:39:44 +02:00
Sarah Hoffmann	9c51c133f7	indexes with includes are not available for postgresql < 11	2021-04-23 22:50:08 +02:00
Sarah Hoffmann	91d2fb6b1c	use group() for regex matches Needed for compatibility with Python 3.5.	2021-04-23 22:50:08 +02:00
Sarah Hoffmann	280406c0d7	use pathlib version of open	2021-04-23 22:50:08 +02:00
Sarah Hoffmann	d5fc3b5e99	subprocess needs string argument Compatibility change for Python 3.5.	2021-04-23 22:50:08 +02:00
Sarah Hoffmann	f8f8c7e534	check for existance of custom .env before opening	2021-04-23 22:50:08 +02:00
Sarah Hoffmann	3a642d50a4	use more generic ImportError to check for module ModuleNotFoundError was only introduced in Python 3.6.	2021-04-23 22:50:08 +02:00
Sarah Hoffmann	9685c68e30	replace usages of fromisoformat() with strptime() fromisoformat was only introduced with Python 3.7 while we still support Python 3.5. Fixes #2292.	2021-04-23 22:50:08 +02:00
RhinoDevel	b7bae80616	Replace "nominatim-update" with "nominatim". If I am not mistaken, the correct command to index imported data via commandline is "nominatim index".	2021-04-22 15:40:22 +02:00
Sarah Hoffmann	f7e4aa51d3	indexer: reset query counter Reset the counter for queries after the asynchronous connections have been reopened.	2021-04-21 10:33:45 +02:00
Sarah Hoffmann	50b6d7298c	factor out async connection handling into separate class Also adds a test for reconnecting regularly while indexing.	2021-04-20 14:08:37 +02:00
Sarah Hoffmann	26a81654a8	indexer: make self.conn function-local Also switches to our internal connect function which gives us a cursor with a sclar() function.	2021-04-20 14:08:37 +02:00
Sarah Hoffmann	6430371d7d	make index() function private	2021-04-20 14:08:37 +02:00
Sarah Hoffmann	18705b3f18	move analyse function into indexinf function	2021-04-20 14:08:37 +02:00
Sarah Hoffmann	c6bd2bb7fb	indexer: move runner into separate file	2021-04-20 14:08:37 +02:00
Sarah Hoffmann	79d55357e8	simplify sql and website creation functions	2021-04-19 10:53:30 +02:00
Sarah Hoffmann	4fa6c0ad53	simplify constructor for SQL preprocessor Use sql path from config.	2021-04-19 10:26:25 +02:00
Sarah Hoffmann	8f63f9516b	simplify interface for adding tiger data Also simplifies tests using existing fixtures.	2021-04-19 10:26:25 +02:00
Sarah Hoffmann	995ba2c7c2	add library directories to config Allows to reduce the number of parameters in functions that take the config anyway.	2021-04-19 10:26:25 +02:00
AntoJvlt	b2ae715699	Only log a warning if a wrong input is detected on the wiki while importing special phrases	2021-04-17 20:19:39 +02:00
AntoJvlt	a95c748363	Fix occurence regex	2021-04-17 19:24:13 +02:00
Sarah Hoffmann	d74ae669e3	add support index when continuing import at index phase Indexing scans the placex table sequentially during indexing on the initial import. That is okay because we know that all rows need to be processed anywhere. When continuing the import, however, a large part might already be indexed, so that the process spends a lot of time going through rows that are no longer of interest. Create a supporting index for all unindexed rows to speed up the scan. This is the same index as used later for updates.	2021-04-17 11:07:04 +02:00
Sarah Hoffmann	da98a2102a	remove transition functions from Python	2021-04-16 18:41:14 +02:00
Sarah Hoffmann	886a01c796	port function to compute initial postcodes to Python	2021-04-16 16:11:20 +02:00
Sarah Hoffmann	76b1885595	use absolute imports in Python code Relative imports are no longer officially recommended.	2021-04-16 14:20:09 +02:00

... 3 4 5 6 7 ...

649 Commits