Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2024-11-27 10:43:02 +03:00

Author	SHA1	Message	Date
Sarah Hoffmann	16daa57e47	unify ICUNameProcessorRules and ICURuleLoader There is no need for the additional layer of indirection that the ICUNameProcessorRules class adds. The ICURuleLoader can fill the database properties directly.	2021-10-01 12:27:24 +02:00
Sarah Hoffmann	be65c8303f	export more data for the tokenizer name preparation Adds class, type, country and rank to the exported information and removes the rather odd hack for countries. Whether a place represents a country boundary can now be computed by the tokenizer.	2021-09-29 11:54:14 +02:00
Sarah Hoffmann	231250f2eb	add wrapper class for place data passed to tokenizer This is mostly for convenience and documentation purposes.	2021-09-29 11:54:07 +02:00
Sarah Hoffmann	90b40fc3e6	define formal public Python interface for tokenizer This introduces an abstract class for the Tokenizer/Analyzer for documentation purposes.	2021-08-16 11:41:54 +02:00
Sarah Hoffmann	6f6681ce67	add helper function for execute_values Make psycopg2's convenience function accessible through the cursor.	2021-07-12 21:08:20 +02:00
Sarah Hoffmann	cf98cff2a1	more formatting fixes Found by flake8.	2021-07-12 17:45:42 +02:00
Sarah Hoffmann	47adb2a3fc	reorganise process_place function Move address processing into its own function as it is rather extensive.	2021-07-12 11:57:55 +02:00
Sarah Hoffmann	2e3c5d4c5b	adapt tests for ICU tokenizer	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	8413075249	move abbreviation computation into import phase This adds precomputation of abbreviated terms for names and removes abbreviation of terms in the query. Basic import works but still needs some thorough testing as well as speed improvements during import. New dependency for python library datrie.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	bc981d0261	fix insertion of special terms and countries into word table Special terms need to be prefixed by a space because they are full terms. For countries avoid duplicate entries of word tokens. Adds tests for adding country terms.	2021-06-02 20:22:39 +02:00
Sarah Hoffmann	fa3e48c59f	use make_keywords for place search terms also Ensures that place indeed uses the same search names as other names.	2021-05-23 23:08:11 +02:00
AntoJvlt	3206bf59df	Resolve conflicts	2021-05-17 13:52:35 +02:00
AntoJvlt	8b8dfc46eb	Added --no-replace command for special phrases importation and added corresponding tests	2021-05-17 13:25:06 +02:00
Sarah Hoffmann	fc860787dd	do not preload postcodes This is too expensive for updates.	2021-05-13 16:14:12 +02:00
Sarah Hoffmann	a4aba23a83	move filling of postcode table to python The Python code now takes care of reading postcodes from placex, enhancing them with potentially existing external postcodes and updating location_postcodes accordingly. The initial setup and updates use exactly the same function. External postcode handling has been generalized. External postcodes for any country are now accepted. The format of the external postcode file has changed. We now expect CSV, potentially gzipped. The postcodes are no longer saved in the database.	2021-05-13 14:15:42 +02:00
Sarah Hoffmann	a263e54b94	enable BDD tests for different tokenizers The tokenizer to be used can be choosen with -DTOKENIZER. Adapt all tests, so that they work with legacy_icu tokenizer. Move lookup in word table to a function in the tokenizer. Special phrases are temporarily imported from the wiki until we have an implementation that can import from file. TIGER tests do not work yet.	2021-05-05 10:31:51 +02:00
Sarah Hoffmann	388ebcbae2	move index creation for word table to tokenizer This introduces a finalization routing for the tokenizer where it can post-process the import if necessary.	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	fc995ea6b9	move database check for module to tokenizer	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	3eb4d88057	boilerplate for PHP code of tokenizer This adds an installation step for PHP code for the tokenizer. The PHP code is split in two parts. The updateable code is found in lib-php. The tokenizer installs an additional script in the project directory which then includes the code from lib-php and defines all settings that are static to the database. The website code then always includes the PHP from the project directory.	2021-04-30 11:31:52 +02:00
Sarah Hoffmann	23fd1d032a	tests for legacy tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	7cb7cf848d	move amenity creation to tokenizer The BDD tests still use the old-style amenity creation scripts because we don't have simple means to import a hand-crafted test file of special phrases right now.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	bef300305e	move default country name creation to tokenizer The new function is also used, when a country us updated. All SQL function related to country names have been removed.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	dc700c25b6	cache all postcodes	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	0ba93e5ba9	reorganise address iteration in tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	9e92759ac7	extract address tokens in tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	ffc2d82b0e	move postcode normalization into tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	d8ed1bfc60	move houseunumber handling to tokenizer Normalization and token computation are now done in the tokenizer. The tokenizer keeps a cache to the hundred most used house numbers to keep the numbers of calls to the database low.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	d711f5a81e	move name token creation into tokenizer Name tokens are now handed in via token_info and used from there. Also moves the generic search name insertion function back to placex_triggers.sql.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	fa2bc60468	introduce name analyzer The name analyzer is the actual work horse of the tokenizer. It is instantiated on a thread-base and provides all functions for analysing names and queries.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	fbbdd31399	move word table and normalisation SQL into tokenizer Creating and populating the word table is now the responsibility of the tokenizer. The get_maxwordfreq() function has been replaced with a simple template parameter to the SQL during function installation. The number is taken from the parameter list in the database to ensure that it is not changed after installation.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	b5540dc35c	add migration for configurable tokenizer Adds a migration that initialises a legacy tokenizer for an existing database. The migration is not active yet as it will need completion when more functionality is added to the legacy tokenizer.	2021-04-30 11:29:57 +02:00
Sarah Hoffmann	296a66558f	move module installation to legacy tokenizer	2021-04-30 11:29:57 +02:00
Sarah Hoffmann	af968d4903	introduce tokenizer modules This adds the boilerplate for selecting configurable tokenizers. A tokenizer can be chosen at import time and will then install itself such that it is fixed for the given database import even when the software itself is updated. The legacy tokenizer implements Nominatim's traditional algorithms.	2021-04-30 11:29:57 +02:00

33 Commits