Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2024-11-23 21:54:10 +03:00

Author	SHA1	Message	Date
Sarah Hoffmann	4342b28882	switch special phrases to new word table format	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	5394b1fa1b	switch postcode tokens to new word table layout	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	5ab0a63fd6	switch housenumber tokens to new word table layout	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	1618aba5f2	switch country name tokens to new word table layout	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	8377528952	new word table layout for icu tokenizer The table now directly reflects the different token types. Extra information is saved in a json structure that may be dynamically extended in the future without affecting the table layout.	2021-07-28 11:31:47 +02:00
Sarah Hoffmann	14f777da18	use psycopg's SQL quoting where possible Use the SQL formatting supplied with psycopg whenever the query needs to be put together from snippets.	2021-07-12 22:05:22 +02:00
Sarah Hoffmann	6f6681ce67	add helper function for execute_values Make psycopg2's convenience function accessible through the cursor.	2021-07-12 21:08:20 +02:00
Sarah Hoffmann	cf98cff2a1	more formatting fixes Found by flake8.	2021-07-12 17:45:42 +02:00
Sarah Hoffmann	daa597b300	split up variant computation for better readability	2021-07-12 14:43:50 +02:00
Sarah Hoffmann	47adb2a3fc	reorganise process_place function Move address processing into its own function as it is rather extensive.	2021-07-12 11:57:55 +02:00
Sarah Hoffmann	1e86dc1d93	remove default parameter for namedtuple This is only available in Python 3.7.	2021-07-06 22:57:42 +02:00
Sarah Hoffmann	62d5984b1b	limit the number of variants that can be produced	2021-07-04 10:28:28 +02:00
Sarah Hoffmann	c32551b4e0	restrict partial word counting to names of reasoanble length The partial word count does not split names to save a bit of time. The result is that it might enounter unreasonably long names which in truth consist of multiple words. No accurate statistics are needed so simply restrict the count to words shorter than 75 characters.	2021-07-04 10:28:28 +02:00
Sarah Hoffmann	e85f7e7aa9	fix subsequent replacements Two replacement words directly following each other did not work as expected because each expects a space at the beginning/end while there was only one space available. Also forbit composing a word after a space was added in the end by a previous replacement.	2021-07-04 10:28:28 +02:00
Sarah Hoffmann	7b0f6b7905	leave ICU variant properties empty for now Saving unused properties causes unnecessary duplicates.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	b9fbfeff67	only consider partials in multi-words for initial count This ensures that it is less likely that we exclude meaningful words like 'hauptstrasse' just because they are frequent.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	62828fc5c1	switch to a more flexible variant description format The new format combines compound splitting and abbreviation. It also allows to restrict rules to additional conditions (like language or region). This latter ability is not used yet.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	a6aa6360e0	use yaml tag syntax to mark include files	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	f70930b1a0	make compund decomposition pure import feature Compound decomposition now creates a full name variant on import just like abbreviations. This simplifies query time normalization and opens a path for changing abbreviation and compund decomposition lists for an existing database.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	9ff4f66f55	complete tests for icu tokenizer	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	32ca631b74	fix full term token in special phrases	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	2e81084f35	complete tests for rule loader	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	a0a7b05c9f	correctly quote strings when copying in data Encapsulate the copy string in a class that ensures that copy lines are written with correct quoting.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	2f6e4edcdb	update unit tests for adapted abbreviation code	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	2e3c5d4c5b	adapt tests for ICU tokenizer	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	8413075249	move abbreviation computation into import phase This adds precomputation of abbreviated terms for names and removes abbreviation of terms in the query. Basic import works but still needs some thorough testing as well as speed improvements during import. New dependency for python library datrie.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	6ba00e6aee	icu tokenizer: move transliteration rules in separate file The tokenizer configuration has become difficult to handle due to the additional manual transliteration rules. Allow to have a separate rule file that is given to the ICU library as is.	2021-07-04 10:28:20 +02:00
Sarah Hoffmann	bc981d0261	fix insertion of special terms and countries into word table Special terms need to be prefixed by a space because they are full terms. For countries avoid duplicate entries of word tokens. Adds tests for adding country terms.	2021-06-02 20:22:39 +02:00
Sarah Hoffmann	24c986c842	add tests for new full name computation with ICU	2021-05-24 10:41:42 +02:00
Sarah Hoffmann	4f4d15c28a	reorganize keyword creation for legacy tokenizer - only save partial words without internal spaces - consider comma and semicolon a separator of full words - consider parts before an opening bracket a full word (but not the part after the bracket) Fixes #244.	2021-05-24 10:41:42 +02:00
Sarah Hoffmann	fa3e48c59f	use make_keywords for place search terms also Ensures that place indeed uses the same search names as other names.	2021-05-23 23:08:11 +02:00
Sarah Hoffmann	16bb007135	Merge pull request #2336 from lonvia/do-not-mask-error-when-loading-tokenizer Do not hide errors when importing tokenizer	2021-05-18 23:00:10 +02:00
Sarah Hoffmann	b2722650d4	do not hide errors when importing tokenizer Explicitly check for the tokenizer source file to check that the name is correct. We can't use the import error for that because it hides other import errors like a missing library. Fixes #2327.	2021-05-18 16:28:21 +02:00
AntoJvlt	3206bf59df	Resolve conflicts	2021-05-17 13:52:35 +02:00
AntoJvlt	8b8dfc46eb	Added --no-replace command for special phrases importation and added corresponding tests	2021-05-17 13:25:06 +02:00
Sarah Hoffmann	fc860787dd	do not preload postcodes This is too expensive for updates.	2021-05-13 16:14:12 +02:00
Sarah Hoffmann	a4aba23a83	move filling of postcode table to python The Python code now takes care of reading postcodes from placex, enhancing them with potentially existing external postcodes and updating location_postcodes accordingly. The initial setup and updates use exactly the same function. External postcode handling has been generalized. External postcodes for any country are now accepted. The format of the external postcode file has changed. We now expect CSV, potentially gzipped. The postcodes are no longer saved in the database.	2021-05-13 14:15:42 +02:00
Sarah Hoffmann	872ab91421	fix name of transliterator Should be different from the normalisation rules.	2021-05-05 17:09:38 +02:00
Sarah Hoffmann	a263e54b94	enable BDD tests for different tokenizers The tokenizer to be used can be choosen with -DTOKENIZER. Adapt all tests, so that they work with legacy_icu tokenizer. Move lookup in word table to a function in the tokenizer. Special phrases are temporarily imported from the wiki until we have an implementation that can import from file. TIGER tests do not work yet.	2021-05-05 10:31:51 +02:00
Sarah Hoffmann	18c99a5c5f	add unit tests for legacy ICU tokenizer	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	d55fc39275	cache translieration results	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	ba8ed7967d	add PHP part for new ICU-base tokenizer	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	f44af49df9	add Python part for new ICU-based tokenizer	2021-05-05 10:15:27 +02:00
Sarah Hoffmann	388ebcbae2	move index creation for word table to tokenizer This introduces a finalization routing for the tokenizer where it can post-process the import if necessary.	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	fc995ea6b9	move database check for module to tokenizer	2021-04-30 17:41:08 +02:00
Sarah Hoffmann	3eb4d88057	boilerplate for PHP code of tokenizer This adds an installation step for PHP code for the tokenizer. The PHP code is split in two parts. The updateable code is found in lib-php. The tokenizer installs an additional script in the project directory which then includes the code from lib-php and defines all settings that are static to the database. The website code then always includes the PHP from the project directory.	2021-04-30 11:31:52 +02:00
Sarah Hoffmann	23fd1d032a	tests for legacy tokenizer	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	7cb7cf848d	move amenity creation to tokenizer The BDD tests still use the old-style amenity creation scripts because we don't have simple means to import a hand-crafted test file of special phrases right now.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	bef300305e	move default country name creation to tokenizer The new function is also used, when a country us updated. All SQL function related to country names have been removed.	2021-04-30 11:30:51 +02:00
Sarah Hoffmann	dc700c25b6	cache all postcodes	2021-04-30 11:30:51 +02:00

1 2

60 Commits