Nominatim

mirror of https://github.com/osm-search/Nominatim.git synced 2024-12-18 18:41:51 +03:00

Author	SHA1	Message	Date
Sarah Hoffmann	bd7c7ddad0	icu tokenizer: switch to matching against partial names When matching address parts from addr:* tags against place names, the address names where so far converted to full names and compared those to the place names. This can become problematic with the new ICU tokenizer once we introduce creation of different variants depending on the place name context. It wouldn't be clear which variant to produce to get a match, so we would have to create all of them. To work around this issue, switch to using the partial terms for matching. This introduces a larger fuzziness between matches but that shouldn't be a problem because matching is always geographically restricted. The search terms created for address parts have a different problem: they are already created before we even know if they are going to be used. This can lead to spurious entries in the word table, which slows down searching. This problem can also be circumvented by using only partial terms for the search terms. In terms of searching that means that the address terms would not get the full-word boost, but given that the case where an address part does not exist as an OSM object should be the exception, this is likely acceptable.	2021-09-27 11:36:19 +02:00
Sarah Hoffmann	c6fdcf9b0d	adapt documentation for SQL tokenizer interface	2021-09-27 11:36:19 +02:00
Sarah Hoffmann	59fe74ddf6	move name matching into tokenizer module Instead of requesting the match tokens from the tokenizer when looking for parent streets/places and address parts, hand in the saved tokens and ask if they match. This gives the tokenizer more freedom to decide how name matching should be done.	2021-09-27 11:36:19 +02:00
Sarah Hoffmann	d562f11298	slightly increase radius to look for postcodes	2021-09-24 23:56:42 +02:00
Sarah Hoffmann	972628c751	Merge pull request #2449 from lonvia/address-ranking-spain Adjust address ranks for Spain	2021-09-24 22:48:21 +02:00
Sarah Hoffmann	09b1db63f4	adjust address ranks for Spain Adjusts levels for boundaries according to the list on https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative * no admin_level 5, so drop that from addresses * admin_level 6 has the province * admin_level 7 has the county when it exists Also reranks place=province so that it matches up with admin_level 6 and introduces place=civil_parish which is used as a place node for some admin_level=9 boundaries in Galicia.	2021-09-24 18:39:44 +02:00
Sarah Hoffmann	e9d54f752c	Merge pull request #2447 from lonvia/fix-dynamic-address-assignment Fix dynamic assignment of address parts	2021-09-19 15:57:28 +02:00
Sarah Hoffmann	c335025167	CI: install locale for CentOS	2021-09-19 13:49:11 +02:00
Sarah Hoffmann	2b2109c89a	Remove the installation warning Installation has become a lot easier.	2021-09-19 13:01:32 +02:00
Sarah Hoffmann	56124546a6	fix dynamic assignment of address parts A boolean check for dynamic changes of address parts is not sufficient. The order of choice should be: 1. an addr:* part matches the name 2. the address part surrounds the object 3. the address part was declared as isaddress The implementation uses a slightly different ordering to avoid geometry checks unless strictly necessary (isaddress is false and no matching address). See #2446.	2021-09-19 12:34:39 +02:00
Sarah Hoffmann	336258ecf8	Merge pull request #2440 from lonvia/generic-config-loader Add generic loader for YAML configuration files	2021-09-04 17:41:15 +02:00
Sarah Hoffmann	b894d2c04a	fix indent	2021-09-04 10:30:35 +02:00
Sarah Hoffmann	8e1d4818ac	use yaml config loader for country info	2021-09-04 00:22:55 +02:00
Sarah Hoffmann	28c98584c1	add tests for generic YAML config reader	2021-09-03 22:31:30 +02:00
Sarah Hoffmann	1c42780bb5	introduce generic YAML config loader Adds a function to the Configuration class to load a YAML file. This means that searching for the file is generalised and works the same now for all configuration files. Changes the search logic, so that it is always possible to have a custom version of the configuration file in the project directory. Move ICU tokenizer to use new load function.	2021-09-03 18:20:07 +02:00
Sarah Hoffmann	18554dfed7	Merge pull request #2437 from lonvia/tweak-ranking-searches Some more tweaks for search interpretation	2021-09-03 14:16:23 +02:00
Sarah Hoffmann	2e493fec46	Merge pull request #2436 from lonvia/country-configuration Move configuration of default languages into a configuration file	2021-09-03 08:55:36 +02:00
Sarah Hoffmann	98c2e08add	reduce penalty for special searches by name Additional penalty for special terms with operator None should only go to near searches. To reduce the number of produced searches, restrict the none operator to appear only in conjunction with the name.	2021-09-03 08:50:38 +02:00
Sarah Hoffmann	94d3dee369	further increase penalty on housenumbers without numbers Make the penality dependent on the length of the token: no penalty for one letter house numbers and increasing one for more letters.	2021-09-02 18:11:49 +02:00
Sarah Hoffmann	7e7dd769fd	remove language and partition from name import	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	79da96b369	read partition and languages from config file	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	78fcabade8	move country name generation to country_info module	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	284645f505	move generation of country tables in own module	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	0b349761a8	add country configuration The new configuration saves the default language(s) originally maintained in the OSM wiki as well as the partition information.	2021-09-02 14:41:11 +02:00
Sarah Hoffmann	d18794931a	Merge pull request #2435 from lonvia/simplified-to-traditional-chinese icu: normalise simplified to traditional chinese	2021-08-31 15:29:26 +02:00
Sarah Hoffmann	b7d4ff3201	icu: normalise simplified to traditional chinese The conversion is unambigious in most cases, so that the information loss is minimal.	2021-08-31 11:18:34 +02:00
Sarah Hoffmann	4c6d674e03	Merge pull request #2434 from lonvia/vagrant-scripts-in-actions Test installation instructions via CI	2021-08-29 10:11:59 +02:00
Sarah Hoffmann	2c97af8021	CI: use packaged source also for test runs	2021-08-24 10:10:01 +02:00
Sarah Hoffmann	832f75a55e	CI: unify jobs for different vagrant scripts	2021-08-24 10:10:01 +02:00
Sarah Hoffmann	4e77969545	add workflow for centos 8	2021-08-24 10:10:01 +02:00
Sarah Hoffmann	6ebbbfee61	CI: use vagrant scripts for import tests Use vanilla docker images of Ubuntu and leave the setup to the vagrant scripts. Then do the usual import tests. Also fixes a couple of issues found with the scripts	2021-08-24 10:10:01 +02:00
Sarah Hoffmann	0fabeefc3e	Merge pull request #2432 from Mastercuber/patch-1 Added postcode	2021-08-22 09:32:31 +02:00
Mastercuber	c70d72f06b	Added postcode Added postcode to the list of addressdetails	2021-08-22 02:52:41 +02:00
Sarah Hoffmann	cc141bf1a5	Add link to fixthemap to issue template	2021-08-21 20:36:16 +02:00
Sarah Hoffmann	199532c802	Merge pull request #2429 from lonvia/place-name-to-admin-boundary Indexing: move linking of places to the preparation stage	2021-08-21 10:21:39 +02:00
Sarah Hoffmann	28ee3d0949	move linking of places to the preparation stage Linked places may bring in extra names. These names need to be processed by the tokenizer. That means that the linking needs to be done before the data is handed to the tokenizer. Move finding the linked place into the preparation stage and update the name fields. Everything else is still done in the indexing stage.	2021-08-20 22:44:17 +02:00
Sarah Hoffmann	925195725d	Merge pull request #2428 from lonvia/rename-icu-tokenizer Rename legacy_icu tokenizer to icu tokenizer	2021-08-18 15:02:19 +02:00
Sarah Hoffmann	f6d22df76e	adapt CI workflow to new tokenizer name	2021-08-18 09:08:20 +02:00
Sarah Hoffmann	118858a55e	rename legacy_icu tokenizer to icu tokenizer The new icu tokenizer is now no longer compatible with the old legacy tokenizer in terms of data structures. Therefore there is also no longer a need to refer to the legacy tokenizer in the name.	2021-08-17 23:11:47 +02:00
Sarah Hoffmann	656c1291b1	Merge pull request #2427 from lonvia/remove-us-states-special-casing Move US state hack into legacy tokenizer	2021-08-17 21:55:32 +02:00
Sarah Hoffmann	f00b8dd1c3	move special hack for US states to legacy tokenizer The hack for IL, AL and LA is only needed because these abbreviations are removed by the legacy tokenizer as a stop word. There is no need to keep the hack for future tokenizers. Move it therefore to the token extraction function.	2021-08-17 14:28:55 +02:00
Sarah Hoffmann	5f2b9e317a	add tests for US state hacks IL, AS and LA are replaced with the US state in Geocode because the old tokenizer would simply remove the abbreviations otherwise.	2021-08-17 10:49:07 +02:00
Sarah Hoffmann	4ae5ba7fc4	Merge pull request #2425 from lonvia/tokenizer-documentation Introduce official Tokenizer API	2021-08-17 09:38:03 +02:00
Sarah Hoffmann	3656eed9ad	add mkdocstrings requirement for building docs mkdocstrings also needs access to the Python sources, so set a PYTHONPATH accordingly. This makes running mkdocs directly a bit awkward, therefore add a `make serve-doc` target.	2021-08-16 11:51:49 +02:00
Sarah Hoffmann	2e82a6ce03	docs: extend explanation of query phrase	2021-08-16 11:51:49 +02:00
Sarah Hoffmann	c4b8a3b768	add documentation for PHP part of tokenizer	2021-08-16 11:51:49 +02:00
Sarah Hoffmann	1147b83b22	php: make word list a first-class object This separates the logic of creating word sets from the Phrase class. A tokenizer may now derived the word sets any way they like. The SimpleWordList class provides a standard implementation for splitting phrases on spaces.	2021-08-16 11:51:49 +02:00
Sarah Hoffmann	0fb8eade13	remove country restriction from tokenizer Restricting tokens due to the search context is better done in the generic search part instead of repeating the same test in every tokenizer implementation.	2021-08-16 11:41:54 +02:00
Sarah Hoffmann	78d11fe628	document tokenizer SQL interface	2021-08-16 11:41:54 +02:00
Sarah Hoffmann	90b40fc3e6	define formal public Python interface for tokenizer This introduces an abstract class for the Tokenizer/Analyzer for documentation purposes.	2021-08-16 11:41:54 +02:00

1 2 3 4 5 ...

3341 Commits