Commit Graph

195 Commits

Author SHA1 Message Date
Sarah Hoffmann
8f3845660f add full tokens to addresses
This is now needed to weigh results.
2024-05-02 11:47:35 +02:00
Sarah Hoffmann
07b7fd1dbb add address counts to tokens 2024-03-18 11:25:48 +01:00
Sarah Hoffmann
bb5de9b955 extend word statistics to address index
Word frequency in names is not sufficient to interpolate word
frequency in the address because names of towns, states etc. are
much more frequently used than, say street names.
2024-03-18 11:25:48 +01:00
marc tobias
7205491b84 Correct some typos 2024-02-26 18:13:30 +01:00
Sarah Hoffmann
05fad607ff make Python frontend default and PHP optional 2024-02-19 18:39:01 +01:00
Sarah Hoffmann
bc51378aee properly grant rights to read-only user when switching out word table 2024-02-06 17:30:01 +01:00
Sarah Hoffmann
81eed0680c recreate word table when refreshing counts
The counting touches a large part of the word table, leaving
bloated tables and indexes. Thus recreate the table instead and
swap it in.
2024-02-04 21:35:10 +01:00
Sarah Hoffmann
cafd8e2b1e fix typos and grammar issues 2023-08-29 12:14:44 +02:00
Sarah Hoffmann
d3372e69ec update to modern mkdocstrings python handler 2023-08-25 21:40:20 +02:00
Sarah Hoffmann
252fe42612
Merge pull request #3122 from miku0/sanitizer-final
Adds sanitizer for Japanese addresses to correspond to block address
2023-08-01 10:38:58 +02:00
miku0
67e1c7dc72 Moved KANJI_MAP to icu-rules 2023-07-31 11:57:49 +00:00
miku0
2350018106 Fixed cosmetic issues 2023-07-31 02:39:04 +00:00
miku0
fac8c32cda Moved KANJI_MAP to global variable 2023-07-26 21:43:22 +00:00
Sarah Hoffmann
8cba65809c older version of Postgres cannot convert jsonb to int 2023-07-26 17:45:21 +02:00
miku0
848e5ac5de Correction to PR's comment 2023-07-26 09:50:25 +00:00
miku0
0722495434 add japanese sanitizer 2023-07-26 07:54:58 +00:00
Sarah Hoffmann
faeee7528f move warm script to python code 2023-07-25 21:39:23 +02:00
Sarah Hoffmann
d7a3039c2a also switch legacy tokenizer to new street/place choice behaviour 2023-06-30 17:03:17 +02:00
Sarah Hoffmann
645ea5a057 use information from tokenizer to determine street vs. place address
So far the SQL logic used the information from the address field
to determine if an address is attached to a street or place.
This changes the logic to use the information provided in the
token_info. This allows sanitizers to enforce a certain parenting
without changing the visible address information.
2023-06-30 11:08:25 +02:00
biswajit-k
8f03c80ce8 generalize filter for sanitizers 2023-04-01 19:24:09 +05:30
Sarah Hoffmann
9a5f75dba7
Merge pull request #2993 from biswajit-k/delete-tags
Adds sanitizer for preventing certain tags to enter search index based on parameters
2023-03-09 14:31:45 +01:00
biswajit-k
ca149fb796 Adds sanitizer for preventing certain tags to enter search index based on parameters
fix: pylint error

added docs for delete tags sanitizer

fixed typos in docs and code comments

fix: python typechecking error

fixed rank address type

Revert "fixed typos in docs and code comments"

This reverts commit 6839eea755a87f557895f30524fb5c03dd983d60.

added default parameters and refactored code

added test for all parameters
2023-03-09 14:18:39 +05:30
biswajit-k
36388cafe9 fixed typos in docs and code comments 2023-03-06 17:09:38 +05:30
Sarah Hoffmann
2abe9e6fd9 use data paths from new nominatim.paths 2022-11-27 12:15:41 +01:00
Sarah Hoffmann
fd3dec8efe add sanitizer for TIGER tags
Currently only takes over cleaning the tiger:county data. This was
done by the import until now.
2022-11-23 10:37:27 +01:00
Sarah Hoffmann
8d082c13e0 adapt to new type annotations from typeshed
Some more functions frrom psycopg are now properly annotated.
No ignoring necessary anymore.
2022-08-09 11:06:54 +02:00
Sarah Hoffmann
9864b191b1 fix various typos 2022-07-31 17:10:35 +02:00
Sarah Hoffmann
51b6d16dc6 overhaul the token analysis interface
The functional split betweenthe two functions is now that the
first one creates the ID that is used in the word table and
the second one creates the variants. There no longer is a
requirement that the ID is the normalized version. We might
later reintroduce the requirement that a normalized version be available
but it doesn't necessarily need to be through the ID.

The function that creates the ID now gets the full PlaceName. That way
it might take into account attributes that were set by the sanitizers.

Finally rename both functions to something more sane.
2022-07-29 15:14:11 +02:00
Sarah Hoffmann
34d27ed45c move PlaceName into the generic data module 2022-07-29 11:42:20 +02:00
Sarah Hoffmann
094100bbf6 harmonize spelling
Stick with the American spelling of Analyze.
2022-07-29 10:52:01 +02:00
Sarah Hoffmann
c8873d34af harmonize interface of token analysis module
The configure() function now receives a Transliterator object instead
of the ICU rules. This harmonizes the parameters with the create
function.
2022-07-29 10:43:07 +02:00
Sarah Hoffmann
f0d640961a add documentation for custom token analysis 2022-07-29 09:41:28 +02:00
Sarah Hoffmann
3746befd88 add documentation for sanitizer interface
Also switches mkdocstrings to 0.18 with the rather unfortunate
consequence that now mkdocstrings-python-legacy is needed as well.
2022-07-28 22:00:29 +02:00
Sarah Hoffmann
d819036daa add support for external token analysis modules 2022-07-25 16:27:22 +02:00
Sarah Hoffmann
6d41046b15 add support for external sanitizer modules 2022-07-25 16:10:19 +02:00
Kian-Meng Ang
f5e52e748f docs: fix typos 2022-07-20 22:05:31 +08:00
Sarah Hoffmann
83054af46f remove typing_extensions requirement
The typing_extensions package is only necessary now when running mypy.
It won't be used at runtime anymore.
2022-07-18 09:55:58 +02:00
Sarah Hoffmann
9963261d8d add type annotations to special phrase importer 2022-07-18 09:54:29 +02:00
Sarah Hoffmann
6c6bbe5747 add type annotations for ICU tokenizer 2022-07-18 09:47:57 +02:00
Sarah Hoffmann
18b16e06ca add type annotations for legacy tokenizer 2022-07-18 09:47:57 +02:00
Sarah Hoffmann
e37cfc64d2 add type annotations to ICU tokenizer helper modules 2022-07-18 09:47:57 +02:00
Sarah Hoffmann
d35e3c25b6 add type annotations for token analysis
No annotations for ICU types yet.
2022-07-18 09:47:57 +02:00
Sarah Hoffmann
62eedbb8f6 add type hints for sanitizers 2022-07-18 09:47:57 +02:00
Sarah Hoffmann
d0c44431d0 add typing information for place_info and country_info 2022-07-18 09:47:57 +02:00
Sarah Hoffmann
cbbcbb1fd7 move country_info into data submodule 2022-07-06 11:08:36 +02:00
Sarah Hoffmann
bce93d60bd move PlaceInfo into data submodule
This data structure is shared between indexer and tokenizer.
2022-07-06 10:54:47 +02:00
Sarah Hoffmann
612d34930b handle postcodes properly on word table updates
update_postcodes_from_db() needs to do the full postcode treatment
in order to derive the correct word table entries.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
5be320368c add documentation for postcode customization 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
7f2ad4ac7e fix linting issue 2022-06-23 23:42:31 +02:00
Sarah Hoffmann
0f00f4968c fix up BDD tests for postcode changes
Includes smaller code fixes found by the tests.
2022-06-23 23:42:31 +02:00