Sarah Hoffmann
9a5f75dba7
Merge pull request #2993 from biswajit-k/delete-tags
...
Adds sanitizer for preventing certain tags to enter search index based on parameters
2023-03-09 14:31:45 +01:00
biswajit-k
ca149fb796
Adds sanitizer for preventing certain tags to enter search index based on parameters
...
fix: pylint error
added docs for delete tags sanitizer
fixed typos in docs and code comments
fix: python typechecking error
fixed rank address type
Revert "fixed typos in docs and code comments"
This reverts commit 6839eea755a87f557895f30524fb5c03dd983d60.
added default parameters and refactored code
added test for all parameters
2023-03-09 14:18:39 +05:30
biswajit-k
36388cafe9
fixed typos in docs and code comments
2023-03-06 17:09:38 +05:30
Sarah Hoffmann
2abe9e6fd9
use data paths from new nominatim.paths
2022-11-27 12:15:41 +01:00
Sarah Hoffmann
fd3dec8efe
add sanitizer for TIGER tags
...
Currently only takes over cleaning the tiger:county data. This was
done by the import until now.
2022-11-23 10:37:27 +01:00
Sarah Hoffmann
8d082c13e0
adapt to new type annotations from typeshed
...
Some more functions frrom psycopg are now properly annotated.
No ignoring necessary anymore.
2022-08-09 11:06:54 +02:00
Sarah Hoffmann
9864b191b1
fix various typos
2022-07-31 17:10:35 +02:00
Sarah Hoffmann
51b6d16dc6
overhaul the token analysis interface
...
The functional split betweenthe two functions is now that the
first one creates the ID that is used in the word table and
the second one creates the variants. There no longer is a
requirement that the ID is the normalized version. We might
later reintroduce the requirement that a normalized version be available
but it doesn't necessarily need to be through the ID.
The function that creates the ID now gets the full PlaceName. That way
it might take into account attributes that were set by the sanitizers.
Finally rename both functions to something more sane.
2022-07-29 15:14:11 +02:00
Sarah Hoffmann
34d27ed45c
move PlaceName into the generic data module
2022-07-29 11:42:20 +02:00
Sarah Hoffmann
094100bbf6
harmonize spelling
...
Stick with the American spelling of Analyze.
2022-07-29 10:52:01 +02:00
Sarah Hoffmann
c8873d34af
harmonize interface of token analysis module
...
The configure() function now receives a Transliterator object instead
of the ICU rules. This harmonizes the parameters with the create
function.
2022-07-29 10:43:07 +02:00
Sarah Hoffmann
f0d640961a
add documentation for custom token analysis
2022-07-29 09:41:28 +02:00
Sarah Hoffmann
3746befd88
add documentation for sanitizer interface
...
Also switches mkdocstrings to 0.18 with the rather unfortunate
consequence that now mkdocstrings-python-legacy is needed as well.
2022-07-28 22:00:29 +02:00
Sarah Hoffmann
d819036daa
add support for external token analysis modules
2022-07-25 16:27:22 +02:00
Sarah Hoffmann
6d41046b15
add support for external sanitizer modules
2022-07-25 16:10:19 +02:00
Kian-Meng Ang
f5e52e748f
docs: fix typos
2022-07-20 22:05:31 +08:00
Sarah Hoffmann
83054af46f
remove typing_extensions requirement
...
The typing_extensions package is only necessary now when running mypy.
It won't be used at runtime anymore.
2022-07-18 09:55:58 +02:00
Sarah Hoffmann
9963261d8d
add type annotations to special phrase importer
2022-07-18 09:54:29 +02:00
Sarah Hoffmann
6c6bbe5747
add type annotations for ICU tokenizer
2022-07-18 09:47:57 +02:00
Sarah Hoffmann
18b16e06ca
add type annotations for legacy tokenizer
2022-07-18 09:47:57 +02:00
Sarah Hoffmann
e37cfc64d2
add type annotations to ICU tokenizer helper modules
2022-07-18 09:47:57 +02:00
Sarah Hoffmann
d35e3c25b6
add type annotations for token analysis
...
No annotations for ICU types yet.
2022-07-18 09:47:57 +02:00
Sarah Hoffmann
62eedbb8f6
add type hints for sanitizers
2022-07-18 09:47:57 +02:00
Sarah Hoffmann
d0c44431d0
add typing information for place_info and country_info
2022-07-18 09:47:57 +02:00
Sarah Hoffmann
cbbcbb1fd7
move country_info into data submodule
2022-07-06 11:08:36 +02:00
Sarah Hoffmann
bce93d60bd
move PlaceInfo into data submodule
...
This data structure is shared between indexer and tokenizer.
2022-07-06 10:54:47 +02:00
Sarah Hoffmann
612d34930b
handle postcodes properly on word table updates
...
update_postcodes_from_db() needs to do the full postcode treatment
in order to derive the correct word table entries.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
5be320368c
add documentation for postcode customization
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
7f2ad4ac7e
fix linting issue
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
0f00f4968c
fix up BDD tests for postcode changes
...
Includes smaller code fixes found by the tests.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
37b2c6a830
port legacy tokenizer to new postcode handling
...
Also documents the changes to the SQL functions of the tokenizer.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
e86db3001f
fix postcode pattern for Mozambique
...
Optional groups are not implemented yet.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
80ea13437d
move postcode matcher in a separate file
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
b7704833e4
icu: switch postcodes to using the pre-formatted one
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
ca7b46511d
introduce and use analyzer for postcodes
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
18864afa8a
postcodes: introduce a default pattern for countries without postcodes
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
5ba75df507
postcode: generate a generic form
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
9172696324
postcodes: add support for optional spaces
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
baee6f3de0
postcodes: strip leading country codes
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
90d4d339db
initial postcode cleaner for simple patterns
...
Moves postcodes that are either in countries without a postcode
system or don't correspond to the local pattern for postcodes into
a field for a normal address part. Makes them searchable but not as
a special address. This has two consequences: they are no longer a
skippable part of the address and the postcodes cannot be searched
on their own.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
8080625747
remove postcodes from countries that don't have them
...
The postcodes will only be removed as a 'computed postcode' they
are still searchable for the given object.
2022-06-23 23:42:31 +02:00
Sarah Hoffmann
1821f68ca0
exclude addr:inclusion from search
2022-05-31 14:19:19 +02:00
Sarah Hoffmann
d14a585cc9
pylint: disable no-self-use check
...
This checker encourages bad behaviour (namely changing the static
status of a function during inheritence) and will be made optional
in upcoming versions of pylint.
2022-05-11 10:25:00 +02:00
Sarah Hoffmann
bb2bd76f91
pylint: avoid explicit use of format() function
...
Use psycopg2 SQL formatters for SQL and formatted string literals
everywhere else.
2022-05-11 09:48:56 +02:00
Sarah Hoffmann
7e70e5f503
always state encoding when opening files in text mode
...
Also applies to Path.write_text().
2022-05-10 15:36:29 +02:00
Sarah Hoffmann
a0ed80d821
restore the tokenizer directory when missing
...
Automatically repopulate the tokenizer/ directory with the PHP stub
and the postgresql module, when the directory is missing. This allows
to switch working directories and in particular run the service
from a different maschine then where it was installed.
Users still need to make sure that .env files are set up correctly
or they will shoot themselves in the foot.
See #2515 .
2022-03-20 11:31:42 +01:00
Sarah Hoffmann
15beeef6ce
do not expand records in select list
...
An expression of the form 'SELECT (func()).*' will be expanded
by Postgresql _before_ execution with the result that the function
will be called as many times as there are fields in the record.
This is not what we want. The function call needs to go into
the FROM clause instead.
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
92bc3cd0a7
fix linting issue
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
4a3bbd0319
adapt housenumber cleanup to new word table structure
2022-03-01 09:34:32 +01:00
Sarah Hoffmann
13ed184efd
housenumber analyzer: avoid creating too many variants
...
Housenumber fields with lots of text are likely bad data. So is
data with many changes from letter to digit. Exclude them from adding
optional spaces.
2022-03-01 09:34:32 +01:00