Commit Graph

16 Commits

Author SHA1 Message Date
Sarah Hoffmann
5ab0a63fd6 switch housenumber tokens to new word table layout 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
1618aba5f2 switch country name tokens to new word table layout 2021-07-28 11:31:47 +02:00
Sarah Hoffmann
8377528952 new word table layout for icu tokenizer
The table now directly reflects the different token types.
Extra information is saved in a json structure that may be
dynamically extended in the future without affecting the
table layout.
2021-07-28 11:31:47 +02:00
Sarah Hoffmann
2c8242c8df remove special code for pre9.5 postgresql
9.5 is now the minimum requirement.
2021-07-19 10:24:57 +02:00
Sarah Hoffmann
8413075249 move abbreviation computation into import phase
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.

New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
4f4d15c28a reorganize keyword creation for legacy tokenizer
- only save partial words without internal spaces
- consider comma and semicolon a separator of full words
- consider parts before an opening bracket a full word
  (but not the part after the bracket)

Fixes #244.
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
f44af49df9 add Python part for new ICU-based tokenizer 2021-05-05 10:15:27 +02:00
Sarah Hoffmann
388ebcbae2 move index creation for word table to tokenizer
This introduces a finalization routing for the tokenizer
where it can post-process the import if necessary.
2021-04-30 17:41:08 +02:00
Sarah Hoffmann
7cb7cf848d move amenity creation to tokenizer
The BDD tests still use the old-style amenity creation scripts
because we don't have simple means to import a hand-crafted
test file of special phrases right now.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
bef300305e move default country name creation to tokenizer
The new function is also used, when a country us updated. All SQL
function related to country names have been removed.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
0ba93e5ba9 reorganise address iteration in tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
d75a235c1f use address tokens in SQL 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
ffc2d82b0e move postcode normalization into tokenizer 2021-04-30 11:30:51 +02:00
Sarah Hoffmann
d8ed1bfc60 move houseunumber handling to tokenizer
Normalization and token computation are now done in the tokenizer.
The tokenizer keeps a cache to the hundred most used house numbers
to keep the numbers of calls to the database low.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
d711f5a81e move name token creation into tokenizer
Name tokens are now handed in via token_info and used from there.

Also moves the generic search name insertion function back to
placex_triggers.sql.
2021-04-30 11:30:51 +02:00
Sarah Hoffmann
fbbdd31399 move word table and normalisation SQL into tokenizer
Creating and populating the word table is now the responsibility
of the tokenizer.

The get_maxwordfreq() function has been replaced with a
simple template parameter to the SQL during function installation.
The number is taken from the parameter list in the database to
ensure that it is not changed after installation.
2021-04-30 11:30:51 +02:00