Sarah Hoffmann
32ca631b74
fix full term token in special phrases
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e81084f35
complete tests for rule loader
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
a0a7b05c9f
correctly quote strings when copying in data
...
Encapsulate the copy string in a class that ensures that
copy lines are written with correct quoting.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2f6e4edcdb
update unit tests for adapted abbreviation code
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
2e3c5d4c5b
adapt tests for ICU tokenizer
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
8413075249
move abbreviation computation into import phase
...
This adds precomputation of abbreviated terms for names and removes
abbreviation of terms in the query. Basic import works but still
needs some thorough testing as well as speed improvements during
import.
New dependency for python library datrie.
2021-07-04 10:28:20 +02:00
Sarah Hoffmann
6ba00e6aee
icu tokenizer: move transliteration rules in separate file
...
The tokenizer configuration has become difficult to handle
due to the additional manual transliteration rules. Allow
to have a separate rule file that is given to the ICU library
as is.
2021-07-04 10:28:20 +02:00
AntoJvlt
3676310efe
Improved performance of the postcodes query and some code cleaning
2021-06-12 15:46:08 +02:00
AntoJvlt
1c175e3a67
Clean and update tests for postcodes
2021-06-09 09:31:32 +02:00
AntoJvlt
47fb7cd3a8
Use place_exists() into can_compute() for postcodes
2021-06-09 09:31:32 +02:00
AntoJvlt
a4733eed90
Use place instead of placex to compute postcodes
2021-06-09 09:31:32 +02:00
Sarah Hoffmann
bc981d0261
fix insertion of special terms and countries into word table
...
Special terms need to be prefixed by a space because they are
full terms.
For countries avoid duplicate entries of word tokens.
Adds tests for adding country terms.
2021-06-02 20:22:39 +02:00
Sarah Hoffmann
72625dc72a
call freeze after running and non-updateable import
...
Some of the tables will have already been removed but
the tables for indexing are still there and should be
dropped.
2021-06-02 11:08:48 +02:00
Sarah Hoffmann
cc2f152d70
commit changes to replication log table
...
Fixes #2350 .
2021-05-26 11:47:08 +02:00
Sarah Hoffmann
a0e85cc17c
only initialise tokenizer for refresh functions where needed
...
Fixes #2347 .
2021-05-25 19:16:22 +02:00
Sarah Hoffmann
24c986c842
add tests for new full name computation with ICU
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
4f4d15c28a
reorganize keyword creation for legacy tokenizer
...
- only save partial words without internal spaces
- consider comma and semicolon a separator of full words
- consider parts before an opening bracket a full word
(but not the part after the bracket)
Fixes #244 .
2021-05-24 10:41:42 +02:00
Sarah Hoffmann
fa3e48c59f
use make_keywords for place search terms also
...
Ensures that place indeed uses the same search names as other
names.
2021-05-23 23:08:11 +02:00
Sarah Hoffmann
16bb007135
Merge pull request #2336 from lonvia/do-not-mask-error-when-loading-tokenizer
...
Do not hide errors when importing tokenizer
2021-05-18 23:00:10 +02:00
AntoJvlt
799a4c9ab6
Documentation update and small code fixes
2021-05-18 22:35:21 +02:00
Sarah Hoffmann
b2722650d4
do not hide errors when importing tokenizer
...
Explicitly check for the tokenizer source file to check that
the name is correct. We can't use the import error for that
because it hides other import errors like a missing
library.
Fixes #2327 .
2021-05-18 16:28:21 +02:00
AntoJvlt
3206bf59df
Resolve conflicts
2021-05-17 13:52:35 +02:00
AntoJvlt
8b8dfc46eb
Added --no-replace command for special phrases importation and added corresponding tests
2021-05-17 13:25:06 +02:00
AntoJvlt
06aab389ed
Code cleaning and SPLoader deleted
2021-05-16 16:59:12 +02:00
Sarah Hoffmann
925726222f
Merge pull request #2323 from darkshredder/disable-search-reverse-only
...
Feat: Disabled search API for --reverse-only imports
2021-05-14 10:40:22 +02:00
Sarah Hoffmann
7d621389ee
adapt tests to new TIGER CSV format
2021-05-14 00:02:50 +02:00
Sarah Hoffmann
35efe3b41c
use tokenizer during Tiger data import
...
This also changes the required import format to CSV.
2021-05-14 00:02:50 +02:00
Darkshredder
e5ffc59cd5
feat: Added reverse-only-search validation
2021-05-14 02:36:21 +05:30
Sarah Hoffmann
5feece64c1
use WorkerPool for Tiger data import
...
Requires adding an option that SQL errors are ignored.
2021-05-13 20:36:50 +02:00
Sarah Hoffmann
b9a09129fa
move WorkerPool into db module
...
The pool is independent of the indexer and may also be used
by other parts of the software.
2021-05-13 17:11:17 +02:00
Sarah Hoffmann
fc860787dd
do not preload postcodes
...
This is too expensive for updates.
2021-05-13 16:14:12 +02:00
Sarah Hoffmann
63e35574d4
Merge pull request #2324 from lonvia/generic-external-postcodes
...
Rework postcode handling and generalised external postcode support
2021-05-13 14:52:19 +02:00
Sarah Hoffmann
db2dbf15f7
fix token_info migration
...
A bad indent meant that only one table received the new column.
2021-05-13 14:31:41 +02:00
Sarah Hoffmann
f5977dac75
ignore invalid coordinates in external postcodes
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
8f2746fe24
ignore entries without country code
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
1ccd4360b4
correctly handle removing all postcodes for country
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
bf864b2c54
index postcodes after refreshing
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
4abaf71234
add and extend tests for new postcode handling
2021-05-13 14:15:42 +02:00
Sarah Hoffmann
a4aba23a83
move filling of postcode table to python
...
The Python code now takes care of reading postcodes from placex,
enhancing them with potentially existing external postcodes and
updating location_postcodes accordingly. The initial setup and
updates use exactly the same function.
External postcode handling has been generalized. External postcodes
for any country are now accepted. The format of the external postcode
file has changed. We now expect CSV, potentially gzipped. The
postcodes are no longer saved in the database.
2021-05-13 14:15:42 +02:00
AntoJvlt
9d83da830f
Introduction of SPCsvLoader to load special phrases from a csv file
2021-05-10 23:26:39 +02:00
AntoJvlt
00959fac57
Refactoring loading of external special phrases and importation process by introducing SPLoader and SPWikiLoader
2021-05-10 21:49:31 +02:00
Sarah Hoffmann
872ab91421
fix name of transliterator
...
Should be different from the normalisation rules.
2021-05-05 17:09:38 +02:00
Sarah Hoffmann
a263e54b94
enable BDD tests for different tokenizers
...
The tokenizer to be used can be choosen with -DTOKENIZER.
Adapt all tests, so that they work with legacy_icu tokenizer.
Move lookup in word table to a function in the tokenizer.
Special phrases are temporarily imported from the wiki until
we have an implementation that can import from file. TIGER
tests do not work yet.
2021-05-05 10:31:51 +02:00
Sarah Hoffmann
18c99a5c5f
add unit tests for legacy ICU tokenizer
2021-05-05 10:15:27 +02:00
Sarah Hoffmann
d55fc39275
cache translieration results
2021-05-05 10:15:27 +02:00
Sarah Hoffmann
ba8ed7967d
add PHP part for new ICU-base tokenizer
2021-05-05 10:15:27 +02:00
Sarah Hoffmann
f44af49df9
add Python part for new ICU-based tokenizer
2021-05-05 10:15:27 +02:00
Sarah Hoffmann
36c624ec71
commit between migrations
...
Later migrations may require tables set up by older ones.
2021-05-01 10:47:35 +02:00
Sarah Hoffmann
7fd871a74d
increase database version for tokenizer migration
2021-05-01 10:47:35 +02:00
Sarah Hoffmann
ced8f0f4a2
fix liniting issues
2021-04-30 17:59:50 +02:00