The Python code now takes care of reading postcodes from placex,
enhancing them with potentially existing external postcodes and
updating location_postcodes accordingly. The initial setup and
updates use exactly the same function.
External postcode handling has been generalized. External postcodes
for any country are now accepted. The format of the external postcode
file has changed. We now expect CSV, potentially gzipped. The
postcodes are no longer saved in the database.
The tokenizer to be used can be choosen with -DTOKENIZER.
Adapt all tests, so that they work with legacy_icu tokenizer.
Move lookup in word table to a function in the tokenizer.
Special phrases are temporarily imported from the wiki until
we have an implementation that can import from file. TIGER
tests do not work yet.
The indexer now fetches any extra data besides the place_id
asynchronously while processing the places from the last batch.
This also means that more places are now fetched at once.
This adds an installation step for PHP code for the tokenizer. The
PHP code is split in two parts. The updateable code is found in
lib-php. The tokenizer installs an additional script in the
project directory which then includes the code from lib-php and
defines all settings that are static to the database. The website
code then always includes the PHP from the project directory.
The BDD tests still use the old-style amenity creation scripts
because we don't have simple means to import a hand-crafted
test file of special phrases right now.
Normalization and token computation are now done in the tokenizer.
The tokenizer keeps a cache to the hundred most used house numbers
to keep the numbers of calls to the database low.
The name analyzer is the actual work horse of the tokenizer. It
is instantiated on a thread-base and provides all functions for
analysing names and queries.
Add a jsonb column to the placex and location_property_osmline tables
which can be used by the installed tokenizer as required. No other
part of the software will use or otherwise rely on this column.
Indexing is now split into three parts: first a preparation step
that collects the necessary information from the database and
returns it to Python. In a second step the data is transformed
within Python as necessary and then returned to the database
through the usual UPDATE which now not only sets the indexed_status
but also other fields. The third step comprises the address
computation which is still done inside the update trigger in
the database.
The second processing step doesn't do anything useful yet.
Creating and populating the word table is now the responsibility
of the tokenizer.
The get_maxwordfreq() function has been replaced with a
simple template parameter to the SQL during function installation.
The number is taken from the parameter list in the database to
ensure that it is not changed after installation.
Adds a migration that initialises a legacy tokenizer for
an existing database. The migration is not active yet as
it will need completion when more functionality is added
to the legacy tokenizer.
This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.
The legacy tokenizer implements Nominatim's traditional algorithms.
These tables have never been actively maintained and the code is
completely untested. With the upcomming changes, it is unlikely
that the code remains usable.
This removes the aux tables and all code that references them.
Indexing scans the placex table sequentially during indexing
on the initial import. That is okay because we know that all
rows need to be processed anywhere. When continuing the import,
however, a large part might already be indexed, so that the
process spends a lot of time going through rows that are no
longer of interest. Create a supporting index for all unindexed
rows to speed up the scan. This is the same index as used later
for updates.
On Postgresql versions 11+ add an index to speed up the lookup
of housenumbers for terms found in search_name. This is really
just a band-aid around the query planer's interpretation of the
query.
On Windows systems the timer may not be accurate enough to measure
the time between init() and done(). Avoid computing statistics with
a diff time of 0 in such cases.
Fixes#2230.
- the command is now --import-special-phrases
- the output is not an sql file anymore, data are directly imported to the database.
- the little part on the documentation (section data import) has been modified.
Adds a 'admin --migrate' command that checks for the current
database version and runs any necessary migrations. Also
has migrations going back to 3.6.
Also switches to jinja-based preprocessing, which allows to
simplify the SQL files. Use 'if not exists' where possible
so that the step can be rerun to fix missing indexes.
Replaces various hand-crafted replacements of varying format with
a single Jinja2 templating mechanism. Allows full access to
configuration if necessary.
If an execption is raised by other means, we still have to close
the stdin pipe to psql to make sure that it exits and releases its
connection to the database.
Instead of parsing the DSN for each external libpq program we
are going to execute, provide a function that feeds them all
necessary parameters through the environment.
osm2pgsql is the first user.
Psycopg2 has changed the kind of exception that is emitted on
deadlocks between versions 2.7 and 2.8. The code was already
trying to catch both kind of errors but because the
psycopg2.errors package is unknown in 2.7 and below, the
code would throw an exception on anything but a deadlock error.
This commit wraps the deadlock handling into a context manager
to avoid code duplication and uses module imports to detect if
the new error codes are available.
Also sets the required psycopg2 version to 2.7 or bigger as
versions below are difficult to test.
The gazetteer output doesn't disable these functions when
writing to the place table but the triggers may contain
operations that cause misplanning for the query planner.
This replaces {data_dir}/settings throughout the code, so that
the configuration may be placed somewhere else in the directory
structure (e.g. in /etc).