From 68bff31cc9d922b9e2800632451c17915c259b61 Mon Sep 17 00:00:00 2001 From: Sarah Hoffmann Date: Thu, 29 Jul 2021 20:54:33 +0200 Subject: [PATCH] docs: add developer doc page for Tokenizer --- docs/develop/Tokenizers.md | 70 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 docs/develop/Tokenizers.md diff --git a/docs/develop/Tokenizers.md b/docs/develop/Tokenizers.md new file mode 100644 index 00000000..7d54247f --- /dev/null +++ b/docs/develop/Tokenizers.md @@ -0,0 +1,70 @@ +# Tokenizers + +The tokenizer is the component of Nominatim that is responsible for +analysing names of OSM objects and queries. Nominatim provides different +tokenizers that use different strategies for normalisation. This page describes +how tokenizers are expected to work and the public API that needs to be +implemented when creating a new tokenizer. For information on how to configure +a specific tokenizer for a database see the +[tokenizer chapter in the administration guide](../admin/Tokenizers.md). + +## Generic Architecture + +### About Search Tokens + +Search in Nominatim is organised around search tokens. Such a token represents +string that can be part of the search query. Tokens are used so that the search +index does not need to be organised around strings. Instead the database saves +for each place which tokens match this place's name, address, house number etc. +To be able to distinguish between these different types of information stored +with the place, a search token also always has a certain type: name, house number, +postcode etc. + +During search an incoming query is transformed into a ordered list of such +search tokens (or rather many lists, see below) and this list is then converted +into a database query to find the right place. + +It is the core task of the tokenizer to create, manage and assign the search +tokens. The tokenizer is involved in two distinct operations: + +* __at import time__: scanning names of OSM objects, normalizing them and + building up the list of search tokens. +* __at query time__: scanning the query and returning the appropriate search + tokens. + + +### Importing + +The indexer is responsible to enrich an OSM object (or place) with all data +required for geocoding. It is split into two parts: the controller collects +the places that require updating, enriches the place information as required +and hands the place to Postgresql. The collector is part of the Nominatim +library written in Python. Within Postgresql, the `placex_update` +trigger is responsible to fill out all secondary tables with extra geocoding +information. This part is written in PL/pgSQL. + +The tokenizer is involved in both parts. When the indexer prepares a place, +it hands it over to the tokenizer to inspect the names and create all the +search tokens applicable for the place. This usually involves updating the +tokenizer's internal token lists and creating a list of all token IDs for +the specific place. This list is later needed in the PL/pgSQL part where the +indexer needs to add the token IDs to the appropriate search tables. To be +able to communicate the list between the Python part and the pl/pgSQL trigger, +the placex table contains a special JSONB column `token_info` which is there +for the exclusive use of the tokenizer. + +The Python part of the tokenizer returns a structured information about the +tokens of a place to the indexer which converts it to JSON and inserts it into +the `token_info` column. The content of the column is then handed to the PL/pqSQL +callbacks of the tokenizer which extracts the required information. Usually +the tokenizer then removes all information from the `token_info` structure, +so that no information is ever persistently saved in the table. All information +that went in should have been processed after all and put into secondary tables. +This is however not a hard requirement. If the tokenizer needs to store +additional information about a place permanently, it may do so in the +`token_info` column. It just may never execute searches over it and +consequently not create any special indexes on it. + +### Querying + +