Nominatim/docs/develop/Tokenizers.md

# Tokenizers

The tokenizer is the component of Nominatim that is responsible for
analysing names of OSM objects and queries. Nominatim provides different
tokenizers that use different strategies for normalisation. This page describes
how tokenizers are expected to work and the public API that needs to be
implemented when creating a new tokenizer. For information on how to configure
a specific tokenizer for a database see the
[tokenizer chapter in the Customization Guide](../customize/Tokenizers.md).

## Generic Architecture

### About Search Tokens

Search in Nominatim is organised around search tokens. Such a token represents
string that can be part of the search query. Tokens are used so that the search
index does not need to be organised around strings. Instead the database saves
for each place which tokens match this place's name, address, house number etc.
To be able to distinguish between these different types of information stored
with the place, a search token also always has a certain type: name, house number,
postcode etc.

During search an incoming query is transformed into a ordered list of such
search tokens (or rather many lists, see below) and this list is then converted
into a database query to find the right place.

It is the core task of the tokenizer to create, manage and assign the search
tokens. The tokenizer is involved in two distinct operations:

* __at import time__: scanning names of OSM objects, normalizing them and
  building up the list of search tokens.
* __at query time__: scanning the query and returning the appropriate search
  tokens.


### Importing

The indexer is responsible to enrich an OSM object (or place) with all data
required for geocoding. It is split into two parts: the controller collects
the places that require updating, enriches the place information as required
and hands the place to Postgresql. The collector is part of the Nominatim
library written in Python. Within Postgresql, the `placex_update`
trigger is responsible to fill out all secondary tables with extra geocoding
information. This part is written in PL/pgSQL.

The tokenizer is involved in both parts. When the indexer prepares a place,
it hands it over to the tokenizer to inspect the names and create all the
search tokens applicable for the place. This usually involves updating the
tokenizer's internal token lists and creating a list of all token IDs for
the specific place. This list is later needed in the PL/pgSQL part where the
indexer needs to add the token IDs to the appropriate search tables. To be
able to communicate the list between the Python part and the pl/pgSQL trigger,
the `placex` table contains a special JSONB column `token_info` which is there
for the exclusive use of the tokenizer.

The Python part of the tokenizer returns a structured information about the
tokens of a place to the indexer which converts it to JSON and inserts it into
the `token_info` column. The content of the column is then handed to the PL/pqSQL
callbacks of the tokenizer which extracts the required information. Usually
the tokenizer then removes all information from the `token_info` structure,
so that no information is ever persistently saved in the table. All information
that went in should have been processed after all and put into secondary tables.
This is however not a hard requirement. If the tokenizer needs to store
additional information about a place permanently, it may do so in the
`token_info` column. It just may never execute searches over it and
consequently not create any special indexes on it.

### Querying

At query time, Nominatim builds up multiple _interpretations_ of the search
query. Each of these interpretations is tried against the database in order
of the likelihood with which they match to the search query. The first
interpretation that yields results wins.

The interpretations are encapsulated in the `SearchDescription` class. An
instance of this class is created by applying a sequence of
_search tokens_ to an initially empty SearchDescription. It is the
responsibility of the tokenizer to parse the search query and derive all
possible sequences of search tokens. To that end the tokenizer needs to parse
the search query and look up matching words in its own data structures.

## Tokenizer API

The following section describes the functions that need to be implemented
for a custom tokenizer implementation.

!!! warning
    This API is currently in early alpha status. While this API is meant to
    be a public API on which other tokenizers may be implemented, the API is
    far away from being stable at the moment.

### Directory Structure

Nominatim expects two files for a tokenizer:

* `nominiatim/tokenizer/<NAME>_tokenizer.py` containing the Python part of the
  implementation
* `lib-php/tokenizer/<NAME>_tokenizer.php` with the PHP part of the
  implementation

where `<NAME>` is a unique name for the tokenizer consisting of only lower-case
letters, digits and underscore. A tokenizer also needs to install some SQL
functions. By convention, these should be placed in `lib-sql/tokenizer`.

If the tokenizer has a default configuration file, this should be saved in
the `settings/<NAME>_tokenizer.<SUFFIX>`.

### Configuration and Persistance

Tokenizers may define custom settings for their configuration. All settings
must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or
persistent. Transient settings are loaded from the configuration file when
Nominatim is started and may thus be changed at any time. Persistent settings
are tied to a database installation and must only be read during installation
time. If they are needed for the runtime then they must be saved into the
`nominatim_properties` table and later loaded from there.

### The Python module

The Python module is expect to export a single factory function:

```python
def create(dsn: str, data_dir: Path) -> AbstractTokenizer
```

The `dsn` parameter contains the DSN of the Nominatim database. The `data_dir`
is a directory in the project directory that the tokenizer may use to save
database-specific data. The function must return the instance of the tokenizer
class as defined below.

### Python Tokenizer Class

All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer`
and implement the abstract functions defined there.

::: nominatim.tokenizer.base.AbstractTokenizer
    rendering:
        heading_level: 4

### Python Analyzer Class

::: nominatim.tokenizer.base.AbstractAnalyzer
    rendering:
        heading_level: 4

### PL/pgSQL Functions

The tokenizer must provide access functions for the `token_info` column
to the indexer which extracts the necessary information for the global
search tables. If the tokenizer needs additional SQL functions for private
use, then these functions must be prefixed with `token_` in order to ensure
that there are no naming conflicts with the SQL indexer code.

The following functions are expected:

```sql
FUNCTION token_get_name_search_tokens(info JSONB) RETURNS INTEGER[]
```

Return an array of token IDs of search terms that should match
the name(s) for the given place. These tokens are used to look up the place
by name and, where the place functions as part of an address for another place,
by address. Must return NULL when the place has no name.

```sql
FUNCTION token_get_name_match_tokens(info JSONB) RETURNS INTEGER[]
```

Return an array of token IDs of full names of the place that should be used
to match addresses. The list of match tokens is usually more strict than
search tokens as it is used to find a match between two OSM tag values which
are expected to contain matching full names. Partial terms should not be
used for match tokens. Must return NULL when the place has no name.

```sql
FUNCTION token_get_housenumber_search_tokens(info JSONB) RETURNS INTEGER[]
```

Return an array of token IDs of house number tokens that apply to the place.
Note that a place may have multiple house numbers, for example when apartments
each have their own number. Must be NULL when the place has no house numbers.

```sql
FUNCTION token_normalized_housenumber(info JSONB) RETURNS TEXT
```

Return the house number(s) in the normalized form that can be matched against
a house number token text. If a place has multiple house numbers they must
be listed with a semicolon as delimiter. Must be NULL when the place has no
house numbers.

```sql
FUNCTION token_matches_street(info JSONB, street_tokens INTEGER[]) RETURNS BOOLEAN
```

Check if the given tokens (previously saved from `token_get_name_match_tokens()`)
match against the `addr:street` tag name. Must return either NULL or FALSE
when the place has no `addr:street` tag.

```sql
FUNCTION token_matches_place(info JSONB, place_tokens INTEGER[]) RETURNS BOOLEAN
```

Check if the given tokens (previously saved from `token_get_name_match_tokens()`)
match against the `addr:place` tag name. Must return either NULL or FALSE
when the place has no `addr:place` tag.


```sql
FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[]
```

Return the search token IDs extracted from the `addr:place` tag. These tokens
are used for searches by address when no matching place can be found in the
database. Must be NULL when the place has no `addr:place` tag.

```sql
FUNCTION token_get_address_keys(info JSONB) RETURNS SETOF TEXT
```

Return the set of keys for which address information is provided. This
should correspond to the list of (relevant) `addr:*` tags with the `addr:`
prefix removed or the keys used in the `address` dictionary of the place info.

```sql
FUNCTION token_get_address_search_tokens(info JSONB, key TEXT) RETURNS INTEGER[]
```

Return the array of search tokens for the given address part. `key` can be
expected to be one of those returned with `token_get_address_keys()`. The
search tokens are added to the address search vector of the place, when no
corresponding OSM object could be found for the given address part from which
to copy the name information.

```sql
FUNCTION token_matches_address(info JSONB, key TEXT, tokens INTEGER[])
```

Check if the given tokens match against the address part `key`.

__Warning:__ the tokens that are handed in are the lists previously saved
from `token_get_name_search_tokens()`, _not_ from the match token list. This
is an historical oddity which will be fixed at some point in the future.
Currently, tokenizers are encouraged to make sure that matching works against
both the search token list and the match token list.

```sql
FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT
```

Return the normalized version of the given postcode. This function must return
the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.

```sql
FUNCTION token_strip_info(info JSONB) RETURNS JSONB
```

Return the part of the `token_info` field that should be stored in the database
permanently. The indexer calls this function when all processing is done and
replaces the content of the `token_info` column with the returned value before
the trigger stores the information in the database. May return NULL if no
information should be stored permanently.

### PHP Tokenizer class

The PHP tokenizer class is instantiated once per request and responsible for
analyzing the incoming query. Multiple requests may be in flight in
parallel.

The class is expected to be found under the
name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file
`tokenizer/tokenizer.php` in the project directory. This file must be created
when the tokenizer is first set up on import. The file should initialize any
configuration variables by setting PHP constants and then require the file
with the actual implementation of the tokenizer.

The tokenizer class must implement the following functions:

```php
public function __construct(object &$oDB)
```

The constructor of the class receives a database connection that can be used
to query persistent data in the database.

```php
public function checkStatus()
```

Check that the tokenizer can access its persistent data structures. If there
is an issue, throw an `\Exception`.

```php
public function normalizeString(string $sTerm) : string
```

Normalize string to a form to be used for comparisons when reordering results.
Nominatim reweighs results how well the final display string matches the actual
query. Before comparing result and query, names and query are normalised against
this function. The tokenizer can thus remove all properties that should not be
taken into account for reweighing, e.g. special characters or case.

```php
public function tokensForSpecialTerm(string $sTerm) : array
```

Return the list of special term tokens that match the given term.

```php
public function extractTokensFromPhrases(array &$aPhrases) : TokenList
```

Parse the given phrases, splitting them into word lists and retrieve the
matching tokens.

The phrase array may take on two forms. In unstructured searches (using `q=`
parameter) the search query is split at the commas and the elements are
put into a sorted list. For structured searches the phrase array is an
associative array where the key designates the type of the term (street, city,
county etc.) The tokenizer may ignore the phrase type at this stage in parsing.
Matching phrase type and appropriate search token type will be done later
when the SearchDescription is built.

For each phrase in the list of phrases, the function must analyse the phrase
string and then call `setWordSets()` to communicate the result of the analysis.
A word set is a list of strings, where each string refers to a search token.
A phrase may have multiple interpretations. Therefore a list of word sets is
usually attached to the phrase. The search tokens themselves are returned
by the function in an associative array, where the key corresponds to the
strings given in the word sets. The value is a list of search tokens. Thus
a single string in the list of word sets may refer to multiple search tokens.
docs: add developer doc page for Tokenizer 2021-07-29 21:54:33 +03:00			`# Tokenizers`

			`The tokenizer is the component of Nominatim that is responsible for`
			`analysing names of OSM objects and queries. Nominatim provides different`
			`tokenizers that use different strategies for normalisation. This page describes`
			`how tokenizers are expected to work and the public API that needs to be`
			`implemented when creating a new tokenizer. For information on how to configure`
			`a specific tokenizer for a database see the`
docs: fix more links 2021-10-18 18:26:14 +03:00			`[tokenizer chapter in the Customization Guide](../customize/Tokenizers.md).`
docs: add developer doc page for Tokenizer 2021-07-29 21:54:33 +03:00
			`## Generic Architecture`

			`### About Search Tokens`

			`Search in Nominatim is organised around search tokens. Such a token represents`
			`string that can be part of the search query. Tokens are used so that the search`
			`index does not need to be organised around strings. Instead the database saves`
			`for each place which tokens match this place's name, address, house number etc.`
			`To be able to distinguish between these different types of information stored`
			`with the place, a search token also always has a certain type: name, house number,`
			`postcode etc.`

			`During search an incoming query is transformed into a ordered list of such`
			`search tokens (or rather many lists, see below) and this list is then converted`
			`into a database query to find the right place.`

			`It is the core task of the tokenizer to create, manage and assign the search`
			`tokens. The tokenizer is involved in two distinct operations:`

			`* __at import time__: scanning names of OSM objects, normalizing them and`
			`building up the list of search tokens.`
			`* __at query time__: scanning the query and returning the appropriate search`
			`tokens.`


			`### Importing`

			`The indexer is responsible to enrich an OSM object (or place) with all data`
			`required for geocoding. It is split into two parts: the controller collects`
			`the places that require updating, enriches the place information as required`
			`and hands the place to Postgresql. The collector is part of the Nominatim`
			library written in Python. Within Postgresql, the `placex_update`
			`trigger is responsible to fill out all secondary tables with extra geocoding`
			`information. This part is written in PL/pgSQL.`

			`The tokenizer is involved in both parts. When the indexer prepares a place,`
			`it hands it over to the tokenizer to inspect the names and create all the`
			`search tokens applicable for the place. This usually involves updating the`
			`tokenizer's internal token lists and creating a list of all token IDs for`
			`the specific place. This list is later needed in the PL/pgSQL part where the`
			`indexer needs to add the token IDs to the appropriate search tables. To be`
			`able to communicate the list between the Python part and the pl/pgSQL trigger,`
docs: extend explanation of query phrase 2021-08-16 10:57:01 +03:00			the `placex` table contains a special JSONB column `token_info` which is there
docs: add developer doc page for Tokenizer 2021-07-29 21:54:33 +03:00			`for the exclusive use of the tokenizer.`

			`The Python part of the tokenizer returns a structured information about the`
			`tokens of a place to the indexer which converts it to JSON and inserts it into`
			the `token_info` column. The content of the column is then handed to the PL/pqSQL
			`callbacks of the tokenizer which extracts the required information. Usually`
			the tokenizer then removes all information from the `token_info` structure,
			`so that no information is ever persistently saved in the table. All information`
			`that went in should have been processed after all and put into secondary tables.`
			`This is however not a hard requirement. If the tokenizer needs to store`
			`additional information about a place permanently, it may do so in the`
			`token_info` column. It just may never execute searches over it and
			`consequently not create any special indexes on it.`

			`### Querying`

docs: extend explanation of query phrase 2021-08-16 10:57:01 +03:00			`At query time, Nominatim builds up multiple _interpretations_ of the search`
			`query. Each of these interpretations is tried against the database in order`
			`of the likelihood with which they match to the search query. The first`
			`interpretation that yields results wins.`

			The interpretations are encapsulated in the `SearchDescription` class. An
			`instance of this class is created by applying a sequence of`
			`_search tokens_ to an initially empty SearchDescription. It is the`
			`responsibility of the tokenizer to parse the search query and derive all`
			`possible sequences of search tokens. To that end the tokenizer needs to parse`
			`the search query and look up matching words in its own data structures.`
define formal public Python interface for tokenizer This introduces an abstract class for the Tokenizer/Analyzer for documentation purposes. 2021-08-10 15:51:35 +03:00
			`## Tokenizer API`

			`The following section describes the functions that need to be implemented`
			`for a custom tokenizer implementation.`

			`!!! warning`
			`This API is currently in early alpha status. While this API is meant to`
			`be a public API on which other tokenizers may be implemented, the API is`
			`far away from being stable at the moment.`

			`### Directory Structure`

			`Nominatim expects two files for a tokenizer:`

document tokenizer SQL interface 2021-08-10 18:31:04 +03:00			* `nominiatim/tokenizer/<NAME>_tokenizer.py` containing the Python part of the
define formal public Python interface for tokenizer This introduces an abstract class for the Tokenizer/Analyzer for documentation purposes. 2021-08-10 15:51:35 +03:00			`implementation`
			* `lib-php/tokenizer/<NAME>_tokenizer.php` with the PHP part of the
			`implementation`

			where `<NAME>` is a unique name for the tokenizer consisting of only lower-case
			`letters, digits and underscore. A tokenizer also needs to install some SQL`
			functions. By convention, these should be placed in `lib-sql/tokenizer`.

			`If the tokenizer has a default configuration file, this should be saved in`
			the `settings/<NAME>_tokenizer.<SUFFIX>`.

			`### Configuration and Persistance`

			`Tokenizers may define custom settings for their configuration. All settings`
			must be prefixed with `NOMINATIM_TOKENIZER_`. Settings may be transient or
			`persistent. Transient settings are loaded from the configuration file when`
			`Nominatim is started and may thus be changed at any time. Persistent settings`
			`are tied to a database installation and must only be read during installation`
			`time. If they are needed for the runtime then they must be saved into the`
			`nominatim_properties` table and later loaded from there.

			`### The Python module`

			`The Python module is expect to export a single factory function:`

			```python
			`def create(dsn: str, data_dir: Path) -> AbstractTokenizer`
			```

			The `dsn` parameter contains the DSN of the Nominatim database. The `data_dir`
			`is a directory in the project directory that the tokenizer may use to save`
			`database-specific data. The function must return the instance of the tokenizer`
			`class as defined below.`

			`### Python Tokenizer Class`

			All tokenizers must inherit from `nominatim.tokenizer.base.AbstractTokenizer`
			`and implement the abstract functions defined there.`

			`::: nominatim.tokenizer.base.AbstractTokenizer`
			`rendering:`
			`heading_level: 4`

			`### Python Analyzer Class`

			`::: nominatim.tokenizer.base.AbstractAnalyzer`
			`rendering:`
			`heading_level: 4`
document tokenizer SQL interface 2021-08-10 18:31:04 +03:00
			`### PL/pgSQL Functions`

			The tokenizer must provide access functions for the `token_info` column
			`to the indexer which extracts the necessary information for the global`
			`search tables. If the tokenizer needs additional SQL functions for private`
			use, then these functions must be prefixed with `token_` in order to ensure
			`that there are no naming conflicts with the SQL indexer code.`

			`The following functions are expected:`

			```sql
			`FUNCTION token_get_name_search_tokens(info JSONB) RETURNS INTEGER[]`
			```

			`Return an array of token IDs of search terms that should match`
			`the name(s) for the given place. These tokens are used to look up the place`
			`by name and, where the place functions as part of an address for another place,`
			`by address. Must return NULL when the place has no name.`

			```sql
			`FUNCTION token_get_name_match_tokens(info JSONB) RETURNS INTEGER[]`
			```

			`Return an array of token IDs of full names of the place that should be used`
			`to match addresses. The list of match tokens is usually more strict than`
			`search tokens as it is used to find a match between two OSM tag values which`
			`are expected to contain matching full names. Partial terms should not be`
			`used for match tokens. Must return NULL when the place has no name.`

			```sql
			`FUNCTION token_get_housenumber_search_tokens(info JSONB) RETURNS INTEGER[]`
			```

			`Return an array of token IDs of house number tokens that apply to the place.`
			`Note that a place may have multiple house numbers, for example when apartments`
			`each have their own number. Must be NULL when the place has no house numbers.`

			```sql
			`FUNCTION token_normalized_housenumber(info JSONB) RETURNS TEXT`
			```

			`Return the house number(s) in the normalized form that can be matched against`
			`a house number token text. If a place has multiple house numbers they must`
			`be listed with a semicolon as delimiter. Must be NULL when the place has no`
			`house numbers.`

			```sql
adapt documentation for SQL tokenizer interface 2021-09-22 23:54:14 +03:00			`FUNCTION token_matches_street(info JSONB, street_tokens INTEGER[]) RETURNS BOOLEAN`
document tokenizer SQL interface 2021-08-10 18:31:04 +03:00			```

adapt documentation for SQL tokenizer interface 2021-09-22 23:54:14 +03:00			Check if the given tokens (previously saved from `token_get_name_match_tokens()`)
			match against the `addr:street` tag name. Must return either NULL or FALSE
			when the place has no `addr:street` tag.
document tokenizer SQL interface 2021-08-10 18:31:04 +03:00
			```sql
adapt documentation for SQL tokenizer interface 2021-09-22 23:54:14 +03:00			`FUNCTION token_matches_place(info JSONB, place_tokens INTEGER[]) RETURNS BOOLEAN`
document tokenizer SQL interface 2021-08-10 18:31:04 +03:00			```

adapt documentation for SQL tokenizer interface 2021-09-22 23:54:14 +03:00			Check if the given tokens (previously saved from `token_get_name_match_tokens()`)
			match against the `addr:place` tag name. Must return either NULL or FALSE
			when the place has no `addr:place` tag.

document tokenizer SQL interface 2021-08-10 18:31:04 +03:00
			```sql
			`FUNCTION token_addr_place_search_tokens(info JSONB) RETURNS INTEGER[]`
			```

			Return the search token IDs extracted from the `addr:place` tag. These tokens
			`are used for searches by address when no matching place can be found in the`
			database. Must be NULL when the place has no `addr:place` tag.

			```sql
adapt documentation for SQL tokenizer interface 2021-09-22 23:54:14 +03:00			`FUNCTION token_get_address_keys(info JSONB) RETURNS SETOF TEXT`
			```

			`Return the set of keys for which address information is provided. This`
			should correspond to the list of (relevant) `addr:*` tags with the `addr:`
			prefix removed or the keys used in the `address` dictionary of the place info.
document tokenizer SQL interface 2021-08-10 18:31:04 +03:00
adapt documentation for SQL tokenizer interface 2021-09-22 23:54:14 +03:00			```sql
			`FUNCTION token_get_address_search_tokens(info JSONB, key TEXT) RETURNS INTEGER[]`
document tokenizer SQL interface 2021-08-10 18:31:04 +03:00			```

adapt documentation for SQL tokenizer interface 2021-09-22 23:54:14 +03:00			Return the array of search tokens for the given address part. `key` can be
			expected to be one of those returned with `token_get_address_keys()`. The
			`search tokens are added to the address search vector of the place, when no`
			`corresponding OSM object could be found for the given address part from which`
			`to copy the name information.`

			```sql
			`FUNCTION token_matches_address(info JSONB, key TEXT, tokens INTEGER[])`
			```

			Check if the given tokens match against the address part `key`.

			`__Warning:__ the tokens that are handed in are the lists previously saved`
			from `token_get_name_search_tokens()`, _not_ from the match token list. This
			`is an historical oddity which will be fixed at some point in the future.`
			`Currently, tokenizers are encouraged to make sure that matching works against`
			`both the search token list and the match token list.`
document tokenizer SQL interface 2021-08-10 18:31:04 +03:00
			```sql
			`FUNCTION token_normalized_postcode(postcode TEXT) RETURNS TEXT`
			```

			`Return the normalized version of the given postcode. This function must return`
			the same value as the Python function `AbstractAnalyzer->normalize_postcode()`.

			```sql
			`FUNCTION token_strip_info(info JSONB) RETURNS JSONB`
			```

			Return the part of the `token_info` field that should be stored in the database
			`permanently. The indexer calls this function when all processing is done and`
			replaces the content of the `token_info` column with the returned value before
			`the trigger stores the information in the database. May return NULL if no`
			`information should be stored permanently.`
add documentation for PHP part of tokenizer 2021-08-12 12:21:50 +03:00
			`### PHP Tokenizer class`

			`The PHP tokenizer class is instantiated once per request and responsible for`
			`analyzing the incoming query. Multiple requests may be in flight in`
			`parallel.`

			`The class is expected to be found under the`
			name of `\Nominatim\Tokenizer`. To find the class the PHP code includes the file
			`tokenizer/tokenizer.php` in the project directory. This file must be created
			`when the tokenizer is first set up on import. The file should initialize any`
			`configuration variables by setting PHP constants and then require the file`
			`with the actual implementation of the tokenizer.`

			`The tokenizer class must implement the following functions:`

			```php
			`public function __construct(object &$oDB)`
			```

			`The constructor of the class receives a database connection that can be used`
			`to query persistent data in the database.`

			```php
			`public function checkStatus()`
			```

			`Check that the tokenizer can access its persistent data structures. If there`
			is an issue, throw an `\Exception`.

			```php
			`public function normalizeString(string $sTerm) : string`
			```

			`Normalize string to a form to be used for comparisons when reordering results.`
			`Nominatim reweighs results how well the final display string matches the actual`
			`query. Before comparing result and query, names and query are normalised against`
			`this function. The tokenizer can thus remove all properties that should not be`
			`taken into account for reweighing, e.g. special characters or case.`

			```php
			`public function tokensForSpecialTerm(string $sTerm) : array`
			```

			`Return the list of special term tokens that match the given term.`

			```php
			`public function extractTokensFromPhrases(array &$aPhrases) : TokenList`
			```

			`Parse the given phrases, splitting them into word lists and retrieve the`
			`matching tokens.`

docs: extend explanation of query phrase 2021-08-16 10:57:01 +03:00			The phrase array may take on two forms. In unstructured searches (using `q=`
			`parameter) the search query is split at the commas and the elements are`
			`put into a sorted list. For structured searches the phrase array is an`
			`associative array where the key designates the type of the term (street, city,`
			`county etc.) The tokenizer may ignore the phrase type at this stage in parsing.`
			`Matching phrase type and appropriate search token type will be done later`
			`when the SearchDescription is built.`

add documentation for PHP part of tokenizer 2021-08-12 12:21:50 +03:00			`For each phrase in the list of phrases, the function must analyse the phrase`
			string and then call `setWordSets()` to communicate the result of the analysis.
			`A word set is a list of strings, where each string refers to a search token.`
			`A phrase may have multiple interpretations. Therefore a list of word sets is`
			`usually attached to the phrase. The search tokens themselves are returned`
			`by the function in an associative array, where the key corresponds to the`
			`strings given in the word sets. The value is a list of search tokens. Thus`
			`a single string in the list of word sets may refer to multiple search tokens.`