mirror of
https://github.com/osm-search/Nominatim.git
synced 2024-12-27 06:51:42 +03:00
docs: add overview over indexing
This commit is contained in:
parent
a069479340
commit
5b86b2078a
152
docs/develop/Indexing.md
Normal file
152
docs/develop/Indexing.md
Normal file
@ -0,0 +1,152 @@
|
|||||||
|
# Indexing Places
|
||||||
|
|
||||||
|
In Nominatim, the word __indexing__ refers to the process that takes the raw
|
||||||
|
OpenStreetMap data from the place table, enriches it with address information
|
||||||
|
and creates the search indexes. This section explains the basic data flow.
|
||||||
|
|
||||||
|
|
||||||
|
## Initial import
|
||||||
|
|
||||||
|
After osm2pgsql has loaded the raw OSM data into the place table,
|
||||||
|
the data is copied to the final search tables placex and location_property_osmline.
|
||||||
|
While they are copied, some basic properties are added:
|
||||||
|
|
||||||
|
* country_code, geometry_sector and partition
|
||||||
|
* initial search and address rank
|
||||||
|
|
||||||
|
In addition the column `indexed_status` is set to `1` marking the place as one
|
||||||
|
that needs to be indexed.
|
||||||
|
|
||||||
|
All this happens in the triggers `placex_insert` and `osmline_insert`.
|
||||||
|
|
||||||
|
## Indexing
|
||||||
|
|
||||||
|
The main work horse of the data import is the indexing step, where Nominatim
|
||||||
|
takes every place from the placex and location_property_osmline tables where
|
||||||
|
the indexed_status != 0 and computes the search terms and the address parts
|
||||||
|
of the place.
|
||||||
|
|
||||||
|
The indexing happens in three major steps:
|
||||||
|
|
||||||
|
1. **Data preparation** - The indexer gets the data for the place to be indexed
|
||||||
|
from the database.
|
||||||
|
|
||||||
|
2. **Search name processing** - The prepared data is given to the
|
||||||
|
tokenizer which computes the search terms from the names
|
||||||
|
and potentially other information.
|
||||||
|
|
||||||
|
3. **Address processing** - The indexer then hands the prepared data and the
|
||||||
|
tokenizer information back to the database via an `INSERT` statement which
|
||||||
|
also sets the indexed_status to `0`. This triggers the update triggers
|
||||||
|
`placex_update`/`osmline_update` which do the work of computing address
|
||||||
|
parts and filling all the search tables.
|
||||||
|
|
||||||
|
When computing the address terms of a place, Nominatim relies on the processed
|
||||||
|
search names of all the address parts. That is why places are processed in rank
|
||||||
|
order, from smallest rank to largest. To ensure correct handling of linked
|
||||||
|
place nodes, administrative boundaries are processed before all other places.
|
||||||
|
|
||||||
|
Apart from these restrictions, each place can be indexed independently
|
||||||
|
from the others. This allows a large degree of parallelization during the indexing.
|
||||||
|
It also means that the indexing process can be interrupted at any time and
|
||||||
|
will simply pick up where it left of when restarted.
|
||||||
|
|
||||||
|
### Data preparation
|
||||||
|
|
||||||
|
The data preparation step computes and retrieves all data for a place that
|
||||||
|
might be needed for the next step of processing the search name. That includes
|
||||||
|
|
||||||
|
* location information (country code)
|
||||||
|
* place classification (class, type, ranks)
|
||||||
|
* names (including names of linked places)
|
||||||
|
* address information (`addr:*` tags)
|
||||||
|
|
||||||
|
Data preparation is implemented in pl/PgSQL mostly in the functions
|
||||||
|
`placex_indexing_prepare()` and `get_interpolation_address()`.
|
||||||
|
|
||||||
|
#### `addr:*` tag inheritance
|
||||||
|
|
||||||
|
Nominatim has limited support for inheriting address tags from a building
|
||||||
|
to POIs inside the building. This only works when the address tags are on the
|
||||||
|
building outline. Any rank 30 object inside such a building or on its outline
|
||||||
|
inherits all address tags when it does not have any address tags of its own.
|
||||||
|
|
||||||
|
The inheritance is computed in the data preparation step.
|
||||||
|
|
||||||
|
### Search name processing
|
||||||
|
|
||||||
|
The prepared place information is handed to the tokenizer next. This is a
|
||||||
|
Python module responsible for processing the names from both name and address
|
||||||
|
terms and building up the word index from them. The process is explained in
|
||||||
|
more detail in the [Tokenizer chapter](Tokenizer.md).
|
||||||
|
|
||||||
|
### Address processing
|
||||||
|
|
||||||
|
Finally, the preprocessed place information and the results of the search name
|
||||||
|
processing are written back to the database. At this point the update trigger
|
||||||
|
of the placex/location_property_osmline tables take over and fill all the
|
||||||
|
dependent tables. This makes up the most work-intensive part of the indexing.
|
||||||
|
|
||||||
|
Nominatim distinguishes between dependent and independent places.
|
||||||
|
**Dependent places** are all places on rank 30: house numbers, POIs etc. These
|
||||||
|
places don't have a full address of their own. Instead they are attached to
|
||||||
|
a parent street or place and use the information of the parent for searching
|
||||||
|
and displaying information. Everything else are **independent places**: streets,
|
||||||
|
parks, water bodies, suburbs, cities, states etc. They receive a full address
|
||||||
|
on their own.
|
||||||
|
|
||||||
|
The address processing for both types of places is very different.
|
||||||
|
|
||||||
|
#### Independent places
|
||||||
|
|
||||||
|
To compute the address of an independent place Nominatim searches for all
|
||||||
|
places that cover the place to compute the address for at least partially.
|
||||||
|
For places with an area, that area is used to check for coverage. For place
|
||||||
|
nodes an artificial square area is computed according to the rank of
|
||||||
|
the place. The lower the rank the lager the area. The `location_area_large_X`
|
||||||
|
tables are there to facilitate the lookup. All places that can function as
|
||||||
|
the address of another place are saved in those tables.
|
||||||
|
|
||||||
|
`addr:*` and `isin:*` tags are taken into account to compute the address, too.
|
||||||
|
Nominatim will give preference to places with the same name as in these tags
|
||||||
|
when looking for places in the vicinity. If there are no matching place names
|
||||||
|
at all, then the tags are at least added to the search index. That means that
|
||||||
|
the names will not be shown in the result as the 'address' of the place, but
|
||||||
|
searching by them still works.
|
||||||
|
|
||||||
|
Independent places are always added to the global search index `search_name`.
|
||||||
|
|
||||||
|
#### Dependent places
|
||||||
|
|
||||||
|
Dependent places skip the full address computation for performance reasons.
|
||||||
|
Instead they just find a parent place to attach themselves to.
|
||||||
|
|
||||||
|
![parenting of dependent places](parenting-flow.svg)
|
||||||
|
|
||||||
|
By default a POI
|
||||||
|
or house number will be attached to the closest street. That can be any major
|
||||||
|
or minor street indexed by Nominatim. In the default configuration that means
|
||||||
|
that it can attach itself to a footway but only when it has a name.
|
||||||
|
|
||||||
|
When the dependent place has an `addr:street` tag, then Nominatim will first
|
||||||
|
try to find a street with the same name before falling back to the closest
|
||||||
|
street.
|
||||||
|
|
||||||
|
There are also addresses in OSM, where the housenumber does not belong
|
||||||
|
to a street at all. These have an `addr:place` tag. For these places, Nominatim
|
||||||
|
tries to find a place with the given name in the indexed places with an
|
||||||
|
address rank between 16 and 25. If none is found, then the dependent place
|
||||||
|
is attached to the closest place in that category and the addr:place name is
|
||||||
|
added as *unlisted* place, which indicates to Nominatim that it needs to add
|
||||||
|
it to the address output, no matter what. This special case is necessary to
|
||||||
|
cover addresses that don't really refer to an existing object.
|
||||||
|
|
||||||
|
When an address has both the `addr:street` and `addr:place` tag, then Nominatim
|
||||||
|
assumes that the `addr:place` tag in fact should be the city part of the address
|
||||||
|
and give the POI the usual street number address.
|
||||||
|
|
||||||
|
Dependent places are only added to the global search index `search_name` when
|
||||||
|
they have either a name themselves or when they have address tags that are not
|
||||||
|
covered by the places that make up their address. The latter ensures that
|
||||||
|
addresses are always searchable by those address tags.
|
||||||
|
|
31
docs/develop/parenting-flow.plantuml
Normal file
31
docs/develop/parenting-flow.plantuml
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
@startuml
|
||||||
|
skinparam monochrome true
|
||||||
|
|
||||||
|
start
|
||||||
|
|
||||||
|
if (has 'addr:street'?) then (yes)
|
||||||
|
if (street with that name\n nearby?) then (yes)
|
||||||
|
:**Use closest street**
|
||||||
|
**with same name**;
|
||||||
|
kill
|
||||||
|
else (no)
|
||||||
|
:** Use closest**\n**street**;
|
||||||
|
kill
|
||||||
|
endif
|
||||||
|
elseif (has 'addr:place'?) then (yes)
|
||||||
|
if (place with that name\n nearby?) then (yes)
|
||||||
|
:**Use closest place**
|
||||||
|
**with same name**;
|
||||||
|
kill
|
||||||
|
else (no)
|
||||||
|
:add addr:place to adress;
|
||||||
|
:**Use closest place**\n**rank 16 to 25**;
|
||||||
|
kill
|
||||||
|
endif
|
||||||
|
else (otherwise)
|
||||||
|
:**Use closest**\n**street**;
|
||||||
|
kill
|
||||||
|
endif
|
||||||
|
|
||||||
|
|
||||||
|
@enduml
|
41
docs/develop/parenting-flow.svg
Normal file
41
docs/develop/parenting-flow.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 9.8 KiB |
@ -36,6 +36,7 @@ pages:
|
|||||||
- 'Developers Guide':
|
- 'Developers Guide':
|
||||||
- 'Architecture Overview' : 'develop/overview.md'
|
- 'Architecture Overview' : 'develop/overview.md'
|
||||||
- 'Database Layout' : 'develop/Database-Layout.md'
|
- 'Database Layout' : 'develop/Database-Layout.md'
|
||||||
|
- 'Indexing' : 'develop/Indexing.md'
|
||||||
- 'Tokenizers' : 'develop/Tokenizers.md'
|
- 'Tokenizers' : 'develop/Tokenizers.md'
|
||||||
- 'Setup for Development' : 'develop/Development-Environment.md'
|
- 'Setup for Development' : 'develop/Development-Environment.md'
|
||||||
- 'Testing' : 'develop/Testing.md'
|
- 'Testing' : 'develop/Testing.md'
|
||||||
|
Loading…
Reference in New Issue
Block a user