Merge pull request #2502 from lonvia/improve-development-documentation

Extend developer's documentation
This commit is contained in:
Sarah Hoffmann 2021-11-01 16:12:23 +01:00 committed by GitHub
commit ccf61db726
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 796 additions and 34 deletions

7
.gitignore vendored
View File

@ -1,12 +1,9 @@
*.log
*.pyc
build
settings/local.php
docs/develop/*.png
data/wiki_import.sql
data/wiki_specialphrases.sql
data/osmosischange.osc
build
.vagrant
data/country_osm_grid.sql.gz

View File

@ -0,0 +1,167 @@
# Database Layout
### Import tables
OSM data is initially imported using [osm2pgsql](https://osm2pgsql.org).
Nominatim uses its own data output style 'gazetteer', which differs from the
output style created for map rendering.
The import process creates the following tables:
![osm2pgsql tables](osm2pgsql-tables.svg)
The `planet_osm_*` tables are the usual backing tables for OSM data. Note
that Nominatim uses them to look up special relations and to find nodes on
ways.
The gazetteer style produces a single table `place` as output with the following
columns:
* `osm_type` - kind of OSM object (**N** - node, **W** - way, **R** - relation)
* `osm_id` - original OSM ID
* `class` - key of principal tag defining the object type
* `type` - value of principal tag defining the object type
* `name` - collection of tags that contain a name or reference
* `admin_level` - numerical value of the tagged administrative level
* `address` - collection of tags defining the address of an object
* `extratags` - collection of additional interesting tags that are not
directly relevant for searching
* `geometry` - geometry of the object (in WGS84)
A single OSM object may appear multiple times in this table when it is tagged
with multiple tags that may constitute a principal tag. Take for example a
motorway bridge. In OSM, this would be a way which is tagged with
`highway=motorway` and `bridge=yes`. This way would appear in the `place` table
once with `class` of `highway` and once with a `class` of `bridge`. Thus the
*unique key* for `place` is (`osm_type`, `osm_id`, `class`).
How raw OSM tags are mapped to the columns in the place table is to a certain
degree configurable. See [Customizing Import Styles](../customize/Import-Styles.md)
for more information.
### Search tables
The following tables carry all information needed to do the search:
![search tables](search-tables.svg)
The **placex** table is the central table that saves all information about the
searchable places in Nominatim. The basic columns are the same as for the
place table and have the same meaning. The placex tables adds the following
additional columns:
* `place_id` - the internal unique ID to identify the place
* `partition` - the id to use with partitioned tables (see below)
* `geometry_sector` - a location hash used for geographically close ordering
* `parent_place_id` - the next higher place in the address hierarchy, only
relevant for POI-type places (with rank 30)
* `linked_place_id` - place ID of the place this object has been merged with.
When this ID is set, then the place is invisible for search.
* `importance` - measure how well known the place is
* `rank_search`, `rank_address` - search and address rank (see [Customizing ranking](../customize/Ranking.md)
* `wikipedia` - the wikipedia page used for computing the importance of the place
* `country_code` - the country the place is located in
* `housenumber` - normalized housenumber, if the place has one
* `postcode` - computed postcode for the place
* `indexed_status` - processing status of the place (0 - ready, 1 - freshly inserted, 2 - needs updating, 100 - needs deletion)
* `indexed_date` - timestamp when the place was processed last
* `centroid` - a point feature for the place
The **location_property_osmline** table is a special table for
[address interpolations](https://wiki.openstreetmap.org/wiki/Addresses#Using_interpolation).
The columns have the same meaning and use as the columns with the same name in
the placex table. Only three columns are special:
* `startnumber` and `endnumber` - beginning and end of the number range
for the interpolation
* `interpolationtype` - a string `odd`, `even` or `all` to indicate
the interval between the numbers
Address interpolations are always ways in OSM, which is why there is no column
`osm_type`.
The **location_postcode** table holds computed centroids of all postcodes that
can be found in the OSM data. The meaning of the columns is again the same
as that of the placex table.
Every place needs an address, a set of surrounding places that describe the
location of the place. The set of address places is made up of OSM places
themselves. The **place_addressline** table cross-references for each place
all the places that make up its address. Two columns define the address
relation:
* `place_id` - reference to the place being addressed
* `address_place_id` - reference to the place serving as an address part
The most of the columns cache information from the placex entry of the address
part. The exceptions are:
* `fromarea` - is true if the address part has an area geometry and can
therefore be considered preceise
* `isaddress` - is true if the address part should show up in the address
output. Sometimes there are multiple places competing for for same address
type (e.g. multiple cities) and this field resolves the tie.
The **search_name** table contains the search index proper. It saves for each
place the terms with which the place can be found. The terms are split into
the name itself and all terms that make up the address. The table mirrors some
of the columns from placex for faster lookup.
Search terms are not saved as strings. Each term is assigned an integer and those
integers are saved in the name and address vectors of the search_name table. The
**word** table serves as the lookup table from string to such a word ID. The
exact content of the word table depends on the [tokenizer](Tokenizers.md) used.
## Address computation tables
Next to the main search tables, there is a set of secondary helper tables used
to compute the address relations between places. These tables are partitioned.
Each country is assigned a partition number in the country_name table (see
below) and the data is then split between a set of tables, one for each
partition. Note that Nominatim still manually manages partitioned tables.
Native support for partitions in PostgreSQL only became useable with version 13.
It will be a little while before Nominatim drops support for older versions.
![address tables](address-tables.svg)
The **search_name_X** table is used to look up streets that appear in the
`addr:street` tag.
The **location_area_large_X** tables are used to look up larger areas
(administrative boundaries and place nodes) either through their geographic
closeness or through `addr:*` entries.
The **location_road** table is used to find the closest street for a
dependent place.
All three table cache specific information from the placex table for their
selected subset of places:
* `keywords` and `name_vector` contain lists of term ids (from the word table)
that the full name of the place should match against
* `isguess` is true for places that are not described by an area
All other columns reflect their counterpart in the placex table.
## Static data tables
Nominatim also creates a number of static tables at import:
* `nominatim_properties` saves settings that must not be changed after
import
* `address_levels` save the rank information from the
[ranking configuration](../customize/Ranking.md)
* `country_name` contains a fallback of names for all countries, their
default languages and saves the assignment of countries to partitions.
* `country_osm_grid` provides a fallback for country geometries
## Auxilary data tables
Finally there are some table for auxillary data:
* `location_property_tiger` - saves housenumber from the Tiger import. Its
layout is similar to that of `location_propoerty_osmline`.
* `place_class_*` tables are helper tables to facilitate lookup of POIs
by their class and type. They exist because it is not possible to create
combined indexes with geometries.

View File

@ -1,27 +0,0 @@
# OSM Data Import
OSM data is initially imported using [osm2pgsql](https://osm2pgsql.org).
Nominatim uses its own data output style 'gazetteer', which differs from the
output style created for map rendering.
## Database Layout
The gazetteer style produces a single table `place` with the following rows:
* `osm_type` - kind of OSM object (**N** - node, **W** - way, **R** - relation)
* `osm_id` - original OSM ID
* `class` - key of principal tag defining the object type
* `type` - value of principal tag defining the object type
* `name` - collection of tags that contain a name or reference
* `admin_level` - numerical value of the tagged administrative level
* `address` - collection of tags defining the address of an object
* `extratags` - collection of additional interesting tags that are not
directly relevant for searching
* `geometry` - geometry of the object (in WGS84)
A single OSM object may appear multiple times in this table when it is tagged
with multiple tags that may constitute a principal tag. Take for example a
motorway bridge. In OSM, this would be a way which is tagged with
`highway=motorway` and `bridge=yes`. This way would appear in the `place` table
once with `class` of `highway` and once with a `class` of `bridge`. Thus the
*unique key* for `place` is (`osm_type`, `osm_id`, `class`).

152
docs/develop/Indexing.md Normal file
View File

@ -0,0 +1,152 @@
# Indexing Places
In Nominatim, the word __indexing__ refers to the process that takes the raw
OpenStreetMap data from the place table, enriches it with address information
and creates the search indexes. This section explains the basic data flow.
## Initial import
After osm2pgsql has loaded the raw OSM data into the place table,
the data is copied to the final search tables placex and location_property_osmline.
While they are copied, some basic properties are added:
* country_code, geometry_sector and partition
* initial search and address rank
In addition the column `indexed_status` is set to `1` marking the place as one
that needs to be indexed.
All this happens in the triggers `placex_insert` and `osmline_insert`.
## Indexing
The main work horse of the data import is the indexing step, where Nominatim
takes every place from the placex and location_property_osmline tables where
the indexed_status != 0 and computes the search terms and the address parts
of the place.
The indexing happens in three major steps:
1. **Data preparation** - The indexer gets the data for the place to be indexed
from the database.
2. **Search name processing** - The prepared data is given to the
tokenizer which computes the search terms from the names
and potentially other information.
3. **Address processing** - The indexer then hands the prepared data and the
tokenizer information back to the database via an `INSERT` statement which
also sets the indexed_status to `0`. This triggers the update triggers
`placex_update`/`osmline_update` which do the work of computing address
parts and filling all the search tables.
When computing the address terms of a place, Nominatim relies on the processed
search names of all the address parts. That is why places are processed in rank
order, from smallest rank to largest. To ensure correct handling of linked
place nodes, administrative boundaries are processed before all other places.
Apart from these restrictions, each place can be indexed independently
from the others. This allows a large degree of parallelization during the indexing.
It also means that the indexing process can be interrupted at any time and
will simply pick up where it left of when restarted.
### Data preparation
The data preparation step computes and retrieves all data for a place that
might be needed for the next step of processing the search name. That includes
* location information (country code)
* place classification (class, type, ranks)
* names (including names of linked places)
* address information (`addr:*` tags)
Data preparation is implemented in pl/PgSQL mostly in the functions
`placex_indexing_prepare()` and `get_interpolation_address()`.
#### `addr:*` tag inheritance
Nominatim has limited support for inheriting address tags from a building
to POIs inside the building. This only works when the address tags are on the
building outline. Any rank 30 object inside such a building or on its outline
inherits all address tags when it does not have any address tags of its own.
The inheritance is computed in the data preparation step.
### Search name processing
The prepared place information is handed to the tokenizer next. This is a
Python module responsible for processing the names from both name and address
terms and building up the word index from them. The process is explained in
more detail in the [Tokenizer chapter](Tokenizer.md).
### Address processing
Finally, the preprocessed place information and the results of the search name
processing are written back to the database. At this point the update trigger
of the placex/location_property_osmline tables take over and fill all the
dependent tables. This makes up the most work-intensive part of the indexing.
Nominatim distinguishes between dependent and independent places.
**Dependent places** are all places on rank 30: house numbers, POIs etc. These
places don't have a full address of their own. Instead they are attached to
a parent street or place and use the information of the parent for searching
and displaying information. Everything else are **independent places**: streets,
parks, water bodies, suburbs, cities, states etc. They receive a full address
on their own.
The address processing for both types of places is very different.
#### Independent places
To compute the address of an independent place Nominatim searches for all
places that cover the place to compute the address for at least partially.
For places with an area, that area is used to check for coverage. For place
nodes an artificial square area is computed according to the rank of
the place. The lower the rank the lager the area. The `location_area_large_X`
tables are there to facilitate the lookup. All places that can function as
the address of another place are saved in those tables.
`addr:*` and `isin:*` tags are taken into account to compute the address, too.
Nominatim will give preference to places with the same name as in these tags
when looking for places in the vicinity. If there are no matching place names
at all, then the tags are at least added to the search index. That means that
the names will not be shown in the result as the 'address' of the place, but
searching by them still works.
Independent places are always added to the global search index `search_name`.
#### Dependent places
Dependent places skip the full address computation for performance reasons.
Instead they just find a parent place to attach themselves to.
![parenting of dependent places](parenting-flow.svg)
By default a POI
or house number will be attached to the closest street. That can be any major
or minor street indexed by Nominatim. In the default configuration that means
that it can attach itself to a footway but only when it has a name.
When the dependent place has an `addr:street` tag, then Nominatim will first
try to find a street with the same name before falling back to the closest
street.
There are also addresses in OSM, where the housenumber does not belong
to a street at all. These have an `addr:place` tag. For these places, Nominatim
tries to find a place with the given name in the indexed places with an
address rank between 16 and 25. If none is found, then the dependent place
is attached to the closest place in that category and the addr:place name is
added as *unlisted* place, which indicates to Nominatim that it needs to add
it to the address output, no matter what. This special case is necessary to
cover addresses that don't really refer to an existing object.
When an address has both the `addr:street` and `addr:place` tag, then Nominatim
assumes that the `addr:place` tag in fact should be the city part of the address
and give the POI the usual street number address.
Dependent places are only added to the global search index `search_name` when
they have either a name themselves or when they have address tags that are not
covered by the places that make up their address. The latter ensures that
addresses are always searchable by those address tags.

View File

@ -0,0 +1,35 @@
@startuml
skinparam monochrome true
skinparam ObjectFontStyle bold
map search_name_X {
place_id => BIGINT
address_rank => SMALLINT
name_vector => INT[]
centroid => GEOMETRY
}
map location_area_large_X {
place_id => BIGINT
keywords => INT[]
partition => SMALLINT
rank_search => SMALLINT
rank_address => SMALLINT
country_code => VARCHR(2)
isguess => BOOLEAN
postcode => TEXT
centroid => POINT
geometry => GEOMETRY
}
map location_road_X {
place_id => BIGINT
partition => SMALLINT
country_code => VARCHR(2)
geometry => GEOMETRY
}
search_name_X -[hidden]> location_area_large_X
location_area_large_X -[hidden]> location_road_X
@enduml

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 11 KiB

View File

@ -0,0 +1,44 @@
@startuml
skinparam monochrome true
skinparam ObjectFontStyle bold
map planet_osm_nodes #eee {
id => BIGINT
lat => INT
lon => INT
}
map planet_osm_ways #eee {
id => BIGINT
nodes => BIGINT[]
tags => TEXT[]
}
map planet_osm_rels #eee {
id => BIGINT
parts => BIGINT[]
members => TEXT[]
tags => TEXT[]
way_off => SMALLINT
rel_off => SMALLINT
}
map place {
osm_type => CHAR(1)
osm_id => BIGINT
class => TEXT
type => TEXT
name => HSTORE
address => HSTORE
extratags => HSTORE
admin_level => SMALLINT
geometry => GEOMETRY
}
planet_osm_nodes -[hidden]> planet_osm_ways
planet_osm_ways -[hidden]> planet_osm_rels
planet_osm_ways -[hidden]-> place
planet_osm_nodes::id <- planet_osm_ways::nodes
@enduml

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 13 KiB

View File

@ -0,0 +1,31 @@
@startuml
skinparam monochrome true
start
if (has 'addr:street'?) then (yes)
if (street with that name\n nearby?) then (yes)
:**Use closest street**
**with same name**;
kill
else (no)
:** Use closest**\n**street**;
kill
endif
elseif (has 'addr:place'?) then (yes)
if (place with that name\n nearby?) then (yes)
:**Use closest place**
**with same name**;
kill
else (no)
:add addr:place to adress;
:**Use closest place**\n**rank 16 to 25**;
kill
endif
else (otherwise)
:**Use closest**\n**street**;
kill
endif
@enduml

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 9.8 KiB

View File

@ -0,0 +1,99 @@
@startuml
skinparam monochrome true
skinparam ObjectFontStyle bold
left to right direction
map placex {
place_id => BIGINT
osm_type => CHAR(1)
osm_id => BIGINT
class => TEXT
type => TEXT
name => HSTORE
address => HSTORE
extratags => HSTORE
admin_level => SMALLINT
partition => SMALLINT
geometry_sector => INT
parent_place_id => BIGINT
linked_place_id => BIGINT
importance => DOUBLE
rank_search => SMALLINT
rank_address => SMALLINT
wikipedia => TEXT
country_code => VARCHAR(2)
housenumber => TEXT
postcode => TEXT
indexed_status => SMALLINT
indexed_date => TIMESTAMP
centroid => GEOMETRY
geometry => GEOMETRY
}
map search_name {
place_id => BIGINT
importance => DOUBLE
search_rank => SMALLINT
address_rank => SMALLINT
name_vector => INT[]
nameaddress_vector => INT[]
country_code => VARCHAR(2)
centroid => GEOMETRY
}
map word {
word_id => INT
word_token => TEXT
... =>
}
map location_property_osmline {
place_id => BIGINT
osm_id => BIGINT
startnumber => INT
endnumber => INT
interpolationtype => TEXT
address => HSTORE
partition => SMALLINT
geometry_sector => INT
parent_place_id => BIGINT
country_code => VARCHAR(2)
postcode => text
indexed_status => SMALLINT
indexed_date => TIMESTAMP
linegeo => GEOMETRY
}
map place_addressline {
place_id => BIGINT
address_place_id => BIGINT
distance => DOUBLE
cached_rank_address => SMALLINT
fromarea => BOOLEAN
isaddress => BOOLEAN
}
map location_postcode {
place_id => BIGINT
postcode => TEXT
parent_place_id => BIGINT
rank_search => SMALLINT
rank_address => SMALLINT
indexed_status => SMALLINT
indexed_date => TIMESTAMP
geometry => GEOMETRY
}
placex::place_id <-- search_name::place_id
placex::place_id <-- place_addressline::place_id
placex::place_id <-- place_addressline::address_place_id
search_name::name_vector --> word::word_id
search_name::nameaddress_vector --> word::word_id
place_addressline -[hidden]> location_property_osmline
search_name -[hidden]> place_addressline
location_property_osmline -[hidden]-> location_postcode
@enduml

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 35 KiB

View File

@ -34,10 +34,11 @@ pages:
- 'External data: US housenumbers from TIGER': 'customize/Tiger.md'
- 'External data: Postcodes': 'customize/Postcodes.md'
- 'Developers Guide':
- 'Setup for Development' : 'develop/Development-Environment.md'
- 'Architecture Overview' : 'develop/overview.md'
- 'OSM Data Import' : 'develop/Import.md'
- 'Database Layout' : 'develop/Database-Layout.md'
- 'Indexing' : 'develop/Indexing.md'
- 'Tokenizers' : 'develop/Tokenizers.md'
- 'Setup for Development' : 'develop/Development-Environment.md'
- 'Testing' : 'develop/Testing.md'
- 'External Data Sources': 'develop/data-sources.md'
- 'Appendix':