graphql-engine/rfcs/catalog-migration-db-stats-logging.md
Karthikeyan Chinnakonda 8acd35c2e7 rfc: log DB stats when the catalog is migrated
PR-URL: https://github.com/hasura/graphql-engine-mono/pull/2637
GitOrigin-RevId: c605d93172ae3cbde669bcac76583fe3903dec34
2021-10-20 12:33:23 +00:00

3.7 KiB

Logging database stats during source catalog migration

Metadata

---
authors:
  Karthikeyan Chinnakonda <karthikeyan@hasura.io>
state: In review
---

Description

Sometimes we need to change the alter some structure in the source catalog, like adding a new column, transforming an existing column's type to another etc. The problem many users have faced during migrations is that sometimes the DDL queries run during the migration lock the database and as a result, the graphql-engine doesn't start up. When this happens there are no logs outputted by the graphql-engine and there's no way for the user to know what's happening and this is frustrating for the user.

Proposed solutions

  1. When the source catalog migration is being done, we can log the output of the following SQL query every five seconds till the migration completes:
SELECT json_agg(json_build_object('query', psa.query, 'lock_granted', pl.granted, 'lock_mode', pl.mode, 'transaction_start_time', psa.xact_start, 'query_start_time', psa.query_start, 'wait_event_type', psa.wait_event_type, 'blocking_query', SUBSTRING(blocking.query, 1, 20) ) order BY psa.query_start)
FROM     pg_stat_activity psa
JOIN     pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(psa.pid))
LEFT JOIN pg_locks pl ON psa.pid = pl.pid
WHERE    psa.query LIKE '%hdb_catalog%' AND psa.wait_event_type IS NOT NULL AND psa.query ilike any (array ['%create%', '%drop%', '%alter%']);

The above query only looks for DDL changes made in the hdb_catalog schema. Only 20 characters of the blocking query is outputted because it may contain sensitive data.

The output of the above query looks like the following:

[
  {
    "query": "INSERT INTO authors (name) values ('boo');",
    "lock_granted": true,
    "lock_mode": "ExclusiveLock",
    "transaction_start_time": "2021-10-19T12:08:14.903342+00:00",
    "query_start_time": "2021-10-19T12:08:14.903342+00:00",
    "wait_event_type": "Lock",
    "blocking_query": "LOCK authors IN ACCESS EXCLUSIVE MODE;"
  },
  {
    "query": "INSERT INTO authors (name) values ('boo');",
    "lock_granted": false,
    "lock_mode": "RowExclusiveLock",
    "transaction_start_time": "2021-10-19T12:08:14.903342+00:00",
    "query_start_time": "2021-10-19T12:08:14.903342+00:00",
    "wait_event_type": "Lock",
    "blocking_query": "LOCK authors IN ACCESS EXCLUSIVE MODE;"
  }
]

Doing this will let the user know that some of the queries that are running on the database are in a locked state.

Implementation details

  1. We'll create a new function logPGSourceCatalogStats which will infinitely query in fixed intervals of five seconds, the database using the above mentioned SQL query. The type signature of the function will look like:

logPGSourceCatalogStats :: forall pgKind m . (MonadIO m, MonadTx m) => Logger Hasura -> SourceConfig '(Postgres pgKind) -> m Void

  1. The logPGSourceCatalogStats function will be run in a separate thread so as to not block the migrations. This can be done by calling logPGSourceCatalogStats using the forkManagedT function.

  2. To avoid implementing this feature for metadata DB catalog migrations, we need to make the 40_to_41 migration a no-op and move the logic to be a source catalog migration. Which means to make the 40_to_41.sql a noop migration and move the migration to 0_to_1.sql in pg_source_migrations. The 41_to_40.sql migration should change accordingly, it has to check first whether the locked column is a timestamp with time zone and iff alter it to the boolean type otherwise do nothing.

Open questions

  1. What will the user do from the above info to resolve the locking? Is the logging merely informational?