ref
https://linear.app/ghost/issue/ENG-1783/add-time-to-create-connection-metric
- Since we've added the "time to acquire" metric to get visibility into
contention in the connection pool, we've seen some anomalies where it
takes a surprisingly long time to acquire a connection (~60ms) when not
under load. Hypothesis is that these anomalies occur when there aren't
any open connections, so Ghost has to establish a new connection with
the DB, and that's the part that's actually taking most of that time.
This new metric should help confirm/deny that hypothesis.
- This will also be an interesting metric to keep an eye on and/or alert
on — if Ghost can't create new connections with its database
performantly, it's not going to perform very well.
ref
https://linear.app/ghost/issue/ENG-1796/reuse-tcp-connections-when-sending-metrics-to-the-pushgateway
- When we rolled out the prometheus metrics collection, it overwhelmed
the pushgateway. Our hypothesis is that Ghost was creating too many new
TCP connections to the pushgateway.
- The prometheus client was creating a new connection with the
pushgateway each time it pushed metrics every 15 seconds.
- This commit changes the prometheus client to keep the connection
alive, and re-use it instead of creating a new one.
- It also limits the number of retries if pushing the metrics fails —
after 3 consecutive failures, Ghost will stop retrying and log an error.
ref
https://linear.app/ghost/issue/ENG-1769/improve-pool-utilization-metric
- Currently the connection pool metrics are all point in time metrics,
and with a scrape interval of 15s this doesn't tell us a whole lot about
what's happening in the pool.
- This commit adds a Summary metric to track the elapsed time each
transaction has to wait to acquire a connection from the pool, which
should be a good indication of contention in the pool.
- Also moved the call to `prometheusClient.instrumentKnex` to after `initCore` in the boot process, because the metric depends on event listeners on `knex.client.pool`, and the pool gets destroyed and recreated in `initCore`, which removes the listeners
no issue
- The `instrumentKnex` method was directly accessing the `promClient`
instance to create custom metrics, and keeping track of them manually in
a `customMetrics` map. This isn't necessary, since the metrics are all
tracked within the `promClient` instance's registry. This method now
uses the `prometheusClient.register...()` methods to create the metrics,
and retrieves them with the `getMetric()` method to reduce duplication
of work and manual bookkeeping
- This also removes the query count metric, as there is a count already
included in the query duration Summary metric
ref
https://linear.app/ghost/issue/ENG-1771/add-utility-functions-to-easily-create-custom-metrics
- Currently adding custom metrics to our prometheus client requires you
to directly access the `prometheusClient.client` to create the metrics
- This isn't super convenient, as you then have to either keep the
metric in a local variable, or manually get it from the
`prometheusClient.client.register`
- This commit exposes some utility functions for registering metrics on
the `prometheusClient` class, and for retrieving metrics that have
already been registered
ref
https://linear.app/ghost/issue/ENG-1592/start-monitoring-connection-pool-utilization-in-ghost
- This commit adds prometheus metrics to the connection pool so we can
start to track connection pool utilization, number of pending acquires,
and also adds some basic SQL query summary metrics like queries per
minute and query duration percentiles.
- The connection pool has now been theorized to be a main constraint of
Ghost for some time, but it's been challenging to get actual visibility
into the state of the connection pool. With this change, we should be
able to directly observe, monitor and alert on the connection pool.
- Updated grafana version to fix a bug in the query editor that was
fixed in 8.3, even though this is a couple versions ahead of production
ref
https://linear.app/ghost/issue/ENG-1746/enable-ghost-to-push-metrics-to-a-pushgateway
- We'd like to use prometheus to expose metrics from Ghost, but the
"standard" approach of having prometheus scrape the `/metrics` endpoint
adds some complexity and additional challenges on Pro.
- A suggested simpler alternative is to use a pushgateway, to have Ghost
_push_ metrics to prometheus, rather than have prometheus scrape the
running instances.
- This PR introduces this functionality behind a configuration.
- It also includes a refactor to the current metrics-server
implementation so all the related code for prometheus is colocated, and
the configuration is a bit more organized. `@tryghost/metrics-server`
has been renamed to `@tryghost/prometheus-metrics`, and it now includes
the metrics server and prometheus-client code itself (including the
pushgateway code)
- To enable the prometheus client alone, `prometheus:enabled` must be
true. This will _not_ enable the metrics server or the pushgateway — it
will essentially collect the metrics, but not do anything with them.
- To enable the metrics server, set `prometheus:metrics_server:enabled`
to true. You can also configure the host and port that the metrics
server should export the `/metrics` endpoint on in the
`prometheus:metrics_server` block.
- To enable the pushgateway, set `prometheus:pushgateway:enabled` to
true. You can also configure the pushgateway's `url`, the `interval` it
should push metrics in (in milliseconds) and the `jobName` in the
`prometheus:pushgateway` block.