docs: refactor metrics table for EE

PR-URL: https://github.com/hasura/graphql-engine-mono/pull/9013
GitOrigin-RevId: f867dc5daa22a9a76357280f5079eae5942963eb
This commit is contained in:
Rob Dominguez 2023-05-03 15:25:47 -05:00 committed by hasura-bot
parent c1705a09df
commit f9c55a2a04

View File

@ -34,205 +34,234 @@ HASURA_GRAPHQL_METRICS_SECRET=<secret>
curl 'http://127.0.0.1:8080/v1/metrics' -H 'Authorization: Bearer <secret>'
```
:::note Note
:::info Configure a secret
- The metrics endpoint should be configured with a secret to prevent misuse and should not be exposed over the internet.
The metrics endpoint should be configured with a secret to prevent misuse and should not be exposed over the internet.
:::
## Metrics exported
<table>
<tr>
<td>Name</td>
<td>Description</td>
<td>Type</td>
<td>Labels</td>
<td>Comment</td>
</tr>
<tr>
<td><code>hasura_http_connections</code></td>
<td>Current number of active HTTP connections (excluding WebSocket connections)</td>
<td>Gauge</td>
<td>none</td>
<td>Represents the HTTP load on the server</td>
</tr>
<tr>
<td><code>hasura_websocket_connections</code></td>
<td>Current number of active WebSocket connections</td>
<td>Gauge</td>
<td>none</td>
<td>Represents the websocket load on the server.</td>
</tr>
<tr>
<td><code>hasura_active_subscriptions</code></td>
<td>Current number of active subscriptions</td>
<td>Gauge</td>
<td>none</td>
<td>Represents the subscription load on the server.</td>
</tr>
<tr>
<td><code>hasura_graphql_requests_total</code></td>
<td>Number of GraphQL requests received </td>
<td>Counter</td>
<td>&#8226; "operation_type": query|mutation|subscription|unknown <br/>
&#8226; The "unknown" operation type will be returned for queries that fail authorization, parsing, or certain
validations<br/>
&#8226; "response_status": success|failed
</td>
<td>Represents the graphql query/mutation traffic on the server.</td>
</tr>
<tr>
<td><code>hasura_graphql_execution_time_seconds</code></td>
<td>Execution time of successful GraphQL requests (excluding subscriptions)</td>
<td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10</td>
<td>&#8226; "operation_type": query|mutation</td>
<td>If more requests are falling in the higher buckets, you should consider <a href="/latest/deployment/performance-tuning">tuning the performance</a>.</td>
</tr>
<tr>
<td><code>hasura_event_queue_time_seconds</code></td>
<td>Queue time for an event already in the processing queue</td>
<td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100</td>
<td>none</td>
<td>More events in higher bucket implies slow processing, you can consider increasing the <a href="/latest/deployment/graphql-engine-flags/reference/#events-http-pool-size">HTTP pool size</a> or optimizing the webhook server.</td>
</tr>
<tr>
<td><code>hasura_event_fetch_time_per_batch_seconds</code></td>
<td>Latency of fetching a batch of events</td>
<td>Histogram<br/><br/>Buckets: 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10</td>
<td>none</td>
<td>A higher metric indicates slower polling of events from the database, you should consider looking into the performance of your database.</td>
</tr>
<tr>
<td><code>hasura_event_webhook_processing_time_seconds</code></td>
<td>The time between when an HTTP worker picks an event for delivery to the time its response is updated in the DB</td>
<td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10</td>
<td>none</td>
<td>A higher processing time indicates slow webhook, you should try to optimize the event webhook.</td>
</tr>
<tr>
<td><code>hasura_event_processing_time_seconds</code></td>
<td>The time taken for an event to be delivered since it's been created (if first attempt) or retried (after first attempt).</td>
<td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100</td>
<td>none</td>
<td>This metric can be considered as the end-to-end processing time for an event.</td>
</tr>
<tr>
<td><code>hasura_event_trigger_http_workers</code></td>
<td>Current number of active Event Trigger HTTP workers</td>
<td>Gauge</td>
<td>none</td>
<td>Compare this number to the <a href="/latest/deployment/graphql-engine-flags/reference/#events-http-pool-size">HTTP pool size</a>. Consider increasing it if the metric is near the current configured value.</td>
</tr>
<tr>
<td><code>hasura_event_processed_total</code></td>
<td>Total number of events processed</td>
<td>Counter</td>
<td>&#8226; "status": success|failed</td>
<td>Represents the Event Trigger egress.</td>
</tr>
<tr>
<td><code>hasura_event_invocations_total</code></td>
<td>Total number of events invoked</td>
<td>Counter</td>
<td>&#8226; "status": success|failed</td>
<td>Represents the Event Trigger webhook HTTP requests made.</td>
</tr>
<tr>
<td><code>hasura_postgres_connections</code></td>
<td>Current number of active PostgreSQL connections</td>
<td>Gauge</td>
<td>&#8226; "source_name": name of the database<br/>
&#8226; "conn_info": connection url string (password omitted) or name of the connection url environment variable<br/>
&#8226; "role": primary|replica
</td>
<td>Compare this to <a href="/latest/api-reference/syntax-defs/#pgpoolsettings">pool settings</a>.</td>
</tr>
<tr>
<td><code>hasura_cron_events_invocation_total</code></td>
<td>Total number of cron events invoked</td>
<td>Counter</td>
<td>&#8226; "status": success|failed<br /></td>
<td>Total number of invocations made for cron events.</td>
</tr>
<tr>
<td><code>hasura_cron_events_processed_total</code></td>
<td>Total number of cron events processed</td>
<td>Counter</td>
<td>&#8226; "status": success|failed<br /></td>
<td>
Compare this to <code>hasura_cron_events_invocation_total</code>. A high difference between the two metrics
indicates high failure rate of the cron webhook.
</td>
</tr>
<tr>
<td><code>hasura_oneoff_events_invocation_total</code></td>
<td>Total number of one-off events invoked</td>
<td>Counter</td>
<td>&#8226; "status": success|failed<br /></td>
<td>Total number of invocations made for one-off events.</td>
</tr>
The following metrics are exported by Hasura GraphQL Engine:
<tr>
<td>
<code>hasura_oneoff_events_processed_total</code>
</td>
<td>Total number of one-off events processed</td>
<td>Counter</td>
<td>
&#8226; "status": success|failed
<br />
</td>
<td>
Compare this to <code>hasura_oneoff_events_invocation_total</code>. A high difference between the two metrics
indicates high failure rate of the one-off webhook.
</td>
</tr>
<tr>
<td>
<code>hasura_active_subscription_pollers</code>
</td>
<td>Current number of active subscription pollers. A subscription poller <a href="https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query">multiplexes </a> similar subscriptions together.
</td>
<td>Gauge</td>
<td>
&#8226; "subscription_kind": streaming|live-query
<br />
</td>
<td>
The value of this metric is supposed to be proportional to the number of uniquely parameterised subscriptions i.e. subscriptions with the same selection set
but with different input arguments and session variables are multiplexed on the same poller.
If this metric is high then it may be an indication that there are too many uniquely parameterised subscriptions
which could be optimized for better performance.
</td>
</tr>
<tr>
<td>
<code>hasura_active_subscription_pollers_in_error_state</code>
</td>
<td>Current number of active subscription pollers that are in the error state.
A subscription poller <a href="https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query">multiplexes </a>
similar subscriptions together.
</td>
<td>Gauge</td>
<td>
&#8226; "subscription_kind": streaming|live-query
<br />
</td>
<td>
A non-zero value of this metric indicates that there are runtime errors in atleast one of the subscription pollers that are running
in Hasura. In most of the cases, runtime errors in subscriptions are caused due to the changes at the data model layer and fixing the
issue at the data model layer should automatically fix the runtime errors.
</td>
</tr>
### Hasura active subscription pollers
Current number of active subscription pollers. A subscription poller
[multiplexes](https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query)
similar subscriptions together. The value of this metric should be proportional to the number of uniquely parameterized
subscriptions (i.e., subscriptions with the same selection set, but with different input arguments and session variables
are multiplexed on the same poller). If this metric is high then it may be an indication that there are too many
uniquely parameterized subscriptions which could be optimized for better performance.
| | |
| ------ | -------------------------------------------- |
| Name | `hasura_active_subscription_pollers` |
| Type | Gauge |
| Labels | `subscription_kind`: streaming \| live-query |
</table>
### Hasura active subscription pollers in error state
:::note Note
Current number of active subscription pollers that are in the error state. A subscription poller
[multiplexes](https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query)
similar subscriptions together. A non-zero value of this metric indicates that there are runtime errors in atleast one
of the subscription pollers that are running in Hasura. In most of the cases, runtime errors in subscriptions are caused
due to the changes at the data model layer and fixing the issue at the data model layer should automatically fix the
runtime errors.
The GraphQL request execution time:
| | |
| ------ | --------------------------------------------------- |
| Name | `hasura_active_subscription_pollers_in_error_state` |
| Type | Gauge |
| Labels | `subscription_kind`: streaming \| live-query |
### Hasura active subscriptions
Current number of active subscriptions, representing the subscription load on the server.
| | |
| ------ | ----------------------------- |
| Name | `hasura_active_subscriptions` |
| Type | Gauge |
| Labels | none |
### Hasura cron events invocation total
Total number of cron events invoked, representing the number of invocations made for cron events.
| | |
| ------ | ------------------------------------- |
| Name | `hasura_cron_events_invocation_total` |
| Type | Counter |
| Labels | `status`: success \| failed |
### Hasura cron events processed total
Total number of cron events processed, representing the number of invocations made for cron events. Compare this to
`hasura_cron_events_invocation_total`. A high difference between the two metrics indicates high failure rate of the cron
webhook.
| | |
| ------ | ------------------------------------ |
| Name | `hasura_cron_events_processed_total` |
| Type | Counter |
| Labels | `status`: success \| failed |
### Hasura event fetch time per batch
Latency of fetching a batch of events. A higher metric indicates slower polling of events from the database, you should
consider looking into the performance of your database.
| | |
| ------ | ------------------------------------------------------------------------------------------ |
| Name | `hasura_event_fetch_time_per_batch_seconds` |
| Type | Histogram<br /><br />Buckets: 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
| Labels | none |
### Hasura event invocations total
Total number of events invoked. Represents the Event Trigger webhook HTTP requests made.
| | |
| ------ | -------------------------------- |
| Name | `hasura_event_invocations_total` |
| Type | Counter |
| Labels | `status`: success \| failed |
### Hasura event processed total
Total number of events processed. Represents the Event Trigger egress.
| | |
| ------ | ------------------------------ |
| Name | `hasura_event_processed_total` |
| Type | Counter |
| Labels | `status`: success \| failed |
### Hasura event processing time
The time taken for an event to be delivered since it's been created (if first attempt) or retried (after first attempt).
This metric can be considered as the end-to-end processing time for an event.
| | |
| ------ | --------------------------------------------------------------------- |
| Name | `hasura_event_processing_time_seconds` |
| Type | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100 |
| Labels | none |
### Hasura event queue time
Queue time for an event already in the processing queue. More events in a higher bucket implies slow processing. In this
case, you can consider increasing the
[HTTP pool size](/deployment/graphql-engine-flags/reference.mdx/#events-http-pool-size) or optimizing the webhook
server.
| | |
| ------ | --------------------------------------------------------------------- |
| Name | `hasura_event_queue_time_seconds` |
| Type | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100 |
| Labels | none |
### Hasura event trigger HTTP workers
Current number of active Event Trigger HTTP workers. Compare this number to the
[HTTP pool size](/deployment/graphql-engine-flags/reference.mdx/#events-http-pool-size). Consider increasing it if the
metric is near the current configured value.
| | |
| ------ | ----------------------------------- |
| Name | `hasura_event_trigger_http_workers` |
| Type | Gauge |
| Labels | none |
### Hasura event webhook processing time
The time between when an HTTP worker picks an event for delivery to the time its response is updated in the DB. A higher
processing time indicates slow webhook, you should try to optimize the event webhook.
| | |
| ------ | ------------------------------------------------------------ |
| Name | `hasura_event_webhook_processing_time_seconds` |
| Type | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
| Labels | none |
### Hasura GraphQL execution time seconds
Execution time of successful GraphQL requests (excluding subscriptions). If more requests are falling in the higher
buckets, you should consider [tuning the performance](/deployment/performance-tuning.mdx).
| | |
| ------ | -------------------------------------------------------------- |
| Name | `hasura_graphql_execution_time_seconds` |
| Type | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
| Labels | `operation_type`: query \| mutation \| subscription \| unknown |
### Hasura GraphQL requests total
Number of GraphQL requests received, representing the GraphQL query/mutation traffic on the server.
| | |
| ------ | -------------------------------------------------------------- |
| Name | `hasura_graphql_requests_total` |
| Type | Counter |
| Labels | `operation_type`: query \| mutation \| subscription \| unknown |
The `unknown` operation type will be returned for queries that fail authorization, parsing, or certain validations. The
`response_status` label will be `success` for successful requests and `failed` for failed requests.
### Hasura HTTP connections
Current number of active HTTP connections (excluding WebSocket connections), representing the HTTP load on the server.
| | |
| ------ | ------------------------- |
| Name | `hasura_http_connections` |
| Type | Gauge |
| Labels | none |
### Hasura one-off events invocation total
Total number of one-off events invoked, representing the number of invocations made for one-off events.
| | |
| ------ | --------------------------------------- |
| Name | `hasura_oneoff_events_invocation_total` |
| Type | Counter |
| Labels | `status`: success \| failed |
### Hasura one-off events processed total
Total number of one-off events processed, representing the number of invocations made for one-off events. Compare this
to `hasura_oneoff_events_invocation_total`. A high difference between the two metrics indicates high failure rate of the
one-off webhook.
| | |
| ------ | -------------------------------------- |
| Name | `hasura_oneoff_events_processed_total` |
| Type | Counter |
| Labels | `status`: success \| failed |
### Hasura postgres connections
Current number of active PostgreSQL connections. Compare this to
[pool settings](/api-reference/syntax-defs.mdx/#pgpoolsettings).
| | |
| ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name | `hasura_postgres_connections` |
| Type | Gauge |
| Labels | `source_name`: name of the database<br />`conn_info`: connection url string (password omitted) or name of the connection url environment variable<br />`role`: primary \| replica |
### Hasura WebSocket connections
Current number of active WebSocket connections, representing the WebSocket load on the server.
| | |
| ------ | ------------------------------ |
| Name | `hasura_websocket_connections` |
| Type | Gauge |
| Labels | none |
:::info GraphQL request execution time
- Uses wall-clock time, so it includes time spent waiting on I/O.
- Includes authorization, parsing, validation, planning, and execution (calls to databases, Remote Schemas).