docs: refactor metrics table for EE

PR-URL: https://github.com/hasura/graphql-engine-mono/pull/9013 GitOrigin-RevId: f867dc5daa22a9a76357280f5079eae5942963eb
2024-12-14 17:02:49 +03:00 · 2023-05-03 15:25:47 -05:00 · 2023-05-03 15:25:47 -05:00 · f9c55a2a04
commit f9c55a2a04
parent c1705a09df
1 changed files with 216 additions and 187 deletions
--- a/docs/docs/enterprise/metrics.mdx
+++ b/docs/docs/enterprise/metrics.mdx
@ -34,205 +34,234 @@ HASURA_GRAPHQL_METRICS_SECRET=<secret>
 curl 'http://127.0.0.1:8080/v1/metrics' -H 'Authorization: Bearer <secret>'
 ```

-:::note Note
+:::info Configure a secret

- The metrics endpoint should be configured with a secret to prevent misuse and should not be exposed over the internet.
+The metrics endpoint should be configured with a secret to prevent misuse and should not be exposed over the internet.

 :::

 ## Metrics exported

-<table>
-  <tr>
-    <td>Name</td>
-    <td>Description</td>
-    <td>Type</td>
-    <td>Labels</td>
-    <td>Comment</td>
-  </tr>
-  <tr>
-    <td><code>hasura_http_connections</code></td>
-    <td>Current number of active HTTP connections (excluding WebSocket connections)</td>
-    <td>Gauge</td>
-    <td>none</td>
-    <td>Represents the HTTP load on the server</td>
-  </tr>
-  <tr>
-    <td><code>hasura_websocket_connections</code></td>
-    <td>Current number of active WebSocket connections</td>
-    <td>Gauge</td>
-    <td>none</td>
-    <td>Represents the websocket load on the server.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_active_subscriptions</code></td>
-    <td>Current number of active subscriptions</td>
-    <td>Gauge</td>
-    <td>none</td>
-    <td>Represents the subscription load on the server.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_graphql_requests_total</code></td>
-    <td>Number of GraphQL requests received </td>
-    <td>Counter</td>
-    <td>&#8226; "operation_type": query|mutation|subscription|unknown <br/>
-      &#8226; The "unknown" operation type will be returned for queries that fail authorization, parsing, or certain
-      validations<br/>
-      &#8226; "response_status": success|failed
-    </td>
-    <td>Represents the graphql query/mutation traffic on the server.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_graphql_execution_time_seconds</code></td>
-    <td>Execution time of successful GraphQL requests (excluding subscriptions)</td>
-    <td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10</td>
-    <td>&#8226; "operation_type": query|mutation</td>
-    <td>If more requests are falling in the higher buckets, you should consider <a href="/latest/deployment/performance-tuning">tuning the performance</a>.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_event_queue_time_seconds</code></td>
-    <td>Queue time for an event already in the processing queue</td>
-    <td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100</td>
-    <td>none</td>
-    <td>More events in higher bucket implies slow processing, you can consider increasing the <a href="/latest/deployment/graphql-engine-flags/reference/#events-http-pool-size">HTTP pool size</a> or optimizing the webhook server.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_event_fetch_time_per_batch_seconds</code></td>
-    <td>Latency of fetching a batch of events</td>
-    <td>Histogram<br/><br/>Buckets: 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10</td>
-    <td>none</td>
-    <td>A higher metric indicates slower polling of events from the database, you should consider looking into the performance of your database.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_event_webhook_processing_time_seconds</code></td>
-    <td>The time between when an HTTP worker picks an event for delivery to the time its response is updated in the DB</td>
-    <td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10</td>
-    <td>none</td>
-    <td>A higher processing time indicates slow webhook, you should try to optimize the event webhook.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_event_processing_time_seconds</code></td>
-    <td>The time taken for an event to be delivered since it's been created (if first attempt) or retried (after first attempt).</td>
-    <td>Histogram<br/><br/>Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100</td>
-    <td>none</td>
-    <td>This metric can be considered as the end-to-end processing time for an event.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_event_trigger_http_workers</code></td>
-    <td>Current number of active Event Trigger HTTP workers</td>
-    <td>Gauge</td>
-    <td>none</td>
-    <td>Compare this number to the <a href="/latest/deployment/graphql-engine-flags/reference/#events-http-pool-size">HTTP pool size</a>. Consider increasing it if the metric is near the current configured value.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_event_processed_total</code></td>
-    <td>Total number of events processed</td>
-    <td>Counter</td>
-    <td>&#8226; "status": success|failed</td>
-    <td>Represents the Event Trigger egress.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_event_invocations_total</code></td>
-    <td>Total number of events invoked</td>
-    <td>Counter</td>
-    <td>&#8226; "status": success|failed</td>
-    <td>Represents the Event Trigger webhook HTTP requests made.</td>
-  </tr>
-  <tr>
-    <td><code>hasura_postgres_connections</code></td>
-    <td>Current number of active PostgreSQL connections</td>
-    <td>Gauge</td>
-    <td>&#8226; "source_name": name of the database<br/>
-      &#8226; "conn_info": connection url string (password omitted) or name of the connection url environment variable<br/>
-      &#8226; "role": primary|replica
-    </td>
-    <td>Compare this to <a href="/latest/api-reference/syntax-defs/#pgpoolsettings">pool settings</a>.</td>
-  </tr>
-  <tr>
-  <td><code>hasura_cron_events_invocation_total</code></td>
-  <td>Total number of cron events invoked</td>
-  <td>Counter</td>
-  <td>&#8226; "status": success|failed<br /></td>
-  <td>Total number of invocations made for cron events.</td>
-</tr>
-<tr>
-  <td><code>hasura_cron_events_processed_total</code></td>
-  <td>Total number of cron events processed</td>
-  <td>Counter</td>
-  <td>&#8226; "status": success|failed<br /></td>
-  <td>
-    Compare this to <code>hasura_cron_events_invocation_total</code>. A high difference between the two metrics
-    indicates high failure rate of the cron webhook.
-  </td>
-</tr>
-<tr>
-  <td><code>hasura_oneoff_events_invocation_total</code></td>
-  <td>Total number of one-off events invoked</td>
-  <td>Counter</td>
-  <td>&#8226; "status": success|failed<br /></td>
-  <td>Total number of invocations made for one-off events.</td>
-</tr>
+The following metrics are exported by Hasura GraphQL Engine:

-<tr>
-  <td>
-    <code>hasura_oneoff_events_processed_total</code>
-  </td>
-  <td>Total number of one-off events processed</td>
-  <td>Counter</td>
-  <td>
-    &#8226; "status": success|failed
-    <br />
-  </td>
-  <td>
-    Compare this to <code>hasura_oneoff_events_invocation_total</code>. A high difference between the two metrics
-    indicates high failure rate of the one-off webhook.
-  </td>
-</tr>
-<tr>
-  <td>
-    <code>hasura_active_subscription_pollers</code>
-  </td>
-  <td>Current number of active subscription pollers. A subscription poller <a href="https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query">multiplexes </a> similar subscriptions together.
- </td>
-  <td>Gauge</td>
-  <td>
-    &#8226; "subscription_kind": streaming|live-query
-    <br />
-  </td>
-  <td>
-    The value of this metric is supposed to be proportional to the number of uniquely parameterised subscriptions i.e. subscriptions with the same selection set
-    but with different input arguments and session variables are multiplexed on the same poller.
-    If this metric is high then it may be an indication that there are too many uniquely parameterised subscriptions
-    which could be optimized for better performance.
-    </td>
-</tr>
-<tr>
-  <td>
-    <code>hasura_active_subscription_pollers_in_error_state</code>
-  </td>
-  <td>Current number of active subscription pollers that are in the error state.
-      A subscription poller <a href="https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query">multiplexes </a>
-      similar subscriptions together.
- </td>
-  <td>Gauge</td>
-  <td>
-    &#8226; "subscription_kind": streaming|live-query
-    <br />
-  </td>
-  <td>
-    A non-zero value of this metric indicates that there are runtime errors in atleast one of the subscription pollers that are running
-    in Hasura. In most of the cases, runtime errors in subscriptions are caused due to the changes at the data model layer and fixing the
-    issue at the data model layer should automatically fix the runtime errors.
-    </td>
-</tr>
+### Hasura active subscription pollers

+Current number of active subscription pollers. A subscription poller
+[multiplexes](https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query)
+similar subscriptions together. The value of this metric should be proportional to the number of uniquely parameterized
+subscriptions (i.e., subscriptions with the same selection set, but with different input arguments and session variables
+are multiplexed on the same poller). If this metric is high then it may be an indication that there are too many
+uniquely parameterized subscriptions which could be optimized for better performance.

+|        |                                              |
+| ------ | -------------------------------------------- |
+| Name   | `hasura_active_subscription_pollers`         |
+| Type   | Gauge                                        |
+| Labels | `subscription_kind`: streaming \| live-query |

-</table>
+### Hasura active subscription pollers in error state

-:::note Note
+Current number of active subscription pollers that are in the error state. A subscription poller
+[multiplexes](https://github.com/hasura/graphql-engine/blob/master/architecture/live-queries.md#idea-3-batch-multiple-live-queries-into-one-sql-query)
+similar subscriptions together. A non-zero value of this metric indicates that there are runtime errors in atleast one
+of the subscription pollers that are running in Hasura. In most of the cases, runtime errors in subscriptions are caused
+due to the changes at the data model layer and fixing the issue at the data model layer should automatically fix the
+runtime errors.

-The GraphQL request execution time:
+|        |                                                     |
+| ------ | --------------------------------------------------- |
+| Name   | `hasura_active_subscription_pollers_in_error_state` |
+| Type   | Gauge                                               |
+| Labels | `subscription_kind`: streaming \| live-query        |
+
+### Hasura active subscriptions
+
+Current number of active subscriptions, representing the subscription load on the server.
+
+|        |                               |
+| ------ | ----------------------------- |
+| Name   | `hasura_active_subscriptions` |
+| Type   | Gauge                         |
+| Labels | none                          |
+
+### Hasura cron events invocation total
+
+Total number of cron events invoked, representing the number of invocations made for cron events.
+
+|        |                                       |
+| ------ | ------------------------------------- |
+| Name   | `hasura_cron_events_invocation_total` |
+| Type   | Counter                               |
+| Labels | `status`: success \| failed           |
+
+### Hasura cron events processed total
+
+Total number of cron events processed, representing the number of invocations made for cron events. Compare this to
+`hasura_cron_events_invocation_total`. A high difference between the two metrics indicates high failure rate of the cron
+webhook.
+
+|        |                                      |
+| ------ | ------------------------------------ |
+| Name   | `hasura_cron_events_processed_total` |
+| Type   | Counter                              |
+| Labels | `status`: success \| failed          |
+
+### Hasura event fetch time per batch
+
+Latency of fetching a batch of events. A higher metric indicates slower polling of events from the database, you should
+consider looking into the performance of your database.
+
+|        |                                                                                            |
+| ------ | ------------------------------------------------------------------------------------------ |
+| Name   | `hasura_event_fetch_time_per_batch_seconds`                                                |
+| Type   | Histogram<br /><br />Buckets: 0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
+| Labels | none                                                                                       |
+
+### Hasura event invocations total
+
+Total number of events invoked. Represents the Event Trigger webhook HTTP requests made.
+
+|        |                                  |
+| ------ | -------------------------------- |
+| Name   | `hasura_event_invocations_total` |
+| Type   | Counter                          |
+| Labels | `status`: success \| failed      |
+
+### Hasura event processed total
+
+Total number of events processed. Represents the Event Trigger egress.
+
+|        |                                |
+| ------ | ------------------------------ |
+| Name   | `hasura_event_processed_total` |
+| Type   | Counter                        |
+| Labels | `status`: success \| failed    |
+
+### Hasura event processing time
+
+The time taken for an event to be delivered since it's been created (if first attempt) or retried (after first attempt).
+This metric can be considered as the end-to-end processing time for an event.
+
+|        |                                                                       |
+| ------ | --------------------------------------------------------------------- |
+| Name   | `hasura_event_processing_time_seconds`                                |
+| Type   | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100 |
+| Labels | none                                                                  |
+
+### Hasura event queue time
+
+Queue time for an event already in the processing queue. More events in a higher bucket implies slow processing. In this
+case, you can consider increasing the
+[HTTP pool size](/deployment/graphql-engine-flags/reference.mdx/#events-http-pool-size) or optimizing the webhook
+server.
+
+|        |                                                                       |
+| ------ | --------------------------------------------------------------------- |
+| Name   | `hasura_event_queue_time_seconds`                                     |
+| Type   | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100 |
+| Labels | none                                                                  |
+
+### Hasura event trigger HTTP workers
+
+Current number of active Event Trigger HTTP workers. Compare this number to the
+[HTTP pool size](/deployment/graphql-engine-flags/reference.mdx/#events-http-pool-size). Consider increasing it if the
+metric is near the current configured value.
+
+|        |                                     |
+| ------ | ----------------------------------- |
+| Name   | `hasura_event_trigger_http_workers` |
+| Type   | Gauge                               |
+| Labels | none                                |
+
+### Hasura event webhook processing time
+
+The time between when an HTTP worker picks an event for delivery to the time its response is updated in the DB. A higher
+processing time indicates slow webhook, you should try to optimize the event webhook.
+
+|        |                                                              |
+| ------ | ------------------------------------------------------------ |
+| Name   | `hasura_event_webhook_processing_time_seconds`               |
+| Type   | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10 |
+| Labels | none                                                         |
+
+### Hasura GraphQL execution time seconds
+
+Execution time of successful GraphQL requests (excluding subscriptions). If more requests are falling in the higher
+buckets, you should consider [tuning the performance](/deployment/performance-tuning.mdx).
+
+|        |                                                                |
+| ------ | -------------------------------------------------------------- |
+| Name   | `hasura_graphql_execution_time_seconds`                        |
+| Type   | Histogram<br /><br />Buckets: 0.01, 0.03, 0.1, 0.3, 1, 3, 10   |
+| Labels | `operation_type`: query \| mutation \| subscription \| unknown |
+
+### Hasura GraphQL requests total
+
+Number of GraphQL requests received, representing the GraphQL query/mutation traffic on the server.
+
+|        |                                                                |
+| ------ | -------------------------------------------------------------- |
+| Name   | `hasura_graphql_requests_total`                                |
+| Type   | Counter                                                        |
+| Labels | `operation_type`: query \| mutation \| subscription \| unknown |
+
+The `unknown` operation type will be returned for queries that fail authorization, parsing, or certain validations. The
+`response_status` label will be `success` for successful requests and `failed` for failed requests.
+
+### Hasura HTTP connections
+
+Current number of active HTTP connections (excluding WebSocket connections), representing the HTTP load on the server.
+
+|        |                           |
+| ------ | ------------------------- |
+| Name   | `hasura_http_connections` |
+| Type   | Gauge                     |
+| Labels | none                      |
+
+### Hasura one-off events invocation total
+
+Total number of one-off events invoked, representing the number of invocations made for one-off events.
+
+|        |                                         |
+| ------ | --------------------------------------- |
+| Name   | `hasura_oneoff_events_invocation_total` |
+| Type   | Counter                                 |
+| Labels | `status`: success \| failed             |
+
+### Hasura one-off events processed total
+
+Total number of one-off events processed, representing the number of invocations made for one-off events. Compare this
+to `hasura_oneoff_events_invocation_total`. A high difference between the two metrics indicates high failure rate of the
+one-off webhook.
+
+|        |                                        |
+| ------ | -------------------------------------- |
+| Name   | `hasura_oneoff_events_processed_total` |
+| Type   | Counter                                |
+| Labels | `status`: success \| failed            |
+
+### Hasura postgres connections
+
+Current number of active PostgreSQL connections. Compare this to
+[pool settings](/api-reference/syntax-defs.mdx/#pgpoolsettings).
+
+|        |                                                                                                                                                                                   |
+| ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Name   | `hasura_postgres_connections`                                                                                                                                                     |
+| Type   | Gauge                                                                                                                                                                             |
+| Labels | `source_name`: name of the database<br />`conn_info`: connection url string (password omitted) or name of the connection url environment variable<br />`role`: primary \| replica |
+
+### Hasura WebSocket connections
+
+Current number of active WebSocket connections, representing the WebSocket load on the server.
+
+|        |                                |
+| ------ | ------------------------------ |
+| Name   | `hasura_websocket_connections` |
+| Type   | Gauge                          |
+| Labels | none                           |
+
+:::info GraphQL request execution time

 - Uses wall-clock time, so it includes time spent waiting on I/O.
 - Includes authorization, parsing, validation, planning, and execution (calls to databases, Remote Schemas).