graphql-engine/v3/crates/sql
Phil Freeman 317193df10 [PACHA-24] Metrics in the SQL layer (#1029)
<!-- The PR description should answer 2 important questions: -->

### What

- Adds datafusion row metrics to our NDC query and aggregate nodes, for
explain output
- Aggregates all datafusion metrics in the trace attributes:
- `rows_processed`, i.e. total number of rows considered over all
execution plan nodes
- `elapsed_compute`, i.e. CPU time spent in _processing_ data (not
fetching it)
- Adds the explain output to the `create_logical_plan` span.

E.g. a query we don't push down to NDC:

```sql
SELECT
    COUNT(42 * invoiceId) AS odd_count
FROM
    InvoiceLine;
```

Attributes:

```text
rows_processed: 2242
total_rows: 1
elapsed_compute: 417
logical_plan: Projection: count(Int64(42) * InvoiceLine.invoiceId) AS odd_count
  Aggregate: groupBy=[[]], aggr=[[count(Int64(42) * InvoiceLine.invoiceId)]]
    TableScan: InvoiceLine
```

The metrics clearly indicate that the cost in terms of rows processed
per row returned (2242 / 1) is very high in this case. The logical plan
makes it clear why this was the case: we failed to push down the
aggregate node.

### How

<!-- How is it trying to accomplish it (what are the implementation
steps)? -->

V3_GIT_ORIGIN_REV_ID: c26cce9adab9d0feb0a7d2873a3eea38542564a0
2024-08-29 00:56:46 +00:00
..
src [PACHA-24] Metrics in the SQL layer (#1029) 2024-08-29 00:56:46 +00:00
Cargo.toml [PACHA-4] initial support for commands (#975) 2024-08-18 03:34:15 +00:00
readme.md Refactor SQL layer to use OpenDD query IR (#925) 2024-08-05 23:38:19 +00:00

SQL Interface

An experimental SQL interface over OpenDD models. This is mostly targeted at AI use cases for now - GenAI models are better at generating SQL queries than GraphQL queries.

This is implemented using the Apache DataFusion Query Engine by deriving the SQL metadata for datafusion from Open DDS metadata. As the implementation currently stands, once we get a LogicalPlan from datafusion we replace TableScans with NDC queries to the underlying connector. There is a rudimentary optimizer that pushes down projections to the opendd query so that we don't fetch all the columns of a collection.