RFC: Column Mutability

This PR contains two pull requests: [Identity Columns](https://github.com/hasura/graphql-engine-mono/blob/rfc/identity-columns/rfcs/identity-columns.md) collects information and product decisions about identity columns. There are some decisions we need to make explicit. [Column Mutability](https://github.com/hasura/graphql-engine-mono/blob/rfc/identity-columns/rfcs/column-mutability.md) proposes an implementation strategy for identity columns and similar that should be able to elegantly accommodate differences among backends. The idea is to model the notion of _column mutability_ rather than e.g. identity columns directly. Please volunteer your opinions and perspectives on these topics in the PR comments. --- Closes #2407 PR-URL: https://github.com/hasura/graphql-engine-mono/pull/2507 GitOrigin-RevId: 5eb14a53504985fd32933c182bee4cc13bb70a02
2024-07-14 14:00:31 +03:00 · 2021-11-02 14:42:03 +01:00 · 2021-11-02 14:42:03 +01:00 · ef4d194d79
commit ef4d194d79
parent 5694d41025
2 changed files with 382 additions and 0 deletions
--- a/rfcs/column-mutability.md
+++ b/rfcs/column-mutability.md
@ -0,0 +1,243 @@
+# Column Mutability
+
+## Metadata
+
+```
+---
+authors: Philip Lykke Carlsen <philip@hasura.io>
+discussion:
+   https://github.com/hasura/graphql-engine-mono/issues/2407
+   https://github.com/hasura/graphql-engine-mono/pull/2507
+---
+```
+
+## Description
+
+Various features of databases influence the set of columns that should be
+exposed under different circumstances. Examples of this include Identity Columns
+and (DB) Generated Columns [^1].
+[^1]: https://www.postgresql.org/docs/14/sql-createtable.html
+
+This RFC proposes to add the basic, shared concept of _Column Mutability_ with
+the goal of simplifying the implementation of features including those mentioned
+above and similar, as well as improving code reuse, specifically for Schema
+Generation.
+
+### Problem
+
+In order to motivate this proposal, we'll start with a historical summary a
+previous effort of implementing support for handling identity columns.
+
+Identity Columns were introduced as first class concept, present throughout the
+different parts of the implementation that needed to adapt its behavior
+depending on them.
+
+What we didn't know at the time was that this would turn out to be awkward,
+because identity columns in Postgresql and MSSQL have different semantics. And
+because Postgres even has two variations of identity columns which need to be
+treated differently. (More on this later)
+
+The specific way in which this turned awkward was in the schema code. We want to
+avoid duplicating this code, so that we produce consistent schemas across
+backends. However, in order for this code to deal correctly with different
+behaviors of different identity column variants it had to accept further
+backend-specific customization options (or, hypothetically, be refactored).
+
+Concretely, apart from case-switching on whether columns were identity columns
+or not the code was amended with the (`class Backend b` type-level) feature flag
+`type XOnConflict b`, indicating whether a backend would support
+`_on_conflict`-based upserts. [^2]
+
+[^2]: MSSQL would not support upserts, as a concession for us missing an
+  implementation that would distinguish between what columns to insert and what
+  columns to update owning to `_on_conflict` being part of `insert_` mutations.
+
+Then came [issue #7557](https://github.com/hasura/graphql-engine/issues/7557),
+and we realised the importance of distinguishing between `GENERATED BY DEFAULT AS
+IDENTITY` and `GENERATED ALWAYS AS IDENTITY` identity columns in Postgres. [^3]
+
+[^3]: `BY DEFAULT` may be inserted but not updated, while `ALWAYS` may neither.
+  So now that the PG backend had a concept of id cols it would interpret them the
+  same way as in MSSQL (which are more similar to the PG `ALWAYS` kind), meaning
+  one user's insert mutations on a table with a `GENERATED BY DEFAULT` identity
+  col suddenly stopped working, because the id col was elided from the schema!
+
+To summarize, it seems our approach to modelling identity columns was
+inappropriate: We introduced a single _Identity Column_ modelling concept which
+had to cover for the various idiosyncratic variants between (and within!)
+databases, and as a result the shared schema generation code needed to know
+about different backends' idiosyncracies.
+
+### Why is it important?
+
+At the level of concrete features our product should support, Identity columns
+are already something our customers rely on (as evidenced by issue #7557). On Postgresql,
+Identity Columns are even the recommended successor to `SERIAL` columns [^4].
+
+[^4]: [Don't use serial](https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_serial):
+     > For new applications, identity columns should be used instead.
+     >
+     > Why not?
+     > 
+     > The serial types have some weird behaviors that make schema, dependency,
+     > and permission management unnecessarily cumbersome.
+   
+Additionally, a generic type of generated columns (database backed rather than
+GraphQL Engine backed) could also be an attractive feature to support, which
+should be easy to build on top of this proposal.
+
+On a more meta-feature level, faithfully exposing relational databases on
+GraphQL is our bread and butter. Being able to develop new features confidently
+with less effort is an important enabler of this.
+
+Our ability to achieve this (and thus the quality of the finished product)
+is affected by our choices of the abstractions that we use to express our solution.
+Therefore it's important we pick those that enable us rather than hamper us.
+
+### Success
+
+This effort is a success once insert, update, and upsert mutations are implemented
+using a backend-agnostic notion of column mutability, unaffected by backend-specific
+concepts such as Identity Columns and Generated Columns.
+
+## How
+
+This specific proposal describes an implementation strategy and does not itself
+have any user visible aspects. 
+
+Implementing this proposal in a basic form amounts to three conceptually simple
+things:
+
+1. Amend the table metadata with column mutability information. E.g. by changing
+   `Hasura.RQL.Types.Column.ColumnInfo` or
+   `Hasura.RQL.Types.Table.TableCoreInfoG`. We should be able to record whether
+   each column is insertable and updatable at the least.
+2. Updating the schema generation code to respect this new metadata. Points of
+   interest include update mutations, insert mutations, and insert with
+   `on_conflict` (the `update_columns` field should only contain updatable
+   columns).
+3. Updating the permissions code to respect this new metadata. Column Update
+   Permissions should apply to only updatable columns etc.
+
+In the actual table metadata collection code we may initially just mark every
+column as insertable and updatable, unless the work is in the context of
+e.g. implementing identity columns.
+
+### Effects and Interactions
+
+This feature may be perfectly well confined to the server. It might be
+interesting however to show or use the column mutability information in the
+Console, for example to reflect to a user how HGE is understanding their
+database schema.
+
+### Alternatives
+
+If you do not have a generic concept of column mutability, all features that
+influence the set of columns exposed in certain situations will have to
+themselves deal directly with those situations.
+
+Examples of code in an implementation without generic column mutability:
+* Schema generation needs to know and understand the concepts of Identity
+  Columns, and Generated Columns. These work differently across backends, which
+  hampers code reuse.
+* Permissions need to know about Identity Columns, and Generated Columns,
+  because Update Permissions don't make sense those columns. But only sometimes,
+  and that in backend specific ways.
+
+### Costs and Drawbacks
+
+A drawback of using this approach over the alternatives is that, as a feature
+implementer you need to know prior knowledge of the column mutability feature in
+order to exploit it. Nothing prohibits you from implementing some feature in
+parallel, that could have benefited from using it.
+
+But then again, if not exploiting the column mutability feature, you would have
+to know about all the relevant interactions of column mutability in order to
+produce a well-rounded feature.
+
+Another drawback is that this could potentially result in poor usability if it
+turns out that some features need to be handled more specifically than just
+including or excluding columns depending on their mutability affords.
+
+A somewhat mild but illustrative example of this is a user wanting to update a
+non-updatable column and being puzzled as to why this column is not present as a
+field in the update mutation.
+
+This concrete example could be remedied by annotating the metadata that
+indicates mutability with a human-readable text describing the cause, which we
+could then display in the console.
+
+But care must be taken: In the (contrived) case that we wanted to, say, give
+rich error messages in response only to update mutations that mention identity
+columns but not those that mention generated columns, we would have encountered
+a need that is not met by a generic column mutability concept. That would leave
+us having to overload/pollute column mutability with data and logic for dealing
+with non-universal edge cases, and we wouldn't have achieved anything.
+
+### Unresolved Questions
+
+> A place for authors to list out open questions
+> Ideally, it would serve as a collection point for others to chime in during
+> the review process to either identify their own questions or offer potential
+> solutions
+
+### Future Work / Out of Scope
+
+As stated earlier, this RFC does not directly imply any application of its
+ideas. Applications such as Identity Columns are treated separately.
+
+Here is a first attempt at looking more closely at how Identity Columns would
+use Column Mutability to fulfill its needs:
+
+* In MSSQL.
+
+  We have a couple of options for how we want to identity columns to work.
+
+  ```
+  IDENTITY(..), with SET IDENTITY_INSERT => not updatable, insertable
+  ``` 
+  (This variant will also require specific handling at SQL translation)
+
+  ```
+  IDENTITY(..), w/o  SET_IDENTITY_INSERT => not updatable, not insertable
+  ```
+
+* In Postgres
+
+  Similarly to MSSQL, we need to choose how to handle one flavor of identity column:
+
+  ```
+  GENERATED ALWAYS AS IDENTITY, with OVERRIDING SYSTEM VALUE => not updatable, insertable
+  ```
+  (This variant will also require specific handling at SQL translation)
+
+  ```
+  GENERATED ALWAYS AS IDENTITY, w/o  OVERRIDING SYSTEM VALUE => not updatable, not insertable
+  ```
+
+  The third variant does not even register:
+  ```
+  GENERATED BY DEFAULT AS IDENTITY => updatable, insertable
+  ```
+
+* General Computed Columns, as supported by all of MSSQL, PG 14, and MySQL:
+  `generated => not updatable, not insertable` 
+  
+Note: Careful inspection of the above suggests that the
+upsert-via-insert+on_conflict schema should work for both MSSQL and Postgresql,
+regardless of the flavors we choose to support. The only troublesome combination
+is `updatable, not insertable` which is absent above. [^why-troublesome]
+
+[^why-troublesome]: The reason `updatable, not insertable` is troublesome is that the values to be upserted are given as:
+      ```
+      insert_some_table(
+        objects: [{ <values here> }],
+          on_conflict: { update_columns: [<columns to be updated>], ...  }
+      )
+      ```
+      We want to elide fields `not insertable` from the schema of `<values here>`,
+      but that is also the place where values upserted go.
+
+Also, for something completely different, it might be interesting to make column
+mutability something a user can specify. This use case does however have some
+overlap with the Columns Permissions feature.
--- a/rfcs/identity-columns.md
+++ b/rfcs/identity-columns.md
@ -0,0 +1,139 @@
+# Handling of Identity Columns
+
+## Metadata
+
+```
+---
+authors: Philip Lykke Carlsen <philip@hasura.io>
+discussion:
+  https://github.com/hasura/graphql-engine-mono/issues/2407
+  https://github.com/hasura/graphql-engine-mono/pull/2507
+state: pending answers to unresolved questions
+---
+```
+
+## Description
+
+This RFC collects discussion and decisions on how we want Identity Columns to
+work in the GraphQL Engine.
+
+## Problem
+
+Identity Columns are an SQL standard database feature that attempts to solve
+the problem of generating row identifiers in a more sound way than naive
+auto-incrementing columns. This works by imposing restrictions on how such
+columns may be updated.
+
+This means that, in order for the GraphQL engine to correctly deal with tables
+that have identity columns it has to observe these restrictions, specifically
+when updating and inserting.
+
+It is also possible to sometimes override the constraints imposed by Identity
+Columns, and we need to decide what we want to support and how we want to
+support it.
+
+## Available options
+
+Overall, there are two flavors of identity columns we may encounter:
+
+* (Postgres only) Identity columns declared `GENERATED BY DEFAULT AS IDENTITY`
+  work just like regular `SERIAL` columns and impose no further constraints.
+  We can view these as being identity-columns-in-name-only.
+
+* The more "true" Identity Columns, supported by both MSSQL and PostgreSQL, are not
+  updatable and only insertable using and override mechanism:
+  * In MSSQL, a column declared `IDENTITY(..)` may be inserted into only when `SET
+    IDENTITY_INSERT` is applied to that table.
+  * In Postgres, a column declared `GENERATED ALWAYS AS IDENTITY`
+    may be inserted into by giving the clause `OVERRIDING SYSTEM VALUE` in an
+    `INSERT` statement.
+
+**We need to decide how/when/if we want to expose the overriding mechanism in
+our GraphQL API** (see the Unresolved Questions section below).
+
+## How
+
+Implementing the handling of identity columns should apply the architecture
+described in [Column Mutability](/rfcs/column-mutability.md).
+
+If we go with the non-overriding policy described above there should not be
+any changes necessary to SQL translation for either MSSQL or PostgreSQL.
+
+The only necessary change then ought to be amending the table metadata
+extraction (for both MSSQL and PostgreSQL) to identify identity columns
+and set column mutability accordingly (i.e. not insertable, not updatable).
+
+## Unresolved Questions
+
+_When, if ever, should we make use of the constraints overriding mechanisms
+  described above? Do we want to never override? Always? Make it configurable?_
+
+Note that:
+* Column Mutability guides us for how to implement the schema generation aspects
+  of either choice (of "non-overriding" vs "overriding")
+* Leaving this unanswered does not block implementation of basically correct
+  handling of identity columns.
+* But the implementation will have to make an (arbitrary) choice between the two.
+  A reasonable choice would be to select "non-overriding".
+* We don't expect any complications to result from amending the implementation
+  at a later point in time.
+
+
+## Appendix
+
+The purpose of this appendix is to collect relevant information on the concept
+of _Identity Columns_ and inform the implementation of GraphQL Engine.
+
+* Part of the SQL standard.
+* Motivation is to standardise DB-supplied identifiers (i.e. autoincrement/serial/..)
+  * Note: This is a concept distinct from primary keys. Identity Columnss don't introduce
+    uniqueness constraints by themselves!
+  * Also provide better semantics than naive auto-increment/serial solutions, by
+    prohibiting updating and inserting of Identity Columns (to an extent), in order to
+    avoid issue where auto-increment logic produces duplicates because conflicts
+    with manual inserts/updates.
+  * Interestingly, no-one seems to actually link to the standard they implement from.
+* Implemented in PG, MSSQL, DB2 and Oracle (also Oracle-NoSQL, ironically)
+* Not implemented in MySQL or SQLite
+* Introduces some complications/extra coordination for replication/backup.
+    
+In a sentence:
+> Identity columns are immutable, sequentially distinct values
+> provided only by the DBMS
+
+### MSSQL semantics
+
+[MSSQL TSQL Identity Columns](https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-ver15)
+
+* Possible to `INSERT` values for Identity Columns, but guarded by a `SET INSERT_IDENTITY <tablename> ON` statement.
+* Impossible to `UPDATE` values for Identity Columns.
+* Syntax differs from SQL standard: `column IDENTITY(type, seed, increment)`.
+
+### PostgreSQL Semantics
+
+[PG Create table syntax (including GENERATED)](https://www.postgresql.org/docs/devel/sql-createtable.html)
+
+* Syntax closer to SQL standard: `column GENERATED BY DEFAULT AS IDENTITY`, `column GENERATED ALWAYS AS IDENTITY`.
+* Implemented on top of `series`.
+* Columns `GENERATED BY DEFAULT` may be both `INSERT`ed and and `UPDATE`d.
+* Columns `GENERATED ALWAYS` may be `INSERT`ed (guarded by an `OVERRIDE SYSTEM VALUE` keyword), but never `UPDATE`d.
+
+
+### Links
+
+[Don't use serial](https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_serial):
+> For new applications, identity columns should be used instead.
+>
+> Why not?
+> 
+> The serial types have some weird behaviors that make schema, dependency, and permission management unnecessarily cumbersome.
+
+[SE: pg serial vs identity](https://stackoverflow.com/questions/55300370/postgresql-serial-vs-identity)
+
+[Implementers blog post](https://www.2ndquadrant.com/en/blog/postgresql-10-identity-columns/)
+
+[Technical details blog post](https://www.depesz.com/2017/04/10/waiting-for-postgresql-10-identity-columns/)
+    
+[Wikipedia: Identity Columns](https://en.wikipedia.org/wiki/Identity_column)
+
+