move pro rfcs to pro namespace

Many pro-only feature RFCs were added in `rfcs` folder which were shadowed in OSS repo

PR-URL: https://github.com/hasura/graphql-engine-mono/pull/2546
GitOrigin-RevId: 0ae965393c04398014f910b5c8ecd2dcb3d5f861
This commit is contained in:
Tirumarai Selvan 2021-10-08 20:26:09 +05:30 committed by hasura-bot
parent 744cf0233e
commit 2406de7601
6 changed files with 0 additions and 609 deletions

View File

@ -1,22 +0,0 @@
### Motivation
If Hasura provides an ability to accept multiple admin secrets, then users can implement security mechanisms like admin key rotation.
### Configuration spec
1. Introduce env var and command line arg `HASURA_GRAPHQL_ADMIN_SECRETS` that can take a comma separated list of strings (instead of a single string currently).
2. If an admin secret contains a comma, then it can be _literalized_ by putting the secret in quotes (this is as per the standard CSV spec).
For example, HASURA_GRAPHQL_ADMIN_SECRETS: "abcd,efg", "xyz"
### Implementing rotating secrets
To implement a secret rotation mechanism, the following can be done:
1. add a new secret to list of admin secrets (and perform a rolling deploy)
2. update apps/backends to use the new secret
3. remove the old secret from the list (and perform a rolling deploy)
### Backwards compatibility
Note that we already have `HASURA_GRAPHQL_ADMIN_SECRET` env var, if both `HASURA_GRAPHQL_ADMIN_SECRET` and `HASURA_GRAPHQL_ADMIN_SECRETS`
are given then we ignore `HASURA_GRAPHQL_ADMIN_SECRETS`.

View File

@ -1,149 +0,0 @@
Several mostly separate concerns:
1. How do we interrupt graphql-engine PostgreSQL client code while executing commands.
(async + postgres cancel, switch to non-blocking libpq, FFI "interruptible", ...)
2. What should communication between graphql-engine and PostgreSQL server be
when interrupting. (close connection, cancel, just let it finish, ...)
3. What should the externally observable behaviour be, to someone talking to both
graphql-engine and datasources (PostgreSQL server) as a client.
(guarantee that we communicate precisely whether a command was executed absent
networking issues?)
4. How do we time out operations inside graphql-engine from a high-level point of view?
(top-level `System.Timeout.timeout`, lower-level `timeout`, check an mvar, exception
safety, ...)
[rob]: This is a bit of a brain-dump, currently. I'd particularly like input on:
- (1.) The async/wait/catch/cancel solution seems workable to me. I'm a bit vague on
some details, but I feel it should be possible to make this reliable and correct.
But again I'm a bit inexperienced with the async runtime, FFI, so might be missing
things.
- (1.) I can dig further into non-blocking libpq, but feel it's not worth the effort
at this stage.
- (2.) Least unsure about this, it seems clear that cancelling commands is the way to
go. But do object :)
- (3./4.) "timeout -> send cancel -> wait for response and use regular error path"
seems cleanest, but am a bit vague
- (4.) Is the top-level use of `timeout` safe enough? I have vague doubts here, lacking
familiarity with the code base and experience working with asynchronous exceptions.
## 1. Interrupting PostgreSQL command execution
The FFI calls to [PGexec*](https://www.postgresql.org/docs/current/libpq-exec.html#LIBPQ-EXEC-MAIN)
block the RTS, so a priori asynchronous exceptions won't be handled until after the
server responds.
(This point interacts closely with 2. below; i.e., simply making the calls interruptible
might not be enough for a correct solution towards PostgreSQL server.)
### Interruptible FFI (doesn't work)
Marking the FFI call [interruptible](https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/ffi.html#extension-InterruptibleFFI)
should cause the OS-level thread to receive SIGPIPE on async exceptions, which would cause
system calls to fail with EINTR. However, as verified experimentally, it appears this doesn't
cause libpq to return, likely because it retries on EINTR (compare https://doxygen.postgresql.org/fe-misc_8c.html#a03fea567926b20bbeada40c134a8fdc1).
### Call libpq using `async` (separate thread)
We can make FFI actions interruptible towards the rest of the program by running them
in a separate thread:
```haskell
do
a <- async $ uninterruptibleFFI conn
wait a
```
While this allows interrupting the action, it isn't thread-safe as such due to
handing over the libpq connection to another thread. What we can do is to cancel
the command (see 2. below) and then wait for the call to return:
```haskell
do
c <- getCancel conn
a <- async $ uninterruptibleFFI conn
res <- try $ wait a
case res of
Left e -> do
cancel c
throwTo (asyncThreadId a) e
wait a
Right x -> return x
```
Couple things to figure out:
- We might want to swallow certain exceptions (e.g. `System.Timeout.Timeout`
or a custom `PGCancelCommand`), in which case we wouldn't rethrow the exception
and instead just return regularly (with a postgres "canceled" error message if
we canceled on time) -- this would allow user code to decide whether the command
was actually interrupted. (See 3. below.)
- Error handling for the call to `cancel`.
### Use libpq in non-blocking mode
We could rewrite parts of pg-client-hs to use libpq
[asynchronously](https://hackage.haskell.org/package/postgresql-libpq-0.9.4.3/docs/Database-PostgreSQL-LibPQ.html#g:9).
The general approach would be:
- get the connection socket file descriptor via `socket :: Connection -> IO (Maybe Fd)`
- loop with `poll()/select()` and `consumeInput`
([rob]: is this in the standard library somewhere? this should be interruptible)
[rob]: I haven't explored this option in detail since it seems like the change
has a far bigger impact, and the forking approach seems like an easier way to
explore this. May well be a good idea longer term though.
## 2. Interrupting towards PostgreSQL
There's a number of things that our postgresql client could do when interrupted:
- closing or resetting the connection
- sending a cancel message
- process the command normally (i.e., wait and read the result, from the connection)
Breaking the connection won't interrupt command execution server-side, it seems
cancelling is generally the right thing to do:
- https://www.postgresql.org/message-id/flat/CA%2BTgmobakAdqELHtB_t8mPkU8qn4QPSJuCrRzWJFGfCJtpoV%3Dw%40mail.gmail.com
- https://dba.stackexchange.com/questions/81408/is-a-postgres-long-running-query-aborted-if-the-connection-is-lost-broken
## 3. From the outside
Let's say we have a slow graphql mutation that translates to a PostgreSQL transaction.
It seems desirable to guarantee that we respond with:
- no error: the transaction was committed
- controlled timeout: the transaction was not committed
- other errors: case by case, generally same behaviour as without timeouts, plus
new failure cases with e.g. an error sending the cancel message
The idea would be that if there are no connection issues, the client can reliably
tell whether a mutation happened by looking at the result. This is possible to achieve
with cancellation, by ensuring that after we send the cancel message, we continue the
regular control flow, which would then return a PostgresQL "command cancelled" error
if the command was cancelled, and the usual response otherwise. (E.g. in the case that
we send the cancel around the same time as the server sends the successful response.)
[rob]: I lack knowledge here to what extent we try to guarantee anything reliable at
all with respect to transactionality towards Postgres. It's clear that we can't
provide similar guarantees e.g. towards remote schemas.
## 4. High-level graphql-engine implementation
- Execute the operation under `System.Timeout.timeout` at the top level (`runGQL`).
[rob]: I'm unsure to what extent we're prepared to be interrupted randomly, e.g.
are we safe against things like `r <- executePostgresQuery; logResult; return r`
being interrupted before the logging step (and less harmless similar scenarios).
- Execute slow operations under `System.Timeout.timeout` at a lower level. E.g.
we might set a deadline, and pass it in, and then use `System.Timeout.timeout`
within `pg-client-hs` or different client code to ensure we don't pass that deadline.
- For either of the above cases, we might opt to throw a different asynchronous
exception at the operation (`HasuraCancel`).
- Instead of returning `Nothing` from `timeout`, the interrupting exception could
result in an "interrupted" error being returned along the usual error return path.
- Instead of throwing an asynchronous exception, we could pass the "interrupted"
information via a different mechanism, e.g. an MVar. E.g. non-blocking postgres
code could wait on both the timed-out mvar and the connection socket.

View File

@ -1,173 +0,0 @@
# Overview
We want to be able to time out slow queries and mutations.
Concretely specced (issue [#1232](https://github.com/hasura/graphql-engine-mono/issues/1232)):
- `api_limits` in metadata gets an extra `operation_timeout` setting
(next to `rate_limit`, `depth_limit`, `node_limit`), configurable
project-wide and per user
- any GraphQL operation should be aborted once the timeout is reached;
this includes REST endpoints via their mapping to GraphQL operations,
and subscriptions.
This RFC outlines a minimal solution, and tries to provide an
overview of the design space, mostly through several appendices
that can be ignored when evaluating the concrete plan. There's
a bit more detail in
[operation-timeout-api-limits-braindump](./operation-timeout-api-limits-braindump.md).
There's a rather wide design space here, some questions,
the answers to which determine how we go about this:
1. Do we let the client know whether a query was actually interrupted
database-server-side? Compare Appendix C.
2. Do we need queries to timeout reliably when there
are connectivity issues between graphql-engine and the
database servers?
3. Some considerations regarding the the effect of the approach
on stability, maintainability, understandability of the code
base, e.g.:
- Is the core query code robust with respect to exception
handling? Do we want to afford the effort of ironing out
bugs there? Is the behavioural complexity of using asynchronous
exceptions for high-level control flow a good trade-off
compared to the implementation complexity of more explicit
timeout handling?
- Is the effort to rewrite the postgres client library in
non-blocking mode worthwhile?
For the first implementation, we've generally opted for choices
which require the smallest changes. None of these choices prevent
us from reevaluating in the future. Concretely:
- Answer "no" to points 1 & 2; this allows using `timeout` (compare
Appendix B on library API design), and a minimal change
to `pg-client-hs` (compare Appendix A on pg-client-hs).
- Answer "yes / we'll see" to the robustness question, and leave
the potential non-blocking rewrite to the future.
# Implementation
There are a couple of mostly-independent things to do:
- modify backend library code to allow interrupting queries
(concretely, `pg-client-hs`)
- provide hooks in the server query execution to allow us to
apply limits
- extend the metadata to allow configuring timeouts
- implement hooks for Pro referencing the metadata
## Concrete technical plan
- `ApiLimits` in metadata gets a new `operation_timeout` field, analogous
to other limits
- Postgres backend library `pg-client-hs` is modified so that queries
are interruptible by async exceptions (absent networking issues).
This is PR [#39](https://github.com/hasura/pg-client-hs/pull/39).
- class `HasResourceLimits` (which currently has `askResourceLimits`
yielding an HTTP handler allocation limiter) is augmented to
offer `askHandlerLimits` and `askOperationLimits`
- `askOperationLimits` returns an action modifier (`m a -> m a`)
- for Pro, we implement this action modifier to call
`Timeout.timeout :: IO a -> IO (Maybe a)` on the action, and
throw an error if the timeout triggered
- for both HTTP (in `runGQ`) and Websocket transport (in `onStart`),
we modify query and mutation execution with this operation limiter
# Open questions, random thoughts
- How to handle subscriptions concretely?
- By the placement of the limiter, we're making a choice about what
concretely is part of the time-limited operation (e.g., do we
limit "cache lookup + execution", or just "execution"; should
request parsing count?)
- The graphql-engine change will affect all backends, but this RFC
doesn't cover potentially required changes analogous to the
`pg-client-hs` changes. Without those, long-running queries to
other backends would still fail, but might not return promptly
(and won't save resources).
- Unit for timeout is seconds (but can we make that clearer? e.g.
operation_timeout_seconds or `operation_timeout: 5s`)
- Offering operation limits for cloud ops (SRE), as opposed to
project administrators.
# Appendix A: (Un-)Blocking postgres queries
The current implementation of `pg-client-hs` executes queries
synchronously in FFI, and doesn't provide a timeout mechanism.
When receiving an asynchronous exception, the action will block
until a response is received from the server. There are several
ways to go about this. (Compare also Appendix B on the library
API below.)
- `pg-client-hs` builds on the C library `libpq`, which is built
in a way that doesn't allow us to use "interruptible"
FFI, because it catches signals itself
- for the query to be interrupted on the database server,
we need to send a cancel message:
- [psql-hackers message](https://www.postgresql.org/message-id/flat/CA%2BTgmobakAdqELHtB_t8mPkU8qn4QPSJuCrRzWJFGfCJtpoV%3Dw%40mail.gmail.com)
- [stackoverflow question](https://dba.stackexchange.com/questions/81408/is-a-postgres-long-running-query-aborted-if-the-connection-is-lost-broken)
- options:
1. (use a native haskell postgres client)
2. (patch libpq to allow using it with interruptible FFI)
3. use libpq in non-blocking mode (and `select(2)` on the file
descriptor)
4. use libpq in non-blocking mode and busy-wait
5. cancel queries in blocking mode
# Appendix B: Timeout API design for a database client library
Say you're designing a Haskell database client library, e.g.
`pg-client-hs`, and would like to allow library users to
timeout queries. There's a range of approaches to how to
provide this functionality:
1. query is essentially of the type
`query :: Query -> IO Result`;
we just ask that `query q` respects async exceptions
(i.e., bails out cleanly and promptly when receiving an async exception,
throwing that exception)
client code would be expected to use `timeout :: DiffTime -> IO a -> IO (Maybe a)`
2. query is essentially of the type
`query :: Query -> IO (Either TimeoutError Result)`;
if `query q` receives (a specific / any) async exception, it bails out
cleanly, swallowing the exception and returning `Left TimeoutError`
client code would use a variant of `timeout` with a custom exception, that doesn't
need to decide whether the timeout was handled (`timeout' :: DiffTime -> IO a -> IO a`)
3. query is essentially of the type
`query :: TimeLimit -> Query -> IO (Either TimeoutError Result)`;
timeouts are handled internally in some way that the library user doesn't
care about
- option: Deadline vs. TimeLimit
- async exception behaviour should ideally be as in option 1 above, but
e.g. blocking in FFI code would be fine
client code just needs to pass a time limit / deadline
4. query is essentially of the type
`query :: Cancel -> Query -> IO (Either CancelError Result)`
where `Cancel` is some channel to allow signalling the desire of client code to
interrupt a query (e.g. `MVar ()`); library code would check for cancellation
at appropriate spots
Some trade-offs:
- options 2 and 4 offer most flexibility to the library user
- option 2 violates the core contract of async exceptions, namely "stop what you're
doing and bail, rethrowing the exception"
- option 1 doesn't allow the library user to determine whether the query was interrupted
server-side (i.e., there's no way to decide between "timeout hit while query was running
and caused the remote transaction to be rolled back" and "timeout hit while client query
code was still running, but remote transaction committed")
# Appendix C: Behaviour towards graphql clients
For backends that allow reliably deciding whether a
transaction committed, do we provide this to the client?
I.e., we could guarantee the following.
Assuming a graphql query response is received client-side,
then either:
- there's no error and the query was executed successfully
- there's an error of a class "all bets are off" (e.g.,
network error talking to the database backend)
- there's an error of a class "database query not performed" (e.g.,
"malformed graphql", "table doesn't exist", "server shut down
while running query"); and if our new timeout triggers, it will
cause the transaction to not be committed, and reported via this
class

View File

@ -1,75 +0,0 @@
This document outlines our support for Postgres client certificates in
multitenant.
# API
<https://github.com/hasura/graphql-engine-mono/issues/1143>
We want to use Env Vars to correlate certificate data to postgres
sources. We want to allow users to add sources with client certs via
the metadata api:
```json
{
"type": "pg_add_source",
"args": {
"name": "pg1",
"configuration": {
"connection_info": {
"database_url": {
"from_env": "<DB_URL_ENV_VAR>"
},
"ssl_configuration": {
"sslmode": "verify-ca",
"sslrootcert": {
"from_env": "<SSL_ROOT_CERT_ENV_VAR>"
},
"sslcert": {
"from_env": "<SSL_CERT_ENV_VAR>"
},
"sslkey": {
"from_env": "<SSL_KEY_ENV_VAR>"
},
"sslpassword": {
"from_env": "<SSL_PASSWORD_ENV_VAR>"
}
}
}
}
}
}
```
Cert Data is stored on disk in a path provided by the flag
`--tenantCertPath`. If this flag is not provided it will be stored in a unique generated folder.
# Implementation
The `PostgresSourceConnInfo` type has been extended to include an
optional set of `PGClientCerts` which itself contains the Env Vars
provided to the Metadata API:
```
data PostgresSourceConnInfo
= PostgresSourceConnInfo
{ _psciDatabaseUrl :: !UrlConf
, _psciPoolSettings :: !(Maybe PostgresPoolSettings)
, _psciUsePreparedStatements :: !Bool
, _psciIsolationLevel :: !Q.TxIsolation
, _psciSslConfiguration :: !(Maybe (PGClientCerts CertVar CertVar))
```
When resolving a `PostgresConnConfiguration` in `getSourceResolver` we
check for the existence of `PGClientCerts`. If it is present, the env
vars are then queried in the environment and the cert data they
contain is written to disk at the `tenantCertPath` in a per tenant
sub-folder.
The filepaths are then merged into the postgres connection string as
query parameters. The module `HasuraPro.PGClientCerts` encapsulates
all the cert merging and writing operations.
Cleanup of cert data is handled in the multitenant lifecycle
loop. When a tenant is taken down or updated, the entire tenant
subfolder of `tenantCertPath` is deleted. This ensures that cert data
does not linger longer then necessary and that during updates we dont
end up with stale cert data.

View File

@ -1,123 +0,0 @@
---
authors: Naveenaidu <naveen@hasura.io>
---
# Query Tags:
## Motivation
We need some ability for some users who are using native monitoring tools
(pganalyze, RDS performance insights) to see some "application context" in the
query logs. This is a PRO only feature.
This can be provided by appending some extra information as comments to the
## Brief Overview
Query Tags are SQL comments which are made up of ``(key=value)`` pairs that
are appended to the SQL statements generated by Hasura for GraphQL operations.
The `(key=value)` pairs will henceforth be called as `attributes`. The main
idea is to convert the attributes as SQL comment and append this to the SQL
statement that is generated when a GraphQL operation is run using Hasura.
For example:
When the following query is sent to Hasura
```graphql
query GetChild {
child {
name
}
}
```
Hasura attaches query tags (*unless disabled*) to the generated SQL statement it
sends to the database.
```sql
SELECT * FROM child /* request_id=487c2ed5-08a4-429a-b0e0-4666a82e3cc6, field_name=child, operation_name=GetChild */
```
## Metadata Specification
Since there is various nitty-gritty with each vendor (who support automatic
visualization) with the query tags format so, in the interest of (future)
customization, we will need to introduce a configuration element to the
metadata.
The `query_tags` will be a part of the `source`. i.e
`source` -> `query_tags`
```yaml
sources:
name: hge_cloud_db
configuration :
query_tags: # Optional Field
disabled: # Optional Field | Type: Bool | Values: true or false
format: # Optional Field | Values: standard or sqlcommenter
```
Query Tags is enabled for all sources by default and the default format for the
query tags is standard. That means:
1. The `query_tags` field is optional.
2. If the `query_tags` field is not present for a source in metadata, then it
means query tags are enabled for the source and the format of the
`query_tags` is `standard`.
4. To disable query tags, set the value of `disabled` field as `true`.
3. A user would only have to write the `query_tags` field for a source only
when they want to disable it for a source OR change the format of query
tags.
## Approach
1. Add query-tags configuration field to `source` field of metadata. This is
done by adding the query-tags to `SourceMetadata` type, since the
`configuration` field of metadata is parsed as `SourceConnConfiguration`
datatype. Different Backends will have different formats of
`SourceConnConfiguration`, so if you want to configure query-tags for a
specific backend, make sure to add `query-tags` to the backend specific
`SourceConnConfiguration`.
2. Query tags configuration is made a part of `SourceInfo b` so that it is
accessible via `SourceConfigWith`. Ideally, query tags should have been a
part of `SourceConfig`, but due to the way our incremental schema building
works, even if only query tags configuration changes due to an incremental
API - then the entire `SourceConfig` will get built (_see
`resolveSourceConfigIfNeeded`_) - that means the connection pools also will
get rebuilt. And that is a heavy cost thus - we decided that it's a saner
approach to have query tags config as part of `SourceConfigWith`
3. Create a new typeclass `MonadQueryTags` which has a `createQueryTags`
function. This new typeclass is necessary because query-tags is a
PRO-specific feature.
4. Add `MonadReader QueryTagsComment` constraint to the `mkDB..Plan` function.
This constraint helps the backend specific implementation of `mkDB..Plan`
function to ask for the query tags comment if it wants to, thus reducing
the friction of the implementors of new backends, because they do not need
to be concerned about query tags. `QueryTagsComment` fetches the comment to
append to generated SQL, if a backend decides to do so
5. The QueryTags are generated where the `mkDB..Plan` functions are called
since all the information necessary to make the QueryTags are available
there. The Query Tags are converted as SQL comment at the same place and
passed down the stack where the SQL is generated for the GQL operation. The
Query Tags are then append to the SQL.
### Note:
The above approach is only the first iteration of query tags.
We initially planned to have `appendQueryTags` inside the `BackendExecute`
typeclass but that did not work out well as the `appendQueryTags` would then
depend on the `PreparedQuery b` type family which is not used by mutations and
for some backends, it only gets created after pretty-printing.

View File

@ -1,67 +0,0 @@
## Metadata examples
1. Global allowlist
```
allowlist:
- collection: bar
scope: #optional
global: true #required
```
2. Role-based allowlist
```
allowlist:
- collection: bar
scope: #required for role-based allowlist
global: false #required
roles: [foo, bar] #required for role-based allowlist
```
## Metadata format
1. The allowlist is an array of allowlist entries.
2. Each allowlist entry is an object, of one of 2 forms: global and role-based.
i. Each allowlist entry is required to contain a `collection`.
ii. A role-based allowlist entry is required to have a `scope`, containing `global: false` and `roles` (which is a non-empty array of roles).
iii. A global allowlist entry is not required to have a `scope` field, but if it does, we expect `global: true` and no roles. (This means that if `scope` is not provided, then it is assumed to be a global collection, which ensures backwards compatibility.)
3. Collection names must be unique in the entire allowlist (i.e. there must not be multiple allowlist entries with the same collection name).
4. Incremental API `add_collection_to_allowlist` takes an additional (optional) field `scope`.
5. We introduce a new incremental API `update_scope_of_collection_in_allowlist`. This has the fields `collection` and `scope` (both required).
## API Behaviour
1. Global collections are available for ALL roles.
2. Role-based collections are _additionally_ available to the configured roles.
3. When `add_collection_to_allowlist` does not have a `scope`:
a) If the collection already exists (global or role-based), no-op success with a response saying "collection already exists and given scope is ignored. To change scope, use the `update_scope_of_collection_in_allowlist` API". The main reason for not erroring here is because currently `add_collection_to_allowlist` is idempotent and we want to be backwards compatible.
b) If the collection doesn't exist, then it is created as a global collection.
4. When `add_collection_to_allowlist` has a `scope` field:
a) If the collection already exists (global or role-based) in allowlist, no-op success with a response saying "collection already exists and given scope is ignored. To change scope, use the `update_scope_of_collection_in_allowlist` API".
b) If the collection doesn't exist in allowlist, then it is created with the given `scope`. If `scope` doesn't conform to the format as specified in "Metdata format" section, then an appropriate error is thrown. E.g. "roles is missing for collection with non-global scope", "roles cannot be empty for collection with non-global scope", "roles should not be provided for collection with global scope".
5. For `drop_collection_from_allowlist`:
a) If the collection exists (global or role-based) in allowlist, then we drop the collection.
b) If the collection doesn't exist in allowlist, then an error is thrown saying collection does not exist in allowlist.
6. For `update_scope_of_collection_in_allowlist`:
a) If the collection exists (global or role-based) in allowlist, then we set its scope to the provided `scope`. If `scope` doesn't conform to the format specified in "Metdata format" section, then an appropriate error is thrown. E.g. "roles is missing for collection with non-global scope", "roles cannot be empty for collection with non-global scope", "roles should not be provided for collection with global scope".
b) If the collection doesn't exist in allowlist, then an error is thrown saying collection does not exist in allowlist.
**Backwards compatibility NOTE**
When a metadata spec configured with role-based allowlists is used with older Hasura version, then `scope` key is silently ignored and the collection becomes global. We should document this. Ideally, even with allowlist, role-based authz will kick-in for the collections.