### Description This PR does several things: - it cleans up some structural issues with the engineering documentation: - it harmonizes the table of contents structure across different files - adds a link to the bigquery documentation - moves some files to a new `deep-dives` subfolder - puts a title at the top of each page to avoid github assuming their title is "table of contents" - it pre-fills the glossary with a long list of words that could use an entry (all empty for now) - it adds the only remaining relevant server file from [hasura-internal's wiki](https://github.com/hasura/graphql-engine-internal/wiki): the old "multiple backends architecture" file ### Discussion A few things worth discussing in the scope of this PR: - is it worth migrating old documentation such as the multiple backends architecture, that document a decision process rather instead of being up-to-date reflections of the code? are we planning to delete hasura-internal? - should we focus instead on _new_ documentation, aimed to be kept up to date? - are there other old documents we want to move in here, or is that it? - is this glossary structure ok, or would a purely alphabetical structure make sense? - does it make sense to have the glossary only in the engineering section? more generally, _what's our broader plan for documentation_? PR-URL: https://github.com/hasura/graphql-engine-mono/pull/4537 GitOrigin-RevId: c2b674657b19af7a75f66a2a304c80c30f5b0afb
16 KiB
Multiple backends architecture (historical)
This document was originally written when the project was in its inception. As of now, it still provides an historical perspective into how the code came to be the way it is.
This Google document also documents some of the decision process.
Table of contents
- Project overview
- Overview
- Implementation challenges
- Drawbacks, alternatives, open questions, and future work
Project overview
The goal of this project is for the GraphQL engine to be able to support multiple backends – that is, to expose in our GraphQL schema tables that are not only stored in Postgres, but also MySQL, MSSQL, MongoDB, and has many others as we need. This poses several challenges, such as:
- the codebase was originally written with the assumption that the underlying database is Postgres;
- different backends support a different set of features, often incompatible or different in subtle ways.
This document details the chosen approach, its strengths and potential weaknesses.
Design goals
The two main goals with this design were:
- as much as possible, adding support for a new backend should not require changing existing code: it should be enough for a backend to implement the required typeclasses in isolation:
- this allows for several backends to be worked on in parallel without making it painful to rebase / merge code
- it encourages having a clean separation in the code between backend-agnostic and backend-specific code
- the code should not explicitly branch on some runtime backend value; it should instead rely on a type-level information:
- this avoids a proliferation of runtime tests
if backend == Postgres then ...
, favouring typeclasses instead - this helps us avoid an entire category of bugs by enforcing at compile time that we don't mix incompatible backend features
- this avoids a proliferation of runtime tests
A noteworthy drawback: no particular attention was given to the difficulty of onboarding new contributors to the project. The chosen design is quite complex, and over-engineering is a potential failure mode of this approach. This document is one way of trying to make onboarding new contributors easier, by providing an overview, but further attention should definitely be paid to documentation.
Overview
From a high-level perspective, there are four required steps for a GraphQL query to hit a given backend:
- metadata must be loaded into the
SchemaCache
: aSourceInfo
must be created, that contains, among others: aTableCache
, aFunctionCache
, and all required connection information; - a
GQLContext
must be created from that source: it details what is exposed in the schema, and how do we parse an incoming query (the same code does both since the Great "Parse, Don't Validate (PDV)" Refactor of 2020); - the output of the parsers must be representable using RQL's Intermediate Representation (IR): an AST that describes the shape of a given query;
- the output of the parsers (a
QueryRootField
,MutationRootField
, orSubscriptionRootField
) must be translated into a backend-specific query string and be executed over the network.
Our design was inspired was the Trees That Grow paper. Namely: most types are decorated with a type parameter that ties them to a specific backend, and we use type families to associate backend-specific types to that parameter whenever required. In total, five typeclasses are required:
- the
Backend
type family, - a
BackendMetadata
class for creating aSourceInfo
, - a
BackendSchema
class responsible for the schema / query parsing, - a
BackendExecute
class responsible for query planning and translation to the appropriate SQL dialect, - a
BackendTransport
class for execution over the network.
For a new backend to be supported, it "simply" needs to implement those typeclasses.
The Backend
typeclass
Different backends have different column types: our old PGColumnType
enum is not an appropriate representation for the column types of other backends. The Backend
typeclass is used to provide a mapping from backend to implementation types. (Additionally, it is where we list all the class constraints on those types, to make other instances easier to write down the line.)
class ( Show (TableName b), Eq (TableName b)
, Show (ColumnType b), Eq (ColumnType b)
) => Backend (b :: BackendType) where
type TableName b
type ColumnType b
instance Backend 'Postgres where
type TableName 'Postgres = PGTableName
type ColumnType 'Postgres = PGColumnType
This typeclass is used to generalize metadata types, the IR, and all types that need to contain backend-specific information:
-- Metadata
data TableInfo (b :: BackendType) =
TableInfo { tableName :: TableName b
, tableConfig :: TableConfig
}
-- IR
data AnnDelG (b :: BackendType) v
= AnnDel { dqp1Table :: !(TableName b)
, dqp1Where :: !(AnnBoolExp b v, AnnBoolExp b v)
}
Another use of the Backend
typeclass is extension types. The IR, by definition, needs to support the union of all possible features we want to support, since any query we accept must be representable by it. Extension types allow us to prune the IR AST for a given backend, by making some features unrepresentable, or to add backend-specific information to the AST if required.
instance Backend 'MSSQL where
type XRelay 'MSSQL = Void
data AnnFieldG (b :: BackendType) v
= AFColumn !(AnnColumnField b)
...
| AFNodeId (XRelay b) !(TableName b) !(PrimaryKeyColumns b)
In this example, the code that deals with translating MSSQL's IR does not need to handle the AFNodeId
constructor, for which there would be no corresponding implementation, since that constructor can never be constructed due to XRelay
being Void
.
BackendMetadata
TODO: detail how we use it to create a SourceInfo b
.
BackendSchema
This typeclass is responsible for the implementation of the graphql parsers. We are keeping the same parser combinators approach that was already in use: ultimately, GraphQL/Schema
will only contain generic components, such as selectTable
, tableSelectionSet
, or functionArguments
, that call to the typeclass for backend-specific combinators, such as parseColumn
. See PDV documentation for how we write our schema code using parser combinators. At time of writing, most components are already generic, with the notable exception of Actions and Relay.
BackendExecute
This typeclass is responsible for query planning: given a list of root fields (selection fields appearing at the top level of the query / mutation / subscription), generate an ExecutionPlan
, by building the corresponding monadic actions. As of now, each root field maps to one ExecutionStep
, but this will likely change as we start implementing a dataloader.
Additionally, BackendExecute
is responsible for deciding in which monad each execution step must be executed; for instance, at time of writing, Postgres uses Tracing.TraceT (LazyTxT QErr IO)
, while MSSQL simply uses IO
. How to deal with that monad and how to lift the results back to the overarching monad is the responsibility of the next typeclass: BackendTransport
.
BackendTransport
Finally, BackendTransport
is responsible for executing each step over the network. Each backend can decide what to trace, what to log, and how to deal with each given step.
Implementation challenges
Heterogeneous containers and existential qualification
A major issue with having all types decorated with the Backend type is that it makes it difficult to store information about different backends in the same containers: if a GraphQL query contains a select on a postgres table and a select on a MSSQL table, how do store the heterogeneous list of QueryRootField b
? In the schema cache, we keep a hashmap from SourceName
to SourceInfo
; how do we keep this now that SourceInfo
is parameterized by the backend type? The answer is existential types: we introduce wrapper types such as:
data BackendSourceInfo = forall b. Backend b => BackendSourceInfo (SourceInfo b)
More generally, we use existential types at the boundaries between the different steps. The output of the metadata step is a SourceInfo b
, hidden in the existential BackendSourceInfo
. The parsers are generated based on each SourceInfo
, and output an existential QueryRootField
... Each step deals with the different backend in isolation, with its own typeclass, and outputs an existential type that erases the backend type. This allows us to keep a unified pipeline dealing with several different backends at the same time.
Breaking the dependency cycle
HERE BE DRAGONS
A problem we have with the backend typeclasses is that we end up with a circular dependency between the existential container and the typeclass that uses it in the next step. For instance: we would ideally want for SourceInfo
to know about BackendSchema
; this would allow for the following:
data BackendSourceInfo = forall b. BackendSchema b => BackendSourceInfo (SourceInfo b)
buildSource
:: (BackendSchema b, MonadSchema m)
=> SourceInfo b
-> m SomeExistentialSchemaInfo
buildSource = ...
buildAllSources
:: MonadSchema m
=> SourceCache
-> m [SomeExistentialSehemaInfo]
buildAllSources cache = for cache $ \(BackendSourceInfo sourceInfo) -> buildSource sourceInfo
Here, we can directly call buildSource
on the sourceInfo
, since the BackendSchema b
constraint was in the constructor. But alas, it is neither possible nor desirable to do this, for two reasons:
BackendSchema
already needs to know aboutSourceInfo
: we need access to the source cache while building the schema, and that constraint is exposed inMonadSchema
, that the combinators inBackendSchema
rely on; this creates a cycle, which I don't believe is even possible to break (partly because I ran into a GHC bug);- the code that generates a
SourceInfo
would also need to know aboutBackendSchema
, and add the corresponding constraints throughout the metadata code; on top of mixing concerns, this would also lead to some hard-to-break cycles.
Instead, we opt for a different solution: those existential types only know about Backend
. Each step is a "black box", and each part can be kept in isolation: no need to know about BackendSchema
to implement BackendMetadata
. While this sounds like a major benefit, this leaves us with a difficult question: how do we do the dispatch to the typeclass, when the constraint was not part of the constructor? How do we implement buildAllSources
in the example above?
We simply have to explicitly unpack / repack, sadly. We use a GADT to represent some backend tags, and we use this to identify the actual backend:
data BackendTag (b :: BackendType) where
PostgresTag :: BackendTag 'Postgres
MSSQLTag :: BackendTag 'MSSQL
...
data BackendSourceInfo = forall b. Backend b => BackendSourceInfo (BackendTag b) (SourceInfo b)
buildSource
:: (BackendSchema b, MonadSchema m)
=> SourceInfo b
-> m SomeExistentialSchemaInfo
buildSource = ...
buildAllSources
:: MonadSchema m
=> SourceCache
-> m [SomeExistentialSchemaInfo]
buildAllSources cache = for cache $ \(BackendSourceInfo backendTag sourceInfo) -> case backendTag of
PostgresTag -> buildSource sourceInfo
MSSQLTag -> buildSource sourceInfo
...
As long as there is an instance of BackendSchema
available in that scope, this works, even if the constraint wasn't part of the constructor. The major drawback is of course that we will now have a few explicit switch cases like this in the code base, at the boundary of each step... However, there is a solution for this as well: all of them will have the exact same shape: given an existential type, switch over the tag, and call a given function on each value. We could therefore use TemplateHaskell to generate those switch cases for us, to make sure that whenever a new backend is added to the BackendType
enum, all dispatch functions are generated correctly. This is an optional and debatable quality of life option, that trades more code complexity in exchange for fewer changes to add a new backend.
Drawbacks, alternatives, open questions, and future work
There are several potential failure modes with this approach, beyond the risk of over-engineering.
Too big Backend class
At time of writing, Backend
contains a plethora of types, and is quite verbose. This is the result of a short deadline for the first iteration of this project, which led to a pretty aggressive generalization of the codebase: easier to put anything "suspicious" in Backend
, and we'll clean it later. As we refine the code, we will probably want to aggressively do the opposite: find a common representation for most types, that works across backends, and reduce the number of types.
Cross-source interactions
Likewise, at a later stage, in future work, we will need to support operations that operate at a higher-level, across sources, such as client-side joins. We will either need to decide on a generic implementation of table names that is the same across all backends, or add more existential types... This will be a project of its own, but here again the same trade-off will apply: having types in Backend
will allow for more granular implementation per backend (allowing us to properly handle things such as case sensitivity, or the notion of a "public schema") at the cost of more existentially qualified types; having generic names will be tricky wrt. some backends, but would allow for "normal" haskell.
Growing IR
The more backends we add, the more we might need a fine control over every aspect of the IR: we might want to be able to prune every single feature with an extension type... This could lead, again, to an explosion of types in Backend
, and an increase in complexity of the IR itself. A potential tradeoff here would be to accept that some features will be representable in the IR despite not being supported by all backends, making the translation code later on partial. This might be an acceptable trade-off if we have a good enough test suite.
Test suite
We will need to investigate how to run our existing integration test suite on other backends. For now, the metadata setup by executing raw SQL on Postgres, which will no longer work. A solution could be to use an abstract representation of the setup, and maintain a connector per backend that sets up a test database based on this abstract representation?