d85de6bb6a
Ideally, we'd have one place for this. I have also removed the pinned Cabal version because v3.10 should address the bug we were seeing in the v3.8 series. PR-URL: https://github.com/hasura/graphql-engine-mono/pull/8497 GitOrigin-RevId: d0c68792c30e2e28ff2e210dd3ac8d28b83ac851 |
||
---|---|---|
.. | ||
src/Hasura | ||
test | ||
incremental.cabal | ||
README.md |
incremental
A library for caching intermediate results in Arrow
computations. Used by
graphql-engine
to optimise schema cache updates by avoiding redundant
recomputation.
Schema cache?
The SchemaCache
is
is a very complex object that, among other things, keeps track of the GraphQL
schema and its parser for each user role in a graphql-engine
instance. Its
computation is based on both user-supplied metadata and the schema of each
data source configured to work with Hasura. When one of these dependencies is
updated (perhaps the user makes a metadata change), we'd like to recompile the
schema cache as quickly (and efficiently with regards to memory) as possible.
For our specific uses of incremental
, see the implementation of
buildSchemaCacheRule
and its Inc.cache
calls, as well as the note titled, Avoiding GraphQL schema rebuilds when changing irrelevant Metadata
.
Arrow
101 and ArrowCache
The Arrow
abstraction
describes a computation from some input to some output. A very simple example
of an Arrow
is regular Haskell functions a -> b
: a computation that
transforms some a
into some b
. Another good example is functions of the
shape a -> m b
for some Monad m
: a computation that transforms some a
into some b
with potential side-effects. The Arrow
abstraction forms the
basis of a whole hierarchy of classes, such as
ArrowChoice
and
ArrowLoop
,
each of which build on this basic idea. We can think of these as analogous to
Monad
being the basis of classes like
MonadPlus
and
MonadIO
:
each class describes further specific capabilities of the generic Monad
interface.
This library introduces the
ArrowCache
class. Sometimes, an Arrow
represents a computation that is computationally expensive, but is also
consistent in its output for any given input. For example, a compiler takes a
collection of source files and produces a binary. This computation can be very
expensive if we have a lot of source files and/or a complicated compiler, but
the result should be deterministic. In this case, ArrowCache
allows us to
decorate an Arrow
with a caching mechanism, avoiding recompilation of
unchanged files.
When we cache
an arrow, we keep track of the last input/output pair. If the
next input matches the last input, we can skip the computation and return the
last output, avoiding the need for the expensive computation. If the next input
doesn't match the last input, we replace the stored pair with the new
input/output pair. We can cache any arrow or composition of arrows in a
pipeline, allowing us to express different granularities of caching1.
If thinking of programs as a composition of Arrow
s is unfamiliar, we can
instead think of our program as a directed, acyclic graph of intermediate
results. For example, our compiler might first compile the individual modules,
and then bundle them together. Any one of these intermediate results can be
cached to avoid expensive recomputation when its dependencies remain unchanged.
How does it work?
Rule
and cache
Rule
is the interesting implementation of ArrowCache
. Its instance in
Hasura.Incremental.Internal.Cache
explains the system above, specifically in the cached
definition of its
where
clause: if the input has not changed, then we return the last result.
Otherwise, we recompute a result for the new input. Note that this is not
memoisation: we don't store an input/output map. We only ever store the last
input/output pair.
Keyed Dependencies
The ArrowCache
also features two other functions:
newDependency
and
dependOn
.
Consider an arrow that takes Metadata
as input, but only ever accesses a
small set of specific fields within it. In this case, we'd ideally only like to
consider those particular fields when we determine whether we should recompute!
The Dependency
type gives us a way to keep track of the parts of a structure
on which we'd like to depend, and Rule
then uses this information to cache
specifically based on our
Accesses
,
rather than just naïvely checking for equality between the last and next input.
The functions for working with Dependency
live in
Hasura.Incremental.Internal.Dependency
.
Invalidation keys
Sometimes, we want to cache arrows that are not deterministic. In these
cases, we might want to re-compute some value despite the inputs being the
same. For example, data source introspection is an inherently non-deterministic
action: what if the database schema changes? In these cases, the provided
InvalidationKey
helper type can be used. When we invalidate
an InvalidationKey
, the
resulting value does not equal the previous value, which should force a
recomputation of the arrow when it is given as part of the input.
Gotchas
Monadic parameter-passing in Rule
The reason for using the Arrow
abstraction is to talk about the computational
relationship between the input and the output. In our specific case, what we
want to talk about is whether or not we've seen this pairing before. However,
this relies on the Arrow
input being the only input to the computation.
When we build
or rebuild
a Rule m a b
computation, we do so inside some
Applicative m
. If this m
happens to be, say, (->) r
, then we have access
to an r
parameter that isn't explicitly given as the a
parameter, and it
will not be considered part of the input that determines whether or not we need
to recompute. This can lead to problems: if the result of the computation
meaningfully changes depending on this value, then repeated calls may yield
incorrect results. In general, we should be careful to make sure that our
chosen m
does not affect our decision to cache.
On the other hand, this is an occasionally helpful loophole. For example, in
Hasura.RQL.Types.SchemaCache.Build
, we use a MonadReader BuildReason m
constraint to determine whether we need to update, say, the event trigger
catalogue. If the input (i.e. the schema) hasn't changed, then the
BuildReason
doesn't matter: the fact that it is accessed via the m
means
that caching works the way we'd like it to work.
Another example is when m
happens to be IO
: in these cases, we have no
guarantees about values being introduced during the computation. As mentioned
in the InvalidationKey
section, non-determinism may be desired, and in these
cases the onus is on the user to invalidate the cache when recomputations are
required.
-
It might be tempting to cache every single arrow. However, this quite quickly leads to a large amount of memory being consumed, and often for no good reason. Sometimes, a result is so simple to recompute that we're better off recomputing rather than allocating more memory. ↩︎