mirror of
https://github.com/enso-org/enso.git
synced 2024-11-22 22:10:15 +03:00
293 lines
13 KiB
Markdown
293 lines
13 KiB
Markdown
---
|
|
layout: developer-doc
|
|
title: IR Caching in the Enso Compiler
|
|
category: runtime
|
|
tags: [runtime, caching]
|
|
order: 10
|
|
---
|
|
|
|
# IR Caching in the Enso Compiler
|
|
|
|
One of the largest pain points for users of Enso at the moment is the fact that
|
|
it has to precompile the entire standard library on every project load. This is,
|
|
in essence, due to the fact that the current parser is abysmally slow, and
|
|
incredibly demanding. The obvious solution to improve this is to take the parser
|
|
out of the equation in its entirety, by serializing the parser's output.
|
|
|
|
To that end, we want to serialize the Enso IR to a format that can later be read
|
|
back in, bypassing the parser entirely. Furthermore, we can move the boundary at
|
|
which this serialization takes place to the end of the compiler pipeline,
|
|
thereby bypassing doing most of the compilation work, and further improving
|
|
startup performance.
|
|
|
|
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
|
|
|
- [Serializing the IR](#serializing-the-ir)
|
|
- [Breaking Links](#breaking-links)
|
|
- [Storing the IR](#storing-the-ir)
|
|
- [Metadata Format](#metadata-format)
|
|
- [Portability and Versioning](#portability-and-versioning)
|
|
- [Loading the IR](#loading-the-ir)
|
|
- [Integrity Checking](#integrity-checking)
|
|
- [Error Handling](#error-handling)
|
|
- [Imports](#imports)
|
|
- [Testing the Serialization](#testing-the-serialization)
|
|
- [Future Directions](#future-directions)
|
|
|
|
<!-- /MarkdownTOC -->
|
|
|
|
## Serializing the IR
|
|
|
|
Using classical Java Serialization turned out to be unsuitably slow. Rather than
|
|
switching to other serialization framework that does the same, but faster we
|
|
desided in [PR-8207](https://github.com/enso-org/enso/pull/8207) to create _own
|
|
persistance framework_ that radically changes the way we can read the caches.
|
|
Rather than loading all the megabytes of stored data, it reads them _lazily on
|
|
demand_.
|
|
|
|
Use following command to generate the Javadoc for the `org.enso.persist`
|
|
package:
|
|
|
|
```bash
|
|
enso$ find lib/java/persistance/src/main/java/ | grep java$ | xargs ~/bin/graalvm-21/bin/javadoc -d target/javadoc/ --snippet-path lib/java/persistance/src/test/java/
|
|
enso$ links target/javadoc/index.html
|
|
```
|
|
|
|
In order to maximize the benefits of this process, we want to serialize the IR
|
|
as _late_ in the compiler pipeline as possible. This means serializing it just
|
|
before the code generation step that generates Truffle nodes (before the
|
|
`RuntimeStubsGenerator` and `IrToTruffle` run).
|
|
|
|
This serialization should take place in an _offloaded thread_ so that it doesn't
|
|
block the compiler from continuing.
|
|
|
|
### Breaking Links
|
|
|
|
Doing this naïvely, however, means that we can inadvertently end up serializing
|
|
the entire module graph. This is due to the `BindingsMap`, which contains a
|
|
reference to the associated `runtime.Module`, from which there is a reference to
|
|
the `ModuleScope`. The `ModuleScope` may then reference other `runtime.Module`s
|
|
which all contain `IR.Module`s. Therefore, done in a silly fashion, we end up
|
|
serializing the entire reachable module graph. This is not what we want.
|
|
|
|
The `Persistance.write` method contains additional `writeReplace` function which
|
|
our cache system uses to perform following modification just before
|
|
`ProcessingPass.Metadata` are stored down:
|
|
|
|
- modify `BindingsMap` and its child types to be able to contain an unlinked
|
|
module pointer `case class ModulePointer(qualifiedName: List[String])` in
|
|
place of a `Module`.
|
|
- As the `MetadataStorage` type that holds the `BindingsMap` is mutable it might
|
|
be tempting to update it in place, but relying on `writeReplace` mechanism is
|
|
safer as it only changes the format of object being written down, rather than
|
|
modifying objects of live `IR` - potentially shared with other parts of the
|
|
system.
|
|
|
|
Having done this, we have broken any links that the IR may hold between modules,
|
|
and can serialize each module individually.
|
|
|
|
It _may_ be safer to `duplicate` the IR before handing it to serialization, but
|
|
it shouldn't be necessary if the `writeReplace` function is written correctly.
|
|
|
|
## Storing the IR
|
|
|
|
The serialized IR needs to be stored in a location that is tied to the library
|
|
that it serializes. Despite this, we _also_ want to be able to ship cached IR
|
|
with libraries. This leads to a two pronged solution where we check two
|
|
locations for the cache.
|
|
|
|
1. **With the Library:** As libraries can have a hidden `.enso` directory, we
|
|
can use a path within that for caching. This should be
|
|
`$package/.enso/cache/ir/enso-$version/`, and can be accessed by extending
|
|
the `pkg` library to be aware of the cache directories.
|
|
2. **Globally:** As some library locations may not be writeable, we need to have
|
|
a global out-of-line cache that is used if the first one is not writeable.
|
|
This is located under `$ENSO_DATA` (whose location can be obtained from the
|
|
`RuntimeDistributionManager`), and is located under the path
|
|
`$ENSO_DATA/cache/ir/$hash/enso-$version/`, where `$hash` is the `SHA3-224`
|
|
hash of the tuple `(namespace, library_name, version)`, where
|
|
`version = SemVer | "local"`. This hash is computed by concatenating the
|
|
string representations of these fields.
|
|
|
|
In each location, the IR is stored with the following assumptions:
|
|
|
|
- The IR file is located in a directory modelled after its module path, followed
|
|
by a file named after the module itself with the extension `.ir` (e.g. the IR
|
|
for `Standard.Base.Data.Vector` is stored in `Standard/Base/Data/Vector.ir`).
|
|
- The [metadata](#metadata-format) file is located in a directory modelled after
|
|
its module path, followed by a file named after the module itself with the
|
|
extension `.meta` (e.g. the metadata for `Standard.Base.Data.Vector.enso` is
|
|
stored in `Standard/Base/Data/Vector.meta`). This is right next to the
|
|
corresponding `.ir` file.
|
|
|
|
Storage of the IR only takes place iff the intended location for that IR is
|
|
_empty_.
|
|
|
|
### Metadata Format
|
|
|
|
The metadata is used for integrity checking of the cached IR to prevent loading
|
|
corrupted or out of date data from the cache. Due to the fact that engines can
|
|
only load IR created by their versions, and cached IR is located in a directory
|
|
named after the engine version, this format need not be forward compatible.
|
|
|
|
It is a JSON file as follows:
|
|
|
|
```typescript
|
|
{
|
|
sourceHash: String; // The hash of the corresponding source file.
|
|
blobHash: String; // The hash of the blob.
|
|
compilationStage: String; // The compilation stage of the IR.
|
|
}
|
|
```
|
|
|
|
All hashes are encoded in SHA1 format, for performance reasons. The engine
|
|
version is encoded in the cache path, and hence does not need to be explicitly
|
|
specified in the metadata.
|
|
|
|
### Portability and Versioning
|
|
|
|
These are two static methods in `Persistance` class to help creating a `byte[]`
|
|
from a single object and then read it back. The array is identified with
|
|
following header:
|
|
|
|
- 4 bytes fixed header
|
|
- 4 bytes describing the version
|
|
- 4 bytes to locate the beginning of the object (the objects aren't written
|
|
linearly)
|
|
|
|
E.g. 12 bytes overhead before the actual data start. Following versioning is
|
|
recommended when making a change:
|
|
|
|
- when you change something really core in the `Persitance` implementation -
|
|
change the builtin header first four bytes
|
|
- when you add or remove a Persistance implementation the version changes (as it
|
|
is computed from all the IDs present in the system)
|
|
- when you change format of some `Persitance.writeObject` method - change its ID
|
|
|
|
That way the same version of Enso will recognize its `.ir` files. Different
|
|
versions of Enso will realize that the files aren't in suitable form.
|
|
|
|
Every `Persistance` class has a unique identifier. In order to keep definitions
|
|
consistent one should not attempt to use smaller `id`s than previously assigned.
|
|
One should also not delete any `Persistance` classes.
|
|
|
|
Additionally, `PerMap.serialVersionUID` version provides a seed to the version
|
|
stamp calculated from all `Persistance` classes. Increasing the
|
|
`serialVersionUID` will invalidate all caches.
|
|
|
|
## Loading the IR
|
|
|
|
Loading the IR is a multi-stage process that involves performing integrity
|
|
checking on the loaded cache. It works as follows.
|
|
|
|
1. **Find the Cache:** Look in the global cache directory under `$ENSO_DATA`. If
|
|
there is no cached IR here that is valid for the current configuration, check
|
|
the ibrary's `.enso/cache` folder. This should be hooked into in
|
|
`Compiler::parseModule`.
|
|
2. **Check Integrity:** Check the module's [metadata](#metadata-format) for
|
|
validity according to the [integrity rules](#integrity-checking).
|
|
3. **Load:** If the cache passes the integrity check, load the `.ir` file. If
|
|
deserialization fails in any way, immediately fall back to parsing the source
|
|
file.
|
|
4. **Re-Link:** Relinking is part of **Load**. When using `Persistance.read`
|
|
provide own `readResolve` function. Such a function gets a chance to change
|
|
and replace each object read-in with appropriate variant respecting the whole
|
|
compiler environment.
|
|
|
|
The main subtlety here is handling the dependencies between modules. We need to
|
|
ensure that, when loading multiple cached libraries, we properly handle them
|
|
one-by-one. Doing this is as simple as hooking into `Compiler::parseModule` and
|
|
setting `AFTER_STATIC_PASSES` as the compilation state after loading the module.
|
|
This will tie into the current `ImportsResolver` and `ExportsResolver` which are
|
|
run in an un-gated fashion in `Compiler::run`.
|
|
|
|
Unlike classical Java deserialization nly registered `Persistance` subclasses
|
|
may participate in deserialization making it much safer and less vulnerable.
|
|
|
|
### Integrity Checking
|
|
|
|
For a cache to be usable, the following properties need to be satisfied:
|
|
|
|
1. The `sourceHash` must match the hash of the corresponding source file.
|
|
2. The `blobHash` must match the hash of the corresponding `.ir` file.
|
|
|
|
If any of these fail, the cache file should be deleted where possible, or
|
|
ignored if it is in a read-only location.
|
|
|
|
### Error Handling
|
|
|
|
It is important, as part of this, that we fail under all circumstances into a
|
|
working state. This means that:
|
|
|
|
- If serialization fails, we report a low-priority error message and continue.
|
|
- If deserialization fails, we fall back to loading and parsing the original
|
|
source file.
|
|
|
|
At no point should this mechanism be exposed to the user in any visible way,
|
|
other than the fact that they may be seeing the actual files on disk.
|
|
|
|
### Imports
|
|
|
|
Integrity Checking does not check the situation when the cached module imports a
|
|
module which cache has been invalidated. For example, module `A` uses a method
|
|
`foo` from module `B` and a successful compilation resulted in IR cache for both
|
|
`A` and `B`. Later, someone modified module `B` by renaming method `foo` to
|
|
`bar`. If we only compared source hashes, `B`'s IR would be re-generated while
|
|
`A`'s would be loaded from cache, thus failing to notice method name change,
|
|
until a complete cache invalidation was forced.
|
|
|
|
Therefore, the compiler performs an additional check by invalidating module's
|
|
cache if any of its imported modules have been invalidated.
|
|
|
|
## Testing the Serialization
|
|
|
|
There are two main elements that need to be tested as part of this feature.
|
|
|
|
- `persistance` project comes with its own unit tests
|
|
- `runtime-parser` project adds tests of various core classes used during `IR`
|
|
serialization - like Scala `List` or checks of the _laziness_ of Scala `Seq`
|
|
- We need to test the serialization and deserialization process, including the
|
|
rewrite of `BindingsMap` to work properly.
|
|
- We also need to test the discovery of cache locations on the filesystem and
|
|
cache eviction strategies. The best way to do this is to set `$ENSO_DATA` to a
|
|
temporary directory and then directly interact with the filesystem. Caching
|
|
should be disabled for existing tests. This will require adding additional
|
|
runtime options for debugging, but also constructing the `DistributionManager`
|
|
on context creation (removing `RuntimeDistributionManager`).
|
|
|
|
### Import/Export caching of bindings
|
|
|
|
Import and export resolution is one of the more expensive elements in the
|
|
initial pipeline. It is also the element which does not change for the releases
|
|
library components as we do not expect users to modify them. During the initial
|
|
compilation stage we iteratively parse/load cached ir, do import resolution on
|
|
the module, followed by export resolution, and repeat the process with any
|
|
dependent modules discovered in the process. Calculating such transitive closure
|
|
is an expensive and repeatable process. By caching bindings per library we are
|
|
able to skip that process completely and discover all necessary modules of the
|
|
library in a single pass.
|
|
|
|
The bindings are serialized along with the library caches in a file with a
|
|
`.bindings` suffix.
|
|
|
|
Further more the storage of `.ir` files contains usage of _lazy_ `Seq`
|
|
references to separate the general part of the `IR` tree from elements
|
|
representing method bodies. As such the compiler can process the structure of
|
|
`.ir` files, but avoid loading in `IR` for methods that aren't being executed.
|
|
|
|
## Future Directions
|
|
|
|
The `Persistance` framework gives us _laziness_ opportunities and we should use
|
|
them more:
|
|
|
|
- have a single _blob_ with all `IR`s per a library and read only the parts that
|
|
are needed
|
|
|
|
- experiement with GC - being able to release parts of unused `IR` once they
|
|
were used (for code generation or co.)
|
|
|
|
- make the `.ir` files smaller where possible
|
|
|
|
The use of `Persistance` has already sped up the execution time of simple
|
|
`IO.println "Hello!"` by 16% - let's use it to speed things up even more.
|