13 KiB
layout | title | category | tags | order | |||
---|---|---|---|---|---|---|---|
developer-doc | Unbounded Recursion | runtime |
|
5 |
Unbounded Recursion
The JVM (and hence, GraalVM) do not have support for segmented stacks, and hence do not allow for computation of unbounded recursion - if you make too many recursive function calls you can cause your stack to overflow. Quite obviously, this is a big problem for a functional language where recursion is the primary construct for looping.
There are two main categories of solution for working with unbounded recursion:
- Segmented Stacks: If you have the ability to allocate stacks on the heap you can allocate the stack in segments as it grows, meaning that the upper limit on the size of your stack is
- Continuation Passing Style (CPS): A program in CPS is one in which the flow of control is passed explicitly as a function of one argument (the continuation). The significant benefit of this is that it means that all calls are made in tail position, and hence no new stack frame needs to be allocated.
This document contains the details of designs and experiments for allowing the use of unbounded recursion in Enso on GraalVM.
- A Baseline
- Emulating Stack Segmentation with Threads
- Avoiding Stack Usage via a CPS Transform
- Linearised Representations
- Alternatives
- Open Questions
A Baseline
In an ideal world, we'd like the performance of Enso's recursive calls to approximate that of Haskell, which can be made to have fairly optimal performance for a functional language. Basic measurements for a Haskell program that sums the numbers up to 1 million are as follows:
- Non-TCO: 20-25 ms/op
- TCO: 0.8-1 ms/op
All benchmarks in the sections below are written in pure Java rather than in
Enso itself. This is to allow us to estimate the maximum theoretical performance
possible when executing on the JVM. They have been run on GraalVM 19.1.0, and
perform the same summation of integers. They have a variable threshold, listed
in the results as inputSize
.
Emulating Stack Segmentation with Threads
As each new thread has its own stack, we can exploit this to emulate the notion of split stacks as used in many functional programming languages. The basic idea is to work out when you're about to run out of stack space,
When to Spawn a Thread
One of the main problems with this approach is that you want to make as much use out of the stack for a given thread as possible. However, it is very difficult to get an accurate idea of when a stack may be about to overflow. There are two main approaches:
- Conservative Counting: You can explicitly maintain a counter that records the depth of your call stack.
- Catching the Overflow: When a thread on the JVM overflows, it throws a
StackOverflowError
, thus giving information as to when you've run out of stack space.
It may at first be apparent that you can rely on some other details of how JVM stacks are implemented, but the JVM spec is very loose with regards to what it permits as a valid stack implementation. This means that from a specification perspective there is very little that could actually be relied upon.
Conservative Counting
A naive and obvious solution is to maintain a counter that tracks the depth of your call stack. This would allow you to make a conservative estimate of the amount of stack you have remaining, and spawn a new thread at some threshold.
Of course, the main issue with this is that the stacks you have available become significantly under-utilised as the threshold has to be set such that overflow is impossible.
We did some brief testing to experiment with the 'depth limit' to find a rough estimate for how much utilisation we could get out of the thread stacks before they overflowed. In practice this seemed to be around 2000, though some runs could have it set higher. Using this value gave the following results.
Benchmark (inputSize) Mode Cnt Score Error Units
Main.testCountedExecutor 100 avgt 5 0.001 ± 0.001 ms/op
Main.testCountedExecutor 1000 avgt 5 0.008 ± 0.004 ms/op
Main.testCountedExecutor 10000 avgt 5 0.951 ± 0.095 ms/op
Main.testCountedExecutor 50000 avgt 5 7.279 ± 2.476 ms/op
Main.testCountedExecutor 100000 avgt 5 12.790 ± 1.101 ms/op
Main.testCountedExecutor 1000000 avgt 5 107.034 ± 2.076 ms/op
As is obvious this is quite slow when compared to the Haskell case, with around a 5x slowdown. A significant amount of the time appears to be spent on OS-level context switches, as the smaller cases that fit into the stack of a single thread are approximately equal to Haskell. It is hence possible that a method that reduces the cost of context switching could make this approach feasible.
Catching the Overflow
Though it is heavily recommended against by the Java documentation, it is indeed
possible to catch the StackOverflowError
. While this provides accurate info
about when you run out of stack space, it has one major problem: you may not
unwind enough to have enough stack space to spawn a new thread.
The following is a potential algorithm that ignores this problem for the moment:
- Each recursive call is wrapped in a
try {} catch (StackOverflowError e) {}
block in order to detect when the stack overflows. - All side-effecting operations must take place within a single Java frame.
- When the stack overflows, a
StackOverflowError
is thrown at frame creation. - This can be caught, with control-flow entering the
catch
block. - A new thread is spawned to continue the computation.
This works because the StackOverflowError
is thrown when the attempt to create
the new stack frame is made. This means that in the failure case none of the
function body has executed so we can safely resume on a new thread.
The main issue with this design is ensuring that there is enough stack space after the unwind to the catch block. If there isn't enough, then it proves impossible to spawn a new thread and this doesn't work.
The benchmarks listed here implement this algorithm without actually performing any significant computation.
Benchmark (inputSize) Mode Cnt Score Error Units
Main.testSOExecutor 100 avgt 5 ≈ 10⁻⁴ ms/op
Main.testSOExecutor 1000 avgt 5 0.003 ± 0.001 ms/op
Main.testSOExecutor 10000 avgt 5 0.031 ± 0.001 ms/op
Main.testSOExecutor 50000 avgt 5 3.927 ± 0.477 ms/op
Main.testSOExecutor 100000 avgt 5 7.724 ± 0.239 ms/op
Main.testSOExecutor 1000000 avgt 5 104.719 ± 11.411 ms/op
This performs slightly better than the conservative option discussed above. As we're guaranteed total utilisation of the stack of each thread we spawn less threads and hence reduce the context switching overhead. Nevertheless, this is still very slow compared to Haskell baseline.
Thread Pools
While a thread pool is conventionally seen as a way to amortise the cost of spawning threads, this approach to recursion requires far more threads than is really feasible to keep around in a pool, so we've not explored that approach.
Project Loom
If project loom's coroutines and / or fibres were stable, these would likely help somewhat by reducing the thread creation overhead that is primarily down to OS-level context switches.
However, Loom doesn't currently seem like a viable solution to this approach as it is not only far from stable, but also has no guarantee that it will actually make it into the JVM.
Avoiding Stack Usage via a CPS Transform
Transforming recursive calls into CPS allows us to avoid the need for using the stack instead of trying to augment it. This could be implemented as a global transformation, or as a local one only for recursive calls.
Benchmark (inputSize) Mode Cnt Score Error Units
Main.testCPS 100 avgt 5 0.001 ± 0.001 ms/op
Main.testCPS 1000 avgt 5 0.014 ± 0.004 ms/op
Main.testCPS 10000 avgt 5 0.197 ± 0.038 ms/op
Main.testCPS 50000 avgt 5 1.075 ± 0.269 ms/op
Main.testCPS 100000 avgt 5 2.258 ± 0.310 ms/op
Main.testCPS 1000000 avgt 5 27.002 ± 2.059 ms/op
The CPS-based approach is very much a trade-off. The code that is actually being executed is more complex, showing an order of magnitude slowdown in the cases where the execution profile fits into a single stack. However, once the input size grows to the point that additional stack segments are needed, the execution performance is within spitting distance of the Haskell code.
The CPS Transform
While it is tempting to perform the CPS transform globally for the whole program, this has some major drawbacks:
- As shown above, the code becomes an order of magnitude slower within the space of a single stack.
- It may be difficult to maintain a mapping from the original code to the CPS'd execution. This would greatly impact our ability to use the debugging and introspection tools which are necessary for implementing Enso Studio.
As a result, an ideal design would involve only performing the CPS
transformation on code which is actually recursive. While you can detect this
statically via whole-program analysis, you can also track execution on the
program stack in a thread-safe manner and perform the transformation at runtime
(e.g. private static ThreadLocal<Boolean> isExecuting;
).
A Hybrid Approach
As we clearly don't want to CPS transform the program globally, we need some mechanism by which we can rewrite only when necessary. As discussed above, we could do this via a dynamic runtime analysis, but we could also potentially make use of the Java stack at least in part.
The hybrid approach works as follows:
- Execute the code using standard recursion on the Java stack until we catch a
StackOverflowError
. - Spawn a new thread to rewrite the original code to CPS, and then continue execution in that style.
This avoids the CPS overhead as much as possible (when the computation fits into the Java stack), but allows for unbounded recursion in the general case. The performance profile is as follows.
Benchmark (inputSize) Mode Cnt Score Error Units
Main.testHybrid 100 avgt 5 ≈ 10⁻⁴ ms/op
Main.testHybrid 1000 avgt 5 0.003 ± 0.001 ms/op
Main.testHybrid 10000 avgt 5 0.013 ± 0.003 ms/op
Main.testHybrid 50000 avgt 5 0.069 ± 0.013 ms/op
Main.testHybrid 100000 avgt 5 1.765 ± 0.056 ms/op
Main.testHybrid 1000000 avgt 5 25.961 ± 2.775 ms/op
This hybrid implementation makes things faster overall, with some particularly good performance wins for the smaller cases.
An open question for this is how you work out exactly what code to CPS transform at the point of the stack overflow. In the simply-recursive case this is trivial, but it may require some more sophisticated tracing in the case of mutually-recursive functions.
Linearised Representations
While not something that we could feasibly do at the moment, one of the potential solutions for this is to statically compile the language to a linearised representation. Rather than trying to implement the CPS transform in a Truffle interpreter not designed for it, we could instead compile Enso to a low-level IR format which has no stack frames, and instead just uses jumps.
Whether we write this IR ourselves or use an existing one implemented as a Truffle language, such as WASM bytecode (currently very experimental) or LLVM IR (much more tried and tested), this would provide a number of benefits:
- The IR output by the compiler phase need not be fed into the truffle interpreter for said IR.
- We gain more flexibility.
- We can still support interoperation with foreign languages through Truffle.
However, such an approach also has some major downsides:
- We do not have the time to pursue such an approach in the short term.
- Such an approach would require significantly more work, as generating linear IR representations is not as simple as generating a high-level truffle node IR.
- Such an approach adds quite a lot of complexity to the compiler pipeline, which, is currently tied quite strongly into the Truffle language life-cycle.
Alternatives
At the current time there are no apparent alternatives to the three approaches discussed above. While it would be ideal for the JVM to have native support for stack segmentation on the heap, this would likely be an in-depth and significant amount of work to add, with no guarantee that it would be accepted into main.
Open Questions
The following are questions for which we don't yet have answers:
- Are there any ways to instrument a JVM thread to detect when it's about to stack overflow?
- Based on our investigation, what would your recommendation be for us to proceed?