roc/crates/compiler/gen_wasm
2022-08-19 22:16:47 -05:00
..
docs moved all crates into seperate folder + related path fixes 2022-07-01 17:37:43 +02:00
src Make method private 2022-08-19 22:16:47 -05:00
.gitignore moved all crates into seperate folder + related path fixes 2022-07-01 17:37:43 +02:00
Cargo.toml Merge remote-tracking branch 'origin/trunk' into roc-std-platform 2022-08-07 13:28:03 +02:00
README.md moved all crates into seperate folder + related path fixes 2022-07-01 17:37:43 +02:00

Development backend for WebAssembly

Plan

  • Initial bringup

    • Get a wasm backend working for some of the number tests.
    • Use a separate gen_wasm directory for now, to avoid trying to do bringup and integration at the same time.
  • Get the fundamentals working

    • Come up with a way to do control flow
    • Flesh out the details of value representations between local variables and stack memory
    • Set up a way to write tests with any return value rather than just i64 and f64
    • Implement stack memory
      • Push and pop stack frames
      • Deal with returning structs
      • Distinguish which variables go in locals, own stack frame, caller stack frame, etc.
      • Ensure early Return statements don't skip stack cleanup
    • Model the stack machine as a storage mechanism, to make generated code "less bad"
    • Switch vectors to bumpalo::Vec where possible
    • Implement relocations
      • Requires knowing the byte offset of each call site. This is awkward as the backend builds a Vec<Instruction> rather than a Vec<u8>. It may be worth serialising each instruction as it is inserted.
  • Refactor for code sharing with CPU backends

    • Extract a trait from WasmBackend that looks as similar as possible to Backend, to prepare for code sharing
    • Refactor to actually share code between WasmBackend and Backend if it seems feasible
  • Integration

    • Move wasm files to gen_dev/src/wasm
    • Share tests between wasm and x64, with some way of saying which tests work on which backends, and dispatching to different eval helpers based on that.
    • Get build_module in object_builder.rs to dispatch to the wasm generator (adding some Wasm options to the Triple struct)
    • Get build_module to write to a file, or maybe return Vec<u8>, instead of returning an Object structure

Structured control flow

One of the security features of WebAssembly is that it does not allow unrestricted "jumps" to anywhere you like. It does not have an instruction for that. All of the control instructions can only implement "structured" control flow, and have names like if, loop, block that you'd normally associate with high-level languages. There are branch (br) instructions that can jump to labelled blocks within the same function, but the blocks have to be nested in sensible ways.

This way of representing control flow is similar to parts of the Roc AST like When, If and LetRec. But Mono IR converts this to jumps and join points, which are more of a Control Flow Graph than a tree. We need to map back from graph to a tree again in the Wasm backend.

Our solution is to wrap all joinpoint/jump graphs in an outer loop, with nested blocks inside it.

Possible future optimisations

There are other algorithms available that may result in more optimised control flow. We are not focusing on that for our development backend, but here are some notes for future reference.

The WebAssembly compiler toolkit binaryen has an API for control-flow graphs. We're not using binaryen right now. It's a C++ library, though it does have a (very thin and somewhat hard-to-use) Rust wrapper. Binaryen's control-flow graph API implements the "Relooper" algorithm developed by the Emscripten project and described in this paper.

By the way, apparently "binaryen" rhymes with "Targaryen", the family name from the "Game of Thrones" TV series

There is also an improvement on Relooper called "Stackifier". It can reorder the joinpoints and jumps to make code more efficient. (It is also has things Roc wouldn't need but C++ does, like support for "irreducible" graphs that include goto).

Stack machine vs register machine

Wasm's instruction set is based on a stack-machine VM. Whereas CPU instructions have named registers that they operate on, Wasm has no named registers at all. The instructions don't contain register names. Instructions can only operate on whatever data is at the top of the stack.

For example the instruction i64.add takes two operands. It pops the top two arguments off the VM stack and pushes the result back.

In the spec, every instruction has a type signature! This is not something you would see for CPU instructions. The type signature for i64.add is [i64 i64] → [i64] because it pushes two i64's and pops an i64.

This means that WebAssembly has a concept of type checking. When you load a .wasm file as a chunk of bytes into a Wasm runtime (like a browser or wasmer), the runtime will first validate those bytes. They have some fast way of checking whether the types being pushed and popped are consistent. So if you try to do the i64.add instruction when you have floats on the stack, it will fail validation.

Note that the instruction makes no mention of any source or destination registers, because there is no such thing. It just pops two values and pushes one. (This architecture choice helps to keep WebAssembly programs quite compact. There are no extra bytes specifying source and destination registers.)

Implications of the stack machine for Roc:

  • There is no such thing as register allocation, since there are no registers! There is no reason to maintain hashmaps of what registers are free or not. And there is no need to do a pass over the IR to find the "last seen" occurrence of a symbol in the IR. That means we don't need the Backend methods scan_ast, scan_ast_call, set_last_seen, last_seen_map, free_map, free_symbols, free_symbol, set_free_map.

  • There is no random access to the stack. All instructions operate on the data at the top of the stack. There is no instruction that says "get the value at index 17 in the stack". If such an instruction did exist, it wouldn't be a stack machine. And there is no way to "free up some of the slots in the stack". You have to consume the stuff at the top, then the stuff further down. However Wasm has a concept of local variables, which do allow random access. See below.

Local variables

WebAssembly functions can have any number of local variables. They are declared at the beginning of the function, along with their types (just like C). WebAssembly has 4 value types: i32, i64, f32, f64.

In this backend, each symbol in the Mono IR gets one WebAssembly local. To illustrate, let's translate a simple Roc example to WebAssembly text format. The WebAssembly code below is completely unoptimised and uses far more locals than necessary. But that does help to illustrate the concept of locals.

app "test" provides [main] to "./platform"

main =
    1 + 2 + 4

Direct translation of Mono IR

The Mono IR contains two functions, Num.add and main, so we generate two corresponding WebAssembly functions. Since it has a Symbol for every expression, the simplest thing is to create a local for each one. The code ends up being quite bloated, with lots of local.set and local.get instructions.

I've added comments on each line to show what is on the stack and in the locals at each point in the program.

  (func (;0;) (param i64 i64) (result i64)   ; declare function index 0 (Num.add) with two i64 parameters and an i64 result
    local.get 0              ; load param 0                                    stack=[param0]
    local.get 1              ; load param 1                                    stack=[param0, param1]
    i64.add                  ; pop two values, add, and push result            stack=[param0 + param1]
    return)                  ; return the value at the top of the stack

  (func (;1;) (result i64)   ; declare function index 1 (main) with no parameters and an i64 result
    (local i64 i64 i64 i64)  ; declare 4 local variables, all with type i64, one for each symbol in the Mono IR
    i64.const 1              ; stack=[1]
    local.set 0              ; stack=[]     local0=1
    i64.const 2              ; stack=[2]    local0=1
    local.set 1              ; stack=[]     local0=1  local1=2
    local.get 0              ; stack=[1]    local0=1  local1=2
    local.get 1              ; stack=[1,2]  local0=1  local1=2
    call 0                   ; stack=[3]    local0=1  local1=2
    local.set 2              ; stack=[]     local0=1  local1=2  local2=3
    i64.const 4              ; stack=[4]    local0=1  local1=2  local2=3
    local.set 3              ; stack=[]     local0=1  local1=2  local2=3  local3=4
    local.get 2              ; stack=[3]    local0=1  local1=2  local2=3  local3=4
    local.get 3              ; stack=[3,4]  local0=1  local1=2  local2=3  local3=4
    call 0                   ; stack=[7]    local0=1  local1=2  local2=3  local3=4
    return)                  ; return the value at the top of the stack

Handwritten equivalent

This code doesn't actually require any locals at all. (It also doesn't need the return instructions, but that's less of a problem.)

  (func (;0;) (param i64 i64) (result i64)
    local.get 0
    local.get 1
    i64.add)
  (func (;1;) (result i64)
    i64.const 1
    i64.const 2
    call 0
    i64.const 4
    call 0)

Reducing sets and gets

For our example code, we don't need any locals because the WebAssembly virtual machine effectively stores intermediate results in a stack. Since it's already storing those values, there is no need for us to create locals. If you compare the two versions, you'll see that the local.set and local.get instructions have simply been deleted and the other instructions are in the same order.

But sometimes we really do need locals! We may need to use the same symbol twice, or values could end up on the stack in the wrong order and need to be swapped around by setting a local and getting it again.

The hard part is knowing when we need a local, and when we don't. For that, the WasmBackend needs to have some understanding of the stack machine.

To help with this, the CodeBuilder maintains a vector that represents the stack. For every instruction the backend generates, CodeBuilder simulates the right number of pushes and pops for that instruction, so that we always know the state of the VM stack at every point in the program.

When the WasmBackend generates code for a Let statement, it can "label" the top of the stack with the relevant Symbol. Then at any later point in the program, when we need to retrieve a list of symbols in a certain order, we can check whether they already happen to be at the top of the stack in that order (as they were in our example above.)

In practice it should be very common for values to appear on the VM stack in the right order, because in the Mono IR, statements occur in dependency order! We should only generate locals when the dependency graph is a little more complicated, and we actually need them.

  ┌─────────────────┐     ┌─────────────┐
  │                 │     │             │
  │                 ├─────►   Storage   ├──────┐
  │                 │     │             │      │
  │                 │     └─────────────┘      │
  │                 │     Manage state about   │
  │                 │     how/where symbol     │ Delegate part of
  │   WasmBackend   │     values are stored    │ state management
  │                 │                          │ for values on
  │                 │                          │ the VM stack
  │                 │                          │
  │                 │  Generate       ┌────────▼──────┐
  │                 │  instructions   │               │
  │                 ├─────────────────►  CodeBuilder  │
  │                 │                 │               │
  └─────────────────┘                 └───────────────┘

Memory

WebAssembly programs have a "linear memory" for storing data, which is a block of memory assigned to it by the host. You can assign a min and max size to the memory, and the WebAssembly program can request 64kB pages from the host, just like a "normal" program would request pages from the OS. Addresses start at zero and go up to whatever the current size is. Zero is a perfectly normal address like any other, and dereferencing it is not a segfault. But addresses beyond the current memory size are out of bounds and dereferencing them will cause a panic.

The program has full read/write access to the memory and can divide it into whatever sections it wants. Most programs will want to do the traditional split of static memory, stack memory and heap memory.

The WebAssembly module structure includes a data section that will be copied into the linear memory at a specified offset on initialisation, so you can use that for string literals etc. But the division of the rest of memory into "stack" and "heap" areas is not a first-class concept. It is up to the compiler to generate instructions to do whatever it wants with that memory.

Stack machine vs stack memory

There are two entirely different meanings of the word "stack" that are relevant to the WebAssembly backend. It's unfortunate that the word "stack" is so overloaded. I guess it's just a useful data structure. The worst thing is that both of them tend to be referred to as just "the stack"! We need more precise terms.

When we are talking about the instruction set, I'll use the term machine stack or VM stack. This is the implicit data structure that WebAssembly instructions operate on. In the examples above, it's where i64.add gets its arguments and stores its result. I think of it as an abstraction over CPU registers, that WebAssembly uses in order to be portable and compact.

When we are talking about how we store values in memory, I'll use the term stack memory rather than just "the stack". It feels clunky but it's the best I can think of.

Of course our program can use another area of memory as a heap as well. WebAssembly doesn't mind how you divide up your memory. It just gives you some memory and some instructions for loading and storing.

Calling conventions & stack memory

In WebAssembly you call a function by pushing arguments to the stack and then issuing a call instruction, which specifies a function index. The VM knows how many values to pop off the stack by examining the type of the function. In our example earlier, Num.add had the type [i64 i64] → [i64] so it expects to find two i64's on the stack and pushes one i64 back as the result. Remember, the runtime engine will validate the module before running it, and if your generated code is trying to call a function at a point in the program where the wrong value types are on the stack, it will fail validation.

Function arguments are restricted to the four value types, i32, i64, f32 and f64. If those are all we need, then there is no need for any stack memory, stack pointer, etc. We saw this in our example earlier. We just said call 0. We didn't need any instructions to create a stack frame with a return address, and there was no "jump" instruction. Essentially, WebAssembly has a first-class concept of function calls, so you don't build it up from lower-level primitives. You could think of this as an abstraction over calling conventions.

That's all great for primitive values but what happens when we want to pass more complex data structures between functions?

Well, remember, "stack memory" is not a special kind of memory in WebAssembly, and is separate from the VM stack. It's just an area of our memory where we implement a stack data structure. But there are some conventions that it makes sense to follow so that we can easily link to Wasm code generated from Zig or other languages.

Observations from compiled C code

  • global 0 is used as the stack pointer, and its value is normally copied to a local as well (presumably because locals tend to be assigned to CPU registers)
  • Stack memory grows downwards
  • If a C function returns a struct, the compiled WebAssembly function has no return value, but instead has an extra argument. The argument is an i32 pointer to space allocated in the caller's stack, that the called function can write to.
  • There is no maximum number of arguments for a WebAssembly function, and arguments are not passed via stack memory. This makes sense because the VM stack has no size limit. It's like having a CPU with an unlimited number of registers.
  • Stack memory is only used for allocating local variables, not for passing arguments. And it's only used for values that cannot be stored in one of WebAssembly's primitive values (i32, i64, f32, f64).

These observations are based on experiments compiling C to WebAssembly via the Emscripten toolchain (which is built on top of clang). It's also in line with what the WebAssembly project describes here.

Modules vs Instances

What's the difference between a Module and an Instance in WebAssembly?

Well, if I compare it to running a program on Linux, it's like the difference between an ELF binary and the executable image in memory that you get when you load that ELF file. The ELF file is essentially a specification for how to create the executable image. In order to start executing the program, the OS has to actually allocate a stack and a heap, and load the text and data. If you run multiple copies of the same program, they will each have their own memory and their own execution state. (More detail here).

The Module is like the ELF file, and the Instance is like the executable image.

The Module is a specification for how to create an Instance of the program. The Module says how much memory the program needs, but the Instance actually contains that memory. In order to run the Wasm program, the VM needs to create an instance, allocate some memory for it, and copy the data section into that memory. If you run many copies of the same Wasm program, you will have one Module but many Instances. Each instance will have its own separate area of memory, and its own execution state.

Modules, object files, and linking

A WebAssembly module is equivalent to an executable file. It doesn't normally need relocations since at the WebAssembly layer, there is no Address Space Layout Randomisation. If it has relocations then it's an object file.

The official spec lists the sections that are part of the final module. It doesn't mention any sections for relocations or symbol names, but it does support "custom" sections. Conventions to use those for linking are documented in the WebAssembly tool-conventions repo here and it mentions that LLVM is using those conventions.

Linking host-to-app calls

We implement a few linking operations in the Wasm backend. The most important are host-to-app calls.

In the host .wasm file, roc__mainForHost_1_exposed is defined as a Wasm Import, as if it were an external JavaScript function. But when we link the host and app, we need to make it an internal function instead.

There are a few important facts to note about the Wasm binary format:

  • Function calls refer to the callee by its function index in the file.
  • If we move a function from one index to another, all of its call sites need to be updated. So we want to minimise this to make linking fast.
  • If we remove a function, then all functions above it will implicitly have their indices shifted down by 1! This is not good for speed. We should try to swap rather than remove.
  • JavaScript imports always get the lower indices.

With that background, here are the linking steps for a single app function that gets called by the host:

  • Remove roc__mainForHost_1_exposed from the imports, updating all call sites to the new index, which is somewhere in the app.
  • Swap the last JavaScript import into the slot where roc__mainForHost_1_exposed was, updating all of its call sites in the host.
  • Insert an internally-defined dummy function at the index where the last JavaScript import used to be.

The diagram below illustrates this process.

The diagram has a tiny number of functions just to make it easier to draw! Our mock host for integration tests has 48 imports and 648 defined functions.

 

Diagram showing how host-to-app calls are linked.