Bend/README.md

# Bend

Bend is a massively parallel, high-level programming language.

Unlike low-level alternatives like CUDA and Metal, Bend has the feeling and
features of expressive languages like Python and Haskell, including fast object
allocations, higher-order functions with full closure support, unrestricted
recursion, even continuations. Yet, it runs on massively parallel hardware like
GPUs, with near-linear speedup based on core count, and zero explicit parallel
annotations: no thread spawning, no locks, mutexes, atomics. Bend is powered by
the [HVM2](https://github.com/HigherOrderCO/hvm) runtime.

A Quick Demo
------------

[![Bend live demo](https://github.com/VictorTaelin/media/blob/main/bend_live_demo.gif?raw=true)](https://x.com/i/status/1791213162525524076)

- For a more in-depth explanation on how to setup and use Bend, check [GUIDE.md](https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md).

- For an extensive list of features, check [FEATURES.md](https://github.com/HigherOrderCO/bend/blob/main/FEATURES.md).

## Using Bend

> Currently not working on Windows, please use [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) as a workaround.

> If you're having issues or have a question about Bend, please first read the [FAQ](https://github.com/HigherOrderCO/Bend/blob/main/KNOWN_ISSUES.md) page and check if your question has already been addressed.

First, install [Rust](https://www.rust-lang.org/tools/install).

If you want to use the C runtime, install a C compiler (like GCC or Clang).
If you want to use the CUDA runtime, install the CUDA toolkit (CUDA and `nvcc`) version 12.x.

> **_Note_: [Only Nvidia GPUs are supported at the moment](https://github.com/HigherOrderCO/Bend/issues/341).**

Then, install both HVM2 and Bend with:

```sh
cargo install hvm
cargo install bend-lang
```

Finally, write some Bend file, and run it with one of these commands:

```sh
bend run    <file.bend> # uses the Rust interpreter (sequential)
bend run-c  <file.bend> # uses the C interpreter (parallel)
bend run-cu <file.bend> # uses the CUDA interpreter (massively parallel)
```

You can also compile `Bend` to standalone C/CUDA files with `gen-c` and
`gen-cu`, for maximum performance. But keep in mind our code gen is still in its
infancy, and is nowhere as mature as SOTA compilers like GCC and GHC.

## Parallel Programming in Bend

To write parallel programs in Bend, all you have to do is... **nothing**. Other
than not making it *inherently sequential*! For example, the expression:

```python
(((1 + 2) + 3) + 4)
```

Can **not** run in parallel, because `+4` depends on `+3` which
depends on `(1+2)`. But the following expression:

```python
((1 + 2) + (3 + 4))
```

Can run in parallel, because `(1+2)` and `(3+4)` are independent; and it *will*,
per Bend's fundamental pledge:

> Everything that **can** run in parallel, **will** run in parallel.

For a more complete example, consider:

```python
# Sorting Network = just rotate trees!
def sort(d, s, tree):
  switch d:
    case 0:
      return tree
    case _:
      (x,y) = tree
      lft   = sort(d-1, 0, x)
      rgt   = sort(d-1, 1, y)
      return rots(d, s, lft, rgt)

# Rotates sub-trees (Blue/Green Box)
def rots(d, s, tree):
  switch d:
    case 0:
      return tree
    case _:
      (x,y) = tree
      return down(d, s, warp(d-1, s, x, y))

(...)
```

This
[file](https://gist.github.com/VictorTaelin/face210ca4bc30d96b2d5980278d3921)
implements a [bitonic sorter](https://en.wikipedia.org/wiki/Bitonic_sorter) with
*immutable tree rotations*. It is not the kind of algorithm you'd expect to
run fast on GPUs. Yet, since it uses a divide-and-conquer approach, which is
*inherently parallel*, Bend will run it multi-threaded. Some benchmarks:

- CPU, Apple M3 Max, 1 thread: **12.15 seconds**

- CPU, Apple M3 Max, 16 threads: **0.96 seconds**

- GPU, NVIDIA RTX 4090, 16k threads: **0.21 seconds**

That's a **57x speedup** by doing nothing. No thread spawning, no explicit
management of locks, mutexes. We just asked Bend to run our program on RTX, and
it did. Simple as that.

Bend isn't limited to a specific paradigm, like tensors or matrices. Any
concurrent system, from shaders to Erlang-like actor models can be emulated on
Bend. For example, to render images in real time, we could simply allocate an
immutable tree on each frame:

```python
# given a shader, returns a square image
def render(depth, shader):
  bend d = 0, i = 0:
    when d < depth:
      color = (fork(d+1, i*2+0), fork(d+1, i*2+1))
    else:
      width = depth / 2
      color = shader(i % width, i / width)
  return color

# given a position, returns a color
# for this demo, it just busy loops
def demo_shader(x, y):
  bend i = 0:
    when i < 5000:
      color = fork(i + 1)
    else:
      color = 0x000001
  return color

# renders a 256x256 image using demo_shader
def main:
  return render(16, demo_shader)
```

And it would actually work. Even involved algorithms parallelize well on Bend.
Long-distance communication is performed by *global beta-reduction* (as per the
[Interaction Calculus](https://github.com/VictorTaelin/Interaction-Calculus)),
and synchronized correctly and efficiently by
[HVM2](https://github.com/HigherOrderCO/HVM)'s *atomic linker*.


- To understand the tech behind Bend, check HVM2's [paper](https://paper.higherorderco.com).

- Bend is developed by [HigherOrderCO](https://HigherOrderCO.com) - join our [Discord](https://discord.HigherOrderCO.com)!

## Note

It is very important to reinforce that, while Bend does what it was built to
(i.e., scale in performance with cores, up to 10000+ concurrent threads), its
single-core performance is still extremely sub-par. This is the first version of
the system, and we haven't put much effort into a proper compiler yet. You can
expect the raw performance to substantially improve on every release, as we work
towards a proper codegen (including a constellation of missing optimizations).
Meanwhile, you can use the interpreters today, to have a glimpse of what
massively parallel programming looks like, from the lens of a Pythonish,
high-level language!