mirror of https://github.com/HigherOrderCO/Bend.git synced 2024-08-15 06:40:25 +03:00

A massively parallel, high-level programming language

Go to file

Sipher 3aa450d124 Update cspell.json		2024-06-05 13:07:59 -03:00
.github	Switch to rust stable	2024-05-26 12:27:36 +02:00
docs	Make monadic blocks lazy by defering execution of continuations with free vars	2024-05-30 23:30:44 +02:00
examples	Merge pull request #454 from ematth/main	2024-06-05 14:19:32 +00:00
src	Bump version to 0.2.31	2024-06-04 16:19:43 -03:00
tests	Merge pull request #454 from ematth/main	2024-06-05 14:19:32 +00:00
.gitignore	remove .vscode from future commits	2024-05-31 09:03:24 -05:00
.rustfmt.toml	Switch to rust stable	2024-05-26 12:27:36 +02:00
BUILTINS.md	replace data with type	2024-05-29 11:59:58 -03:00
Cargo.lock	Bump version to 0.2.31	2024-06-04 16:19:43 -03:00
Cargo.toml	Bump version to 0.2.31	2024-06-04 16:19:43 -03:00
CONTRIBUTING.md	Update CONTRIBUTING.md	2024-05-27 16:52:12 -03:00
cspell.json	Update cspell.json	2024-06-05 13:07:59 -03:00
FAQ.md	Improve the FAQ a bit	2024-05-27 13:07:42 +02:00
FEATURES.md	Improve map feature documentation	2024-05-24 15:56:09 -03:00
GUIDE.md	Make monadic blocks lazy by defering execution of continuations with free vars	2024-05-30 23:30:44 +02:00
justfile	[sc-627] Initial update for hvm32	2024-04-22 19:03:56 +02:00
LICENSE-APACHE	Update and rename LICENSE to LICENSE-APACHE	2024-05-17 15:17:10 -03:00
README.md	Update README.md	2024-06-05 12:20:19 -03:00

README.md

Bend

A high-level, massively parallel programming language

Introduction

Bend offers the feel and features of expressive languages like Python and Haskell. This includes fast object allocations, full support for higher-order functions with closures, unrestricted recursion, and even continuations.
Bend scales like CUDA, it runs on massively parallel hardware like GPUs, with nearly linear acceleration based on core count, and without explicit parallelism annotations: no thread creation, locks, mutexes, or atomics.
Bend is powered by the HVM2 runtime.

Important Notes

Bend is designed to excel in scaling performance with cores, supporting over 10000 concurrent threads.
The current version may have lower single-core performance.
You can expect substantial improvements in performance as we advance our code generation and optimization techniques.
We are still working to support Windows. Use WSL2 as an alternative solution.
We only support NVIDIA Gpus currently.

Install

Install dependencies

On Linux

# Install Rust if you haven't it already.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# For the C version of Bend, use GCC. We recommend a version up to 12.x.
sudo apt install gcc

For the CUDA runtime install the CUDA toolkit for Linux version 12.x.

On Mac

# Install Rust if you haven't it already.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# For the C version of Bend, use GCC. We recommend a version up to 12.x.
brew install gcc

Install Bend

Install HVM2 by running:

# HVM2 is HOC's massively parallel Interaction Combinator evaluator.
cargo install hvm

# This ensures HVM is correctly installed and accessible.
hvm --version

Install Bend by running:

# This command will install Bend
cargo install bend-lang

# This ensures Bend is correctly installed and accessible.
bend --version

Getting Started

Running Bend Programs

bend run    <file.bend> # uses the Rust interpreter (sequential)
bend run-c  <file.bend> # uses the C interpreter (parallel)
bend run-cu <file.bend> # uses the CUDA interpreter (massively parallel)

# Notes
# You can also compile Bend to standalone C/CUDA files using gen-c and gen-cu for maximum performance.
# The code generator is still in its early stages and not as mature as compilers like GCC and GHC.
# You can use the -s flag to have more information on
  # Reductions
  # Time the code took to run
  # Interaction per second (In millions)

Testing Bend Programs

The example below sums all the numbers in the range from start to target. It can be written in two different methods: one that is inherently sequential (and thus cannot be parallelized), and another that is easily parallelizable. (We will be using the -sflag in most examples, for the sake of visibility)

Sequential version:

First, create a file named ssum.bend

# Write this command on your terminal
touch ssum.bend

Then with your text editor, open the file ssum.bend, copy the code below and paste in the file.

# Defines the function Sum with two parameters: start and target
def Sum(start, target):
    # If the value of start is the same as target, returns start
  if start == target:
    return start
    # If start is not equal to target, recursively call Sum with start incremented by 1, and add the result to start
  else:
    return start + Sum(start + 1, target)  

def main():
# This translates to (1 + (2 + (3 + (...... + (79999999 + 80000000)))))
  return Sum(1, 80000000)

Running the file

You can run it using Rust interpreter (Sequential)

bend run ssum.bend -s

Or you can run it using C interpreter (Sequential)

bend run-c ssum.bend -s

If you have a NVIDIA GPU, you can also run in CUDA (Sequential)

bend run-cu ssum.bend -s

In this version, the next value to be calculated depends on the previous sum, meaning that it cannot proceed until the current computation is complete. Now, let's look at the easily parallelizable version.

Parallelizable version:

First close the old file and then proceed to your terminal to create psum.bend

# Write this command on your terminal
touch psum.bend

Then with your text editor, open the file psum.bend, copy the code below and paste in the file.

# Defines the function Sum with two parameters: start and target
def Sum(start, target):
  # If the value of start is the same as target, returns start
  if start == target:
    return start
  # If start is not equal to target, calculate the midpoint (half), then recursively call Sum on both halves
  else:
    half = (start + target) / 2
    left = Sum(start, half)  # (Start -> Half)
    right = Sum(half + 1, target)
    return left + right

# Main function to demonstrate the parallelizable sum from 1 to 80000000
def main():
# This translates to ((1 + 2) + (3 + 4)+ ... (79999999 + 80000000)...)
  return Sum(1, 80000000)

In this example, the (3 + 4) sum does not depend on the (1 + 2), meaning that it can run in parallel because both computations can happen at the same time.

Running the file

You can run it using Rust interpreter (Sequential)

bend run psum.bend -s

Or you can run it using C interpreter (Parallel)

bend run-c ssum.bend -s

If you have a NVIDIA GPU, you can also run in CUDA (Massively parallel)

bend run-cu ssum.bend -s

In Bend, it can be parallelized by just changing the run command. If your code can run in parallel it will run in parallel.

Speedup Examples

The code snippet below implements a bitonic sorter with immutable tree rotations. It's not the type of algorithm you would expect to run fast on GPUs. However, since it uses a divide and conquer approach, which is inherently parallel, Bend will execute it on multiple threads, no thread creation, no explicit lock management.

Click here for the Bitonic Sorter code

# Sorting Network = just rotate trees!
def sort(d, s, tree):
 switch d:
   case 0:
     return tree
   case _:
     (x,y) = tree
     lft   = sort(d-1, 0, x)
     rgt   = sort(d-1, 1, y)
     return rots(d, s, (lft, rgt))

# Rotates sub-trees (Blue/Green Box)
def rots(d, s, tree):
 switch d:
   case 0:
     return tree
   case _:
     (x,y) = tree
     return down(d, s, warp(d-1, s, x, y))

# Swaps distant values (Red Box)
def warp(d, s, a, b):
 switch d:
   case 0:
     return swap(s ^ (a > b), a, b)
   case _:
     (a.a, a.b) = a
     (b.a, b.b) = b
     (A.a, A.b) = warp(d-1, s, a.a, b.a)
     (B.a, B.b) = warp(d-1, s, a.b, b.b)
     return ((A.a,B.a),(A.b,B.b))

# Propagates downwards
def down(d,s,t):
 switch d:
   case 0:
     return t
   case _:
     (t.a, t.b) = t
     return (rots(d-1, s, t.a), rots(d-1, s, t.b))

# Swaps a single pair
def swap(s, a, b):
 switch s:
   case 0:
     return (a,b)
   case _:
     return (b,a)

# Testing
# -------

# Generates a big tree
def gen(d, x):
 switch d:
   case 0:
     return x
   case _:
     return (gen(d-1, x * 2 + 1), gen(d-1, x * 2))

# Sums a big tree
def sum(d, t):
 switch d:
   case 0:
     return t
   case _:
     (t.a, t.b) = t
     return sum(d-1, t.a) + sum(d-1, t.b)

# Sorts a big tree
def main:
 return sum(20, sort(20, 0, gen(20, 0)))

Benchmark

bend run: CPU, Apple M3 Max: 12.15 seconds
bend run-c: CPU, Apple M3 Max: 0.96 seconds
bend run-cu: GPU, NVIDIA RTX 4090: 0.21 seconds

if you are interested in some other algorithms, you can check our examples folder

Additional Resources

To understand the technology behind Bend, check out the HVM2 paper. Bend is developed by HigherOrderCO - join our Discord!
Watch the live demo video.