Merge pull request #594 from HigherOrderCO/580-add-string-encoding-functions

#580 Add string encoding functions
This commit is contained in:
Nicolas Abril 2024-06-19 12:59:07 +00:00 committed by GitHub
commit c4afa0f248
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 236 additions and 10 deletions

View File

@ -24,6 +24,7 @@ and this project does not currently adhere to a particular versioning scheme.
- Add `to_f24`, `to_u24` and `to_i24` number casting builtin functions. ([#582][gh-582])
- Add `IO/sleep` builtin function to sleep for a given amount of seconds as a float. ([#581][gh-581])
- Add primitive file IO functions `IO/FS/{read, write, seek, open, close}`. ([#573][gh-573])
- Add encoding/decoding builtin functions `Bytes/{decode_utf8, decode_ascii} String/{encode_ascii, decode_ascii} Utf8/{decode_character, REPLACEMENT_CHARACTER}`. ([#580][gh-580])
## [0.2.35] - 2024-06-06
@ -346,6 +347,7 @@ and this project does not currently adhere to a particular versioning scheme.
[gh-528]: https://github.com/HigherOrderCO/Bend/issues/528
[gh-581]: https://github.com/HigherOrderCO/Bend/issues/581
[gh-573]: https://github.com/HigherOrderCO/Bend/issues/573
[gh-580]: https://github.com/HigherOrderCO/Bend/issues/580
[gh-582]: https://github.com/HigherOrderCO/Bend/issues/582
[gh-583]: https://github.com/HigherOrderCO/Bend/issues/583
[gh-586]: https://github.com/HigherOrderCO/Bend/issues/586

View File

@ -1,25 +1,28 @@
>this is a WIP based on [Builtins.bend](https://github.com/HigherOrderCO/Bend/blob/main/src/fun/builtins.bend).
> this is a WIP based on [Builtins.bend](https://github.com/HigherOrderCO/Bend/blob/main/src/fun/builtins.bend).
# Built-in Types and Functions
**Bend** built-in types and functions, this document serves as a reference guide. Read more at [FEATURES.md](https://github.com/HigherOrderCO/Bend/blob/main/FEATURES.md).
## String
```python
type String = (Cons head ~tail) | (Nil)
```
- **Nil**: Represents an empty string.
- **Cons head ~tail**: Represents a string with a `head` character and a `tail` string.
### Syntax
A String literal is surrounded with `"`. Accepts the same values as characters literals.
```
"Hello, World!"
```
## List
```python
type List = (Cons head ~tail) | (Nil)
```
@ -28,14 +31,15 @@ type List = (Cons head ~tail) | (Nil)
- **Cons head ~tail**: Represents a list with a `head` element and a `tail` list.
### Syntax
A List of values can be written using `[ ]`, it can have multiple values inside, using `,` you can divide its value in a list of multiple elements.
```
["This", "List", "Has", "Multiple", "Values"]
```
## Tree
```python
type Tree:
Node { ~left, ~right }
@ -49,15 +53,19 @@ Trees are a structure that naturally lends itself to parallel recursion, so writ
- **Leaf { value }**: Represents one of the ends of the tree, storing `value`.
#### Syntax
**Bend** provides the `![]` operator to create tree branches and the `!` operator to create a tree leaf.
```py
# ![a, b] => Equivalent to Tree/Node { left: a, right: b }
# !x => Equivalent to Tree/Leaf { value: x }
tree = ![![!1, !2],![!3, !4]]
```
Technically your trees don't need to end with leaves, but if you don't, your program will be very hard to reason about.
## Map
```python
type Map:
Node { value ~left ~right }
@ -71,7 +79,9 @@ It is meant to be used as an efficient map data structure with integer keys and
- **Leaf**: Represents an unwritten, empty portion of the map.
#### Syntax
Here's how you create a new `Map` with some initial values.:
```python
{ 0: 4, `hi`: "bye", 'c': 2 + 3 }
```
@ -81,6 +91,7 @@ The keys must be `U24` numbers, and can be given as literals or any other expres
The values can be anything, but storing data of different types in a `Map` will make it harder for you to reason about it.
You can read and write a value of a map with the `[]` operator:
```python
map = { 0: "zero", 1: "one", 2: "two", 3: "three" }
map[0] = "not zero"
@ -91,18 +102,21 @@ map[3] = map[1] + map[map[1]]
Here, `map` must be the name of the `Map` variable, and the keys inside `[]` can be any expression that evaluates to a `U24`.
## Map functions
### Map/empty
Initializes an empty map.
```python
Map/empty = Map/Leaf
```
### Map/get
Retrieves a `value` from the `map` based on the `key`.
Returns a tuple with the value and the `map` unchanged.
```rust
Map/get map key =
match map {
@ -123,22 +137,30 @@ Map/get map key =
```
#### Syntax
Considering the following map
```python
{ 0: "hello", 1: "bye", 2: "maybe", 3: "yes"}
```
The `get` function can be written as
```
return x[0] # Gets the value of the key 0
```
And the value resultant from the get function would be:
```
"hello"
```
### Map/set
Sets a `value` in the `map` at the specified `key`.
Returns the map with the new value.
```rust
Map/set map key value =
match map {
@ -160,32 +182,44 @@ Map/set map key value =
}
}
```
#### Syntax
Considering the following tree
```python
{ 0: "hello", 1: "bye", 2: "maybe", 3: "yes"}
```
The `set` function can be written as
```py
x[0] = "swapped" # Assigns the key 0 to the value "swapped"
```
And the value resultant from the get function would be:
```py
{ 0: "swapped", 1: "bye", 2: "maybe", 3: "yes"}
```
If there's no matching `key` in the tree, it would add a new branch to that tree with the value `set`
```py
x[4] = "added" # Assigns the key 4 to the value "added"
```
The new tree
```py
{ 0: "swapped", 1: "bye", 2: "maybe", 3: "yes", 4: "added"}
```
### Map/map
Applies a function to a value in the map.
Returns the map with the value mapped.
```rust
Map/map (Map/Leaf) key f = Map/Leaf
Map/map (Map/Node value left right) key f =
@ -201,30 +235,33 @@ Map/map (Map/Node value left right) key f =
```
#### Syntax
With the same map that we `set` in the previous section, we can map it's values with `@=`:
```py
x[0] @= lambda y: String/concat(y, " and mapped")
# x[0] now contains "swapped and mapped"
```
## Nat
```python
type Nat = (Succ ~pred) | (Zero)
```
- **Succ ~pred**: Represents a natural number successor.
- **Zero**: Represents the natural number zero.
### Syntax
A Natural Number can be written with literals with a `#` before the literal number.
```
#1337
```
## IO
The basic builtin IO functions are under development and will be stable in the next milestone.
Here is the current list of functions, but be aware that they may change in the near future.
@ -232,11 +269,13 @@ Here is the current list of functions, but be aware that they may change in the
### File IO
#### File open
```python
def IO/FS/open(path, mode)
```
Opens a file with with `path` being given as a string and `mode` being a string with the mode to open the file in. The mode should be one of the following:
- `"r"`: Read mode
- `"w"`: Write mode (write at the beginning of the file, overwriting any existing content)
- `"a"`: Append mode (write at the end of the file)
@ -249,11 +288,13 @@ Returns an U24 with the file descriptor. File descriptors are not necessarily th
#### File descriptors for standard files
The standard input/output files are always open and assigned the following file descriptors:
- `IO/FS/STDIN = 0`: Standard input
- `IO/FS/STDOUT = 1`: Standard output
- `IO/FS/STDERR = 2`: Standard error
#### File close
```python
def IO/FS/close(file)
```
@ -261,6 +302,7 @@ def IO/FS/close(file)
Closes the file with the given `file` descriptor.
#### File read
```python
def IO/FS/read(file, num_bytes)
```
@ -270,6 +312,7 @@ Reads `num_bytes` bytes from the file with the given `file` descriptor.
Returns a list of U24 with each element representing a byte read from the file.
#### File write
```python
def IO/FS/write(file, bytes)
```
@ -278,9 +321,8 @@ Writes `bytes`, a list of U24 with each element representing a byte, to the file
Returns nothing (`*`).
Writing discards any preexisting content that came after the current position. For example, if your file contains the text `Hello, world!` and the current position is at the `,`, writing `!` will
#### File seek
```python
def IO/FS/seek(file, offset, mode)
```
@ -288,6 +330,7 @@ def IO/FS/seek(file, offset, mode)
Moves the current position of the file with the given `file` descriptor to the given `offset`, an I24 or U24 number, in bytes.
`mode` can be one of the following:
- `IO/FS/SEEK_SET = 0`: Seek from start of file
- `IO/FS/SEEK_CUR = 1`: Seek from current position
- `IO/FS/SEEK_END = 2`: Seek from end of file
@ -297,33 +340,91 @@ Returns nothing (`*`).
## Numeric operations
### log
```py
def log(x: f24, base: f24) -> f24
```
Computes the logarithm of `x` with the specified `base`.
### atan2
```py
def atan2(x: f24, y: f24) -> f24
```
Computes the arctangent of `y / x`.
Has the same behaviour as `atan2f` in the C math lib.
### to_f24
```py
def to_f24(x: any number) -> f24
```
Casts any native number to an f24.
### to_u24
```py
def to_u24(x: any number) -> u24
```
Casts any native number to a u24.
### to_i24
```py
def to_i24(x: any number) -> i24
```
Casts any native number to an i24.
## Encoding functions
### Bytes/decode_utf8
```py
def Bytes/decode_utf8(bytes: [u24]) -> String
```
Decodes a sequence of bytes to a String using utf-8 encoding.
### Bytes/decode_ascii
```py
def Bytes/decode_ascii(bytes: [u24]) -> String
```
Decodes a sequence of bytes to a String using ascii encoding.
### String/encode_utf8
```py
def String/encode_utf8(s: String) -> [u24]
```
Encodes a String to a sequence of bytes using utf-8 encoding.
### String/encode_ascii
```py
def String/encode_ascii(s: String) -> [u24]
```
Encodes a String to a sequence of bytes using ascii encoding.
### Utf8/decode_character
```py
def Utf8/decode_character(bytes: [u24]) -> (rune: u24, rest: [u24])
```
Decodes a utf-8 character, returns a tuple containing the rune and the rest of the byte sequence.
### Utf8/REPLACEMENT_CHARACTER
```py
def Utf8/REPLACEMENT_CHARACTER: u24 = '\u{FFFD}'
```

View File

@ -175,3 +175,113 @@ hvm to_u24:
# Casts any native number to an i24.
hvm to_i24:
($([i24] ret) ret)
# Encoding
Utf8/REPLACEMENT_CHARACTER = '\u{FFFD}'
Bytes/decode_utf8 bytes =
let (got, rest) = (Utf8/decode_character bytes)
match rest {
List/Nil: (String/Cons got String/Nil)
List/Cons: (String/Cons got (Bytes/decode_utf8 rest))
}
Utf8/decode_character [] = (0, [])
Utf8/decode_character [a] = if (<= a 0x7F) { (a, []) } else { (Utf8/REPLACEMENT_CHARACTER, []) }
Utf8/decode_character [a, b] =
use Utf8/maskx = 0b00111111
use Utf8/mask2 = 0b00011111
if (<= a 0x7F) {
(a, [b])
} else {
if (== (& a 0xE0) 0xC0) {
let r = (| (<< (& a Utf8/mask2) 6) (& b Utf8/maskx))
(r, [])
} else {
(Utf8/REPLACEMENT_CHARACTER, [])
}
}
Utf8/decode_character [a, b, c] =
use Utf8/maskx = 0b00111111
use Utf8/mask2 = 0b00011111
use Utf8/mask3 = 0b00001111
if (<= a 0x7F) {
(a, [b, c])
} else {
if (== (& a 0xE0) 0xC0) {
let r = (| (<< (& a Utf8/mask2) 6) (& b Utf8/maskx))
(r, [c])
} else {
if (== (& a 0xF0) 0xE0) {
let r = (| (<< (& a Utf8/mask3) 12) (| (<< (& b Utf8/maskx) 6) (& c Utf8/maskx)))
(r, [])
} else {
(Utf8/REPLACEMENT_CHARACTER, [])
}
}
}
Utf8/decode_character (List/Cons a (List/Cons b (List/Cons c (List/Cons d rest)))) =
use Utf8/maskx = 0b00111111
use Utf8/mask2 = 0b00011111
use Utf8/mask3 = 0b00001111
use Utf8/mask4 = 0b00000111
if (<= a 0x7F) {
(a, (List/Cons b (List/Cons c (List/Cons d rest))))
} else {
if (== (& a 0xE0) 0xC0) {
let r = (| (<< (& a Utf8/mask2) 6) (& b Utf8/maskx))
(r, (List/Cons c (List/Cons d rest)))
} else {
if (== (& a 0xF0) 0xE0) {
let r = (| (<< (& a Utf8/mask3) 12) (| (<< (& b Utf8/maskx) 6) (& c Utf8/maskx)))
(r, (List/Cons d rest))
} else {
if (== (& a 0xF8) 0xF0) {
let r = (| (<< (& a Utf8/mask4) 18) (| (<< (& b Utf8/maskx) 12) (| (<< (& c Utf8/maskx) 6) (& d Utf8/maskx))))
(r, [])
} else {
(Utf8/REPLACEMENT_CHARACTER, rest)
}
}
}
}
String/encode_utf8 (String/Nil) = (List/Nil)
String/encode_utf8 (String/Cons x xs) =
use Utf8/rune1max = (- (<< 1 7) 1)
use Utf8/rune2max = (- (<< 1 11) 1)
use Utf8/rune3max = (- (<< 1 16) 1)
use Utf8/tx = 0b10000000
use Utf8/t2 = 0b11000000
use Utf8/t3 = 0b11100000
use Utf8/t4 = 0b11110000
use Utf8/maskx = 0b00111111
if (<= x Utf8/rune1max) {
(List/Cons x (String/encode_utf8 xs))
} else {
if (<= x Utf8/rune2max) {
let b1 = (| Utf8/t2 (>> x 6))
let b2 = (| Utf8/tx (& x Utf8/maskx))
(List/Cons b1 (List/Cons b2 (String/encode_utf8 xs)))
} else {
if (<= x Utf8/rune3max) {
let b1 = (| Utf8/t3 (>> x 12))
let b2 = (| Utf8/tx (& (>> x 6) Utf8/maskx))
let b3 = (| Utf8/tx (& x Utf8/maskx))
(List/Cons b1 (List/Cons b2 (List/Cons b3 (String/encode_utf8 xs))))
} else {
let b1 = (| Utf8/t4 (>> x 18))
let b2 = (| Utf8/tx (& (>> x 12) Utf8/maskx))
let b3 = (| Utf8/tx (& (>> x 6) Utf8/maskx))
let b4 = (| Utf8/tx (& x Utf8/maskx))
(List/Cons b1 (List/Cons b2 (List/Cons b3 (List/Cons b4 (String/encode_utf8 xs)))))
}
}
}
Bytes/decode_ascii (List/Cons x xs) = (String/Cons x (Bytes/decode_ascii xs))
Bytes/decode_ascii (List/Nil) = (String/Nil)
String/encode_ascii (String/Cons x xs) = (List/Cons x (String/encode_ascii xs))
String/encode_ascii (String/Nil) = (List/Nil)

View File

@ -0,0 +1,4 @@
def main:
use bytes = [72, 101, 108, 108, 111, 44, 32, 228, 184, 150, 231, 149, 140, 33]
use s = "Hello, 世界!"
return (String/encode_utf8(s), Bytes/decode_utf8(bytes))

View File

@ -0,0 +1,9 @@
---
source: tests/golden_tests.rs
input_file: tests/golden_tests/run_file/encode_decode_utf8.bend
---
NumScott:
([72, 101, 108, 108, 111, 44, 32, 228, 184, 150, 231, 149, 140, 33], "Hello, 世界!")
Scott:
([72, 101, 108, 108, 111, 44, 32, 228, 184, 150, 231, 149, 140, 33], "Hello, 世界!")