mirror of
https://github.com/AleoHQ/leo.git
synced 2024-11-24 10:52:29 +03:00
[RFC] Move future extensions under future extensions.
Also add subsections for each future feature. Also tweak/improve some wording.
This commit is contained in:
parent
6340929f28
commit
87c8ec2923
@ -102,29 +102,6 @@ creating literals, there is no need for the circuit to know
|
||||
which code points are disallowed.
|
||||
|
||||
The equality operators `==` and `!=` are automatically available for `char`.
|
||||
Given that characters are essentially code points,
|
||||
we may also support the ordering operators `<`, `<=`, `>`, and `>=`;
|
||||
these may be useful to check whether a character is in certain range.
|
||||
|
||||
Below is a list of possible operations we could support on characters.
|
||||
It should be fairly easy to add more.
|
||||
- [ ] `is_alphabetic` - Returns `true` if the `char` has the `Alphabetic` property.
|
||||
- [ ] `is_ascii` - Returns `true` if the `char` is in the `ASCII` range.
|
||||
- [ ] `is_ascii_alphabetic` - Returns `true` if the `char` is in the `ASCII Alphabetic` range.
|
||||
- [ ] `is_lowercase` - Returns `true` if the `char` has the `Lowercase` property.
|
||||
- [ ] `is_numeric` - Returns `true` if the `char` has one of the general categories for numbers.
|
||||
- [ ] `is_uppercase` - Returns `true` if the `char` has the `Uppercase` property.
|
||||
- [ ] `is_whitespace` - Returns `true` if the `char` has the `White_Space` property.
|
||||
- [ ] `to_digit` - Converts the `char` to the given `radix` format.
|
||||
- [ ] `from_digit` - Inverse of to_digit.
|
||||
- [ ] `to_uppercase` - Converts lowercase to uppercase, leaving others unchanged.
|
||||
- [ ] `to_lowercase` - Converts uppercase to lowercase, leaving others unchanged.
|
||||
|
||||
It seems natural to convert between `char` values
|
||||
and `u8` or `u16` or `u32` values, under suitable range conditions;
|
||||
perhaps also between `char` values and
|
||||
(non-negative) `i8` or `i16` or `i32` values.
|
||||
This will be accomplished as part of the type casting extension of Leo.
|
||||
|
||||
The following code sample illustrates three ways of defining characters:
|
||||
character literal, single-character escapes, and Unicode escapes.
|
||||
@ -212,19 +189,6 @@ that are also common for strings in other programming languages:
|
||||
* `s[i]` extracts the `i`-th character from the string `s`.
|
||||
* `s[1..]` removes the first character from the string `s`.
|
||||
|
||||
Below is a list of possible operations we could support on strings.
|
||||
It should be fairly easy to add more.
|
||||
- [ ] `u8` to `[char; 2]` hexstring, .., `u128` to `[char; 32]` hexstring.
|
||||
- [ ] Field element to `[char; 64]` hexstring. (Application can test leading zeros and slice them out if it needs to return, say, a 40-hex-digit string.)
|
||||
- [ ] Apply `to_uppercase` (see above) to every character.
|
||||
- [ ] Apply `to_lowercase` (see above) to every character.
|
||||
|
||||
Note that the latter two could be also realize via simple loops through the string.
|
||||
|
||||
Given the natural conversions between `char` values and integer values discussed earlier,
|
||||
it may be natural to also support element-wise conversions between strings and arrays of integers.
|
||||
This may be accomplished as part of the type casting extensions of Leo.
|
||||
|
||||
The following code shows a string literal and its actual transformation into an
|
||||
array of characters as well as possible array-like operations on strings:
|
||||
concatenation and comparison.
|
||||
@ -255,38 +219,6 @@ which will be interpreted as a format string
|
||||
according to the semantics of console print calls.
|
||||
The internal UTF-32 string will be translated to UTF-8 for output.
|
||||
|
||||
### Circuit Types for Character and String Operations
|
||||
|
||||
The operations on characters and lists described earlier, e.g. `is_ascii`,
|
||||
are provided as static member functions of two new built-in or library circuit types `Char` and `String`.
|
||||
Thus, an example call is `Char::is_ascii(c)`.
|
||||
This seems a general good way to organize built-in or library operations,
|
||||
and supports the use of the same name with different circuit types,
|
||||
e.g. `Char::to_uppercase` and `String::to_uppercase`.
|
||||
|
||||
These circuit types could also include constants, e.g. for certain ASCII characters.
|
||||
However, currently Leo does not support constants in circuit types,
|
||||
so that would have to be added separately first.
|
||||
|
||||
These two circuit types are just meant to collect static member functions for characters and strings.
|
||||
They are not meant to be the types of characters and strings:
|
||||
as mentioned previously, `char` is a new scalar (not circuit) type (like `bool`, `address`, `u8`, etc.)
|
||||
and there is no string type as such for now, but we use character arrays for strings.
|
||||
In the future we may want all the Leo types to be circuit types of some sort,
|
||||
but that is a separate feature that would have to be designed and developed independently.
|
||||
|
||||
### Input and Output of Literal Characters and Strings
|
||||
|
||||
Since UTF-8 is a standard encoding, it would make sense for
|
||||
the literal characters and strings in the `.in` file
|
||||
to be automatically converted to UTF-32 by the Leo compiler.
|
||||
However, the size of a string can be confusing since multiple
|
||||
Unicode code points can be composed into a single glyph which
|
||||
then appears to be a single character. If a parameter of type `[char; 10]`
|
||||
[if that is the syntax we decide on] is passed a literal string
|
||||
of a different size, the error message should explain that the
|
||||
size must be the number of codepoints needed to encode the string.
|
||||
|
||||
### Compilation to R1CS
|
||||
|
||||
So far, the discussion has been independent from R1CS
|
||||
@ -360,6 +292,8 @@ See the discussion of input files in the design section.
|
||||
|
||||
## Alternatives
|
||||
|
||||
### 32-bit Unsigned Integers for Characters
|
||||
|
||||
We could avoid the new `char` type altogether,
|
||||
and instead, rely on the existing `u32` to represent Unicode code points,
|
||||
and provide character-oriented operations on `u32` values.
|
||||
@ -378,6 +312,8 @@ even if restricted to the number of bits needed for Unicode code points,
|
||||
is less efficient than the field representation described earlier
|
||||
because `u32` requires a field element for each bit.
|
||||
|
||||
### Separate String Type, Isomorphic to Character Arrays
|
||||
|
||||
Instead of representing strings as character arrays,
|
||||
we could introduce a new type `string`
|
||||
whose values are finite sequences of zero or more characters.
|
||||
@ -392,6 +328,8 @@ Furthermore, as noted in the section on future extensions,
|
||||
this leaves the door open to
|
||||
introducing a future type `string` for resizable strings.
|
||||
|
||||
### Field Elements for Characters
|
||||
|
||||
Yet another option could be to use directly `field` to represent characters
|
||||
and `[field; N]` to represent strings of `N` characters.
|
||||
However, many values of type `field` are not valid Unicode code points,
|
||||
@ -401,7 +339,82 @@ and more in accordance with Leo's strong typing.
|
||||
|
||||
## Future Extensions
|
||||
|
||||
As alluded to in the section about design above,
|
||||
### Operations on Characters
|
||||
|
||||
Given that characters are essentially code points,
|
||||
we may also support the ordering operators `<`, `<=`, `>`, and `>=`;
|
||||
these may be useful to check whether a character is in certain range.
|
||||
|
||||
Below is a list of possible operations we could support on characters.
|
||||
It should be fairly easy to add more.
|
||||
- [ ] `is_alphabetic` - Returns `true` if the `char` has the `Alphabetic` property.
|
||||
- [ ] `is_ascii` - Returns `true` if the `char` is in the `ASCII` range.
|
||||
- [ ] `is_ascii_alphabetic` - Returns `true` if the `char` is in the `ASCII Alphabetic` range.
|
||||
- [ ] `is_lowercase` - Returns `true` if the `char` has the `Lowercase` property.
|
||||
- [ ] `is_numeric` - Returns `true` if the `char` has one of the general categories for numbers.
|
||||
- [ ] `is_uppercase` - Returns `true` if the `char` has the `Uppercase` property.
|
||||
- [ ] `is_whitespace` - Returns `true` if the `char` has the `White_Space` property.
|
||||
- [ ] `to_digit` - Converts the `char` to the given `radix` format.
|
||||
- [ ] `from_digit` - Inverse of to_digit.
|
||||
- [ ] `to_uppercase` - Converts lowercase to uppercase, leaving others unchanged.
|
||||
- [ ] `to_lowercase` - Converts uppercase to lowercase, leaving others unchanged.
|
||||
|
||||
It seems natural to convert between `char` values
|
||||
and `u8` or `u16` or `u32` values, under suitable range conditions;
|
||||
perhaps also between `char` values and
|
||||
(non-negative) `i8` or `i16` or `i32` values.
|
||||
This may be accomplished as part of the type casting extension of Leo.
|
||||
|
||||
### Operations on Strings
|
||||
|
||||
Below is a list of possible operations we could support on strings.
|
||||
It should be fairly easy to add more.
|
||||
- [ ] `u8` to `[char; 2]` hexstring, .., `u128` to `[char; 32]` hexstring.
|
||||
- [ ] Field element to `[char; 64]` hexstring. (Application can test leading zeros and slice them out if it needs to return, say, a 40-hex-digit string.)
|
||||
- [ ] Apply `to_uppercase` (see above) to every character.
|
||||
- [ ] Apply `to_lowercase` (see above) to every character.
|
||||
|
||||
Note that the latter two could be also realized via simple loops through the string.
|
||||
|
||||
Given the natural conversions between `char` values and integer values discussed earlier,
|
||||
it may be natural to also support element-wise conversions between strings and arrays of integers.
|
||||
This may be accomplished as part of the type casting extensions of Leo.
|
||||
|
||||
### Circuit Types for Character and String Operations
|
||||
|
||||
The operations on characters and lists described earlier, e.g. `is_ascii`,
|
||||
are provided as static member functions of two new built-in or library circuit types `Char` and `String`.
|
||||
Thus, an example call is `Char::is_ascii(c)`.
|
||||
This seems a general good way to organize built-in or library operations,
|
||||
and supports the use of the same name with different circuit types,
|
||||
e.g. `Char::to_uppercase` and `String::to_uppercase`.
|
||||
|
||||
These circuit types could also include constants, e.g. for certain ASCII characters.
|
||||
However, currently Leo does not support constants in circuit types,
|
||||
so that would have to be added separately first.
|
||||
|
||||
These two circuit types are just meant to collect static member functions for characters and strings.
|
||||
They are not meant to be the types of characters and strings:
|
||||
as mentioned previously, `char` is a new scalar (not circuit) type (like `bool`, `address`, `u8`, etc.)
|
||||
and there is no string type as such for now, but we use character arrays for strings.
|
||||
In the future we may want all the Leo types to be circuit types of some sort,
|
||||
but that is a separate feature that would have to be designed and developed independently.
|
||||
|
||||
### Input and Output of Literal Characters and Strings
|
||||
|
||||
Since UTF-8 is a standard encoding, it would make sense for
|
||||
the literal characters and strings in the `.in` file
|
||||
to be automatically converted to UTF-32 by the Leo compiler.
|
||||
However, the size of a string can be confusing since multiple
|
||||
Unicode code points can be composed into a single glyph which
|
||||
then appears to be a single character. If a parameter of type `[char; 10]`
|
||||
[if that is the syntax we decide on] is passed a literal string
|
||||
of a different size, the error message should explain that the
|
||||
size must be the number of codepoints needed to encode the string.
|
||||
|
||||
### String Type
|
||||
|
||||
As alluded to in the design section above,
|
||||
for now, we are avoiding the introduction of a string type,
|
||||
isomorphic to but separate from character arrays,
|
||||
because we may want to introduce later a more flexible type of strings,
|
||||
@ -415,8 +428,3 @@ This hypothetical type of resizable vectors
|
||||
may have to be parameterized over the element type,
|
||||
requiring an extension of the Leo type system
|
||||
that is much more general than strings.
|
||||
|
||||
Because of the above considerations,
|
||||
it seems premature to design a string type at this time,
|
||||
provided that the simple initial design described in the section above
|
||||
suffices to cover the initial use cases that motivate this RFC.
|
||||
|
Loading…
Reference in New Issue
Block a user