[RFC] Move future extensions under future extensions.

Also add subsections for each future feature. Also tweak/improve some wording.
2024-09-11 05:45:30 +03:00 · 2021-09-29 10:26:53 -07:00 · 2021-09-29 10:26:53 -07:00 · 87c8ec2923
commit 87c8ec2923
parent 6340929f28
1 changed files with 82 additions and 74 deletions
--- a/docs/rfc/001-initial-strings.md
+++ b/docs/rfc/001-initial-strings.md
@ -102,29 +102,6 @@ creating literals, there is no need for the circuit to know
 which code points are disallowed.

 The equality operators `==` and `!=` are automatically available for `char`.
-Given that characters are essentially code points,
-we may also support the ordering operators `<`, `<=`, `>`, and `>=`;
-these may be useful to check whether a character is in certain range.
-
-Below is a list of possible operations we could support on characters.
-It should be fairly easy to add more.
- [ ] `is_alphabetic` - Returns `true` if the `char` has the `Alphabetic` property.
- [ ] `is_ascii` - Returns `true` if the `char` is in the `ASCII` range.
- [ ] `is_ascii_alphabetic` - Returns `true` if the `char` is in the `ASCII Alphabetic` range.
- [ ] `is_lowercase` - Returns `true` if the `char` has the `Lowercase` property.
- [ ] `is_numeric` - Returns `true` if the `char` has one of the general categories for numbers.
- [ ] `is_uppercase` - Returns `true` if the `char` has the `Uppercase` property.
- [ ] `is_whitespace` - Returns `true` if the `char` has the `White_Space` property.
- [ ] `to_digit` - Converts the `char` to the given `radix` format.
- [ ] `from_digit` - Inverse of to_digit.
- [ ] `to_uppercase` - Converts lowercase to uppercase, leaving others unchanged.
- [ ] `to_lowercase` - Converts uppercase to lowercase, leaving others unchanged.
-
-It seems natural to convert between `char` values
-and `u8` or `u16` or `u32` values, under suitable range conditions;
-perhaps also between `char` values and
-(non-negative) `i8` or `i16` or `i32` values.
-This will be accomplished as part of the type casting extension of Leo.

 The following code sample illustrates three ways of defining characters:
 character literal, single-character escapes, and Unicode escapes.
@ -212,19 +189,6 @@ that are also common for strings in other programming languages:
 * `s[i]` extracts the `i`-th character from the string `s`.
 * `s[1..]` removes the first character from the string `s`.

-Below is a list of possible operations we could support on strings.
-It should be fairly easy to add more.
- [ ] `u8` to `[char; 2]` hexstring, .., `u128` to `[char; 32]` hexstring.
- [ ] Field element to `[char; 64]` hexstring. (Application can test leading zeros and slice them out if it needs to return, say, a 40-hex-digit string.)
- [ ] Apply `to_uppercase` (see above) to every character.
- [ ] Apply `to_lowercase` (see above) to every character.
-
-Note that the latter two could be also realize via simple loops through the string.
-
-Given the natural conversions between `char` values and integer values discussed earlier,
-it may be natural to also support element-wise conversions between strings and arrays of integers.
-This may be accomplished as part of the type casting extensions of Leo.
-
 The following code shows a string literal and its actual transformation into an
 array of characters as well as possible array-like operations on strings:
 concatenation and comparison.
@ -255,38 +219,6 @@ which will be interpreted as a format string
 according to the semantics of console print calls.
 The internal UTF-32 string will be translated to UTF-8 for output.

-### Circuit Types for Character and String Operations
-
-The operations on characters and lists described earlier, e.g. `is_ascii`,
-are provided as static member functions of two new built-in or library circuit types `Char` and `String`.
-Thus, an example call is `Char::is_ascii(c)`.
-This seems a general good way to organize built-in or library operations,
-and supports the use of the same name with different circuit types,
-e.g. `Char::to_uppercase` and `String::to_uppercase`.
-
-These circuit types could also include constants, e.g. for certain ASCII characters.
-However, currently Leo does not support constants in circuit types,
-so that would have to be added separately first.
-
-These two circuit types are just meant to collect static member functions for characters and strings.
-They are not meant to be the types of characters and strings:
-as mentioned previously, `char` is a new scalar (not circuit) type (like `bool`, `address`, `u8`, etc.)
-and there is no string type as such for now, but we use character arrays for strings.
-In the future we may want all the Leo types to be circuit types of some sort,
-but that is a separate feature that would have to be designed and developed independently.
-
-### Input and Output of Literal Characters and Strings
-
-Since UTF-8 is a standard encoding, it would make sense for
-the literal characters and strings in the `.in` file
-to be automatically converted to UTF-32 by the Leo compiler.
-However, the size of a string can be confusing since multiple
-Unicode code points can be composed into a single glyph which
-then appears to be a single character.  If a parameter of type `[char; 10]`
-[if that is the syntax we decide on] is passed a literal string
-of a different size, the error message should explain that the
-size must be the number of codepoints needed to encode the string.
-
 ### Compilation to R1CS

 So far, the discussion has been independent from R1CS
@ -360,6 +292,8 @@ See the discussion of input files in the design section.

 ## Alternatives

+### 32-bit Unsigned Integers for Characters
+
 We could avoid the new `char` type altogether,
 and instead, rely on the existing `u32` to represent Unicode code points,
 and provide character-oriented operations on `u32` values.
@ -378,6 +312,8 @@ even if restricted to the number of bits needed for Unicode code points,
 is less efficient than the field representation described earlier
 because `u32` requires a field element for each bit.

+### Separate String Type, Isomorphic to Character Arrays
+
 Instead of representing strings as character arrays,
 we could introduce a new type `string`
 whose values are finite sequences of zero or more characters.
@ -392,6 +328,8 @@ Furthermore, as noted in the section on future extensions,
 this leaves the door open to
 introducing a future type `string` for resizable strings.

+### Field Elements for Characters
+
 Yet another option could be to use directly `field` to represent characters
 and `[field; N]` to represent strings of `N` characters.
 However, many values of type `field` are not valid Unicode code points,
@ -401,7 +339,82 @@ and more in accordance with Leo's strong typing.

 ## Future Extensions

-As alluded to in the section about design above,
+### Operations on Characters
+
+Given that characters are essentially code points,
+we may also support the ordering operators `<`, `<=`, `>`, and `>=`;
+these may be useful to check whether a character is in certain range.
+
+Below is a list of possible operations we could support on characters.
+It should be fairly easy to add more.
+- [ ] `is_alphabetic` - Returns `true` if the `char` has the `Alphabetic` property.
+- [ ] `is_ascii` - Returns `true` if the `char` is in the `ASCII` range.
+- [ ] `is_ascii_alphabetic` - Returns `true` if the `char` is in the `ASCII Alphabetic` range.
+- [ ] `is_lowercase` - Returns `true` if the `char` has the `Lowercase` property.
+- [ ] `is_numeric` - Returns `true` if the `char` has one of the general categories for numbers.
+- [ ] `is_uppercase` - Returns `true` if the `char` has the `Uppercase` property.
+- [ ] `is_whitespace` - Returns `true` if the `char` has the `White_Space` property.
+- [ ] `to_digit` - Converts the `char` to the given `radix` format.
+- [ ] `from_digit` - Inverse of to_digit.
+- [ ] `to_uppercase` - Converts lowercase to uppercase, leaving others unchanged.
+- [ ] `to_lowercase` - Converts uppercase to lowercase, leaving others unchanged.
+
+It seems natural to convert between `char` values
+and `u8` or `u16` or `u32` values, under suitable range conditions;
+perhaps also between `char` values and
+(non-negative) `i8` or `i16` or `i32` values.
+This may be accomplished as part of the type casting extension of Leo.
+
+### Operations on Strings
+
+Below is a list of possible operations we could support on strings.
+It should be fairly easy to add more.
+- [ ] `u8` to `[char; 2]` hexstring, .., `u128` to `[char; 32]` hexstring.
+- [ ] Field element to `[char; 64]` hexstring. (Application can test leading zeros and slice them out if it needs to return, say, a 40-hex-digit string.)
+- [ ] Apply `to_uppercase` (see above) to every character.
+- [ ] Apply `to_lowercase` (see above) to every character.
+
+Note that the latter two could be also realized via simple loops through the string.
+
+Given the natural conversions between `char` values and integer values discussed earlier,
+it may be natural to also support element-wise conversions between strings and arrays of integers.
+This may be accomplished as part of the type casting extensions of Leo.
+
+### Circuit Types for Character and String Operations
+
+The operations on characters and lists described earlier, e.g. `is_ascii`,
+are provided as static member functions of two new built-in or library circuit types `Char` and `String`.
+Thus, an example call is `Char::is_ascii(c)`.
+This seems a general good way to organize built-in or library operations,
+and supports the use of the same name with different circuit types,
+e.g. `Char::to_uppercase` and `String::to_uppercase`.
+
+These circuit types could also include constants, e.g. for certain ASCII characters.
+However, currently Leo does not support constants in circuit types,
+so that would have to be added separately first.
+
+These two circuit types are just meant to collect static member functions for characters and strings.
+They are not meant to be the types of characters and strings:
+as mentioned previously, `char` is a new scalar (not circuit) type (like `bool`, `address`, `u8`, etc.)
+and there is no string type as such for now, but we use character arrays for strings.
+In the future we may want all the Leo types to be circuit types of some sort,
+but that is a separate feature that would have to be designed and developed independently.
+
+### Input and Output of Literal Characters and Strings
+
+Since UTF-8 is a standard encoding, it would make sense for
+the literal characters and strings in the `.in` file
+to be automatically converted to UTF-32 by the Leo compiler.
+However, the size of a string can be confusing since multiple
+Unicode code points can be composed into a single glyph which
+then appears to be a single character.  If a parameter of type `[char; 10]`
+[if that is the syntax we decide on] is passed a literal string
+of a different size, the error message should explain that the
+size must be the number of codepoints needed to encode the string.
+
+### String Type
+
+As alluded to in the design section above,
 for now, we are avoiding the introduction of a string type,
 isomorphic to but separate from character arrays,
 because we may want to introduce later a more flexible type of strings,
@ -415,8 +428,3 @@ This hypothetical type of resizable vectors
 may have to be parameterized over the element type,
 requiring an extension of the Leo type system
 that is much more general than strings.
-
-Because of the above considerations,
-it seems premature to design a string type at this time,
-provided that the simple initial design described in the section above
-suffices to cover the initial use cases that motivate this RFC.