From 87c8ec2923f592b3cd5ea5fa45fd02b8e80e58bc Mon Sep 17 00:00:00 2001 From: Alessandro Coglio Date: Wed, 29 Sep 2021 10:26:53 -0700 Subject: [PATCH] [RFC] Move future extensions under future extensions. Also add subsections for each future feature. Also tweak/improve some wording. --- docs/rfc/001-initial-strings.md | 156 +++++++++++++++++--------------- 1 file changed, 82 insertions(+), 74 deletions(-) diff --git a/docs/rfc/001-initial-strings.md b/docs/rfc/001-initial-strings.md index 359983ebe3..50b4be6482 100644 --- a/docs/rfc/001-initial-strings.md +++ b/docs/rfc/001-initial-strings.md @@ -102,29 +102,6 @@ creating literals, there is no need for the circuit to know which code points are disallowed. The equality operators `==` and `!=` are automatically available for `char`. -Given that characters are essentially code points, -we may also support the ordering operators `<`, `<=`, `>`, and `>=`; -these may be useful to check whether a character is in certain range. - -Below is a list of possible operations we could support on characters. -It should be fairly easy to add more. -- [ ] `is_alphabetic` - Returns `true` if the `char` has the `Alphabetic` property. -- [ ] `is_ascii` - Returns `true` if the `char` is in the `ASCII` range. -- [ ] `is_ascii_alphabetic` - Returns `true` if the `char` is in the `ASCII Alphabetic` range. -- [ ] `is_lowercase` - Returns `true` if the `char` has the `Lowercase` property. -- [ ] `is_numeric` - Returns `true` if the `char` has one of the general categories for numbers. -- [ ] `is_uppercase` - Returns `true` if the `char` has the `Uppercase` property. -- [ ] `is_whitespace` - Returns `true` if the `char` has the `White_Space` property. -- [ ] `to_digit` - Converts the `char` to the given `radix` format. -- [ ] `from_digit` - Inverse of to_digit. -- [ ] `to_uppercase` - Converts lowercase to uppercase, leaving others unchanged. -- [ ] `to_lowercase` - Converts uppercase to lowercase, leaving others unchanged. - -It seems natural to convert between `char` values -and `u8` or `u16` or `u32` values, under suitable range conditions; -perhaps also between `char` values and -(non-negative) `i8` or `i16` or `i32` values. -This will be accomplished as part of the type casting extension of Leo. The following code sample illustrates three ways of defining characters: character literal, single-character escapes, and Unicode escapes. @@ -212,19 +189,6 @@ that are also common for strings in other programming languages: * `s[i]` extracts the `i`-th character from the string `s`. * `s[1..]` removes the first character from the string `s`. -Below is a list of possible operations we could support on strings. -It should be fairly easy to add more. -- [ ] `u8` to `[char; 2]` hexstring, .., `u128` to `[char; 32]` hexstring. -- [ ] Field element to `[char; 64]` hexstring. (Application can test leading zeros and slice them out if it needs to return, say, a 40-hex-digit string.) -- [ ] Apply `to_uppercase` (see above) to every character. -- [ ] Apply `to_lowercase` (see above) to every character. - -Note that the latter two could be also realize via simple loops through the string. - -Given the natural conversions between `char` values and integer values discussed earlier, -it may be natural to also support element-wise conversions between strings and arrays of integers. -This may be accomplished as part of the type casting extensions of Leo. - The following code shows a string literal and its actual transformation into an array of characters as well as possible array-like operations on strings: concatenation and comparison. @@ -255,38 +219,6 @@ which will be interpreted as a format string according to the semantics of console print calls. The internal UTF-32 string will be translated to UTF-8 for output. -### Circuit Types for Character and String Operations - -The operations on characters and lists described earlier, e.g. `is_ascii`, -are provided as static member functions of two new built-in or library circuit types `Char` and `String`. -Thus, an example call is `Char::is_ascii(c)`. -This seems a general good way to organize built-in or library operations, -and supports the use of the same name with different circuit types, -e.g. `Char::to_uppercase` and `String::to_uppercase`. - -These circuit types could also include constants, e.g. for certain ASCII characters. -However, currently Leo does not support constants in circuit types, -so that would have to be added separately first. - -These two circuit types are just meant to collect static member functions for characters and strings. -They are not meant to be the types of characters and strings: -as mentioned previously, `char` is a new scalar (not circuit) type (like `bool`, `address`, `u8`, etc.) -and there is no string type as such for now, but we use character arrays for strings. -In the future we may want all the Leo types to be circuit types of some sort, -but that is a separate feature that would have to be designed and developed independently. - -### Input and Output of Literal Characters and Strings - -Since UTF-8 is a standard encoding, it would make sense for -the literal characters and strings in the `.in` file -to be automatically converted to UTF-32 by the Leo compiler. -However, the size of a string can be confusing since multiple -Unicode code points can be composed into a single glyph which -then appears to be a single character. If a parameter of type `[char; 10]` -[if that is the syntax we decide on] is passed a literal string -of a different size, the error message should explain that the -size must be the number of codepoints needed to encode the string. - ### Compilation to R1CS So far, the discussion has been independent from R1CS @@ -360,6 +292,8 @@ See the discussion of input files in the design section. ## Alternatives +### 32-bit Unsigned Integers for Characters + We could avoid the new `char` type altogether, and instead, rely on the existing `u32` to represent Unicode code points, and provide character-oriented operations on `u32` values. @@ -378,6 +312,8 @@ even if restricted to the number of bits needed for Unicode code points, is less efficient than the field representation described earlier because `u32` requires a field element for each bit. +### Separate String Type, Isomorphic to Character Arrays + Instead of representing strings as character arrays, we could introduce a new type `string` whose values are finite sequences of zero or more characters. @@ -392,6 +328,8 @@ Furthermore, as noted in the section on future extensions, this leaves the door open to introducing a future type `string` for resizable strings. +### Field Elements for Characters + Yet another option could be to use directly `field` to represent characters and `[field; N]` to represent strings of `N` characters. However, many values of type `field` are not valid Unicode code points, @@ -401,7 +339,82 @@ and more in accordance with Leo's strong typing. ## Future Extensions -As alluded to in the section about design above, +### Operations on Characters + +Given that characters are essentially code points, +we may also support the ordering operators `<`, `<=`, `>`, and `>=`; +these may be useful to check whether a character is in certain range. + +Below is a list of possible operations we could support on characters. +It should be fairly easy to add more. +- [ ] `is_alphabetic` - Returns `true` if the `char` has the `Alphabetic` property. +- [ ] `is_ascii` - Returns `true` if the `char` is in the `ASCII` range. +- [ ] `is_ascii_alphabetic` - Returns `true` if the `char` is in the `ASCII Alphabetic` range. +- [ ] `is_lowercase` - Returns `true` if the `char` has the `Lowercase` property. +- [ ] `is_numeric` - Returns `true` if the `char` has one of the general categories for numbers. +- [ ] `is_uppercase` - Returns `true` if the `char` has the `Uppercase` property. +- [ ] `is_whitespace` - Returns `true` if the `char` has the `White_Space` property. +- [ ] `to_digit` - Converts the `char` to the given `radix` format. +- [ ] `from_digit` - Inverse of to_digit. +- [ ] `to_uppercase` - Converts lowercase to uppercase, leaving others unchanged. +- [ ] `to_lowercase` - Converts uppercase to lowercase, leaving others unchanged. + +It seems natural to convert between `char` values +and `u8` or `u16` or `u32` values, under suitable range conditions; +perhaps also between `char` values and +(non-negative) `i8` or `i16` or `i32` values. +This may be accomplished as part of the type casting extension of Leo. + +### Operations on Strings + +Below is a list of possible operations we could support on strings. +It should be fairly easy to add more. +- [ ] `u8` to `[char; 2]` hexstring, .., `u128` to `[char; 32]` hexstring. +- [ ] Field element to `[char; 64]` hexstring. (Application can test leading zeros and slice them out if it needs to return, say, a 40-hex-digit string.) +- [ ] Apply `to_uppercase` (see above) to every character. +- [ ] Apply `to_lowercase` (see above) to every character. + +Note that the latter two could be also realized via simple loops through the string. + +Given the natural conversions between `char` values and integer values discussed earlier, +it may be natural to also support element-wise conversions between strings and arrays of integers. +This may be accomplished as part of the type casting extensions of Leo. + +### Circuit Types for Character and String Operations + +The operations on characters and lists described earlier, e.g. `is_ascii`, +are provided as static member functions of two new built-in or library circuit types `Char` and `String`. +Thus, an example call is `Char::is_ascii(c)`. +This seems a general good way to organize built-in or library operations, +and supports the use of the same name with different circuit types, +e.g. `Char::to_uppercase` and `String::to_uppercase`. + +These circuit types could also include constants, e.g. for certain ASCII characters. +However, currently Leo does not support constants in circuit types, +so that would have to be added separately first. + +These two circuit types are just meant to collect static member functions for characters and strings. +They are not meant to be the types of characters and strings: +as mentioned previously, `char` is a new scalar (not circuit) type (like `bool`, `address`, `u8`, etc.) +and there is no string type as such for now, but we use character arrays for strings. +In the future we may want all the Leo types to be circuit types of some sort, +but that is a separate feature that would have to be designed and developed independently. + +### Input and Output of Literal Characters and Strings + +Since UTF-8 is a standard encoding, it would make sense for +the literal characters and strings in the `.in` file +to be automatically converted to UTF-32 by the Leo compiler. +However, the size of a string can be confusing since multiple +Unicode code points can be composed into a single glyph which +then appears to be a single character. If a parameter of type `[char; 10]` +[if that is the syntax we decide on] is passed a literal string +of a different size, the error message should explain that the +size must be the number of codepoints needed to encode the string. + +### String Type + +As alluded to in the design section above, for now, we are avoiding the introduction of a string type, isomorphic to but separate from character arrays, because we may want to introduce later a more flexible type of strings, @@ -415,8 +428,3 @@ This hypothetical type of resizable vectors may have to be parameterized over the element type, requiring an extension of the Leo type system that is much more general than strings. - -Because of the above considerations, -it seems premature to design a string type at this time, -provided that the simple initial design described in the section above -suffices to cover the initial use cases that motivate this RFC.