diff --git a/grammar/abnf-grammar.md b/grammar/abnf-grammar.md new file mode 100644 index 0000000000..eb587deea0 --- /dev/null +++ b/grammar/abnf-grammar.md @@ -0,0 +1,1803 @@ +Leo Library + + + +Copyright (C) 2021 Aleo Systems Inc. + + + +Author: Alessandro Coglio (acoglio on GitHub) + + +-------- + + +Format Note +----------- + +The ABNF standard requires grammars to consist of lines terminated by CR LF +(i.e. carriage return followed by line feed, DOS/Windows-style), +as explained in the background on ABNF later in this file. +This file's lines are therefore terminated by CR LF. +To avoid losing this requirement across systems, +this file is marked as 'text eol=crlf' in .gitattributes: +this means that the file is textual, enabling visual diffs, +but its lines will always be terminated by CR LF on any system. + +Note that this CR LF requirement only applies to the grammar files themselves. +It does not apply to the lines of the languages described by the grammar. +ABNF grammar may describe any kinds of languages, +with any kind of line terminators, +or even without line terminators at all (e.g. for "binary" languages). + + +-------- + + +Introduction +------------ + +This file contains an initial draft of +a complete ABNF (Augmented Backus-Naur Form) grammar of Leo. +Background on ABNF is provided later in this file. + +The initial motivation for creating an ABNF grammar of Leo was that +we have a formally verified parser of ABNF grammars +(at https://github.com/acl2/acl2/tree/master/books/kestrel/abnf; +also see the paper at https://www.kestrel.edu/people/coglio/vstte18.pdf) +which we have used to parse this ABNF grammar of Leo +into a formal representation, in the ACL2 theorem prover, +of the Leo concrete syntax. +The parsing of this grammar file into an ACL2 representation +happens every time the ACL2 formalization of Leo is built. + +In addition to that initial motivation, +this ABNF grammar has now the additional and primary purpose of +providing an official definition of the syntax of Leo +that is both human-readable and machine-readable. +This grammar will be part of the (upcoming) Leo language reference, +of the Leo Language Formal Specification +(i.e. the LaTeX document in the leo-semantics repo), +and of the ACL2 formalization of Leo (which was the initial motivation). +It has also been suggested that it may be used to generate tests. + + +-------- + + +Background on ABNF +------------------ + +ABNF is an Internet standard: +see https://www.rfc-editor.org/info/rfc5234 +and https://www.rfc-editor.org/info/rfc7405. +It is used to specify the syntax of JSON, HTTP, and other standards. + +The following is copied (and "un-LaTeX'd") from the aforementioned paper +(at https://www.kestrel.edu/people/coglio/vstte18.pdf). + +ABNF adds conveniences and makes slight modifications +to Backus-Naur Form (BNF), +without going beyond context-free grammars. + +Instead of BNF's angle-bracket notation for nonterminals, +ABNF uses case-insensitive names consisting of letters, digits, and dashes, +e.g. HTTP-message and IPv6address. +ABNF includes an angle-bracket notation for prose descriptions, +e.g. , +usable as last resort in the definiens of a nonterminal. + +While BNF allows arbitrary terminals, +ABNF uses only natural numbers as terminals, +and denotes them via: +(i) binary, decimal, or hexadecimal sequences, +e.g. %b1.11.1010, %d1.3.10, and %x.1.3.A +all denote the string '1 3 10'; +(ii) binary, decimal, or hexadecimal ranges, +e.g. %x30-39 denotes any string 'n' with 48 <= n <= 57 +(an ASCII digit); +(iii) case-sensitive ASCII strings, +e.g. "Ab" denotes the string '65 98'; +and (iv) case-insensitive ASCII strings, +e.g. %i"ab", or just "ab", denotes +any string among +'65 66', +'65 98', +'97 66', and +'97 98'. +ABNF terminals in suitable sets represent ASCII or Unicode characters. + +ABNF allows repetition prefixes n*m, +where n and m are natural numbers in decimal notation; +if absent, +n defaults to 0, and +m defaults to infinity. +For example, +1*4HEXDIG denotes one to four HEXDIGs, +*3DIGIT denotes up to three DIGITs, and +1*OCTET denotes one or more OCTETs. +A single n prefix +abbreviates n*n, +e.g. 3DIGIT denotes three DIGITs. + +Instead of BNF's |, ABNF uses / to separate alternatives. +Repetition prefixes have precedence over juxtapositions, +which have precedence over /. +Round brackets group things and override the aforementioned precedence rules, +e.g. *(WSP / CRLF WSP) denotes strings +obtained by repeating, zero or more times, +either (i) a WSP or (ii) a CRLF followed by a WSP. +Square brackets also group things but make them optional, +e.g. [":" port] is equivalent to 0*1(":" port). + +Instead of BNF's ::=, ABNF uses = to define nonterminals, +and =/ to incrementally add alternatives +to previously defined nonterminals. +For example, the rule BIT = "0" / "1" +is equivalent to BIT = "0" followed by BIT =/ "1". + +The syntax of ABNF itself is formally specified in ABNF +(in Section 4 of the aforementioned RFC 5234, +after the syntax and semantics of ABNF +are informally specified in natural language +(in Sections 1, 2, and 3 of the aforementioned RFC 5234). +The syntax rules of ABNF prescribe the ASCII codes allowed for +white space (spaces and horizontal tabs), +line endings (carriage returns followed by line feeds), +and comments (semicolons to line endings). + + +-------- + + +Structure +--------- + +Language specifications often define the syntax of their languages via +(i) a lexical grammar that describes how +sequence of characters are parsed into tokens, and +(ii) a syntactic grammar that described how +tokens are parsed into expressions, statements, etc. +(The adjectives 'lexical' and 'syntactic' are +the ones used in the Java language specification; +other terms may be used by other languages, +but the essence is similar.) +The situation is sometimes more complex, +with multiple passes (e.g. Unicode escape processing in Java), +but the division between lexical and syntactic (in the sense above) stands. + +In the aforementioned language specifications, +both the lexical and syntactic grammars +are normally written in a context-free grammar notation, +augmented with natural language that may assert, for instance, +that tokenization always takes the longest sequence that constitutes a token. + +This dual structure appears to be motivated by the fact that +concerns of white space, line endings, etc. +can be handled by the lexical grammar, +allowing the syntactic grammar to focus on the more important structure. +Handling both aspects in a single context-free grammar may be unwieldy, +so having two grammars provides more clarity and readability. + +In contrast, PEG (Parsing Expression Grammar) formalisms like Pest +naturally embody a procedural interpretation +that can handle white space and tokenization in just one manageable grammar. +However, this procedural interpretaion may be sometimes +less clear and readable to humans than context-free rules. +Indeed, context-free grammar are commonly used to documentat languages. + +ABNF is a context-free grammar notation, +with no procedural interpretation, +and therefore it makes sense to define +separate lexical and syntactic ABNF grammars for Leo. +Conceptually, the two grammars define two subsequent processing phases, +as detailed below. +However, a parser implementation does not need to perform +two strictly separate phases (in fact, it typically does not), +so long as it produces the same final result. + + +-------- + + +Operator Precedence +------------------- + +We formulate the grammar rules for expressions +in a way that describes the relative precedence of operators, +as often done in language syntax specifications. + +For instance, consider the rules + + + +``` +multiplicative-expression = + exponential-expression + / multiplicative-expression "*" exponential-expression + / multiplicative-expression "/" exponential-expression +``` + + + +``` +additive-expression = + multiplicative-expression + / additive-expression "+" multiplicative-expression + / additive-expression "-" multiplicative-expression +``` + + + +this rule tells us that the additive operators '+' and '-' have +lower precedence than the multiplicative operators '*' and '/', +and that both the additive and multiplicative operators associate to the left. +This may be best understood via the examples given below. + +According to the rules, the expression + + + +``` +x + y * z +``` + + + +can only be parsed as + + + +``` + + + / \ +x * + / \ + y z +``` + + + +and not as + + + +``` + * + / \ + + z + / \ +x y +``` + + + +because a multiplicative expression cannot have an additive expression +as first sub-expression, as it would in the second tree above. + +Also according to the rules, the expression + + + +``` +x + y + z +``` + + + +can only be parsed as + + + +``` + + + / \ + + z + / \ +x y +``` + + + +and not as + + + +``` + + + / \ +x + + / \ + y z +``` + + + +because an additive expression cannot have an additive expression +as second sub-expression, as it would in the second tree above. + + +-------- + + +Naming Convention +----------------- + +For this ABNF grammar, we choose nonterminal names +that consist of complete English words, separated by dashes, +and that describe the construct the way it is in English. +For instance, we use the name 'conditional-statement' +to describe conditional statements. + +At the same time, we attempt to establish +a precise and "official" nomenclature for the Leo constructs, +by way of the nonterminal names that define their syntax. +For instance, the rule + + + +``` +group-literal = product-group-literal + / affine-group-literal +``` + + + +tells us that there are two kinds of group literals, +namely product group literals and affine group literals. +This is more precise than describing them as +integers (which are not really group elements per se), +or points (they are all points, just differently specified), +or being singletons vs. pairs (which is a bit generic). + +The only exception to the nomenclature-establishing role of the grammar +is the fact that, as discussed above, +we write the grammar rules in a way that determines +the relative precedence and the associativity of expression operators, +and that therefore we have rules like + + + +``` +unary-expression = primary-expression + / "!" unary-expression + / "-" unary-expression +``` + + + +In order to allow the recursion of the rule to stop, +we need to regard, in the grammar, a primary expression as a unary expression. +However, this is just a grammatical artifact: +ontologically, a primary expression is not really a unary expression, +because a unary expression is one that consists of +a unary operator and an operand sub-expression. +These terminological "exceptions" should be easy to identify in the rules. + + +-------- + + +Lexical Grammar +--------------- + +A Leo file is a finite sequence of Unicode characters, +represented as Unicode code points, +which are numbers in the range form 0 to 10FFFFh. +These are captured by the ABNF rule 'character' below. + +The lexical grammar defines how, conceptually, +the sequence of characters is turned into +a sequence of tokens, comments, and whitespaces. + +As stated, the lexical grammar is ambiguous. +For example, the sequence of characters '**' (i.e. two stars) +could be equally parsed as two '*' symbol tokens or one '**' symbol token +(see rule for 'symbol' below). +As another example, the sequence or characters '' +(i.e. carriage return followed by line feed) +could be equally parsed as two line terminators or one +(see rule for 'newline'). + +Thus, as often done in language syntax definitions, +the lexical grammar is disambiguated by +the extra-grammatical requirement that +the longest possible sequence of characters is always parsed. +This way, '**' must be parsed as one '**' symbol token, +and '' must be parsed as one line terminator. +(We should formalize this requirement, +along with the other extra-grammatical requirements given below, +and formally prove that they indeed make +the lexical grammar of Leo unambiguous.) + +As mentioned above, a character is any Unicode code point. +This grammar does not say how those are encoded in files (e.g. UTF-8): +it starts with a decoded sequence of Unicode code points. +Note that we allow any value, +even though some values may not be used according to the Unicode standard. + + +```abnf +character = %x0-10FFFF ; any Unicode code point +``` + +We give names to certain ASCII characters. + + +```abnf +horizontal-tab = %x9 +``` + + +```abnf +line-feed = %xA +``` + + +```abnf +carriage-return = %xD +``` + + +```abnf +space = %x20 +``` + + +```abnf +double-quote = %x22 +``` + +We give names to complements of certain ASCII characters. +These consist of all the Unicode characters except for one or two. + + +```abnf +not-double-quote = %x0-22 / %x24-10FFFF ; anything but " +``` + + +```abnf +not-star = %x0-29 / %x2B-10FFFF ; anything but * +``` + + +```abnf +not-line-feed-or-carriage-return = %x0-9 / %xB-C / %xE-10FFFF + ; anything but LF or CR +``` + + +```abnf +not-star-or-slash = %x0-29 / %x2B-2E / %x30-10FFFF ; anything but * or / +``` + +Lines in Leo may be terminated via +a single carriage return, +a line feed, +or a carriage return immediately followed by a line feed. +Note that the latter combination constitutes a single line terminator, +according to the extra-grammatical rule of the longest sequence. + + +```abnf +newline = line-feed / carriage-return / carriage-return line-feed +``` + +Go to: _[line-feed](#user-content-line-feed), [carriage-return](#user-content-carriage-return)_; + + +Line terminators form whitespace, along with spaces and horizontal tabs. + + +```abnf +whitespace = space / horizontal-tab / newline +``` + +Go to: _[horizontal-tab](#user-content-horizontal-tab), [newline](#user-content-newline), [space](#user-content-space)_; + + +There are two kinds of comments in Leo, as in other languages. +One is block comments of the form '/* ... */', +and the other is end-of-line comments '// ...'. +The first kind start at '/*' and end at the first '*/', +possibly spanning multiple (partial) lines; +they do no nest. +The second kind start at '//' and extend till the end of the line. +The rules about comments given below are similar to +the ones used in the Java language specification. + + +```abnf +comment = block-comment / end-of-line-comment +``` + +Go to: _[block-comment](#user-content-block-comment), [end-of-line-comment](#user-content-end-of-line-comment)_; + + + +```abnf +block-comment = "/*" rest-of-block-comment +``` + +Go to: _[rest-of-block-comment](#user-content-rest-of-block-comment)_; + + + +```abnf +rest-of-block-comment = "*" rest-of-block-comment-after-star + / not-star rest-of-block-comment +``` + +Go to: _[not-star](#user-content-not-star), [rest-of-block-comment-after-star](#user-content-rest-of-block-comment-after-star), [rest-of-block-comment](#user-content-rest-of-block-comment)_; + + + +```abnf +rest-of-block-comment-after-star = "/" + / "*" rest-of-block-comment-after-star + / not-star-or-slash rest-of-block-comment +``` + +Go to: _[rest-of-block-comment-after-star](#user-content-rest-of-block-comment-after-star), [rest-of-block-comment](#user-content-rest-of-block-comment), [not-star-or-slash](#user-content-not-star-or-slash)_; + + + +```abnf +end-of-line-comment = "//" *not-line-feed-or-carriage-return newline +``` + +Go to: _[newline](#user-content-newline)_; + + +Below are the keywords in the Leo language. +They cannot be used as identifiers. + + +```abnf +keyword = "address" + / "as" + / "bool" + / "circuit" + / "console" + / "const" + / "else" + / "false" + / "field" + / "for" + / "function" + / "group" + / "i8" + / "i16" + / "i32" + / "i64" + / "i128" + / "if" + / "import" + / "in" + / "input" + / "let" + / "mut" + / "return" + / "Self" + / "self" + / "static" + / "string" + / "true" + / "u8" + / "u16" + / "u32" + / "u64" + / "u128" +``` + +The following rules define (ASCII) digits +and (uppercase and lowercase) letters. + + +```abnf +digit = %x30-39 ; 0-9 +``` + + +```abnf +uppercase-letter = %x41-5A ; A-Z +``` + + +```abnf +lowercase-letter = %x61-7A ; a-z +``` + + +```abnf +letter = uppercase-letter / lowercase-letter +``` + +Go to: _[lowercase-letter](#user-content-lowercase-letter), [uppercase-letter](#user-content-uppercase-letter)_; + + +An identifier is a non-empty sequence of letters, digits, and underscores, +starting with a letter. +It must not be a keyword: this is an extra-grammatical constraint. + + +```abnf +identifier = letter *( letter / digit / "_" ) ; but not a keyword +``` + +Go to: _[letter](#user-content-letter)_; + + +A package name consists of one or more segments separated by single dashes, +where each segment is a non-empty sequence of lowercase letters and digits. + + +```abnf +package-name = 1*( lowercase-letter / digit ) + *( "-" 1*( lowercase-letter / digit ) ) +``` + +An address starts with 'aleo1' +and continues with exactly 58 lowercase letters and digits. +Thus an address always consists of 63 characters. + + +```abnf +address = "aleo1" 58( lowercase-letter / digit ) +``` + +A format string is a sequence of characters, other than double quote, +surrounded by double quotes. +Within a format string, substrings '{}' are distinguished as containers +(these are the ones that may be matched with values +whose textual representation replaces the containers +in the printed string). +There is an implicit extra-grammatical requirements that +the explicit 'formatted-string-container' instances include +all the occurrences of '{}' in the parsed character sequence: +that is, there may not be two contiguous 'not-double-quote' instances +that are '{' and '}'. + + +```abnf +formatted-string-container = "{}" +``` + + +```abnf +formatted-string = double-quote + *( not-double-quote / formatted-string-container ) + double-quote +``` + +Go to: _[double-quote](#user-content-double-quote)_; + + +Here is (part of this ABNF comment), +an alternative way to specify format strings, +which captures the extra-grammatical requirement above in the grammar +but is more complicated: + + + +``` +not-double-quote-or-open-brace = %x0-22 / %x24-7A / %x7C-10FFFF +``` + + + +``` +not-double-quote-or-close-brace = %x0-22 / %x24-7C / %x7E-10FFFF +``` + + + +``` +formatted-string-element = not-double-quote-or-open-brace + / "{" not-double-quote-or-close-brace + / formatted-string-container +``` + + + +``` +formatted-string = double-quote *formatted-string-element double-quote +``` + + + +It is not immediately clear which approach is better; there are tradeoffs. +Regardless, we should choose one eventually. + +Annotations are built out of names and arguments, which are tokens. +Two names are currently supported. +An argument is a sequence of one or more letters, digits, and underscores. + + +```abnf +annotation-name = "@" identifier +``` + +Go to: _[identifier](#user-content-identifier)_; + + +A natural (number) is a sequence of one or more digits. +Note that we allow leading zeros, e.g. '007'. + + +```abnf +natural = 1*digit +``` + +An integer (number) is either a natural or its negation. +Note that we also allow leading zeros in negative numbers, e.g. '-007'. + + +```abnf +integer = [ "-" ] natural +``` + +Go to: _[natural](#user-content-natural)_; + + +An untyped literal is just an integer. + + +```abnf +untyped-literal = integer +``` + +Go to: _[integer](#user-content-integer)_; + + +Unsigned literals are naturals followed by unsigned types. + + +```abnf +unsigned-literal = natural ( "u8" / "u16" / "u32" / "u64" / "u128" ) +``` + +Go to: _[natural](#user-content-natural)_; + + +Signed literals are integers followed by signed types. + + +```abnf +signed-literal = integer ( "i8" / "i16" / "i32" / "i64" / "i128" ) +``` + +Go to: _[integer](#user-content-integer)_; + + +Field literals are integers followed by the type of field elements. + + +```abnf +field-literal = integer "field" +``` + +Go to: _[integer](#user-content-integer)_; + + +There are two kinds of group literals. +One is a single integer followed by the type of group elements, +which denotes the scalar product of the generator point by the integer. +The other is a pair of integer coordinates, +which are reduced modulo the prime to identify a point, +which must be on the elliptic curve. +It is also allowed to omit one (not both) coordinates, +with an indication of how to infer the missing coordinate +(i.e. sign high, sign low, or inferred). + + +```abnf +product-group-literal = integer "group" +``` + +Go to: _[integer](#user-content-integer)_; + + + +```abnf +group-coordinate = integer / "+" / "-" / "_" +``` + +Go to: _[integer](#user-content-integer)_; + + + +```abnf +affine-group-literal = "(" group-coordinate "," group-coordinate ")" "group" +``` + +Go to: _[group-coordinate](#user-content-group-coordinate)_; + + + +```abnf +group-literal = product-group-literal / affine-group-literal +``` + +Go to: _[product-group-literal](#user-content-product-group-literal), [affine-group-literal](#user-content-affine-group-literal)_; + + +Note that the rule for group literals above +allows no whitespace between coordinates. +If we want to allow whitespace, +e.g. '(3, 4)group' as opposed to requiring '(3,4)group', +then we should define affine group literals +in the syntactic grammar instead of in the lexical grammar. +We should have a notion of atomic literal in the lexical grammar, +and (non-atomic) literal in the syntactic grammar. +The lexical grammar should define a token for ')group' +if we want no whitespace between the closing parenthesis and 'group'. +More precisely, the rule for 'literal' below in the lexical grammar +would be replaced with + + + +``` +atomic-literal = ... / product-group-literal +``` + + + +where the '...' stands for all the '...-literal' alternatives +in the current rule for 'literal' below, except 'group-literal'. +Furthermore, the rule for 'symbol' below in the lexical grammar +would be extended to + + + +``` +symbol = ... / ")group" +``` + + + +where '...' stands for the current definiens of the rule. +We would also have to adjust the rule for 'token' below in the lexical grammar +to reference 'atomic-literal' instead of 'literal' in the definiens. +We would then add to the syntactic grammar the following rules + + + +``` +affine-group-literal = "(" group-coordinate "," group-coordinate ")group" +``` + + + +``` +literal = atomic-literal / affine-group-literal +``` + + + +which would now define literals in the syntactic grammar. +Note that now an affine group literal would have the form + + + +``` +( , )group +``` + + + +where is optional whitespace; +however, no whitespace is allowed between the closing ')' and 'group'. + +Boolean literals are the usual two. + + +```abnf +boolean-literal = "true" / "false" +``` + +An address literal is an address wrapped into an indication of address, +to differentiate it from an identifier. + + +```abnf +address-literal = "address" "(" address ")" +``` + +Go to: _[address](#user-content-address)_; + + +The ones above are all the literals, as defined by the following rule. + + +```abnf +literal = untyped-literal + / unsigned-literal + / signed-literal + / field-literal + / group-literal + / boolean-literal + / address-literal +``` + +Go to: _[unsigned-literal](#user-content-unsigned-literal), [untyped-literal](#user-content-untyped-literal), [field-literal](#user-content-field-literal), [group-literal](#user-content-group-literal), [address-literal](#user-content-address-literal), [boolean-literal](#user-content-boolean-literal), [signed-literal](#user-content-signed-literal)_; + + +After defining the (mostly) alphanumeric tokens above, +it remains to define tokens for non-alphanumeric symbols such as "+" and "(". +Different programming languages used different terminologies for these, +e.g. operators, separators, punctuators, etc. +Here we use 'symbol', for all of them, but we can do something different. +We could give names to all of these symbols, +via rules such as + + + +``` +equality-operator = "==" +``` + + + +and defining 'symbol' in terms of those + + + +``` +symbol = ... / equality-operator / ... +``` + + + +This may or may not make the grammar more readable, +but it would help establish a terminology in the grammar, +namely the exact names of some of these token. +On the other hand, at least some of them are perhaps simple enough +that they could be just described in terms of their symbols, +e.g. 'double dot', 'question mark', etc. + + +```abnf +symbol = "!" / "&&" / "||" + / "==" / "!=" + / "<" / "<=" / ">" / ">=" + / "+" / "-" / "*" / "/" / "**" + / "=" / "+=" / "-=" / "*=" / "/=" / "**=" + / "(" / ")" + / "[" / "]" + / "{" / "}" + / "," / "." / ".." / "..." / ";" / ":" / "::" / "?" + / "->" / "_" +``` + +Everything defined above, other than comments and whitespace, +is a token, as defined by the following rule. + + +```abnf +token = keyword + / identifier + / literal + / package-name + / formatted-string + / annotation-name + / symbol +``` + +Go to: _[literal](#user-content-literal), [annotation-name](#user-content-annotation-name), [formatted-string](#user-content-formatted-string), [symbol](#user-content-symbol), [identifier](#user-content-identifier), [keyword](#user-content-keyword), [package-name](#user-content-package-name)_; + + + +-------- + + +Syntactic Grammar +----------------- + +The processing defined by the lexical grammar above +turns the initial sequence of characters +into a sequence of tokens, comments, and whitespaces. +The purpose of comments and whitespaces, from a syntactic point of view, +is just to separate tokens: +they are discarded, leaving a sequence of tokens. +The syntactic grammar describes how to turn +a sequence of tokens into concrete syntax trees. + +There are unsigned and signed integer types, for five sizes. + + +```abnf +unsigned-type = "u8" / "u16" / "u32" / "u64" / "u128" +``` + + +```abnf +signed-type = "i8" / "i16" / "i32" / "i64" / "i128" +``` + + +```abnf +integer-type = unsigned-type / signed-type +``` + +Go to: _[unsigned-type](#user-content-unsigned-type), [signed-type](#user-content-signed-type)_; + + +The integer types, along with the field and group types, +for the arithmetic types, i.e. the ones that support arithmetic operations. + + +```abnf +field-type = "field" +``` + + +```abnf +group-type = "group" +``` + + +```abnf +arithmetic-type = integer-type / field-type / group-type +``` + +Go to: _[field-type](#user-content-field-type), [group-type](#user-content-group-type), [integer-type](#user-content-integer-type)_; + + +The arithmetic types, along with the boolean and address types, +form the scalar types, i.e. the ones whose values do not contain (sub)values. + + +```abnf +boolean-type = "bool" +``` + + +```abnf +address-type = "address" +``` + + +```abnf +scalar-type = boolean-type / arithmetic-type / address-type +``` + +Go to: _[arithmetic-type](#user-content-arithmetic-type), [address-type](#user-content-address-type), [boolean-type](#user-content-boolean-type)_; + + +Circuit types are denoted by identifiers and the keyword 'Self'. +The latter is only allowed inside a circuit definition, +to denote the defined circuit. + + +```abnf +self-type = "Self" +``` + + +```abnf +circuit-type = identifier / self-type +``` + +Go to: _[self-type](#user-content-self-type), [identifier](#user-content-identifier)_; + + +A tuple type consists of zero, two, or more component types. + + +```abnf +tuple-type = "(" [ type 1*( "," type ) ] ")" +``` + +Go to: _[type](#user-content-type)_; + + +An array type consists of an element type +and an indication of dimensions. +There is either a single dimension (a number), +or a tuple of one or more dimensions +(we could restrict the latter to two or more dimensions). + + +```abnf +array-type = "[" type ";" array-dimensions "]" +``` + +Go to: _[type](#user-content-type), [array-dimensions](#user-content-array-dimensions)_; + + + +```abnf +array-dimensions = natural + / "(" natural *( "," natural ) ")" +``` + +Go to: _[natural](#user-content-natural)_; + + +Circuit, tuple, and array types form the aggregate types, +i.e. types whose values contain (sub)values +(with the corner-case exception of the empty tuple value). + + +```abnf +aggregate-type = tuple-type / array-type / circuit-type +``` + +Go to: _[array-type](#user-content-array-type), [tuple-type](#user-content-tuple-type), [circuit-type](#user-content-circuit-type)_; + + +Scalar and aggregate types form all the types. + + +```abnf +type = scalar-type / aggregate-type +``` + +Go to: _[scalar-type](#user-content-scalar-type), [aggregate-type](#user-content-aggregate-type)_; + + +As often done in grammatical language syntax specifications, +we define rules for different kinds of expressions, +which also defines the relative precedence +of operators and other expression constructs, +and the (left or right) associativity of binary operators. + +The primary expressions are self-contained in a way, +i.e. they have clear deliminations. +Some consist of single tokens: +identifiers, the keywords 'self' and 'input', and literals. +Primary expressions also include parenthesized expressions, +i.e. any expression may be turned into a primary one +by putting parentheses around it. +The other kinds are defined and explained below. + + +```abnf +primary-expression = identifier + / "self" + / "input" + / literal + / "(" expression ")" + / tuple-expression + / array-expression + / circuit-expression + / function-call +``` + +Go to: _[function-call](#user-content-function-call), [expression](#user-content-expression), [circuit-expression](#user-content-circuit-expression), [identifier](#user-content-identifier), [literal](#user-content-literal), [array-expression](#user-content-array-expression), [tuple-expression](#user-content-tuple-expression)_; + + +There are tuple expressions to construct and deconstruct tuples. +A construction consists of zero, two, or more component expressions. +A deconstruction uses a component index (zero-indexed). +Note that constructions are delimited by closing parentheses +and deconstructions are delimited by natural tokens. +The rule below, and similar rules for other aggregate types, +use the perhaps more familiar 'access', +but note that 'deconstruction' has a nice symmetry to 'construction'; +the term 'destructor' has a different meaning in other languages, +so we may want to avoid it, but not necessarily. + + +```abnf +tuple-construction = "(" [ expression 1*( "," expression ) ] ")" +``` + +Go to: _[expression](#user-content-expression)_; + + + +```abnf +tuple-expression = tuple-construction +``` + +Go to: _[tuple-construction](#user-content-tuple-construction)_; + + +The are array expressions to construct and deconstruct arrays. +There are two kinds of constructions: +one lists the element expressions (at least one), +including spreads (via '...') which are arrays being spliced in; +the other repeats (the value of) a single expression +across one or more dimensions. +There are two kinds of deconstructions: +one selects a single element by index (zero-indexed); +the other selects a range via two indices, +the first inclusive and the second exclusive -- +both are optional, +the first defaulting to 0 and the second to the array length. +Note that these expressions are all delimited +by closing square brackets. + + +```abnf +array-inline-construction = "[" + array-inline-element + *( "," array-inline-element ) + "]" +``` + +Go to: _[array-inline-element](#user-content-array-inline-element)_; + + + +```abnf +array-inline-element = expression / "..." expression +``` + +Go to: _[expression](#user-content-expression)_; + + + +```abnf +array-repeat-construction = "[" expression ";" array-dimensions "]" +``` + +Go to: _[expression](#user-content-expression), [array-dimensions](#user-content-array-dimensions)_; + + + +```abnf +array-construction = array-inline-construction / array-repeat-construction +``` + +Go to: _[array-inline-construction](#user-content-array-inline-construction), [array-repeat-construction](#user-content-array-repeat-construction)_; + + + +```abnf +array-expression = array-construction +``` + +Go to: _[array-construction](#user-content-array-construction)_; + + +There are circuit expressions to construct and deconstruct circuit values. +A construction lists values for all the member variables (in any order); +there must be at least one member variable currently. +A single identifier abbreviates +a pair consisting of the same identifier separated by dot; +note that, in the expansion, the left one denotes a member name, +while the right one denotes an expression (a variable), +so they are syntactically identical but semantically different. +A deconstruction selects a member variable by name. +Note that these expressions are delimited, +by closing curly braces or identifiers. + + +```abnf +circuit-construction = circuit-type "{" + circuit-inline-element *( "," circuit-inline-element ) + "}" +``` + +Go to: _[circuit-inline-element](#user-content-circuit-inline-element), [circuit-type](#user-content-circuit-type)_; + + + +```abnf +circuit-inline-element = identifier ":" expression / identifier +``` + +Go to: _[identifier](#user-content-identifier), [expression](#user-content-expression)_; + + + +```abnf +circuit-expression = circuit-construction +``` + +Go to: _[circuit-construction](#user-content-circuit-construction)_; + + +There are three kinds of function calls: +top-level function calls, +instance (i.e. non-static) member function calls, and +static member function calls. +What changes is the start, but they all end in an argument list, +delimited by a closing parenthesis. + + +```abnf +function-arguments = "(" [ expression *( "," expression ) ] ")" +``` + +Go to: _[expression](#user-content-expression)_; + + +Postfix expressions have highest precedence. +They apply to primary expressions. +Contains access expressions for arrays, tuples, and circuits. +Contains function call types. +postfix-expression = primary-expression + / postfix-expression "." natural + / postfix-expression "." identifier + / identifier function-arguments + / postfix-expression "." identifier function-arguments + / circuit-type "::" identifier function-arguments + / postfix-expression "[" expression "]" + / postfix-expression "[" [expression] ".." [expression] "]" + +Unary operators have the highest operator precedence. +They apply to postfix expressions +and recursively to unary expressions. + + +```abnf +unary-expression = postfix-expression + / "!" unary-expression + / "-" unary-expression +``` + +Go to: _[postfix-expression](#user-content-postfix-expression), [unary-expression](#user-content-unary-expression)_; + + +Next in the operator precedence is casting. +The current rule below makes exponentiation left-associative, +cast-expression = unary-expression + / cast-expression "as" type + +Next in the operator precedence is exponentiation, +following mathematical practice. +The current rule below makes exponentiation left-associative, +i.e. 'a ** b ** c' must be parsed as '(a ** b) ** c'. +This is easy to change if we want it to be right-associative instead. + + +```abnf +exponential-expression = cast-expression + / exponential-expression "**" cast-expression +``` + +Go to: _[cast-expression](#user-content-cast-expression), [exponential-expression](#user-content-exponential-expression)_; + + +Next in precedence come multiplication and division, both left-associative. + + +```abnf +multiplicative-expression = exponential-expression + / multiplicative-expression "*" exponential-expression + / multiplicative-expression "/" exponential-expression +``` + +Go to: _[exponential-expression](#user-content-exponential-expression), [multiplicative-expression](#user-content-multiplicative-expression)_; + + +Then there are addition and subtraction, both left-assocative. + + +```abnf +additive-expression = multiplicative-expression + / additive-expression "+" multiplicative-expression + / additive-expression "-" multiplicative-expression +``` + +Go to: _[additive-expression](#user-content-additive-expression), [multiplicative-expression](#user-content-multiplicative-expression)_; + + +Next in the precedence order are ordering relations. +These are not associative, because they return boolean values. + + +```abnf +ordering-expression = additive-expression + / additive-expression "<" additive-expression + / additive-expression ">" additive-expression + / additive-expression "<=" additive-expression + / additive-expression ">=" additive-expression +``` + +Go to: _[additive-expression](#user-content-additive-expression)_; + + +Equalities return booleans but may also operate on boolean, +so we make them left-associative. + + +```abnf +equality-expression = ordering-expression + / equality-expression "==" ordering-expression + / equality-expression "!=" ordering-expression +``` + +Go to: _[equality-expression](#user-content-equality-expression), [ordering-expression](#user-content-ordering-expression)_; + + +Next come conjunctive expressions, left-associative. + + +```abnf +conjunctive-expression = equality-expression + / conjunctive-expression "&&" equality-expression +``` + +Go to: _[equality-expression](#user-content-equality-expression), [conjunctive-expression](#user-content-conjunctive-expression)_; + + +Next come disjunctive expressions, left-associative. + + +```abnf +disjunctive-expression = conjunctive-expression + / disjunctive-expression "||" conjunctive-expression +``` + +Go to: _[disjunctive-expression](#user-content-disjunctive-expression), [conjunctive-expression](#user-content-conjunctive-expression)_; + + +Finally we have conditional expressions. + + +```abnf +conditional-expression = disjunctive-expression + / conditional-expression + "?" expression + ":" conditional-expression +``` + +Go to: _[conditional-expression](#user-content-conditional-expression), [expression](#user-content-expression), [disjunctive-expression](#user-content-disjunctive-expression)_; + + +These are all the expressions. +Recall that conditional expressions +may be disjunctive expressions, +which may be conjunctive expressions, +and so on all the way to primary expressions. + + +```abnf +expression = conditional-expression +``` + +Go to: _[conditional-expression](#user-content-conditional-expression)_; + + +There are various kinds of statements, +including blocks, which are +possibly empty sequences of statements surounded by curly braces. + + +```abnf +statement = expression-statement + / return-statement + / variable-definition-statement + / conditional-statement + / loop-statement + / assignment-statement + / console-statement + / block +``` + +Go to: _[variable-definition-statement](#user-content-variable-definition-statement), [loop-statement](#user-content-loop-statement), [block](#user-content-block), [assignment-statement](#user-content-assignment-statement), [console-statement](#user-content-console-statement), [conditional-statement](#user-content-conditional-statement), [return-statement](#user-content-return-statement), [expression-statement](#user-content-expression-statement)_; + + + +```abnf +block = "{" *statement "}" +``` + +An expression (that returns the empty tuple) +can be turned into a statement by appending a semicolon. + + +```abnf +expression-statement = expression ";" +``` + +Go to: _[expression](#user-content-expression)_; + + +A return statement always takes an expression, +and does not end with a semicolon (but we may want to change that). + + +```abnf +return-statement = "return" expression +``` + +Go to: _[expression](#user-content-expression)_; + + +There are two kinds of variable definition statements, +which only differ in the starting keyword. +The variables are either a single one or a tuple of two or more; +in all cases, there is just one optional type +and just one initializing expression. + + +```abnf +variable-definition-statement = ( "let" / "const" ) + identifier-or-identifiers + [ ":" type ] "=" expression ";" +``` + +Go to: _[identifier-or-identifiers](#user-content-identifier-or-identifiers), [type](#user-content-type), [expression](#user-content-expression)_; + + + +```abnf +identifier-or-identifiers = identifier + / "(" identifier 1*( "," identifier ) ")" +``` + +Go to: _[identifier](#user-content-identifier)_; + + +A conditional statement always starts with a condition and a block +(which together form a branch). +It may stop there, or it may continue with an alternative block, +or possibly with another conditional statement, forming a chain. +Note that we require blocks in all branches, not merely statements. + + +```abnf +branch = "if" expression block +``` + +Go to: _[block](#user-content-block), [expression](#user-content-expression)_; + + + +```abnf +conditional-statement = branch + / branch "else" block + / branch "else" conditional-statement +``` + +Go to: _[block](#user-content-block), [branch](#user-content-branch), [conditional-statement](#user-content-conditional-statement)_; + + +A loop statement implicitly defines a loop variable +that goes from a starting value (inclusive) to an ending value (exclusive). +The body is a block. + + +```abnf +loop-statement = "for" identifier "in" expression ".." expression block +``` + +Go to: _[expression](#user-content-expression), [block](#user-content-block), [identifier](#user-content-identifier)_; + + +An assignment statement is straightforward. +Based on the operator, the assignment may be simple (i.e. '=') +or compound (i.e. combining assignment with an arithmetic operation). + + +```abnf +assignment-operator = "=" / "+=" / "-=" / "*=" / "/=" / "**=" +``` + + +```abnf +assignment-statement = expression assignment-operator expression ";" +``` + +Go to: _[assignment-operator](#user-content-assignment-operator), [expression](#user-content-expression)_; + + +Console statements start with the 'console' keyword, +followed by a console function call. +The call may be an assertion or a print command. +The former takes an expression (which must be boolean) as argument. +The latter takes either no argument, +or a format string followed by expressions, +whose number must match the number of containers '{}' in the format string. +Note that the console function names are identifiers, not keywords. +There are three kind of printing. + + +```abnf +console-statement = "console" "." console-call +``` + +Go to: _[console-call](#user-content-console-call)_; + + + +```abnf +console-call = assert-call + / print-call +``` + +Go to: _[assert-call](#user-content-assert-call), [print-call](#user-content-print-call)_; + + + +```abnf +assert-call = "assert" "(" expression ")" +``` + +Go to: _[expression](#user-content-expression)_; + + + +```abnf +print-function = "debug" / "error" / "log" +``` + + +```abnf +print-arguments = "(" [ formatted-string *( "," expression ) ] ")" +``` + +Go to: _[formatted-string](#user-content-formatted-string)_; + + + +```abnf +print-call = print-function print-arguments +``` + +Go to: _[print-function](#user-content-print-function), [print-arguments](#user-content-print-arguments)_; + + +An annotation consists of an annotation name (which starts with '@') +with optional annotation arguments. +Note that no parentheses are used if there are no arguments. + + +```abnf +annotation = annotation-name + [ "(" identifier *( "," identifier ) ")" ] +``` + +Go to: _[identifier](#user-content-identifier), [annotation-name](#user-content-annotation-name)_; + + +A function declaration defines a function. +This could be called 'function-definition' instead, +but see the comments about the 'declaration' rule below. +The output type is optional (it defaults to the empty tuple type). +In general, a function input consists of an identifier and a type, +with an optional 'const' modifier. +However, functions inside circuits +may start with a 'mut self' or 'self' parameter, +which may be the only parameter in fact. +Furthermore, any function may end with an 'input' parameter, +which may be the only parameter in fact. + + +```abnf +function-declaration = *annotation "function" identifier + "(" [ function-parameters ] ")" [ "->" type ] + block +``` + +Go to: _[function-parameters](#user-content-function-parameters), [type](#user-content-type), [block](#user-content-block), [identifier](#user-content-identifier)_; + + + +```abnf +function-parameters = self-parameter [ "," input-parameter ] + / self-parameter "," function-inputs [ "," input-parameter ] + / function-inputs [ "," input-parameter ] + / input-parameter +``` + +Go to: _[input-parameter](#user-content-input-parameter), [function-inputs](#user-content-function-inputs), [self-parameter](#user-content-self-parameter)_; + + + +```abnf +self-parameter = ["mut"] "self" +``` + + +```abnf +function-inputs = function-input *( "," function-input ) +``` + +Go to: _[function-input](#user-content-function-input)_; + + + +```abnf +function-input = [ "const" ] identifier ":" type +``` + +Go to: _[identifier](#user-content-identifier), [type](#user-content-type)_; + + + +```abnf +input-parameter = "input" +``` + +A circuit member variable declaration consists of an identifier and a type. +A circuit member function declaration consists of a function declaration. +We could call these 'member-definition' etc., +but see the comments about the 'declaration' rule below. + + +```abnf +member-declaration = member-variable-declaration + / member-function-declaration +``` + +Go to: _[member-variable-declaration](#user-content-member-variable-declaration), [member-function-declaration](#user-content-member-function-declaration)_; + + + +```abnf +member-variable-declaration = identifier ":" type +``` + +Go to: _[identifier](#user-content-identifier), [type](#user-content-type)_; + + + +```abnf +member-function-declaration = function-declaration +``` + +Go to: _[function-declaration](#user-content-function-declaration)_; + + +A circuit declaration defines a circuit type. It is straightforward. +This could be called 'circuit-definition' instead, +but see the comments about the 'declaration' rule below. + + +```abnf +circuit-declaration = *annotation "circuit" identifier + "{" member-declaration *( "," member-declaration ) "}" +``` + +Go to: _[member-declaration](#user-content-member-declaration), [identifier](#user-content-identifier)_; + + +An import declaration consists of the 'import' keyword +followed by a package path, which may be one of the following: +a single wildcard; +an identifier, optionally followed by a local renamer; +a package name followed by a path, recursively; +or a parenthesized list of package paths, +which are "fan out" of the initial path. +Note that we allow the last element of the parenthesized list +to be followed by a comma, presumably for convenience. + + +```abnf +import-declaration = "import" package-path +``` + +Go to: _[package-path](#user-content-package-path)_; + + + +```abnf +package-path = "*" + / identifier [ "as" identifier ] + / package-name "." package-path + / "(" package-path *( "," package-path ) [","] ")" +``` + +Go to: _[identifier](#user-content-identifier), [package-name](#user-content-package-name), [package-path](#user-content-package-path)_; + + +Finally, we define a file as a sequence of zero or more declarations. +This is why we used 'function-declaration' and 'circuit-declaration' +instead of 'function-definition' and 'ciruit-definition': +this way, they are all declarations of some sort. +An import declaration cannot really called an import definition, +because it does not define anything. +But we may revisit this, and use 'definition' instead of 'declaration'. + + +```abnf +declaration = import-declaration + / function-declaration + / circuit-declaration +``` + +Go to: _[import-declaration](#user-content-import-declaration), [function-declaration](#user-content-function-declaration), [circuit-declaration](#user-content-circuit-declaration)_; + + + +```abnf +file = *declaration + diff --git a/grammar/src/main.rs b/grammar/src/main.rs index 6c70568ded..22465b66d2 100644 --- a/grammar/src/main.rs +++ b/grammar/src/main.rs @@ -76,7 +76,7 @@ impl<'a> Processor<'a> { /// Main function for this struct. /// Goes through each line and transforms it into proper markdown. - fn process(&mut self) -> Result<()> { + fn process(&mut self) { let lines = self.grammar.lines(); let mut prev = ""; @@ -85,27 +85,27 @@ impl<'a> Processor<'a> { // code block in comment (not highlighted as abnf) if let Some(code) = line.strip_prefix("; ") { - self.enter_scope(Scope::Code)?; + self.enter_scope(Scope::Code); self.append_str(code); // just comment. end of code block } else if let Some(code) = line.strip_prefix("; ") { - self.enter_scope(Scope::Free)?; + self.enter_scope(Scope::Free); self.append_str(code); // horizontal rule - section separator } else if line.starts_with(";;;;;;;;;;") { - self.enter_scope(Scope::Free)?; + self.enter_scope(Scope::Free); self.append_str("\n--------\n"); // empty line in comment. end of code block } else if line.starts_with(';') { - self.enter_scope(Scope::Free)?; + self.enter_scope(Scope::Free); self.append_str("\n\n"); // just empty line. end of doc, start of definition } else if line.is_empty() { - self.enter_scope(Scope::Free)?; + self.enter_scope(Scope::Free); self.append_str(""); // definition (may be multiline) @@ -118,7 +118,7 @@ impl<'a> Processor<'a> { // try to find rule matching definition or fail let rule = self.rules.get(&def.to_string()).cloned().unwrap(); - self.enter_scope(Scope::Definition(rule))?; + self.enter_scope(Scope::Definition(rule)); } self.append_str(line); @@ -126,8 +126,6 @@ impl<'a> Processor<'a> { prev = line; } - - Ok(()) } /// Append new line into output, add newline character. @@ -138,7 +136,7 @@ impl<'a> Processor<'a> { /// Enter new scope (definition or code block). Allows customizing /// pre and post lines for each scope entered or exited. - fn enter_scope(&mut self, new_scope: Scope) -> Result<()> { + fn enter_scope(&mut self, new_scope: Scope) { match (&self.scope, &new_scope) { // exchange scopes between Free and Code (Scope::Free, Scope::Code) => self.append_str("```"), @@ -174,8 +172,6 @@ impl<'a> Processor<'a> { }; self.scope = new_scope; - - Ok(()) } } @@ -215,7 +211,7 @@ fn main() -> Result<()> { // Init parser and run it. That's it. let mut parser = Processor::new(grammar, parsed); - parser.process()?; + parser.process(); // Print result of conversion to STDOUT. println!("{}", parser.out);