diff --git a/grammar/abnf-grammar.txt b/grammar/abnf-grammar.txt new file mode 100644 index 0000000000..ff93d19c09 --- /dev/null +++ b/grammar/abnf-grammar.txt @@ -0,0 +1,1089 @@ +; Leo Library +; +; Copyright (C) 2021 Aleo Systems Inc. +; +; Author: Alessandro Coglio (acoglio on GitHub) + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Format Note +; ----------- + +; The ABNF standard requires grammars to consist of lines terminated by CR LF +; (i.e. carriage return followed by line feed, DOS/Windows-style), +; as explained in the background on ABNF later in this file. +; This file's lines are therefore terminated by CR LF. +; To avoid losing this requirement across systems, +; this file is marked as 'text eol=crlf' in .gitattributes: +; this means that the file is textual, enabling visual diffs, +; but its lines will always be terminated by CR LF on any system. + +; Note that this CR LF requirement only applies to the grammar files themselves. +; It does not apply to the lines of the languages described by the grammar. +; ABNF grammar may describe any kinds of languages, +; with any kind of line terminators, +; or even without line terminators at all (e.g. for "binary" languages). + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Introduction +; ------------ + +; This file contains an initial draft of +; a complete ABNF (Augmented Backus-Naur Form) grammar of Leo. +; Background on ABNF is provided later in this file. + +; The initial motivation for creating an ABNF grammar of Leo was that +; we have a formally verified parser of ABNF grammars +; (at https://github.com/acl2/acl2/tree/master/books/kestrel/abnf; +; also see the paper at https://www.kestrel.edu/people/coglio/vstte18.pdf) +; which we have used to parse this ABNF grammar of Leo +; into a formal representation, in the ACL2 theorem prover, +; of the Leo concrete syntax. +; The parsing of this grammar file into an ACL2 representation +; happens every time the ACL2 formalization of Leo is built. + +; In addition to that initial motivation, +; this ABNF grammar has now the additional and primary purpose of +; providing an official definition of the syntax of Leo +; that is both human-readable and machine-readable. +; This grammar will be part of the (upcoming) Leo language reference, +; of the Leo Language Formal Specification +; (i.e. the LaTeX document in the leo-semantics repo), +; and of the ACL2 formalization of Leo (which was the initial motivation). +; It has also been suggested that it may be used to generate tests. + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Background on ABNF +; ------------------ + +; ABNF is an Internet standard: +; see https://www.rfc-editor.org/info/rfc5234 +; and https://www.rfc-editor.org/info/rfc7405. +; It is used to specify the syntax of JSON, HTTP, and other standards. + +; The following is copied (and "un-LaTeX'd") from the aforementioned paper +; (at https://www.kestrel.edu/people/coglio/vstte18.pdf). + +; ABNF adds conveniences and makes slight modifications +; to Backus-Naur Form (BNF), +; without going beyond context-free grammars. + +; Instead of BNF's angle-bracket notation for nonterminals, +; ABNF uses case-insensitive names consisting of letters, digits, and dashes, +; e.g. HTTP-message and IPv6address. +; ABNF includes an angle-bracket notation for prose descriptions, +; e.g. , +; usable as last resort in the definiens of a nonterminal. + +; While BNF allows arbitrary terminals, +; ABNF uses only natural numbers as terminals, +; and denotes them via: +; (i) binary, decimal, or hexadecimal sequences, +; e.g. %b1.11.1010, %d1.3.10, and %x.1.3.A +; all denote the string '1 3 10'; +; (ii) binary, decimal, or hexadecimal ranges, +; e.g. %x30-39 denotes any string 'n' with 48 <= n <= 57 +; (an ASCII digit); +; (iii) case-sensitive ASCII strings, +; e.g. %s"Ab" denotes the string '65 98'; +; and (iv) case-insensitive ASCII strings, +; e.g. %i"ab", or just "ab", denotes +; any string among +; '65 66', +; '65 98', +; '97 66', and +; '97 98'. +; ABNF terminals in suitable sets represent ASCII or Unicode characters. + +; ABNF allows repetition prefixes n*m, +; where n and m are natural numbers in decimal notation; +; if absent, +; n defaults to 0, and +; m defaults to infinity. +; For example, +; 1*4HEXDIG denotes one to four HEXDIGs, +; *3DIGIT denotes up to three DIGITs, and +; 1*OCTET denotes one or more OCTETs. +; A single n prefix +; abbreviates n*n, +; e.g. 3DIGIT denotes three DIGITs. + +; Instead of BNF's |, ABNF uses / to separate alternatives. +; Repetition prefixes have precedence over juxtapositions, +; which have precedence over /. +; Round brackets group things and override the aforementioned precedence rules, +; e.g. *(WSP / CRLF WSP) denotes strings +; obtained by repeating, zero or more times, +; either (i) a WSP or (ii) a CRLF followed by a WSP. +; Square brackets also group things but make them optional, +; e.g. [":" port] is equivalent to 0*1(":" port). + +; Instead of BNF's ::=, ABNF uses = to define nonterminals, +; and =/ to incrementally add alternatives +; to previously defined nonterminals. +; For example, the rule BIT = "0" / "1" +; is equivalent to BIT = "0" followed by BIT =/ "1". + +; The syntax of ABNF itself is formally specified in ABNF +; (in Section 4 of the aforementioned RFC 5234, +; after the syntax and semantics of ABNF +; are informally specified in natural language +; (in Sections 1, 2, and 3 of the aforementioned RFC 5234). +; The syntax rules of ABNF prescribe the ASCII codes allowed for +; white space (spaces and horizontal tabs), +; line endings (carriage returns followed by line feeds), +; and comments (semicolons to line endings). + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Structure +; --------- + +; Language specifications often define the syntax of their languages via +; (i) a lexical grammar that describes how +; sequence of characters are parsed into tokens, and +; (ii) a syntactic grammar that described how +; tokens are parsed into expressions, statements, etc. +; (The adjectives 'lexical' and 'syntactic' are +; the ones used in the Java language specification; +; other terms may be used by other languages, +; but the essence is similar.) +; The situation is sometimes more complex, +; with multiple passes (e.g. Unicode escape processing in Java), +; but the division between lexical and syntactic (in the sense above) stands. + +; In the aforementioned language specifications, +; both the lexical and syntactic grammars +; are normally written in a context-free grammar notation, +; augmented with natural language that may assert, for instance, +; that tokenization always takes the longest sequence that constitutes a token. + +; This dual structure appears to be motivated by the fact that +; concerns of white space, line endings, etc. +; can be handled by the lexical grammar, +; allowing the syntactic grammar to focus on the more important structure. +; Handling both aspects in a single context-free grammar may be unwieldy, +; so having two grammars provides more clarity and readability. + +; In contrast, PEG (Parsing Expression Grammar) formalisms like Pest +; naturally embody a procedural interpretation +; that can handle white space and tokenization in just one manageable grammar. +; However, this procedural interpretaion may be sometimes +; less clear and readable to humans than context-free rules. +; Indeed, context-free grammar are commonly used to documentat languages. + +; ABNF is a context-free grammar notation, +; with no procedural interpretation, +; and therefore it makes sense to define +; separate lexical and syntactic ABNF grammars for Leo. +; Conceptually, the two grammars define two subsequent processing phases, +; as detailed below. +; However, a parser implementation does not need to perform +; two strictly separate phases (in fact, it typically does not), +; so long as it produces the same final result. + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Operator Precedence +; ------------------- + +; We formulate the grammar rules for expressions +; in a way that describes the relative precedence of operators, +; as often done in language syntax specifications. + +; For instance, consider the rules +; +; multiplicative-expression = +; exponential-expression +; / multiplicative-expression "*" exponential-expression +; / multiplicative-expression "/" exponential-expression +; +; additive-expression = +; multiplicative-expression +; / additive-expression "+" multiplicative-expression +; / additive-expression "-" multiplicative-expression +; +; this rule tells us that the additive operators '+' and '-' have +; lower precedence than the multiplicative operators '*' and '/', +; and that both the additive and multiplicative operators associate to the left. +; This may be best understood via the examples given below. + +; According to the rules, the expression +; +; x + y * z +; +; can only be parsed as +; +; + +; / \ +; x * +; / \ +; y z +; +; and not as +; +; * +; / \ +; + z +; / \ +; x y +; +; because a multiplicative expression cannot have an additive expression +; as first sub-expression, as it would in the second tree above. + +; Also according to the rules, the expression +; +; x + y + z +; +; can only be parsed as +; +; + +; / \ +; + z +; / \ +; x y +; +; and not as +; +; + +; / \ +; x + +; / \ +; y z +; +; because an additive expression cannot have an additive expression +; as second sub-expression, as it would in the second tree above. + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Naming Convention +; ----------------- + +; For this ABNF grammar, we choose nonterminal names +; that consist of complete English words, separated by dashes, +; and that describe the construct the way it is in English. +; For instance, we use the name 'conditional-statement' +; to describe conditional statements. + +; At the same time, we attempt to establish +; a precise and "official" nomenclature for the Leo constructs, +; by way of the nonterminal names that define their syntax. +; For instance, the rule +; +; group-literal = product-group-literal +; / affine-group-literal +; +; tells us that there are two kinds of group literals, +; namely product group literals and affine group literals. +; This is more precise than describing them as +; integers (which are not really group elements per se), +; or points (they are all points, just differently specified), +; or being singletons vs. pairs (which is a bit generic). + +; The only exception to the nomenclature-establishing role of the grammar +; is the fact that, as discussed above, +; we write the grammar rules in a way that determines +; the relative precedence and the associativity of expression operators, +; and that therefore we have rules like +; +; unary-expression = primary-expression +; / "!" unary-expression +; / "-" unary-expression +; +; In order to allow the recursion of the rule to stop, +; we need to regard, in the grammar, a primary expression as a unary expression. +; However, this is just a grammatical artifact: +; ontologically, a primary expression is not really a unary expression, +; because a unary expression is one that consists of +; a unary operator and an operand sub-expression. +; These terminological "exceptions" should be easy to identify in the rules. + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Lexical Grammar +; --------------- + +; A Leo file is a finite sequence of Unicode characters, +; represented as Unicode code points, +; which are numbers in the range form 0 to 10FFFFh. +; These are captured by the ABNF rule 'character' below. + +; The lexical grammar defines how, conceptually, +; the sequence of characters is turned into +; a sequence of tokens, comments, and whitespaces. + +; As stated, the lexical grammar is ambiguous. +; For example, the sequence of characters '**' (i.e. two stars) +; could be equally parsed as two '*' symbol tokens or one '**' symbol token +; (see rule for 'symbol' below). +; As another example, the sequence or characters '' +; (i.e. carriage return followed by line feed) +; could be equally parsed as two line terminators or one +; (see rule for 'newline'). + +; Thus, as often done in language syntax definitions, +; the lexical grammar is disambiguated by +; the extra-grammatical requirement that +; the longest possible sequence of characters is always parsed. +; This way, '**' must be parsed as one '**' symbol token, +; and '' must be parsed as one line terminator. +; (We should formalize this requirement, +; along with the other extra-grammatical requirements given below, +; and formally prove that they indeed make +; the lexical grammar of Leo unambiguous.) + +; As mentioned above, a character is any Unicode code point. +; This grammar does not say how those are encoded in files (e.g. UTF-8): +; it starts with a decoded sequence of Unicode code points. +; Note that we allow any value, +; even though some values may not be used according to the Unicode standard. + +character = %x0-10FFFF ; any Unicode code point + +; We give names to certain ASCII characters. + +horizontal-tab = %x9 + +line-feed = %xA + +carriage-return = %xD + +space = %x20 + +double-quote = %x22 + +; We give names to complements of certain ASCII characters. +; These consist of all the Unicode characters except for one or two. + +not-double-quote = %x0-22 / %x24-10FFFF ; anything but " + +not-star = %x0-29 / %x2B-10FFFF ; anything but * + +not-line-feed-or-carriage-return = %x0-9 / %xB-C / %xE-10FFFF + ; anything but LF or CR + +not-star-or-slash = %x0-29 / %x2B-2E / %x30-10FFFF ; anything but * or / + +; Lines in Leo may be terminated via +; a single carriage return, +; a line feed, +; or a carriage return immediately followed by a line feed. +; Note that the latter combination constitutes a single line terminator, +; according to the extra-grammatical rule of the longest sequence. + +newline = line-feed / carriage-return / carriage-return line-feed + +; Line terminators form whitespace, along with spaces and horizontal tabs. + +whitespace = space / horizontal-tab / newline + +; There are two kinds of comments in Leo, as in other languages. +; One is block comments of the form '/* ... */', +; and the other is end-of-line comments '// ...'. +; The first kind start at '/*' and end at the first '*/', +; possibly spanning multiple (partial) lines; +; they do no nest. +; The second kind start at '//' and extend till the end of the line. +; The rules about comments given below are similar to +; the ones used in the Java language specification. + +comment = block-comment / end-of-line-comment + +block-comment = "/*" rest-of-block-comment + +rest-of-block-comment = "*" rest-of-block-comment-after-star + / not-star rest-of-block-comment + +rest-of-block-comment-after-star = "/" + / "*" rest-of-block-comment-after-star + / not-star-or-slash rest-of-block-comment + +end-of-line-comment = "//" *not-line-feed-or-carriage-return newline + +; Below are the keywords in the Leo language. +; They cannot be used as identifiers. + +keyword = %s"address" + / %s"as" + / %s"bool" + / %s"circuit" + / %s"console" + / %s"const" + / %s"else" + / %s"false" + / %s"field" + / %s"for" + / %s"function" + / %s"group" + / %s"i8" + / %s"i16" + / %s"i32" + / %s"i64" + / %s"i128" + / %s"if" + / %s"import" + / %s"in" + / %s"input" + / %s"let" + / %s"mut" + / %s"return" + / %s"Self" + / %s"self" + / %s"static" + / %s"string" + / %s"true" + / %s"u8" + / %s"u16" + / %s"u32" + / %s"u64" + / %s"u128" + +; The following rules define (ASCII) digits +; and (uppercase and lowercase) letters. + +digit = %x30-39 ; 0-9 + +uppercase-letter = %x41-5A ; A-Z + +lowercase-letter = %x61-7A ; a-z + +letter = uppercase-letter / lowercase-letter + +; An identifier is a non-empty sequence of letters, digits, and underscores, +; starting with a letter. +; It must not be a keyword: this is an extra-grammatical constraint. + +identifier = letter *( letter / digit / "_" ) ; but not a keyword + +; A package name consists of one or more segments separated by single dashes, +; where each segment is a non-empty sequence of lowercase letters and digits. + +package-name = 1*( lowercase-letter / digit ) + *( "-" 1*( lowercase-letter / digit ) ) + +; An address starts with 'aleo1' +; and continues with exactly 58 lowercase letters and digits. +; Thus an address always consists of 63 characters. + +address = %s"aleo1" 58( lowercase-letter / digit ) + +; A format string is a sequence of characters, other than double quote, +; surrounded by double quotes. +; Within a format string, substrings '{}' are distinguished as containers +; (these are the ones that may be matched with values +; whose textual representation replaces the containers +; in the printed string). +; There is an implicit extra-grammatical requirements that +; the explicit 'format-container' instances include +; all the occurrences of '{}' in the parsed character sequence: +; that is, there may not be two contiguous 'not-double-quote' instances +; that are '{' and '}'. + +format-string-container = "{}" + +format-string = double-quote + *( not-double-quote / format-string-container ) + double-quote + +; Here is (part of this ABNF comment), +; an alternative way to specify format strings, +; which captures the extra-grammatical requirement above in the grammar +; but is more complicated: +; +; not-double-quote-or-open-brace = %x0-22 / %x24-7A / %x7C-10FFFF +; +; not-double-quote-or-close-brace = %x0-22 / %x24-7C / %x7E-10FFFF +; +; format-string-element = not-double-quote-or-open-brace +; / "{" not-double-quote-or-close-brace +; / format-container +; +; format-string = double-quote *format-string-element double-quote +; +; It is not immediately clear which approach is better; there are tradeoffs. +; Regardless, we should choose one eventually. + +; Annotations are built out of names and arguments, which are tokens. +; Two names are currently supported. +; An argument is a sequence of one or more letters, digits, and underscores. + +annotation-name = %s"@context" / %s"@test" + +annotation-argument = 1*( letter / digit / "_" ) + +; A natural (number) is a sequence of one or more digits. +; Note that we allow leading zeros, e.g. '007'. + +natural = 1*digit + +; An integer (number) is either a natural or its negation. +; Note that we also allow leading zeros in negative numbers, e.g. '-007'. + +integer = [ "-" ] natural + +; An untyped literal is just an integer. + +untyped-literal = integer + +; Unsigned literals are naturals followed by unsigned types. + +unsigned-literal = natural ( %s"u8" / %s"u16" / %s"u32" / %s"u64" / %s"u128" ) + +; Signed literals are integers followed by signed types. + +signed-literal = integer ( %s"i8" / %s"i16" / %s"i32" / %s"i64" / %s"i128" ) + +; Field literals are integers followed by the type of field elements. + +field-literal = integer %s"field" + +; There are two kinds of group literals. +; One is a single integer followed by the type of group elements, +; which denotes the scalar product of the generator point by the integer. +; The other is a pair of integer coordinates, +; which are reduced modulo the prime to identify a point, +; which must be on the elliptic curve. +; It is also allowed to omit one (not both) coordinates, +; with an indication of how to infer the missing coordinate +; (i.e. sign high, sign low, or inferred). + +product-group-literal = integer %s"group" + +group-coordinate = integer / "+" / "-" / "_" + +affine-group-literal = "(" group-coordinate "," group-coordinate ")" %s"group" + +group-literal = product-group-literal / affine-group-literal + +; Note that the rule for group literals above +; allows no whitespace between coordinates. +; If we want to allow whitespace, +; e.g. '(3, 4)group' as opposed to requiring '(3,4)group', +; then we should define affine group literals +; in the syntactic grammar instead of in the lexical grammar. +; We should have a notion of atomic literal in the lexical grammar, +; and (non-atomic) literal in the syntactic grammar. +; The lexical grammar should define a token for ')group' +; if we want no whitespace between the closing parenthesis and 'group'. +; More precisely, the rule for 'literal' below in the lexical grammar +; would be replaced with +; +; atomic-literal = ... / product-group-literal +; +; where the '...' stands for all the '...-literal' alternatives +; in the current rule for 'literal' below, except 'group-literal'. +; Furthermore, the rule for 'symbol' below in the lexical grammar +; would be extended to +; +; symbol = ... / %s")group" +; +; where '...' stands for the current definiens of the rule. +; We would also have to adjust the rule for 'token' below in the lexical grammar +; to reference 'atomic-literal' instead of 'literal' in the definiens. +; We would then add to the syntactic grammar the following rules +; +; affine-group-literal = "(" group-coordinate "," group-coordinate %s")group" +; +; literal = atomic-literal / affine-group-literal +; +; which would now define literals in the syntactic grammar. +; Note that now an affine group literal would have the form +; +; ( , )group +; +; where is optional whitespace; +; however, no whitespace is allowed between the closing ')' and 'group'. + +; Boolean literals are the usual two. + +boolean-literal = %s"true" / %s"false" + +; An address literal is an address wrapped into an indication of address, +; to differentiate it from an identifier. + +address-literal = %s"address" "(" address ")" + +; The ones above are all the literals, as defined by the following rule. + +literal = untyped-literal + / unsigned-literal + / signed-literal + / field-literal + / group-literal + / boolean-literal + / address-literal + +; After defining the (mostly) alphanumeric tokens above, +; it remains to define tokens for non-alphanumeric symbols such as "+" and "(". +; Different programming languages used different terminologies for these, +; e.g. operators, separators, punctuators, etc. +; Here we use 'symbol', for all of them, but we can do something different. +; We could give names to all of these symbols, +; via rules such as +; +; equality-operator = "==" +; +; and defining 'symbol' in terms of those +; +; symbol = ... / equality-operator / ... +; +; This may or may not make the grammar more readable, +; but it would help establish a terminology in the grammar, +; namely the exact names of some of these token. +; On the other hand, at least some of them are perhaps simple enough +; that they could be just described in terms of their symbols, +; e.g. 'double dot', 'question mark', etc. + +symbol = "!" / "&&" / "||" + / "==" / "!=" + / "<" / "<=" / ">" / ">=" + / "+" / "-" / "*" / "/" / "**" + / "=" / "+=" / "-=" / "*=" / "/=" / "**=" + / "(" / ")" + / "[" / "]" + / "{" / "}" + / "," / "." / ".." / "..." / ";" / ":" / "::" / "?" + / "->" + +; Everything defined above, other than comments and whitespace, +; is a token, as defined by the following rule. + +token = keyword + / identifier + / literal + / package-name + / format-string + / annotation-name + / annotation-argument + / symbol + +;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; + +; Syntactic Grammar +; ----------------- + +; The processing defined by the lexical grammar above +; turns the initial sequence of characters +; into a sequence of tokens, comments, and whitespaces. +; The purpose of comments and whitespaces, from a syntactic point of view, +; is just to separate tokens: +; they are discarded, leaving a sequence of tokens. +; The syntactic grammar describes how to turn +; a sequence of tokens into concrete syntax trees. + +; There are unsigned and signed integer types, for five sizes. + +unsigned-type = %s"u8" / %s"u16" / %s"u32" / %s"u64" / %s"u128" + +signed-type = %s"i8" / %s"i16" / %s"i32" / %s"i64" / %s"i128" + +integer-type = unsigned-type / signed-type + +; The integer types, along with the field and group types, +; for the arithmetic types, i.e. the ones that support arithmetic operations. + +field-type = %s"field" + +group-type = %s"group" + +arithmetic-type = integer-type / field-type / group-type + +; The arithmetic types, along with the boolean and address types, +; form the scalar types, i.e. the ones whose values do not contain (sub)values. + +boolean-type = %s"bool" + +address-type = %s"address" + +scalar-type = boolean-type / arithmetic-type / address-type + +; Circuit types are denoted by identifiers and the keyword 'Self'. +; The latter is only allowed inside a circuit definition, +; to denote the defined circuit. + +self-type = %s"Self" + +circuit-type = identifier / self-type + +; A tuple type consists of zero, two, or more component types. + +tuple-type = "(" [ type 1*( "," type ) ] ")" + +; An array type consists of an element type +; and an indication of dimensions. +; There is either a single dimension (a number), +; or a tuple of one or more dimensions +; (we could restrict the latter to two or more dimensions). + +array-type = "[" type ";" array-dimensions "]" + +array-dimensions = natural + / "(" natural *( "," natural ) ")" + +; Circuit, tuple, and array types form the aggregate types, +; i.e. types whose values contain (sub)values +; (with the corner-case exception of the empty tuple value). + +aggregate-type = tuple-type / array-type / circuit-type + +; Scalar and aggregate types form all the types. + +type = scalar-type / aggregate-type + +; As often done in grammatical language syntax specifications, +; we define rules for different kinds of expressions, +; which also defines the relative precedence +; of operators and other expression constructs, +; and the (left or right) associativity of binary operators. + +; The primary expressions are self-contained in a way, +; i.e. they have clear deliminations. +; Some consist of single tokens: +; identifiers, the keywords 'self' and 'input', and literals. +; Primary expressions also include parenthesized expressions, +; i.e. any expression may be turned into a primary one +; by putting parentheses around it. +; The other kinds are defined and explained below. + +primary-expression = identifier + / %s"self" + / %s"input" + / literal + / "(" expression ")" + / tuple-expression + / array-expression + / circuit-expression + / function-call + +; There are tuple expressions to construct and deconstruct tuples. +; A construction consists of zero, two, or more component expressions. +; A deconstruction uses a component index (zero-indexed). +; Note that constructions are delimited by closing parentheses +; and deconstructions are delimited by natural tokens. +; The rule below, and similar rules for other aggregate types, +; use the perhaps more familiar 'access', +; but note that 'deconstruction' has a nice symmetry to 'construction'; +; the term 'destructor' has a different meaning in other languages, +; so we may want to avoid it, but not necessarily. + +tuple-construction = "(" [ expression 1*( "," expression ) ] ")" + +tuple-access = primary-expression "." natural + +tuple-expression = tuple-construction / tuple-access + +; The are array expressions to construct and deconstruct arrays. +; There are two kinds of constructions: +; one lists the element expressions (at least one), +; including spreads (via '...') which are arrays being spliced in; +; the other repeats (the value of) a single expression +; across one or more dimensions. +; There are two kinds of deconstructions: +; one selects a single element by index (zero-indexed); +; the other selects a range via two indices, +; the first inclusive and the second exclusive -- +; both are optional, +; the first defaulting to 0 and the second to the array length. +; Note that these expressions are all delimited +; by closing square brackets. + +array-inline-construction = "[" + array-inline-element + *( "," array-inline-element ) + "]" + +array-inline-element = expression / "..." expression + +array-repeat-construction = "[" expression ";" array-dimensions "]" + +array-construction = array-inline-construction / array-repeat-construction + +array-element-access = primary-expression "[" expression "]" + +array-range-access = primary-expression "[" [expression] ".." [expression] "]" + +array-access = array-element-access / array-range-access + +array-expression = array-construction / array-access + +; There are circuit expressions to construct and deconstruct circuit values. +; A construction lists values for all the member variables (in any order); +; there must be at least one member variable currently. +; A deconstruction selects a member variable by name. +; Note that these expressions are delimited, +; by closing curly braces or identifiers. + +circuit-construction = circuit-type "{" + circuit-inline-element *( "," circuit-inline-element ) + "}" + +circuit-inline-element = identifier ":" expression + +circuit-access = primary-expression "." identifier + +circuit-expression = circuit-construction / circuit-access + +; There are three kinds of function calls: +; top-level function calls, +; instance (i.e. non-static) member function calls, and +; static member function calls. +; What changes is the start, but they all end in an argument list, +; delimited by a closing parenthesis. + +top-level-function-call = identifier function-arguments + +instance-member-function-call = primary-expression "." + identifier function-arguments + +static-member-function-call = circuit-type "::" identifier function-arguments + +function-call = top-level-function-call + / instance-member-function-call + / static-member-function-call + +function-arguments = "(" [ expression *( "," expression ) ] ")" + +; Unary operators have the highest precedence. +; They apply to primary expressions +; and recursively to unary expressions. + +unary-expression = primary-expression + / "!" unary-expression + / "-" unary-expression + +; Next in the operator precedence is exponentiation, +; following mathematical practice. +; The current rule below makes exponentiation left-associative, +; i.e. 'a ** b ** c' must be parsed as '(a ** b) ** c'. +; This is easy to change if we want it to be right-associative instead. + +exponential-expression = unary-expression + / exponential-expression "**" unary-expression + +; Next in precedence come multiplication and division, both left-associative. + +multiplicative-expression = exponential-expression + / multiplicative-expression "*" exponential-expression + / multiplicative-expression "/" exponential-expression + +; Then there are addition and subtraction, both left-assocative. + +additive-expression = multiplicative-expression + / additive-expression "+" multiplicative-expression + / additive-expression "-" multiplicative-expression + +; Next in the precedence order are ordering relations. +; These are not associative, because they return boolean values. + +ordering-expression = additive-expression + / additive-expression "<" additive-expression + / additive-expression ">" additive-expression + / additive-expression "<=" additive-expression + / additive-expression ">=" additive-expression + +; Equalities return booleans but may also operate on boolean, +; so we make them left-associative. + +equality-expression = ordering-expression + / equality-expression "==" ordering-expression + / equality-expression "!=" ordering-expression + +; Next come conjunctive expressions, left-associative. + +conjunctive-expression = equality-expression + / conjunctive-expression "&&" equality-expression + +; Next come disjunctive expressions, left-associative. + +disjunctive-expression = conjunctive-expression + / disjunctive-expression "||" conjunctive-expression + +; Finally we have conditional expressions. + +conditional-expression = disjunctive-expression + / conditional-expression + "?" expression + ":" conditional-expression + +; These are all the expressions. +; Recall that conditional expressions +; may be disjunctive expressions, +; which may be conjunctive expressions, +; and so on all the way to primary expressions. + +expression = conditional-expression + +; There are various kinds of statements, +; including blocks, which are +; possibly empty sequences of statements surounded by curly braces. + +statement = expression-statement + / return-statement + / variable-definition-statement + / conditional-statement + / loop-statement + / assignment-statement + / console-statement + / block + +block = "{" *statement "}" + +; An expression (that returns the empty tuple) +; can be turned into a statement by appending a semicolon. + +expression-statement = expression ";" + +; A return statement always takes an expression, +; and does not end with a semicolon (but we may want to change that). + +return-statement = %s"return" expression + +; There are two kinds of variable definition statements, +; which only differ in the starting keyword. +; The variables are either a single one or a tuple of two or more; +; in all cases, there is just one optional type +; and just one initializing expression. + +variable-definition-statement = ( %s"let" / %s"const" ) + identifier-or-identifiers + [ ":" type ] "=" expression ";" + +identifier-or-identifiers = identifier + / "(" identifier 1*( "," identifier ) ")" + +; A conditional statement always starts with a condition and a block +; (which together form a branch). +; It may stop there, or it may continue with an alternative block, +; or possibly with another conditional statement, forming a chain. +; Note that we require blocks in all branches, not merely statements. + +branch = %s"if" expression block + +conditional-statement = branch + / branch %s"else" block + / branch %s"else" conditional-statement + +; A loop statement implicitly defines a loop variable +; that goes from a starting value (inclusive) to an ending value (exclusive). +; The body is a block. + +loop-statement = %s"for" identifier %s"in" expression ".." expression block + +; An assignment statement is straightforward. +; Based on the operator, the assignment may be simple (i.e. '=') +; or compound (i.e. combining assignment with an arithmetic operation). + +assignment-operator = "=" / "+=" / "-=" / "*=" / "/=" / "**=" + +assignment-statement = expression assignment-operator expression ";" + +; Console statements start with the 'console' keyword, +; followed by a console function call. +; The call may be an assertion or a print command. +; The former takes an expression (which must be boolean) as argument. +; The latter takes either no argument, +; or a format string followed by expressions, +; whose number must match the number of containers '{}' in the format string. +; Note that the console function names are identifiers, not keywords. +; There are three kind of printing. + +console-statement = %s"console" "." console-call + +console-call = assert-call + / print-call + +assert-call = %s"assert" "(" expression ")" + +print-function = %s"debug" / %s"error" / %s"log" + +print-arguments = "(" [ format-string *( "," expression ) ] ")" + +print-call = print-function print-arguments + +; An annotation consists of an annotation name (which starts with '@') +; with optional annotation arguments. +; Note that no parentheses are used if there are no arguments. + +annotation = annotation-name + [ "(" annotation-argument *( "," annotation-argument ) ")" ] + +; A function declaration defines a function. +; This could be called 'function-definition' instead, +; but see the comments about the 'declaration' rule below. +; The output type is optional (it defaults to the empty tuple type). +; In general, a function input consists of an identifier and a type, +; with an optional 'const' modifier. +; However, functions inside circuits +; may start with a 'mut self' or 'self' parameter, +; which may be the only parameter in fact. +; Furthermore, any function may end with an 'input' parameter, +; which may be the only parameter in fact. + +function-declaration = *annotation %s"function" identifier + "(" [ function-parameters ] ")" [ "->" type ] + block + +function-parameters = self-parameter [ "," input-parameter ] + / self-parameter "," function-inputs [ "," input-parameter ] + / function-inputs [ "," input-parameter ] + / input-parameter + +self-parameter = [%s"mut"] %s"self" + +function-inputs = function-input *( "," function-input ) + +function-input = [ %s"const" ] identifier ":" type + +input-parameter = %s"input" + +; A circuit member variable declaration consists of an identifier and a type. +; A circuit member function declaration consists of a function declaration. +; We could call these 'member-definition' etc., +; but see the comments about the 'declaration' rule below. + +member-declaration = member-variable-declaration + / member-function-declaration + +member-variable-declaration = identifier ":" type + +member-function-declaration = function-declaration + +; A circuit declaration defines a circuit type. It is straightforward. +; This could be called 'circuit-definition' instead, +; but see the comments about the 'declaration' rule below. + +circuit-declaration = *annotation %s"circuit" identifier + "{" member-declaration *( "," member-declaration ) "}" + +; An import declaration consists of the 'import' keyword +; followed by a package path, which may be one of the following: +; a single wildcard; +; an identifier, optionally followed by a local renamer; +; a package name followed by a path, recursively; +; or a parenthesized list of package paths, +; which are "fan out" of the initial path. +; Note that we allow the last element of the parenthesized list +; to be followed by a comma, presumably for convenience. + +import-declaration = %s"import" package-path + +package-path = "*" + / identifier [ %s"as" identifier ] + / package-name "." package-path + / "(" package-path *( "," package-path ) [","] ")" + +; Finally, we define a file as a sequence of zero or more declarations. +; This is why we used 'function-declaration' and 'circuit-declaration' +; instead of 'function-definition' and 'ciruit-definition': +; this way, they are all declarations of some sort. +; An import declaration cannot really called an import definition, +; because it does not define anything. +; But we may revisit this, and use 'definition' instead of 'declaration'. + +declaration = import-declaration + / function-declaration + / circuit-declaration + +file = *declaration