leo/grammar/abnf-grammar.txt

1096 lines
39 KiB
Plaintext
Raw Normal View History

2021-03-23 19:11:42 +03:00
; Leo Library
;
; Copyright (C) 2021 Aleo Systems Inc.
;
; Author: Alessandro Coglio (acoglio on GitHub)
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Format Note
; -----------
; The ABNF standard requires grammars to consist of lines terminated by CR LF
; (i.e. carriage return followed by line feed, DOS/Windows-style),
; as explained in the background on ABNF later in this file.
; This file's lines are therefore terminated by CR LF.
; To avoid losing this requirement across systems,
; this file is marked as 'text eol=crlf' in .gitattributes:
; this means that the file is textual, enabling visual diffs,
; but its lines will always be terminated by CR LF on any system.
; Note that this CR LF requirement only applies to the grammar files themselves.
; It does not apply to the lines of the languages described by the grammar.
; ABNF grammar may describe any kinds of languages,
; with any kind of line terminators,
; or even without line terminators at all (e.g. for "binary" languages).
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Introduction
; ------------
; This file contains an initial draft of
; a complete ABNF (Augmented Backus-Naur Form) grammar of Leo.
; Background on ABNF is provided later in this file.
; The initial motivation for creating an ABNF grammar of Leo was that
; we have a formally verified parser of ABNF grammars
; (at https://github.com/acl2/acl2/tree/master/books/kestrel/abnf;
; also see the paper at https://www.kestrel.edu/people/coglio/vstte18.pdf)
; which we have used to parse this ABNF grammar of Leo
; into a formal representation, in the ACL2 theorem prover,
; of the Leo concrete syntax.
; The parsing of this grammar file into an ACL2 representation
; happens every time the ACL2 formalization of Leo is built.
; In addition to that initial motivation,
; this ABNF grammar has now the additional and primary purpose of
; providing an official definition of the syntax of Leo
; that is both human-readable and machine-readable.
; This grammar will be part of the (upcoming) Leo language reference,
; of the Leo Language Formal Specification
; (i.e. the LaTeX document in the leo-semantics repo),
; and of the ACL2 formalization of Leo (which was the initial motivation).
; It has also been suggested that it may be used to generate tests.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Background on ABNF
; ------------------
; ABNF is an Internet standard:
; see https://www.rfc-editor.org/info/rfc5234
; and https://www.rfc-editor.org/info/rfc7405.
; It is used to specify the syntax of JSON, HTTP, and other standards.
; The following is copied (and "un-LaTeX'd") from the aforementioned paper
; (at https://www.kestrel.edu/people/coglio/vstte18.pdf).
; ABNF adds conveniences and makes slight modifications
; to Backus-Naur Form (BNF),
; without going beyond context-free grammars.
; Instead of BNF's angle-bracket notation for nonterminals,
; ABNF uses case-insensitive names consisting of letters, digits, and dashes,
; e.g. HTTP-message and IPv6address.
; ABNF includes an angle-bracket notation for prose descriptions,
; e.g. <host, see [RFC3986], Section 3.2.2>,
; usable as last resort in the definiens of a nonterminal.
; While BNF allows arbitrary terminals,
; ABNF uses only natural numbers as terminals,
; and denotes them via:
; (i) binary, decimal, or hexadecimal sequences,
; e.g. %b1.11.1010, %d1.3.10, and %x.1.3.A
; all denote the string '1 3 10';
; (ii) binary, decimal, or hexadecimal ranges,
; e.g. %x30-39 denotes any string 'n' with 48 <= n <= 57
; (an ASCII digit);
; (iii) case-sensitive ASCII strings,
; e.g. %s"Ab" denotes the string '65 98';
; and (iv) case-insensitive ASCII strings,
; e.g. %i"ab", or just "ab", denotes
; any string among
; '65 66',
; '65 98',
; '97 66', and
; '97 98'.
; ABNF terminals in suitable sets represent ASCII or Unicode characters.
; ABNF allows repetition prefixes n*m,
; where n and m are natural numbers in decimal notation;
; if absent,
; n defaults to 0, and
; m defaults to infinity.
; For example,
; 1*4HEXDIG denotes one to four HEXDIGs,
; *3DIGIT denotes up to three DIGITs, and
; 1*OCTET denotes one or more OCTETs.
; A single n prefix
; abbreviates n*n,
; e.g. 3DIGIT denotes three DIGITs.
; Instead of BNF's |, ABNF uses / to separate alternatives.
; Repetition prefixes have precedence over juxtapositions,
; which have precedence over /.
; Round brackets group things and override the aforementioned precedence rules,
; e.g. *(WSP / CRLF WSP) denotes strings
; obtained by repeating, zero or more times,
; either (i) a WSP or (ii) a CRLF followed by a WSP.
; Square brackets also group things but make them optional,
; e.g. [":" port] is equivalent to 0*1(":" port).
; Instead of BNF's ::=, ABNF uses = to define nonterminals,
; and =/ to incrementally add alternatives
; to previously defined nonterminals.
; For example, the rule BIT = "0" / "1"
; is equivalent to BIT = "0" followed by BIT =/ "1".
; The syntax of ABNF itself is formally specified in ABNF
; (in Section 4 of the aforementioned RFC 5234,
; after the syntax and semantics of ABNF
; are informally specified in natural language
; (in Sections 1, 2, and 3 of the aforementioned RFC 5234).
; The syntax rules of ABNF prescribe the ASCII codes allowed for
; white space (spaces and horizontal tabs),
; line endings (carriage returns followed by line feeds),
; and comments (semicolons to line endings).
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Structure
; ---------
; Language specifications often define the syntax of their languages via
; (i) a lexical grammar that describes how
; sequence of characters are parsed into tokens, and
; (ii) a syntactic grammar that described how
; tokens are parsed into expressions, statements, etc.
; (The adjectives 'lexical' and 'syntactic' are
; the ones used in the Java language specification;
; other terms may be used by other languages,
; but the essence is similar.)
; The situation is sometimes more complex,
; with multiple passes (e.g. Unicode escape processing in Java),
; but the division between lexical and syntactic (in the sense above) stands.
; In the aforementioned language specifications,
; both the lexical and syntactic grammars
; are normally written in a context-free grammar notation,
; augmented with natural language that may assert, for instance,
; that tokenization always takes the longest sequence that constitutes a token.
; This dual structure appears to be motivated by the fact that
; concerns of white space, line endings, etc.
; can be handled by the lexical grammar,
; allowing the syntactic grammar to focus on the more important structure.
; Handling both aspects in a single context-free grammar may be unwieldy,
; so having two grammars provides more clarity and readability.
; In contrast, PEG (Parsing Expression Grammar) formalisms like Pest
; naturally embody a procedural interpretation
; that can handle white space and tokenization in just one manageable grammar.
; However, this procedural interpretaion may be sometimes
; less clear and readable to humans than context-free rules.
; Indeed, context-free grammar are commonly used to documentat languages.
; ABNF is a context-free grammar notation,
; with no procedural interpretation,
; and therefore it makes sense to define
; separate lexical and syntactic ABNF grammars for Leo.
; Conceptually, the two grammars define two subsequent processing phases,
; as detailed below.
; However, a parser implementation does not need to perform
; two strictly separate phases (in fact, it typically does not),
; so long as it produces the same final result.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Operator Precedence
; -------------------
; We formulate the grammar rules for expressions
; in a way that describes the relative precedence of operators,
; as often done in language syntax specifications.
; For instance, consider the rules
;
; multiplicative-expression =
; exponential-expression
; / multiplicative-expression "*" exponential-expression
; / multiplicative-expression "/" exponential-expression
;
; additive-expression =
; multiplicative-expression
; / additive-expression "+" multiplicative-expression
; / additive-expression "-" multiplicative-expression
;
; this rule tells us that the additive operators '+' and '-' have
; lower precedence than the multiplicative operators '*' and '/',
; and that both the additive and multiplicative operators associate to the left.
; This may be best understood via the examples given below.
; According to the rules, the expression
;
; x + y * z
;
; can only be parsed as
;
; +
; / \
; x *
; / \
; y z
;
; and not as
;
; *
; / \
; + z
; / \
; x y
;
; because a multiplicative expression cannot have an additive expression
; as first sub-expression, as it would in the second tree above.
; Also according to the rules, the expression
;
; x + y + z
;
; can only be parsed as
;
; +
; / \
; + z
; / \
; x y
;
; and not as
;
; +
; / \
; x +
; / \
; y z
;
; because an additive expression cannot have an additive expression
; as second sub-expression, as it would in the second tree above.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Naming Convention
; -----------------
; For this ABNF grammar, we choose nonterminal names
; that consist of complete English words, separated by dashes,
; and that describe the construct the way it is in English.
; For instance, we use the name 'conditional-statement'
; to describe conditional statements.
; At the same time, we attempt to establish
; a precise and "official" nomenclature for the Leo constructs,
; by way of the nonterminal names that define their syntax.
; For instance, the rule
;
; group-literal = product-group-literal
; / affine-group-literal
;
; tells us that there are two kinds of group literals,
; namely product group literals and affine group literals.
; This is more precise than describing them as
; integers (which are not really group elements per se),
; or points (they are all points, just differently specified),
; or being singletons vs. pairs (which is a bit generic).
; The only exception to the nomenclature-establishing role of the grammar
; is the fact that, as discussed above,
; we write the grammar rules in a way that determines
; the relative precedence and the associativity of expression operators,
; and that therefore we have rules like
;
; unary-expression = primary-expression
; / "!" unary-expression
; / "-" unary-expression
;
; In order to allow the recursion of the rule to stop,
; we need to regard, in the grammar, a primary expression as a unary expression.
; However, this is just a grammatical artifact:
; ontologically, a primary expression is not really a unary expression,
; because a unary expression is one that consists of
; a unary operator and an operand sub-expression.
; These terminological "exceptions" should be easy to identify in the rules.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Lexical Grammar
; ---------------
; A Leo file is a finite sequence of Unicode characters,
; represented as Unicode code points,
; which are numbers in the range form 0 to 10FFFFh.
; These are captured by the ABNF rule 'character' below.
; The lexical grammar defines how, conceptually,
; the sequence of characters is turned into
; a sequence of tokens, comments, and whitespaces.
; As stated, the lexical grammar is ambiguous.
; For example, the sequence of characters '**' (i.e. two stars)
; could be equally parsed as two '*' symbol tokens or one '**' symbol token
; (see rule for 'symbol' below).
; As another example, the sequence or characters '<CR><LF>'
; (i.e. carriage return followed by line feed)
; could be equally parsed as two line terminators or one
; (see rule for 'newline').
; Thus, as often done in language syntax definitions,
; the lexical grammar is disambiguated by
; the extra-grammatical requirement that
; the longest possible sequence of characters is always parsed.
; This way, '**' must be parsed as one '**' symbol token,
; and '<CR><LF>' must be parsed as one line terminator.
; (We should formalize this requirement,
; along with the other extra-grammatical requirements given below,
; and formally prove that they indeed make
; the lexical grammar of Leo unambiguous.)
; As mentioned above, a character is any Unicode code point.
; This grammar does not say how those are encoded in files (e.g. UTF-8):
; it starts with a decoded sequence of Unicode code points.
; Note that we allow any value,
; even though some values may not be used according to the Unicode standard.
character = %x0-10FFFF ; any Unicode code point
; We give names to certain ASCII characters.
horizontal-tab = %x9
line-feed = %xA
carriage-return = %xD
space = %x20
double-quote = %x22
; We give names to complements of certain ASCII characters.
; These consist of all the Unicode characters except for one or two.
not-double-quote = %x0-22 / %x24-10FFFF ; anything but "
not-star = %x0-29 / %x2B-10FFFF ; anything but *
not-line-feed-or-carriage-return = %x0-9 / %xB-C / %xE-10FFFF
; anything but LF or CR
not-star-or-slash = %x0-29 / %x2B-2E / %x30-10FFFF ; anything but * or /
; Lines in Leo may be terminated via
; a single carriage return,
; a line feed,
; or a carriage return immediately followed by a line feed.
; Note that the latter combination constitutes a single line terminator,
; according to the extra-grammatical rule of the longest sequence.
newline = line-feed / carriage-return / carriage-return line-feed
; Line terminators form whitespace, along with spaces and horizontal tabs.
whitespace = space / horizontal-tab / newline
; There are two kinds of comments in Leo, as in other languages.
; One is block comments of the form '/* ... */',
; and the other is end-of-line comments '// ...'.
; The first kind start at '/*' and end at the first '*/',
; possibly spanning multiple (partial) lines;
; they do no nest.
; The second kind start at '//' and extend till the end of the line.
; The rules about comments given below are similar to
; the ones used in the Java language specification.
comment = block-comment / end-of-line-comment
block-comment = "/*" rest-of-block-comment
rest-of-block-comment = "*" rest-of-block-comment-after-star
/ not-star rest-of-block-comment
rest-of-block-comment-after-star = "/"
/ "*" rest-of-block-comment-after-star
/ not-star-or-slash rest-of-block-comment
end-of-line-comment = "//" *not-line-feed-or-carriage-return newline
; Below are the keywords in the Leo language.
; They cannot be used as identifiers.
keyword = %s"address"
/ %s"as"
/ %s"bool"
/ %s"circuit"
/ %s"console"
/ %s"const"
/ %s"else"
/ %s"false"
/ %s"field"
/ %s"for"
/ %s"function"
/ %s"group"
/ %s"i8"
/ %s"i16"
/ %s"i32"
/ %s"i64"
/ %s"i128"
/ %s"if"
/ %s"import"
/ %s"in"
/ %s"input"
/ %s"let"
/ %s"mut"
/ %s"return"
/ %s"Self"
/ %s"self"
/ %s"static"
/ %s"string"
/ %s"true"
/ %s"u8"
/ %s"u16"
/ %s"u32"
/ %s"u64"
/ %s"u128"
; The following rules define (ASCII) digits
; and (uppercase and lowercase) letters.
digit = %x30-39 ; 0-9
uppercase-letter = %x41-5A ; A-Z
lowercase-letter = %x61-7A ; a-z
letter = uppercase-letter / lowercase-letter
; An identifier is a non-empty sequence of letters, digits, and underscores,
; starting with a letter.
; It must not be a keyword: this is an extra-grammatical constraint.
identifier = letter *( letter / digit / "_" ) ; but not a keyword
; A package name consists of one or more segments separated by single dashes,
; where each segment is a non-empty sequence of lowercase letters and digits.
package-name = 1*( lowercase-letter / digit )
*( "-" 1*( lowercase-letter / digit ) )
; An address starts with 'aleo1'
; and continues with exactly 58 lowercase letters and digits.
; Thus an address always consists of 63 characters.
address = %s"aleo1" 58( lowercase-letter / digit )
; A format string is a sequence of characters, other than double quote,
; surrounded by double quotes.
; Within a format string, substrings '{}' are distinguished as containers
; (these are the ones that may be matched with values
; whose textual representation replaces the containers
; in the printed string).
; There is an implicit extra-grammatical requirements that
2021-03-23 20:45:57 +03:00
; the explicit 'format-string-container' instances include
2021-03-23 19:11:42 +03:00
; all the occurrences of '{}' in the parsed character sequence:
; that is, there may not be two contiguous 'not-double-quote' instances
; that are '{' and '}'.
format-string-container = "{}"
2021-03-23 20:45:57 +03:00
formatted-string = double-quote
2021-03-23 19:11:42 +03:00
*( not-double-quote / format-string-container )
double-quote
; Here is (part of this ABNF comment),
; an alternative way to specify format strings,
; which captures the extra-grammatical requirement above in the grammar
; but is more complicated:
;
; not-double-quote-or-open-brace = %x0-22 / %x24-7A / %x7C-10FFFF
;
; not-double-quote-or-close-brace = %x0-22 / %x24-7C / %x7E-10FFFF
;
; format-string-element = not-double-quote-or-open-brace
; / "{" not-double-quote-or-close-brace
2021-03-23 20:45:57 +03:00
; / format-string-container
2021-03-23 19:11:42 +03:00
;
2021-03-23 20:45:57 +03:00
; formatted-string = double-quote *format-string-element double-quote
2021-03-23 19:11:42 +03:00
;
; It is not immediately clear which approach is better; there are tradeoffs.
; Regardless, we should choose one eventually.
; Annotations are built out of names and arguments, which are tokens.
; Two names are currently supported.
; An argument is a sequence of one or more letters, digits, and underscores.
annotation-name = "@" identifier
2021-03-23 19:11:42 +03:00
; A natural (number) is a sequence of one or more digits.
; Note that we allow leading zeros, e.g. '007'.
natural = 1*digit
; An integer (number) is either a natural or its negation.
; Note that we also allow leading zeros in negative numbers, e.g. '-007'.
integer = [ "-" ] natural
; An untyped literal is just an integer.
untyped-literal = integer
; Unsigned literals are naturals followed by unsigned types.
unsigned-literal = natural ( %s"u8" / %s"u16" / %s"u32" / %s"u64" / %s"u128" )
; Signed literals are integers followed by signed types.
signed-literal = integer ( %s"i8" / %s"i16" / %s"i32" / %s"i64" / %s"i128" )
; Field literals are integers followed by the type of field elements.
field-literal = integer %s"field"
; There are two kinds of group literals.
; One is a single integer followed by the type of group elements,
; which denotes the scalar product of the generator point by the integer.
; The other is a pair of integer coordinates,
; which are reduced modulo the prime to identify a point,
; which must be on the elliptic curve.
; It is also allowed to omit one (not both) coordinates,
; with an indication of how to infer the missing coordinate
; (i.e. sign high, sign low, or inferred).
product-group-literal = integer %s"group"
group-coordinate = integer / "+" / "-" / "_"
affine-group-literal = "(" group-coordinate "," group-coordinate ")" %s"group"
group-literal = product-group-literal / affine-group-literal
; Note that the rule for group literals above
; allows no whitespace between coordinates.
; If we want to allow whitespace,
; e.g. '(3, 4)group' as opposed to requiring '(3,4)group',
; then we should define affine group literals
; in the syntactic grammar instead of in the lexical grammar.
; We should have a notion of atomic literal in the lexical grammar,
; and (non-atomic) literal in the syntactic grammar.
; The lexical grammar should define a token for ')group'
; if we want no whitespace between the closing parenthesis and 'group'.
; More precisely, the rule for 'literal' below in the lexical grammar
; would be replaced with
;
; atomic-literal = ... / product-group-literal
;
; where the '...' stands for all the '...-literal' alternatives
; in the current rule for 'literal' below, except 'group-literal'.
; Furthermore, the rule for 'symbol' below in the lexical grammar
; would be extended to
;
; symbol = ... / %s")group"
;
; where '...' stands for the current definiens of the rule.
; We would also have to adjust the rule for 'token' below in the lexical grammar
; to reference 'atomic-literal' instead of 'literal' in the definiens.
; We would then add to the syntactic grammar the following rules
;
; affine-group-literal = "(" group-coordinate "," group-coordinate %s")group"
;
; literal = atomic-literal / affine-group-literal
;
; which would now define literals in the syntactic grammar.
; Note that now an affine group literal would have the form
;
; ( <ow> <coordinate> <ow> , <ow> <coordinate> <ow> )group
;
; where <ow> is optional whitespace;
; however, no whitespace is allowed between the closing ')' and 'group'.
; Boolean literals are the usual two.
boolean-literal = %s"true" / %s"false"
; An address literal is an address wrapped into an indication of address,
; to differentiate it from an identifier.
address-literal = %s"address" "(" address ")"
; The ones above are all the literals, as defined by the following rule.
literal = untyped-literal
/ unsigned-literal
/ signed-literal
/ field-literal
/ group-literal
/ boolean-literal
/ address-literal
; After defining the (mostly) alphanumeric tokens above,
; it remains to define tokens for non-alphanumeric symbols such as "+" and "(".
; Different programming languages used different terminologies for these,
; e.g. operators, separators, punctuators, etc.
; Here we use 'symbol', for all of them, but we can do something different.
; We could give names to all of these symbols,
; via rules such as
;
; equality-operator = "=="
;
; and defining 'symbol' in terms of those
;
; symbol = ... / equality-operator / ...
;
; This may or may not make the grammar more readable,
; but it would help establish a terminology in the grammar,
; namely the exact names of some of these token.
; On the other hand, at least some of them are perhaps simple enough
; that they could be just described in terms of their symbols,
; e.g. 'double dot', 'question mark', etc.
symbol = "!" / "&&" / "||"
/ "==" / "!="
/ "<" / "<=" / ">" / ">="
/ "+" / "-" / "*" / "/" / "**"
/ "=" / "+=" / "-=" / "*=" / "/=" / "**="
/ "(" / ")"
/ "[" / "]"
/ "{" / "}"
/ "," / "." / ".." / "..." / ";" / ":" / "::" / "?"
2021-03-24 18:40:26 +03:00
/ "->" / "_"
2021-03-23 19:11:42 +03:00
; Everything defined above, other than comments and whitespace,
; is a token, as defined by the following rule.
token = keyword
/ identifier
/ literal
/ package-name
/ format-string
/ annotation-name
/ symbol
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Syntactic Grammar
; -----------------
; The processing defined by the lexical grammar above
; turns the initial sequence of characters
; into a sequence of tokens, comments, and whitespaces.
; The purpose of comments and whitespaces, from a syntactic point of view,
; is just to separate tokens:
; they are discarded, leaving a sequence of tokens.
; The syntactic grammar describes how to turn
; a sequence of tokens into concrete syntax trees.
; There are unsigned and signed integer types, for five sizes.
unsigned-type = %s"u8" / %s"u16" / %s"u32" / %s"u64" / %s"u128"
signed-type = %s"i8" / %s"i16" / %s"i32" / %s"i64" / %s"i128"
integer-type = unsigned-type / signed-type
; The integer types, along with the field and group types,
; for the arithmetic types, i.e. the ones that support arithmetic operations.
field-type = %s"field"
group-type = %s"group"
arithmetic-type = integer-type / field-type / group-type
; The arithmetic types, along with the boolean and address types,
; form the scalar types, i.e. the ones whose values do not contain (sub)values.
boolean-type = %s"bool"
address-type = %s"address"
scalar-type = boolean-type / arithmetic-type / address-type
; Circuit types are denoted by identifiers and the keyword 'Self'.
; The latter is only allowed inside a circuit definition,
; to denote the defined circuit.
self-type = %s"Self"
circuit-type = identifier / self-type
; A tuple type consists of zero, two, or more component types.
tuple-type = "(" [ type 1*( "," type ) ] ")"
; An array type consists of an element type
; and an indication of dimensions.
; There is either a single dimension (a number),
; or a tuple of one or more dimensions
; (we could restrict the latter to two or more dimensions).
array-type = "[" type ";" array-dimensions "]"
array-dimensions = natural
/ "(" natural *( "," natural ) ")"
; Circuit, tuple, and array types form the aggregate types,
; i.e. types whose values contain (sub)values
; (with the corner-case exception of the empty tuple value).
aggregate-type = tuple-type / array-type / circuit-type
; Scalar and aggregate types form all the types.
type = scalar-type / aggregate-type
; As often done in grammatical language syntax specifications,
; we define rules for different kinds of expressions,
; which also defines the relative precedence
; of operators and other expression constructs,
; and the (left or right) associativity of binary operators.
; The primary expressions are self-contained in a way,
; i.e. they have clear deliminations.
; Some consist of single tokens:
; identifiers, the keywords 'self' and 'input', and literals.
; Primary expressions also include parenthesized expressions,
; i.e. any expression may be turned into a primary one
; by putting parentheses around it.
; The other kinds are defined and explained below.
primary-expression = identifier
/ %s"self"
/ %s"input"
/ literal
/ "(" expression ")"
/ tuple-expression
/ array-expression
/ circuit-expression
/ function-call
; There are tuple expressions to construct and deconstruct tuples.
; A construction consists of zero, two, or more component expressions.
; A deconstruction uses a component index (zero-indexed).
; Note that constructions are delimited by closing parentheses
; and deconstructions are delimited by natural tokens.
; The rule below, and similar rules for other aggregate types,
; use the perhaps more familiar 'access',
; but note that 'deconstruction' has a nice symmetry to 'construction';
; the term 'destructor' has a different meaning in other languages,
; so we may want to avoid it, but not necessarily.
tuple-construction = "(" [ expression 1*( "," expression ) ] ")"
tuple-access = primary-expression "." natural
tuple-expression = tuple-construction / tuple-access
; The are array expressions to construct and deconstruct arrays.
; There are two kinds of constructions:
; one lists the element expressions (at least one),
; including spreads (via '...') which are arrays being spliced in;
; the other repeats (the value of) a single expression
; across one or more dimensions.
; There are two kinds of deconstructions:
; one selects a single element by index (zero-indexed);
; the other selects a range via two indices,
; the first inclusive and the second exclusive --
; both are optional,
; the first defaulting to 0 and the second to the array length.
; Note that these expressions are all delimited
; by closing square brackets.
array-inline-construction = "["
array-inline-element
*( "," array-inline-element )
"]"
array-inline-element = expression / "..." expression
array-repeat-construction = "[" expression ";" array-dimensions "]"
array-construction = array-inline-construction / array-repeat-construction
array-element-access = primary-expression "[" expression "]"
array-range-access = primary-expression "[" [expression] ".." [expression] "]"
array-access = array-element-access / array-range-access
array-expression = array-construction / array-access
; There are circuit expressions to construct and deconstruct circuit values.
; A construction lists values for all the member variables (in any order);
; there must be at least one member variable currently.
2021-03-23 20:45:57 +03:00
; A single identifier abbreviates
; a pair consisting of the same identifier separated by dot;
; note that, in the expansion, the left one denotes a member name,
; while the right one denotes an expression (a variable),
; so they are syntactically identical but semantically different.
2021-03-23 19:11:42 +03:00
; A deconstruction selects a member variable by name.
; Note that these expressions are delimited,
; by closing curly braces or identifiers.
circuit-construction = circuit-type "{"
circuit-inline-element *( "," circuit-inline-element )
"}"
2021-03-23 20:45:57 +03:00
circuit-inline-element = identifier ":" expression / identifier
2021-03-23 19:11:42 +03:00
circuit-access = primary-expression "." identifier
circuit-expression = circuit-construction / circuit-access
; There are three kinds of function calls:
; top-level function calls,
; instance (i.e. non-static) member function calls, and
; static member function calls.
; What changes is the start, but they all end in an argument list,
; delimited by a closing parenthesis.
top-level-function-call = identifier function-arguments
instance-member-function-call = primary-expression "."
identifier function-arguments
static-member-function-call = circuit-type "::" identifier function-arguments
function-call = top-level-function-call
/ instance-member-function-call
/ static-member-function-call
function-arguments = "(" [ expression *( "," expression ) ] ")"
; Unary operators have the highest precedence.
; They apply to primary expressions
; and recursively to unary expressions.
unary-expression = primary-expression
/ "!" unary-expression
/ "-" unary-expression
; Next in the operator precedence is exponentiation,
; following mathematical practice.
; The current rule below makes exponentiation left-associative,
; i.e. 'a ** b ** c' must be parsed as '(a ** b) ** c'.
; This is easy to change if we want it to be right-associative instead.
exponential-expression = unary-expression
/ exponential-expression "**" unary-expression
; Next in precedence come multiplication and division, both left-associative.
multiplicative-expression = exponential-expression
/ multiplicative-expression "*" exponential-expression
/ multiplicative-expression "/" exponential-expression
; Then there are addition and subtraction, both left-assocative.
additive-expression = multiplicative-expression
/ additive-expression "+" multiplicative-expression
/ additive-expression "-" multiplicative-expression
; Next in the precedence order are ordering relations.
; These are not associative, because they return boolean values.
ordering-expression = additive-expression
/ additive-expression "<" additive-expression
/ additive-expression ">" additive-expression
/ additive-expression "<=" additive-expression
/ additive-expression ">=" additive-expression
; Equalities return booleans but may also operate on boolean,
; so we make them left-associative.
equality-expression = ordering-expression
/ equality-expression "==" ordering-expression
/ equality-expression "!=" ordering-expression
; Next come conjunctive expressions, left-associative.
conjunctive-expression = equality-expression
/ conjunctive-expression "&&" equality-expression
; Next come disjunctive expressions, left-associative.
disjunctive-expression = conjunctive-expression
/ disjunctive-expression "||" conjunctive-expression
; Next come conditional expressions.
2021-03-23 19:11:42 +03:00
conditional-expression = disjunctive-expression
/ conditional-expression
2021-03-23 19:11:42 +03:00
"?" expression
":" conditional-expression
; Need to decide on operator precedence for as.
cast-expression = conditional-expression
/ cast-expression %s"as" type
2021-03-23 19:11:42 +03:00
; These are all the expressions.
; Recall that conditional expressions
; may be disjunctive expressions,
; which may be conjunctive expressions,
; and so on all the way to primary expressions.
expression = conditional-expression
; There are various kinds of statements,
; including blocks, which are
; possibly empty sequences of statements surounded by curly braces.
statement = expression-statement
/ return-statement
/ variable-definition-statement
/ conditional-statement
/ loop-statement
/ assignment-statement
/ console-statement
/ block
block = "{" *statement "}"
; An expression (that returns the empty tuple)
; can be turned into a statement by appending a semicolon.
expression-statement = expression ";"
; A return statement always takes an expression,
; and does not end with a semicolon (but we may want to change that).
return-statement = %s"return" expression
; There are two kinds of variable definition statements,
; which only differ in the starting keyword.
; The variables are either a single one or a tuple of two or more;
; in all cases, there is just one optional type
; and just one initializing expression.
variable-definition-statement = ( %s"let" / %s"const" )
identifier-or-identifiers
[ ":" type ] "=" expression ";"
identifier-or-identifiers = identifier
/ "(" identifier 1*( "," identifier ) ")"
; A conditional statement always starts with a condition and a block
; (which together form a branch).
; It may stop there, or it may continue with an alternative block,
; or possibly with another conditional statement, forming a chain.
; Note that we require blocks in all branches, not merely statements.
branch = %s"if" expression block
conditional-statement = branch
/ branch %s"else" block
/ branch %s"else" conditional-statement
; A loop statement implicitly defines a loop variable
; that goes from a starting value (inclusive) to an ending value (exclusive).
; The body is a block.
loop-statement = %s"for" identifier %s"in" expression ".." expression block
; An assignment statement is straightforward.
; Based on the operator, the assignment may be simple (i.e. '=')
; or compound (i.e. combining assignment with an arithmetic operation).
assignment-operator = "=" / "+=" / "-=" / "*=" / "/=" / "**="
assignment-statement = expression assignment-operator expression ";"
; Console statements start with the 'console' keyword,
; followed by a console function call.
; The call may be an assertion or a print command.
; The former takes an expression (which must be boolean) as argument.
; The latter takes either no argument,
; or a format string followed by expressions,
; whose number must match the number of containers '{}' in the format string.
; Note that the console function names are identifiers, not keywords.
; There are three kind of printing.
console-statement = %s"console" "." console-call
console-call = assert-call
/ print-call
assert-call = %s"assert" "(" expression ")"
print-function = %s"debug" / %s"error" / %s"log"
print-arguments = "(" [ format-string *( "," expression ) ] ")"
print-call = print-function print-arguments
; An annotation consists of an annotation name (which starts with '@')
; with optional annotation arguments.
; Note that no parentheses are used if there are no arguments.
annotation = annotation-name
[ "(" identifier *( "," identifier ) ")" ]
2021-03-23 19:11:42 +03:00
; A function declaration defines a function.
; This could be called 'function-definition' instead,
; but see the comments about the 'declaration' rule below.
; The output type is optional (it defaults to the empty tuple type).
; In general, a function input consists of an identifier and a type,
; with an optional 'const' modifier.
; However, functions inside circuits
; may start with a 'mut self' or 'self' parameter,
; which may be the only parameter in fact.
; Furthermore, any function may end with an 'input' parameter,
; which may be the only parameter in fact.
function-declaration = *annotation %s"function" identifier
"(" [ function-parameters ] ")" [ "->" type ]
block
function-parameters = self-parameter [ "," input-parameter ]
/ self-parameter "," function-inputs [ "," input-parameter ]
/ function-inputs [ "," input-parameter ]
/ input-parameter
self-parameter = [%s"mut"] %s"self"
function-inputs = function-input *( "," function-input )
function-input = [ %s"const" ] identifier ":" type
input-parameter = %s"input"
; A circuit member variable declaration consists of an identifier and a type.
; A circuit member function declaration consists of a function declaration.
; We could call these 'member-definition' etc.,
; but see the comments about the 'declaration' rule below.
member-declaration = member-variable-declaration
/ member-function-declaration
member-variable-declaration = identifier ":" type
member-function-declaration = function-declaration
; A circuit declaration defines a circuit type. It is straightforward.
; This could be called 'circuit-definition' instead,
; but see the comments about the 'declaration' rule below.
circuit-declaration = *annotation %s"circuit" identifier
"{" member-declaration *( "," member-declaration ) "}"
; An import declaration consists of the 'import' keyword
; followed by a package path, which may be one of the following:
; a single wildcard;
; an identifier, optionally followed by a local renamer;
; a package name followed by a path, recursively;
; or a parenthesized list of package paths,
; which are "fan out" of the initial path.
; Note that we allow the last element of the parenthesized list
; to be followed by a comma, presumably for convenience.
import-declaration = %s"import" package-path
package-path = "*"
/ identifier [ %s"as" identifier ]
/ package-name "." package-path
/ "(" package-path *( "," package-path ) [","] ")"
; Finally, we define a file as a sequence of zero or more declarations.
; This is why we used 'function-declaration' and 'circuit-declaration'
; instead of 'function-definition' and 'ciruit-definition':
; this way, they are all declarations of some sort.
; An import declaration cannot really called an import definition,
; because it does not define anything.
; But we may revisit this, and use 'definition' instead of 'declaration'.
declaration = import-declaration
/ function-declaration
/ circuit-declaration
file = *declaration