Syntax Trees
This section describes the syntax trees produced by JuliaSyntax, mainly in terms of their similarities and differences with the Expr tree data structures used since Julia 0.1.
JuliaSyntax trees vs Expr
The tree structure of GreenNode/SyntaxNode is similar to Julia's Expr data structure but there are various differences:
Source ordered children
The children of our trees are strictly in source order. This has many consequences in places where Expr reorders child expressions.
- Infix and postfix operator calls have the operator name in the second child position.
a + bis parsed as(call-i a + b)- where the infix-iflag indicates infix child position - rather thanExpr(:call, :+, :a, :b). - Generators are represented in source order as a single node rather than multiple nested flatten and generator expressions.
No LineNumberNodes
Our syntax nodes inherently stores source position, so there's no need for the LineNumberNodes used by Expr.
More consistent / less redundant blocks
Sometimes Expr needs redundant block constructs to store LineNumberNodes, but we don't need these. Also in cases which do use blocks we try to use them consistently.
- No block is used on the right hand side of short form function syntax
- No block is used for the conditional in
elseif - No block is used for the body of anonymous functions after the
-> letargument lists always use a block regardless of number or form of bindings
Faithful representation of the source text / avoid premature lowering
Some cases of "premature lowering" have been removed, preferring to represent the source text more closely.
K"macrocall"- allow users to easily distinguish macrocalls with parentheses from those without them (#218)- Grouping parentheses are represented with a node of kind
K"parens"(#222) - The right hand side of
x where {T}retains theK"braces"node around theTto distinguish it fromx where T. - Ternary syntax is not immediately lowered to an
ifnode:a ? b : cparses as(? a b c)rather thanExpr(:if, :a, :b, :c)(#85) global constandconst globalare not normalized by the parser. This is done inExprconversion (#130)dosyntax is nested as the last child of the call which thedolambda will be passed to (#98, #322)@.is not lowered to@__dot__inside the parser (#146)- Docstrings use the
K"doc"kind, and are not lowered toCore.@docuntil later (#217) - Juxtaposition uses the
K"juxtapose"kind rather than lowering immediately to*(#220) returnwithout a value has zero children, rather than lowering toreturn nothing(#220)- Command syntax $`foo`$ parses into a
cmdstringtree node wrapping the string, as(cmdstring "foo")(#438). These are lowered to a macro call later rather than by the parser.
Containers for string-like constructs
String-like constructs always come within a container node, not as a single token. These are useful for tooling which works with the tokens of the source text. Also separating the delimiters from the text they delimit removes a whole class of tokenization errors and lets the parser deal with them.
- string always use
K"string"to wrap strings, even when they only contain a single string chunk (#94) - char literals are wrapped in the
K"char"kind, containing the character literal string along with their delimiters (#121) - backticks use the
K"cmdstring"kind var""syntax usesK"var"as the head (#127)- The parser splits triple quoted strings into string chunks interspersed with whitespace trivia
Improvements for AST inconsistencies
- Field access syntax like
a.bis parsed as(. a b)rather than(. a (quote b))to avoid the inconsistency between this and actual quoted syntax literals like:(b)andquote b end(#342) - Dotted call syntax like
f.(a,b)anda .+ bhas been made consistent with theK"dotcall"head (#90) - Standalone dotted operators are always parsed as
(. op). For example.*(x,y)is parsed as(call (. *) x y)(#240) - The
K"="kind is used for keyword syntax rather thankw, to avoid various inconsistencies and ambiguities (#103) - Unadorned postfix adjoint is parsed as
callrather than as a syntactic operator for consistency with suffixed versions likex'ᵀ(#124) - The argument list in the left hand side of
->is always a tuple. For example,x->yparses as(-> (tuple x) y)rather than(-> x y)(#522)
Improvements to awkward AST forms
FrankenTuples with multiple parameter blocks like(a=1, b=2; c=3; d=4)are flattened into the parent tuple instead of using nestedK"parameters"nodes (#133)- Using
try catch else finally endis parsed withK"catch"K"else"andK"finally"children to avoid the awkwardness of the optional child nodes in theExprrepresentation (#234) - The dotted import path syntax as in
import A.b.cis parsed with aK"importpath"kind rather thanK".", because a bareA.b.chas a very different nested/quoted expression representation (#244) - We use flags rather than child nodes to represent the difference between
structandmutable struct,moduleandbaremodule(#220) - Iterations are represented with the
iterationandinheads rather than=within the header of afor. Thusfor i=is ; body endparses to(for (iteration (in i is)) (block body)). Cartesian iteration as infor a=as, b=bs body endare represented with a nested(iteration (in a as) (in b bs))rather than ablockcontaining=because these lists of iterators are neither semantically nor syntactically a sequence of statements, unlike other uses ofblock. Generators also use theiterationhead - see information on that below. - Short form functions like
f(x) = x + 1are represented with thefunctionhead rather than the=head. In this case theSHORT_FORM_FUNCTION_FLAGflag is set to allow the surface syntactic form to be easily distinguished from long form functions. - All kinds of updating assignment operators like
+=are represented with a singleK"op="head, with the operator itself in infix position. For example,x += 1is(op= x + 1), where the plus token is of kindK"Identifier". This greatly reduces the number of distinct forms here from a rather big list ($=%=&=*=+=-=//=/=<<=>>=>>>=\=^=|=÷=⊻=) and makes the operator itself appear in the AST as kindK"Identifier", as it should. It also makes it possible to add further unicode updating operators while keeping the AST stable.
More detail on tree differences
Generators
Flattened generators are uniquely problematic because the Julia AST doesn't respect a key rule we normally expect: that the children of an AST node are a contiguous range in the source text. For example, the fors in [xy for x in xs for y in ys] are parsed in the normal order of a for loop to mean
for x in xs
for y in ys
push!(xy, collection)
end
endso the xy prefix is in the body of the innermost for loop. Following this, the standard Julia AST is like so:
(flatten
(generator
(generator
xy
(= y ys))
(= x xs)))however, note that if this tree were flattened, the order would be (xy) (y in ys) (x in xs) and the x and y iterations are opposite of the source order.
However, our green tree is strictly source-ordered, so we must deviate from the Julia AST. We deal with this by grouping cartesian products of iterators (separated by commas) within iteration blocks as in for loops, and use the length of the iteration block rather than the flatten head to distinguish flattened iterators. The nested flattens and generators of Expr forms are reconstructed later. In this form the tree structure resembles the source much more closely. For example, (xy for x in xs for y in ys) is parsed as
(generator
xy
(iteration (in x xs))
(iteration (in y ys)))And the cartesian iteration (xy for x in xs, y in ys) is parsed as
(generator
xy
(iteration (in x xs) (in y ys)))Whitespace trivia inside strings
For triple quoted strings, the indentation isn't part of the string data so should also be excluded from the string content within the green tree. That is, it should be treated as separate whitespace trivia tokens. With this separation things like formatting should be much easier. The same reasoning goes for escaping newlines and following whitespace with backslashes in normal strings.
Detecting string trivia during parsing means that string content is split over several tokens. Here we wrap these in the K"string" kind (as is already used for interpolations). The individual chunks can then be reassembled during Expr construction. (A possible alternative might be to reuse the K"String" and K"CmdString" kinds for groups of string chunks (without interpolation).)
Take as an example the following Julia fragment.
x = """
$a
b"""Here this is parsed as (= x (string-s a "\n" "b")) (the -s flag in string-s means "triple quoted string")
Looking at the green tree, we see the indentation before the $a and b are marked as trivia:
julia> text = "x = \"\"\"\n \$a\n b\"\"\""
show(stdout, MIME"text/plain"(), parseall(GreenNode, text, rule=:statement), text)
1:23 │[=]
1:1 │ Identifier ✔ "x"
2:2 │ Whitespace " "
3:3 │ = "="
4:4 │ Whitespace " "
5:23 │ [string]
5:7 │ """ "\"\"\""
8:8 │ String "\n"
9:12 │ Whitespace " "
13:13 │ $ "\$"
14:14 │ Identifier ✔ "a"
15:15 │ String ✔ "\n"
16:19 │ Whitespace " "
20:20 │ String ✔ "b"
21:23 │ """ "\"\"\""String nodes always wrapped in K"string" or K"cmdstring"
All strings are surrounded by a node of kind K"string", even non-interpolated literals, so "x" parses as (string "x"). This makes string handling simpler and more systematic because interpolations and triple strings with embedded trivia don't need to be treated differently. It also gives a container in which to attach the delimiting quotes.
The same goes for command strings which are always wrapped in K"cmdstring" regardless of whether they have multiple pieces (due to triple-quoted dedenting) or otherwise.
Do blocks
do syntax is represented in the Expr AST with the do outside the call. This makes some sense syntactically (do appears as "an operator" after the function call).
However semantically this nesting is awkward because the lambda represented by the do block is passed to the call. This same problem occurs for the macro form @f(x) do \n body end where the macro expander needs a special rule to expand nestings of the form Expr(:do, Expr(:macrocall ...), ...), rearranging the expression which are passed to this macro call rather than passing the expressions up the tree.
The implied closure is also lowered to a nested Expr(:->) expression, though it this somewhat premature to do this during parsing.
To resolve these problems we parse
@f(x, y) do a, b\n body\n end
f(x, y) do a, b\n body\n endby tacking the do onto the end of the call argument list:
(macrocall @f x y (do (tuple a b) body))
(call f x y (do (tuple a b) body))This achieves the following desirable properties
- Content of
dois nested inside the call which improves the match between AST and semantics - Macro can be passed the syntax as-is rather than the macro expander rearranging syntax before passing it to the macro
- In the future, a macro can detect when it's being passed do syntax rather than lambda syntax
dohead is used uniformly for both call and macrocall- We preserve the source ordering properties we need for the green tree.
Tree structure reference
This section may eventually contain a full description of the Julia AST. For now, we describe a few of the more subtle features.
Concatenation syntax
Concatenation syntax comes in two syntax forms:
- The traditional
hcat/vcat/rowwhich deal with concatenation or matrix construction along dimensions one and two. - The new
ncat/nrowsyntax which deals with concatenation or array construction along arbitrary dimensions.
We write ncat-3 for concatenation along the third dimension. (The 3 is stored in the head flags for SyntaxNode trees, and in the first arg for Expr trees.) Semantically the new syntax can work like the old:
ncat-1is the same asvcatncat-2is the same ashcatrowis the same asnrow-2
Vertical concatenation (dimension 1)
Vertical concatenation along dimension 1 can be done with semicolons or newlines
julia> print_tree(:([a
b]))
Expr(:vcat)
├─ :a
└─ :b
julia> print_tree(:([a ; b]))
Expr(:vcat)
├─ :a
└─ :bHorizontal concatenation (dimension 2)
For horizontal concatenation along dimension 2, use spaces or double semicolons
julia> print_tree(:([a b]))
Expr(:hcat)
├─ :a
└─ :b
julia> print_tree(:([a ;; b]))
Expr(:ncat)
├─ 2
├─ :a
└─ :bMixed concatenation
Concatenation along dimensions 1 and 2 can be done with spaces and single semicolons or newlines, producing a mixture of vcat and row expressions:
julia> print_tree(:([a b
c d]))
# OR
julia> print_tree(:([a b ; c d]))
Expr(:vcat)
├─ Expr(:row)
│ ├─ :a
│ └─ :b
└─ Expr(:row)
├─ :c
└─ :dGeneral n-dimensional concatenation results in nested ncat and nrow, for example
julia> print_tree(:([a ; b ;; c ; d ;;; x]))
Expr(:ncat)
├─ 3
├─ Expr(:nrow)
│ ├─ 2
│ ├─ Expr(:nrow)
│ │ ├─ 1
│ │ ├─ :a
│ │ └─ :b
│ └─ Expr(:nrow)
│ ├─ 1
│ ├─ :c
│ └─ :d
└─ :x