What's difference between expr*, expr? and expr in Python3.6 AST Abstract Grammar? - python-3.x

In Python3.6 AST Abstract Grammar, there are expr*, expr? and expr inside. What's difference betweem them? Such as: expr* targets, expr? value and expr target.

Typically those suffixes mean the same as in regular expressions:
expr - exactly one instance of the expression
expr?- zero or one instances
expr* - zero or more instances
For example, for the following:
FunctionDef(identifier name, arguments args,
stmt* body, expr* decorator_list, expr? returns)
a function definition consists of exactly one name, at least zero statements and asn optional return value (for functions that don't return anything).

Related

Including comments in AST

I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.
At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having
data Expr = Add Expr Expr | ...
one would have
data Expr a = Add a Expr Expr
and use a for whatever annotation (e.g. for comments that come after the expression).
However, there are some not so exciting cases. The language features C-like comments (// ..., /* .. */) and a simple for loop like this:
for (i in 1:10)
{
... // list of statements
}
Now, excluding the body there are at least 10 places where one could put one (or more) comments:
/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/
{ /*K*/
...
In other words, while the for loop could previously be comfortably represented as an identifier (i), two expressions (1 & 10) and a list of statements (the body), we would now at least had to include 10 more parameters or records for annotations.
This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?
You can probably capture most of those positions with just two generic comment productions:
Expr -> Comment Expr
Stmt -> Comment Stmt
This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for production (maybe four, with one hidden in Range here):
Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt
In other words: one before each literal string but the first. Seems not too onerous, ultimately.

How do fixpoint, variable, let and tag schema constructors work in Winery?

I had previously asked where the Winery types are indexed. I noticed that in the serialization for the schema for Bool, which is [4,6], the 4 is the version number, and 6 is the index of SBool in SchemaP. I verified the hypothesis using other "primitive" types like Integer (serialization: 16), Double (18), Text (20). Also, [Bool] will be SVector SBool, serialized to [4,2,6], which makes sense: the 2 is for SVector, the 6 is for SBool.
But SchemaP also contains constructors that I don't intuitively see how are used: SFix, SVar, STag and SLet. What are they, and which type would I need to construct the schema for, to see them used? Why is SLet at the end, but SFix at the beginning?
SFix looks like a µ quantifier for a recursive type. The type µx. T is the type T where x refers to the whole type µx. T. For example, a list data List a = Nil | Cons a (List a) can be represented as L(a) = µr. 1 + a × r, where the recursive occurrence of the type is replaced with the variable r. You could probably see this with a recursive user-defined type like data BinTree a = Leaf | Branch a (BinTree a) (BinTree a).
This encoding doesn’t explicitly include a variable name, because the next constructor SVar specifies that “SVar n refers to the nth innermost fixpoint”, where n is an Int in the synonym type Schema = SchemaP Int. This is a De Bruijn index. If you had some nested recursive types like µx. µy. … = SFix (SFix …), then the inner variable would be referenced as SVar 0 and the outer one as SVar 1 within the body …. This “relative” notation means you can freely reorganise terms without worrying about having to rename variables for capture-avoiding substitution.
SLet is a let binding, and since it’s specified as SLet !(SchemaP a) !(SchemaP a), I presume that SLet e1 e2 is akin to let x = e1 in e2, where the variable is again implicit. So I suspect this may be a deficiency of the docs, and SVar can also refer to Let-bound variables. I don’t know how they use this constructor, but it could be used to make sharing explicit in the schema.
Finally, STag appears to be a way to attach extra “tag” metadata within the schema, in some way that’s specific to the library.
The ordering of these constructors might be maintained for compatibility with earlier versions, so adding new constructors at the end would avoid disturbing the encoding, and I figure the STag and SLet constructors at the end were simply added later.

How does Haskell know whether a data type declaration is a variable or a named type?

Take a data type declaration like
data myType = Null | Container TypeA v
As I understand it, Haskell would read this as myType coming in two different flavors. One of them is Null which Haskell interprets just as some name of a ... I guess you'd call it an instance of the type? Or a subtype? Factor? Level? Anyway, if we changed Null to Nubb it would behave in basically the same way--Haskell doesn't really know anything about null values.
The other flavor is Container and I would expect Haskell to read this as saying that the Container flavor takes two fields, TypeA and v. I expect this is because, when making this type definition, the first word is always read as the name of the flavor and everything that follows is another field.
My question (besides: did I get any of that wrong?) is, how does Haskell know that TypeA is a specific named type rather than an un-typed variable? Am I wrong to assume that it reads v as an un-typed variable, and if that's right, is it because of the lower-case initial letter?
By un-typed I mean how the types appear in the following type-declaration for a function:
func :: a -> a
func a = a
First of all, terminology: "flavors" are called "cases" or "constructors". Your type has two cases - Null and Container.
Second, what you call "untyped" is not really "untyped". That's not the right way to think about it. The a in declaration func :: a -> a does not mean "untyped" the same way variables are "untyped" in JavaScript or Python (though even that is not really true), but rather "whoever calls this function chooses the type". So if I call func "abc", then I have chosen a to be String, and now the compiler knows that the result of this call must also be String, since that's what the func's signature says - "I take any type you choose, and I return the same type". The proper term for this is "generic".
The difference between "untyped" and "generic" is that "untyped" is free-for-all, the type will only be known at runtime, no guarantees whatsoever; whereas generic types, even though not precisely known yet, still have some sort of relationship between them. For example, your func says that it returns the same type it takes, and not something random. Or for another example:
mkList :: a -> [a]
mkList a = [a]
This function says "I take some type that you choose, and I will return a list of that same type - never a list of something else".
Finally, your myType declaration is actually illegal. In Haskell, concrete types have to be Capitalized, while values and type variables are javaCase. So first, you have to change the name of the type to satisfy this:
data MyType = Null | Container TypeA v
If you try to compile this now, you'll still get an error saying that "Type variable v is unknown". See, Haskell has decided that v must be a type variable, and not a concrete type, because it's lower case. That simple.
If you want to use a type variable, you have to declare it somewhere. In function declaration, type variables can just sort of "appear" out of nowhere, and the compiler will consider them "declared". But in a type declaration you have to declare your type variables explicitly, e.g.:
data MyType v = Null | Container TypeA v
This requirement exist to avoid confusion and ambiguity in cases where you have several type variables, or when type variables come from another context, such as a type class instance.
Declared this way, you'll have to specify something in place of v every time you use MyType, for example:
n :: MyType Int
n = Null
mkStringContainer :: TypeA -> String -> MyType String
mkStringContainer ta s = Container ta s
-- Or make the function generic
mkContainer :: TypeA -> a -> MyType a
mkContainer ta a = Container ta a
Haskell uses a critically important distinction between variables and constructors. Variables begin with a lower-case letter; constructors begin with an upper-case letter1.
So data myType = Null | Container TypeA v is actually incorrect; the first symbol after the data keyword is the name of the new type constructor you're introducing, so it must start with a capital letter.
Assuming you've fixed that to data MyType = Null | Container TypeA v, then each of the alternatives separated by | is required to consist of a data constructor name (here you've chosen Null and Container) followed by a type expression for each of the fields of that constructor.
The Null constructor has no fields. The Container constructor has two fields:
TypeA, which starts with a capital letter so it must be a type constructor; therefore the field is of that concrete type.
v, which starts with a lowercase letter and is therefore a type variable. Normally this variable would be defined as a type parameter on the MyType type being defined, like data MyType v = Null | Container TypeA v. You cannot normally use free variables, so this was another error in your original example.2
Your data declaration showed how the distinction between constructors and variables matters at the type level. This distinction between variables and constructors is also present at the value level. It's how the compiler can tell (when you're writing pattern matches) which terms are patterns it should be checking the data against, and which terms are variables that should be bound to whatever the data contains. For example:
lookAtMaybe :: Show a => Maybe a -> String
lookAtMaybe Nothing = "Nothing to see here"
lookAtMaybe (Just x) = "I found: " ++ show x
If Haskell didn't have the first-letter rule, then there would be two possible interpretations of the first clause of the function:
Nothing could be a reference to the externally-defined Nothing constructor, saying I want this function rule to apply when the argument matches that constructor. This is the interpretation the first-letter rule mandates.
Nothing could be a definition of an (unused) variable, representing the function's argument. This would be the equivalent of lookAtMaybe x = "Nothing to see here"
Both of those interpretations are valid Haskell code producing different behaviour (try changing the capital N to a lower case n and see what the function does). So Haskell needs a rule to choose between them. The designers chose the first-letter rule as a way of simply disambiguating constructors from variables (that is simple to both the compiler and to human readers) without requiring any additional syntactic noise.
1 The rule about the case of the first letter applies to alphanumeric names, which can only consist of letters, numbers, and underscores. Haskell also has symbolic names, which consists only of symbol characters like +, *, :, etc. For these, the rule is that names beginning with the : character are constructors, while names beginning with another character are variables. This is how the list constructor : is distinguished from a function name like +.
2 With the ExistentialQuantification extension turned on it is possible to write data MyType = Null | forall v. Container TypeA v, so that the the constructor has a field with a variable type and the variable does not appear as a parameter to the overall type. I'm not going to explain how this works here; it's generally considered an advanced feature, and isn't part of standard Haskell code (which is why it requires an extension)

Bison/Flex, reduce/reduce, identifier in different production

I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}

Can I have a value constructor named "/""?

I've declared a recursive data type with the following structure:
data Path = GET | POST | Slash Path String
I'd really like to rename that last value constructor to / so that I can infix it in cute expressions like GET /"controller"/"action". However, if I try to do so:
import Prelude hiding ((/))
infixr 5 /
data Path = GET | POST | Path / String
...then I get this:
Path.hs:4:30: parse error on input `/'
Those same three lines compile just fine if I replace / with :/ or any other special character sequence beginning with :.
So, is there any way I can name my value constructor /? I know that I can just name it Slash and then declare a separate function:
(/) :: Path -> String -> Path
(/) = Slash
...but that won't let me pattern match, as in:
request :: Path -> String
request path = case path of GET /"hello" -> "Hello!"
GET /"goodbye" -> "Goodbye!"
Short answer: No.
Long answer: Type classes, type names, and data constructors must begin with either a capital letter or a colon (some of this requires using a language extension). Everything else must begin with a lowercase letter or any other allowed symbol.
Note that type variables, which are normally lowercase identifiers, follow the same rules and do not begin with a colon.
See also the GHC user's guide for enabling type operators. Data constructors are always allowed, I think.
Personally, in your case I'd just use (:/). It doesn't look that bad, and after a while you get used to ignoring the colons. Some people like a trailing colon as well, especially if the data is "symmetric" in some sense.
No, you can't do this. In pure Haskell 98, user-defined type names and constructors must be alphanumeric and begin with an uppercase letter; this is in section 4.1.2 of the Haskell 98 Report. In GHC, just as user-defined constructors with alphanumeric names must begin with an uppercase letter, user-defined constructors which are operators must begin with a :.1 (The same is true for user-defined type names.) This is documented in section 7.4.2 of the GHC manual. I'd probably use :/, myself, with or without / as a synonym.
1: The reason for the "user-defined" qualification is that there are a few built-in exceptions: ->, [], (), and the tuple types (,), (,,), etc. as type names; and () and the tuple type constructors (,), (,,), etc., as type constructors
I think all constructor operators need to start with a colon, (but I may be wrong).
So you could do:
data Path = GET | POST | Path :/ String

Resources