I have this grammar section in a happy parser, given on the Happy official site, but I need some deeper explanation of the meaning of the rules in brackets. Here is the token definition
%token
let { TokenLet }
in { TokenIn }
int { TokenInt $$ }
var { TokenVar $$ }
'=' { TokenEq }
'+' { TokenPlus }
'-' { TokenMinus }
'*' { TokenTimes }
'/' { TokenDiv }
'(' { TokenOB }
')' { TokenCB }
and here the grammar section
Exp : let var '=' Exp in Exp { Let $2 $4 $6 }
| Exp1 { Exp1 $1 }
Exp1 : Exp1 '+' Term { Plus $1 $3 }
| Exp1 '-' Term { Minus $1 $3 }
| Term { Term $1 }
Term : Term '*' Factor { Times $1 $3 }
| Term '/' Factor { Div $1 $3 }
| Factor { Factor $1 }
Factor
: int { Int $1 }
| var { Var $1 }
| '(' Exp ')' { Brack $2 }
What I understand is that the lexer, defined below in the file, should produce tokens only of the type definined and then build the parse tree using the grammar. But what exactley mean "{Let $2 $4 $6}"? I know that $2 refers to the second rule argument and so on but if someone can give me a "human read version" of the rules I would be really happy. Hope I've been clear.
Thanks in advance.
In the %token section the left column is the token names used elsewhere in the grammar, and the right is a pattern that can be used in a case statement. Where you see $$ Happy will substitute its own variable. So if the resulting parser is expecting an Integer at some point then Happy will have a case statement with a pattern including TokenInt v1234 where the v1234 bit is a variable name created by Happy.
The "Let" is the constructor for the grammar expression being recognised. If you look a little lower in the example page you will see
data Exp
= Let String Exp Exp
| Exp1 Exp1
deriving Show
So the Let constructor takes a string and two sub-expressions (of type 'Exp'). If you look at the grammar you can see that there six elements in the let rule. The first is just the constant string "let". That is used by the generated parser to figure out that its looking at a "let" clause, but the resulting parse tree doesn't need it. So $1 doesn't appear. Instead the first argument to the Let constructor has to be the name of the variable being bound, which is the second item in the grammar rule. Hence this is $2. The other things are the two sub-expressions, which are $4 and $6 by the same logic. Both of these can be arbitrarily complex expressions: Happy figures out where they begin and end and parses them by following the other rules for what constitutes expressions.
This line is one rule for creating (parsing) the production Exp:
Exp : let var '=' Exp in Exp { Let $2 $4 $6 }
It corresponds to the rule:
if you see "let" ($1)
followed by a variable name ($2)
followed by "=" ($3)
followed by an Exp ($4)
followed by "in" ($5)
followed by another Exp ($6)
then return the value Let $2 $4 $6. The $n parameters will be replaced with the values of each sub-production. So if this rule is matched, the Let function (which is probably some data constructor) will be called with:
the value of the var token as the first parameter,
the first Exp parsed ($4) as the second parameter
and the second parsed Exp ($6) as the third parameter.
I believe here the value of the var token is the variable name.
Related
My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).
In my grammar, I want to have both "variable identifiers" and "function identifiers". Essentially, I want to be less restrictive on the characters allowed in function identifiers. However, I am running in to the issue that all variable identifiers are valid function identifiers.
As an example, say I want to allow uppercase letters in a function identifier but not in a variable identifier. My current (presumably naive) might look like:
prog : 'func' FunctionId
| 'var' VariableId
;
FunctionId : [a-zA-Z]+ ;
VariableId : [a-z]+ ;
With the above rules, var hello fails to parse. If I understand correctly, this is because FunctionId is defined first, so "hello" is treated as a FunctionId.
Can I make antlr choose the more specific valid rule?
An explanation why your grammar does not work as expected could be found here.
You can solve this with semantic predicates:
grammar Test;
prog : 'func' functionId
| 'var' variableId
;
functionId : Id;
variableId : {isVariableId(getCurrentToken().getText())}? Id ;
Id : [a-zA-Z]+;
On the lexer level there will be only ids. On the parser level you can restrict an id to lowercase characters. isVariableId(String) would look like:
public boolean isVariableId(String text) {
return text.matches("[a-z]+");
}
Can I make antlr choose the more specific valid rule?
No (as already mentioned). The lexer merely matches as much as it can, and in case 2 or more rules match the same, the one defined first "wins". There is no way around this.
I'd go for something like this:
prog : 'func' functionId
| 'var' variableId
;
functionId : LowerCaseId | UpperCaseId ;
variableId : LowerCaseId ;
LowerCaseId : [a-z]+ ;
UpperCaseId : [A-Z] [a-zA-Z]* ;
I'm writing a dsl in a text in which people can declare some variables. the grammar is as follows:
Cosem:
cosem+=ID '=' 'COSEM' '(' class=INT ',' version=INT ',' obis=STRING ')' ;
Attributes :
attribute+=ID '=' 'ATTRIBUTE' '(' object=ID ',' attribute_name=STRING ')' ;
Action:
action+=ID '=' 'ACTION' '(' object=ID ',' action_name=STRING ')';
the Dsl has some methods like the print method:
Print:
'PRINT' '(' var0=(STRING|ID) (','var1+=(STRING|ID) )* ')' |
'PRINT' '(' ')'
;
I put all my variables in map so I can use them later in my code. the key is identifying them is their ID which is a string.
However, in my interpreter I can't make the différence between a string and an ID
def dispatch void exec(Print p) {
if (LocalMapAttribute.containsKey(p.var0) )
{print(LocalMapAttribute.get(p.var0))}
else if (LocalMapAction.containsKey(p.var0)){print(LocalMapAction.get(p.var0))}
else if (LocalMapCosem.containsKey(p.var0)){print(LocalMapCosem.get(p.var0))}
else
{print("erreeeur Print")}
p.var1.forEach[v
| if (LocalMapAttribute.containsKey(v)){print(LocalMapAttribute.get(v))}
else if (LocalMapAction.containsKey(v)){print(LocalMapAction.get(v))}
else if (LocalMapCosem.containsKey(v)){print(LocalMapCosem.get(v))}
else{print("erreur entre print")} ]
}
For example when I write PRINT ("attribut2",attribut2) the result shoud be
attribut2 "the value of attribut2"
but I get
"the value of attribut2" "the value of attribut2"
your current grammar structure makes it hard to do this since you throw away the information at the point where you fill the map.
you can use org.eclipse.xtext.nodemodel.util.NodeModelUtils.findNodesForFeature(EObject, EStructuralFeature) to obtain the actual text (which still may contain the original value including the ""
or you change your grammar to
var0=Value ...
Value: IDValue | StringValue;
IDValue: value=ID;
StringValue: value=STRING;
then you can have a look at the type (IDValue or StringValue) to decide wheather you need to put the text into "" (org.eclipse.xtext.util.Strings.convertToJavaString(String, boolean)) might be helpful
Or you can try to use a special replacement for STRINGValueaConcerter that does not strip the quotation marks
I read "The Definite ANTLR4 Reference" and it says
While ANTLR v4 can handle direct left recursion, it can’t handle indirect left
recursion.
on page 71.
But in json grammar on page 90 i see next
grammar JSON;
json: object
| array
;
object
: '{' pair (',' pair)* '}'
| '{' '}' // empty object
;
pair: STRING ':' value ;
array
: '[' value (',' value)* ']'
| '[' ']' // empty array
;
value
: STRING
| NUMBER
| object // indirect recursion
| array // indirec recursion
| 'true'
| 'false'
| 'null'
;
Is it correct?
The JSON grammar you mentioned is not a problem because it actually doesn't contain any indirect left recursion.
The rule value can produce array and array can again produce something which contains value, but not as it's leftmost part. (there is a [ preceding value)
The value rule would only be a problem if there would be some way to produce value folowed by any terminals and non-terminals.
From the book
A left-recursive rule is one that
either directly or indirectly invokes itself on the left edge of an alternative.
Example:
expr : expr '*' expr // match subexpressions joined with '*'
| expr '+' expr // match subexpressions joined with '+' operator
| INT // matches simple integer atom
;
It is left recursion because there is at least one alternative immediatly started with expr. Also it is direct left recursion.
Example of indirect left recursion:
expr : addition // indirectly invokes expr left recursively via addition
| ...
;
addition : expr '+' expr
;
I'm trying to build a v4 grammar for an existing DSL, and am a bit out of my depth. I've tried everything I could think with no luck. We can have a function call like foo(param1, param2);, which I have working. There is an optional construct like foo(y, z) x 100; which means to call the fx 100 times (the x is the literal token, great choice eh!) That's what I can't get to work.
My func_call now looks like this: func_call: Identifier '(' arg_list ')';
Adding a (('x'|'X') expr)? and variations thereof didn't work. It starts to get confused by variables named x.
If it helps, an old yacc grammar for this language had this: rep: func_call REP expr; (where REP is x) Any help would be appreciated. thanks!
Make Identifier a parser rule rather than a lexer rule. This way, the lexer always matches x as a Rep, even if it is contained in identifier. Here is one solution:
grammar Foo;
func_call : identifier '(' arg_list? ')' (Rep expr)? ;
arg_list : identifier (',' identifier)* ;
expr : //TODO implement
;
identifier : idFront (idFront | Digit)* ;
idFront : Rep | OtherThanRep | '_' ;
Digit : [0-9] ;
Rep : 'x' | 'X';
OtherThanRep : [a-wA-W] | 'y' | 'z' | 'Y' | 'Z' ;
WS : [ \t\f\r\n] ->skip;
The generated parser successfully parses x(x,x) x 100