I want to parse Smalltalk.
Normally in a sequence of expressions, they need a PERIOD token (.) in between as a separator, like the ';' in java.
An expression alone does not need the PERIOD.
Hence i match this PERIOD in the expressions rule:
expressions : expression (PERIOD expression)*;
And the different sub-rules for the specific expression do not match the PERIOD by themselves.
However, there is one special type of expression, that calls to native libraries:
<primitive: ABC>
And when this is followed by another expression, the PERIOD is surprisingly not needed.
How can such a situation be handled?
Perhaps injecting a PERIOD. From within the "primitive" rule, tell the lexer to inject a PERIOD token next. But how?
Or is there a better solution for this situation?
Frank
Perhaps something like this:
expressions
: start_expression* expression '.'?
;
start_expression
: expression '.'
| pragma
;
expression
: assignment
| pragma
;
assignment
: ID ':=' NUMBER
;
pragma
: '<' ID ':' ID '>'
;
Related
I have a grammar where I recently added syntax for a constant OSC address --- it looks like this
OSCAddressConstant: ('/' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')+)+;
Typical examples might be
/a/b/c
/Handle/SetValue
/1/Volume/Page3
Unfortunately, I discovered rather quickly that simple expressions with division: e.g.
foo = 20/10
now fail with type errors because the parser thinks that the /10 is an OSC address and so we get "integer" "Divide" "OSCAddressConstant"
What is the recommended (and hopefully) simplest way to disambiguate these other than changing the actual syntax of the OSC address, which would be a pity.
Thanks in advance
(NB - I saw a similar question about ambiguity between division and regular expression syntax but I did not understand the solution - there was a reference to the use of #member but it was unclear what to do with it - I've not seen that before and other questions about #member seem to have gone unanswered)
That OSCAddressConstant rule is rather a higher level rule, like a complex identifier, possibly qualified. Such higher level constructs should go into the parser, not the lexer.
Just like you would define a qualified identifier as:
ID: [a-zA-Z][a-zA-Z0-9]*;
DOT: '.';
qualified: ID (DOT ID)?;
you can define your OSC address as:
EID: [a-zA-Z0-9_]+;
DIV: '/';
oscAddressConstant: (DIV EID)+;
The only drawback with this approach is: when you usually ignore whitespaces this syntax will allow constructs like: / abc / 12. But if that's something you do not want handle whitespaces in the semantic phase and throw an error then.
Calculator math operator precedence is often remembered pneumonic PMDAS.
The grammar on the ANTLR home page (using the same abbreviations) has order MDASP. This isn't PMDAS or reverse PMDAS like I would expect. E.g. this stackoverflow answer contains a grammar that looks like PMDAS.
But no matter what expressions I put into the command line; the parse tree looks correct!
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
| '(' expr ')'
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
How does this work?
The question is a little tricky to answer as im not entirely sure what you were trying to parse but pseudo code for what this grammar expects may help you understand it:
An int is one or more numbers from 0 to 9
A new line is one or more \r\n
an (expr)ession is made up of any of these:
an expression with a '*' or '/' and another expression
an expression with a '+' or '-' and another expression
an int
a curly brace containing an expression followed by a curly brace
the (prog)ram is made up of zero or more expressions followed by new lines.
Also remember that ANTLR:
goes for the longest sequence first. if two rules or more match the
longest possible sequence then it chooses the lexical rule specified
first
This link may be very useful to you. Anyway if you post the tree you are struggling to understand we could try and help you further. Good luck with your project.
I'm learning ANTLR4 and I'm confused at one point. For a Java-like language, I'm trying to add rules for constructs like member chaining, something like that:
expr1.MethodCall(expr2).MethodCall(expr3);
I'm getting an error, saying that two of my rules are mutually left-recursive:
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| memberAccess
;
memberAccess: expression DOT (methodCall | fieldReference);
I thought I understood why the above rule combination is considered left-recursive: because memberAccess is a candidate of expression and memberAccess starts with an expression.
However, my understanding broke down when I saw (by looking at the Java example) that if I just move the contents of memberAccess to expression, I got no errors from ANTLR4 (even though it still doesn't parse what I want, seems to fall into a loop):
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| expression DOT (methodCall | fieldReference)
;
Why is the first example left-recursive but the second isn't?
And what do I have to do to actually parse the initial line?
The second is left-recursive but not mutually left recursive. ANTLR4 can eliminate left-recursive rules with an inbuilt algorithm. It cannot eliminate mutually left recursive rules. There probably exists an algorithm, but this would hardly preserve actions and semantic predicates.
For some reason, ANTLRWorks 2 was not responding when my grammar had left-recursion, causing me to (erroneously) believe that my grammar was wrong.
Compiling and testing from commandline revealed that the version with immediate left-recursion did, in fact, compile and parse correctly.
(I'm leaving this here in case anyone else is confused by the behavior of the IDE.)
I have this grammar:
grammar MkSh;
script
: (statement
| targetRule
)*
;
statement
: assignment
;
assignment
: ID '=' STRING
;
targetRule
: TARGET ':' TARGET*
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
) -> channel(HIDDEN)
;
STRING
: '\"' CHR* '\"'
;
fragment
CHR
: ('a'..'z'|'A'..'Z'|' ')
;
TARGET
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
;
and this input file:
hello="world"
target: CLASSES
When running my parser I'm getting this error:
line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='
Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').
How can I get that to happen?
(This is my first Antlr project so I'm open to anything.)
First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.
As #cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.
(Warning: writing off the top of my head without testing :-)
Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:
id : TARGET
{ complain if not legal identifier (e.g., contains slashes, etc.) }
;
assignment
: id '=' STRING
;
Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.
Here is a related topic for previous ANTLR version :
Java ANTLR how to ignore part of rule? ignore part after subrule
With a lexer rule like :
R1
: [a-zA-Z0-9]* ';'
;
For example i have this input text :
test;rezrezr
zrezrzerz
It will match "test;" wich is correct. I only need the "test" string.
Do i need to take care of ';' character manually in a custom listener for example ? Or is there a way to specify in the grammar that i want to avoid it (only using lexer rules) ?
UPDATE
test1;rezrezr
zrezrzerz
test2;rezrezr
zrezrzerz
If you want to avoid the ; character, simply remove it from the lexer rule. Note that I also changed the * to a + to ensure that R1 is never a zero-length token.
R1
: [a-zA-Z0-9]+
;