I'm trying to build a v4 grammar for an existing DSL, and am a bit out of my depth. I've tried everything I could think with no luck. We can have a function call like foo(param1, param2);, which I have working. There is an optional construct like foo(y, z) x 100; which means to call the fx 100 times (the x is the literal token, great choice eh!) That's what I can't get to work.
My func_call now looks like this: func_call: Identifier '(' arg_list ')';
Adding a (('x'|'X') expr)? and variations thereof didn't work. It starts to get confused by variables named x.
If it helps, an old yacc grammar for this language had this: rep: func_call REP expr; (where REP is x) Any help would be appreciated. thanks!
Make Identifier a parser rule rather than a lexer rule. This way, the lexer always matches x as a Rep, even if it is contained in identifier. Here is one solution:
grammar Foo;
func_call : identifier '(' arg_list? ')' (Rep expr)? ;
arg_list : identifier (',' identifier)* ;
expr : //TODO implement
;
identifier : idFront (idFront | Digit)* ;
idFront : Rep | OtherThanRep | '_' ;
Digit : [0-9] ;
Rep : 'x' | 'X';
OtherThanRep : [a-wA-W] | 'y' | 'z' | 'Y' | 'Z' ;
WS : [ \t\f\r\n] ->skip;
The generated parser successfully parses x(x,x) x 100
Related
I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);
My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).
I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;
in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?
Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.
Goal
I want to reduce (or eliminate) the Java-specific actions and predicates in my parser. Perhaps it isn't possible, but I wanted to ask here just in case there's some ANTLR4 feature I've missed. (The language itself is third-party, so I don't have control over that.)
Simplified example
The predicates I want to use are mostly exact (or perhaps case-insensitive) string-matching. I could make big parallel sets of parser rules, but I'd rather not since the real-life example is considerably more convoluted.
Suppose I'm given something like:
isWidget(int) : "Whether it is a widget" : 4 ;
ownerFirstName(string) : "john" ;
ownerLastName(string) : "This is the last-name of the owner" : "doe" ;
I want the parser to look at the default-value (the last item on the line, like 4, "john" or "doe") and parse it based on the earlier type (int), (string), (string).
main
: stmt SEMIC (stmt SEMIC)* EOF
;
stmt
: propname=IDENTIFIER LPAREN datatype=IDENTIFIER RPAREN (COLON description=QUOTSTRING)? COLON df=defaultVal
;
defaultVal
: QUOTSTRING //TODO only this alt if datatype=string
| NUM //TODO only this alt if datatype=int
;
fragment Letter : 'a'..'z' | 'A'..'Z' ;
fragment Digit : '0'..'9' ;
fragment Underscore : '_' ;
SEMIC : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
IDENTIFIER : (Letter|Underscore) (Letter|Underscore|Digit)* ;
QUOTSTRING : '"' ~('"' |'\n' | '\r' | '\u2029' | '\u2028')* '"' ;
NUM : Digit+ ;
WS : [ \t\n\r]+ -> skip ;
I know I can do it with predicates and rule inputs, but then I'm crossing the line from a language-agnostic grammar to one with embedded Java code.
Your parser should handle things like the following without a problem:
isWidget(int) : "Whether it is a widget" : "foo" ;
In other words, do not add a predicate that would fail in this case, or you will lose the ability to report sane error messages. Instead, use a language-specific listener or visitor implementation after the parse is complete to report a semantic error if the type of the default value does not match the declared type.