How ANTLR separate tokens - antlr4

I have a grammar where CHAR is a token. A legal expression for my grammar is
CHAR(2).
The point is that ANTLR seems to get the whole String CHAR(2) as a token, hence giving error as unknown token.
To solve this I have to insert spaces after the token , like CHAR (2) .
So, how can I tell ANTLR to separate tokens ?
Thanks

With the following grammar:
expression
: STRING CHAR LPAREN NUMBER RPAREN
;
STRING: 'String';
CHAR: 'CHAR';
NUMBER: [0-9]+;
LPAREN: '(';
RPAREN: ')';
ID: [A-Za-z]+;
WS: [ \t\r\n]+ -> channel(HIDDEN);
You can parse your expression in an expecting way.

Related

Parsing key / value subparameters

I'm a bit clueless as to how I can parse (more or less) "free form" parameter lists, suppose the syntax allows for
PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE)
I have managed to basically parse combos of positional and key/value parameters, but as soon as I hit a lexer token like PARM= the parser bails out with a "mismatched input", and I can't specifically allow for or expect anything because these parameters passed to a function are completely arbitrary.
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode, the delimiters are PARM=( on the left and the closing ) on the right, but as the "data" itself can contain (pairs of) brackets how would I identify the correct closing paren so I don't prematurely end the lexer mode?
TIA - Alex
Edit 1:
Minimal grammar showing the issue with keywords being used where they shouldn't, as this is part of a complex grammar I can't change the order of tokens to put ID in front of everything else, for example, as it would catch too much. So I don't see how this can work short of breaking out into a different lexer mode.
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
COMMA : ',' ;
EQUALS : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
PARM : 'PARM=' ;
ID : ID_LITERAL ;
fragment ID_LITERAL : [A-Za-z]+ ;
.
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parms : PARM LPAREN parm+ RPAREN ;
parm : (pkey=ID EQUALS)? pval=ID COMMA? ;
Input:
PARM=( TEST, KEY=VAL, PARM=X)
Results in
line 1:22 extraneous input 'PARM=' expecting {')', ID}
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode
Instead of switching to modes (with -> mode(...)), you can push your "special" mode on a stack (with -> pushMode(...)) and then when encountering a ) you pop a mode from the stack. That way, you can have multiple nested lists (..(..(..).)..). A quick demo:
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN);
EQUALS : '=' ;
LPAREN : '(' -> pushMode(InList);
PARM : 'PARM';
ID : [A-Za-z] [A-Za-z0-9]*;
mode InList;
LST_LPAREN : '(' -> type(LPAREN), pushMode(InList);
RPAREN : ')' -> popMode;
COMMA : ',';
LST_EQUALS : '=' -> type(EQUALS);
STRING : '\'' ~['\r\n]* '\'';
LST_ID : [A-Za-z] [A-Za-z0-9]* -> type(ID);
LST_SPACE : [ \t\r\n]+ -> channel(HIDDEN);
and:
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parse
: PARM EQUALS list EOF
;
list
: LPAREN ( value ( COMMA value )* )? RPAREN
;
value
: ID
| STRING
| key_value
| ID list
;
key_value
: ID EQUALS value
;
which will parse your example input PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE) like this:
You don't have a rule (alternative) that recognizes a PARM token in your parm rule.
Bart has provided an answer using Lexer modes (and assuming that LPAREN and RPAREN always control those modes), but you can also just set up a parser rule that matches all of your keywords:
lexer grammar ParmLexer
;
SPACE: [ \t\r\n]+ -> channel(HIDDEN);
COMMA: ',';
EQUALS: '=';
LPAREN: '(';
RPAREN: ')';
PARM: 'PARM';
KW1: 'KW1';
KW2: 'KW2';
ID: ID_LITERAL;
fragment ID_LITERAL: [A-Za-z]+;
parser grammar ParmParser
;
options {
tokenVocab = ParmLexer;
}
parms: PARM EQUALS LPAREN parm (COMMA parm)* RPAREN;
parm: ((pkey = ID | kwid = kw) EQUALS)? pval = ID;
kw: PARM | KW1 | KW2;
input
"PARM=( TEST, KEY=VAL, KW2=v2, PARM=X)"
yields:
(parms PARM = ( (parm TEST) , (parm KEY = VAL) , (parm (kw KW2) = v) , (parm (kw PARM) = X) ))

Choosing lexer mode based on variable

My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR​ expression)+
| expression (NOT​ expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

Incorrect Result When ANTLR4 Lexer Action Invokes getText()

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for
parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:
grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip;
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*)
{System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z] | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];
Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:
dkk\uzzzz
The $text of the id_token parser rule action produces this correct result:
dkk
uzzzz
i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).
However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:
dkk\u
uzzzz
Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?
EDIT:
Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.
It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

ANTLR 4 lexer tokens inside other tokens

I have the following grammar for ANTLR 4:
grammar Pattern;
//parser rules
parse : string LBRACK CHAR DASH CHAR RBRACK ;
string : (CHAR | DASH)+ ;
//lexer rules
DASH : '-' ;
LBRACK : '[' ;
RBRACK : ']' ;
CHAR : [A-Za-z0-9] ;
And I'm trying to parse the following string
ab-cd[0-9]
The code parses out the ab-cd on the left which will be treated as a literal string in my application. It then parses out [0-9] as a character set which in this case will translate to any digit. My grammar works for me except I don't like to have (CHAR | DASH)+ as a parser rule when it's simply being treated as a token. I would rather the lexer create a STRING token and give me the following tokens:
"ab-cd" "[" "0" "-" "9" "]"
instead of these
"ab" "-" "cd" "[" "0" "-" "9" "]"
I have looked at other examples, but haven't been able to figure it out. Usually other examples have quotes around such string literals or they have whitespace to help delimit the input. I'd like to avoid both. Can this be accomplished with lexer rules or do I need to continue to handle it in the parser rules like I'm doing?
In ANTLR 4, you can use lexer modes for this.
STRING : [a-z-]+;
LBRACK : '[' -> pushMode(CharSet);
mode CharSet;
DASH : '-';
NUMBER : [0-9]+;
RBRACK : ']' -> popMode;
After parsing a [ character, the lexer will operate in mode CharSet until a ] character is reached and the popMode command is executed.

Resources