ANTLR4 pushMode, popMode, mode - antlr4

I have a grammar that fails when I use 'pushMode' and 'popMode' but works when I use 'mode'.
This grammar construct that works:
TAG: '{' -> pushMode( TAG_MODE ), skip;
TEXT: ~[{]+;
mode TAG_MODE;
TAG_COMMENT: '*' -> skip, mode( COMMENT_MODE );
mode COMMENT_MODE;
END_COMMENT: .*? '*}' -> skip, popMode;
COMMENT_TEXT: . -> type( SYNTAX_ERROR ), popMode;
Now if I change the TAG_COMMENT to use 'pushMode' and 'popMode' instead it fails an I loose everything from the comment tag onwards.
This grammar construct that fails:
TAG: '{' -> pushMode( TAG_MODE ), skip;
TEXT: ~[{]+;
mode TAG_MODE;
TAG_COMMENT: '*' -> skip, pushMode( COMMENT_MODE ), popMode;
mode COMMENT_MODE;
END_COMMENT: .*? '*}' -> skip, popMode;
COMMENT_TEXT: . -> type( SYNTAX_ERROR ), popMode;
Can anybody explain the difference to me; in my mind they should be functionally equivalent except that it second method uses one more level of lexer stack.
I would actually prefer to use the failing construct as it's too easy to get the lexer stack screwed up using 'mode'(goto) whereas if everything uses 'pushMode' and 'popMode' it's much easier to keep the lexer stack in order.
Also is it ok to use an empty set to exit a mode?
Something like:
mode MODE1;
TAG_IDENT: IDENT;
TAG_EMPTY: -> popMode; // Hopefully exits and doesn't consume a token
Any thoughts?

In the TAG_COMMENT rule you use pushMode immediately followed by popMode. The result is the same as though both instructions were removed. The reverse order would have had a very different effect, as shown in item 1 below.
You could reverse the order of the pushMode and popMode instructions in TAG_COMMENT, but that would be semantically equivalent to a simpler single mode command:
TAG_COMMENT
: '*'
-> skip, mode(COMMENT_MODE)
;
Remove the popMode action from the COMMENT_TEXT rule (if the comment is unterminated, the entire rest of the input is inside the comment).

#Terry151151 I am new to Antlr, but I was able to achieve a similar result by adding a second 'popMode' in the secondary mode. This would probably just work for modes with one rule where it's pushing and needs to pop when the secondary mode is done.
Try changing the following lines:
TAG_COMMENT: '*' -> skip, pushMode( COMMENT_MODE );
and
END_COMMENT: .*? '*}' -> skip, popMode, popMode;

Related

How to parse tokens of long lexer rule that cannot be converted into parser rule?

I am trying to parse this with ANTLR4:
> A Request [AR]
Commments might have many lines here
Line 2
- A Response [A]
- The other response [B]
Response can also have lines here.
> Request [A]
- Responce
The following code parses it very well:
grammar Response;
prog: (request | response)+ EOF;
request: REQUEST TEXT*;
response: RESPONSE TEXT*;
REQUEST: '>' TEXT '[' ID ']';
RESPONSE: '-' TEXT ('[' ID ']')?;
ID: [a-zA-Z] [a-zA-Z0-9._]*;
TEXT: ~[\r\n]+;
EMPTY: [ \t\r\n]+ -> skip;
This is a good result. However I would like to parse separately the ID and TEXT. Because these are tokens in a long lexer rule, it seems this is not supported.
As I understand, usually in this case you can replace the lexer rules REQUEST and RESPONSE into parser rules like request_rule and response_rule.
But this does not work here, as then the TEXT lexer rule will match each and every line. For example, if I replace REQUEST and RESPONSE to ruleREQUEST and ruleRESPONSE:
I am trying to figure out how to proceed... It seems that the only way is to make the code far more complicated using a number of popMode and pushMode, as described here:
https://github.com/antlr/antlr4/issues/2229 (incorrect lexer rule precedence with "not" rules)
Is there any simple way, based on the original antlr4 code to get the TEXT and ID values in C# Antlr4.Runtime.Standard? Other then that, the code works perfectly.
TEXT is greedy, so it matches above all other lexer rules. You will need to make it not greedy by adding a '?' operator after the '+'.
Once you do that, however, the parser rules will need to be changed to allow different tokens.
Here is a grammar that may work instead. It works for your input, but you may need to make further changes.
grammar Response;
prog: (request | response)+ EOF;
request: request_rule text*;
response: response_rule text*;
request_rule: '>' text '[' ID ']';
response_rule: '-' text ('[' ID ']')?;
text: (ID | TEXT)+;
ID: [a-zA-Z] [a-zA-Z0-9._]*;
GT: '>';
LP: '[';
RP: ']';
DS: '-';
TEXT: ~[\r\n]+?;
EMPTY: [ \t\r\n]+ -> skip;

ANTLR4: Lexer returning a single token when in a lexer mode

I am attempting to use a lexer mode with ANTLR4 with the following lexer grammar:
STRING: '"' -> pushMode(STRING_MODE);
mode STRING_MODE;
STRING_CONTENTS: ~('"'|'\n'|'\r')+ -> type(STRING);
END_STRING: '"' -> type(STRING), popMode;
STRING_UNMATCHED: . -> type(UNMATCHED);
Is there a way to return a single token of type STRING for all the characters captured within the mode and including the characters which caused an entrance to the mode?
When does the mode end?
I am aware that I can also write the string token like so:
STRING: '"' (~["\n\r]|'\\"')* '"';
1) The more attribute will accumulate the matched text into the first token emitted by a non-more attributed rule.
For:
STRING: '"' -> more, pushMode(STRING_MODE);
mode STRING_MODE;
STRING_CONTENTS: ~('"'|'\n'|'\r')+ -> more ;
END_STRING: '"' -> type(STRING), popMode;
the text matching the STRING and STRING_CONTENTS rules is prepended to that of the END_STRING rule, resulting in a STRING-typed token containing the full text of the string.
2) The 'end' of a mode statement is implied by the first subsequent encounter of
a parser rule
another mode statement
a fragment rule
EOF

Token collision (??) writing ANTLR4 grammar

I have what I thought a very simple grammar to write:
I want it to allow token called fact. These token can start with a letter and then allow a any kind of these: letter, digit, % or _
I want to concat two facts with a . but the the second fact does not have to start by a letter (a digit, % or _ are also valid from the second token)
Any "subfact" (even the initial one) in the whole fact can be "instantiated" like an array (you will get it by reading my examples)
For example:
Foo
Foo%
Foo_12%
Foo.Bar
Foo.%Bar
Foo.4_Bar
Foo[42]
Foo['instance'].Bar
etc
I tried to write such grammar but I can't get it working:
grammar Common;
/*
* Parser Rules
*/
fact: INITIALFACT instance? ('.' SUBFACT instance?)*;
instance: '[' (LITERAL | NUMERIC) (',' (LITERAL | NUMERIC))* ']';
/*
* Lexer Rules
*/
INITIALFACT: [a-zA-Z][a-zA-Z0-9%_]*;
SUBFACT: [a-zA-Z%_]+;
ASSIGN: ':=';
LITERAL: ('\'' .*? '\'') | ('"' .*? '"');
NUMERIC: ([1-9][0-9]*)?[0-9]('.'[0-9]+)?;
WS: [ \t\r\n]+ -> skip;
For example, if I tried to parse Foo.Bar, I get: Syntax error line 1 position 4: mismatched input 'Bar' expecting SUBFACT.
I think this is because ANTLR first finds Bar match INITIALFACT and stops here. How can I fix this ?
If it is relevent, I am using Antlr4cs.

ANTLR v4: How do I capture an arbitrary trimmed string to the end of line/file after a certain token?

For curiosity's sake I'm learning ANTLR, in particular, 4 and I'm trying to create a simple grammar. I chose NES (Nintentdo Entertainment System) Game Genie files the very first attempt. Let's say, here is a sample Game Genie file for Jurassic Park found somewhere in Internet:
GZUXXKVS Infinite ammo on pick-up
PAVPAGZE More bullets picked up from small dinosaurs
PAVPAGZA Fewer bullets picked up from small dinosaurs
GZEULOVK Infinite lives--1st 2 Levels only
ATVGZOSA Immune to most attacks
VEXASASA + VEUAXASA 3-ball bolas picked up
NEXASASA + NEUAXASA Explosive multi-shots
And here is a grammar I'm working on.
grammar NesGameGenie;
all: lines EOF;
lines: (anyLine? EOL+)* anyLine?;
anyLine: codeLine;
codeLine: code;
code: CODE (PLUS? CODE)*;
CODE: SHORT_CODE | LONG_CODE;
fragment SHORT_CODE: FF FF FF FF FF FF;
fragment LONG_CODE: FF FF FF FF FF FF FF FF;
fragment FF: [APZLGITYEOXUKSVN];
COMMENT: COMMENT_START NOEOL -> skip;
COMMENT_START: [#;];
EOL: '\r'? '\n';
PLUS: '+';
WS: [ \t]+ -> skip;
fragment NOEOL: ~[\r\n]*;
Well it's ridiculously short and easy, but it still has two issues I can see:
The cheats descriptions cause recognition errors like line 1:16 token recognition error at: 'In' because there is no a description rule provided to the grammar.
Adding the # symbol to the description will probably cause ignore the rest to the end of line. At least, AAAAAA Player #1 ammo only reports Player and #1 ammo is unfortunately parsed as a comment, but I think it could be fixed once the description rule is introduced.
My previous attempts to add the description rule caused a lot of various errors, and I've found a non-error but still not a good solution:
...
codeLine: code description?;
...
description: PRINTABLE+;
...
PRINTABLE: [\u0020-\uFFFE];
...
Unfortunately every character is parsed as a single PRINTABLE, and what I'm looking for is a description rule to match arbirtrary text until the end of line (or file) including whitespaces, but trimmed on left and right. If I add + to the end of the PRINTABLE, the whole document is considered invalid. I guess that PRINTABLE might be safely inlined to the description rule somehow, but description: ('\u0020' .. '\uFFFE')+; captures way more.
How should the description rule be declared to let it capture all characters to the end of line right after the codes, but trimming whitespaces ([ \t]) on both left and right only? Simply speaking, I would have a grammar that would parse into something like (including the # character not parsing it as a comment):
code=..., description="Infinite ammo on pick-up"
code=..., description="More bullets picked up from small dinosaurs"
code=..., description="Fewer bullets picked up from small dinosaurs"
code=..., description="Infinite lives--1st 2 Levels only"
code=..., description="Immune to most attacks"
code=..., description="3-ball bolas picked up"
code=..., description="Explosive multi-shots"
One more note, I'm using:
IntelliJ IDEA 2016.1.1 CE
IJ plugin: ANTLR v4 grammar plugin 1.8.1
IJ plugin: ANTLRWorks 1.3.1
Quite easy actually, just use lexer modes. Once you hit certain tokens, change the mode.
Here is the lexer grammar, parser is easy based on that (filename is NesGameGenieLexer.g4):
lexer grammar NesGameGenieLexer;
CODE: [A-Z]+;
WS : [ ]+ -> skip, mode(COMMENT_MODE);
mode COMMENT_MODE;
PLUS: '+' (' ')* -> mode(DEFAULT_MODE);
fragment ANY_CHAR: [a-zA-Z_/0-9=.\-\\ ];
COMMENT: ANY_CHAR+;
NEWLINE: [\r\n] -> skip, mode(DEFAULT_MODE);
I've assumed that + can't be in comments. If you use ANTLRWorks lexer debugger you can see all the token types and token modes nicely highlighted.
And here is the parser grammar (filename is NesGameGenieParser.g4):
parser grammar NesGameGenieParser;
options {
tokenVocab=NesGameGenieLexer;
}
file: line+;
line : code comment
| code PLUS code comment;
code: CODE;
comment: COMMENT;
Here I've assumed that CODE is just set of chars before PLUS but obviously that's very easy to change :)
Spending the whole sleepless night and having much less time to sleep, I seem to have managed to write the lexer and parser grammars. No need to explain much, see the comments in the source code, so in short:
The lexer:
lexer grammar NesGameGenieLexer;
COMMENT: [#;] ~[\r\n]+ [\r\n]+ -> skip;
CODE: (SHORT_CODE | LONG_CODE) -> mode(CODE_FOUND_MODE);
fragment SHORT_CODE: FF FF FF FF FF FF;
fragment LONG_CODE: FF FF FF FF FF FF FF FF;
fragment FF: [APZLGITYEOXUKSVN];
WS: [\t ]+ -> skip;
mode CODE_FOUND_MODE;
PLUS: [\t ]* '+' [\t ]* -> mode(DEFAULT_MODE);
// Skip inline whitespaces and switch to the description detection mode.
DESCRIPTION_LEFT_DELIMITER: [\t ]+ -> skip, mode(DESCRIPTION_FOUND_MODE);
NEW_LINE_IN_CODE_FOUND_MODE: [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode DESCRIPTION_FOUND_MODE;
// Greedily grab all non-CRLF characters and ignore trailing spaces - this is a trimming operation equivalent, I guess.
DESCRIPTION: ~[\r\n]*~[\r\n\t ]+;
// But then terminate the line and switch to the code detection mode.
// This operation is probably required because the `DESCRIPTION: ... -> mode(CODE_FOUND_MODE)` seems not working
NEW_LINE_IN_DESCRIPTION_FOUND_MODE: [\r\n]+ -> skip, mode(DEFAULT_MODE);
The parser:
parser grammar NesGameGenieParser;
options {
tokenVocab = NesGameGenieLexer;
}
file
: line+
;
line
: code description?
| code (PLUS code)* description?
;
code
: CODE
;
description
: DESCRIPTION
;
It is looks much more complicated than I thought it should work, but it seems to work exactly the way I want. Also, I'm not sure if the grammars above are really well-written and idiomatic. Thanks to #cantSleepNow for giving the modes idea.

ANTLR - Handle whitespace in identifier

I am trying to build simple search expression, and couldn't get right answer to below grammar.
Here are my sample search text
LOB WHERE
Line of Business WHERE
Line of Business WHERE
As you can see in above search, first few words reflect search keyword followed by where condition, i want to capture search keyword that can include whitespace. Sharing following sample grammar but doesn't seems to parse properly
sqlsyntax : identifierws 'WHERE';
identifierws : (WSID)+;
WSID: [a-zA-Z0-9 ] ; // match identifiers with space
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Any help in this regard is appreciated.
This is what is happening when I try to parse
Line of Business WHERE
I get following error
line 1:0 no viable alternative at input 'Line'
I get back LineofBusiness text but whitespace got trimmed, i want exact text Line of Business, that is where I am struggling a bit.
The identeriferws rule is consuming all text. Better to prioritize identification of keywords in the lexer:
sqlsyntax : identifierws WHERE identifierws EQ STRING EOF ;
identifierws : (WSID)+;
WHERE: 'WHERE';
EQ : '=' ;
STRING : '\'' .? '\'' ;
WSID: [a-zA-Z0-9 ] ;
WS : [ \t\r\n]+ -> skip ;
For such a simple case I wouldn't use a parser. That's just overkill. All you need to do is to get the current position in the input then search for WHERE (a simple boyer-moore search). Then take the text between start position and WHERE position as your input. Jump over WHERE and set the start position to where you are then. After that do the same search again etc.

Resources