ANTLR4 skip no difference with or without - antlr4

I do not see what skip does; this is the example
FS : WHITESPACE* (',') WHITESPACE*;
WHITESPACE : [ \r\t\n]+ ->skip;
When I run with ANTLRWorks2 in TestRig I see no difference between with or without ->skip. The visual tree contains (i will use normal dot . for space) .,.\r\n
What is the difference between using or not using ->skip?

Space chars on their own will be skipped, but not when they surround comma's. If you want them to be skipped, do not include them in your FS rule:
FS : ',';
WHITESPACE : [ \r\t\n]+ -> skip;

Related

Atleast ONE Space around Parenthesis in ANTLR4

I want spaces around Parenthesis in IF condition. ATleast one space is required. But when i use Space in grammar it throws me error, when i use Else block with it. Please help me, how to accomplish it, i have seen many examples but none is related to it.
i only need spaces around Parenthesis of If condition.
prog: stat_block EOF;
stat_block: OBRACE block CBRACE;
block: (stat (stat)*)?;
stat: expr ';'
| IF condition_block (ELSE stat_block)?
;
expr
: expr SPACE ('*' | '/') SPACE expr
| ID
| INT
| STRING
;
exprList: expr (',' expr)*;
condition_block: SPACE OPAR SPACE expr SPACE CPAR SPACE stat_block;
IF: 'IF';
ELSE: 'ELSE';
OPAR: '(';
CPAR: ')';
OBRACE: '{';
CBRACE: '}';
SPACE: SINGLE_SPACE+;
SINGLE_SPACE: ' ';
ID: [a-zA-Z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n' -> skip;
WS: [ \t]+ -> skip;
Expected input to parse
IF ( 3 ) { } ELSE { }
Current Input
There's a reason that almost all languages ignore whitespace. If you don't ignore it, then you have to deal with its possible existence in the token stream anywhere it might, or might not, be in ALL of your parser rules.
You can try to include the spaces in the Lexer rules for tokens that you want wrapped in spaces, but may still find surprises.
Suggestion: Instead if -> skip; for your WS rule, use -> channel(HIDDEN); This keeps the tokens in the token stream so you can look for them in your code, but "hides" the whitespace tokens from the parser rules. This also allows ANTLR to get a proper interpretation of your input and build a parse tree that represents it correctly.
If you REALLY want to insist on the spaces before/after, you can write code in a listener that looks before/after the tokens in the input stream to see if you have whitespace, and generate your own error (that can be very specific about your requirement).
At least one space is required.
Then you either:
cannot -> skip the WS rule, which will cause all spaces and tabs to become tokens and needing your parser to handle them correctly (which is likely going to become a complete mess in your parser rules!), or
you leave WS -> skip as-is, but include a space in your PAREN rules: OPAR : ' ( '; CPAR: ' ) '; (or with tabs as well if that is possible)

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

Lexer and Parser rules for a simple command processor

I am attempting to build a simple command processor for a legacy language.
I am attempting to work with C# with antlr4 version "ANTLR", "4.6.6")
I am unable to make progress against one scenario, of several.
The following examples shows various sample invocations of the command PKS.
PKS
PKS?
PKStext_that_is_a_filename
The scenario that I can not solve is the PKS command followed by filename.
Command:
PKS
(block (line (expr (command PKS)) (eol \r\n)) <EOF>)
Command:
PKS?
(block (line (expr (command PKS) (query ?)) (eol \r\n)) <EOF>)
Command:
PKSFILENAME
line 1:0 mismatched input 'PKSFILENAME' expecting COMMAND
(block PKSFILENAME \r\n)
Command:
what I believe to be the relevant snippet of grammar:
block : line+ EOF;
line : (expr eol)+;
expr : command file
| command listOfDouble
| command query
| command
;
command : COMMAND
;
query : QUERY;
file : TEXT ;
eol : EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)* ;
From the lexer:
COMMAND : PKS;
PKS :'PKS' ;
QUERY : '?'
;
fragment LETTER : [A-Z];
fragment DIGIT : [0-9];
fragment UNDER : [_];
TEXT : (LETTER) (LETTER|DIGIT|UNDER)* ;
The main problem here is that your TEXT rule also matches what PKS is supposed to match. And since PKStext_that_is_a_filename can entirely be matched by that TEXT rule it is preferred over the PKS rule, even though it appears first in the grammar (if 2 rules match the same input then the first one wins).
In order to fix that problem you have 2 options:
Require whitespace(s) between the keyword (PKS) and the rest of the expression.
Change the TEXT rule to explicitly exclude "PKS" as valid input.
Option 2 is certainly possible, but will get very messy if you have have more keywords (as they all would have to be excluded). With a whitespace between the keywords and the text the lexer would automatically do that for you.
And let me give you a hint to approach such kind of problems: always check the token list produced by the lexer to see if it generated the tokens you expected. I reworked your grammar a bit, added missing tokens and ran it through my ANTLR4 debugger, which gave me:
Parser error (5, 1): extraneous input 'PKStext_that_is_a_filename' expecting {<EOF>, COMMAND, EOL}
Tokens:
[#0,0:2='PKS',<1>,1:0]
[#1,3:3='\n',<8>,1:3]
[#2,4:4='\n',<8>,2:0]
[#3,5:7='PKS',<1>,3:0]
[#4,8:8='?',<3>,3:3]
[#5,9:9='\n',<8>,3:4]
[#6,10:10='\n',<8>,4:0]
[#7,11:36='PKStext_that_is_a_filename',<7>,5:0]
[#8,37:37='\n',<8>,5:26]
[#9,38:37='<EOF>',<-1>,6:0]
For this input:
PKS
PKS?
PKStext_that_is_a_filename
Here's the grammar I used:
grammar Example;
start: block;
block: line+ EOF;
line: expr? eol;
expr: command (file | listOfDouble | query)?;
command: COMMAND;
query: QUERY;
file: TEXT;
eol: EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)*;
COMMAND: PKS;
PKS: 'PKS';
QUERY: '?';
fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];
fragment UNDER: [_];
COMMA: ',';
DOUBLE: DIGIT+ (DOT DIGIT*)?;
DOT: '.';
TEXT: LETTER (LETTER | DIGIT | UNDER)*;
EOL: [\n\r];
and the generated visual parse tree:

ANTLR 4 how to parse comments

I am parsing a SQL like language and I am having trouble parsing comments.
The idea is to ignore them.
I have these rules:
NEWLINE: '\r'? '\n' -> skip
WS : [ \t]+ -> skip
How can I ignore:
Everything that is between '--'or '#' and the next '\n'
Everything between '/' and '/' (slash + asterisk untill asterix + slash - the asterisk somehow gone).
I tried something like this before the WS and the NEWLINW:
COMMENT1 : ('--'|'#') ~'\n'* -> skip;
didn't work - I got:
line 1:115 missing ';' at '<EOF>'
probably something because it didn't go with my main rule:
parse : (statments (';')+)* EOF;
Can anyone help me?
Regards idob
When in doubt, see what someone else did ;)
There are some ready-made grammars for different languages, more or less working.
So I look in Java's grammar and see:
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
So your general idea seems to be correct. I'm guessing that the problem lies somewhere else. Can you provide sample input you test on and your grammar (relevant parts)?

string recursion antlr lexer token

How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.

Resources