ANTLR 4 how to parse comments - antlr4

I am parsing a SQL like language and I am having trouble parsing comments.
The idea is to ignore them.
I have these rules:
NEWLINE: '\r'? '\n' -> skip
WS : [ \t]+ -> skip
How can I ignore:
Everything that is between '--'or '#' and the next '\n'
Everything between '/' and '/' (slash + asterisk untill asterix + slash - the asterisk somehow gone).
I tried something like this before the WS and the NEWLINW:
COMMENT1 : ('--'|'#') ~'\n'* -> skip;
didn't work - I got:
line 1:115 missing ';' at '<EOF>'
probably something because it didn't go with my main rule:
parse : (statments (';')+)* EOF;
Can anyone help me?
Regards idob

When in doubt, see what someone else did ;)
There are some ready-made grammars for different languages, more or less working.
So I look in Java's grammar and see:
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
So your general idea seems to be correct. I'm guessing that the problem lies somewhere else. Can you provide sample input you test on and your grammar (relevant parts)?

Related

Atleast ONE Space around Parenthesis in ANTLR4

I want spaces around Parenthesis in IF condition. ATleast one space is required. But when i use Space in grammar it throws me error, when i use Else block with it. Please help me, how to accomplish it, i have seen many examples but none is related to it.
i only need spaces around Parenthesis of If condition.
prog: stat_block EOF;
stat_block: OBRACE block CBRACE;
block: (stat (stat)*)?;
stat: expr ';'
| IF condition_block (ELSE stat_block)?
;
expr
: expr SPACE ('*' | '/') SPACE expr
| ID
| INT
| STRING
;
exprList: expr (',' expr)*;
condition_block: SPACE OPAR SPACE expr SPACE CPAR SPACE stat_block;
IF: 'IF';
ELSE: 'ELSE';
OPAR: '(';
CPAR: ')';
OBRACE: '{';
CBRACE: '}';
SPACE: SINGLE_SPACE+;
SINGLE_SPACE: ' ';
ID: [a-zA-Z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n' -> skip;
WS: [ \t]+ -> skip;
Expected input to parse
IF ( 3 ) { } ELSE { }
Current Input
There's a reason that almost all languages ignore whitespace. If you don't ignore it, then you have to deal with its possible existence in the token stream anywhere it might, or might not, be in ALL of your parser rules.
You can try to include the spaces in the Lexer rules for tokens that you want wrapped in spaces, but may still find surprises.
Suggestion: Instead if -> skip; for your WS rule, use -> channel(HIDDEN); This keeps the tokens in the token stream so you can look for them in your code, but "hides" the whitespace tokens from the parser rules. This also allows ANTLR to get a proper interpretation of your input and build a parse tree that represents it correctly.
If you REALLY want to insist on the spaces before/after, you can write code in a listener that looks before/after the tokens in the input stream to see if you have whitespace, and generate your own error (that can be very specific about your requirement).
At least one space is required.
Then you either:
cannot -> skip the WS rule, which will cause all spaces and tabs to become tokens and needing your parser to handle them correctly (which is likely going to become a complete mess in your parser rules!), or
you leave WS -> skip as-is, but include a space in your PAREN rules: OPAR : ' ( '; CPAR: ' ) '; (or with tabs as well if that is possible)

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

Keep whitespace and comment in ANTLR4

When I run InsertSerialID.java or ExtractInterfaceTool.java in ANTLR4 tour source pack( https://pragprog.com/titles/tpantlr2/source_code ), I found all the white-space and comments are not included in the output. So the output source code cannot be compiled or readable. How to keep them?
Well, I found redirect to an extra channel will keep them in the Token, instead of use skip
WS : [ \t\r\n\u000C]+ -> channel(2) // -> skip
;
COMMENT
: '/*' .*? '*/' -> channel(2) // -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2) // -> skip
;
They saved in ParserRuleContext.getSourceInterval() as Interval, although I don't know how to map Interval to thier gramma type.

how to handling nested comments in antlr lexer

How to handle nested comments in antlr4 lexer? ie I need to count the number of "/*" inside this token and close only after the same number of "*/" have been received. As an example, the D language has such nested comments as "/+ ... +/"
For example, the following lines should be treated as one block of comments:
/* comment 1
comment 2
/* comment 3
comment 4
*/
// comment 5
comment 6
*/
My current code is the following, and it does not work on the above nested comment:
COMMENT : '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' -> channel(HIDDEN)
;
Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments:
COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' .*? '\n' -> channel(HIDDEN) ;
I'm using:
COMMENT: '/*' ('/'*? COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/' -> skip;
This forces any /* inside a comment to be the beginning of a nested comment and similarly with */. In other words, there's no way to recognize /* and */ other than at the beginning and end of the rule COMMENT.
This way, something like /* /* /* */ a */ would not be recognized entirely as a (bad) comment (mismatched /*s and */s), as it would if using COMMENT: '/*' (COMMENT|.)*? '*/' -> skip;, but as /, followed by *, followed by correct nested comments /* /* */ a */.
Works for Antlr3.
Allows nested comments and '*' within a comment.
fragment
F_MultiLineCommentTerm
:
( {LA(1) == '*' && LA(2) != '/'}? => '*'
| {LA(1) == '/' && LA(2) == '*'}? => F_MultiLineComment
| ~('*')
)*
;
fragment
F_MultiLineComment
:
'/*'
F_MultiLineCommentTerm
'*/'
;
H_MultiLineComment
: r= F_MultiLineComment
{ $channel=HIDDEN;
printf(stder,"F_MultiLineComment[\%s]",$r->getText($r)->chars);
}
;
I can give you an ANTLR3 solution, which you can adjust to work in ANTLR4:
I think you can use a recursive rule invocation. Make a non-greedy comment rule for /* ... */ which calls itself. That should allow for unlimited nesting without having to count opening + closing comment markers:
COMMENT option { greedy = false; }:
('/*' ({LA(1) == '/' && LA(2) == '*'} => COMMENT | .) .* '*/') -> channel(HIDDEN)
;
or maybe even:
COMMENT option { greedy = false; }:
('/*' .* COMMENT? .* '*/') -> channel(HIDDEN)
;
I'm not sure if ANTLR properly chooses the right path depending on any char or the comment introducer. Try it out.
This will handle : '/*/*/' and '/*.../*/'where the comment body is '/' and '.../' respectively.
Multiline comments will not nest inside single line comments, therefore you cannot start nor begin a multiline comment inside a single line comment.
This is not a valid comment: '/* // */'.
You need a newline to end the single line comment before the '*/' can be consumed to end the multiline comment.
This is a valid comment: '/* // */ \n /*/'.
The comment body is: ' // */ \n /'. As you can see the complete single line comment is included in the body of the multiline comment.
Although '/*/' can end a multiline comment if the preceding character is '*', the comment will end on the first '/' and remaining '*/' will need to end a nested comment otherwise there is a error. The shortest path wins, this is non-greedy!
This is not a valid comment /****/*/
This is a valid comment /*/****/*/, the comment body is /****/, which is itself a nested comment.
The prefix and suffix will never be matched in the multiline comment body.
If you want to implement this for the 'D' language, change the '*' to '+'.
COMMENT_NEST
: '/*'
( ('/'|'*'+)? ~[*/] | COMMENT_NEST | COMMENT_INL )*?
('/'|'*'+?)?
'*/'
;
COMMENT_INL
: '//' ( COMMENT_INL | ~[\n\r] )*
;

Optional token at end of file

I am trying to write a grammar to parse a file where blank lines indicate the end of a block. I have grammar similar to this which almost works.
file : block+ EOF;
block : line+ NL;
line : stuff NL;
NL : '\r'? '\n';
This works except that the last block sometimes does not have an extra newline. Is there a good way to make the NL at the end of block optional when I am at the end of the file?
In antlr3, I would have done
block : line+ (NL | (EOF) => /* empty */ )
However, antlr4 does not have syntactic predicates, so I can't do that.
block : line+ NL? ;
should work, but then a block in the middle of the file could avoid its final newline. I don't think it will since a block can only be followed by a block. That means a block without the trailing newline followed by a block looks like one single block, and the parser will greedily combine them. However, it makes it less clear what the structure actually is. I can certainly imagine more complicated source file formats where this causes a problem.
Is there a good way to solve this?
Try something like this:
file : NL* block (NL+ block)* NL* EOF;
block : line (NL line)*;
line : stuff;
NL : '\r'? '\n';
Or simply append a line break at the end of your input.

Resources