how to handling nested comments in antlr lexer - antlr4

How to handle nested comments in antlr4 lexer? ie I need to count the number of "/*" inside this token and close only after the same number of "*/" have been received. As an example, the D language has such nested comments as "/+ ... +/"
For example, the following lines should be treated as one block of comments:
/* comment 1
comment 2
/* comment 3
comment 4
*/
// comment 5
comment 6
*/
My current code is the following, and it does not work on the above nested comment:
COMMENT : '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' -> channel(HIDDEN)
;

Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments:
COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' .*? '\n' -> channel(HIDDEN) ;

I'm using:
COMMENT: '/*' ('/'*? COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/' -> skip;
This forces any /* inside a comment to be the beginning of a nested comment and similarly with */. In other words, there's no way to recognize /* and */ other than at the beginning and end of the rule COMMENT.
This way, something like /* /* /* */ a */ would not be recognized entirely as a (bad) comment (mismatched /*s and */s), as it would if using COMMENT: '/*' (COMMENT|.)*? '*/' -> skip;, but as /, followed by *, followed by correct nested comments /* /* */ a */.

Works for Antlr3.
Allows nested comments and '*' within a comment.
fragment
F_MultiLineCommentTerm
:
( {LA(1) == '*' && LA(2) != '/'}? => '*'
| {LA(1) == '/' && LA(2) == '*'}? => F_MultiLineComment
| ~('*')
)*
;
fragment
F_MultiLineComment
:
'/*'
F_MultiLineCommentTerm
'*/'
;
H_MultiLineComment
: r= F_MultiLineComment
{ $channel=HIDDEN;
printf(stder,"F_MultiLineComment[\%s]",$r->getText($r)->chars);
}
;

I can give you an ANTLR3 solution, which you can adjust to work in ANTLR4:
I think you can use a recursive rule invocation. Make a non-greedy comment rule for /* ... */ which calls itself. That should allow for unlimited nesting without having to count opening + closing comment markers:
COMMENT option { greedy = false; }:
('/*' ({LA(1) == '/' && LA(2) == '*'} => COMMENT | .) .* '*/') -> channel(HIDDEN)
;
or maybe even:
COMMENT option { greedy = false; }:
('/*' .* COMMENT? .* '*/') -> channel(HIDDEN)
;
I'm not sure if ANTLR properly chooses the right path depending on any char or the comment introducer. Try it out.

This will handle : '/*/*/' and '/*.../*/'where the comment body is '/' and '.../' respectively.
Multiline comments will not nest inside single line comments, therefore you cannot start nor begin a multiline comment inside a single line comment.
This is not a valid comment: '/* // */'.
You need a newline to end the single line comment before the '*/' can be consumed to end the multiline comment.
This is a valid comment: '/* // */ \n /*/'.
The comment body is: ' // */ \n /'. As you can see the complete single line comment is included in the body of the multiline comment.
Although '/*/' can end a multiline comment if the preceding character is '*', the comment will end on the first '/' and remaining '*/' will need to end a nested comment otherwise there is a error. The shortest path wins, this is non-greedy!
This is not a valid comment /****/*/
This is a valid comment /*/****/*/, the comment body is /****/, which is itself a nested comment.
The prefix and suffix will never be matched in the multiline comment body.
If you want to implement this for the 'D' language, change the '*' to '+'.
COMMENT_NEST
: '/*'
( ('/'|'*'+)? ~[*/] | COMMENT_NEST | COMMENT_INL )*?
('/'|'*'+?)?
'*/'
;
COMMENT_INL
: '//' ( COMMENT_INL | ~[\n\r] )*
;

Related

ANTLR: how to debug a misidentified token

I am trying to implement a grammar in Antlr4 for a simple template engine. This engine consists of 3 different clauses:
IF ANSWERED ( variable )
END IF
Variable
Variable can be any upper or lowercase letter including white spaces. Both IF ANSWERED and END IF are always uppercase.
I have written the following grammar/lexer rules so far, but my problem is that IF ANSWERED keeps getting recognized as a Variable and not as 2 tokens IF and ANSWERED.
grammar program;
/**grammar */
command: (ifStart | ifEnd | VARIABLE ) EOF;
ifStart: IF ANSWERED '(' VARIABLE ')';
ifEnd: 'END IF';
/** lexer */
IF: 'IF';
ANSWERED: 'ANSWERED';
TEXT: (LOWERCASE | UPPERCASE | NUMBER) ;
VARIABLE: (TEXT | [ \t\r\n])+;
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment NUMBER: [0-9];
If I try to parse IF ANSWERED ( FirstName ) I get the following output:
[#0,0:10='IF ANSWERED',**<VARIABLE>**,1:0]
[#1,11:11='(',<'('>,1:11]
[#2,12:25='Execution date',<VARIABLE>,1:12]
[#3,26:26=')',<')'>,1:26]
[#4,27:26='<EOF>',<EOF>,1:27]
line 1:0 mismatched input 'IF ANSWERED' expecting 'IF'
I read that Antlr4 is greedy and tries to match the biggest possible token, but I fail to understand what is the correct approach, or how to think through the problem to find a solution.
Correct: ANTLR's lexer is greedy, and tries to consume as much as possible. That is why IF ANSWERED is tokenised as a TEXT token instead of 2 separate keywords. You'll need to change TEXT so that it does not match spaces.
Something like this could get you started:
parse
: command* EOF
;
command
: (ifStatement | variable)+
;
ifStatement
: IF ANSWERED '(' variable ')' command* END IF
;
variable
: TEXT
;
IF : 'IF';
END : 'END';
ANSWERED : 'ANSWERED';
TEXT : [a-zA-Z0-9]+;
SPACES : [ \t\r\n]+ -> skip;

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

Keep whitespace and comment in ANTLR4

When I run InsertSerialID.java or ExtractInterfaceTool.java in ANTLR4 tour source pack( https://pragprog.com/titles/tpantlr2/source_code ), I found all the white-space and comments are not included in the output. So the output source code cannot be compiled or readable. How to keep them?
Well, I found redirect to an extra channel will keep them in the Token, instead of use skip
WS : [ \t\r\n\u000C]+ -> channel(2) // -> skip
;
COMMENT
: '/*' .*? '*/' -> channel(2) // -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2) // -> skip
;
They saved in ParserRuleContext.getSourceInterval() as Interval, although I don't know how to map Interval to thier gramma type.

extra channels in antlr 4.5

I am using antlr 4.5 to build a parser for a language with several special comment formats, which I would like to stream to different channels.
It seems antlr 4.5 has been extended with a new construct for declaring extra lexer channels:
extract from doc https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules
As of 4.5, you can also define channel names like you enumerations
with the following construct above the lexer rules:
channels { WSCHANNEL, MYHIDDEN }
My lexing and parsing rules are in a single file, and my code looks like this:
channels {
ANNOT_CHANNEL,
FORMAL_SPEC_CHANNEL,
DOC_CHANNEL,
COMMENT_CHANNEL,
PRAGMAS_CHANNEL
}
... parsing rules ...
// expression annotation (sent to a special channel)
ANNOT: (EOL_ANNOT | LUS_ANNOT | C_ANNOT) -> channel(ANNOT_CHANNEL) ;
fragment LUS_ANNOT: '(*!' ( COMMENT | . )*? '*)' ;
fragment C_ANNOT: '/*!' ( COMMENT | . )*? '*/' ;
fragment EOL_ANNOT: ('--!' | '//!') .*? EOL ;
// formal specification annotations (sent to a special channel)
FORMAL_SPEC: (EOL_SPEC | LUS_SPEC | C_SPEC ) -> channel(FORMAL_SPEC_CHANNEL) ;
fragment LUS_SPEC: '(*#' ( COMMENT | . )*? '*)' ;
fragment C_SPEC: '/*#' ( COMMENT | . )*? '*/' ;
fragment EOL_SPEC: ('--#' | '//#' | '--%') .*? EOL;
// documentation annotation (sent to a special channel)
DOC: ( EOL_DOC |LUS_DOC | C_DOC ) -> channel(DOC_CHANNEL);
fragment LUS_DOC: '(**' ( COMMENT | . )*? '*)' ;
fragment C_DOC: '/**' ( COMMENT | . )*? '*/' ;
fragment EOL_DOC: ('--*' | '//*') .*? EOL;
// standard comment (sent to a special channel)
COMMENT: ( EOL_COMMENT | LUS_COMMENT | C_COMMENT ) -> channel(COMMENT_CHANNEL);
fragment LUS_COMMENT: '(*' ( COMMENT | . )*? '*)' ;
fragment C_COMMENT: '/*' ( COMMENT |. )*? '*/' ;
fragment EOL_COMMENT: ('--' | '//') .*? EOL;
// pragmas are sent to a special channel
PRAGMA: '#pragma' CHARACTER* '#end' -> channel(PRAGMAS_CHANNEL);
however I am still getting this 4.4-like error
warning(155): Scade6.g4:550:52: rule ANNOT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:556:56: rule FORMAL_SPEC contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:562:45: rule DOC contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:568:62: rule COMMENT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:574:47: rule PRAGMA contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
If I split lexer and parser in two distinct files and use an import statement to import the lexer in the parser I still get the same error as above,
Using integer constants instead of names with a combined grammar
-> channel(10000)
yields the following error
error(164): Scade6.g4:8:0: custom channels are not supported in combined grammars
If I split lexer and parser apart in two files and use integer constants no warning, however it is not really satisfactory for readability.
Is there anything I can do to have extra channels named properly? (with either combined or separate lexer/parser specs, no preference)
Regards,
Is there anything I can do to have extra channels named properly?
not sure about v4.5 (have not used it), but in v4.x you could always define channels like so (assuming using java):
grammar MyGrammar;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
...the rest of your grammar goes here...
WS : [ \t\n\r]+ -> channel(WHITESPACE) ; // channel(1)
SL_COMMENT
: '//' .*? '\n' -> channel(COMMENTS) // channel(2)
;
If you do not already have "The Definitive ANTLR 4 Reference" book I recommend getting hold of it. Will save you a lot of time. Example above is from that book.

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?
ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Resources