how to handling nested comments in antlr lexer

how to handling nested comments in antlr lexer - antlr4

How to handle nested comments in antlr4 lexer? ie I need to count the number of "/*" inside this token and close only after the same number of "*/" have been received. As an example, the D language has such nested comments as "/+ ... +/"
For example, the following lines should be treated as one block of comments:
/* comment 1
comment 2
/* comment 3
comment 4
*/
// comment 5
comment 6
*/
My current code is the following, and it does not work on the above nested comment:
COMMENT : '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' -> channel(HIDDEN)
;

Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments:
COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' .*? '\n' -> channel(HIDDEN) ;

I'm using:
COMMENT: '/*' ('/'*? COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/' -> skip;
This forces any /* inside a comment to be the beginning of a nested comment and similarly with */. In other words, there's no way to recognize /* and */ other than at the beginning and end of the rule COMMENT.
This way, something like /* /* /* */ a */ would not be recognized entirely as a (bad) comment (mismatched /*s and */s), as it would if using COMMENT: '/*' (COMMENT|.)*? '*/' -> skip;, but as /, followed by *, followed by correct nested comments /* /* */ a */.

Works for Antlr3.
Allows nested comments and '*' within a comment.
fragment
F_MultiLineCommentTerm
:
( {LA(1) == '*' && LA(2) != '/'}? => '*'
| {LA(1) == '/' && LA(2) == '*'}? => F_MultiLineComment
| ~('*')
)*
;
fragment
F_MultiLineComment
:
'/*'
F_MultiLineCommentTerm
'*/'
;
H_MultiLineComment
: r= F_MultiLineComment
{ $channel=HIDDEN;
printf(stder,"F_MultiLineComment[\%s]",$r->getText($r)->chars);
}
;

I can give you an ANTLR3 solution, which you can adjust to work in ANTLR4:
I think you can use a recursive rule invocation. Make a non-greedy comment rule for /* ... */ which calls itself. That should allow for unlimited nesting without having to count opening + closing comment markers:
COMMENT option { greedy = false; }:
('/*' ({LA(1) == '/' && LA(2) == '*'} => COMMENT | .) .* '*/') -> channel(HIDDEN)
;
or maybe even:
COMMENT option { greedy = false; }:
('/*' .* COMMENT? .* '*/') -> channel(HIDDEN)
;
I'm not sure if ANTLR properly chooses the right path depending on any char or the comment introducer. Try it out.

This will handle : '/*/*/' and '/*.../*/'where the comment body is '/' and '.../' respectively.
Multiline comments will not nest inside single line comments, therefore you cannot start nor begin a multiline comment inside a single line comment.
This is not a valid comment: '/* // */'.
You need a newline to end the single line comment before the '*/' can be consumed to end the multiline comment.
This is a valid comment: '/* // */ \n /*/'.
The comment body is: ' // */ \n /'. As you can see the complete single line comment is included in the body of the multiline comment.
Although '/*/' can end a multiline comment if the preceding character is '*', the comment will end on the first '/' and remaining '*/' will need to end a nested comment otherwise there is a error. The shortest path wins, this is non-greedy!
This is not a valid comment /****/*/
This is a valid comment /*/****/*/, the comment body is /****/, which is itself a nested comment.
The prefix and suffix will never be matched in the multiline comment body.
If you want to implement this for the 'D' language, change the '*' to '+'.
COMMENT_NEST
: '/*'
( ('/'|'*'+)? ~[*/] | COMMENT_NEST | COMMENT_INL )*?
('/'|'*'+?)?
'*/'
;
COMMENT_INL
: '//' ( COMMENT_INL | ~[\n\r] )*
;

Related

ANTLR: how to debug a misidentified token

I am trying to implement a grammar in Antlr4 for a simple template engine. This engine consists of 3 different clauses:
IF ANSWERED ( variable )
END IF
Variable
Variable can be any upper or lowercase letter including white spaces. Both IF ANSWERED and END IF are always uppercase.
I have written the following grammar/lexer rules so far, but my problem is that IF ANSWERED keeps getting recognized as a Variable and not as 2 tokens IF and ANSWERED.
grammar program;
/**grammar */
command: (ifStart | ifEnd | VARIABLE ) EOF;
ifStart: IF ANSWERED '(' VARIABLE ')';
ifEnd: 'END IF';
/** lexer */
IF: 'IF';
ANSWERED: 'ANSWERED';
TEXT: (LOWERCASE | UPPERCASE | NUMBER) ;
VARIABLE: (TEXT | [ \t\r\n])+;
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment NUMBER: [0-9];
If I try to parse IF ANSWERED ( FirstName ) I get the following output:
[#0,0:10='IF ANSWERED',**<VARIABLE>**,1:0]
[#1,11:11='(',<'('>,1:11]
[#2,12:25='Execution date',<VARIABLE>,1:12]
[#3,26:26=')',<')'>,1:26]
[#4,27:26='<EOF>',<EOF>,1:27]
line 1:0 mismatched input 'IF ANSWERED' expecting 'IF'
I read that Antlr4 is greedy and tries to match the biggest possible token, but I fail to understand what is the correct approach, or how to think through the problem to find a solution.

Correct: ANTLR's lexer is greedy, and tries to consume as much as possible. That is why IF ANSWERED is tokenised as a TEXT token instead of 2 separate keywords. You'll need to change TEXT so that it does not match spaces.
Something like this could get you started:
parse
: command* EOF
;
command
: (ifStatement | variable)+
;
ifStatement
: IF ANSWERED '(' variable ')' command* END IF
;
variable
: TEXT
;
IF : 'IF';
END : 'END';
ANSWERED : 'ANSWERED';
TEXT : [a-zA-Z0-9]+;
SPACES : [ \t\r\n]+ -> skip;

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.

Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

Keep whitespace and comment in ANTLR4

When I run InsertSerialID.java or ExtractInterfaceTool.java in ANTLR4 tour source pack( https://pragprog.com/titles/tpantlr2/source_code ), I found all the white-space and comments are not included in the output. So the output source code cannot be compiled or readable. How to keep them?

Well, I found redirect to an extra channel will keep them in the Token, instead of use skip
WS : [ \t\r\n\u000C]+ -> channel(2) // -> skip
;
COMMENT
: '/*' .*? '*/' -> channel(2) // -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2) // -> skip
;
They saved in ParserRuleContext.getSourceInterval() as Interval, although I don't know how to map Interval to thier gramma type.

extra channels in antlr 4.5

I am using antlr 4.5 to build a parser for a language with several special comment formats, which I would like to stream to different channels.
It seems antlr 4.5 has been extended with a new construct for declaring extra lexer channels:
extract from doc https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules
As of 4.5, you can also define channel names like you enumerations
with the following construct above the lexer rules:
channels { WSCHANNEL, MYHIDDEN }
My lexing and parsing rules are in a single file, and my code looks like this:
channels {
ANNOT_CHANNEL,
FORMAL_SPEC_CHANNEL,
DOC_CHANNEL,
COMMENT_CHANNEL,
PRAGMAS_CHANNEL
}
... parsing rules ...
// expression annotation (sent to a special channel)
ANNOT: (EOL_ANNOT | LUS_ANNOT | C_ANNOT) -> channel(ANNOT_CHANNEL) ;
fragment LUS_ANNOT: '(*!' ( COMMENT | . )*? '*)' ;
fragment C_ANNOT: '/*!' ( COMMENT | . )*? '*/' ;
fragment EOL_ANNOT: ('--!' | '//!') .*? EOL ;
// formal specification annotations (sent to a special channel)
FORMAL_SPEC: (EOL_SPEC | LUS_SPEC | C_SPEC ) -> channel(FORMAL_SPEC_CHANNEL) ;
fragment LUS_SPEC: '(*#' ( COMMENT | . )*? '*)' ;
fragment C_SPEC: '/*#' ( COMMENT | . )*? '*/' ;
fragment EOL_SPEC: ('--#' | '//#' | '--%') .*? EOL;
// documentation annotation (sent to a special channel)
DOC: ( EOL_DOC |LUS_DOC | C_DOC ) -> channel(DOC_CHANNEL);
fragment LUS_DOC: '(**' ( COMMENT | . )*? '*)' ;
fragment C_DOC: '/**' ( COMMENT | . )*? '*/' ;
fragment EOL_DOC: ('--*' | '//*') .*? EOL;
// standard comment (sent to a special channel)
COMMENT: ( EOL_COMMENT | LUS_COMMENT | C_COMMENT ) -> channel(COMMENT_CHANNEL);
fragment LUS_COMMENT: '(*' ( COMMENT | . )*? '*)' ;
fragment C_COMMENT: '/*' ( COMMENT |. )*? '*/' ;
fragment EOL_COMMENT: ('--' | '//') .*? EOL;
// pragmas are sent to a special channel
PRAGMA: '#pragma' CHARACTER* '#end' -> channel(PRAGMAS_CHANNEL);
however I am still getting this 4.4-like error
warning(155): Scade6.g4:550:52: rule ANNOT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:556:56: rule FORMAL_SPEC contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:562:45: rule DOC contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:568:62: rule COMMENT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:574:47: rule PRAGMA contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
If I split lexer and parser in two distinct files and use an import statement to import the lexer in the parser I still get the same error as above,
Using integer constants instead of names with a combined grammar
-> channel(10000)
yields the following error
error(164): Scade6.g4:8:0: custom channels are not supported in combined grammars
If I split lexer and parser apart in two files and use integer constants no warning, however it is not really satisfactory for readability.
Is there anything I can do to have extra channels named properly? (with either combined or separate lexer/parser specs, no preference)
Regards,

Is there anything I can do to have extra channels named properly?
not sure about v4.5 (have not used it), but in v4.x you could always define channels like so (assuming using java):
grammar MyGrammar;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
...the rest of your grammar goes here...
WS : [ \t\n\r]+ -> channel(WHITESPACE) ; // channel(1)
SL_COMMENT
: '//' .*? '\n' -> channel(COMMENTS) // channel(2)
;
If you do not already have "The Definitive ANTLR 4 Reference" book I recommend getting hold of it. Will save you a lot of time. Example above is from that book.

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?

ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to handling nested comments in antlr lexer - antlr4

Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments: COMMENT : '/' (COMMENT|.)? '/' -> channel(HIDDEN) ; LINE_COMMENT : '//' .? '\n' -> channel(HIDDEN) ;

Related

ANTLR: how to debug a misidentified token

Antlr4: Skip line when it start with * unless the second char is

Keep whitespace and comment in ANTLR4

extra channels in antlr 4.5

antlr tokenizer starts with the last token

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to handling nested comments in antlr lexer - antlr4

Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments: COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ; LINE_COMMENT : '//' .*? '\n' -> channel(HIDDEN) ;

Related

ANTLR: how to debug a misidentified token

Antlr4: Skip line when it start with * unless the second char is

Keep whitespace and comment in ANTLR4

extra channels in antlr 4.5

antlr tokenizer starts with the last token

Categories

Resources

Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments: COMMENT : '/' (COMMENT|.)? '/' -> channel(HIDDEN) ; LINE_COMMENT : '//' .? '\n' -> channel(HIDDEN) ;