I created a language with antlr but I have problem with comment regex. in my language comments for a line just begin with "$$" and for multi line begin with '$$' and end with '$$'. i used
below regex
COMMENT : '$$'.*?'$$' -> skip;
LINE_COMMENT : '$$'.*?'\n' -> skip;
but can't work properly some times.
Related
I am trying to implement a grammar in Antlr4 for a simple template engine. This engine consists of 3 different clauses:
IF ANSWERED ( variable )
END IF
Variable
Variable can be any upper or lowercase letter including white spaces. Both IF ANSWERED and END IF are always uppercase.
I have written the following grammar/lexer rules so far, but my problem is that IF ANSWERED keeps getting recognized as a Variable and not as 2 tokens IF and ANSWERED.
grammar program;
/**grammar */
command: (ifStart | ifEnd | VARIABLE ) EOF;
ifStart: IF ANSWERED '(' VARIABLE ')';
ifEnd: 'END IF';
/** lexer */
IF: 'IF';
ANSWERED: 'ANSWERED';
TEXT: (LOWERCASE | UPPERCASE | NUMBER) ;
VARIABLE: (TEXT | [ \t\r\n])+;
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment NUMBER: [0-9];
If I try to parse IF ANSWERED ( FirstName ) I get the following output:
[#0,0:10='IF ANSWERED',**<VARIABLE>**,1:0]
[#1,11:11='(',<'('>,1:11]
[#2,12:25='Execution date',<VARIABLE>,1:12]
[#3,26:26=')',<')'>,1:26]
[#4,27:26='<EOF>',<EOF>,1:27]
line 1:0 mismatched input 'IF ANSWERED' expecting 'IF'
I read that Antlr4 is greedy and tries to match the biggest possible token, but I fail to understand what is the correct approach, or how to think through the problem to find a solution.
Correct: ANTLR's lexer is greedy, and tries to consume as much as possible. That is why IF ANSWERED is tokenised as a TEXT token instead of 2 separate keywords. You'll need to change TEXT so that it does not match spaces.
Something like this could get you started:
parse
: command* EOF
;
command
: (ifStatement | variable)+
;
ifStatement
: IF ANSWERED '(' variable ')' command* END IF
;
variable
: TEXT
;
IF : 'IF';
END : 'END';
ANSWERED : 'ANSWERED';
TEXT : [a-zA-Z0-9]+;
SPACES : [ \t\r\n]+ -> skip;
In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces
When I run InsertSerialID.java or ExtractInterfaceTool.java in ANTLR4 tour source pack( https://pragprog.com/titles/tpantlr2/source_code ), I found all the white-space and comments are not included in the output. So the output source code cannot be compiled or readable. How to keep them?
Well, I found redirect to an extra channel will keep them in the Token, instead of use skip
WS : [ \t\r\n\u000C]+ -> channel(2) // -> skip
;
COMMENT
: '/*' .*? '*/' -> channel(2) // -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2) // -> skip
;
They saved in ParserRuleContext.getSourceInterval() as Interval, although I don't know how to map Interval to thier gramma type.
I am parsing a SQL like language and I am having trouble parsing comments.
The idea is to ignore them.
I have these rules:
NEWLINE: '\r'? '\n' -> skip
WS : [ \t]+ -> skip
How can I ignore:
Everything that is between '--'or '#' and the next '\n'
Everything between '/' and '/' (slash + asterisk untill asterix + slash - the asterisk somehow gone).
I tried something like this before the WS and the NEWLINW:
COMMENT1 : ('--'|'#') ~'\n'* -> skip;
didn't work - I got:
line 1:115 missing ';' at '<EOF>'
probably something because it didn't go with my main rule:
parse : (statments (';')+)* EOF;
Can anyone help me?
Regards idob
When in doubt, see what someone else did ;)
There are some ready-made grammars for different languages, more or less working.
So I look in Java's grammar and see:
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
So your general idea seems to be correct. I'm guessing that the problem lies somewhere else. Can you provide sample input you test on and your grammar (relevant parts)?
How to handle nested comments in antlr4 lexer? ie I need to count the number of "/*" inside this token and close only after the same number of "*/" have been received. As an example, the D language has such nested comments as "/+ ... +/"
For example, the following lines should be treated as one block of comments:
/* comment 1
comment 2
/* comment 3
comment 4
*/
// comment 5
comment 6
*/
My current code is the following, and it does not work on the above nested comment:
COMMENT : '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' -> channel(HIDDEN)
;
Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments:
COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' .*? '\n' -> channel(HIDDEN) ;
I'm using:
COMMENT: '/*' ('/'*? COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/' -> skip;
This forces any /* inside a comment to be the beginning of a nested comment and similarly with */. In other words, there's no way to recognize /* and */ other than at the beginning and end of the rule COMMENT.
This way, something like /* /* /* */ a */ would not be recognized entirely as a (bad) comment (mismatched /*s and */s), as it would if using COMMENT: '/*' (COMMENT|.)*? '*/' -> skip;, but as /, followed by *, followed by correct nested comments /* /* */ a */.
Works for Antlr3.
Allows nested comments and '*' within a comment.
fragment
F_MultiLineCommentTerm
:
( {LA(1) == '*' && LA(2) != '/'}? => '*'
| {LA(1) == '/' && LA(2) == '*'}? => F_MultiLineComment
| ~('*')
)*
;
fragment
F_MultiLineComment
:
'/*'
F_MultiLineCommentTerm
'*/'
;
H_MultiLineComment
: r= F_MultiLineComment
{ $channel=HIDDEN;
printf(stder,"F_MultiLineComment[\%s]",$r->getText($r)->chars);
}
;
I can give you an ANTLR3 solution, which you can adjust to work in ANTLR4:
I think you can use a recursive rule invocation. Make a non-greedy comment rule for /* ... */ which calls itself. That should allow for unlimited nesting without having to count opening + closing comment markers:
COMMENT option { greedy = false; }:
('/*' ({LA(1) == '/' && LA(2) == '*'} => COMMENT | .) .* '*/') -> channel(HIDDEN)
;
or maybe even:
COMMENT option { greedy = false; }:
('/*' .* COMMENT? .* '*/') -> channel(HIDDEN)
;
I'm not sure if ANTLR properly chooses the right path depending on any char or the comment introducer. Try it out.
This will handle : '/*/*/' and '/*.../*/'where the comment body is '/' and '.../' respectively.
Multiline comments will not nest inside single line comments, therefore you cannot start nor begin a multiline comment inside a single line comment.
This is not a valid comment: '/* // */'.
You need a newline to end the single line comment before the '*/' can be consumed to end the multiline comment.
This is a valid comment: '/* // */ \n /*/'.
The comment body is: ' // */ \n /'. As you can see the complete single line comment is included in the body of the multiline comment.
Although '/*/' can end a multiline comment if the preceding character is '*', the comment will end on the first '/' and remaining '*/' will need to end a nested comment otherwise there is a error. The shortest path wins, this is non-greedy!
This is not a valid comment /****/*/
This is a valid comment /*/****/*/, the comment body is /****/, which is itself a nested comment.
The prefix and suffix will never be matched in the multiline comment body.
If you want to implement this for the 'D' language, change the '*' to '+'.
COMMENT_NEST
: '/*'
( ('/'|'*'+)? ~[*/] | COMMENT_NEST | COMMENT_INL )*?
('/'|'*'+?)?
'*/'
;
COMMENT_INL
: '//' ( COMMENT_INL | ~[\n\r] )*
;