Keep whitespace and comment in ANTLR4 - antlr4

When I run InsertSerialID.java or ExtractInterfaceTool.java in ANTLR4 tour source pack( https://pragprog.com/titles/tpantlr2/source_code ), I found all the white-space and comments are not included in the output. So the output source code cannot be compiled or readable. How to keep them?

Well, I found redirect to an extra channel will keep them in the Token, instead of use skip
WS : [ \t\r\n\u000C]+ -> channel(2) // -> skip
;
COMMENT
: '/*' .*? '*/' -> channel(2) // -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2) // -> skip
;
They saved in ParserRuleContext.getSourceInterval() as Interval, although I don't know how to map Interval to thier gramma type.

Related

Atleast ONE Space around Parenthesis in ANTLR4

I want spaces around Parenthesis in IF condition. ATleast one space is required. But when i use Space in grammar it throws me error, when i use Else block with it. Please help me, how to accomplish it, i have seen many examples but none is related to it.
i only need spaces around Parenthesis of If condition.
prog: stat_block EOF;
stat_block: OBRACE block CBRACE;
block: (stat (stat)*)?;
stat: expr ';'
| IF condition_block (ELSE stat_block)?
;
expr
: expr SPACE ('*' | '/') SPACE expr
| ID
| INT
| STRING
;
exprList: expr (',' expr)*;
condition_block: SPACE OPAR SPACE expr SPACE CPAR SPACE stat_block;
IF: 'IF';
ELSE: 'ELSE';
OPAR: '(';
CPAR: ')';
OBRACE: '{';
CBRACE: '}';
SPACE: SINGLE_SPACE+;
SINGLE_SPACE: ' ';
ID: [a-zA-Z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n' -> skip;
WS: [ \t]+ -> skip;
Expected input to parse
IF ( 3 ) { } ELSE { }
Current Input
There's a reason that almost all languages ignore whitespace. If you don't ignore it, then you have to deal with its possible existence in the token stream anywhere it might, or might not, be in ALL of your parser rules.
You can try to include the spaces in the Lexer rules for tokens that you want wrapped in spaces, but may still find surprises.
Suggestion: Instead if -> skip; for your WS rule, use -> channel(HIDDEN); This keeps the tokens in the token stream so you can look for them in your code, but "hides" the whitespace tokens from the parser rules. This also allows ANTLR to get a proper interpretation of your input and build a parse tree that represents it correctly.
If you REALLY want to insist on the spaces before/after, you can write code in a listener that looks before/after the tokens in the input stream to see if you have whitespace, and generate your own error (that can be very specific about your requirement).
At least one space is required.
Then you either:
cannot -> skip the WS rule, which will cause all spaces and tabs to become tokens and needing your parser to handle them correctly (which is likely going to become a complete mess in your parser rules!), or
you leave WS -> skip as-is, but include a space in your PAREN rules: OPAR : ' ( '; CPAR: ' ) '; (or with tabs as well if that is possible)

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

ANTLR 4 how to parse comments

I am parsing a SQL like language and I am having trouble parsing comments.
The idea is to ignore them.
I have these rules:
NEWLINE: '\r'? '\n' -> skip
WS : [ \t]+ -> skip
How can I ignore:
Everything that is between '--'or '#' and the next '\n'
Everything between '/' and '/' (slash + asterisk untill asterix + slash - the asterisk somehow gone).
I tried something like this before the WS and the NEWLINW:
COMMENT1 : ('--'|'#') ~'\n'* -> skip;
didn't work - I got:
line 1:115 missing ';' at '<EOF>'
probably something because it didn't go with my main rule:
parse : (statments (';')+)* EOF;
Can anyone help me?
Regards idob
When in doubt, see what someone else did ;)
There are some ready-made grammars for different languages, more or less working.
So I look in Java's grammar and see:
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
So your general idea seems to be correct. I'm guessing that the problem lies somewhere else. Can you provide sample input you test on and your grammar (relevant parts)?

extra channels in antlr 4.5

I am using antlr 4.5 to build a parser for a language with several special comment formats, which I would like to stream to different channels.
It seems antlr 4.5 has been extended with a new construct for declaring extra lexer channels:
extract from doc https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules
As of 4.5, you can also define channel names like you enumerations
with the following construct above the lexer rules:
channels { WSCHANNEL, MYHIDDEN }
My lexing and parsing rules are in a single file, and my code looks like this:
channels {
ANNOT_CHANNEL,
FORMAL_SPEC_CHANNEL,
DOC_CHANNEL,
COMMENT_CHANNEL,
PRAGMAS_CHANNEL
}
... parsing rules ...
// expression annotation (sent to a special channel)
ANNOT: (EOL_ANNOT | LUS_ANNOT | C_ANNOT) -> channel(ANNOT_CHANNEL) ;
fragment LUS_ANNOT: '(*!' ( COMMENT | . )*? '*)' ;
fragment C_ANNOT: '/*!' ( COMMENT | . )*? '*/' ;
fragment EOL_ANNOT: ('--!' | '//!') .*? EOL ;
// formal specification annotations (sent to a special channel)
FORMAL_SPEC: (EOL_SPEC | LUS_SPEC | C_SPEC ) -> channel(FORMAL_SPEC_CHANNEL) ;
fragment LUS_SPEC: '(*#' ( COMMENT | . )*? '*)' ;
fragment C_SPEC: '/*#' ( COMMENT | . )*? '*/' ;
fragment EOL_SPEC: ('--#' | '//#' | '--%') .*? EOL;
// documentation annotation (sent to a special channel)
DOC: ( EOL_DOC |LUS_DOC | C_DOC ) -> channel(DOC_CHANNEL);
fragment LUS_DOC: '(**' ( COMMENT | . )*? '*)' ;
fragment C_DOC: '/**' ( COMMENT | . )*? '*/' ;
fragment EOL_DOC: ('--*' | '//*') .*? EOL;
// standard comment (sent to a special channel)
COMMENT: ( EOL_COMMENT | LUS_COMMENT | C_COMMENT ) -> channel(COMMENT_CHANNEL);
fragment LUS_COMMENT: '(*' ( COMMENT | . )*? '*)' ;
fragment C_COMMENT: '/*' ( COMMENT |. )*? '*/' ;
fragment EOL_COMMENT: ('--' | '//') .*? EOL;
// pragmas are sent to a special channel
PRAGMA: '#pragma' CHARACTER* '#end' -> channel(PRAGMAS_CHANNEL);
however I am still getting this 4.4-like error
warning(155): Scade6.g4:550:52: rule ANNOT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:556:56: rule FORMAL_SPEC contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:562:45: rule DOC contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:568:62: rule COMMENT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Scade6.g4:574:47: rule PRAGMA contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
If I split lexer and parser in two distinct files and use an import statement to import the lexer in the parser I still get the same error as above,
Using integer constants instead of names with a combined grammar
-> channel(10000)
yields the following error
error(164): Scade6.g4:8:0: custom channels are not supported in combined grammars
If I split lexer and parser apart in two files and use integer constants no warning, however it is not really satisfactory for readability.
Is there anything I can do to have extra channels named properly? (with either combined or separate lexer/parser specs, no preference)
Regards,
Is there anything I can do to have extra channels named properly?
not sure about v4.5 (have not used it), but in v4.x you could always define channels like so (assuming using java):
grammar MyGrammar;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
...the rest of your grammar goes here...
WS : [ \t\n\r]+ -> channel(WHITESPACE) ; // channel(1)
SL_COMMENT
: '//' .*? '\n' -> channel(COMMENTS) // channel(2)
;
If you do not already have "The Definitive ANTLR 4 Reference" book I recommend getting hold of it. Will save you a lot of time. Example above is from that book.

how to handling nested comments in antlr lexer

How to handle nested comments in antlr4 lexer? ie I need to count the number of "/*" inside this token and close only after the same number of "*/" have been received. As an example, the D language has such nested comments as "/+ ... +/"
For example, the following lines should be treated as one block of comments:
/* comment 1
comment 2
/* comment 3
comment 4
*/
// comment 5
comment 6
*/
My current code is the following, and it does not work on the above nested comment:
COMMENT : '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' -> channel(HIDDEN)
;
Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments:
COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' .*? '\n' -> channel(HIDDEN) ;
I'm using:
COMMENT: '/*' ('/'*? COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/' -> skip;
This forces any /* inside a comment to be the beginning of a nested comment and similarly with */. In other words, there's no way to recognize /* and */ other than at the beginning and end of the rule COMMENT.
This way, something like /* /* /* */ a */ would not be recognized entirely as a (bad) comment (mismatched /*s and */s), as it would if using COMMENT: '/*' (COMMENT|.)*? '*/' -> skip;, but as /, followed by *, followed by correct nested comments /* /* */ a */.
Works for Antlr3.
Allows nested comments and '*' within a comment.
fragment
F_MultiLineCommentTerm
:
( {LA(1) == '*' && LA(2) != '/'}? => '*'
| {LA(1) == '/' && LA(2) == '*'}? => F_MultiLineComment
| ~('*')
)*
;
fragment
F_MultiLineComment
:
'/*'
F_MultiLineCommentTerm
'*/'
;
H_MultiLineComment
: r= F_MultiLineComment
{ $channel=HIDDEN;
printf(stder,"F_MultiLineComment[\%s]",$r->getText($r)->chars);
}
;
I can give you an ANTLR3 solution, which you can adjust to work in ANTLR4:
I think you can use a recursive rule invocation. Make a non-greedy comment rule for /* ... */ which calls itself. That should allow for unlimited nesting without having to count opening + closing comment markers:
COMMENT option { greedy = false; }:
('/*' ({LA(1) == '/' && LA(2) == '*'} => COMMENT | .) .* '*/') -> channel(HIDDEN)
;
or maybe even:
COMMENT option { greedy = false; }:
('/*' .* COMMENT? .* '*/') -> channel(HIDDEN)
;
I'm not sure if ANTLR properly chooses the right path depending on any char or the comment introducer. Try it out.
This will handle : '/*/*/' and '/*.../*/'where the comment body is '/' and '.../' respectively.
Multiline comments will not nest inside single line comments, therefore you cannot start nor begin a multiline comment inside a single line comment.
This is not a valid comment: '/* // */'.
You need a newline to end the single line comment before the '*/' can be consumed to end the multiline comment.
This is a valid comment: '/* // */ \n /*/'.
The comment body is: ' // */ \n /'. As you can see the complete single line comment is included in the body of the multiline comment.
Although '/*/' can end a multiline comment if the preceding character is '*', the comment will end on the first '/' and remaining '*/' will need to end a nested comment otherwise there is a error. The shortest path wins, this is non-greedy!
This is not a valid comment /****/*/
This is a valid comment /*/****/*/, the comment body is /****/, which is itself a nested comment.
The prefix and suffix will never be matched in the multiline comment body.
If you want to implement this for the 'D' language, change the '*' to '+'.
COMMENT_NEST
: '/*'
( ('/'|'*'+)? ~[*/] | COMMENT_NEST | COMMENT_INL )*?
('/'|'*'+?)?
'*/'
;
COMMENT_INL
: '//' ( COMMENT_INL | ~[\n\r] )*
;

Resources