I want spaces around Parenthesis in IF condition. ATleast one space is required. But when i use Space in grammar it throws me error, when i use Else block with it. Please help me, how to accomplish it, i have seen many examples but none is related to it.
i only need spaces around Parenthesis of If condition.
prog: stat_block EOF;
stat_block: OBRACE block CBRACE;
block: (stat (stat)*)?;
stat: expr ';'
| IF condition_block (ELSE stat_block)?
;
expr
: expr SPACE ('*' | '/') SPACE expr
| ID
| INT
| STRING
;
exprList: expr (',' expr)*;
condition_block: SPACE OPAR SPACE expr SPACE CPAR SPACE stat_block;
IF: 'IF';
ELSE: 'ELSE';
OPAR: '(';
CPAR: ')';
OBRACE: '{';
CBRACE: '}';
SPACE: SINGLE_SPACE+;
SINGLE_SPACE: ' ';
ID: [a-zA-Z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n' -> skip;
WS: [ \t]+ -> skip;
Expected input to parse
IF ( 3 ) { } ELSE { }
Current Input
There's a reason that almost all languages ignore whitespace. If you don't ignore it, then you have to deal with its possible existence in the token stream anywhere it might, or might not, be in ALL of your parser rules.
You can try to include the spaces in the Lexer rules for tokens that you want wrapped in spaces, but may still find surprises.
Suggestion: Instead if -> skip; for your WS rule, use -> channel(HIDDEN); This keeps the tokens in the token stream so you can look for them in your code, but "hides" the whitespace tokens from the parser rules. This also allows ANTLR to get a proper interpretation of your input and build a parse tree that represents it correctly.
If you REALLY want to insist on the spaces before/after, you can write code in a listener that looks before/after the tokens in the input stream to see if you have whitespace, and generate your own error (that can be very specific about your requirement).
At least one space is required.
Then you either:
cannot -> skip the WS rule, which will cause all spaces and tabs to become tokens and needing your parser to handle them correctly (which is likely going to become a complete mess in your parser rules!), or
you leave WS -> skip as-is, but include a space in your PAREN rules: OPAR : ' ( '; CPAR: ' ) '; (or with tabs as well if that is possible)
Related
In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces
I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;
I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).
Suppose a line has a maximum length of 5.
I want an Identifier to continue when a newline character is put on position 5.
examples:
abcd'\n'ef would result in a single Identifier "abdef"
ab'\n'def would result in Identifier "ab" (and another one "def")
Somehow I cannot get it working...
Attempt 1 is something like:
NEWLINE1 : '\r'? '\n' { _tokenStartCharPositionInLine == 5 } -> skip;
NEWLINE2 : '\r'? '\n' { _tokenStartCharPositionInLine < 5 } -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit)*;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
Attempt 2 is something like:
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit NEWLINE?)*;
NEWLINE: '\r'? '\n' { _tokenStartCharPositionInLine == 5}? -> skip;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser. Somehow I do not succeed into 'ignoring' the newline when it is on the last position of a line.
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser.
That is because the NEWLINE is only skipped when matched "independently". Whenever it is part of another rule, like Identifier, it will stay part of said rule.
IMO, you should just go for this solution and not add too much predicates to your lexer (or parser). Simply strip the line break from the Identifier after parsing.
How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.