string recursion antlr lexer token - string

How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?

Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png

ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.

Related

Choosing lexer mode based on variable

My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).

ANTLR grammar: Boolean literal which can occur as qualified variable name while ignoring whitespace

I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR​ expression)+
| expression (NOT​ expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

Incorrect Result When ANTLR4 Lexer Action Invokes getText()

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for
parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:
grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip;
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*)
{System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z] | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];
Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:
dkk\uzzzz
The $text of the id_token parser rule action produces this correct result:
dkk
uzzzz
i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).
However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:
dkk\u
uzzzz
Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?
EDIT:
Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.
It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

antlr does not parse when token only mentions one other token

I am trying to learn EBNF grammars with ANTLR. So I thought I would convert the Wikipedia EBNF grammar to ANTLR 4 and play with it. However I have had a terrible time at it. I was able to reduce the grammar to the one step that generates the problem.
It seems if I have one token reference solely another token then ANTLR 4 can't parse the input.
Here is my grammar:
grammar Hello;
program : statement+ ;
statement : IDENTIFIER STATEMENTEND /*| LETTERS STATEMENTEND */ ;
LETTERS : [a-z]+ ;
IDENTIFIER : LETTERS ;
SEMICOLON : [;] ;
STATEMENTEND : SEMICOLON NEWLINE* | NEWLINE+ ;
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Notice IDENTIFIER refers only to LETTERS.
If I provide this input:
a;
Then I get this error:
line 1:0 mismatched input 'a' expecting IDENTIFIER
(program a ;\n)
However if I uncomment the code and provide the same input I get legit output:
(program (statement a ;\n))
I do not understand why one works and the other does not.
The token a will only be assigned one token type. Since this input text matches both the LETTERS and IDENTIFIER rules, ANTLR 4 will assign the type according to the first rule appearing in the lexer, which means the input a will be a token of type LETTERS.
If you only meant for LETTERS to be a sub-part of other lexer rules, and not form LETTERS tokens themselves, you can declare it as a fragment rule.
fragment LETTERS : [a-z]+;
IDENTIFIER : LETTERS;
In this case, a would be assigned the token type IDENTIFIER and the original parser rule would work.

Resources