ANTLR4 - how to interrupt - antlr4

Suppose a line has a maximum length of 5.
I want an Identifier to continue when a newline character is put on position 5.
examples:
abcd'\n'ef would result in a single Identifier "abdef"
ab'\n'def would result in Identifier "ab" (and another one "def")
Somehow I cannot get it working...
Attempt 1 is something like:
NEWLINE1 : '\r'? '\n' { _tokenStartCharPositionInLine == 5 } -> skip;
NEWLINE2 : '\r'? '\n' { _tokenStartCharPositionInLine < 5 } -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit)*;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
Attempt 2 is something like:
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit NEWLINE?)*;
NEWLINE: '\r'? '\n' { _tokenStartCharPositionInLine == 5}? -> skip;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser. Somehow I do not succeed into 'ignoring' the newline when it is on the last position of a line.

This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser.
That is because the NEWLINE is only skipped when matched "independently". Whenever it is part of another rule, like Identifier, it will stay part of said rule.
IMO, you should just go for this solution and not add too much predicates to your lexer (or parser). Simply strip the line break from the Identifier after parsing.

Related

How to use the reserved words inside the string in ANTLR4?

I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".
There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).
I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

Antlr4 Match Force Priority

I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.
You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?
ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

antlr match any character except

I have the following deffinition of fragment:
fragment CHAR :'a'..'z'|'A'..'Z'|'\n'|'\t'|'\\'|EOF;
Now I have to define a lexer rule for string. I did the following :
STRING : '"'(CHAR)*'"'
However in string I want to match all of my characters except the new line '\n'. Any ideas how I can achieve that?
You'll also need to exclude " besides line breaks. Try this:
STRING : '"' ~('\r' | '\n' | '"')* '"' ;
The ~ negates char-sets.
ut I want to negate only the new line from my CHAR set
No other way than this AFAIK:
STRING : '"' CHAR_NO_NL* '"' ;
fragment CHAR_NO_NL : 'a'..'z'|'A'..'Z'|'\t'|'\\'|EOF;

Resources