I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.
You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;
Related
I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);
I've been playing around with modes in an attempt to parse a message like this:
-MSGTXT (DO NOT TOKENIZE (THERE CAN BE PARENS HERE) THIS PART)
-END END OF MESSAGE
-TEST 123
The contents of MSGTXT can be any character so I set up my lexer grammar as follows:
lexer grammar ADEXPLexer;
// Fields
MSGTYP: 'MSGTYP';
ADEP: 'ADEP';
TITLE: 'TITLE';
FILTIM: 'FILTIM';
ORIGINDT: 'ORIGINDT';
IFPLID: 'IFPLID';
MSGTXT: 'MSGTXT' -> pushMode(MSG);
COMMENT: 'COMMENT';
// Message types.
ACK: 'ACK';
IFPL: 'IFPL';
// Lexical rules.
SEP: HYPHEN;
WS: [ \t\n\r] + -> skip;
KEYWORD: (ALPHA|DIGIT)+;
mode MSG;
TEXT: CLOSE_MSG | (ALPHA|DIGIT|SPECIAL|WS|HYPHEN)+;
CLOSE_MSG: ')' -> popMode;
fragment HYPHEN: '-';
fragment ALPHA: [A-Z];
fragment DIGIT: [0-9];
fragment SPECIAL
: '('
| '?'
| ':'
| '.'
| ','
| '\''
| '='
| '+'
| '/'
| ')'
;
The problem now however is that the last closing ')' is never used to break out back into the default mode so it continues on into other parts of the message. The parser rule itself looks like this:
msgtxt: SEP MSGTXT TEXT;
I'm looking for a way to get around this which doesn't involve TokenStreamRewriter as there's no such thing in the JavaScript runtime.
Any help appreciated!
Not sure what you need exactly, but if you don't need to check if contents of the TEXT is one of the (ALPHA|DIGIT|SPECIAL|WS|HYPHEN) just use this:
mode MSG;
TEXT: ~[)]+;
CLOSE_MSG: ')' -> popMode;
if you do, just exclude ')' from fragment SPECIAL
In my ANTLr code, we should be able to recognize strings, characters, hexadecimal numbers etc.
However, in my code, when I test it like this:
grun A1_lexer tokens -tokens test.txt
With my test.txt file being a simple string, such as "pineapple", it is unable to recognize the different tokens.
In my lexer, I define the following helper tokens:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9'] ;
fragment Digit: ['0'-'9'] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
And I define the following tokens:
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Id: Alpha Alpha_num* ;
I run it like this:
grun A1_lexer tokens -tokens test.txt
And it outputs this:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really wondering what the problem is and how I could fix it.
Thanks.
UPDATE 1:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
I have updated the code, I got rid of the un-necessary single quotes in my Char classification. However, I get the same output as before.
UPDATE 2:
Even when I make the changes suggested, I still get the same error. I believed the problem is that I am not recompiling, but I am. These are the steps that I take to recompile.
antlr4 A1_lexer.g4
javac A1_lexer*.java
chmod a+x build.sh
./build.sh
grun A1_lexer tokens -tokens test.txt
With my build.sh file looking like this:
#!/bin/bash
FILE="A1_lexer"
ANTLR=$(echo $CLASSPATH | tr ':' '\n' | grep -m 1 "antlr-4.7.1-
complete.jar")
java -jar $ANTLR $FILE.g4
javac $FILE*.java
Even when I recompile, my antlr code is still unable to recognize the tokens.
My code is also now like this:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
UPDATE 3:
Grammar:
program
:'class Program {'field_decl* method_decl*'}'
field_decl
: type (id | id'['int_literal']') ( ',' id | id'['int_literal']')*';'
| type id '=' literal ';'
method_decl
: (type | 'void') id'('( (type id) ( ','type id)*)? ')'block
block
: '{'var_decl* statement*'}'
var_decl
: type id(','id)* ';'
type
: 'int'
| 'boolean'
statement
: location assign_op expr';'
| method_call';'
| 'if ('expr')' block ('else' block )?
| 'switch' expr '{'('case' literal ':' statement*)+'}'
| 'while (' expr ')' statement
| 'return' ( expr )? ';'
| 'break ;'
| 'continue ;'
| block
assign_op
: '='
| '+='
| '-='
method_call
: method_name '(' (expr ( ',' expr )*)? ')'
| 'callout (' string_literal ( ',' callout_arg )* ')'
method_name
: id
location
: id
| id '[' expr ']'
expr
: location
| method_call
| literal
| expr bin_op expr
| '-' expr
| '!' expr
| '(' expr ')'
callout_arg
: expr
| string_literal
bin_op
: arith_op
| rel_op
| eq_op
| cond_op
arith_op
: '+'
| '-'
| '*'
| '/'
| '%'
rel_op
: '<'
| '>'
| '<='
| '>='
eq_op
: '=='
| '!='
cond_op
: '&&'
| '||'
literal
: int_literal
| char_literal
| bool_literal
id
: alpha alpha_num*
alpha
: ['a'-'z''A'-'Z''_']
alpha_num
: alpha
| digit
digit
: ['0'-'9']
hex_digit
: digit
| ['a'-'f''A'-'F']
int_literal
: decimal_literal
| hex_literal
decimal_literal
: digit+
hex_literal
: '0x' hex_digit+
bool_literal
: 'true'
| 'false'
char_literal
: '‘'char'’'
string_literal
: '“'char*'”'
test.txt :
"pineapple"
A1_lexer:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
What I Write in Terminal:
grun A1_lexer tokens -tokens test.txt
Output in Terminal:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really not sure why this is happening.
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9']
['a'-'z'] doesn't mean "a to z", it means "a single quote, or a, or a single quote to a single quote, or z, or a single quote", which simplifies to just "a single quote, a or z". What you want is just [a-z] without the quotes and the same applies to the other character classes as well - except that they also contain spaces, so it's "single quote, A, single quote, space to space, single quote, Z, or single quote" etc. Also you don't need to "or" character classes, you can just write everything in one character class like this: [a-zA-Z0-9] (like you already did for the Alpha rule).
The same applies to the Digit rule as well.
Note that it's a bit unusual to only allow these specific characters inside quotes. Usually you'd allow everything that isn't an unescaped quote or an invalid escape sequence. But of course that all depends on the language you're parsing.
In antlr4 I want to define a string but exclude from it the combination := permitting the respective single characters. What is syntax to define the grammar
EQUAL : '=';
NUMBER: DIGIT+;
DIGIT : ('0'..'9');
LITERALEQUAL: ((CHAR | NUMBER | EQUAL | OTHERS) ' '?)+;
fragment CHAR :[a-z]| [A-Z];
fragment OTHERS: '.' | '/' | ':' | '-' | '#' | '?' | '&' | '_' | '[' | ']' | '^' | ';' | '"' | '=';
As long as you don't make a lexer rule or implicit token like:
stmt : value ':=' something ; <-- implicit token
or
BADEquals : ':=' ; <-- explicit lexer definition
your eventual grammar won't allow it if your goal is to a allow : and = but exclude the combination := .
What I am trying to achieve is to develop a grammar that would parse the following two lines in the same way:
1. "Bucket 1" = "1 item placed", "3 items removed"
2. Bucket 2 = 2 items placed, 6 items removed
So, a line starts with an ordinal number, then an element name goes - 'Bucket 1' and 'Bucket 2'. Also, a bucket has one or more values separated by a comma.
The issue is that the data can come enclosed with double quotes (line #1 above) and without the quotes (as shown in line #2). I can figure grammars for each of the lines separately but can not develop a grammar that would parse them both.
grammar Test;
doc : element+ EOF;
element: ordinal element_name EQUAL element_values '\n';
element_name : STRING ;
element_values: STRING (COMMA STRING)+;
ordinal : NUMBER ;
COMMA: ',' ;
EQUAL: '=' ;
NUMBER : ('0'..'9')+ ;
STRING : '"' (EscapeSequence | ~('\\'|'"') )* '"' ;
// STRING : ('"' (EscapeSequence | ~('\\'|'"') )* '"') | ~('"'|',')+ ;
fragment
EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| OctalEscape
;
fragment
OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
WS : [ .\t]+ -> skip ;
I played with STRING rule above trying to make it handle both cases but with no luck. If I enable the commented out version of STRING rule, then I get a line 1:0 missing NUMBER at '4. ' parser error which is confusing as I thought that NUMBER rule should be caught since it goes first.
Is that a wrong assumption? Can you please explain why it does not get caught?