antlr4 grammar with negative option - antlr4

In antlr4 I want to define a string but exclude from it the combination := permitting the respective single characters. What is syntax to define the grammar
EQUAL : '=';
NUMBER: DIGIT+;
DIGIT : ('0'..'9');
LITERALEQUAL: ((CHAR | NUMBER | EQUAL | OTHERS) ' '?)+;
fragment CHAR :[a-z]| [A-Z];
fragment OTHERS: '.' | '/' | ':' | '-' | '#' | '?' | '&' | '_' | '[' | ']' | '^' | ';' | '"' | '=';

As long as you don't make a lexer rule or implicit token like:
stmt : value ':=' something ; <-- implicit token
or
BADEquals : ':=' ; <-- explicit lexer definition
your eventual grammar won't allow it if your goal is to a allow : and = but exclude the combination := .

Related

Antlr4 Match Force Priority

I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.
You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;

Not Able to Recognize Strings and Characters in ANTLr

In my ANTLr code, we should be able to recognize strings, characters, hexadecimal numbers etc.
However, in my code, when I test it like this:
grun A1_lexer tokens -tokens test.txt
With my test.txt file being a simple string, such as "pineapple", it is unable to recognize the different tokens.
In my lexer, I define the following helper tokens:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9'] ;
fragment Digit: ['0'-'9'] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
And I define the following tokens:
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Id: Alpha Alpha_num* ;
I run it like this:
grun A1_lexer tokens -tokens test.txt
And it outputs this:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really wondering what the problem is and how I could fix it.
Thanks.
UPDATE 1:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
I have updated the code, I got rid of the un-necessary single quotes in my Char classification. However, I get the same output as before.
UPDATE 2:
Even when I make the changes suggested, I still get the same error. I believed the problem is that I am not recompiling, but I am. These are the steps that I take to recompile.
antlr4 A1_lexer.g4
javac A1_lexer*.java
chmod a+x build.sh
./build.sh
grun A1_lexer tokens -tokens test.txt
With my build.sh file looking like this:
#!/bin/bash
FILE="A1_lexer"
ANTLR=$(echo $CLASSPATH | tr ':' '\n' | grep -m 1 "antlr-4.7.1-
complete.jar")
java -jar $ANTLR $FILE.g4
javac $FILE*.java
Even when I recompile, my antlr code is still unable to recognize the tokens.
My code is also now like this:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
UPDATE 3:
Grammar:
program
:'class Program {'field_decl* method_decl*'}'
field_decl
: type (id | id'['int_literal']') ( ',' id | id'['int_literal']')*';'
| type id '=' literal ';'
method_decl
: (type | 'void') id'('( (type id) ( ','type id)*)? ')'block
block
: '{'var_decl* statement*'}'
var_decl
: type id(','id)* ';'
type
: 'int'
| 'boolean'
statement
: location assign_op expr';'
| method_call';'
| 'if ('expr')' block ('else' block )?
| 'switch' expr '{'('case' literal ':' statement*)+'}'
| 'while (' expr ')' statement
| 'return' ( expr )? ';'
| 'break ;'
| 'continue ;'
| block
assign_op
: '='
| '+='
| '-='
method_call
: method_name '(' (expr ( ',' expr )*)? ')'
| 'callout (' string_literal ( ',' callout_arg )* ')'
method_name
: id
location
: id
| id '[' expr ']'
expr
: location
| method_call
| literal
| expr bin_op expr
| '-' expr
| '!' expr
| '(' expr ')'
callout_arg
: expr
| string_literal
bin_op
: arith_op
| rel_op
| eq_op
| cond_op
arith_op
: '+'
| '-'
| '*'
| '/'
| '%'
rel_op
: '<'
| '>'
| '<='
| '>='
eq_op
: '=='
| '!='
cond_op
: '&&'
| '||'
literal
: int_literal
| char_literal
| bool_literal
id
: alpha alpha_num*
alpha
: ['a'-'z''A'-'Z''_']
alpha_num
: alpha
| digit
digit
: ['0'-'9']
hex_digit
: digit
| ['a'-'f''A'-'F']
int_literal
: decimal_literal
| hex_literal
decimal_literal
: digit+
hex_literal
: '0x' hex_digit+
bool_literal
: 'true'
| 'false'
char_literal
: '‘'char'’'
string_literal
: '“'char*'”'
test.txt :
"pineapple"
A1_lexer:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
What I Write in Terminal:
grun A1_lexer tokens -tokens test.txt
Output in Terminal:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really not sure why this is happening.
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9']
['a'-'z'] doesn't mean "a to z", it means "a single quote, or a, or a single quote to a single quote, or z, or a single quote", which simplifies to just "a single quote, a or z". What you want is just [a-z] without the quotes and the same applies to the other character classes as well - except that they also contain spaces, so it's "single quote, A, single quote, space to space, single quote, Z, or single quote" etc. Also you don't need to "or" character classes, you can just write everything in one character class like this: [a-zA-Z0-9] (like you already did for the Alpha rule).
The same applies to the Digit rule as well.
Note that it's a bit unusual to only allow these specific characters inside quotes. Usually you'd allow everything that isn't an unescaped quote or an invalid escape sequence. But of course that all depends on the language you're parsing.

ANTLR 4 building parse tree incorrectly

I'm making a grammar for a subset of SQL, which I've pasted below:
grammar Sql;
sel_stmt : SEL QUANT? col_list FROM tab_list;
as_stmt : 'as' ID;
col_list : '*' | col_spec (',' col_spec | ',' col_group)*;
col_spec : (ID | ID '.' ID) as_stmt?;
col_group : ID '.' '*';
tab_list : (tab_spec | join_tab) (',' tab_spec | ',' join_tab)*;
tab_spec : (ID | sub_query) as_stmt?;
sub_query : '(' sel_stmt ')';
join_tab : tab_spec (join_type? 'join' tab_spec 'on' cond_list)+;
join_type : 'inner' | 'full' | 'left' | 'right';
cond_list : cond_term (BOOL_OP cond_term)*;
cond_term : col_spec (COMP_OP val | between | like);
val : INT | FLT | SQ_STR | DQ_STR;
between : ('not')? 'between' val 'and' val;
like : ('not')? 'like' (SQ_STR | DQ_STR);
WS : (' ' | '\t' | '\n')+ -> skip;
INT : ('0'..'9')+;
FLT : INT '.' INT;
SQ_STR : '\'' ~('\\'|'\'')* '\'';
DQ_STR : '"' ~('\\'|'"')* '"';
BOOL_OP : ',' | 'or' | 'and';
COMP_OP : '=' | '<>' | '<' | '>' | '<=' | '>=';
SEL : 'select' | 'SELECT';
QUANT: 'distinct' | 'DISTINCT' | 'all' | 'ALL';
FROM: 'from' | 'FROM';
ID : ('A'..'Z' | 'a'..'z')('A'..'Z' | 'a'..'z' | '0'..'9')*;
The input I'm testing is select distinct test.col1 as col, test2.* from test join test2 on col='something', test2.col1=1.4. The output parse tree matches the last appearance of test2 as a and thus doesn't know what to do with the rest of the input. The comma before the last 'test2' token is made a child of the node, when it should be a child of .
My question is what is going on behind the scenes to cause this?
Apparently, ANTLR does not like it when you use a literal symbol as a token and have it be part of another token. In my case, the comma token was both used as a literal token and as part of the BOOL_OP token.
A future question may be how ANTLR disambiguates when the same symbol is used as parts of different tokens that apply only under specific scopes. However, this question is answered for now.

Identifier Lexer rule does not match '*' like its supposed to

I am in the process of finalizing a grammar for a proprietary pattern language. It borrows a few regex syntax elements (like quantifiers) but it's also a lot more complex than regex, since it has to allow macros, different pattern styles etc.
My problem is that '*' does not match against the ID lexer rule like it's supposed to. There is no other rule that could swallow the * token as far as i see.
Here's the grammar i wrote:
grammar Pattern;
element:
ID
| macro;
macro:
MACRONAME macroarg? ('*'|'+'|'?'|FROMTIL)?;
macroarg: '['( (element | MACROFREE ) ';')* (element | MACROFREE) ']';
and_con :
element '&' element
| and_con '&' element
|'(' and_con ')';
head_con :
'H[' block '=>' block ']';
expression :
element
| and_con
| expression ' ' element
| '(' expression ')';
block :
element
| and_con
| or_con
| '(' block ')';
blocksequence :
(block ' '+)* block;
or_con :
((element | and_con) '|')+ (element | and_con)
| or_con '|' (element | and_con)
| '(' blocksequence (')|(' blocksequence)+ (')'|')*');
patternlist :
(blocksequence ' '* ',' ' '*)* blocksequence;
sentenceord :
'S=(' patternlist ')';
sentenceunord :
'S={' patternlist '}';
pattern :
sentenceord
| sentenceunord
| blocksequence;
multisentence :
MS pattern;
clause :
'CLS' ' '+ pattern;
complexpattern :
pattern
| multisentence
| clause
| SECTIONS ' ' complexpattern;
dictentry:
NUM ';' complexpattern
| NUM ';' NAME ';' complexpattern
| COMMENT;
dictionary:
(dictentry ('\r'|'\n'))* (dictentry)?;
ID : '*' ('*'|'+'|'?'|FROMTIL)?
| ( '^'? '!'? ('F'|'C'|'L'|'P'|'CA'|'N'|'PE'|'G'|'CD'|'T'|'M'|'D')'=' NAME ('*'|'+'|'?'|FROMTIL)? '$'? );
MS : 'MS' [0-9];
SECTIONS: 'SEC' '=' ([0-9]+','?)+;
FROMTIL: '{'NUM'-'NUM'}';
NUM: [0-9]+;
NAME: CHAR+ | ',' | '.' | '*';
CHAR: [a-zA-Z0-9_äöüßÄÖÜ\-];
MACRONAME: '#'[a-zA-Z_][a-zA-Z_0-9]*;
MACROFREE: [a-zA-Z!]+;
COMMENT: '//' ~('\r'|'\n')*;
The complexpattern/pattern/element/block parser rules should accept a simple '*', and i can't figure out why they don't.
In your macro rule, you defined the literal '*', causing the ID rule not to match a single "*" as input.

Struggling to parse array notation

I have a very simple grammar to parse statements.
Here are examples of the type of statements that need be parsed:
a.b.c
a.b.c == "88"
The issue I am having is that array notation is not matching. For example, things that are not working:
a.b[0].c
a[3][4]
I hope someone can point out what I am doing wrong here. (I am testing in ANTLRWorks)
Here is the grammar (generationUnit is my entry point):
grammar RatBinding;
generationUnit: testStatement | statement;
arrayAccesor : identifier arrayNotation+;
arrayNotation: '[' Number ']';
testStatement:
(statement | string | Number | Bool )
(greaterThanAndEqual
| lessThanOrEqual
| greaterThan
| lessThan | notEquals | equals)
(statement | string | Number | Bool )
;
part: identifier | arrayAccesor;
statement: part ('.' part )*;
string: ('"' identifier '"') | ('\'' identifier '\'');
greaterThanAndEqual: '>=';
lessThanOrEqual: '<=';
greaterThan: '>';
lessThan: '<';
notEquals : '!=';
equals: '==';
identifier: Letter (Letter|Digit)*;
Bool : 'true' | 'false';
ArrayLeft: '\u005B';
ArrayRight: '\u005D';
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f '|
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Digit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : [ \r\t\u000C\n]+ -> channel(HIDDEN)
;
You referenced the non-existent rule Number in the arrayNotation parser rule.
A Digit rule does exist in the lexer, but it will only match a single-digit number. For example, 1 is a Digit, but 10 is two separate Digit tokens so a[10] won't match the arrayAccesor rule. You probably want to resolve this in two parts:
Create a Number token consisting of one or more digits.
Number
: Digit+
;
Mark Digit as a fragment rule to indicate that it doesn't form tokens on its own, but is merely intended to be referenced from other lexer rules.
fragment // prevents a Digit token from being created on its own
Digit
: ...
You will not need to change arrayNotation because it already references the Number rule you created here.
Bah, waste of space. I Used Number instead of Digit in my array declaration.

Resources