ANTLR4 Grammar Issue with Decimal Numbers - antlr4

I'm new to ANTLR and using ANTLR4 (4.7.2 Jar file). I'm currently working on Oracle Parser.
I'm having issues with Decimal numbers. I have kept only the relevant parts.
My grammar file is as below.
Now when I parse the below statement it is fine. ".1" is a valid number in my case.
BEGIN a NUMBER:=.1; END;
I haven't shown the grammar but the below are valid cases for me in Oracle.
a NUMBER:= .1; // with Space after operator
a NUMBER:=1.1; // without Space after operator
a NUMBER:=1; // without Space after operator
a NUMER:= 3; // with Space after operator
Now I need to create a tablespace as below.
CREATE TABLESPACE tbs_01 DATAFILE +DATA/BR/CONTROLFILE/Current.260.750;
Here the Digits 260 & 750 are tokenized along with the DOT (as per the definition of NUMERIC_LITERAL). I would want this to be 2 separate digits separated by DOT (and assigned to filenumber and incarnation_number resp as shown in the grammar).
How do I do this?
I have tried using _input.LA(-1)!='.'}? etc but was not working correctly for me.
I tried many other steps mentioned (most solutions were for ANTLR3 and not working in ANTLR4). Is there a simple way to do this in LEXER? I do not want to write a Parser rule to split the decimal digits.
grammar Oracle;
parse
: ( sql_statements | error )* EOF
;
error
: UNEXPECTED_CHAR
{
throw new RuntimeException("UNEXPECTED_CHAR=" + $UNEXPECTED_CHAR.text);
}
;
sql_statements
: 'CREATE' 'TABLESPACE' tablespace_name 'DATAFILE' fully_qualified_file_name ';'
| 'BEGIN' var1 'NUMBER' ':=' num1 ';' 'END' ';'
;
tablespace_name : IDENTIFIER;
fully_qualified_file_name : K_PLUS_SIGN diskgroup_name K_SOLIDUS db_name K_SOLIDUS file_type K_SOLIDUS file_type_tag '.' filenumber '.' incarnation_number;
diskgroup_name : IDENTIFIER;
db_name : IDENTIFIER;
file_type : IDENTIFIER;
file_type_tag : IDENTIFIER;
filenumber : NUMERIC_LITERAL;
incarnation_number : NUMERIC_LITERAL;
var1 : IDENTIFIER;
num1 : NUMERIC_LITERAL;
IDENTIFIER : [a-zA-Z_] ([a-zA-Z] | '$' | '_' | '#' | DIGIT)* ;
K_PLUS_SIGN : '+';
K_SOLIDUS : '/';
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
;
SPACES : [ \u000B\t\r\n] -> skip;
WS : [ \t\r\n]+ -> skip;
UNEXPECTED_CHAR : . ;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];

Your Dsl has a natural ambiguity: in some instances, numbers are integers and in others, decimals.
If the Dsl provides sufficient guard conditions, Antlr modes can be used to isolate the instances. For example, in the given Dsl, decimal numbers appear to always occur between := and ; guards.
...
K_ASSIGN : ':=' -> pushMode(Decimals);
K_SEMI : ';' ;
NUMERIC_LITERAL : DIGIT+ ;
...
mode Decimals;
D_SEMI : ';' -> type(K_SEMI), popMode ;
NUMERIC:
DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? 'D'
| 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
-> type(NUMERIC_LITERAL);

Related

Antlr4 float number

I am trying to use ANTLR4 to parse input from user but having a hard time.
I want to get a list of numbers. Here is part of my grammar:
number
: DEC
| FLOAT
| HEX
| BIN
;
FLOAT : DIGIT? '.' DIGIT*;
DEC : DIGIT+ ;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
When input is 1 .2 3.2 then I get 1 .2 3.2
But if I use 1.2.3 it incorrectly recognizes 1.2 .3
How can I change the grammar to fix this?
FLOAT rule seems wrong. I have updated number and FLOAT definitions. Below code works only single numbers.
number
: (FLOAT | DEC | HEX | BIN) EOF
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
In most complex grammars, there are other tokens like signs, parenthesis.. etc. So we can easily handle separate tokens with space skipping. However, your grammar has only numbers and I can not separate tokens with skipping spaces. So, I discarded the whitespace skip definition. Below code handle many numbers and fails if the word is like 1.2.3. You should test with numbers and don't process WS tokens.
numbers
: WS? number (WS number)* WS? EOF;
number
: (FLOAT | DEC | HEX | BIN)
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+;

Why whould antlr rule won't making a nice parse tree?

I'm trying to create a grammar that would help me parse a string like this:
[Hello:/c=0.3//a=hi/] [what:/c=0.4/] [are:/c=0.6//a=is/]
This is my grammar:
grammar MyGrammar;
WS: [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
sentence: WORD+;
WORD: '[' WORD_DESCRIPTOR ']';
WORD_DESCRIPTOR: WORD_IDENTIFIER ':' WORD_FEATURES_DESCRIPTORS;
WORD_IDENTIFIER: STRING;
WORD_FEATURES_DESCRIPTORS: WORD_FEATURE_DESCRIPTOR+;
WORD_FEATURE_DESCRIPTOR: '/' WORD_FEATURE_IDENTIFIER '=' WORD_FEATURE_VALUE '/';
WORD_FEATURE_IDENTIFIER:
C_FEATURE | A_FEATURE
;
C_FEATURE: 'c';
A_FEATURE: 'a';
WORD_FEATURE_VALUE: STRING | NUMBER;
fragment LETTER : LOWER | UPPER ;
fragment LOWER : 'a'..'z' ;
fragment UPPER : 'A'..'Z' ;
fragment DIGIT : '0'..'9' ;
fragment INTEGER: DIGIT+ ;
fragment NUMBER: INTEGER (DOT INTEGER)? ;
fragment STRING: LETTER+ ;
fragment DOT: '.' ;
The problem is that the parse tree has only one level.
What I'm doing wrong?
Your parse tree shows up the way it does because all tokens are leaf nodes, and all parser rules are internal nodes. Since you only have a single parser rule (sentence) and the rest are all tokens, this is the parse tree:
sentence
/ | | \
/ | | \
WORD WORD WORD WORD ...
You should see tokens as the atoms that your language is built from. Once you start creating tokens like TOKEN : TOKEN_A | TOKEN_B;, then that is often better defined as a parser rule: token : TOKEN_A | TOKEN_B;.
Try something like this instead:
sentence : word+ EOF;
word : '[' word_descriptor ']';
word_descriptor : word_identifier ':' word_feature_descriptors;
word_identifier : STRING;
word_feature_descriptors : word_feature_descriptor+;
word_feature_descriptor : '/' word_feature_identifier '=' word_feature_value '/';
word_feature_value : STRING | NUMBER;
word_feature_identifier : C_FEATURE | A_FEATURE;
C_FEATURE : 'c';
A_FEATURE : 'a';
NUMBER : INTEGER (DOT INTEGER)?;
STRING : LETTER+ ;
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : LOWER | UPPER;
fragment LOWER : [a-z];
fragment UPPER : [A-Z];
fragment DIGIT : [0-9];
fragment INTEGER : DIGIT+;
fragment DOT : '.';
which will create the following parse tree for your input:

ANTLR single grammar input mismatch

So far I've been testing with ANTLR4, I've tested with this single grammar:
grammar LivingDSLParser;
options{
language = Java;
//tokenVocab = LivingDSLLexer;
}
living
: query #QUERY
;
query
: K_QUERY entity K_WITH expr
;
entity
: STAR #ALL
| D_FUAS #FUAS
| D_RESOURCES #RESOURCES
;
field
: ((D_FIELD | D_PROPERTY | D_METAINFO) DOT)? IDENTIFIER
| STAR
;
expr
: field
| expr ( '*' | '/' | '%' ) expr
| expr ( '+' | '-' ) expr
| expr ( '<<' | '>>' | '&' | '|' ) expr
| expr ( '<' | '<=' | '>' | '>=' ) expr
| expr ( '=' | '==' | '!=' | '<>' ) expr
| expr K_AND expr
| expr K_OR expr
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
K_QUERY : Q U E R Y;
K_WITH: W I T H;
K_OR: O R;
K_AND: A N D;
D_FUAS : F U A S;
D_RESOURCES : R E S O U R C E S;
D_METAINFO: M E T A I N F O;
D_PROPERTY: P R O P E R T Y;
D_FIELD: F I E L D;
STAR : '*';
PLUS : '+';
MINUS : '-';
PIPE2 : '||';
DIV : '/';
MOD : '%';
LT2 : '<<';
GT2 : '>>';
AMP : '&';
PIPE : '|';
LT : '<';
LT_EQ : '<=';
GT : '>';
GT_EQ : '>=';
EQ : '==';
NOT_EQ1 : '!=';
NOT_EQ2 : '<>';
OPEN_PAR : '(';
CLOSE_PAR : ')';
SCOL : ';';
DOT : '.';
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
//so on...
As far I've been able to figure out, when I write some input like this:
query fuas with field.xxx == property.yyy
, it should match.
However I recive this message:
LivingDSLParser::living:1:0: mismatched input 'query' expecting K_QUERY
I have no idea where's the problem and neither what this message means.
Whenever ANTLR can match 2 or more rules to some input, it chooses the first rule. Since both IDENTIFIER and K_QUERY match the input "query"
, and IDENTIFIER is defined before K_QUERY, IDENTIFIER is matched.
Solution: move your IDENTIFIER rule below your keyword definitions.

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Please help me with my ANTLR4 Grammar.
Sample "formel":
(Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') &
(Arbejde.ArbejdsTimerPrUge = 40)
(Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato;
'+62Å') < DagsDato)
(Arb.BorI=860)
My problem is that Arb.BorI=860 is not handled correct. I get this error:
Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException
Please notis that Arb.BorI contains the word 'or'.
I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt'
So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated.
My Grammar:
grammar UnikFormel;
formel : boolExpression # BooleanExpr
| expression # Expr
| '(' formel ')' # Parentes;
boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+;
expression : element compareOps element # Compare;
element : datakildefelt # DatakildeId
| function # Funktion
| int # Integer
| decimal # Real
| string # Text;
datakildefelt : datakilde '.' felt;
datakilde : identifyer;
felt : identifyer;
function : funktionsnavn ('(' funcParameters? ')')?;
funktionsnavn : identifyer;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
identifyer : LETTER+;
int : DIGIT+;
decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
string : QUOTE .*? QUOTE;
booleanOps : (AND | OR);
compareOps : (LT | GT | EQ | GTEQ | LTEQ);
QUOTE : '\'';
OPERATOR: '+';
DIGIT: [0-9];
LETTER: [a-åA-Å];
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
GT : '>';
LT : '<';
EQ : '=';
GTEQ : '>=';
LTEQ : '<=';
AND : '&' | 'and';
OR : '?' | 'or';
WS : ' '+ -> skip;
Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too.
EDIT
Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).
Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is:
grammar UnikFormel;
formel : '(' formel ')' # Parentes
| expression # Expr
| boolExpression # BooleanExpr
;
boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+
| '(' formel ')' ( booleanOps '(' formel ')' )+;
expression : element compareOps element # Compare;
datakildefelt : ID '.' ID;
function : ID ('(' funcParameters? ')')?;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
element : datakildefelt # DatakildeId
| function # Funktion
| INT # Integer
| DECIMAL # Real
| STRING # Text;
booleanOps : (AND | OR);
compareOps : ( GTEQ | LTEQ | LT | GT | EQ |);
AND : '&' | 'and';
OR : '?' | 'or';
GTEQ : '>=';
LTEQ : '<=';
GT : '>';
LT : '<';
EQ : '=';
ID : LETTER ( LETTER | DIGIT)*;
INT : DIGIT+;
DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
STRING : QUOTE .*? QUOTE;
fragment QUOTE : '\'';
fragment DIGIT: [0-9];
fragment LETTER: [a-åA-Å];
WS : [ \t\r\n]+ -> skip;

parsing SQL CREATE statement using ANTLR4: no viable alternative at input 'conflict'

I'm newbie ANTLR user and trying to parse the following sql create statement. (I dropped some unimportant part of both SQL and grammar)
CREATE TABLE Account (_id integer primary key, conflict integer default 1);
and the grammar is like this: (You can compile this grammar with copy&paste)
grammar CreateTable;
tableList : (createTableStmt)* ;
createTableStmt : CREATE TABLE tableName LP columnDefs (COMMA tableConstraints)? RP SEMICOLON ;
columnDefs : columnDef (COMMA columnDef)* ;
columnDef : columnName typeName? columnConstraint* ;
typeName : sqliteType (LP SIGNED_NUMBER (COMMA SIGNED_NUMBER)? RP)? ;
sqliteType : intType | textType | ID ;
intType : 'INTEGER'|'LONG';
textType : TEXT ;
columnConstraint
: (CONSTRAINT name)? PRIMARY KEY conflictClause?
| (CONSTRAINT name)? UNIQUE conflictClause?
| (CONSTRAINT name)? DEFAULT SIGNED_NUMBER
;
tableConstraints
: tableConstraint (COMMA tableConstraint)* ;
tableConstraint
: (CONSTRAINT name)? (PRIMARY KEY|UNIQUE) LP indexedColumns RP conflictClause? ;
conflictClause : ON CONFLICT REPLACE ;
indexedColumns : indexedColumn (COMMA indexedColumn)* ;
indexedColumn : columnName;
columnName : name ;
tableName : name ;
name : ID | '\"' ID '\"' | STRING_LITERAL ;
SIGNED_NUMBER : (PLUS|MINUS)? NUMERIC_LITERAL ;
NUMERIC_LITERAL : DIGIT+ ;
STRING_LITERAL : '\'' (~'\'')* '\'' ;
LP : '(' ;
RP : ')' ;
COMMA : ',' ;
SEMICOLON : ';' ;
PLUS : '+' ;
MINUS : '-' ;
CONFLICT : C O N F L I C T ;
CONSTRAINT : C O N S T R A I N T ;
CREATE : C R E A T E ;
DEFAULT : D E F A U L T;
KEY : K E Y ;
ON : O N;
PRIMARY : P R I M A R Y ;
REPLACE : R E P L A C E;
TABLE : T A B L E ;
TEXT : T E X T;
UNIQUE : U N I Q U E ;
WS : [ \t\r\n\f]+ -> channel(HIDDEN);
ID : LETTER (LETTER|DIGIT)*;
fragment LETTER : [a-zA-Z_];
fragment DIGIT : [0-9] ;
NL : '\r'? '\n' ;
fragment A:('a'|'A'); fragment B:('b'|'B'); fragment C:('c'|'C');
fragment D:('d'|'D'); fragment E:('e'|'E'); fragment F:('f'|'F');
fragment G:('g'|'G'); fragment I:('i'|'I'); fragment K:('k'|'K');
fragment L:('l'|'L'); fragment M:('m'|'M'); fragment N:('n'|'N');
fragment O:('o'|'O'); fragment P:('p'|'P'); fragment Q:('q'|'Q');
fragment R:('r'|'R'); fragment S:('s'|'S'); fragment T:('t'|'T');
fragment U:('u'|'U'); fragment X:('x'|'X');
By the way, the above SQL statement I should parse uses reserved word 'conflict' as column name. If I change column name 'conflict' with other name like 'conflict1' everything is okay.
Where should I change to parse above SQL statement?
The parse trees look like this.
Thanks
You are defining the input "conflict" as a separate token CONFLICT. So if it is also a valid table name and column name, this should work:
name : ID | '\"' ID '\"' | STRING_LITERAL | CONFLICT

Resources