antlr4 mismatch input error on sql parser - antlr4

I am getting following error on parsing but not sure why it's happening.
line 1:24 mismatched input '1' expecting NUM
line 1:24 mismatched input '1' expecting NUM
select a from abc limit 1 ;
--
grammar SQLCmd;
parse : sql
;
sql : ('select' ((columns (',' columns))|count) 'from')
tables
('where' condition ((and|or) condition))* (limit)? ';'
;
limit : 'limit' NUM
;
num : NUM
;
count : 'count(*)'
;
columns : VAL
;
tables : VAL
;
condition : ( left '=' right )+
;
and : 'and'
;
or : 'or'
;
left : VAL
;
right : VAL
;
VAL : [*a-z0-9A-Z~?]+
;
NUM : [0-9]+
;
WS : [ \t\n\r]+ -> skip
;

It looks like you have a VAL instead of a NUM.
The "1" is both a VAL and a NUM but since VAL comes first, there will never be NUM tokens since every NUM will be a VAL.
Try putting the NUM rule before the VAL rule.
You could have found out this by yourself by looking at the token types from the lexer. This will tell you the actual type of the token that is present.
#TheAntlrGuy: Maybe one could add the actual token type to the error message?

Related

ANTLR4 handling continuations for "any data"

The grammar I need to create is based on the following:
Command lines start with a slash
Command lines can be continued with a hyphen as the last character
(excluding whitespaces) on a line
For some commands I want to parse their parameters
For other commands I am not interested in their parameters
This works almost fine with the following (simplified) Lexer
lexer grammar T1Lexer;
NewLine
: [\r\n]+ -> skip
;
CommandStart
: '/' -> pushMode(CommandMode)
;
DataStart
: . -> more, pushMode(DataMode)
;
mode DataMode;
DataLine
: ~[\r\n]+ -> popMode
;
mode CommandMode;
CmNL
: [\r\n]+ -> skip, popMode
;
CONTINUEMINUS : ( '-' [ ]* ('\r/' | '\n/' | '\r\n/') ) -> channel(HIDDEN);
EOL: ( [ ]* ('\r' | '\n' | '\r\n') ) -> popMode;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
DOT : [.] ;
COMMA : ',' ;
CMD1 : 'CMD1';
CMD2 : 'CMD2';
CMDIGN : 'CMDIGN' -> pushMode(DataMode) ;
VAR1 : 'VAR1=' ;
ID : ID_LITERAL;
fragment ID_LITERAL: [A-Z_$0-9]*?[A-Z_$]+?[A-Z_$0-9]*;
and Parser:
parser grammar T1Parser;
options { tokenVocab=T1Lexer; }
root : line+ EOF ;
line: ( commandLine | dataLine)+ ;
dataLine : DataLine ;
commandLine : CommandStart command ;
command : cmd1 | cmd2 | cmdign ;
cmd1 : CMD1 (VAR1 ID)+ ;
cmd2 : CMD2 (VAR1 ID)+ ;
cmdign : CMDIGN DataLine ;
The problem arises where I need a combination of 2. + 4., i.e. continuation for a command where I want to simply get the parms as an unparsed String (lines 5+6 in the example).
When I push to DataMode for CMDIGN on line 5 the continuation character is not recognized as it is swallowed by the "any until EOL" rule, so I pop back to default mode and the continuation line is considered a new command and fails to parse.
Is there a way of handling this combo properly ?
TIA - Alex
(For your example) You don't really need a CommandMode; it actually complicates things a bit.
T1Lexer.g4:
lexer grammar T1Lexer
;
CMD_START: '/';
CONTINUE_EOL_SLASH: '-' EOL_F '/' -> channel(HIDDEN);
EOL: EOL_F;
WS: [ \t]+ -> channel(HIDDEN);
DOT: [.];
COMMA: ',';
CMD1: 'CMD1';
CMD2: 'CMD2';
CMDIGN: 'CMDIGN' -> pushMode(DataMode);
VAR1: 'VAR1=';
ID: ID_LITERAL;
//=======================================
mode DataMode
;
DM_EOL: EOL_F -> type(EOL), popMode;
DATA_LINE: ( ~[\r\n]*? '-' EOL_F)* ~[\r\n]+;
//=======================================
fragment NL: '\r'? '\n';
fragment EOL_F: [ ]* NL;
fragment ID_LITERAL: [A-Z_$0-9]*? [A-Z_$]+? [A-Z_$0-9]*;
T1Parser.g4
parser grammar T1Parser
;
options {
tokenVocab = T1Lexer;
}
root: line (EOL line)* EOL? EOF;
line: commandLine | dataLine | emptyLine;
dataLine: DATA_LINE;
commandLine: CMD_START command;
emptyLine: CMD_START;
command: cmd1 | cmd2 | cmdign;
cmd1: CMD1 (VAR1 ID)+;
cmd2: CMD2 (VAR1 ID)+;
cmdign: CMDIGN DATA_LINE?;
Test Input:
/ CMD1 VAR1=VAL1 VAR1=VAL2
/ CMDIGN VAR1=BLAH VAR2=BLAH
/ CMD2 VAR1=VAL12 -
/ VAR1=VAL22
/ CMDIGN
/
/ CMDIGN VAR-1=0 -
/ VAR2=notignored
Token Stream:
[#0,0:0='/',<'/'>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:5='CMD1',<'CMD1'>,1:2]
[#3,6:6=' ',<WS>,channel=1,1:6]
[#4,7:11='VAR1=',<'VAR1='>,1:7]
[#5,12:15='VAL1',<ID>,1:12]
[#6,16:16=' ',<WS>,channel=1,1:16]
[#7,17:21='VAR1=',<'VAR1='>,1:17]
[#8,22:25='VAL2',<ID>,1:22]
[#9,26:26='\n',<EOL>,1:26]
[#10,27:27='/',<'/'>,2:0]
[#11,28:28=' ',<WS>,channel=1,2:1]
[#12,29:34='CMDIGN',<'CMDIGN'>,2:2]
[#13,35:54=' VAR1=BLAH VAR2=BLAH',<DATA_LINE>,2:8]
[#14,55:55='\n',<EOL>,2:28]
[#15,56:56='/',<'/'>,3:0]
[#16,57:57=' ',<WS>,channel=1,3:1]
[#17,58:61='CMD2',<'CMD2'>,3:2]
[#18,62:62=' ',<WS>,channel=1,3:6]
[#19,63:67='VAR1=',<'VAR1='>,3:7]
[#20,68:72='VAL12',<ID>,3:12]
[#21,73:73=' ',<WS>,channel=1,3:17]
[#22,74:76='-\n/',<CONTINUE_EOL_SLASH>,channel=1,3:18]
[#23,77:82=' ',<WS>,channel=1,4:1]
[#24,83:87='VAR1=',<'VAR1='>,4:7]
[#25,88:92='VAL22',<ID>,4:12]
[#26,93:93='\n',<EOL>,4:17]
[#27,94:94='/',<'/'>,5:0]
[#28,95:95=' ',<WS>,channel=1,5:1]
[#29,96:101='CMDIGN',<'CMDIGN'>,5:2]
[#30,102:102='\n',<EOL>,5:8]
[#31,103:103='/',<'/'>,6:0]
[#32,104:104='\n',<EOL>,6:1]
[#33,105:105='/',<'/'>,7:0]
[#34,106:106=' ',<WS>,channel=1,7:1]
[#35,107:112='CMDIGN',<'CMDIGN'>,7:2]
[#36,113:150=' VAR-1=0 - \n/
tree output:
(root
(line
(commandLine
/
(command
(cmd1 CMD1 VAR1= VAL1 VAR1= VAL2)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR1=BLAH VAR2=BLAH)
)
)
)
\n
(line
(commandLine
/
(command
(cmd2 CMD2 VAR1= VAL12 VAR1= VAL22)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN)
)
)
)
\n
(line
(emptyLine /)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR-1=0 - \n/ VAR2=notignored)
)
)
)
<EOF>
)

ANTLR4 catch an entire line of arbitrary data

I have a grammar with command lines starting with a / and "data lines" which is everything that does not start with a slash.
I just can't get it to be parsed correctly, the following rule
FM_DATA: ( ('\r' | '\n' | '\r\n') ~'/') -> mode(DATA_MODE);
does almost what I need but for a data line of
abcde
the following tokens are generated
[#23,170:171='\na',<4>,4:72]
[#24,172:175='bcde',<103>,5:1]
so the first character is swallowed by the rule.
I also tried
FM_DATA: ( {getCharPositionInLine() == 0}? ~'/') -> mode(DATA_MODE);
but this causes even weirder things.
What's the correct rule for getting this to work as expected ?
TIA - Alex
The ... -> more command can be used to let the first char (or first part of a lexer rule) not be consumed (yet).
A quick demo:
lexer grammar FmDataLexer;
NewLine
: [\r\n]+ -> skip
;
CommandStart
: '/' -> pushMode(CommandMode)
;
FmDataStart
: . -> more, pushMode(FmDataMode)
;
mode CommandMode;
CommandLine
: ~[\r\n]+ -> popMode
;
mode FmDataMode;
FmData
: ~[\r\n]+ -> popMode
;
If you run the following code:
FmDataLexer lexer = new FmDataLexer(CharStreams.fromString("abcde\n/mu"));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n", FmDataLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get this output:
FmData 'abcde'
CommandStart '/'
CommandLine 'mu'
EOF '<EOF>'
See: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#mode-pushmode-popmode-and-more

How to parse expression with parenthesis?

I would like to parse an expression with parenthesis in python using textx.
For example the following DSL :
CREATE boby = sacha - ( boby & tralaa) ; 
CREATE boby = sacha & boby - ( david & lucas )
This is the grammar I tried:
Model:
    'CREATE' name=Identifier '=' exp=SetExpr
;
JoinOperator: /-/&/;
SetExpr:SetParExpr | SetBaseExpr 
;
SetBaseExpr:
    first=ID op=JoinOperator second=ID
;
SetParExpr:
    '(' SetExpr ')'
I guess I should have a list somewhere to fill with expression.
Do you have any suggestion ?
I've changed your examples just slightly: I added a semicolon to end and I put another pair of parentheses in your second example. I inferred these changes based on what you provided in your grammar. Here's the examples:
CREATE boby = sacha - ( boby & tralaa);
CREATE boby = sacha & (boby - ( david & lucas ));
To parse examples like these your grammar needs to be changed to:
Take in multiple Models (I created a Script rule that takes semi colon separated models)
Allow the second property of the SetBaseExpr rule to be an ID or a SetParExpr.
Change Identifier to ID in the model rule (I assume this is what you meant).
I made these changes and ended up with the following grammar that parses the examples I gave:
Script:
models+=Model[';'] ';'
;
Model:
'CREATE' name=ID '=' exp=SetExpr
;
JoinOperator: '-' | '&';
SetExpr:
SetParExpr | SetBaseExpr
;
SetBaseExpr:
first=ID op=JoinOperator (second=ID | second=SetParExpr)
;
SetParExpr:
'(' SetExpr ')'
;
I hope that answers your question or gives you a hint as to handle parenthetical expressions.

Overlapping rules - mismatched input

My grammar (as follows (trimmed down from the original)) requires somewhat overlapping rules
grammar NOVIANum;
statement : (priorityStatement | integerStatement)* ;
priorityStatement : T_PRIO TwoDigits ;
integerStatement : T_INTEGER Integer ;
WS : [ \t\r\n]+ -> skip ;
T_PRIO : 'PRIO' ;
T_INTEGER : 'INTEGER' ;
Integer: OneToNine Digit* | ZERO ;
TwoDigits : Digit Digit ;
fragment OneToNine : ('1'..'9') ;
fragment Digit: ('0'..'9');
ZERO : [0] ;
so "Integer" and "TwoDigits" overlap to a certain extent.
The following input
INTEGER 10
PRIO 10
results in
line 2:5 mismatched input '10' expecting TwoDigits
when Integer precedes TwoDigits and in
line 1:8 mismatched input '10' expecting Integer
when TwoDigits precedes Integer in the grammar.
Is there a way around this ?
Thanks - Alex
Edit:
Thanks #GRosenberg, your suggestion, of course, worked for this small example, but when I integrated this into my full grammar it led to different mismatched input errors sure enough.
The reason being another lexer rule which requires a range of '[1-4]', so I thought I'll be clever and turn it into
grammar NOVIANum;
statement : (priorityT | integerT | levelT )* ;
priorityT : T_PRIO twoDigits ;
integerT : T_INTEGER integer ;
levelT : T_LEVEL levelNumber ;
levelNumber : ( ZERO DIGIT ) | ( OneToFour (ZERO | DIGIT) ) ;
integer: ZERO* ( DIGIT ( DIGIT | ZERO )* ) ;
twoDigits : (ZERO | DIGIT) ( ZERO | DIGIT ) ;
oneToFour : OneToFour (DIGIT | ZERO) ;
WS : [ \t\r\n]+ -> skip ;
T_INTEGER : 'INTEGER' ;
T_LEVEL : 'LEVEL' ;
T_PRIO : 'PRIO' ;
DIGIT: OneToFour | FiveToNine ;
ZERO : '0' ;
OneToFour : [1-4] ;
FiveToNine : [5-9] ;
This still works for the previous inputs but ...
INTEGER 350
PRIO 10
LEVEL 01
LEVEL 05
LEVEL 10
LEVEL 49
results in
[#0,0:6='INTEGER',<2>,1:0]
[#1,8:8='3',<5>,1:8]
[#2,9:9='5',<5>,1:9]
[#3,10:10='0',<6>,1:10]
[#4,12:15='PRIO',<4>,2:0]
[#5,17:17='1',<5>,2:5]
[#6,18:18='0',<6>,2:6]
[#7,20:24='LEVEL',<3>,3:0]
[#8,26:26='0',<6>,3:6]
[#9,27:27='1',<5>,3:7]
[#10,29:33='LEVEL',<3>,4:0]
[#11,35:35='0',<6>,4:6]
[#12,36:36='5',<5>,4:7]
[#13,38:42='LEVEL',<3>,5:0]
[#14,44:44='1',<5>,5:6]
[#15,45:45='0',<6>,5:7]
[#16,47:51='LEVEL',<3>,6:0]
[#17,53:53='4',<5>,6:6]
[#18,54:54='9',<5>,6:7]
[#19,55:54='<EOF>',<-1>,6:8]
line 5:6 no viable alternative at input '1'
line 6:6 no viable alternative at input '4'
(statement (integerT INTEGER (integer 3 5 0)) (priorityT PRIO (twoDigits 1 0)) (levelT LEVEL (levelNumber 0 1)) (levelT LEVEL (levelNumber 0 5)) (levelT LEVEL (levelNumber 1 0)) (levelT LEVEL (levelNumber 4 9)))
What am I missing here ?
Edit 2:
Ok, answering my own question here, of course
DIGIT: OneToFour | FiveToNine ;
kicks in where it shouldn't, even in this combined form,
so about the only way to get around this - I can think of - would be
grammar NOVIANum;
statement : (priorityT | integerT | levelT )* ;
priorityT : T_PRIO twoDigits ;
integerT : T_INTEGER integer ;
levelT : T_LEVEL levelNumber ;
levelNumber : ( ZERO (OneToFour | FiveToNine) | ( OneToFour (ZERO | (OneToFour | FiveToNine)) ) ) ;
integer: ZERO* ( (OneToFour | FiveToNine) ( (OneToFour | FiveToNine) | ZERO )* ) ;
twoDigits : (ZERO | (OneToFour | FiveToNine)) ( ZERO | (OneToFour | FiveToNine) ) ;
WS : [ \t\r\n]+ -> skip ;
T_INTEGER : 'INTEGER' ;
T_LEVEL : 'LEVEL' ;
T_PRIO : 'PRIO' ;
// DIGIT: OneToFour | FiveToNine;
ZERO : '0' ;
OneToFour : [1-4] ;
FiveToNine : [5-9] ;
because when I create a parser rule for it like
oneToNine : OneToFour | FiveToNine ;
it'll give me this
integerT INTEGER (integer (oneToNine 3) (oneToNine 5) 0))
which is ugly and harder to handle than just
(integerT INTEGER (integer 3 5 0))
As an general issue of design, always try to work with distinguishing elements and their objects (T_PRIO -> TwoDigits) at the same level, parser or lexer. Presuming the semantic nature of the Integer and TwoDigits rules is important, promote them to the parser and let the lexer only produce digits. That is, don't over-constrain the lexer.
In the parser, you can let the integer rule functionally hide the twoDigits rule except in the evaluation of the priorityStatement rule:
priorityStatement : T_PRIO twoDigits ;
integerStatement : T_INTEGER integer ;
integer: ZERO | ( DIGIT ( DIGIT | ZERO )* ) ;
twoDigits : DIGIT DIGIT ;
T_PRIO : 'PRIO' ;
T_INTEGER : 'INTEGER' ;
DIGIT : [1-9] ;
ZERO : '0' ;

Why does AntlrWorks 2 display warning 125 (implicit definition of token in parser) in this case?

I have a separate lexer and parser grammar (derived from the sample ModeTagsLexer/ModeTagsParser) and get a warning in AntlrWorks 2 that I don't understand:
warning(125): implicit definition of token OPEN in parser
If I replace the OPEN rule with '<' the warning goes away. I wonder what the difference between OPEN and CLOSE ist which get's no warning.
I'm using antlr-4.1-complete.jar and 2013-01-22-antlrworks-2.0.
Lexer STLexer.g4:
lexer grammar STLexer;
// Default mode rules (the SEA)
OPEN : '<' -> pushMode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
mode ISLAND;
CLOSE : '>' -> popMode ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z0-9"=]+ ; // match/send ID in tag to parser
WS : [ \t]+ -> channel(HIDDEN);
Parser STParser.g4:
parser grammar STParser;
options { tokenVocab=STLexer; } // use tokens from STLexer.g4
unit: (tag | TEXT)* ;
tag : OPEN ID+ CLOSE
| OPEN SLASH ID+ CLOSE
;
It even persists if I rename the rule slightly and remove the additional mode:
lexer grammar STLexer;
Lexer (modified):
// Default mode rules (the SEA)
OPPEN : '<' ;// -> pushMode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
//mode ISLAND;
CLOSE : '>' ; // -> popMode ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z0-9"=]+ ; // match/send ID in tag to parser
WS : [ \t]+ -> channel(HIDDEN);
Parser (modified):
parser grammar STParser;
options { tokenVocab=STLexer; } // use tokens from STLexer.g4
unit: (tag | TEXT)* ;
tag : ID OPPEN ID+ CLOSE
| ID OPPEN SLASH ID+ CLOSE
;

Resources