Overlapping rules - mismatched input - antlr4

My grammar (as follows (trimmed down from the original)) requires somewhat overlapping rules
grammar NOVIANum;
statement : (priorityStatement | integerStatement)* ;
priorityStatement : T_PRIO TwoDigits ;
integerStatement : T_INTEGER Integer ;
WS : [ \t\r\n]+ -> skip ;
T_PRIO : 'PRIO' ;
T_INTEGER : 'INTEGER' ;
Integer: OneToNine Digit* | ZERO ;
TwoDigits : Digit Digit ;
fragment OneToNine : ('1'..'9') ;
fragment Digit: ('0'..'9');
ZERO : [0] ;
so "Integer" and "TwoDigits" overlap to a certain extent.
The following input
INTEGER 10
PRIO 10
results in
line 2:5 mismatched input '10' expecting TwoDigits
when Integer precedes TwoDigits and in
line 1:8 mismatched input '10' expecting Integer
when TwoDigits precedes Integer in the grammar.
Is there a way around this ?
Thanks - Alex
Edit:
Thanks #GRosenberg, your suggestion, of course, worked for this small example, but when I integrated this into my full grammar it led to different mismatched input errors sure enough.
The reason being another lexer rule which requires a range of '[1-4]', so I thought I'll be clever and turn it into
grammar NOVIANum;
statement : (priorityT | integerT | levelT )* ;
priorityT : T_PRIO twoDigits ;
integerT : T_INTEGER integer ;
levelT : T_LEVEL levelNumber ;
levelNumber : ( ZERO DIGIT ) | ( OneToFour (ZERO | DIGIT) ) ;
integer: ZERO* ( DIGIT ( DIGIT | ZERO )* ) ;
twoDigits : (ZERO | DIGIT) ( ZERO | DIGIT ) ;
oneToFour : OneToFour (DIGIT | ZERO) ;
WS : [ \t\r\n]+ -> skip ;
T_INTEGER : 'INTEGER' ;
T_LEVEL : 'LEVEL' ;
T_PRIO : 'PRIO' ;
DIGIT: OneToFour | FiveToNine ;
ZERO : '0' ;
OneToFour : [1-4] ;
FiveToNine : [5-9] ;
This still works for the previous inputs but ...
INTEGER 350
PRIO 10
LEVEL 01
LEVEL 05
LEVEL 10
LEVEL 49
results in
[#0,0:6='INTEGER',<2>,1:0]
[#1,8:8='3',<5>,1:8]
[#2,9:9='5',<5>,1:9]
[#3,10:10='0',<6>,1:10]
[#4,12:15='PRIO',<4>,2:0]
[#5,17:17='1',<5>,2:5]
[#6,18:18='0',<6>,2:6]
[#7,20:24='LEVEL',<3>,3:0]
[#8,26:26='0',<6>,3:6]
[#9,27:27='1',<5>,3:7]
[#10,29:33='LEVEL',<3>,4:0]
[#11,35:35='0',<6>,4:6]
[#12,36:36='5',<5>,4:7]
[#13,38:42='LEVEL',<3>,5:0]
[#14,44:44='1',<5>,5:6]
[#15,45:45='0',<6>,5:7]
[#16,47:51='LEVEL',<3>,6:0]
[#17,53:53='4',<5>,6:6]
[#18,54:54='9',<5>,6:7]
[#19,55:54='<EOF>',<-1>,6:8]
line 5:6 no viable alternative at input '1'
line 6:6 no viable alternative at input '4'
(statement (integerT INTEGER (integer 3 5 0)) (priorityT PRIO (twoDigits 1 0)) (levelT LEVEL (levelNumber 0 1)) (levelT LEVEL (levelNumber 0 5)) (levelT LEVEL (levelNumber 1 0)) (levelT LEVEL (levelNumber 4 9)))
What am I missing here ?
Edit 2:
Ok, answering my own question here, of course
DIGIT: OneToFour | FiveToNine ;
kicks in where it shouldn't, even in this combined form,
so about the only way to get around this - I can think of - would be
grammar NOVIANum;
statement : (priorityT | integerT | levelT )* ;
priorityT : T_PRIO twoDigits ;
integerT : T_INTEGER integer ;
levelT : T_LEVEL levelNumber ;
levelNumber : ( ZERO (OneToFour | FiveToNine) | ( OneToFour (ZERO | (OneToFour | FiveToNine)) ) ) ;
integer: ZERO* ( (OneToFour | FiveToNine) ( (OneToFour | FiveToNine) | ZERO )* ) ;
twoDigits : (ZERO | (OneToFour | FiveToNine)) ( ZERO | (OneToFour | FiveToNine) ) ;
WS : [ \t\r\n]+ -> skip ;
T_INTEGER : 'INTEGER' ;
T_LEVEL : 'LEVEL' ;
T_PRIO : 'PRIO' ;
// DIGIT: OneToFour | FiveToNine;
ZERO : '0' ;
OneToFour : [1-4] ;
FiveToNine : [5-9] ;
because when I create a parser rule for it like
oneToNine : OneToFour | FiveToNine ;
it'll give me this
integerT INTEGER (integer (oneToNine 3) (oneToNine 5) 0))
which is ugly and harder to handle than just
(integerT INTEGER (integer 3 5 0))

As an general issue of design, always try to work with distinguishing elements and their objects (T_PRIO -> TwoDigits) at the same level, parser or lexer. Presuming the semantic nature of the Integer and TwoDigits rules is important, promote them to the parser and let the lexer only produce digits. That is, don't over-constrain the lexer.
In the parser, you can let the integer rule functionally hide the twoDigits rule except in the evaluation of the priorityStatement rule:
priorityStatement : T_PRIO twoDigits ;
integerStatement : T_INTEGER integer ;
integer: ZERO | ( DIGIT ( DIGIT | ZERO )* ) ;
twoDigits : DIGIT DIGIT ;
T_PRIO : 'PRIO' ;
T_INTEGER : 'INTEGER' ;
DIGIT : [1-9] ;
ZERO : '0' ;

Related

ANTLR4 handling continuations for "any data"

The grammar I need to create is based on the following:
Command lines start with a slash
Command lines can be continued with a hyphen as the last character
(excluding whitespaces) on a line
For some commands I want to parse their parameters
For other commands I am not interested in their parameters
This works almost fine with the following (simplified) Lexer
lexer grammar T1Lexer;
NewLine
: [\r\n]+ -> skip
;
CommandStart
: '/' -> pushMode(CommandMode)
;
DataStart
: . -> more, pushMode(DataMode)
;
mode DataMode;
DataLine
: ~[\r\n]+ -> popMode
;
mode CommandMode;
CmNL
: [\r\n]+ -> skip, popMode
;
CONTINUEMINUS : ( '-' [ ]* ('\r/' | '\n/' | '\r\n/') ) -> channel(HIDDEN);
EOL: ( [ ]* ('\r' | '\n' | '\r\n') ) -> popMode;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
DOT : [.] ;
COMMA : ',' ;
CMD1 : 'CMD1';
CMD2 : 'CMD2';
CMDIGN : 'CMDIGN' -> pushMode(DataMode) ;
VAR1 : 'VAR1=' ;
ID : ID_LITERAL;
fragment ID_LITERAL: [A-Z_$0-9]*?[A-Z_$]+?[A-Z_$0-9]*;
and Parser:
parser grammar T1Parser;
options { tokenVocab=T1Lexer; }
root : line+ EOF ;
line: ( commandLine | dataLine)+ ;
dataLine : DataLine ;
commandLine : CommandStart command ;
command : cmd1 | cmd2 | cmdign ;
cmd1 : CMD1 (VAR1 ID)+ ;
cmd2 : CMD2 (VAR1 ID)+ ;
cmdign : CMDIGN DataLine ;
The problem arises where I need a combination of 2. + 4., i.e. continuation for a command where I want to simply get the parms as an unparsed String (lines 5+6 in the example).
When I push to DataMode for CMDIGN on line 5 the continuation character is not recognized as it is swallowed by the "any until EOL" rule, so I pop back to default mode and the continuation line is considered a new command and fails to parse.
Is there a way of handling this combo properly ?
TIA - Alex
(For your example) You don't really need a CommandMode; it actually complicates things a bit.
T1Lexer.g4:
lexer grammar T1Lexer
;
CMD_START: '/';
CONTINUE_EOL_SLASH: '-' EOL_F '/' -> channel(HIDDEN);
EOL: EOL_F;
WS: [ \t]+ -> channel(HIDDEN);
DOT: [.];
COMMA: ',';
CMD1: 'CMD1';
CMD2: 'CMD2';
CMDIGN: 'CMDIGN' -> pushMode(DataMode);
VAR1: 'VAR1=';
ID: ID_LITERAL;
//=======================================
mode DataMode
;
DM_EOL: EOL_F -> type(EOL), popMode;
DATA_LINE: ( ~[\r\n]*? '-' EOL_F)* ~[\r\n]+;
//=======================================
fragment NL: '\r'? '\n';
fragment EOL_F: [ ]* NL;
fragment ID_LITERAL: [A-Z_$0-9]*? [A-Z_$]+? [A-Z_$0-9]*;
T1Parser.g4
parser grammar T1Parser
;
options {
tokenVocab = T1Lexer;
}
root: line (EOL line)* EOL? EOF;
line: commandLine | dataLine | emptyLine;
dataLine: DATA_LINE;
commandLine: CMD_START command;
emptyLine: CMD_START;
command: cmd1 | cmd2 | cmdign;
cmd1: CMD1 (VAR1 ID)+;
cmd2: CMD2 (VAR1 ID)+;
cmdign: CMDIGN DATA_LINE?;
Test Input:
/ CMD1 VAR1=VAL1 VAR1=VAL2
/ CMDIGN VAR1=BLAH VAR2=BLAH
/ CMD2 VAR1=VAL12 -
/ VAR1=VAL22
/ CMDIGN
/
/ CMDIGN VAR-1=0 -
/ VAR2=notignored
Token Stream:
[#0,0:0='/',<'/'>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:5='CMD1',<'CMD1'>,1:2]
[#3,6:6=' ',<WS>,channel=1,1:6]
[#4,7:11='VAR1=',<'VAR1='>,1:7]
[#5,12:15='VAL1',<ID>,1:12]
[#6,16:16=' ',<WS>,channel=1,1:16]
[#7,17:21='VAR1=',<'VAR1='>,1:17]
[#8,22:25='VAL2',<ID>,1:22]
[#9,26:26='\n',<EOL>,1:26]
[#10,27:27='/',<'/'>,2:0]
[#11,28:28=' ',<WS>,channel=1,2:1]
[#12,29:34='CMDIGN',<'CMDIGN'>,2:2]
[#13,35:54=' VAR1=BLAH VAR2=BLAH',<DATA_LINE>,2:8]
[#14,55:55='\n',<EOL>,2:28]
[#15,56:56='/',<'/'>,3:0]
[#16,57:57=' ',<WS>,channel=1,3:1]
[#17,58:61='CMD2',<'CMD2'>,3:2]
[#18,62:62=' ',<WS>,channel=1,3:6]
[#19,63:67='VAR1=',<'VAR1='>,3:7]
[#20,68:72='VAL12',<ID>,3:12]
[#21,73:73=' ',<WS>,channel=1,3:17]
[#22,74:76='-\n/',<CONTINUE_EOL_SLASH>,channel=1,3:18]
[#23,77:82=' ',<WS>,channel=1,4:1]
[#24,83:87='VAR1=',<'VAR1='>,4:7]
[#25,88:92='VAL22',<ID>,4:12]
[#26,93:93='\n',<EOL>,4:17]
[#27,94:94='/',<'/'>,5:0]
[#28,95:95=' ',<WS>,channel=1,5:1]
[#29,96:101='CMDIGN',<'CMDIGN'>,5:2]
[#30,102:102='\n',<EOL>,5:8]
[#31,103:103='/',<'/'>,6:0]
[#32,104:104='\n',<EOL>,6:1]
[#33,105:105='/',<'/'>,7:0]
[#34,106:106=' ',<WS>,channel=1,7:1]
[#35,107:112='CMDIGN',<'CMDIGN'>,7:2]
[#36,113:150=' VAR-1=0 - \n/
tree output:
(root
(line
(commandLine
/
(command
(cmd1 CMD1 VAR1= VAL1 VAR1= VAL2)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR1=BLAH VAR2=BLAH)
)
)
)
\n
(line
(commandLine
/
(command
(cmd2 CMD2 VAR1= VAL12 VAR1= VAL22)
)
)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN)
)
)
)
\n
(line
(emptyLine /)
)
\n
(line
(commandLine
/
(command
(cmdign CMDIGN VAR-1=0 - \n/ VAR2=notignored)
)
)
)
<EOF>
)

Hybris faceted search

I have a requirement whereby I have to implement a faceted search,where a user is taken through some questions and suggested a list of products on Hybris.Any approach to help me get started?
Solr supports facet search on its own. Hybris leverages this via the solr configuration. You can manage this through the impex file. There are a lot of fields on SolrIndexedProperty, but I think these are the ones required to control facet search - facet=true, facetType, and facetSort.
INSERT_UPDATE SolrIndexedProperty ; ... facet[default = true] ; facetType(code) ; facetSort(code) ; ...
; ... ; MultiSelectOr ; Alpha ; ...
Here's the full impex statement, in case I missed something.
INSERT_UPDATE SolrIndexedProperty ; solrIndexedType(identifier)[unique = true] ; name[unique = true] ; type(code) ; isAlpha[default = false] ; isNumeric[default = false] ; isAlphaNumeric[default = false] ; sortableType(code) ; currency[default = false] ; localized[default = false] ; multiValue[default = false] ; facet[default = true] ; facetType(code) ; facetSort(code) ; priority ; visible ; useForSpellchecking[default = false] ; useForAutocomplete[default = false] ; fieldValueProvider ; valueProviderParameter ; facetDisplayNameProvider ; customFacetSortProvider ; topValuesProvider ; rangeSets(name) ; displayName ; includeInResponse [default=true]
; yourProductType ; colorFacet ; string ; true ; ; ; ; ; ; ; ; MultiSelectOr ; Alpha ; 7500 ; true ; ; ; colorValueProvider ; ; ; facetNameSortProviderAscending ; defaultTopValuesProvider ; ; "Color" ;false

What is the grammar to have parentheses take place of linebreaks?

For example, I'm attempting to write a grammar to parse DNS zone files. The resource records are normally separated by newlines. However, a record can be broken across multiple lines by using parentheses. For example:
record1 part1 part2 part3 part4
or
record1 part1 ( part 2
part3
part4
)
I can't come up with how to allow for the parentheses to exist at any place within a record.
How about this (not thoroughly tested).
grammar:
grammar dns;
file : (record|NL)+ EOF ;
record : recordName recordPart+ (NL|EOF)
;
recordName : Something;
recordPart
: '(' recordPartOrNewLine+ ')'
| Something
;
recordPartOrNewLine
: NL
| recordPart
;
Something : [a-zA-Z0-9:.]+; // adjust!
WS: [ \t]+ -> skip;
NL : ('\r'? '\n')|'\r';
Comment : ';' ~[\r\n]* -> skip;
test case (from wikipedia):
example.com. 1800 IN SOA ns1.example.com. mailbox.example.com. (
100 ; Seriennummer
300 ; Refresh Time
100 ; Retry Time
6000 ; Expire Time
600 ; negative Caching Zeit
)
example.com. 1800 IN NS ns1.example.com.
ns1.example.com. 1800 IN A 172.27.182.17
ns1.example.com. 1800 IN AAAA 2001:db8::f:a
www.example.com. 1800 IN A 192.168.1.2
www.example.com. 1800 IN AAAA 2001:db8::1:2
result (large image here):

antlr4 mismatch input error on sql parser

I am getting following error on parsing but not sure why it's happening.
line 1:24 mismatched input '1' expecting NUM
line 1:24 mismatched input '1' expecting NUM
select a from abc limit 1 ;
--
grammar SQLCmd;
parse : sql
;
sql : ('select' ((columns (',' columns))|count) 'from')
tables
('where' condition ((and|or) condition))* (limit)? ';'
;
limit : 'limit' NUM
;
num : NUM
;
count : 'count(*)'
;
columns : VAL
;
tables : VAL
;
condition : ( left '=' right )+
;
and : 'and'
;
or : 'or'
;
left : VAL
;
right : VAL
;
VAL : [*a-z0-9A-Z~?]+
;
NUM : [0-9]+
;
WS : [ \t\n\r]+ -> skip
;
It looks like you have a VAL instead of a NUM.
The "1" is both a VAL and a NUM but since VAL comes first, there will never be NUM tokens since every NUM will be a VAL.
Try putting the NUM rule before the VAL rule.
You could have found out this by yourself by looking at the token types from the lexer. This will tell you the actual type of the token that is present.
#TheAntlrGuy: Maybe one could add the actual token type to the error message?

shift/reduce conflict with if ... else statement

I'm trying to make a parser about a java like language, but with else statement, a shift/reduce conflict appears.
I tried bison bison_file.y --report=state and the result about the conflict is:
State 62
31 statement: if_statement .
65 if_else_statement: if_statement . ELSE statement
ELSE shift, and go to state 84
ELSE [reduce using rule 31 (statement)]
$default reduce using rule 31 (statement)
I cannot think a way to avoid the conflict. Any good ideas?
Here, I submit the full code:
%{
#include <stdio.h>
#include <math.h>
void yyerror(char *);
extern int yylval;
extern FILE *yyin;
extern FILE *yyout;
extern yylineno;
extern int yyparse(void);
extern int yylex(void);
extern int yywrap() { return 1; }
extern char* yytext;
int errors;
%}
%debug
%start m_class
%token IF ELSE INT CHAR CLASS NEW GURISE VOID WHILE
%token PUBLIC PROTECTED PRIVATE STATIC FINAL ABSTRACT
%token PLUS MINUS MUL DIV MODULO
%token EQ NEQ GRT LT GREQ LEQ
%token OR AND NOT
%token AR_PAR DEK_PAR AR_AGK DEK_AGK AR_STRO DEK_STRO
%token SEMICOLON ANATHESI COMA
%token MY_INT SINT MY_CHAR ID
%right ANATHESI
%left OR AND
%nonassoc EQ NEQ GRT LT GREQ LEQ
%left PLUS MINUS MUL DIV MODULO
%right NOT
%%
m_class: m_class class_declaration
| class_declaration
;
class_declaration: CLASS ID class_body
;
class_body: AR_STRO variable_declaration constructor method_declaration DEK_STRO
;
variable_declaration:variable variable_declaration
|variable
|array_declaration
|array_declaration variable_declaration
;
variable: var_type ID SEMICOLON
;
var_type: INT
|CHAR
;
array_declaration: ID ANATHESI NEW var_type AR_AGK MY_INT DEK_AGK SEMICOLON
;
constructor: modifier ID AR_STRO variable_declaration DEK_STRO
;
modifier: PUBLIC
| PROTECTED
| PRIVATE
| STATIC
| FINAL
| ABSTRACT
;
method_declaration: modifier meth_type ID parameters meth_body
;
meth_type: VOID
| var_type
;
parameters: AR_PAR par_body DEK_PAR
;
par_body: var_type ID
| par_body COMA var_type ID
;
meth_body: AR_STRO bodybuilder DEK_STRO
;
bodybuilder: statement GURISE expression SEMICOLON
|statement bodybuilder
|statement
;
statement: anathesh
| if_statement
| if_else_statement
| while_statement
;
statementsss: statement
|
;
anathesh:atath SEMICOLON
| atath numeric_expression SEMICOLON
;
atath: ID ANATHESI orisma
|ID AR_AGK MY_INT DEK_AGK ANATHESI orisma
;
orisma: ID
|MY_INT
|SINT
|MY_CHAR
;
expression: testing_expression
| numeric_expression
| logical_expression
| ID
| MY_INT
| SINT
| MY_CHAR
;
numeric_expression: expression PLUS expression
| expression MINUS expression
| expression MUL expression
| expression DIV expression
| expression MODULO expression
;
testing_expression: expression EQ expression
| expression NEQ expression
| expression GRT expression
| expression LT expression
| expression GREQ expression
| expression LEQ expression
;
logical_expression: expression OR expression
| expression AND expression
| expression NOT expression
;
if_statement: IF abc
|
;
if_else_statement: if_statement ELSE statement
;
abc: sin8iki statement
;
sin8iki: AR_PAR testing_expression DEK_PAR
| AR_PAR logical_expression DEK_PAR
;
while_statement: WHILE sin8iki statement
;
%%
void yyerror(char *s) {
errors++;
printf("\n------- ERROR AT LINE #%d.\n\n", yylineno);
fprintf(stderr, "%d: error: '%s' at '%s', yylval=%u\n", yylineno, s, yytext, yylval);
}
int main (int argc, char **argv) {
++argv;
--argc;
errors=0;
if (argc > 0)
yyin = fopen (argv[0], "r");
else
yyin = stdin;
yyout = fopen ("output","w");
yyparse ();
if(errors==0)
printf("komple");
return 0;
}
Here the parser pushes the tokens into the stack and when IF abc are pushed and the next token is ELSE there will be a conflict, the parser should reduce the IF abc according to if_statement rule or it should shift the next token ELSE into the stack.
you have to determine the priorities of your rules, in this case you have to give the ELSE token more priority than if_statement by using %nonassoc and %prec. try this:
if_statement: IF abc %prec else_priority
and in the priorities area:
%nonassoc else_priority
%nonassoc ELSE
you have to write the priorities in this order (the more priority below the less one).
hope this will solve your problem.

Resources