grammar Hello;
prog: stat+ EOF;
stat: expr NEWLINE # printExpr
| ID '=' expr NEWLINE # assign
| NEWLINE # blank
| STRING NEWLINE # string
;
expr: expr (MUL|DIV) expr # opExpr
| expr (ADD|SUB) expr # opExpr
| expr AND expr # andExpr
| INT # int
| ID # id
| '(' expr ')' # parens
;
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
ID: [a-zA-Z]+[0-9a-zA-Z]*;
NEWLINE : [\r\n] ;
INT : [0-9]+ ;
AND : '&';
WS : [ \t\r\n]+ -> skip;
CM : '//' ~[\r\n]* -> skip;`
Can someone explain to me what is wrong with my code? This is my error :
Your help will be appreciated !
The problem is in these lexer rules:
NEWLINE : [\r\n] ;
WS : [ \t\r\n]+ -> skip;
When the lexer finds a \r\n in the input string, it will try to match the rules, and both will match. However, WS will match the entire \r\n, producing one WS token, and NEWLINE will match \r then \n, producing two NEWLINE tokens.
In this scenario, Antlr always chooses the longest match, in your case it will produce WS. If you look at your lexer output for a = 3\r\nx = 4\r\n, the generated tokens will be:
ID WS '=' WS INT WS ID WS '=' WS INT WS
a = 3 \r\n x = 4 \r\n
But what you're looking for is:
ID WS '=' WS INT NEWLINE ID WS '=' WS INT NEWLINE
a = 3 \r\n x = 4 \r\n
Your grammar seems to be written entirely expecting all the line breaks to generate NEWLINE tokens, so I suggest changing the WS rule to:
WS: [ \t]+ -> skip;
Related
I have this grammar:
grammar BajaPower;
// Gramaticas
programa:PROGRAM ID ';' vars* bloque ;
vars:VAR ((ID|ID',')+ ':' tipo ';')+;
tipo:(INT|FLOAT);
bloque:'{' estatuto+ '}';
estatuto: (asignacion|condicion|escritura);
asignacion: ID '=' expresion ';';
condicion: 'if' '(' expresion ')' bloque (';'|'else' bloque ';');
escritura: 'print' '(' (expresion|STRING ',')* (expresion|STRING) ')' ';';
expresion: exp ('>'|'<'|'<>') exp;
exp: (termino ('+'|'-')*|termino);
termino: (factor ('*'|'/')*|factor);
factor: ('(' expresion ')')|('+'|'-') varcte| varcte;
varcte: (ID|CteI|CteF);
// Tokens
WS: [\t\r\n]+ -> skip;
PROGRAM:'program';
ID:([a-zA-Z]['_'(a-zA-Z0-9)+]*);
VAR:'var';
INT:'int';
FLOAT:'float';
CteI: ([1-9][0-9]*|'0');
CteF: [+-]?([0-9]*[.])?[0-9]+;
STRING:'"' [a-zA-Z0-9]+ '"';
And I'm trying to test it with the following code:
program TestCorrect;
var
x,y:int;
z:float;
{
x = 1;
y = 2;
z = (x+y*3)/4;
if (z > x) {
print("hola mundo",(x+y));
}
}
When I run it it only detects program as an ID and not the PROGRAM token.
There are quite a few things going wrong. In future, I suggest you incrementally create your grammar instead of (trying) to write the entire thing in one go and then coming to the conclusion it doesn't do what you meant it to.
Let's start with the lexer:
WS: [\t\r\n]+ -> skip does not include spaces
ID: ['_'(a-zA-Z0-9)+]* should be ('_'[a-zA-Z0-9]+)*
ID: the first part, [a-zA-Z], should probably be [a-zA-Z]+
VAR, INT, FLOAT are placed after ID, so when ID is properly defined, it will match var, int and float before these tokens
CteF: don't include [+-]?, leave that for the parser to deal with
STRING: [a-zA-Z0-9]+ doe not include spaces, so "hola mundo" will not be matched
Now the parser:
vars: (ID|ID',')+ is wrong because it now always has to end with a comma if you want to match multiple ID's. Do ID (',' ID)* instead
condicion: (';'|'else' bloque ';') mandates a semi-colon should always be present after an if or else block, but in your input, you do not have a semi-colon. Do ('else' bloque)? instead
expresion: exp ('>'|'<'|'<>') exp means an expresion always contains one of the operators >, < or <>, which is not correct (an expression can also just be 1*2). Do exp (('>'|'<'|'<>') exp)? instead
exp: termino ('+'|'-')* is odd: that will match 1++++++++++++. Do termino (('+'|'-') termino)* instead
termino: factor ('*'|'/')* should be factor (('*'|'/') factor)* (same as exp)
varcte: should probably include STRING so that you do not have to do this on multiple places: (expresion|STRING) but can then just do expresion
All in all, this should do the trick:
grammar BajaPower;
programa
: PROGRAM ID ';' vars* bloque
;
vars
: VAR (ID (',' ID)* ':' tipo ';')+
;
tipo
: INT
| FLOAT
;
bloque
:'{' estatuto+ '}'
;
estatuto
: asignacion
| condicion
| escritura
;
asignacion
: ID '=' expresion ';'
;
condicion
: 'if' '(' expresion ')' bloque ('else' bloque)?
;
escritura
: 'print' '(' (expresion ',')* expresion ')' ';'
;
expresion
: exp (('>'|'<'|'<>') exp)?
;
exp
: termino (('+'|'-') termino)*
;
termino
: factor (('*'|'/') factor)*
;
factor
: '(' expresion ')'
| ('+'|'-')? varcte
| STRING
;
varcte
: ID
| CteI
| CteF
;
WS : [ \t\r\n]+ -> skip;
PROGRAM : 'program';
VAR : 'var';
INT : 'int';
FLOAT : 'float';
CteI : [1-9][0-9]* | '0';
CteF : [0-9]* '.' [0-9]+;
ID : [a-zA-Z]+ ('_' [a-zA-Z0-9]+)*;
STRING : '"' .*? '"';
I'm a bit clueless as to how I can parse (more or less) "free form" parameter lists, suppose the syntax allows for
PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE)
I have managed to basically parse combos of positional and key/value parameters, but as soon as I hit a lexer token like PARM= the parser bails out with a "mismatched input", and I can't specifically allow for or expect anything because these parameters passed to a function are completely arbitrary.
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode, the delimiters are PARM=( on the left and the closing ) on the right, but as the "data" itself can contain (pairs of) brackets how would I identify the correct closing paren so I don't prematurely end the lexer mode?
TIA - Alex
Edit 1:
Minimal grammar showing the issue with keywords being used where they shouldn't, as this is part of a complex grammar I can't change the order of tokens to put ID in front of everything else, for example, as it would catch too much. So I don't see how this can work short of breaking out into a different lexer mode.
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
COMMA : ',' ;
EQUALS : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
PARM : 'PARM=' ;
ID : ID_LITERAL ;
fragment ID_LITERAL : [A-Za-z]+ ;
.
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parms : PARM LPAREN parm+ RPAREN ;
parm : (pkey=ID EQUALS)? pval=ID COMMA? ;
Input:
PARM=( TEST, KEY=VAL, PARM=X)
Results in
line 1:22 extraneous input 'PARM=' expecting {')', ID}
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode
Instead of switching to modes (with -> mode(...)), you can push your "special" mode on a stack (with -> pushMode(...)) and then when encountering a ) you pop a mode from the stack. That way, you can have multiple nested lists (..(..(..).)..). A quick demo:
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN);
EQUALS : '=' ;
LPAREN : '(' -> pushMode(InList);
PARM : 'PARM';
ID : [A-Za-z] [A-Za-z0-9]*;
mode InList;
LST_LPAREN : '(' -> type(LPAREN), pushMode(InList);
RPAREN : ')' -> popMode;
COMMA : ',';
LST_EQUALS : '=' -> type(EQUALS);
STRING : '\'' ~['\r\n]* '\'';
LST_ID : [A-Za-z] [A-Za-z0-9]* -> type(ID);
LST_SPACE : [ \t\r\n]+ -> channel(HIDDEN);
and:
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parse
: PARM EQUALS list EOF
;
list
: LPAREN ( value ( COMMA value )* )? RPAREN
;
value
: ID
| STRING
| key_value
| ID list
;
key_value
: ID EQUALS value
;
which will parse your example input PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE) like this:
You don't have a rule (alternative) that recognizes a PARM token in your parm rule.
Bart has provided an answer using Lexer modes (and assuming that LPAREN and RPAREN always control those modes), but you can also just set up a parser rule that matches all of your keywords:
lexer grammar ParmLexer
;
SPACE: [ \t\r\n]+ -> channel(HIDDEN);
COMMA: ',';
EQUALS: '=';
LPAREN: '(';
RPAREN: ')';
PARM: 'PARM';
KW1: 'KW1';
KW2: 'KW2';
ID: ID_LITERAL;
fragment ID_LITERAL: [A-Za-z]+;
parser grammar ParmParser
;
options {
tokenVocab = ParmLexer;
}
parms: PARM EQUALS LPAREN parm (COMMA parm)* RPAREN;
parm: ((pkey = ID | kwid = kw) EQUALS)? pval = ID;
kw: PARM | KW1 | KW2;
input
"PARM=( TEST, KEY=VAL, KW2=v2, PARM=X)"
yields:
(parms PARM = ( (parm TEST) , (parm KEY = VAL) , (parm (kw KW2) = v) , (parm (kw PARM) = X) ))
I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".
There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).
I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
I have the following simple grammar:
grammar TestG;
p : pDecl+ ;
pDecl : endianDecl
| dTDecl
;
endianType : E_BIG
| E_LITTLE
;
endianDecl : 'endian' '=' endianType ';' ;
dTDecl : 'dT' '[' STRING ']' '=' ID ';' ;
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
E_BIG : 'big' ;
E_LITTLE : 'little' ;
When I run grun TestG p and input the following:
endian = little;
I get the following:
line 1:9 mismatched input 'little' expecting {'big', 'little'}
What have I done wrong?
Because your lexer rule for ID precedes that for E_LITTLE, your 'little' input is being lexed as an ID.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<ID>,1:9] <== see here it's being lexed as an ID
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'little' expecting {'big', 'little'}
Moving the these lexer tokens above ID like so:
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
E_BIG : 'big' ;
E_LITTLE : 'little' ;
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
yields the correct output from your test input.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<'little'>,1:9] <== see here being lexed correctly
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
Remember, for lexer tokens, the longest match wins, but in the case of a tie, the one that appears FIRST wins. This is why you want your more specific lexer tokens at the top of the lexer token list, and the more general ones (like identifiers, strings, etc.) farther down.
Why won't the below grammar recognize boolean values?
I've compared this to the grammars for both Java and GraphQL, and cannot see why it doesn't work.
Given the below grammar, parses is as follows:
foo = null // foo = value:nullValue
foo = 123 // foo = value:numberValue
foo = "Hello" // foo = value:stringValue
foo = true // line 1:6 mismatched input 'true' expecting {'null', STRING, BOOLEAN, NUMBER}
What is wrong?
grammar issue;
elementValuePair
: Identifier '=' value
;
Identifier : [_A-Za-z] [_0-9A-Za-z]* ;
value
: STRING # stringValue | NUMBER # numberValue | BOOLEAN # booleanValue | 'null' #nullValue
;
STRING
: '"' ( ESC | ~ ["\\] )* '"'
;
BOOLEAN
: 'true' | 'false'
;
NUMBER
: '-'? INT '.' [0-9]+| '-'? INT | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
fragment ESC
: '\\' ( ["\\/bfnrt] )
;
fragment HEX
: [0-9a-fA-F]
;
WS
: [ \t\n\r]+ -> skip
;
It doesn't work because the token 'true' matches lexer rule Identifier:
[#0,0:2='foo',<Identifier>,1:0]
[#1,4:4='=',<'='>,1:4]
[#2,6:9='true',<Identifier>,1:6] <== lexed as an Identifier!
[#3,14:13='<EOF>',<EOF>,3:0]
line 1:6 mismatched input 'true' expecting {'null', STRING, BOOLEAN, NUMBER}
Move your definition of Identifier down farther to be among the lexer rules and it works:
NUMBER
: '-'? INT '.' [0-9]+| '-'? INT | '-'? INT
;
Identifier : [_A-Za-z] [_0-9A-Za-z]* ;
Remember stuff at the top trumps stuff at the bottom. In a combined grammar like yours, don't intersperse the lexer rules (which start with a capital letter) with parser rules (which start with a lowercase letter).