antlr4 literal string handling

antlr4 literal string handling - antlr4

I have the following antlr4 grammar:
grammar squirrel;
program: globalstatement+;
globalstatement: globalvardef | classdef | functiondef;
globalvardef: IDENT '=' constantexpr ';';
classdef: CLASS IDENT '{' classstatement+ '}';
functiondef: FUNCTION IDENT '(' parameterlist ')' functionbody;
constructordef: CONSTRUCTOR '(' parameterlist ')' functionbody;
parameterlist: IDENT (',' IDENT)* | ;
functionbody: '{' statement* '}';
classstatement: globalvardef | functiondef | constructordef;
statement: expression ';';
expression:
IDENT # ident |
IDENT '=' expression # assignment |
IDENT ('.' IDENT)+ # lookupchain |
constantexpr # constant |
IDENT '(' expressionlist ')' # functioncall |
expression '+' expression # addition;
constantexpr: INTEGER | STRING;
expressionlist: expression (',' expression)* | ;
CONSTRUCTOR: 'constructor';
CLASS: 'class';
FUNCTION: 'function';
COMMENT: '//'.*[\n];
STRING: '"' CHAR* '"';
CHAR: [ a-zA-Z0-9];
INTEGER: [0-9]+;
IDENT: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
Now if I parse this file:
z = "global variable";
class Base
{
z = 10;
}
everything is fine:
#0,0:0='z',<16>,1:0
#1,2:2='=',<1>,1:2
#2,4:20='"global variable"',<14>,1:4
#3,21:21=';',<2>,1:21
#4,26:30='class',<11>,3:0
#5,32:35='Base',<16>,3:6
#6,38:38='{',<3>,4:0
#7,42:42='z',<16>,5:1
#8,44:44='=',<1>,5:3
#9,46:47='10',<15>,5:5
#10,48:48=';',<2>,5:7
#11,51:51='}',<4>,6:0
#12,56:55='<EOF>',<-1>,8:0
But with this file:
z = "global variable";
class Base
{
z = "10";
}
I get this:
#0,0:0='z',<16>,1:0
#1,2:2='=',<1>,1:2
#2,4:49='"global variable";\r\n\r\nclass Base\r\n{\r\n\tz = "10"',<14>,1:4
#3,50:50=';',<2>,5:9
#4,53:53='}',<4>,6:0
#5,58:57='<EOF>',<-1>,8:0
So it seems like everything between the first " and last " in a file gets matched to one string literal.
How do I prevent this ?

Note the string is matching from the first quote to the last possible quote.
By default, a Kleene operator (*) in ANTLR is greedy. So, change
STRING: '"' CHAR* '"';
to
STRING: '"' CHAR*? '"';
to make it non-greedy.

Related

Why isn't the program token recognized? ANTLR4

I have this grammar:
grammar BajaPower;
// Gramaticas
programa:PROGRAM ID ';' vars* bloque ;
vars:VAR ((ID|ID',')+ ':' tipo ';')+;
tipo:(INT|FLOAT);
bloque:'{' estatuto+ '}';
estatuto: (asignacion|condicion|escritura);
asignacion: ID '=' expresion ';';
condicion: 'if' '(' expresion ')' bloque (';'|'else' bloque ';');
escritura: 'print' '(' (expresion|STRING ',')* (expresion|STRING) ')' ';';
expresion: exp ('>'|'<'|'<>') exp;
exp: (termino ('+'|'-')*|termino);
termino: (factor ('*'|'/')*|factor);
factor: ('(' expresion ')')|('+'|'-') varcte| varcte;
varcte: (ID|CteI|CteF);
// Tokens
WS: [\t\r\n]+ -> skip;
PROGRAM:'program';
ID:([a-zA-Z]['_'(a-zA-Z0-9)+]*);
VAR:'var';
INT:'int';
FLOAT:'float';
CteI: ([1-9][0-9]*|'0');
CteF: [+-]?([0-9]*[.])?[0-9]+;
STRING:'"' [a-zA-Z0-9]+ '"';
And I'm trying to test it with the following code:
program TestCorrect;
var
x,y:int;
z:float;
{
x = 1;
y = 2;
z = (x+y*3)/4;
if (z > x) {
print("hola mundo",(x+y));
}
}
When I run it it only detects program as an ID and not the PROGRAM token.

There are quite a few things going wrong. In future, I suggest you incrementally create your grammar instead of (trying) to write the entire thing in one go and then coming to the conclusion it doesn't do what you meant it to.
Let's start with the lexer:
WS: [\t\r\n]+ -> skip does not include spaces
ID: ['_'(a-zA-Z0-9)+]* should be ('_'[a-zA-Z0-9]+)*
ID: the first part, [a-zA-Z], should probably be [a-zA-Z]+
VAR, INT, FLOAT are placed after ID, so when ID is properly defined, it will match var, int and float before these tokens
CteF: don't include [+-]?, leave that for the parser to deal with
STRING: [a-zA-Z0-9]+ doe not include spaces, so "hola mundo" will not be matched
Now the parser:
vars: (ID|ID',')+ is wrong because it now always has to end with a comma if you want to match multiple ID's. Do ID (',' ID)* instead
condicion: (';'|'else' bloque ';') mandates a semi-colon should always be present after an if or else block, but in your input, you do not have a semi-colon. Do ('else' bloque)? instead
expresion: exp ('>'|'<'|'<>') exp means an expresion always contains one of the operators >, < or <>, which is not correct (an expression can also just be 1*2). Do exp (('>'|'<'|'<>') exp)? instead
exp: termino ('+'|'-')* is odd: that will match 1++++++++++++. Do termino (('+'|'-') termino)* instead
termino: factor ('*'|'/')* should be factor (('*'|'/') factor)* (same as exp)
varcte: should probably include STRING so that you do not have to do this on multiple places: (expresion|STRING) but can then just do expresion
All in all, this should do the trick:
grammar BajaPower;
programa
: PROGRAM ID ';' vars* bloque
;
vars
: VAR (ID (',' ID)* ':' tipo ';')+
;
tipo
: INT
| FLOAT
;
bloque
:'{' estatuto+ '}'
;
estatuto
: asignacion
| condicion
| escritura
;
asignacion
: ID '=' expresion ';'
;
condicion
: 'if' '(' expresion ')' bloque ('else' bloque)?
;
escritura
: 'print' '(' (expresion ',')* expresion ')' ';'
;
expresion
: exp (('>'|'<'|'<>') exp)?
;
exp
: termino (('+'|'-') termino)*
;
termino
: factor (('*'|'/') factor)*
;
factor
: '(' expresion ')'
| ('+'|'-')? varcte
| STRING
;
varcte
: ID
| CteI
| CteF
;
WS : [ \t\r\n]+ -> skip;
PROGRAM : 'program';
VAR : 'var';
INT : 'int';
FLOAT : 'float';
CteI : [1-9][0-9]* | '0';
CteF : [0-9]* '.' [0-9]+;
ID : [a-zA-Z]+ ('_' [a-zA-Z0-9]+)*;
STRING : '"' .*? '"';

Antlr4 left recursive rule contains a left recursive alternative which can be followed by the empty string

So I defined a grammar to parse an C style syntax language:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression (op=('*'|'/') expression)*
| expression (op=('+'|'-') expression)*
| relation
| INT
| ID
| '(' expression ')'
;
relation
: expression (op=('<'|'>') expression)*
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
And I can parse a simple test as this:
int i = (2+3)*3/2*(3+36);
int j = i;
int k = 2*1+i*3;
if (k > 2)
k = k + 1;
i = i / 3;
j = j / 3;
fi;
loop (i < 10)
i = i + 1 * (i+k);
j = (j + 1) * (j-k);
k = i + j;
print(k);
pool;
However, when I want to generate ANTLR Recogonizers in intelliJ, I got this error:
sCalc.g4:19:0: left recursive rule expression contains a left recursive alternative which can be followed by the empty string
I wonder if this is caused by my ID could be an empty string?

There are a couple of issues with your grammar:
you have INT as an alternative inside expression while you probably want INTEGER instead
there is no need to do expression (op=('+'|'-') expression)*: this will do: expression op=('+'|'-') expression
ANTLR4 does not support indirect left recursive rules: you must include relation inside expression
Something like this ought to do it:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression op=('*'|'/') expression
| expression op=('+'|'-') expression
| expression op=('<'|'>') expression
| INTEGER
| ID
| '(' expression ')'
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
Also not that this (statement)* can simply be written as statement*

It's about your expression and relation rules. The expression rule can match relation in one alt, which in turn recurses back to expression. Rule relation additionally can potentially match nothing because of (op=('<'|'>') expression)*
A better approach is probably to have relation call expression and remove the relation alt from expression. Then use relation everywhere you used expression now. That's a typical scenario in expressions, starting out with low precedence operations as top level rules and drilling down to higher precedence rules, ultimately ending at a simple expression rule (or similar).

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Please help me with my ANTLR4 Grammar.
Sample "formel":
(Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') &
(Arbejde.ArbejdsTimerPrUge = 40)
(Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato;
'+62Å') < DagsDato)
(Arb.BorI=860)
My problem is that Arb.BorI=860 is not handled correct. I get this error:
Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException
Please notis that Arb.BorI contains the word 'or'.
I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt'
So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated.
My Grammar:
grammar UnikFormel;
formel : boolExpression # BooleanExpr
| expression # Expr
| '(' formel ')' # Parentes;
boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+;
expression : element compareOps element # Compare;
element : datakildefelt # DatakildeId
| function # Funktion
| int # Integer
| decimal # Real
| string # Text;
datakildefelt : datakilde '.' felt;
datakilde : identifyer;
felt : identifyer;
function : funktionsnavn ('(' funcParameters? ')')?;
funktionsnavn : identifyer;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
identifyer : LETTER+;
int : DIGIT+;
decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
string : QUOTE .*? QUOTE;
booleanOps : (AND | OR);
compareOps : (LT | GT | EQ | GTEQ | LTEQ);
QUOTE : '\'';
OPERATOR: '+';
DIGIT: [0-9];
LETTER: [a-åA-Å];
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
GT : '>';
LT : '<';
EQ : '=';
GTEQ : '>=';
LTEQ : '<=';
AND : '&' | 'and';
OR : '?' | 'or';
WS : ' '+ -> skip;

Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too.
EDIT
Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).

Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is:
grammar UnikFormel;
formel : '(' formel ')' # Parentes
| expression # Expr
| boolExpression # BooleanExpr
;
boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+
| '(' formel ')' ( booleanOps '(' formel ')' )+;
expression : element compareOps element # Compare;
datakildefelt : ID '.' ID;
function : ID ('(' funcParameters? ')')?;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
element : datakildefelt # DatakildeId
| function # Funktion
| INT # Integer
| DECIMAL # Real
| STRING # Text;
booleanOps : (AND | OR);
compareOps : ( GTEQ | LTEQ | LT | GT | EQ |);
AND : '&' | 'and';
OR : '?' | 'or';
GTEQ : '>=';
LTEQ : '<=';
GT : '>';
LT : '<';
EQ : '=';
ID : LETTER ( LETTER | DIGIT)*;
INT : DIGIT+;
DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
STRING : QUOTE .*? QUOTE;
fragment QUOTE : '\'';
fragment DIGIT: [0-9];
fragment LETTER: [a-åA-Å];
WS : [ \t\r\n]+ -> skip;

Identifier Lexer rule does not match '*' like its supposed to

I am in the process of finalizing a grammar for a proprietary pattern language. It borrows a few regex syntax elements (like quantifiers) but it's also a lot more complex than regex, since it has to allow macros, different pattern styles etc.
My problem is that '*' does not match against the ID lexer rule like it's supposed to. There is no other rule that could swallow the * token as far as i see.
Here's the grammar i wrote:
grammar Pattern;
element:
ID
| macro;
macro:
MACRONAME macroarg? ('*'|'+'|'?'|FROMTIL)?;
macroarg: '['( (element | MACROFREE ) ';')* (element | MACROFREE) ']';
and_con :
element '&' element
| and_con '&' element
|'(' and_con ')';
head_con :
'H[' block '=>' block ']';
expression :
element
| and_con
| expression ' ' element
| '(' expression ')';
block :
element
| and_con
| or_con
| '(' block ')';
blocksequence :
(block ' '+)* block;
or_con :
((element | and_con) '|')+ (element | and_con)
| or_con '|' (element | and_con)
| '(' blocksequence (')|(' blocksequence)+ (')'|')*');
patternlist :
(blocksequence ' '* ',' ' '*)* blocksequence;
sentenceord :
'S=(' patternlist ')';
sentenceunord :
'S={' patternlist '}';
pattern :
sentenceord
| sentenceunord
| blocksequence;
multisentence :
MS pattern;
clause :
'CLS' ' '+ pattern;
complexpattern :
pattern
| multisentence
| clause
| SECTIONS ' ' complexpattern;
dictentry:
NUM ';' complexpattern
| NUM ';' NAME ';' complexpattern
| COMMENT;
dictionary:
(dictentry ('\r'|'\n'))* (dictentry)?;
ID : '*' ('*'|'+'|'?'|FROMTIL)?
| ( '^'? '!'? ('F'|'C'|'L'|'P'|'CA'|'N'|'PE'|'G'|'CD'|'T'|'M'|'D')'=' NAME ('*'|'+'|'?'|FROMTIL)? '$'? );
MS : 'MS' [0-9];
SECTIONS: 'SEC' '=' ([0-9]+','?)+;
FROMTIL: '{'NUM'-'NUM'}';
NUM: [0-9]+;
NAME: CHAR+ | ',' | '.' | '*';
CHAR: [a-zA-Z0-9_äöüßÄÖÜ\-];
MACRONAME: '#'[a-zA-Z_][a-zA-Z_0-9]*;
MACROFREE: [a-zA-Z!]+;
COMMENT: '//' ~('\r'|'\n')*;
The complexpattern/pattern/element/block parser rules should accept a simple '*', and i can't figure out why they don't.

In your macro rule, you defined the literal '*', causing the ID rule not to match a single "*" as input.

Implement a jump function

I'm trying to create an interpreter for my scripting language using ANTLR4. I have yet implemented the standard operations (mul, div, sub, etc...) using visitors and now i've to implement a jump \ Salta function call. Jump(n) FunctionCall ignore n rows after his call. Example:
Fai var1 = 3,var2 = 4;
Fai Salta(1); //my jump() function call
Fai var1=4;
println(var1);
output: 3
This is my current grammar:
grammar TL;
#members{
int salta=0;
}
parse
: block+ EOF
;
block
: DO statement (','statement )* END // Fai a=2,B=e;
; //manca l'if
DO:'Fai';
END:';';
Salta:'Salta';
statement
:assign
|functionCall
|saltaCall
;
functionCall
: Identifier '(' exprList? ')' #identifierFunctionCall
| Println '(' expr? ')' #printlnFunctionCall
| Print '(' expr ')' #printFunctionCall
;
saltaCall
:Salta '(' Number ')' rows[$Number.int]
;
rows[int n]
locals[int i=0;]
:({$i<$n}?END? block {$i++;})*
;
exprList
: expr (',' expr)*
;
assign
:Identifier '=' expr
;
Identifier
: [a-zA-Z_] [a-zA-Z_0-9]*
;
expr
: '-'expr #unaryMinusExpr
| '!'expr #notExpr
| expr '^' expr #powerExpr
| expr '*' expr #multiplyExpr
| expr '/' expr #divideExpr
| expr '%' expr #modulusExpr
| expr '+' expr #addExpr
| expr '-' expr #subtractExpr
| expr '>=' expr #gtEqExpr
| expr '<=' expr #ltEqExpr
| expr '>' expr #gtExpr
| expr '<' expr #ltExpr
| expr '==' expr #eqExpr
| expr 'O' expr #orExpr
| expr 'E' expr #andExpr
| expr '=' expr #eqExpr
| Number #numberExpr
| Bool #boolExpr
| Null #nullExpr
| functionCall #functionCallExpr
| Identifier #identifierExpr
| String #stringExpr
| '(' expr ')' #exprExpr
;
Println:'println';
Print:'print';
Null:'null';
Or : 'O';
And : 'E';
Equals : '==';
NEquals : '!=';
GTEquals : '>=';
LTEquals : '<=';
Pow : '^';
Excl : '!';
GT : '>';
LT : '<';
Add : '+';
Subtract : '-';
Multiply : '*';
Divide : '/';
Modulus : '%';
OBrace : '{';
CBrace : '}';
OBracket : '[';
CBracket : ']';
OParen : '(';
CParen : ')';
Assign : '=';
QMark : '?';
Colon : ':';
Bool
: 'true'
| 'false'
;
Number
: Int ('.' Digit*)?
;
String
: ["] (~["\r\n] | '\\\\' | '\\"')* ["]
| ['] (~['\r\n] | '\\\\' | '\\\'')* [']
;
fragment Int
: [1-9] Digit*
| '0'
;
fragment Digit
: [0-9]
;
fragment NL
: '\n'
;
// ---------SKIP------------
Comment
: ('#' ~[\r\n]* ) -> skip
;
Space
: [ \t\r\n\u000C] -> skip
;
How can I implement that function?

It's easier to have a look at my Mu interpreter. Your jump call would look a lot like a log call:
jump
: JUMP expr SCOL
;
JUMP : 'jump';
Then override the visitJump method and keep track of the value inside the jump(...) call inside your EvalVisitor.
Now all you would need to do is override the visitBlock method and if the recorded value inside value inside the jump(...) is more then 0, don't visit the next expression. Some pseudo code:
public class EvalVisitor extends MuBaseVisitor<Value> {
...
private Double jumpAmount = 0.0;
#Override
public Value visitBlock(#NotNull MuParser.BlockContext ctx) {
for (MuParser.StatContext statContext : ctx.stat()) {
if jumpAmount is more than 0, decrease by 1
else visit (execute) statContext
}
return Value.VOID;
}
#Override
public Value visitJump(#NotNull MuParser.JumpContext ctx) {
Value amount = this.visit(ctx.expr());
jumpAmount = amount.asDouble();
return amount;
}
...
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

antlr4 literal string handling - antlr4

Note the string is matching from the first quote to the last possible quote. By default, a Kleene operator () in ANTLR is greedy. So, change STRING: '"' CHAR '"'; to STRING: '"' CHAR*? '"'; to make it non-greedy.

Related

Why isn't the program token recognized? ANTLR4

Antlr4 left recursive rule contains a left recursive alternative which can be followed by the empty string

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Identifier Lexer rule does not match '*' like its supposed to

Implement a jump function

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

antlr4 literal string handling - antlr4

Note the string is matching from the first quote to the last possible quote. By default, a Kleene operator (*) in ANTLR is greedy. So, change STRING: '"' CHAR* '"'; to STRING: '"' CHAR*? '"'; to make it non-greedy.

Related

Why isn't the program token recognized? ANTLR4

Antlr4 left recursive rule contains a left recursive alternative which can be followed by the empty string

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Identifier Lexer rule does not match '*' like its supposed to

Implement a jump function

Categories

Resources

Note the string is matching from the first quote to the last possible quote. By default, a Kleene operator () in ANTLR is greedy. So, change STRING: '"' CHAR '"'; to STRING: '"' CHAR*? '"'; to make it non-greedy.