Description
I'm trying to create a custom language that I want to separate lexer rules from parser rules. Besides, I aim to divide lexer and parser rules into specific files further (e.g., common lexer rules, and keyword rules).
But I don't seem to be able to get it to work.
Although I'm not getting any errors while generating the parser (.java files), grun fails with Exception in thread "main" java.lang.ClassCastException.
Note
I'm running ANTLR4.7.2 on Windows7 targeting Java.
Code
I created a set of files that closely mimic what I intend to achieve. The example below defines a language called MyLang and separates lexer and parser grammar. Also, I'm splitting lexer rules into four files:
// MyLang.g4
parser grammar MyLang;
options { tokenVocab = MyLangL; }
prog
: ( func )* END
;
func
: DIR ID L_BRKT (stat)* R_BRKT
;
stat
: expr SEMICOLON
| ID OP_ASSIGN expr SEMICOLON
| SEMICOLON
;
expr
: expr OPERATOR expr
| NUMBER
| ID
| L_PAREN expr R_PAREN
;
// MyLangL.g4
lexer grammar MyLangL;
import SkipWhitespaceL, CommonL, KeywordL;
#header {
package com.invensense.wiggler.lexer;
}
#lexer::members { // place this class member only in lexer
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("for", MyLangL.KW_FOR);
/* add more keywords here */
}};
}
ID : [a-zA-Z]+
{
if ( keywords.containsKey(getText()) ) {
setType(keywords.get(getText())); // reset token type
}
}
;
DIR
: 'in'
| 'out'
;
END : 'end' ;
// KeywordL.g4
lexer grammar KeywordL;
#lexer::header { // place this header action only in lexer, not the parser
import java.util.*;
}
// explicitly define keyword token types to avoid implicit def warnings
tokens {
KW_FOR
/* add more keywords here */
}
// CommonL.g4
lexer grammar CommonL;
NUMBER
: FLOAT
| INT
| UINT
;
FLOAT
: NEG? DIGIT+ '.' DIGIT+ EXP?
| INT
;
INT
: NEG? UINT+
;
UINT
: DIGIT+ EXP?
;
OPERATOR
: OP_ASSIGN
| OP_ADD
| OP_SUB
;
OP_ASSIGN : ':=';
OP_ADD : POS;
OP_SUB : NEG;
L_BRKT : '[' ;
R_BRKT : ']' ;
L_PAREN : '(' ;
R_PAREN : ')' ;
SEMICOLON : ';' ;
fragment EXP
: [Ee] SIGN? DIGIT+
;
fragment SIGN
: POS
| NEG
;
fragment POS: '+' ;
fragment NEG : '-' ;
fragment DIGIT : [0-9];
// SkipWhitespaceL.g4
lexer grammar SkipWhitespaceL;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
Output
Here is the exact output I receive from the code above:
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ antlr4.bat .\MyLangL.g4
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ antlr4.bat .\MyLang.g4
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ javac *.java
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ grun MyLang prog -tree
Exception in thread "main" java.lang.ClassCastException: class MyLang
at java.lang.Class.asSubclass(Unknown Source)
at org.antlr.v4.gui.TestRig.process(TestRig.java:135)
at org.antlr.v4.gui.TestRig.main(TestRig.java:119)
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§
Rename both of your file with MyLangParser and MyLangLexer, then run grun MyLang prog -tree
Related
I have this grammar:
grammar BajaPower;
// Gramaticas
programa:PROGRAM ID ';' vars* bloque ;
vars:VAR ((ID|ID',')+ ':' tipo ';')+;
tipo:(INT|FLOAT);
bloque:'{' estatuto+ '}';
estatuto: (asignacion|condicion|escritura);
asignacion: ID '=' expresion ';';
condicion: 'if' '(' expresion ')' bloque (';'|'else' bloque ';');
escritura: 'print' '(' (expresion|STRING ',')* (expresion|STRING) ')' ';';
expresion: exp ('>'|'<'|'<>') exp;
exp: (termino ('+'|'-')*|termino);
termino: (factor ('*'|'/')*|factor);
factor: ('(' expresion ')')|('+'|'-') varcte| varcte;
varcte: (ID|CteI|CteF);
// Tokens
WS: [\t\r\n]+ -> skip;
PROGRAM:'program';
ID:([a-zA-Z]['_'(a-zA-Z0-9)+]*);
VAR:'var';
INT:'int';
FLOAT:'float';
CteI: ([1-9][0-9]*|'0');
CteF: [+-]?([0-9]*[.])?[0-9]+;
STRING:'"' [a-zA-Z0-9]+ '"';
And I'm trying to test it with the following code:
program TestCorrect;
var
x,y:int;
z:float;
{
x = 1;
y = 2;
z = (x+y*3)/4;
if (z > x) {
print("hola mundo",(x+y));
}
}
When I run it it only detects program as an ID and not the PROGRAM token.
There are quite a few things going wrong. In future, I suggest you incrementally create your grammar instead of (trying) to write the entire thing in one go and then coming to the conclusion it doesn't do what you meant it to.
Let's start with the lexer:
WS: [\t\r\n]+ -> skip does not include spaces
ID: ['_'(a-zA-Z0-9)+]* should be ('_'[a-zA-Z0-9]+)*
ID: the first part, [a-zA-Z], should probably be [a-zA-Z]+
VAR, INT, FLOAT are placed after ID, so when ID is properly defined, it will match var, int and float before these tokens
CteF: don't include [+-]?, leave that for the parser to deal with
STRING: [a-zA-Z0-9]+ doe not include spaces, so "hola mundo" will not be matched
Now the parser:
vars: (ID|ID',')+ is wrong because it now always has to end with a comma if you want to match multiple ID's. Do ID (',' ID)* instead
condicion: (';'|'else' bloque ';') mandates a semi-colon should always be present after an if or else block, but in your input, you do not have a semi-colon. Do ('else' bloque)? instead
expresion: exp ('>'|'<'|'<>') exp means an expresion always contains one of the operators >, < or <>, which is not correct (an expression can also just be 1*2). Do exp (('>'|'<'|'<>') exp)? instead
exp: termino ('+'|'-')* is odd: that will match 1++++++++++++. Do termino (('+'|'-') termino)* instead
termino: factor ('*'|'/')* should be factor (('*'|'/') factor)* (same as exp)
varcte: should probably include STRING so that you do not have to do this on multiple places: (expresion|STRING) but can then just do expresion
All in all, this should do the trick:
grammar BajaPower;
programa
: PROGRAM ID ';' vars* bloque
;
vars
: VAR (ID (',' ID)* ':' tipo ';')+
;
tipo
: INT
| FLOAT
;
bloque
:'{' estatuto+ '}'
;
estatuto
: asignacion
| condicion
| escritura
;
asignacion
: ID '=' expresion ';'
;
condicion
: 'if' '(' expresion ')' bloque ('else' bloque)?
;
escritura
: 'print' '(' (expresion ',')* expresion ')' ';'
;
expresion
: exp (('>'|'<'|'<>') exp)?
;
exp
: termino (('+'|'-') termino)*
;
termino
: factor (('*'|'/') factor)*
;
factor
: '(' expresion ')'
| ('+'|'-')? varcte
| STRING
;
varcte
: ID
| CteI
| CteF
;
WS : [ \t\r\n]+ -> skip;
PROGRAM : 'program';
VAR : 'var';
INT : 'int';
FLOAT : 'float';
CteI : [1-9][0-9]* | '0';
CteF : [0-9]* '.' [0-9]+;
ID : [a-zA-Z]+ ('_' [a-zA-Z0-9]+)*;
STRING : '"' .*? '"';
I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);
I'm working on parsing PDF streams. In section 7.3.4.2 on literal string objects, the PDF Reference says that a backslash within a literal string that isn't followed by an end-of-line character, one to three octal digits, or one of the characters "nrtbf()\" should be ignored. Is there a way to get the recover method in my lexer to ignore a backslash in this situation?
Here is my simplified parser:
parser grammar PdfStreamParser;
options { tokenVocab=PdfSteamLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
Here's the lexer. (Based on the advice in this answer I'm not using a separate string mode):
lexer grammar PdfStreamLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
LITERAL_STRING: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | LITERAL_STRING )* ')' ;
// Escape sequences that can occur within a LITERAL_STRING
fragment ESCAPE_SEQUENCE
: '\\' ( [\r\nnrtbf()\\] | [0-7] [0-7]? [0-7]? )
;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; // Hexadecimal data enclosed in angle brackets
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
I can override the recover method in the PdfStreamLexer class and get notified when the LexerNoViableAltException occurs, but I'm not sure how to (or if it's possible to) ignore the backslash and continue on with the LITERAL_STRING tokenization.
To be able to skip part of the string, you'll need to use lexical modes. Here's a quick demo:
lexer grammar DemoLexer;
STRING_OPEN
: '(' -> pushMode(STRING_MODE)
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
mode STRING_MODE;
STRING_CLOSE
: ')' -> popMode
;
ESCAPE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;
STRING_PART
: ~[\\()]
;
NESTED_STRING_OPEN
: '(' -> type(STRING_OPEN), pushMode(STRING_MODE)
;
IGNORED_ESCAPE
: '\\' . -> skip
;
which can be used in the parser as follows:
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
parse
: ( string | OTHER )* EOF
;
string
: STRING_OPEN ( ESCAPE | STRING_PART | string )* STRING_CLOSE
;
If you now parse the string FU(abc(def)\#\))BAR, you will get the following parse tree:
As you can see, the \) is left in the tree, but \# is omitted.
Why won't the below grammar recognize boolean values?
I've compared this to the grammars for both Java and GraphQL, and cannot see why it doesn't work.
Given the below grammar, parses is as follows:
foo = null // foo = value:nullValue
foo = 123 // foo = value:numberValue
foo = "Hello" // foo = value:stringValue
foo = true // line 1:6 mismatched input 'true' expecting {'null', STRING, BOOLEAN, NUMBER}
What is wrong?
grammar issue;
elementValuePair
: Identifier '=' value
;
Identifier : [_A-Za-z] [_0-9A-Za-z]* ;
value
: STRING # stringValue | NUMBER # numberValue | BOOLEAN # booleanValue | 'null' #nullValue
;
STRING
: '"' ( ESC | ~ ["\\] )* '"'
;
BOOLEAN
: 'true' | 'false'
;
NUMBER
: '-'? INT '.' [0-9]+| '-'? INT | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
fragment ESC
: '\\' ( ["\\/bfnrt] )
;
fragment HEX
: [0-9a-fA-F]
;
WS
: [ \t\n\r]+ -> skip
;
It doesn't work because the token 'true' matches lexer rule Identifier:
[#0,0:2='foo',<Identifier>,1:0]
[#1,4:4='=',<'='>,1:4]
[#2,6:9='true',<Identifier>,1:6] <== lexed as an Identifier!
[#3,14:13='<EOF>',<EOF>,3:0]
line 1:6 mismatched input 'true' expecting {'null', STRING, BOOLEAN, NUMBER}
Move your definition of Identifier down farther to be among the lexer rules and it works:
NUMBER
: '-'? INT '.' [0-9]+| '-'? INT | '-'? INT
;
Identifier : [_A-Za-z] [_0-9A-Za-z]* ;
Remember stuff at the top trumps stuff at the bottom. In a combined grammar like yours, don't intersperse the lexer rules (which start with a capital letter) with parser rules (which start with a lowercase letter).
I am still a newbie to ANTLR, so sorry if I am posting an obvious question.
I have a relatively simple grammar. What I need is for the user to be able to enter something like the following:
if (condition)
{
return true
}
else if (condition)
{
return false
}
else
{
if (condition)
{
return true
}
return false
}
In my grammar below, is there a way to make sure that an error will be flagged if the input string does not contain a 'return' statement? If not, can I do it via the Listener, and if so, how?
grammar Evaluator;
parse
: block EOF
;
block
: statement
;
statement
: return_statement
| if_statement
;
return_statement
: RETURN (TRUE | FALSE)
;
if_statement
: IF condition_block (ELSE IF condition_block)* (ELSE statement_block)?
;
condition_block
: expression statement_block
;
statement_block
: OBRACE block CBRACE
;
expression
: MINUS expression #unaryMinusExpression
| NOT expression #notExpression
| expression op=(MULT | DIV) expression #multiplicationExpression
| expression op=(PLUS | MINUS) expression #additiveExpression
| expression op=(LTEQ | GTEQ | LT | GT) expression #relationalExpression
| expression op=(EQ | NEQ) expression #equalityExpression
| expression AND expression #andExpression
| expression OR expression #orExpression
| atom #atomExpression
;
atom
: function #functionAtom
| OPAR expression CPAR #parenExpression
| (INT | FLOAT) #numberAtom
| (TRUE | FALSE) #booleanAtom
| ID #idAtom
;
function
: ID OPAR (parameter (',' parameter)*)? CPAR
;
parameter
: expression #expressionParameter
;
OR : '||';
AND : '&&';
EQ : '==';
NEQ : '!=';
GT : '>';
LT : '<';
GTEQ : '>=';
LTEQ : '<=';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
NOT : '!';
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
ASSIGN : '=';
RETURN : 'return';
TRUE : 'true';
FALSE : 'false';
IF : 'if';
ELSE : 'else';
// ID either starts with a letter then followed by any number of a-zA-Z_0-9
// or starts with one or more numbers, then followed by at least one a-zA-Z_ then followed
// by any number of a-zA-Z_0-9
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
// Anything not recognized above will be an error
ErrChar
: .
;
Ross' answer is perfectly correct. You design your grammar to accept a certain input. If the input stream does not correspond, the parser will complain.
Allow me to rewrite your grammar like this :
grammar Question;
/* enforce each block to end with a return statement */
a_grammar
: if_statement EOF
;
if_statement
: 'if' expression statement+ ( 'else' statement+ )?
;
statement
: if_statement
// other statements
| statement_block
;
statement_block
: '{' statement* return_statement '}'
;
return_statement
: 'return' ( 'true' | 'false' )
;
expression // reduced to a strict minimum to answer the OP question
: atom
| atom '<=' atom
| '(' expression ')'
;
atom
: ID
| INT
;
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT : [0-9]+ ;
WS : [ \t\r\n] -> skip ;
// Anything not recognized above will be an error
ErrChar
: .
;
With the following input
if (a <= 7)
{
return true
}
else
if (xyz <= 99)
{
return false
}
else incor##!$rect
{
if (b <= a)
{
return true
}
return false
}
you get these tokens
[#0,0:1='if',<'if'>,1:0]
[#1,3:3='(',<'('>,1:3]
[#2,4:4='a',<ID>,1:4]
[#3,6:7='<=',<'<='>,1:6]
...
[#21,82:85='else',<'else'>,10:1]
[#22,87:91='incor',<ID>,10:6]
[#23,92:92='#',<ErrChar>,10:11]
[#24,93:93='#',<ErrChar>,10:12]
[#25,94:94='!',<ErrChar>,10:13]
[#26,95:95='$',<ErrChar>,10:14]
[#27,96:99='rect',<ID>,10:15]
[#28,102:102='{',<'{'>,11:1]
...
line 10:6 mismatched input 'incor' expecting {'if', '{'}
If you run the test rig with the -gui option, it displays the parse tree with erroneous tokens nicely displayed in pink !
grun Question a_grammar -gui data.txt
I've never played with the Listener before.
Via the Visitor, in the VisitStatement(StatementContext context) method, check if the context.return_statement() (ReturnStatementContext) is null. If it is null, throw an exception.
I'm a newbie as well. I was thinking of forcing the lexer to barf by
requiring a return statement, so instead of:
statement
: return_statement
| if_statement
;
Which says a statement is EITHER a if_statement OR a return_statement I would try something like:
statement
: (if_statement)? return_statement
;
Which (I believe), says the if_statement is optional but the return_statement MUST always occur. But you might want to try something like:
block_data : statements+ return_statement;
Where statements could be if_statements etc, and one or more of those are allowed.
I would take everything above with a grain of salt, as I have only been working with ANTLR4 a week or so. I have 4 .g4 files working, and am happy with ANTLR, but you may actually have more ANTLR stick time than I.
-Regards