ANTLR4 strange behavior with a simple rule - antlr4

I a defining a simple rule in ANTLR4 for C# target below:
numberliteral: NUMBER;
NUMBER : '-'? INT '.' INT EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -4 12
;
fragment INT : [0] | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
The results are weird:
Anything that fulfil the first alternative in NUMBER is good, e.g. 1.2, 1.2e+1, -1.2
The other two alternative for NUMBER only work if there is '-' sign in front of the number, e.g. -1e+2 or -2. It does not recognize positive number like: 2 or 2e+3
Anyone has any idea what goes wrong here?
Thanks

Did work for me (in Antlrworks NetBeans plugin):
Grammar:
grammar simpleGrammar;
start: numberliteral*;
numberliteral: NUMBER;
NUMBER : '-'? INT '.' INT EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -4 12
;
fragment INT : [0] | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r] -> skip;
Sample:
1.35
1.35E-9
0.3
-4.5
1e10
-3e4
-4
12
Result:

Related

Antlr4 float number

I am trying to use ANTLR4 to parse input from user but having a hard time.
I want to get a list of numbers. Here is part of my grammar:
number
: DEC
| FLOAT
| HEX
| BIN
;
FLOAT : DIGIT? '.' DIGIT*;
DEC : DIGIT+ ;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
When input is 1 .2 3.2 then I get 1 .2 3.2
But if I use 1.2.3 it incorrectly recognizes 1.2 .3
How can I change the grammar to fix this?
FLOAT rule seems wrong. I have updated number and FLOAT definitions. Below code works only single numbers.
number
: (FLOAT | DEC | HEX | BIN) EOF
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
In most complex grammars, there are other tokens like signs, parenthesis.. etc. So we can easily handle separate tokens with space skipping. However, your grammar has only numbers and I can not separate tokens with skipping spaces. So, I discarded the whitespace skip definition. Below code handle many numbers and fails if the word is like 1.2.3. You should test with numbers and don't process WS tokens.
numbers
: WS? number (WS number)* WS? EOF;
number
: (FLOAT | DEC | HEX | BIN)
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+;

Can I make my ANTLR4 Lexer discard a character from the input stream?

I'm working on parsing PDF streams. In section 7.3.4.2 on literal string objects, the PDF Reference says that a backslash within a literal string that isn't followed by an end-of-line character, one to three octal digits, or one of the characters "nrtbf()\" should be ignored. Is there a way to get the recover method in my lexer to ignore a backslash in this situation?
Here is my simplified parser:
parser grammar PdfStreamParser;
options { tokenVocab=PdfSteamLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
Here's the lexer. (Based on the advice in this answer I'm not using a separate string mode):
lexer grammar PdfStreamLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
LITERAL_STRING: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | LITERAL_STRING )* ')' ;
// Escape sequences that can occur within a LITERAL_STRING
fragment ESCAPE_SEQUENCE
: '\\' ( [\r\nnrtbf()\\] | [0-7] [0-7]? [0-7]? )
;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; // Hexadecimal data enclosed in angle brackets
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
I can override the recover method in the PdfStreamLexer class and get notified when the LexerNoViableAltException occurs, but I'm not sure how to (or if it's possible to) ignore the backslash and continue on with the LITERAL_STRING tokenization.
To be able to skip part of the string, you'll need to use lexical modes. Here's a quick demo:
lexer grammar DemoLexer;
STRING_OPEN
: '(' -> pushMode(STRING_MODE)
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
mode STRING_MODE;
STRING_CLOSE
: ')' -> popMode
;
ESCAPE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;
STRING_PART
: ~[\\()]
;
NESTED_STRING_OPEN
: '(' -> type(STRING_OPEN), pushMode(STRING_MODE)
;
IGNORED_ESCAPE
: '\\' . -> skip
;
which can be used in the parser as follows:
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
parse
: ( string | OTHER )* EOF
;
string
: STRING_OPEN ( ESCAPE | STRING_PART | string )* STRING_CLOSE
;
If you now parse the string FU(abc(def)\#\))BAR, you will get the following parse tree:
As you can see, the \) is left in the tree, but \# is omitted.

ANTLR4: grun errors while using separate grammar (ClassCastException)

Description
I'm trying to create a custom language that I want to separate lexer rules from parser rules. Besides, I aim to divide lexer and parser rules into specific files further (e.g., common lexer rules, and keyword rules).
But I don't seem to be able to get it to work.
Although I'm not getting any errors while generating the parser (.java files), grun fails with Exception in thread "main" java.lang.ClassCastException.
Note
I'm running ANTLR4.7.2 on Windows7 targeting Java.
Code
I created a set of files that closely mimic what I intend to achieve. The example below defines a language called MyLang and separates lexer and parser grammar. Also, I'm splitting lexer rules into four files:
// MyLang.g4
parser grammar MyLang;
options { tokenVocab = MyLangL; }
prog
: ( func )* END
;
func
: DIR ID L_BRKT (stat)* R_BRKT
;
stat
: expr SEMICOLON
| ID OP_ASSIGN expr SEMICOLON
| SEMICOLON
;
expr
: expr OPERATOR expr
| NUMBER
| ID
| L_PAREN expr R_PAREN
;
// MyLangL.g4
lexer grammar MyLangL;
import SkipWhitespaceL, CommonL, KeywordL;
#header {
package com.invensense.wiggler.lexer;
}
#lexer::members { // place this class member only in lexer
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("for", MyLangL.KW_FOR);
/* add more keywords here */
}};
}
ID : [a-zA-Z]+
{
if ( keywords.containsKey(getText()) ) {
setType(keywords.get(getText())); // reset token type
}
}
;
DIR
: 'in'
| 'out'
;
END : 'end' ;
// KeywordL.g4
lexer grammar KeywordL;
#lexer::header { // place this header action only in lexer, not the parser
import java.util.*;
}
// explicitly define keyword token types to avoid implicit def warnings
tokens {
KW_FOR
/* add more keywords here */
}
// CommonL.g4
lexer grammar CommonL;
NUMBER
: FLOAT
| INT
| UINT
;
FLOAT
: NEG? DIGIT+ '.' DIGIT+ EXP?
| INT
;
INT
: NEG? UINT+
;
UINT
: DIGIT+ EXP?
;
OPERATOR
: OP_ASSIGN
| OP_ADD
| OP_SUB
;
OP_ASSIGN : ':=';
OP_ADD : POS;
OP_SUB : NEG;
L_BRKT : '[' ;
R_BRKT : ']' ;
L_PAREN : '(' ;
R_PAREN : ')' ;
SEMICOLON : ';' ;
fragment EXP
: [Ee] SIGN? DIGIT+
;
fragment SIGN
: POS
| NEG
;
fragment POS: '+' ;
fragment NEG : '-' ;
fragment DIGIT : [0-9];
// SkipWhitespaceL.g4
lexer grammar SkipWhitespaceL;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
Output
Here is the exact output I receive from the code above:
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ antlr4.bat .\MyLangL.g4
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ antlr4.bat .\MyLang.g4
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ javac *.java
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ grun MyLang prog -tree
Exception in thread "main" java.lang.ClassCastException: class MyLang
at java.lang.Class.asSubclass(Unknown Source)
at org.antlr.v4.gui.TestRig.process(TestRig.java:135)
at org.antlr.v4.gui.TestRig.main(TestRig.java:119)
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§
Rename both of your file with MyLangParser and MyLangLexer, then run grun MyLang prog -tree

ANTLR4 Grammar doesn't recognize Boolean literals

Why won't the below grammar recognize boolean values?
I've compared this to the grammars for both Java and GraphQL, and cannot see why it doesn't work.
Given the below grammar, parses is as follows:
foo = null // foo = value:nullValue
foo = 123 // foo = value:numberValue
foo = "Hello" // foo = value:stringValue
foo = true // line 1:6 mismatched input 'true' expecting {'null', STRING, BOOLEAN, NUMBER}
What is wrong?
grammar issue;
elementValuePair
: Identifier '=' value
;
Identifier : [_A-Za-z] [_0-9A-Za-z]* ;
value
: STRING # stringValue | NUMBER # numberValue | BOOLEAN # booleanValue | 'null' #nullValue
;
STRING
: '"' ( ESC | ~ ["\\] )* '"'
;
BOOLEAN
: 'true' | 'false'
;
NUMBER
: '-'? INT '.' [0-9]+| '-'? INT | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
fragment ESC
: '\\' ( ["\\/bfnrt] )
;
fragment HEX
: [0-9a-fA-F]
;
WS
: [ \t\n\r]+ -> skip
;
It doesn't work because the token 'true' matches lexer rule Identifier:
[#0,0:2='foo',<Identifier>,1:0]
[#1,4:4='=',<'='>,1:4]
[#2,6:9='true',<Identifier>,1:6] <== lexed as an Identifier!
[#3,14:13='<EOF>',<EOF>,3:0]
line 1:6 mismatched input 'true' expecting {'null', STRING, BOOLEAN, NUMBER}
Move your definition of Identifier down farther to be among the lexer rules and it works:
NUMBER
: '-'? INT '.' [0-9]+| '-'? INT | '-'? INT
;
Identifier : [_A-Za-z] [_0-9A-Za-z]* ;
Remember stuff at the top trumps stuff at the bottom. In a combined grammar like yours, don't intersperse the lexer rules (which start with a capital letter) with parser rules (which start with a lowercase letter).

Advice on handling an ambiguous operator in an ANTLR 4 grammar

I am writing an antlr grammar file for a dialect of basic. Most of it is either working or I have a good idea of what I need to do next. However, I am not at all sure what I should do with the '=' character which is used for both equality tests as well as assignment.
For example, this is a valid statement
t = (x = 5) And (y = 3)
This evaluates if x is EQUAL to 5, if y is EQUAL to 3 then performs a logical AND on those results and ASSIGNS the result to t.
My grammar will parse this; albeit incorrectly, but I think that will resolve itself once the ambiguity is resolved .
How do I differentiate between the two uses of the '=' character?
1) Should I remove the assignment rule from expression and handle these cases (assignment vs equality test) in my visitor and\or listener implementation during code generation
2) Is there a better way to define the grammar such that it is already sorted out
Would someone be able to simply point me in the right direction as to how best implement this language "feature"?
Also, I have been reading through the Definitive guide to ANTLR4 as well as Language Implementation Patterns looking for a solution to this. It may be there but I have not yet found it.
Below is the full parser grammar. The ASSIGN token is currently set to '='. EQUAL is set to '=='.
parser grammar wlParser;
options { tokenVocab=wlLexer; }
program
: multistatement (NEWLINE multistatement)* NEWLINE?
;
multistatement
: statement (COLON statement)*
;
statement
: declarationStat
| defTypeStat
| assignment
| expression
;
assignment
: lvalue op=ASSIGN expression
;
expression
: <assoc=right> left=expression op=CARAT right=expression #exponentiationExprStat
| (PLUS|MINUS) expression #signExprStat
| IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN #arrayIndexExprStat
| left=expression op=(ASTERISK|FSLASH) right=expression #multDivExprStat
| left=expression op=BSLASH right=expression #integerDivExprStat
| left=expression op=KW_MOD right=expression #modulusDivExprStat
| left=expression op=(PLUS|MINUS) right=expression #addSubExprStat
| left=string op=AMPERSAND right=string #stringConcatenation
| left=expression op=(RELATIONALOPERATORS | KW_IS | KW_ISA) right=expression #relationalComparisonExprStat
| left=expression (op=LOGICALOPERATORS right=expression)+ #logicalOrAndExprStat
| op=KW_LIKE patternString #likeExprStat
| LPAREN expression RPAREN #groupingExprStat
| NUMBER #atom
| string #atom
| IDENTIFIER DATATYPESUFFIX? #atom
;
lvalue
: (IDENTIFIER DATATYPESUFFIX?) | (IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN)
;
string
: STRING
;
patternString
: DQUOT (QUESTIONMARK | POUND | ASTERISK | LBRACKET BANG? .*? RBRACKET)+ DQUOT
;
referenceType
: DATATYPE
;
declarationStat
: constDecl
| varDecl
;
constDecl
: CONSTDECL? KW_CONST IDENTIFIER EQUAL expression
;
varDecl
: VARDECL (varDeclPart (COMMA varDeclPart)*)? | listDeclPart
;
varDeclPart
: IDENTIFIER DATATYPESUFFIX? ((arrayBounds)? KW_AS DATATYPE (COMMA DATATYPE)*)?
;
listDeclPart
: IDENTIFIER DATATYPESUFFIX? KW_LIST KW_AS DATATYPE
;
arrayBounds
: LPAREN (arrayDimension (COMMA arrayDimension)*)? RPAREN
;
arrayDimension
: INTEGER (KW_TO INTEGER)?
;
defTypeStat
: DEFTYPES DEFTYPERANGE (COMMA DEFTYPERANGE)*
;
This is the lexer grammar.
lexer grammar wlLexer;
NUMBER
: INTEGER
| REAL
| BINARY
| OCTAL
| HEXIDECIMAL
;
RELATIONALOPERATORS
: EQUAL
| NEQUAL
| LT
| LTE
| GT
| GTE
;
LOGICALOPERATORS
: KW_OR
| KW_XOR
| KW_AND
| KW_NOT
| KW_IMP
| KW_EQV
;
INSTANCEOF
: KW_IS
| KW_ISA
;
CONSTDECL
: KW_PUBLIC
| KW_PRIVATE
;
DATATYPE
: KW_BOOLEAN
| KW_BYTE
| KW_INTEGER
| KW_LONG
| KW_SINGLE
| KW_DOUBLE
| KW_CURRENCY
| KW_STRING
;
VARDECL
: KW_DIM
| KW_STATIC
| KW_PUBLIC
| KW_PRIVATE
;
LABEL
: IDENTIFIER COLON
;
DEFTYPERANGE
: [a-zA-Z] MINUS [a-zA-Z]
;
DEFTYPES
: KW_DEFBOOL
| KW_DEFBYTE
| KW_DEFCUR
| KW_DEFDBL
| KW_DEFINT
| KW_DEFLNG
| KW_DEFSNG
| KW_DEFSTR
| KW_DEFVAR
;
DATATYPESUFFIX
: PERCENT
| AMPERSAND
| BANG
| POUND
| AT
| DOLLARSIGN
;
STRING
: (DQUOT (DQUOTESC|.)*? DQUOT)
| (LBRACE (RBRACEESC|.)*? RBRACE)
| (PIPE (PIPESC|.|NEWLINE)*? PIPE)
;
fragment DQUOTESC: '\"\"' ;
fragment RBRACEESC: '}}' ;
fragment PIPESC: '||' ;
INTEGER
: DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
REAL
: DIGIT+ PERIOD DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
BINARY
: AMPERSAND B BINARYDIGIT+
;
OCTAL
: AMPERSAND O OCTALDIGIT+
;
HEXIDECIMAL
: AMPERSAND H HEXDIGIT+
;
QUESTIONMARK: '?' ;
COLON: ':' ;
ASSIGN: '=';
SEMICOLON: ';' ;
AT: '#' ;
LPAREN: '(' ;
RPAREN: ')' ;
DQUOT: '"' ;
LBRACE: '{' ;
RBRACE: '}' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
CARAT: '^' ;
PLUS: '+' ;
MINUS: '-' ;
ASTERISK: '*' ;
FSLASH: '/' ;
BSLASH: '\\' ;
AMPERSAND: '&' ;
BANG: '!' ;
POUND: '#' ;
DOLLARSIGN: '$' ;
PERCENT: '%' ;
COMMA: ',' ;
APOSTROPHE: '\'' ;
TWOPERIODS: '..' ;
PERIOD: '.' ;
UNDERSCORE: '_' ;
PIPE: '|' ;
NEWLINE: '\r\n' | '\r' | '\n';
EQUAL: '==' ;
NEQUAL: '<>' | '><' ;
LT: '<' ;
LTE: '<=' | '=<';
GT: '>' ;
GTE: '=<'|'<=' ;
KW_AND: A N D ;
KW_BINARY: B I N A R Y ;
KW_BOOLEAN: B O O L E A N ;
KW_BYTE: B Y T E ;
KW_DATATYPE: D A T A T Y P E ;
KW_DATE: D A T E ;
KW_INTEGER: I N T E G E R ;
KW_IS: I S ;
KW_ISA: I S A ;
KW_LIKE: L I K E ;
KW_LONG: L O N G ;
KW_MOD: M O D ;
KW_NOT: N O T ;
KW_TO: T O ;
KW_FALSE: F A L S E ;
KW_TRUE: T R U E ;
KW_SINGLE: S I N G L E ;
KW_DOUBLE: D O U B L E ;
KW_CURRENCY: C U R R E N C Y ;
KW_STRING: S T R I N G ;
fragment BINARYDIGIT: ('0'|'1') ;
fragment OCTALDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7') ;
fragment DIGIT: '0'..'9' ;
fragment HEXDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' | A | B | C | D | E | F) ;
fragment A: ('a'|'A');
fragment B: ('b'|'B');
fragment C: ('c'|'C');
fragment D: ('d'|'D');
fragment E: ('e'|'E');
fragment F: ('f'|'F');
fragment G: ('g'|'G');
fragment H: ('h'|'H');
fragment I: ('i'|'I');
fragment J: ('j'|'J');
fragment K: ('k'|'K');
fragment L: ('l'|'L');
fragment M: ('m'|'M');
fragment N: ('n'|'N');
fragment O: ('o'|'O');
fragment P: ('p'|'P');
fragment Q: ('q'|'Q');
fragment R: ('r'|'R');
fragment S: ('s'|'S');
fragment T: ('t'|'T');
fragment U: ('u'|'U');
fragment V: ('v'|'V');
fragment W: ('w'|'W');
fragment X: ('x'|'X');
fragment Y: ('y'|'Y');
fragment Z: ('z'|'Z');
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_~]*
;
LINE_ESCAPE
: (' ' | '\t') UNDERSCORE ('\r'? | '\n')
;
WS
: [ \t] -> skip
;
Take a look at this grammar (Note that this grammar is not supposed to be a grammar for BASIC, it's just an example to show how to disambiguate using "=" for both assignment and equality):
grammar Foo;
program:
(statement | exprOtherThanEquality)*
;
statement:
assignment
;
expr:
equality | exprOtherThanEquality
;
exprOtherThanEquality:
boolAndOr
;
boolAndOr:
atom (BOOL_OP expr)*
;
equality:
atom EQUAL expr
;
assignment:
VAR EQUAL expr ENDL
;
atom:
BOOL |
VAR |
INT |
group
;
group:
LEFT_PARENTH expr RGHT_PARENTH
;
ENDL : ';' ;
LEFT_PARENTH : '(' ;
RGHT_PARENTH : ')' ;
EQUAL : '=' ;
BOOL:
'true' | 'false'
;
BOOL_OP:
'and' | 'or'
;
VAR:
[A-Za-z_]+ [A-Za-z_0-9]*
;
INT:
'-'? [0-9]+
;
WS:
[ \t\r\n] -> skip
;
Here is the parse tree for the input: t = (x = 5) and (y = 2);
In one of the comments above, I asked you if we can assume that the first equal sign on a line always corresponds to an assignment. I retract that assumption slightly... The first equal sign on a line always corresponds to an assignment unless it is contained within parentheses. With the above grammar, this is a valid line: (x = 2). Here is the parse tree:

Resources