Antlr4 float number - antlr4

I am trying to use ANTLR4 to parse input from user but having a hard time.
I want to get a list of numbers. Here is part of my grammar:
number
: DEC
| FLOAT
| HEX
| BIN
;
FLOAT : DIGIT? '.' DIGIT*;
DEC : DIGIT+ ;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
When input is 1 .2 3.2 then I get 1 .2 3.2
But if I use 1.2.3 it incorrectly recognizes 1.2 .3
How can I change the grammar to fix this?

FLOAT rule seems wrong. I have updated number and FLOAT definitions. Below code works only single numbers.
number
: (FLOAT | DEC | HEX | BIN) EOF
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
In most complex grammars, there are other tokens like signs, parenthesis.. etc. So we can easily handle separate tokens with space skipping. However, your grammar has only numbers and I can not separate tokens with skipping spaces. So, I discarded the whitespace skip definition. Below code handle many numbers and fails if the word is like 1.2.3. You should test with numbers and don't process WS tokens.
numbers
: WS? number (WS number)* WS? EOF;
number
: (FLOAT | DEC | HEX | BIN)
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+;

Related

ANTLR Grammar to distinguish words, alphanumeric and numbers

I am still working my way through ANTLR and would appreciate any support for an enhanced version of this grammar.
Here is an input string:
SYS [ErrorCode is not Available] : Transaction ID:
d9d1211e-d273-40e1-bdd0-e4c9a8036ef3 . This can be ignored safely to:
map To Not availble : works in progress
Expected Parser Output:
words -> SYS
specials -> [
words -> ErrorCode
words -> is
....
alphanumeric -> d9d1211e-d273-40e1-bdd0-e4c9a8036ef3
...
ANTLR Grammar I have so far came up with:
grammar Expressions;
expression
:
| numbers? specials? words (numbers? specials? words)*
| numbers words specials
| specials words numbers
| specials numbers words
| words specials numbers
| words numbers specials
| specials specials? (specials specials? )*
| words words? (words words?)*
| numbers numbers? (numbers numbers?)*
;
words
: CHARACTERS
;
numbers
: NUMBERS
;
specials
: AND
| OR
| EQUALS
| ASSIGN
| GT
| LT
| GTE
| LTE
| NOTEQUALS
| NOT
| PLUS
| MINUS
| IF
| COLON
| TLB
| TRB
| FLB
| FRB
| DOT
;
AND : '&&' ;
OR : '||' ;
EQUALS : '==' ;
ASSIGN : '=' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
NOTEQUALS : '!=' ;
NOT : '!' ;
PLUS : '+' ;
MINUS : '-' ;
IF : 'if' ;
COLON : ':' ;
TLB : '[' ;
TRB : ']' ;
FLB : ')' ;
FRB : '(' ;
DOT : '.' ;
CHARACTERS
: [a-zA-Z] [a-zA-Z]*
;
NUMBERS
: [0-9]+
| ([0-9]+)? '.' ([0-9])+
;
WS : [ \t\r\n]+ -> skip
;
Wrote this simple golang program to find if the string has any number in it.
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
someString := "ID:8e038845-bd81-4218-9769-8406241fbb34 Operation is failed java.core.CoreRuntimeException: java.core.CoreRuntimeException: The JDBC connection information provided is incomplete"
words := strings.Fields(someString)
var tokens []string
var x int
for _, j := range words {
if HasDigit(j) {
dynamic := "$" + strconv.Itoa(x)
tokens = append(tokens, dynamic)
x++
} else {
tokens = append(tokens, j)
}
}
var tokenized string
tokenized = strings.Join(tokens, " ")
fmt.Println(tokenized)
}
func HasDigit(s string) bool {
for _, r := range s {
if '0' <= r && r <= '9' {
return true
}
}
return false
}

ANTLR4 Grammar Issue with Decimal Numbers

I'm new to ANTLR and using ANTLR4 (4.7.2 Jar file). I'm currently working on Oracle Parser.
I'm having issues with Decimal numbers. I have kept only the relevant parts.
My grammar file is as below.
Now when I parse the below statement it is fine. ".1" is a valid number in my case.
BEGIN a NUMBER:=.1; END;
I haven't shown the grammar but the below are valid cases for me in Oracle.
a NUMBER:= .1; // with Space after operator
a NUMBER:=1.1; // without Space after operator
a NUMBER:=1; // without Space after operator
a NUMER:= 3; // with Space after operator
Now I need to create a tablespace as below.
CREATE TABLESPACE tbs_01 DATAFILE +DATA/BR/CONTROLFILE/Current.260.750;
Here the Digits 260 & 750 are tokenized along with the DOT (as per the definition of NUMERIC_LITERAL). I would want this to be 2 separate digits separated by DOT (and assigned to filenumber and incarnation_number resp as shown in the grammar).
How do I do this?
I have tried using _input.LA(-1)!='.'}? etc but was not working correctly for me.
I tried many other steps mentioned (most solutions were for ANTLR3 and not working in ANTLR4). Is there a simple way to do this in LEXER? I do not want to write a Parser rule to split the decimal digits.
grammar Oracle;
parse
: ( sql_statements | error )* EOF
;
error
: UNEXPECTED_CHAR
{
throw new RuntimeException("UNEXPECTED_CHAR=" + $UNEXPECTED_CHAR.text);
}
;
sql_statements
: 'CREATE' 'TABLESPACE' tablespace_name 'DATAFILE' fully_qualified_file_name ';'
| 'BEGIN' var1 'NUMBER' ':=' num1 ';' 'END' ';'
;
tablespace_name : IDENTIFIER;
fully_qualified_file_name : K_PLUS_SIGN diskgroup_name K_SOLIDUS db_name K_SOLIDUS file_type K_SOLIDUS file_type_tag '.' filenumber '.' incarnation_number;
diskgroup_name : IDENTIFIER;
db_name : IDENTIFIER;
file_type : IDENTIFIER;
file_type_tag : IDENTIFIER;
filenumber : NUMERIC_LITERAL;
incarnation_number : NUMERIC_LITERAL;
var1 : IDENTIFIER;
num1 : NUMERIC_LITERAL;
IDENTIFIER : [a-zA-Z_] ([a-zA-Z] | '$' | '_' | '#' | DIGIT)* ;
K_PLUS_SIGN : '+';
K_SOLIDUS : '/';
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
;
SPACES : [ \u000B\t\r\n] -> skip;
WS : [ \t\r\n]+ -> skip;
UNEXPECTED_CHAR : . ;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
Your Dsl has a natural ambiguity: in some instances, numbers are integers and in others, decimals.
If the Dsl provides sufficient guard conditions, Antlr modes can be used to isolate the instances. For example, in the given Dsl, decimal numbers appear to always occur between := and ; guards.
...
K_ASSIGN : ':=' -> pushMode(Decimals);
K_SEMI : ';' ;
NUMERIC_LITERAL : DIGIT+ ;
...
mode Decimals;
D_SEMI : ';' -> type(K_SEMI), popMode ;
NUMERIC:
DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? 'D'
| 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
-> type(NUMERIC_LITERAL);

Why whould antlr rule won't making a nice parse tree?

I'm trying to create a grammar that would help me parse a string like this:
[Hello:/c=0.3//a=hi/] [what:/c=0.4/] [are:/c=0.6//a=is/]
This is my grammar:
grammar MyGrammar;
WS: [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
sentence: WORD+;
WORD: '[' WORD_DESCRIPTOR ']';
WORD_DESCRIPTOR: WORD_IDENTIFIER ':' WORD_FEATURES_DESCRIPTORS;
WORD_IDENTIFIER: STRING;
WORD_FEATURES_DESCRIPTORS: WORD_FEATURE_DESCRIPTOR+;
WORD_FEATURE_DESCRIPTOR: '/' WORD_FEATURE_IDENTIFIER '=' WORD_FEATURE_VALUE '/';
WORD_FEATURE_IDENTIFIER:
C_FEATURE | A_FEATURE
;
C_FEATURE: 'c';
A_FEATURE: 'a';
WORD_FEATURE_VALUE: STRING | NUMBER;
fragment LETTER : LOWER | UPPER ;
fragment LOWER : 'a'..'z' ;
fragment UPPER : 'A'..'Z' ;
fragment DIGIT : '0'..'9' ;
fragment INTEGER: DIGIT+ ;
fragment NUMBER: INTEGER (DOT INTEGER)? ;
fragment STRING: LETTER+ ;
fragment DOT: '.' ;
The problem is that the parse tree has only one level.
What I'm doing wrong?
Your parse tree shows up the way it does because all tokens are leaf nodes, and all parser rules are internal nodes. Since you only have a single parser rule (sentence) and the rest are all tokens, this is the parse tree:
sentence
/ | | \
/ | | \
WORD WORD WORD WORD ...
You should see tokens as the atoms that your language is built from. Once you start creating tokens like TOKEN : TOKEN_A | TOKEN_B;, then that is often better defined as a parser rule: token : TOKEN_A | TOKEN_B;.
Try something like this instead:
sentence : word+ EOF;
word : '[' word_descriptor ']';
word_descriptor : word_identifier ':' word_feature_descriptors;
word_identifier : STRING;
word_feature_descriptors : word_feature_descriptor+;
word_feature_descriptor : '/' word_feature_identifier '=' word_feature_value '/';
word_feature_value : STRING | NUMBER;
word_feature_identifier : C_FEATURE | A_FEATURE;
C_FEATURE : 'c';
A_FEATURE : 'a';
NUMBER : INTEGER (DOT INTEGER)?;
STRING : LETTER+ ;
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : LOWER | UPPER;
fragment LOWER : [a-z];
fragment UPPER : [A-Z];
fragment DIGIT : [0-9];
fragment INTEGER : DIGIT+;
fragment DOT : '.';
which will create the following parse tree for your input:

Can I make my ANTLR4 Lexer discard a character from the input stream?

I'm working on parsing PDF streams. In section 7.3.4.2 on literal string objects, the PDF Reference says that a backslash within a literal string that isn't followed by an end-of-line character, one to three octal digits, or one of the characters "nrtbf()\" should be ignored. Is there a way to get the recover method in my lexer to ignore a backslash in this situation?
Here is my simplified parser:
parser grammar PdfStreamParser;
options { tokenVocab=PdfSteamLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
Here's the lexer. (Based on the advice in this answer I'm not using a separate string mode):
lexer grammar PdfStreamLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
LITERAL_STRING: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | LITERAL_STRING )* ')' ;
// Escape sequences that can occur within a LITERAL_STRING
fragment ESCAPE_SEQUENCE
: '\\' ( [\r\nnrtbf()\\] | [0-7] [0-7]? [0-7]? )
;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; // Hexadecimal data enclosed in angle brackets
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
I can override the recover method in the PdfStreamLexer class and get notified when the LexerNoViableAltException occurs, but I'm not sure how to (or if it's possible to) ignore the backslash and continue on with the LITERAL_STRING tokenization.
To be able to skip part of the string, you'll need to use lexical modes. Here's a quick demo:
lexer grammar DemoLexer;
STRING_OPEN
: '(' -> pushMode(STRING_MODE)
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
mode STRING_MODE;
STRING_CLOSE
: ')' -> popMode
;
ESCAPE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;
STRING_PART
: ~[\\()]
;
NESTED_STRING_OPEN
: '(' -> type(STRING_OPEN), pushMode(STRING_MODE)
;
IGNORED_ESCAPE
: '\\' . -> skip
;
which can be used in the parser as follows:
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
parse
: ( string | OTHER )* EOF
;
string
: STRING_OPEN ( ESCAPE | STRING_PART | string )* STRING_CLOSE
;
If you now parse the string FU(abc(def)\#\))BAR, you will get the following parse tree:
As you can see, the \) is left in the tree, but \# is omitted.

ANTLR4 strange behavior with a simple rule

I a defining a simple rule in ANTLR4 for C# target below:
numberliteral: NUMBER;
NUMBER : '-'? INT '.' INT EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -4 12
;
fragment INT : [0] | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
The results are weird:
Anything that fulfil the first alternative in NUMBER is good, e.g. 1.2, 1.2e+1, -1.2
The other two alternative for NUMBER only work if there is '-' sign in front of the number, e.g. -1e+2 or -2. It does not recognize positive number like: 2 or 2e+3
Anyone has any idea what goes wrong here?
Thanks
Did work for me (in Antlrworks NetBeans plugin):
Grammar:
grammar simpleGrammar;
start: numberliteral*;
numberliteral: NUMBER;
NUMBER : '-'? INT '.' INT EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -4 12
;
fragment INT : [0] | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r] -> skip;
Sample:
1.35
1.35E-9
0.3
-4.5
1e10
-3e4
-4
12
Result:

Resources