Antlr String can not match - string

I'm working on a little antlr problem. In my small custom DSL I want to be able to do a compare action between fields. I've got three fieldtypes (String, Int, Identifier) the Identifier is a variable name. I made a big specification but i've reduced my problem to a smaller grammer.
The problem is that when I try to use the String grammar notation, which you can add to your grammer using antlrworks, my Strings are seen as an identifier. This is my grammar:
grammar test;
x
: 'FROM' field_value EOF
;
field_value
: STRING
| INT
| identifier
;
identifier
: ID (('.' '`' ID '`')|('.' ID))?
| '`' ID '`' (('.' '`' ID '`')|('.' ID))?
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
When I try to parse the following string FROM "Hello!" it returns a parsetree like this
<grammar test>
|
x
|
----------------------------
| | |
FROM field_value !
|
identifier
|
"Hello
It parses what I think should be a string to an identifier thought my identifier doesn't say anything about double quoets so it shouldn't match.
Besides I think my definition for a string is wrong, even though antlrworks generated it for me. Does anybody know why this happens?
Cheers!

There's nothing wrong with your grammar. The things that is messing it up for you is most probably the fact that you're using ANTLRWorks' interpreter. Don't. The interpreter doesn't work well.
Use ANTLRWorks' debugger instead (in your grammar, press CTRL + D), which works like a charm. This is what the debugger shows after parsing FROM "Hello!":

Related

How to use the reserved words inside the string in ANTLR4?

I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);

Not Able to Recognize Strings and Characters in ANTLr

In my ANTLr code, we should be able to recognize strings, characters, hexadecimal numbers etc.
However, in my code, when I test it like this:
grun A1_lexer tokens -tokens test.txt
With my test.txt file being a simple string, such as "pineapple", it is unable to recognize the different tokens.
In my lexer, I define the following helper tokens:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9'] ;
fragment Digit: ['0'-'9'] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
And I define the following tokens:
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Id: Alpha Alpha_num* ;
I run it like this:
grun A1_lexer tokens -tokens test.txt
And it outputs this:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really wondering what the problem is and how I could fix it.
Thanks.
UPDATE 1:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
I have updated the code, I got rid of the un-necessary single quotes in my Char classification. However, I get the same output as before.
UPDATE 2:
Even when I make the changes suggested, I still get the same error. I believed the problem is that I am not recompiling, but I am. These are the steps that I take to recompile.
antlr4 A1_lexer.g4
javac A1_lexer*.java
chmod a+x build.sh
./build.sh
grun A1_lexer tokens -tokens test.txt
With my build.sh file looking like this:
#!/bin/bash
FILE="A1_lexer"
ANTLR=$(echo $CLASSPATH | tr ':' '\n' | grep -m 1 "antlr-4.7.1-
complete.jar")
java -jar $ANTLR $FILE.g4
javac $FILE*.java
Even when I recompile, my antlr code is still unable to recognize the tokens.
My code is also now like this:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
UPDATE 3:
Grammar:
program
:'class Program {'field_decl* method_decl*'}'
field_decl
: type (id | id'['int_literal']') ( ',' id | id'['int_literal']')*';'
| type id '=' literal ';'
method_decl
: (type | 'void') id'('( (type id) ( ','type id)*)? ')'block
block
: '{'var_decl* statement*'}'
var_decl
: type id(','id)* ';'
type
: 'int'
| 'boolean'
statement
: location assign_op expr';'
| method_call';'
| 'if ('expr')' block ('else' block )?
| 'switch' expr '{'('case' literal ':' statement*)+'}'
| 'while (' expr ')' statement
| 'return' ( expr )? ';'
| 'break ;'
| 'continue ;'
| block
assign_op
: '='
| '+='
| '-='
method_call
: method_name '(' (expr ( ',' expr )*)? ')'
| 'callout (' string_literal ( ',' callout_arg )* ')'
method_name
: id
location
: id
| id '[' expr ']'
expr
: location
| method_call
| literal
| expr bin_op expr
| '-' expr
| '!' expr
| '(' expr ')'
callout_arg
: expr
| string_literal
bin_op
: arith_op
| rel_op
| eq_op
| cond_op
arith_op
: '+'
| '-'
| '*'
| '/'
| '%'
rel_op
: '<'
| '>'
| '<='
| '>='
eq_op
: '=='
| '!='
cond_op
: '&&'
| '||'
literal
: int_literal
| char_literal
| bool_literal
id
: alpha alpha_num*
alpha
: ['a'-'z''A'-'Z''_']
alpha_num
: alpha
| digit
digit
: ['0'-'9']
hex_digit
: digit
| ['a'-'f''A'-'F']
int_literal
: decimal_literal
| hex_literal
decimal_literal
: digit+
hex_literal
: '0x' hex_digit+
bool_literal
: 'true'
| 'false'
char_literal
: '‘'char'’'
string_literal
: '“'char*'”'
test.txt :
"pineapple"
A1_lexer:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
What I Write in Terminal:
grun A1_lexer tokens -tokens test.txt
Output in Terminal:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really not sure why this is happening.
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9']
['a'-'z'] doesn't mean "a to z", it means "a single quote, or a, or a single quote to a single quote, or z, or a single quote", which simplifies to just "a single quote, a or z". What you want is just [a-z] without the quotes and the same applies to the other character classes as well - except that they also contain spaces, so it's "single quote, A, single quote, space to space, single quote, Z, or single quote" etc. Also you don't need to "or" character classes, you can just write everything in one character class like this: [a-zA-Z0-9] (like you already did for the Alpha rule).
The same applies to the Digit rule as well.
Note that it's a bit unusual to only allow these specific characters inside quotes. Usually you'd allow everything that isn't an unescaped quote or an invalid escape sequence. But of course that all depends on the language you're parsing.

Antlr - not able to use Lexer token if it is assigned to another token in grammar

"In this example if i use 'MID('int = VALUE')' then it works fine. I want MID to be validated for INT value but when i use INT it gives error "mismatched input '9' expecting INT.
I am using antlr-4.2-complete version of antlr.
I am not able to understand the exact issue?
grammar DIExpression;
r: 'MID('int_val = INT')'
{
System.out.println("value equals: "+ $int_val.text);
};
VALUE : INT | STRING;
STRING : [0-9a-zA-Z_]+;
INT : [0-9]+;
WS : [ \t\r\n]+ -> skip ;
UPDATE:
I am giving input like MID(9)
The issue is that your rules are ambiguous. What should '9' be? It could be a STRING or and INT. I would highly recommend to use the predefined literals for STRING, WS, COMMENT and NEWLINE provided by the antlr community.
Be aware, this is antlr3 code! As you can see a String is sth in quotes (I guess that´s also what you want)
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;

ANTLR4 - Quoted and non-quoted strings to support at the same time

What I am trying to achieve is to develop a grammar that would parse the following two lines in the same way:
1. "Bucket 1" = "1 item placed", "3 items removed"
2. Bucket 2 = 2 items placed, 6 items removed
So, a line starts with an ordinal number, then an element name goes - 'Bucket 1' and 'Bucket 2'. Also, a bucket has one or more values separated by a comma.
The issue is that the data can come enclosed with double quotes (line #1 above) and without the quotes (as shown in line #2). I can figure grammars for each of the lines separately but can not develop a grammar that would parse them both.
grammar Test;
doc : element+ EOF;
element: ordinal element_name EQUAL element_values '\n';
element_name : STRING ;
element_values: STRING (COMMA STRING)+;
ordinal : NUMBER ;
COMMA: ',' ;
EQUAL: '=' ;
NUMBER : ('0'..'9')+ ;
STRING : '"' (EscapeSequence | ~('\\'|'"') )* '"' ;
// STRING : ('"' (EscapeSequence | ~('\\'|'"') )* '"') | ~('"'|',')+ ;
fragment
EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| OctalEscape
;
fragment
OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
WS : [ .\t]+ -> skip ;
I played with STRING rule above trying to make it handle both cases but with no luck. If I enable the commented out version of STRING rule, then I get a line 1:0 missing NUMBER at '4. ' parser error which is confusing as I thought that NUMBER rule should be caught since it goes first.
Is that a wrong assumption? Can you please explain why it does not get caught?

Antlr4 Grammar/Rules - issue with solving BASIC print variable

The scenario is that I want to create a BASIC (high level) language using ANTRL4.
The test input below is the creation of a variable called C$ and assigning an integer value. The value assignment works. The print statement works except where concatenating the variable to it:-
************ TEST CASE ****************
$C=15;
print "dangerdanger!"; # print works
print "Number of GB left=" + $C;
Using a Parse Tree Inspector I can see assignments are working fine but when it gets to the identification of the variable in the string it seems there is a mismatched input '+' expecting STMTEND.
I wondered if anyone could help me out here and see what adjustment I need to make to my rules and grammar to solve this issue.
Many thanks in advance.
Kevin
PS. As a side issue I would rather have C$ than $C but early days...
********RULES************
VARNAME : '$'('A'..'Z')*
;
CONCAT : '+'
;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+
;
STRING : SQUOTED_STRING (CONCAT SQUOTED_STRING | CONCAT VARNAME)*
| DQUOTED_STRING (CONCAT DQUOTED_STRING | CONCAT VARNAME)*
;
fragment SQUOTED_STRING : '\'' (~['])* '\''
;
fragment DQUOTED_STRING
: '"' ( ESC_SEQ| ~('\\'|'"') )* '"'
;
fragment ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment HEX_DIGIT : '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+
;
fragment UNICODE_ESC : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
SEMICOLON : ';'
;
NEWLINE : '\r'?'\n'
************GRAMMAR************
print_command
: PRINT STRING STMTEND #printCommandLabel
;
assignment
: VARNAME EQUALS INTEGER STMTEND #assignInteger
| VARNAME EQUALS STRING STMTEND #assignString
;
You shouldn't try to create concat-expressions inside your lexer: that is the responsibility of the parser. Something like this should do it:
print_command
: PRINT STRING STMTEND #printCommandLabel
;
assignment
: VARNAME EQUALS expression STMTEND
;
expression
: expression CONCAT expression
| INTEGER
| STRING
| VARNAME
;
CONCAT
: '+'
;
VARNAME
: '$'('A'..'Z')*
;
STMTEND
: SEMICOLON NEWLINE*
| NEWLINE+
;
STRING
: SQUOTED_STRING
| DQUOTED_STRING
;
fragment SQUOTED_STRING
: '\'' (~['])* '\''
;
fragment DQUOTED_STRING
: '"' ( ESC_SEQ| ~('\\'|'"') )* '"'
;
fragment ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment HEX_DIGIT : '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
fragment UNICODE_ESC : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r'?'\n';

Resources