ANTLRv4 - How to identify unquoted quote within string

ANTLRv4 - How to identify unquoted quote within string - string

How would I recognise the string "Aren't you a string?" without getting a token recognition error at the apostrophe?
Here is the relative grammar from my lexer:
STRING_LITERAL : '"' STRING? '"';
fragment STRING : STRING_CHARACTER+;
fragment STRING_CHARACTER : ~["'\\] | ESCSEQ;
fragment ESCSEQ : '\\' [tnfr"'\\];

Remove the single quote from ~["'\\]:
STRING_LITERAL : '"' STRING? '"';
fragment STRING : STRING_CHARACTER+;
fragment STRING_CHARACTER : ~["\\] | ESCSEQ;
fragment ESCSEQ : '\\' [tnfr"'\\];

Related

Why whould antlr rule won't making a nice parse tree?

I'm trying to create a grammar that would help me parse a string like this:
[Hello:/c=0.3//a=hi/] [what:/c=0.4/] [are:/c=0.6//a=is/]
This is my grammar:
grammar MyGrammar;
WS: [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
sentence: WORD+;
WORD: '[' WORD_DESCRIPTOR ']';
WORD_DESCRIPTOR: WORD_IDENTIFIER ':' WORD_FEATURES_DESCRIPTORS;
WORD_IDENTIFIER: STRING;
WORD_FEATURES_DESCRIPTORS: WORD_FEATURE_DESCRIPTOR+;
WORD_FEATURE_DESCRIPTOR: '/' WORD_FEATURE_IDENTIFIER '=' WORD_FEATURE_VALUE '/';
WORD_FEATURE_IDENTIFIER:
C_FEATURE | A_FEATURE
;
C_FEATURE: 'c';
A_FEATURE: 'a';
WORD_FEATURE_VALUE: STRING | NUMBER;
fragment LETTER : LOWER | UPPER ;
fragment LOWER : 'a'..'z' ;
fragment UPPER : 'A'..'Z' ;
fragment DIGIT : '0'..'9' ;
fragment INTEGER: DIGIT+ ;
fragment NUMBER: INTEGER (DOT INTEGER)? ;
fragment STRING: LETTER+ ;
fragment DOT: '.' ;
The problem is that the parse tree has only one level.
What I'm doing wrong?

Your parse tree shows up the way it does because all tokens are leaf nodes, and all parser rules are internal nodes. Since you only have a single parser rule (sentence) and the rest are all tokens, this is the parse tree:
sentence
/ | | \
/ | | \
WORD WORD WORD WORD ...
You should see tokens as the atoms that your language is built from. Once you start creating tokens like TOKEN : TOKEN_A | TOKEN_B;, then that is often better defined as a parser rule: token : TOKEN_A | TOKEN_B;.
Try something like this instead:
sentence : word+ EOF;
word : '[' word_descriptor ']';
word_descriptor : word_identifier ':' word_feature_descriptors;
word_identifier : STRING;
word_feature_descriptors : word_feature_descriptor+;
word_feature_descriptor : '/' word_feature_identifier '=' word_feature_value '/';
word_feature_value : STRING | NUMBER;
word_feature_identifier : C_FEATURE | A_FEATURE;
C_FEATURE : 'c';
A_FEATURE : 'a';
NUMBER : INTEGER (DOT INTEGER)?;
STRING : LETTER+ ;
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : LOWER | UPPER;
fragment LOWER : [a-z];
fragment UPPER : [A-Z];
fragment DIGIT : [0-9];
fragment INTEGER : DIGIT+;
fragment DOT : '.';
which will create the following parse tree for your input:

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Please help me with my ANTLR4 Grammar.
Sample "formel":
(Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') &
(Arbejde.ArbejdsTimerPrUge = 40)
(Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato;
'+62Å') < DagsDato)
(Arb.BorI=860)
My problem is that Arb.BorI=860 is not handled correct. I get this error:
Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException
Please notis that Arb.BorI contains the word 'or'.
I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt'
So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated.
My Grammar:
grammar UnikFormel;
formel : boolExpression # BooleanExpr
| expression # Expr
| '(' formel ')' # Parentes;
boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+;
expression : element compareOps element # Compare;
element : datakildefelt # DatakildeId
| function # Funktion
| int # Integer
| decimal # Real
| string # Text;
datakildefelt : datakilde '.' felt;
datakilde : identifyer;
felt : identifyer;
function : funktionsnavn ('(' funcParameters? ')')?;
funktionsnavn : identifyer;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
identifyer : LETTER+;
int : DIGIT+;
decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
string : QUOTE .*? QUOTE;
booleanOps : (AND | OR);
compareOps : (LT | GT | EQ | GTEQ | LTEQ);
QUOTE : '\'';
OPERATOR: '+';
DIGIT: [0-9];
LETTER: [a-åA-Å];
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
GT : '>';
LT : '<';
EQ : '=';
GTEQ : '>=';
LTEQ : '<=';
AND : '&' | 'and';
OR : '?' | 'or';
WS : ' '+ -> skip;

Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too.
EDIT
Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).

Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is:
grammar UnikFormel;
formel : '(' formel ')' # Parentes
| expression # Expr
| boolExpression # BooleanExpr
;
boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+
| '(' formel ')' ( booleanOps '(' formel ')' )+;
expression : element compareOps element # Compare;
datakildefelt : ID '.' ID;
function : ID ('(' funcParameters? ')')?;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
element : datakildefelt # DatakildeId
| function # Funktion
| INT # Integer
| DECIMAL # Real
| STRING # Text;
booleanOps : (AND | OR);
compareOps : ( GTEQ | LTEQ | LT | GT | EQ |);
AND : '&' | 'and';
OR : '?' | 'or';
GTEQ : '>=';
LTEQ : '<=';
GT : '>';
LT : '<';
EQ : '=';
ID : LETTER ( LETTER | DIGIT)*;
INT : DIGIT+;
DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
STRING : QUOTE .*? QUOTE;
fragment QUOTE : '\'';
fragment DIGIT: [0-9];
fragment LETTER: [a-åA-Å];
WS : [ \t\r\n]+ -> skip;

Escape quote in ANTLR4?

I have this grammar :
grammar Hello;
STRING : '"' ( ESC | ~[\r\n"])* '"' ;
fragment ESC : '\\"' ;
r : STRING;
I want when i type a string :
"my name is : \" StackOverflow \" "
the result will be :
"my name is : "StackOverflow" "
But this is the result when i test it :
So what should i do to fix it ? Your help will be appreciated .

There is no way to handle it in your grammar without targeting a specific language. You either strip the slashes when walking your parse tree in a listener or visitor, or embed target specific code in your grammar.
If Java is your target, you could do this:
STRING
: '"' ( ESC | ~[\r\n"] )* '"'
{
String text = getText();
text = text.substring(1, text.length() - 1);
text = text.replaceAll("\\\\(.)", "$1");
setText(text);
}
;

Antlr4 Grammar/Rules - issue with solving BASIC print variable

The scenario is that I want to create a BASIC (high level) language using ANTRL4.
The test input below is the creation of a variable called C$ and assigning an integer value. The value assignment works. The print statement works except where concatenating the variable to it:-
************ TEST CASE ****************
$C=15;
print "dangerdanger!"; # print works
print "Number of GB left=" + $C;
Using a Parse Tree Inspector I can see assignments are working fine but when it gets to the identification of the variable in the string it seems there is a mismatched input '+' expecting STMTEND.
I wondered if anyone could help me out here and see what adjustment I need to make to my rules and grammar to solve this issue.
Many thanks in advance.
Kevin
PS. As a side issue I would rather have C$ than $C but early days...
********RULES************
VARNAME : '$'('A'..'Z')*
;
CONCAT : '+'
;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+
;
STRING : SQUOTED_STRING (CONCAT SQUOTED_STRING | CONCAT VARNAME)*
| DQUOTED_STRING (CONCAT DQUOTED_STRING | CONCAT VARNAME)*
;
fragment SQUOTED_STRING : '\'' (~['])* '\''
;
fragment DQUOTED_STRING
: '"' ( ESC_SEQ| ~('\\'|'"') )* '"'
;
fragment ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment HEX_DIGIT : '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+
;
fragment UNICODE_ESC : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
SEMICOLON : ';'
;
NEWLINE : '\r'?'\n'
************GRAMMAR************
print_command
: PRINT STRING STMTEND #printCommandLabel
;
assignment
: VARNAME EQUALS INTEGER STMTEND #assignInteger
| VARNAME EQUALS STRING STMTEND #assignString
;

You shouldn't try to create concat-expressions inside your lexer: that is the responsibility of the parser. Something like this should do it:
print_command
: PRINT STRING STMTEND #printCommandLabel
;
assignment
: VARNAME EQUALS expression STMTEND
;
expression
: expression CONCAT expression
| INTEGER
| STRING
| VARNAME
;
CONCAT
: '+'
;
VARNAME
: '$'('A'..'Z')*
;
STMTEND
: SEMICOLON NEWLINE*
| NEWLINE+
;
STRING
: SQUOTED_STRING
| DQUOTED_STRING
;
fragment SQUOTED_STRING
: '\'' (~['])* '\''
;
fragment DQUOTED_STRING
: '"' ( ESC_SEQ| ~('\\'|'"') )* '"'
;
fragment ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment HEX_DIGIT : '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
fragment UNICODE_ESC : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r'?'\n';

Antlr String can not match

I'm working on a little antlr problem. In my small custom DSL I want to be able to do a compare action between fields. I've got three fieldtypes (String, Int, Identifier) the Identifier is a variable name. I made a big specification but i've reduced my problem to a smaller grammer.
The problem is that when I try to use the String grammar notation, which you can add to your grammer using antlrworks, my Strings are seen as an identifier. This is my grammar:
grammar test;
x
: 'FROM' field_value EOF
;
field_value
: STRING
| INT
| identifier
;
identifier
: ID (('.' '`' ID '`')|('.' ID))?
| '`' ID '`' (('.' '`' ID '`')|('.' ID))?
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
When I try to parse the following string FROM "Hello!" it returns a parsetree like this
<grammar test>
|
x
|
----------------------------
| | |
FROM field_value !
|
identifier
|
"Hello
It parses what I think should be a string to an identifier thought my identifier doesn't say anything about double quoets so it shouldn't match.
Besides I think my definition for a string is wrong, even though antlrworks generated it for me. Does anybody know why this happens?
Cheers!

There's nothing wrong with your grammar. The things that is messing it up for you is most probably the fact that you're using ANTLRWorks' interpreter. Don't. The interpreter doesn't work well.
Use ANTLRWorks' debugger instead (in your grammar, press CTRL + D), which works like a charm. This is what the debugger shows after parsing FROM "Hello!":

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLRv4 - How to identify unquoted quote within string - string

Remove the single quote from ~["'\\]: STRING_LITERAL : '"' STRING? '"'; fragment STRING : STRING_CHARACTER+; fragment STRING_CHARACTER : ~["\\] | ESCSEQ; fragment ESCSEQ : '\\' [tnfr"'\\];

Related

Why whould antlr rule won't making a nice parse tree?

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Escape quote in ANTLR4?

Antlr4 Grammar/Rules - issue with solving BASIC print variable

Antlr String can not match

Categories

Resources