Escape quote in ANTLR4? - antlr4

I have this grammar :
grammar Hello;
STRING : '"' ( ESC | ~[\r\n"])* '"' ;
fragment ESC : '\\"' ;
r : STRING;
I want when i type a string :
"my name is : \" StackOverflow \" "
the result will be :
"my name is : "StackOverflow" "
But this is the result when i test it :
So what should i do to fix it ? Your help will be appreciated .

There is no way to handle it in your grammar without targeting a specific language. You either strip the slashes when walking your parse tree in a listener or visitor, or embed target specific code in your grammar.
If Java is your target, you could do this:
STRING
: '"' ( ESC | ~[\r\n"] )* '"'
{
String text = getText();
text = text.substring(1, text.length() - 1);
text = text.replaceAll("\\\\(.)", "$1");
setText(text);
}
;

Related

ANTLRv4 - How to identify unquoted quote within string

How would I recognise the string "Aren't you a string?" without getting a token recognition error at the apostrophe?
Here is the relative grammar from my lexer:
STRING_LITERAL : '"' STRING? '"';
fragment STRING : STRING_CHARACTER+;
fragment STRING_CHARACTER : ~["'\\] | ESCSEQ;
fragment ESCSEQ : '\\' [tnfr"'\\];
Remove the single quote from ~["'\\]:
STRING_LITERAL : '"' STRING? '"';
fragment STRING : STRING_CHARACTER+;
fragment STRING_CHARACTER : ~["\\] | ESCSEQ;
fragment ESCSEQ : '\\' [tnfr"'\\];

Handling String Literals which End in an Escaped Quote in ANTLR4

How do I write a lexer rule to match a String literal which does not end in an escaped quote?
Here's my grammar:
lexer grammar StringLexer;
// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;
Here's my java block:
String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s));
Token t = lexer.nextToken();
if (t.getType() == StringLexer.STRING) {
System.out.println("Saw a String");
}
else {
System.out.println("Nope");
}
This outputs Saw a String. Should "\" really match STRING?
Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.
For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:
'"' .*? '"'
To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.
If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.
Yes, "\" is matched by the STRING rule:
STRING: '"' (ESC|.)*? '"';
^ ^ ^
| | |
// matches: " \ "
If you don't want the . to match the backslash (and quote), do something like this:
STRING: '"' ( ESC | ~[\\"] )* '"';
And if your string can't be spread over multiple lines, do:
STRING: '"' ( ESC | ~[\\"\r\n] )* '"';

Are characters classes allowed in ANTLR4?

Are character classes supported in ANTLR 4 lexers? I saw some examples that looked like this is OK:
LITERAL: [a-zA-z]+;
but what I found is that it matches the string "OR[" with the opening bracket. Using ranges worked:
LITERAL: ('a'..'z' | 'A'..'Z')+;
and only identified "OR" as the LITERAL. Here is an example:
grammar Test;
#members {
private void log(String msg) {
System.out.println(msg);
}
}
parse
: expr EOF
;
expr
: atom {log("atom(" + $atom.text + ")");}
| l=expr OR r=expr {log("IOR:left(" + $l.text + ") right(" + $r.text + "}");}
| (OR '[' la=atom ra=atom ']') {log("POR:left(" + $la.text + ") right(" + $ra.text + "}");}
;
atom
: LITERAL
;
OR : O R ;
LITERAL: [a-zA-z]+;
//LITERAL: ('a'..'z' | 'A'..'Z')+;
SPACE
: [ \t\r\n] -> skip
;
fragment O: ('o'|'O');
fragment R: ('r'|'R');
When given the input "OR [ cat dog ]" it parses correctly, but "OR[ cat dog ]" does not.
You can use character sets in ANTLR 4 lexers, but the ranges are case sensitive. You used [a-zA-z] where I believe you meant [a-zA-Z].

antlr match any character except

I have the following deffinition of fragment:
fragment CHAR :'a'..'z'|'A'..'Z'|'\n'|'\t'|'\\'|EOF;
Now I have to define a lexer rule for string. I did the following :
STRING : '"'(CHAR)*'"'
However in string I want to match all of my characters except the new line '\n'. Any ideas how I can achieve that?
You'll also need to exclude " besides line breaks. Try this:
STRING : '"' ~('\r' | '\n' | '"')* '"' ;
The ~ negates char-sets.
ut I want to negate only the new line from my CHAR set
No other way than this AFAIK:
STRING : '"' CHAR_NO_NL* '"' ;
fragment CHAR_NO_NL : 'a'..'z'|'A'..'Z'|'\t'|'\\'|EOF;

Antlr Lexer Quoted String Predicate

I'm trying to build a lexer to tokenize lone words and quoted strings. I got the following:
STRING: QUOTE (options {greedy=false;} : . )* QUOTE ;
WS : SPACE+ { $channel = HIDDEN; } ;
WORD : ~(QUOTE|SPACE)+ ;
For the corner cases, it needs to parse:
"string" word1" word2
As three tokens: "string" as STRING and word1" and word2 as WORD. Basically, if there is a last quote, it needs to be part of the WORD were it is. If the quote is surrounded by white spaces, it should be a WORD.
I tried this rule for WORD, without success:
WORD: ~(QUOTE|SPACE)+
| (~(QUOTE|SPACE)* QUOTE ~QUOTE*)=> ~(QUOTE|SPACE)* QUOTE ~(QUOTE|SPACE)* ;
I finally found something that could do the trick without resorting to writing Java code:
fragment QUOTE
: '"' ;
fragment SPACE
: (' '|'\r'|'\t'|'\u000C'|'\n') ;
WS : SPACE+ {$channel=HIDDEN;};
PHRASE : QUOTE (options {greedy=false;} : . )* QUOTE ;
WORD : (~(QUOTE|SPACE)* QUOTE ~QUOTE* EOF)=> ~(QUOTE|SPACE)* QUOTE ~(SPACE)*
| ~(QUOTE|SPACE)+ ;
That way, the predicate differentiate/solves for both:
PHRASE : QUOTE (options {greedy=false;} : . )* QUOTE ;
and
| ~(QUOTE|SPACE)+ ;

Resources