antlr match any character except - string

I have the following deffinition of fragment:
fragment CHAR :'a'..'z'|'A'..'Z'|'\n'|'\t'|'\\'|EOF;
Now I have to define a lexer rule for string. I did the following :
STRING : '"'(CHAR)*'"'
However in string I want to match all of my characters except the new line '\n'. Any ideas how I can achieve that?

You'll also need to exclude " besides line breaks. Try this:
STRING : '"' ~('\r' | '\n' | '"')* '"' ;
The ~ negates char-sets.
ut I want to negate only the new line from my CHAR set
No other way than this AFAIK:
STRING : '"' CHAR_NO_NL* '"' ;
fragment CHAR_NO_NL : 'a'..'z'|'A'..'Z'|'\t'|'\\'|EOF;

Related

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".
There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).
I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

How to include quotes in string in ANTLR4

How can I include quotes for string and characters as part of the string. Example is "This is a \" string" which should result in one string instead of "This is a \" as one string and string" as an error in this case. The same goes for the characters. Example is '\'', but
in my case it's only '\'.
This is my current solution which works only without quotes.
CHARACTER
: '\'' ~('\'')+ '\''
;
STRING
: '"' ~('"')+ '"'
;
Your string/char rules don't handle escape sequences correctly. For the character it should be:
CHARACTER: '\'' '\\'? '.' '\'';
Here we make the escape char (backshlash) be part of the rule and require an additional char (whatever it is) follow it. Similar for the string:
STRING: '"' ('\\'? .)+? '"';
By using +? we are telling ANTLR4 to match in a non-greedy manner, stopping at the first non-escaped quote char after the initial one.

ANTLR4 - how to interrupt

Suppose a line has a maximum length of 5.
I want an Identifier to continue when a newline character is put on position 5.
examples:
abcd'\n'ef would result in a single Identifier "abdef"
ab'\n'def would result in Identifier "ab" (and another one "def")
Somehow I cannot get it working...
Attempt 1 is something like:
NEWLINE1 : '\r'? '\n' { _tokenStartCharPositionInLine == 5 } -> skip;
NEWLINE2 : '\r'? '\n' { _tokenStartCharPositionInLine < 5 } -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit)*;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
Attempt 2 is something like:
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit NEWLINE?)*;
NEWLINE: '\r'? '\n' { _tokenStartCharPositionInLine == 5}? -> skip;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser. Somehow I do not succeed into 'ignoring' the newline when it is on the last position of a line.
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser.
That is because the NEWLINE is only skipped when matched "independently". Whenever it is part of another rule, like Identifier, it will stay part of said rule.
IMO, you should just go for this solution and not add too much predicates to your lexer (or parser). Simply strip the line break from the Identifier after parsing.

Handling String Literals which End in an Escaped Quote in ANTLR4

How do I write a lexer rule to match a String literal which does not end in an escaped quote?
Here's my grammar:
lexer grammar StringLexer;
// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;
Here's my java block:
String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s));
Token t = lexer.nextToken();
if (t.getType() == StringLexer.STRING) {
System.out.println("Saw a String");
}
else {
System.out.println("Nope");
}
This outputs Saw a String. Should "\" really match STRING?
Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.
For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:
'"' .*? '"'
To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.
If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.
Yes, "\" is matched by the STRING rule:
STRING: '"' (ESC|.)*? '"';
^ ^ ^
| | |
// matches: " \ "
If you don't want the . to match the backslash (and quote), do something like this:
STRING: '"' ( ESC | ~[\\"] )* '"';
And if your string can't be spread over multiple lines, do:
STRING: '"' ( ESC | ~[\\"\r\n] )* '"';

ANTLR 4 lexer tokens inside other tokens

I have the following grammar for ANTLR 4:
grammar Pattern;
//parser rules
parse : string LBRACK CHAR DASH CHAR RBRACK ;
string : (CHAR | DASH)+ ;
//lexer rules
DASH : '-' ;
LBRACK : '[' ;
RBRACK : ']' ;
CHAR : [A-Za-z0-9] ;
And I'm trying to parse the following string
ab-cd[0-9]
The code parses out the ab-cd on the left which will be treated as a literal string in my application. It then parses out [0-9] as a character set which in this case will translate to any digit. My grammar works for me except I don't like to have (CHAR | DASH)+ as a parser rule when it's simply being treated as a token. I would rather the lexer create a STRING token and give me the following tokens:
"ab-cd" "[" "0" "-" "9" "]"
instead of these
"ab" "-" "cd" "[" "0" "-" "9" "]"
I have looked at other examples, but haven't been able to figure it out. Usually other examples have quotes around such string literals or they have whitespace to help delimit the input. I'd like to avoid both. Can this be accomplished with lexer rules or do I need to continue to handle it in the parser rules like I'm doing?
In ANTLR 4, you can use lexer modes for this.
STRING : [a-z-]+;
LBRACK : '[' -> pushMode(CharSet);
mode CharSet;
DASH : '-';
NUMBER : [0-9]+;
RBRACK : ']' -> popMode;
After parsing a [ character, the lexer will operate in mode CharSet until a ] character is reached and the popMode command is executed.

Resources