Parsing quoted string with escape chars

Parsing quoted string with escape chars - antlr4

I'm having a problem parsing a list of lines of format in antlr4
* this is a string
* "first" this is "quoted"
* this is "quoted with \" "
I want to build a parse tree like
(list
(line * (value (string this is a string)))
(line * (value (parameter first) (string this is) (parameter quoted)))
(line * (value (string this is) (parameter quoted with " )))
)
I have an antlr4 grammar of this format
grammar List;
list : line+;
line : '*' (WS)+ value* NEWLINE;
value : string
| parameter
;
string : ((WORD) (WS)*)+;
parameter : '"'((WORD) (WS)*)+ '"';
WORD : (~'\n')+;
WS : '\t' | ' ';
NEWLINE : '\n';
But this is failing in the first character recognition of '*' itself, which baffles me.
line 1:0 mismatched input '* this is a string' expecting '*'

The problem is that your lexer is too greedy. The rule
WORD : (~'\n')+;
matches almost everything. This causes the lexer to produce the following tokens for your input:
token 1: WORD (* this is a string)
token 2: NEWLINE
token 3: WORD (`* "first" this is "quoted")
token 4: NEWLINE
token 5: WORD (* this is "quoted with \" ")
Yes, that is correct: only WORD and NEWLINE tokens. ANTLR's lexer tries to construct tokens with as much characters as possible, it does not "listen" to what the parser is trying to match.
The error message:
line 1:0 mismatched input '* this is a string' expecting '*'
is telling you this: on line 1, index 0 the token with text '* this is a string' (type WORD) is encountered, but the parser is trying to match the token: '*'
Try something like this instead:
grammar List;
parse
: NEWLINE* list* NEWLINE* EOF
;
list
: item (NEWLINE item)*
;
item
: '*' (STRING | WORD)*
;
BULLET : '*';
STRING : '"' (~[\\"] | '\\' [\\"])* '"';
WORD : ~[ \t\r\n"*]+;
NEWLINE : '\r'? '\n' | '\r';
SPACE : [ \t]+ -> skip;
which parses your example input as follows:
(parse
(list
(item
* this is a string) \n
(item
* "first" this is "quoted") \n
(item
* this is "quoted with \" "))
\n
<EOF>)

Related

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".

There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).

I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

ANTLR4 - how to interrupt

Suppose a line has a maximum length of 5.
I want an Identifier to continue when a newline character is put on position 5.
examples:
abcd'\n'ef would result in a single Identifier "abdef"
ab'\n'def would result in Identifier "ab" (and another one "def")
Somehow I cannot get it working...
Attempt 1 is something like:
NEWLINE1 : '\r'? '\n' { _tokenStartCharPositionInLine == 5 } -> skip;
NEWLINE2 : '\r'? '\n' { _tokenStartCharPositionInLine < 5 } -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit)*;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
Attempt 2 is something like:
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit NEWLINE?)*;
NEWLINE: '\r'? '\n' { _tokenStartCharPositionInLine == 5}? -> skip;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser. Somehow I do not succeed into 'ignoring' the newline when it is on the last position of a line.

This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser.
That is because the NEWLINE is only skipped when matched "independently". Whenever it is part of another rule, like Identifier, it will stay part of said rule.
IMO, you should just go for this solution and not add too much predicates to your lexer (or parser). Simply strip the line break from the Identifier after parsing.

Are characters classes allowed in ANTLR4?

Are character classes supported in ANTLR 4 lexers? I saw some examples that looked like this is OK:
LITERAL: [a-zA-z]+;
but what I found is that it matches the string "OR[" with the opening bracket. Using ranges worked:
LITERAL: ('a'..'z' | 'A'..'Z')+;
and only identified "OR" as the LITERAL. Here is an example:
grammar Test;
#members {
private void log(String msg) {
System.out.println(msg);
}
}
parse
: expr EOF
;
expr
: atom {log("atom(" + $atom.text + ")");}
| l=expr OR r=expr {log("IOR:left(" + $l.text + ") right(" + $r.text + "}");}
| (OR '[' la=atom ra=atom ']') {log("POR:left(" + $la.text + ") right(" + $ra.text + "}");}
;
atom
: LITERAL
;
OR : O R ;
LITERAL: [a-zA-z]+;
//LITERAL: ('a'..'z' | 'A'..'Z')+;
SPACE
: [ \t\r\n] -> skip
;
fragment O: ('o'|'O');
fragment R: ('r'|'R');
When given the input "OR [ cat dog ]" it parses correctly, but "OR[ cat dog ]" does not.

You can use character sets in ANTLR 4 lexers, but the ranges are case sensitive. You used [a-zA-z] where I believe you meant [a-zA-Z].

ANTLR 4 lexer tokens inside other tokens

I have the following grammar for ANTLR 4:
grammar Pattern;
//parser rules
parse : string LBRACK CHAR DASH CHAR RBRACK ;
string : (CHAR | DASH)+ ;
//lexer rules
DASH : '-' ;
LBRACK : '[' ;
RBRACK : ']' ;
CHAR : [A-Za-z0-9] ;
And I'm trying to parse the following string
ab-cd[0-9]
The code parses out the ab-cd on the left which will be treated as a literal string in my application. It then parses out [0-9] as a character set which in this case will translate to any digit. My grammar works for me except I don't like to have (CHAR | DASH)+ as a parser rule when it's simply being treated as a token. I would rather the lexer create a STRING token and give me the following tokens:
"ab-cd" "[" "0" "-" "9" "]"
instead of these
"ab" "-" "cd" "[" "0" "-" "9" "]"
I have looked at other examples, but haven't been able to figure it out. Usually other examples have quotes around such string literals or they have whitespace to help delimit the input. I'd like to avoid both. Can this be accomplished with lexer rules or do I need to continue to handle it in the parser rules like I'm doing?

In ANTLR 4, you can use lexer modes for this.
STRING : [a-z-]+;
LBRACK : '[' -> pushMode(CharSet);
mode CharSet;
DASH : '-';
NUMBER : [0-9]+;
RBRACK : ']' -> popMode;
After parsing a [ character, the lexer will operate in mode CharSet until a ] character is reached and the popMode command is executed.

antlr match any character except

I have the following deffinition of fragment:
fragment CHAR :'a'..'z'|'A'..'Z'|'\n'|'\t'|'\\'|EOF;
Now I have to define a lexer rule for string. I did the following :
STRING : '"'(CHAR)*'"'
However in string I want to match all of my characters except the new line '\n'. Any ideas how I can achieve that?

You'll also need to exclude " besides line breaks. Try this:
STRING : '"' ~('\r' | '\n' | '"')* '"' ;
The ~ negates char-sets.
ut I want to negate only the new line from my CHAR set
No other way than this AFAIK:
STRING : '"' CHAR_NO_NL* '"' ;
fragment CHAR_NO_NL : 'a'..'z'|'A'..'Z'|'\t'|'\\'|EOF;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parsing quoted string with escape chars - antlr4

Related

ANTLR4 ambiguity - how to solve

ANTLR4 - how to interrupt

Are characters classes allowed in ANTLR4?

ANTLR 4 lexer tokens inside other tokens

antlr match any character except

Categories

Resources