In ANTLR4 this will cause the LINE_FOLD token to be skipped:
LINE_FOLD
: CRLF WSP -> skip
;
But if I do this:
ESCAPED_CHAR
: '\\' LINE_FOLD? '\\'
| '\\' LINE_FOLD? ';'
| '\\' LINE_FOLD? ','
| '\\' LINE_FOLD? N
;
will it return the ESCAPED_CHAR without the LINE_FOLD, and if not how can I do this?
No, inside ESCAPED_CHAR, CRLF WSP will not be skipped.
ANTLR(4) best-practice is to handle such target specific actions in the stage after parsing (in a listener or visitor).
However, you can add a target specific block at the end of your rule that discards the \\ CRLF WSP from the ESCAPED_CHAR rule:
ESCAPED_CHAR
: '\\' LINE_FOLD? [\\;,nN]
{
String s = getText();
setText(s.substring(s.length() - 1));
}
;
Assuming your lexer rule N matches either 'n' or 'N'.
Now the rule ESCAPED_CHAR will only produce tokens whose contents will be on of: \\, ;, ,, n, or N.
Needles to say, this will only work with the Java target.
Related
I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.
You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;
Suppose a line has a maximum length of 5.
I want an Identifier to continue when a newline character is put on position 5.
examples:
abcd'\n'ef would result in a single Identifier "abdef"
ab'\n'def would result in Identifier "ab" (and another one "def")
Somehow I cannot get it working...
Attempt 1 is something like:
NEWLINE1 : '\r'? '\n' { _tokenStartCharPositionInLine == 5 } -> skip;
NEWLINE2 : '\r'? '\n' { _tokenStartCharPositionInLine < 5 } -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit)*;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
Attempt 2 is something like:
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> channel(WHITESPACE);
Identifier : Letter (LetterOrDigit NEWLINE?)*;
NEWLINE: '\r'? '\n' { _tokenStartCharPositionInLine == 5}? -> skip;
fragment
Letter : [a-zA-Z];
fragment
LetterOrDigit : [a-zA-Z0-9];
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser. Somehow I do not succeed into 'ignoring' the newline when it is on the last position of a line.
This seems to work, however the '\n' sign is still part of the Identifier when processing it in the parser.
That is because the NEWLINE is only skipped when matched "independently". Whenever it is part of another rule, like Identifier, it will stay part of said rule.
IMO, you should just go for this solution and not add too much predicates to your lexer (or parser). Simply strip the line break from the Identifier after parsing.
What I am trying to achieve is to develop a grammar that would parse the following two lines in the same way:
1. "Bucket 1" = "1 item placed", "3 items removed"
2. Bucket 2 = 2 items placed, 6 items removed
So, a line starts with an ordinal number, then an element name goes - 'Bucket 1' and 'Bucket 2'. Also, a bucket has one or more values separated by a comma.
The issue is that the data can come enclosed with double quotes (line #1 above) and without the quotes (as shown in line #2). I can figure grammars for each of the lines separately but can not develop a grammar that would parse them both.
grammar Test;
doc : element+ EOF;
element: ordinal element_name EQUAL element_values '\n';
element_name : STRING ;
element_values: STRING (COMMA STRING)+;
ordinal : NUMBER ;
COMMA: ',' ;
EQUAL: '=' ;
NUMBER : ('0'..'9')+ ;
STRING : '"' (EscapeSequence | ~('\\'|'"') )* '"' ;
// STRING : ('"' (EscapeSequence | ~('\\'|'"') )* '"') | ~('"'|',')+ ;
fragment
EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| OctalEscape
;
fragment
OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
WS : [ .\t]+ -> skip ;
I played with STRING rule above trying to make it handle both cases but with no luck. If I enable the commented out version of STRING rule, then I get a line 1:0 missing NUMBER at '4. ' parser error which is confusing as I thought that NUMBER rule should be caught since it goes first.
Is that a wrong assumption? Can you please explain why it does not get caught?
How do I write a lexer rule to match a String literal which does not end in an escaped quote?
Here's my grammar:
lexer grammar StringLexer;
// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;
Here's my java block:
String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s));
Token t = lexer.nextToken();
if (t.getType() == StringLexer.STRING) {
System.out.println("Saw a String");
}
else {
System.out.println("Nope");
}
This outputs Saw a String. Should "\" really match STRING?
Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.
For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:
'"' .*? '"'
To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.
If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.
Yes, "\" is matched by the STRING rule:
STRING: '"' (ESC|.)*? '"';
^ ^ ^
| | |
// matches: " \ "
If you don't want the . to match the backslash (and quote), do something like this:
STRING: '"' ( ESC | ~[\\"] )* '"';
And if your string can't be spread over multiple lines, do:
STRING: '"' ( ESC | ~[\\"\r\n] )* '"';
I have the following deffinition of fragment:
fragment CHAR :'a'..'z'|'A'..'Z'|'\n'|'\t'|'\\'|EOF;
Now I have to define a lexer rule for string. I did the following :
STRING : '"'(CHAR)*'"'
However in string I want to match all of my characters except the new line '\n'. Any ideas how I can achieve that?
You'll also need to exclude " besides line breaks. Try this:
STRING : '"' ~('\r' | '\n' | '"')* '"' ;
The ~ negates char-sets.
ut I want to negate only the new line from my CHAR set
No other way than this AFAIK:
STRING : '"' CHAR_NO_NL* '"' ;
fragment CHAR_NO_NL : 'a'..'z'|'A'..'Z'|'\t'|'\\'|EOF;