Manipulate the text of a lexical rule in antlr4 - lexer

So I have this lexical rule for string:
STRINGLIT: '"' ( ('\'[\"bftrn]) | ~[\n\"] )* '"' ;
For example, with the input "abc", I expect abc,<EOF> discarding the "
I read here http://www.antlr2.org/doc/lexer.html that you can use ! operator. Then I would have:
STRINGLIT: '"'! ( ('\'[\"bftrn]) | ~[\n\"] )* '"'! ;
But then I can't make it work on the code.

The v2 functionality of the ! operator is no longer supported since v3 (you're using v4).
There is no equivalent operator in v3 or v4. The only way to strip the quotes is to do so in a listener or visitor after parsing, or embed target specific code in your lexer:
STRINGLIT
: '"' ( ( '\\' [\\bftrn"] ) | ~[\\\r\n"] )* '"'
{
// Get all the text that this rules matched
String matched = getText();
// Strip the first and the last characters (the quotes)
String matchedWithoutQuotes = matched.substring(1, matched.length() - 1);
// possibly do some more replacements here like replace `\\n` with `\n` etc.
// Set the new string to this token
setText(matchedWithoutQuotes);
}
;

Related

Choosing lexer mode based on variable

My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).

Antlr4 Match Force Priority

I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.
You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR​ expression)+
| expression (NOT​ expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

Handling String Literals which End in an Escaped Quote in ANTLR4

How do I write a lexer rule to match a String literal which does not end in an escaped quote?
Here's my grammar:
lexer grammar StringLexer;
// from The Definitive ANTLR 4 Reference
STRING: '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\' ;
Here's my java block:
String s = "\"\\\""; // looks like "\"
StringLexer lexer = new StringLexer(new ANTLRInputStream(s));
Token t = lexer.nextToken();
if (t.getType() == StringLexer.STRING) {
System.out.println("Saw a String");
}
else {
System.out.println("Nope");
}
This outputs Saw a String. Should "\" really match STRING?
Edit: Both 280Z28 and Bart's solutions are great solutions, unfortunately I can only accept one.
For properly formed input, the lexer will match the text you expect. However, the use of the non-greedy operator will not prevent it from matching something with the following form:
'"' .*? '"'
To ensure strings are tokens in the most "sane" way possible, I recommended using the following rules.
StringLiteral
: UnterminatedStringLiteral '"'
;
UnterminatedStringLiteral
: '"' (~["\\\r\n] | '\\' (. | EOF))*
;
If your language allows string literals to span across multiple lines, you would likely need to modify UnterminatedStringLiteral to allow matching end-of-line characters.
If you do not include the UnterminatedStringLiteral rule, the lexer will handle unterminated strings by simply ignoring the opening " character of the string and proceeding to tokenize the content of the string.
Yes, "\" is matched by the STRING rule:
STRING: '"' (ESC|.)*? '"';
^ ^ ^
| | |
// matches: " \ "
If you don't want the . to match the backslash (and quote), do something like this:
STRING: '"' ( ESC | ~[\\"] )* '"';
And if your string can't be spread over multiple lines, do:
STRING: '"' ( ESC | ~[\\"\r\n] )* '"';

Nested brackets/chars '(' and ')' in grammar/ANTLRWorks warning: Decision can match input such as ... using multiple alternatives

The grammar below parses ( left part = right part # comment ), # comment is optional.
Two questions:
Sometimes warning (ANTLRWorks 1.4.2):
Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2 (referencing id2)
But only sometimes!
The next extension should be that the comment (id2) can contain chars '(' and ')'.
The grammar:
grammar NestedBrackets1a1;
//==========================================================
// Lexer Rules
//==========================================================
Int
: Digit+
;
fragment Digit
: '0'..'9'
;
Special
: ( TCUnderscore | TCQuote )
;
TCListStart : '(' ;
TCListEnd : ')' ;
fragment TCUnderscore : '_' ;
fragment TCQuote : '"' ;
// A word must start with a letter
Word
: ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
;
Space
: ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
;
//==========================================================
// Parser Rules
//==========================================================
assignment
: TCListStart id1 '=' id1 ( comment )? TCListEnd
;
id1
: Word+
;
comment
: '#' ( id2 )*
;
id2
: ( Word | Int )+
;
ANTLRStarter wrote:
Sometimes warning (ANTLRWorks 1.4.2): Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2 (referencing id2)
But only sometimes!
No, the grammar you posted will always produce this warning. Perhaps you don't always notice it (your IDE-plugin or ANTLRWorks might show it in a tab you don't have opened), but the warning is there. Convince yourself by creating a lexer/parser from the command line:
java -cp antlr-3.4-complete.jar org.antlr.Tool NestedBrackets1a1.g
will produce:
warning(200): NestedBrackets1a1.g:49:19:
Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
This is because you have a * after ( id2 ) inside your comment rule, and id2 also is a repetition of tokens: ( Word | Int )+. Let's say your input is "# foo bar" (a # followed by two Word tokens). ANTLR can now parse the input in more than 1 way: the 2 tokens "foo" and "bar" could be matched by ( id2 )*, where id2 matches a single Word token at a time, but "foo" and "bar" could also be matches in one go of the id2 rule.
Look at the merged rules:
comment
: '#' ( ( Word | Int )+ )*
;
See how you're repeating a repetition: ( ( ... )+ )*? This is usually a problem, as it is in your case.
Resolve this problem by either replacing the * with a ?:
comment
: '#' ( id2 )?
;
id2
: ( Word | Int )+
;
or by removing the +:
comment
: '#' ( id2 )*
;
id2
: ( Word | Int )
;
ANTLRStarter wrote:
The next extension should be that the comment (id2) can contain chars '(' and ')'.
That is asking for trouble since a comment is followed by a TCListEnd, which is a ). I don't recommend letting a comment match ).
EDIT
Note that comments are usually stripped from the source file while tokenizing the input source. That way you don't need to account for them in your parser rules. You can do that by "skipping" these tokens in a lexer rule:
Comment
: '#' ~('\r' | '\n')* {skip();}
;

Resources