Suppose that I have a very simple grammar defined in ANTLR 4:
input : String Separator String ;
String : 'a'..'z' ;
Separator : ',' ;
For this grammar, the separator is fixed; it will always be a comma. Is there a way to make the separator variable? That is, I'd like to define the separator using an input parameter, which is set by the code that calls the lexer-parser. I can define a getter and setter like so:
#lexer::members
{
String sep = ",";
public void setSep(String sep)
{
this.sep = sep;
}
private String getSep()
{
return sep;
}
}
But how do I change the value of the separator in the lexer rule? This is close, but wrong:
Separator : ',' { setText(getSep()); } ;
After looking at some other questions, I decided to try and solve this with semantic predicates. Here's my complete solution:
grammar InputCombinedGrammar;
#parser::members
{
String sep = ",";
public void setSep(String sep)
{
this.sep = sep;
}
private String getSep()
{
return sep;
}
}
input : String { getSep().equals(_input.LT(1).getText()) }? Separator String EOF ;
String : 'a'..'z' ;
Separator : . ;
Two items to note:
The separator will match on any character, not just commas.
The semantic predicate uses lookahead to compare the next token to the separator. If it matches, then the rule proceeds. If not, then it will throw an error.
This solution is trusting the semantic predicate to only use the correct separator. I'm pretty happy with this solution, but I'd like to see others.
I would handle it inside the lexer:
#lexer::members {
...
}
input : String Separator String EOF ;
Separator : { sep.equals(_input.LA(1).getText()) }? . ;
String : 'a'..'z' ;
If you do it inside the parser, all rules defined after the Separator : . ; can never be a single character since it would get caught by the Separator rule.
Related
My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).
So I have this lexical rule for string:
STRINGLIT: '"' ( ('\'[\"bftrn]) | ~[\n\"] )* '"' ;
For example, with the input "abc", I expect abc,<EOF> discarding the "
I read here http://www.antlr2.org/doc/lexer.html that you can use ! operator. Then I would have:
STRINGLIT: '"'! ( ('\'[\"bftrn]) | ~[\n\"] )* '"'! ;
But then I can't make it work on the code.
The v2 functionality of the ! operator is no longer supported since v3 (you're using v4).
There is no equivalent operator in v3 or v4. The only way to strip the quotes is to do so in a listener or visitor after parsing, or embed target specific code in your lexer:
STRINGLIT
: '"' ( ( '\\' [\\bftrn"] ) | ~[\\\r\n"] )* '"'
{
// Get all the text that this rules matched
String matched = getText();
// Strip the first and the last characters (the quotes)
String matchedWithoutQuotes = matched.substring(1, matched.length() - 1);
// possibly do some more replacements here like replace `\\n` with `\n` etc.
// Set the new string to this token
setText(matchedWithoutQuotes);
}
;
I'm building a grammar for parsing Quake III shaders with ANTLR4.
Here's an exemple of a shader:
textures/liquids/lava-example
{
deformVertexes wave sin 0 3 0 0.1
q3map_tessSize 64
surfaceparm lava
qer_editorimage textures/common/lava.tga
{
map textures/common/lava.tga
}
}
As you can see, the structure is:
shadername
{
directive
directive
//....
{
directive
//...
}
}
My question
As you can see, a directive is composed by a key and some parameters. The keys and the parameters are known (more than 100 keys possible). I wonder how to set a global rule for a directive (key + parameters) and specify all of them beside of it. Moreover, if I can separate all of them in different files for keeping clean grammars, it is better.
What I have for now
parser grammar ShaderParser;
options {
tokenVocab=ShaderLexer;
}
shaderlist
: shader+
;
shader
: shadername LBracket directive* stage* RBracket
;
shadername
: Path CompileTime?
;
stage
: LBracket directive* RBracket
;
directive
: key parameter?
| key DQuote parameter DQuote
| key SQuote parameter SQuote
;
key
: Word
;
parameter
: Word
| Integer
| Double
| Path
;
The directive, key and parameter (a single one for testing) rules were just for getting the main grammar.
Thanks for your help!
The grammar seems to need newlines as statement separators. You should add them at the end of each directive. Also the parameters you can match with your grammar will be one at most.
shader
: shadername NEWLINE
LBracket NEWLINE
directive*
stage*
RBracket NEWLINE?
EOF
;
shadername
: Path CompileTime?
;
stage
: LBracket NEWLINE
directive*
RBracket NEWLINE
;
directive
: key mayBeQuotedParameter* NEWLINE
;
mayBeQuotedParameter
: parameter?
| DQuote parameter DQuote
| SQuote parameter SQuote
;
key
: Word
;
parameter
: Word
| Integer
| Double
| Path
;
I have the following deffinition of fragment:
fragment CHAR :'a'..'z'|'A'..'Z'|'\n'|'\t'|'\\'|EOF;
Now I have to define a lexer rule for string. I did the following :
STRING : '"'(CHAR)*'"'
However in string I want to match all of my characters except the new line '\n'. Any ideas how I can achieve that?
You'll also need to exclude " besides line breaks. Try this:
STRING : '"' ~('\r' | '\n' | '"')* '"' ;
The ~ negates char-sets.
ut I want to negate only the new line from my CHAR set
No other way than this AFAIK:
STRING : '"' CHAR_NO_NL* '"' ;
fragment CHAR_NO_NL : 'a'..'z'|'A'..'Z'|'\t'|'\\'|EOF;
I'm trying to build a lexer to tokenize lone words and quoted strings. I got the following:
STRING: QUOTE (options {greedy=false;} : . )* QUOTE ;
WS : SPACE+ { $channel = HIDDEN; } ;
WORD : ~(QUOTE|SPACE)+ ;
For the corner cases, it needs to parse:
"string" word1" word2
As three tokens: "string" as STRING and word1" and word2 as WORD. Basically, if there is a last quote, it needs to be part of the WORD were it is. If the quote is surrounded by white spaces, it should be a WORD.
I tried this rule for WORD, without success:
WORD: ~(QUOTE|SPACE)+
| (~(QUOTE|SPACE)* QUOTE ~QUOTE*)=> ~(QUOTE|SPACE)* QUOTE ~(QUOTE|SPACE)* ;
I finally found something that could do the trick without resorting to writing Java code:
fragment QUOTE
: '"' ;
fragment SPACE
: (' '|'\r'|'\t'|'\u000C'|'\n') ;
WS : SPACE+ {$channel=HIDDEN;};
PHRASE : QUOTE (options {greedy=false;} : . )* QUOTE ;
WORD : (~(QUOTE|SPACE)* QUOTE ~QUOTE* EOF)=> ~(QUOTE|SPACE)* QUOTE ~(SPACE)*
| ~(QUOTE|SPACE)+ ;
That way, the predicate differentiate/solves for both:
PHRASE : QUOTE (options {greedy=false;} : . )* QUOTE ;
and
| ~(QUOTE|SPACE)+ ;