I'm building a grammar for parsing Quake III shaders with ANTLR4.
Here's an exemple of a shader:
textures/liquids/lava-example
{
deformVertexes wave sin 0 3 0 0.1
q3map_tessSize 64
surfaceparm lava
qer_editorimage textures/common/lava.tga
{
map textures/common/lava.tga
}
}
As you can see, the structure is:
shadername
{
directive
directive
//....
{
directive
//...
}
}
My question
As you can see, a directive is composed by a key and some parameters. The keys and the parameters are known (more than 100 keys possible). I wonder how to set a global rule for a directive (key + parameters) and specify all of them beside of it. Moreover, if I can separate all of them in different files for keeping clean grammars, it is better.
What I have for now
parser grammar ShaderParser;
options {
tokenVocab=ShaderLexer;
}
shaderlist
: shader+
;
shader
: shadername LBracket directive* stage* RBracket
;
shadername
: Path CompileTime?
;
stage
: LBracket directive* RBracket
;
directive
: key parameter?
| key DQuote parameter DQuote
| key SQuote parameter SQuote
;
key
: Word
;
parameter
: Word
| Integer
| Double
| Path
;
The directive, key and parameter (a single one for testing) rules were just for getting the main grammar.
Thanks for your help!
The grammar seems to need newlines as statement separators. You should add them at the end of each directive. Also the parameters you can match with your grammar will be one at most.
shader
: shadername NEWLINE
LBracket NEWLINE
directive*
stage*
RBracket NEWLINE?
EOF
;
shadername
: Path CompileTime?
;
stage
: LBracket NEWLINE
directive*
RBracket NEWLINE
;
directive
: key mayBeQuotedParameter* NEWLINE
;
mayBeQuotedParameter
: parameter?
| DQuote parameter DQuote
| SQuote parameter SQuote
;
key
: Word
;
parameter
: Word
| Integer
| Double
| Path
;
Related
Description
I'm trying to create a custom language that I want to separate lexer rules from parser rules. Besides, I aim to divide lexer and parser rules into specific files further (e.g., common lexer rules, and keyword rules).
But I don't seem to be able to get it to work.
Although I'm not getting any errors while generating the parser (.java files), grun fails with Exception in thread "main" java.lang.ClassCastException.
Note
I'm running ANTLR4.7.2 on Windows7 targeting Java.
Code
I created a set of files that closely mimic what I intend to achieve. The example below defines a language called MyLang and separates lexer and parser grammar. Also, I'm splitting lexer rules into four files:
// MyLang.g4
parser grammar MyLang;
options { tokenVocab = MyLangL; }
prog
: ( func )* END
;
func
: DIR ID L_BRKT (stat)* R_BRKT
;
stat
: expr SEMICOLON
| ID OP_ASSIGN expr SEMICOLON
| SEMICOLON
;
expr
: expr OPERATOR expr
| NUMBER
| ID
| L_PAREN expr R_PAREN
;
// MyLangL.g4
lexer grammar MyLangL;
import SkipWhitespaceL, CommonL, KeywordL;
#header {
package com.invensense.wiggler.lexer;
}
#lexer::members { // place this class member only in lexer
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("for", MyLangL.KW_FOR);
/* add more keywords here */
}};
}
ID : [a-zA-Z]+
{
if ( keywords.containsKey(getText()) ) {
setType(keywords.get(getText())); // reset token type
}
}
;
DIR
: 'in'
| 'out'
;
END : 'end' ;
// KeywordL.g4
lexer grammar KeywordL;
#lexer::header { // place this header action only in lexer, not the parser
import java.util.*;
}
// explicitly define keyword token types to avoid implicit def warnings
tokens {
KW_FOR
/* add more keywords here */
}
// CommonL.g4
lexer grammar CommonL;
NUMBER
: FLOAT
| INT
| UINT
;
FLOAT
: NEG? DIGIT+ '.' DIGIT+ EXP?
| INT
;
INT
: NEG? UINT+
;
UINT
: DIGIT+ EXP?
;
OPERATOR
: OP_ASSIGN
| OP_ADD
| OP_SUB
;
OP_ASSIGN : ':=';
OP_ADD : POS;
OP_SUB : NEG;
L_BRKT : '[' ;
R_BRKT : ']' ;
L_PAREN : '(' ;
R_PAREN : ')' ;
SEMICOLON : ';' ;
fragment EXP
: [Ee] SIGN? DIGIT+
;
fragment SIGN
: POS
| NEG
;
fragment POS: '+' ;
fragment NEG : '-' ;
fragment DIGIT : [0-9];
// SkipWhitespaceL.g4
lexer grammar SkipWhitespaceL;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
Output
Here is the exact output I receive from the code above:
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ antlr4.bat .\MyLangL.g4
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ antlr4.bat .\MyLang.g4
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ javac *.java
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§ grun MyLang prog -tree
Exception in thread "main" java.lang.ClassCastException: class MyLang
at java.lang.Class.asSubclass(Unknown Source)
at org.antlr.v4.gui.TestRig.process(TestRig.java:135)
at org.antlr.v4.gui.TestRig.main(TestRig.java:119)
ussjc-dd9vkc2 | C:\M\w\s\a\l\example
§
Rename both of your file with MyLangParser and MyLangLexer, then run grun MyLang prog -tree
grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR expression)+
| expression (NOT expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)
I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?
ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.
Goal
I want to reduce (or eliminate) the Java-specific actions and predicates in my parser. Perhaps it isn't possible, but I wanted to ask here just in case there's some ANTLR4 feature I've missed. (The language itself is third-party, so I don't have control over that.)
Simplified example
The predicates I want to use are mostly exact (or perhaps case-insensitive) string-matching. I could make big parallel sets of parser rules, but I'd rather not since the real-life example is considerably more convoluted.
Suppose I'm given something like:
isWidget(int) : "Whether it is a widget" : 4 ;
ownerFirstName(string) : "john" ;
ownerLastName(string) : "This is the last-name of the owner" : "doe" ;
I want the parser to look at the default-value (the last item on the line, like 4, "john" or "doe") and parse it based on the earlier type (int), (string), (string).
main
: stmt SEMIC (stmt SEMIC)* EOF
;
stmt
: propname=IDENTIFIER LPAREN datatype=IDENTIFIER RPAREN (COLON description=QUOTSTRING)? COLON df=defaultVal
;
defaultVal
: QUOTSTRING //TODO only this alt if datatype=string
| NUM //TODO only this alt if datatype=int
;
fragment Letter : 'a'..'z' | 'A'..'Z' ;
fragment Digit : '0'..'9' ;
fragment Underscore : '_' ;
SEMIC : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
IDENTIFIER : (Letter|Underscore) (Letter|Underscore|Digit)* ;
QUOTSTRING : '"' ~('"' |'\n' | '\r' | '\u2029' | '\u2028')* '"' ;
NUM : Digit+ ;
WS : [ \t\n\r]+ -> skip ;
I know I can do it with predicates and rule inputs, but then I'm crossing the line from a language-agnostic grammar to one with embedded Java code.
Your parser should handle things like the following without a problem:
isWidget(int) : "Whether it is a widget" : "foo" ;
In other words, do not add a predicate that would fail in this case, or you will lose the ability to report sane error messages. Instead, use a language-specific listener or visitor implementation after the parse is complete to report a semantic error if the type of the default value does not match the declared type.
I wanted to try this tool, antlr, so that I could eventually arrive to parse some code and refactor it. I tried some small grammars, everything was ok, so I took the next step and started parsing a sort of simple C#.
The good news: it takes like 10 minutes to understand the basics.
The extrememly bad news: it takes hours to understand how to parse two spaces instead of just one. Really. This things hates whitespaces, and has no shame in telling you that. Honestly I started to think it was unable to parse them, but then something went the right way... Or at least I thought so.
Now the problem of spaces comes after the fact that ANTLRWorks tries to allocate half a GB of ram and cannot really parse anything.
The grammar is not very hard, being I a beginner:
grammar newEmptyCombinedGrammar;
TokenEndCmd : ';' ;
TokenGlobImport : 'import' ;
TokenGlobNamespace : 'namespace' ;
TokenClass : 'class' ;
TokenSepFloat : ',' ;
TokenSepNamespace : '.' ;
fragment TokenEmptyString : '' ;
TokenUnderscore : '_' ;
TokenArgsSep : ',' ;
TokenArgsOpen : '(' ;
TokenArgsClose : ')' ;
TokenBlockOpen : '{' ;
TokenBlockClose : '}' ;
// --------------------
Digit : [0-9] ;
numberInt : Digit+ ;
numberFloat : numberInt TokenSepFloat numberInt ;
WordCI : [a-zA-Z]+ ;
WordUP : [A-Z]+ ;
WordLW : [a-z]+ ;
// -----------------
keyword : (WordCI | TokenUnderscore+) (numberInt | WordCI | TokenUnderscore)* ;
// ---------------------
spaces : (' ' | '\t')+ ;
spaceLNs : (' ' | '\t' | '\r' | '\n')+ ;
spacesOpt : spaces* ;
spaceLNsOpt : spaceLNs* ;
// ---------------------
// tipo "System" o "System.Net.Socket"
namepaceNameComposited : keyword (TokenSepNamespace keyword)* ;
// import System; import System.IO;
globImport : TokenGlobImport spaces namepaceNameComposited spacesOpt TokenEndCmd ;
// class class1 {}
namespaceClass : TokenClass spaces keyword spaceLNsOpt TokenBlockOpen spaceLNsOpt TokenBlockClose ;
// "namespace ns1 {}", "namespace ns1.sns2{}"
globNamespace : TokenGlobNamespace spaces namepaceNameComposited spaceLNsOpt TokenBlockOpen spaceLNsOpt namespaceClass spaceLNsOpt TokenBlockClose ;
globFile : (globImport | spaceLNsOpt)* (globNamespace | spaceLNsOpt)* ;
but still when globFile or globNamespace are added the ide starts to allocate memory like there's no tomorrow, and that's obviously a problem.
So
-is this way of capturing the whitespaces right? (I don't want to skip them, that's the point)
-is the memory leaking for a recursion I don't see?
The code that this thing is able to parse is something like:
import System;
namespace aNamespace{
class aClass{
}
}
globFile is the main rule, by the way.
You should define a lexer token to treat whitespaces the way you need it to. If you want a group of consecutive space or tab characters to form a single token, use a definition like the following. In this case, you would reference whitespace in the parser rules as Whitespace (required) or Whitespace? (optional).
// ANTLR 3:
Whitespace : (' ' | '\t')+;
// ANTLR 4:
Whitespace : [ \t]+;
If you want every individual whitespace character to be its own token, use something like the following. In this case, you would reference whitespace in the parser rules as Whitespace+ (required) or Whitespace* (optional).
// ANTLR 3:
Whitespace : ' ' | '\t';
// ANTLR 4:
Whitespace : [ \t];
The question about memory leaks probably belongs on the ANTLRWorks issue tracker.
ANTLRWorks 1 issue tracker: https://github.com/antlr/antlrworks/issues
ANTLRWorks 2 issue tracker: https://bitbucket.org/sharwell/antlrworks2/issues
The problem is effectively the last rule
globFile : (globImport | spaceLNsOpt)* (globNamespace | spaceLNsOpt)* ;
I changed it like this:
globFile : (globImport spaceLNsOpt)* (globNamespace spaceLNsOpt)* ;
and it seems that adding a EOF apparently helps:
globFile : (globImport spaceLNsOpt)* (globNamespace spaceLNsOpt)* EOF ;
but this is not sufficient, the rule cannot function in any case.