I have the little grammar below. node is the start production. When my input is (a:b) I get an error: line 1:1 extraneous input 'a' expecting {':', INAME}
Why is this?
EDIT - I forgot that the lexer and parser run as a separate phases. By the time the parser runs, the lexer has completed. When the lexer runs it has no knowledge of the parser rules. It has already made the TYPE/INAME decision choosing TYPE per #bart's reasoning below.
grammar g1;
TYPE: [A-Za-z_];
INAME: [A-Za-z_];
node: '(' namesAndTypes ')';
namesAndTypes:
INAME ':' TYPE
| ':' TYPE
| INAME
;
That is because the lexer will never produce an INAME token. The lexer works in the following was:
try to match as much characters as possible
when 2 or more lexer rules match the same characters, let the one defined first "win"
Because the input "a" and "b" both match the TYPE and INAME rules, the TYPE rule wins because it is defined first. It doesn't matter if the parser is trying to match an INAME rule, the lexer will not produce it. The lexer does not "listen" to he parser.
You could create some sort of ID rule, and then define type and iname parser rules instead:
ID: [A-Za-z_];
node
: '(' namesAndTypes ')'
;
namesAndTypes
: iname ':' type
| ':' type
| iname
;
type
: ID
;
iname
: ID
;
I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).
I'm writing a dsl in a text in which people can declare some variables. the grammar is as follows:
Cosem:
cosem+=ID '=' 'COSEM' '(' class=INT ',' version=INT ',' obis=STRING ')' ;
Attributes :
attribute+=ID '=' 'ATTRIBUTE' '(' object=ID ',' attribute_name=STRING ')' ;
Action:
action+=ID '=' 'ACTION' '(' object=ID ',' action_name=STRING ')';
the Dsl has some methods like the print method:
Print:
'PRINT' '(' var0=(STRING|ID) (','var1+=(STRING|ID) )* ')' |
'PRINT' '(' ')'
;
I put all my variables in map so I can use them later in my code. the key is identifying them is their ID which is a string.
However, in my interpreter I can't make the différence between a string and an ID
def dispatch void exec(Print p) {
if (LocalMapAttribute.containsKey(p.var0) )
{print(LocalMapAttribute.get(p.var0))}
else if (LocalMapAction.containsKey(p.var0)){print(LocalMapAction.get(p.var0))}
else if (LocalMapCosem.containsKey(p.var0)){print(LocalMapCosem.get(p.var0))}
else
{print("erreeeur Print")}
p.var1.forEach[v
| if (LocalMapAttribute.containsKey(v)){print(LocalMapAttribute.get(v))}
else if (LocalMapAction.containsKey(v)){print(LocalMapAction.get(v))}
else if (LocalMapCosem.containsKey(v)){print(LocalMapCosem.get(v))}
else{print("erreur entre print")} ]
}
For example when I write PRINT ("attribut2",attribut2) the result shoud be
attribut2 "the value of attribut2"
but I get
"the value of attribut2" "the value of attribut2"
your current grammar structure makes it hard to do this since you throw away the information at the point where you fill the map.
you can use org.eclipse.xtext.nodemodel.util.NodeModelUtils.findNodesForFeature(EObject, EStructuralFeature) to obtain the actual text (which still may contain the original value including the ""
or you change your grammar to
var0=Value ...
Value: IDValue | StringValue;
IDValue: value=ID;
StringValue: value=STRING;
then you can have a look at the type (IDValue or StringValue) to decide wheather you need to put the text into "" (org.eclipse.xtext.util.Strings.convertToJavaString(String, boolean)) might be helpful
Or you can try to use a special replacement for STRINGValueaConcerter that does not strip the quotation marks
With the following grammar
grammar Gram;
exprEof
: expr EOF
;
expr
: Uident
| expr '(' Uident ')'
;
Uident
: [A-Z][a-z]*
;
WS
: [ \n\t]+ -> skip
;
if I try to parse the input Foo(A B) from exprEof I get the expected error
line 1:6 extraneous input 'B' expecting ')'
But if I add an additional rule
expr2
: expr '(' Uident ')'
;
then the error is
line 1:3 mismatched input '(' expecting <EOF>
This is surprising because expr2 is not actually called from exprEof. In my full grammar it leads to very unhelpful error messages, where a syntax error deep in an expression is reported as mismatched input '( expecting <EOF> near the beginning of the expression.
I'm using ANTLR 4.5.3.
I am writing an ANTLR 4 grammar for a language that will have switch statements that do not allow fallthrough (similar to C#). All case statements must be terminated by a break statement. Multiple case statements can follow each other without any code in between (again, just like in C#). Here is a snippet of the grammar that captures this:
grammar MyGrammar;
switchStmt : 'switch' '(' expression ')' '{' caseStmt+ '}' ;
caseStmt : (caseOpener)+ statementList breakStmt ;
caseOpener : 'case' literal ':'
| 'default' ':'
;
statementList : statement (statement)* ;
breakStmt : 'break' ';' ;
I left out the definitions of expression and statement for brevity. However, it's important to note that the definition for statement includes breakStmt. This is because break statements can also be used to break out of loops.
In general the grammar is fine - it parses input as expected. However, I get warnings during the parse like "line 18:0 reportAttemptingFullContext d=10 (statementList), input='break;" and "line 18:0 reportContextSensitivity d=10 (statementList), input='break;" This makes sense because the parser is not sure whether to match a break statement as statement or as breakStmt and needs to fall back on ALL(*) parsing. My question is, how can I change my grammar in order to eliminate the need for this during the parse and avoid the performance hit? Is it even possible to do without changing the language syntax?
You should remove the breakStmt reference from the end of caseStmt, and instead perform this validation in a listener or visitor after the parse is complete. This offers you the following advantages:
Improved error handling when a user omits the required break statement.
Improved parser performance by removing the ambiguity between the breakStmt at the end of caseStmt and the statementList that precedes it.
I would use the following rules:
switchStmt
: 'switch' '(' expression ')' '{' caseStmt* '}'
;
caseStmt
: caseOpener statementList?
;
statementList
: statement+
;