Config parser with ANTLR 4 - antlr4

i have some problems creating a config parser with ANTLR 4. The config files have the following syntax:
section1{
key=value;
key=value;
}
section2{
key=value;
}
I have also written a lexer/parser:
grammar Config;
fragment IdentifierText: [A-Za-z]+[A-Za-z0-9]*;
fragment IdentifierNumber:[0-9]+'.'?[0-9]*;
fragment IdentifierBool: 'false'|'true';
Section: IdentifierText;
Key: [A-Za-z]+[A-Za-z0-9]*;
Value: IdentifierText|IdentifierNumber|IdentifierBool;
Whitespace: [\t\b \f\r\n]+ -> skip;
start: configs;
configs: config+;
config: section statement ;
statement: '{' assignment+ '}';
section: Section;
assignment: Key '=' Value ';';
But if I use this as a sample:
Test{
debug=false;
}
I get the following errors:
line 2:0 mismatched input 'debug' expecting Key
line 2:5 mismatched input '=' expecting '{'
line 2:11 mismatched input ';' expecting '{'
Any ideas how to solve the problem?
thanks in advance

Your Section lexer rule is indistinguishable from Key. When an input matching one of these rules is found, the ambiguity is resolved by choosing the first one that appears in your grammar; in this case Section. In other words, it is impossible for the lexer as you have defined it to return a Key token.
You should resolve this by using a single Identifier rule for both sections and keys, and using parser rules to distinguish between them.
section : Identifier;
key : Identifier;
Identifier : IdentifierText;

Related

Antlr4 token recognition error and extraneous input

I'm trying to create SQL interpreter for my project. I ran into these errors when I run my program.
line 2:28 token recognition error at: ''a'
line 2:33 token recognition error at: '','
line 2:30 extraneous input 'nna' expecting Value
This is my test sql query:
INSERT INTO teacher VALUES ('Anna', 21);
And the partial of my grammar is:
insert: INSERT INTO ValidName VALUES '(' Value (',' Value)* ')' ';' ;
Value: Number | String;
ValidName: [a-z][a-z0-9_]*;
Number: [0-9]+;
String: '\''[^']+'\'';
I try to print out ctx.children and got this:
[INSERT, INTO, teacher, VALUES, (, nna, 21, ), ;]
Would anyone please help me where did I do wrong?
A couple of thing should help:
1 - Value should probably be a parser rule rather than a Lexer rule:
value: Number | String;
(and change the Valuess in your rules to values
2 - For your STRING rule, it's a bit simpler to use the non-greedy operator to pick up everything until you match the next character:
STRING: '\'' .*? '\'';

Matching the finite closure pattern ({x,y}) of regular expressions

I am trying to write a grammar that will match the finite closure pattern for regular expressions ( i.e foo{1,3} matches 1 to 3 'o' appearances after the 'fo' prefix )
To identify the string {x,y} as finite closure it must not include spaces for example { 1, 3} is recognized as a sequence of seven characters.
I have written the following lexer and parser file but i am not sure if this is the best solution. I am using a lexical mode for the closure pattern which is activated when a regular expression matches a valid closure expression.
lexer grammar closure_lexer;
#header { using System;
using System.IO; }
#lexer::members{
public static bool guard = true;
public static int LBindex = 0;
}
OTHER : .;
NL : '\r'? '\n' ;
CLOSURE_FLAG : {guard}? {LBindex =InputStream.Index; }
'{' INTEGER ( ',' INTEGER? )? '}'
{ closure_lexer.guard = false;
// Go back to the opening brace
InputStream.Seek(LBindex);
Console.WriteLine("Enter Closure Mode");
Mode(CLOSURE);
} -> skip
;
mode CLOSURE;
LB : '{';
RB : '}' { closure_lexer.guard = true;
Mode(0); Console.WriteLine("Enter Default Mode"); };
COMMA : ',' ;
NUMBER : INTEGER ;
fragment INTEGER : [1-9][0-9]*;
and the parser grammar
parser grammar closure_parser;
#header { using System;
using System.IO; }
options { tokenVocab = closure_lexer; }
compileUnit
: ( other {Console.WriteLine("OTHER: {0}",$other.text);} |
closure {Console.WriteLine("CLOSURE: {0}",$closure.text);} )+
;
other : ( OTHER | NL )+;
closure : LB NUMBER (COMMA NUMBER?)? RB;
Is there a better way to handle this situation?
Thanks in advance
This looks quite complex for such a simple task. You can easily let your lexer match one construct (preferably that without whitespaces, if you usually skip them) and the parser matches the other form. You don't even need lexer modes for that.
Define your closure rule:
CLOSURE
: OPEN_CURLY INTEGER (COMMA INTEGER?)? CLOSE_CURLY
;
This rule will not match any form that contains e.g. whitespaces. So, if your lexer does not match CLOSURE you will get all the individual tokens like the curly braces and integers ending up in your parser for matching (where you then can treat them as something different).
NB: doesn't the closure definition also allow {,n} (same as {n})? That requires an additional alt in the CLOSURE rule.
And finally a hint: your OTHER rule will probably give you trouble as it matches any char and is even located before other rules. If you have a whildcard rule then it should be the last in your grammar, matching everything not matched by any other rule.

ANTLR v4 grammar fails to parse due to mismatched EOF

Follows a simple grammar with ANTLR v4. This grammar when walked produces a error message
**line 1:14 mismatched input '' expecting DimensionName*
for trivial input such as "sdarsfd integer" (without quotation marks).
SO has mention f similar errors and a bug perhaps were filed in 4.3 timeframe.
I have been using ANTLR 4.5.
Any help/pointer/solution?
/**
A simple parser for a dimension declaration
*/
grammar Simple;
definition : dim;
dim : DimensionName DataType;
DimensionName : LETTER (LETTER)*; // greedy
DataType: 'integer' | 'decimal';
LETTER : [a-zA-Z];
DIGIT : [0-9];
WS: [ \t\n\r]+ -> skip;
You just have to switch the two lexer rules DataType and DimensionName
...
DataType: 'integer' | 'decimal';
DimensionName : LETTER (LETTER)*; // greedy
...
As DimensionName matches every chars, 'integer' is typed as a DimensionName instead of a DataType. For "sdarsfd integer", the lexer produces two DimensionName token, so the dim rule cannot be matched. By switching the two lexer rules, the lexer produces a DimensionName token and a DataType token which match the dim rule.
Also, you can define LETTER and DIGIT as fragment:
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
Unless you want them to be matched as independent token (in your grammar, "a" will be typed as a LETTER).

No viable alternative at input ' '

I know this question has been asked before, but I haven't found any solution to my specific problem. I am using Antlr4 with the C# target and I have the following lexer rules:
INT : [0-9]+
;
LETTER : [a-zA-Z_]+
;
WS : [ \t\r\n\u000C]+ -> skip
;
LineComment
: '#' ~[\r\n]* -> skip
;
That are all lexer rules, but there are many parser rules which I will not post here since I don't think it is relevant.
The problem I have is that whitespaces do not get skipped. When I inspect the token stream after the lexer ran my input, the whitespaces are still in there and therefore cause an error. The input I use is relatively basic:
"fd 100"
it parses complete until it reaches this parser rule:
noSignFactor
: ':' ident #NoSignFactorArg
| integer #NoSignFactorInt
| float #NoSignFactorFloat
| BOOLEAN #NoSignFactorBool
| '(' expr ')' #NoSignFactorExpr
| 'not' factor #NoSignFactorNot
;
integer : INT #IntegerInt
;
Start by separating your grammar into a separate lexer grammar and parser grammar. For example, if you have a grammar Foo;, create the following:
Create a file FooLexer.g4, and move all of the lexer rules from Foo.g4 into FooLexer.g4.
Create a file FooParser.g4, and move all of the parser rules from Foo.g4 into FooParser.g4.
Include the following option in FooParser.g4:
options {
tokenVocab=FooLexer;
}
This separation will ensure that your parser isn't silently creating lexer rules for you. In a combined grammar, using a literal such as 'not' in a parser rule will create a lexer rule for you if one does not already exist. When this happens, it's easy to lose track of what kinds of tokens your lexer is able to produce. When you use a separate lexer grammar, you will need to explicitly declare a rule like the following in order to use 'not' in a parser rule.
NOT : 'not';
This should solve the problems with whitespace should you have included the literal ' ' somewhere in a parser rule.

ANTLR4 lexer not resolving ambiguity in grammar order

Using ANTLR 4.2, I'm trying a very simple parse of this test data:
RRV0#ABC
Using a minimal grammar:
grammar Tiny;
thing : RRV N HASH ID ;
RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
Running the ANTLR4 test rig with the test data above, the output is
[#0,0:3='RRV0',<4>,1:0]
[#1,4:4='#',<3>,1:4]
[#2,5:7='ABC',<4>,1:5]
[#3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'
I can see the first token is <4> for ID, with the value 'RRV0'
I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.
If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.
I started in ANTLR 4.1 with the same issue.
I checked in ANTLRWorks and from the command line, with the same result both ways.
How can I change the grammar to match lexer item RRV in preference to ID ?
The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.
This strategy is especially important in languages like Java. Consider the following input:
String className = "";
Along with the following two grammar rules (slightly simplified):
CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;
If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

Resources