ANTLR4: Parser adding duplicate entries - antlr4

I have below input to be parsed:-
([LANGUAGE] IN ("Arabic", "Dutch") AND [Content Series] IN ("The Walking Dead") AND [PUBLISHER_NAME] IN ("Yahoo Search", "Yahoo! NAR") )
OR
([LANGUAGE] IN ("English") AND [PUBLISHER_NAME] IN ("Aol News", "Microsoft-Bing!") )
Basically the inputs have 2 groups separated by 'OR'.Both groups has several base exp(targetEntities) separated by AND. So each group has list of target entities.
Grammar file:
grammar Exp;
options {
language = Java;
}
start
: def EOF
;
def : (AND? base)+
| (OR? '(' def ')')*
;
base : key operator values ;
key : LSQR ID RSQR ;
values : '(' VALUE (',' VALUE)* ')' ;
operator : IN
| NIN
;
VALUE: '"' .*? '"' ;
AND : 'AND' ;
OR : 'OR' ;
NOT : 'not' ;
EQ : '=' ;
COMMA : ',' ;
SEMI : ';' ;
IN : 'IN' ;
NIN : 'NOT_IN' ;
LSQR : '[' ;
RSQR : ']' ;
INT : [0-9]+ ;
ID: [a-zA-Z_][a-zA-Z_0-9-!]* ;
WS: [\t\n\r\f ]+ -> skip ;
Below is the listener and parser-
#Component
#NoArgsConstructor
public class ANTLRTargetingExpressionParser {`
static List<Group> groupList = new ArrayList<>();
public String entityOperator;
public static class ExpMapper extends ExpBaseListener {
TargetEntity targetEntity;
Group group;
List<TargetEntity> targetEntities;
private static int inc = 1;
#Override
public void exitDef(ExpParser.DefContext ctx) {
group.setTargets(targetEntities);
groupList.add(group);
super.exitDef(ctx);
}
#Override
public void exitValues(ExpParser.ValuesContext ctx) {
targetEntity.setValues(
Arrays.asList(
Arrays.toString(ctx.VALUE().stream().collect(Collectors.toSet()).toArray())));
super.exitValues(ctx);
targetEntities.add(targetEntity);
}
#Override
public void exitOperator(ExpParser.OperatorContext ctx) {
targetEntity.setOperator(ctx.getText());
super.exitOperator(ctx);
}
#Override
public void exitKey(ExpParser.KeyContext ctx) {
targetEntity = new TargetEntity();
ctx.getParent();
targetEntity.setEntity(ctx.ID().getText());
super.exitKey(ctx);
}
#Override
public void enterDef(ExpParser.DefContext ctx) {
group = new Group();
targetEntities = new ArrayList<>();
super.enterDef(ctx);
}
}
public List<Group> parse(String expression) {`
ANTLRInputStream in = new ANTLRInputStream(expression);
ExpLexer lexer = new ExpLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ExpParser parser = new ExpParser(tokens);
parser.setBuildParseTree(true); // tell ANTLR to build a parse tree
ParseTree tree = parser.def();
/** Create standard walker. */
ParseTreeWalker walker = new ParseTreeWalker();
System.out.println(tree.toStringTree(parser));
ExpMapper mapper = new ExpMapper();
walker.walk(mapper, tree);
return groupList;
}
}
Output:-
[Group(targets=[{LANGUAGE, IN, [["Dutch", "Arabic"]]}, {Content_Series, IN, [["The Walking Dead"]]}, {PUBLISHER_NAME, IN, [["Yahoo Search", "Yahoo! NAR"]]}]),
Group(targets=[{LANGUAGE, IN, [["English"]]}, {PUBLISHER_NAME, IN, [["Aol News", "Microsoft-Bing!"]]}]),
Group(targets=[{LANGUAGE, IN, [["English"]]}, {PUBLISHER_NAME, IN, [["Aol News", "Microsoft-Bing!"]]}])]
Q1:- I am getting duplicate value in the grouplist at end. Tried checking the value in ctx to stop the walker but couldnt help.
Q2:- Also how can we catch the soft exception thrown by grammar file in case of wrong input given in java.

(NOTE: It's MUCH easier to sort questions out if you ensure that the examples you provide are valid and are compilable. I had to change a few things just to get a clean parse, and there's too much missing to attempt to compile and run your code.)
That said....
def : (AND? base)+
| (OR? '(' def ')')*
;
Would normally be represented as something akin to
def: '(' def ')'
| def AND def
| def OR def
| base
;
(Note: these are not exactly equivalent. Your rule requires parentheses around defs used in an OR, but disallows them when used with AND. Those would be "odd" constraints, so I'm not sure if you intended that.)
You'll notice here that it's clear that a def can contain other defs. This is also true in your rule for (but only as the second half of an OR type.
It can be really useful to use a plugin or the -gui option of the antler tool, to see a visual representation of your tree. (Both IntelliJ, and VS Code have good plugins available for this). With that visualization it would have been clear that there was a def in a subtree of a def. (The information would have been the in the output of the System.out.println(tree.toStringTree(parser));, but a bit harder to notice.
This is your clue. You're getting a duplicate of the second half of your OR and this is because you'll have a nested def and, as a result, you'll exitDef twice (and add it twice in the process).
Your listener does not handle nested structures like this properly (having only a targetEntity and a group). You'll need to do something like maintaining a stack of Group instances and pushing/popping as you enter/exit (and only dealing with the top of the stack).
A few other observations:
super.enterDef(ctx);
There's no need to call the super method on your listener overrides, the default methods are empty. (Of course, it does no harm, and it can be a "safe" practice to generally call the super method when overriding.
ctx.getParent();
You didn't do anything with this parent, as a result, this doesn't do anything.

Related

Xtext refering to element from different file does not work

Hello I am having two files in my xtext editor, the first one containing all definitions and the second one containing the executed recipe. The Grammar looks like this:
ServiceAutomationProgram:
('package' name=QualifiedName ';')?
imports+=ServiceAutomationImport*
definitions+=Definition*;
ServiceAutomationImport:
'import' importedNamespace=QualifiedNameWithWildcard ';';
Definition:
'define' ( TypeDefinition | ServiceDefinition |
SubRecipeDefinition | RecipeDefinition) ';';
TypeDefinition:
'quantity' name=ID ;
SubRecipeDefinition:
'subrecipe' name=ID '('( subRecipeParameters+=ServiceParameterDefinition (','
subRecipeParameters+=ServiceParameterDefinition)*)? ')' '{'
recipeSteps+=RecipeStep*
'}';
RecipeDefinition:
'recipe' name=ID '{' recipeSteps+=RecipeStep* '}';
RecipeStep:
(ServiceInvocation | SubRecipeInvocation) ';';
SubRecipeInvocation:
name=ID 'subrecipe' calledSubrecipe=[SubRecipeDefinition] '('( parameters+=ServiceInvocationParameter (',' parameters+=ServiceInvocationParameter)* )?')'
;
ServiceInvocation:
name=ID 'service' service=[ServiceDefinition]
'(' (parameters+=ServiceInvocationParameter (',' parameters+=ServiceInvocationParameter)*)? ')'
;
ServiceInvocationParameter:
ServiceEngineeringQuantityParameter | SubRecipeParameter
;
ServiceEngineeringQuantityParameter:
parameterName=[ServiceParameterDefinition] value=Amount;
ServiceDefinition:
'service' name=ID ('inputs' serviceInputs+=ServiceParameterDefinition (','
serviceInputs+=ServiceParameterDefinition)*)?;
ServiceParameterDefinition:
name=ID ':' (parameterType=[TypeDefinition]);
;
SubRecipeParameter:
parameterName=[ServiceParameterDefinition]
;
QualifiedNameWithWildcard:
QualifiedName '.*'?;
QualifiedName:
ID ('.' ID)*;
Amount:
INT ;
....
definitionfile file.mydsl:
define quantity Temperature;
define service Heater inputs SetTemperature:Temperature;
define subrecipe sub_recursive() {
Heating1 service Heater(SetTemperature 10);
};
....
recipefile secondsfile.mydsl:
define recipe Main {
sub1 subrecipe sub_recursive();
};
.....
In my generator file which looks like this:
override void doGenerate(Resource resource, IFileSystemAccess2 fsa, IGeneratorContext context) {
for (e : resource.allContents. toIterable.filter (RecipeDefinition)){
e.class;//just for demonstration add breakpoint here and //traverse down the tree
}
}
I need as an example the information RecipeDefinition.recipesteps.subrecipeinvocation.calledsubrecipe.recipesteps.serviceinvocation.service.name which is not accessible (null) So some of the very deep buried information gets lost (maybe due to lazylinking?).
To make the project executable also add to the scopeprovider:
public IScope getScope(EObject context, EReference reference) {
if (context instanceof ServiceInvocationParameter
&& reference == MyDslPackage.Literals.SERVICE_INVOCATION_PARAMETER__PARAMETER_NAME) {
ServiceInvocationParameter invocationParameter = (ServiceInvocationParameter) context;
List<ServiceParameterDefinition> candidates = new ArrayList<>();
if(invocationParameter.eContainer() instanceof ServiceInvocation) {
ServiceInvocation serviceCall = (ServiceInvocation) invocationParameter.eContainer();
ServiceDefinition calledService = serviceCall.getService();
candidates.addAll(calledService.getServiceInputs());
if(serviceCall.eContainer() instanceof SubRecipeDefinition) {
SubRecipeDefinition subRecipeCall=(SubRecipeDefinition) serviceCall.eContainer();
candidates.addAll(subRecipeCall.getSubRecipeParameters());
}
return Scopes.scopeFor(candidates);
}
else if(invocationParameter.eContainer() instanceof SubRecipeInvocation) {
SubRecipeInvocation serviceCall = (SubRecipeInvocation) invocationParameter.eContainer();
SubRecipeDefinition calledSub = serviceCall.getCalledSubrecipe();
candidates.addAll(calledSub.getSubRecipeParameters());
return Scopes.scopeFor(candidates);
}
}return super.getScope(context, reference);
}
When I put all in the same file it works as it does the first time executed after launching runtime but afterwards(when dogenerate is triggered via editor saving) some information is missing. Any idea how to get to the missing informations? thanks a lot!

antlr4: token is not recognised as intended

I am trying to build a grammar using antlr4 that should be able to store intermediate parsing results as variables which can be accessed for later use. I thought about using a key word, like as (or the German als), which will trigger this storing functionality. Besides this I have a general-purpose token ID that will match any possible identifier.
The storing ability should be an option for the user. Therefore, I am using the ? in my grammar definition.
My grammar looks as follows:
grammar TokenTest;
#header {
package some.package.declaration;
}
AS : 'als' ;
VALUE_ASSIGNMENT : AS ID ;
ID : [a-zA-Z_][a-zA-Z0-9_]+ ;
WS : [ \t\n\r]+ -> skip ;
ANY : . ;
formula : identifier=ID (variable=VALUE_ASSIGNMENT)? #ExpressionIdentifier
;
There are no failures when compiling this grammar. But, when I try to apply the following TestNG-tests I cannot explain its behaviour:
package some.package.declaration;
import java.util.List;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Token;
import org.testng.Assert;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import some.package.declaration.TokenTestLexer;
public class TokenTest {
private static List<Token> getTokens(final String input) {
final TokenTestLexer lexer = new TokenTestLexer(CharStreams.fromString(input));
final CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
return tokens.getTokens();
}
#DataProvider (name = "tokenData")
public Object[][] tokenData() {
return new Object [][] {
{"result", new String[] {"result"}, new int[] {TokenTestLexer.ID}},
{"als", new String[] {"als"}, new int[] {TokenTestLexer.AS}},
{"result als x", new String[] {"result", "als", "x"}, new int[] {TokenTestLexer.ID, TokenTestLexer.AS, TokenTestLexer.ID}},
};
}
#Test (dataProvider = "tokenData")
public void testTokenGeneration(final String input, final String[] expectedTokens, final int[] expectedTypes) {
// System.out.println("test token generation for <" + input + ">");
Assert.assertEquals(expectedTokens.length, expectedTypes.length);
final List<Token> parsedTokens = getTokens(input);
Assert.assertEquals(parsedTokens.size()-1/*EOF is a token*/, expectedTokens.length);
for (int index = 0; index < expectedTokens.length; index++) {
final Token currentToken = parsedTokens.get(index);
Assert.assertEquals(currentToken.getText(), expectedTokens[index]);
Assert.assertEquals(currentToken.getType(), expectedTypes[index]);
}
}
}
The second test tells me that the word als is parsed as an AS-token. But, the third test does not work as intended. I assume it to be an ID-token, followed by an AS-token, and finally followed by an ID-token. But instead, the last token will be recognized as an ANY-token.
If I change the definition of the AS-token as follows:
fragment AS : 'als' ;
there is another strange behaviour. Of course, the second test case does not work any longer, since there is no AS-token any more. Thats no surprise. Instead, the x in the third test case will be recognized as an ANY-token. But, I assume the whole "als x"-sequence to be a VALUE_ASSIGNMENT-token. What am I doing wrong? Any help would be really nice.
Kind regards!
But, the third test does not work as intended. I assume it to be an ID-token, followed by an AS-token, and finally followed by an ID-token. But instead, the last token will be recognized as an ANY-token
That is because you defined:
ID : [a-zA-Z_][a-zA-Z0-9_]+ ;
where the + means "one or more". What you probably want is "zero or more":
ID : [a-zA-Z_][a-zA-Z0-9_]* ;
But, I assume the whole "als x"-sequence to be a VALUE_ASSIGNMENT-token. What am I doing wrong?
Note that spaces are skipped in parser rules, not lexer rules. This means that VALUE_ASSIGNMENT will only match alsFOO, and not als FOO. This rules should probably be a parser rules instead:
value_assignment : AS ID ;

Semantically disambiguating an ambiguous syntax

Using Antlr 4 I have a situation I am not sure how to resolve. I originally asked the question at https://groups.google.com/forum/#!topic/antlr-discussion/1yxxxAvU678 on the Antlr discussion forum. But that forum does not seem to get a lot of traffic, so I am asking again here.
I have the following grammar:
expression
: ...
| path
;
path
: ...
| dotIdentifierSequence
;
dotIdentifierSequence
: identifier (DOT identifier)*
;
The concern here is that dotIdentifierSequence can mean a number of things semantically, and not all of them are "paths". But at the moment they are all recognized as paths in the parse tree and then I need to handle them specially in my visitor.
But what I'd really like is a way to express the dotIdentifierSequence usages that are not paths into the expression rule rather than in the path rule, and still have dotIdentifierSequence in path to handle path usages.
To be clear, a dotIdentifierSequence might be any of the following:
A path - this is a SQL-like grammar and a path expression would be like a table or column reference in SQL, e.g. a.b.c
A Java class name - e.g. com.acme.SomeJavaType
A static Java field reference - e.g. com.acme.SomeJavaType.SOME_FIELD
A Java enum value reference - e.g. com.acme.Gender.MALE
The idea is that during visitation "dotIdentifierSequence as a path" resolves as a very different type from the other usages.
Any idea how I can do this?
The issue here is that you're trying to make a distinction between "paths" while being created in the parser. Constructing paths inside the lexer would be easier (pseudo code follows):
grammar T;
tokens {
JAVA_TYPE_PATH,
JAVA_FIELD_PATH
}
// parser rules
PATH
: IDENTIFIER ('.' IDENTIFIER)*
{
String s = getText();
if (s is a Java class) {
setType(JAVA_TYPE_PATH);
} else if (s is a Java field) {
setType(JAVA_FIELD_PATH);
}
}
;
fragment IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
and then in the parser you would do:
expression
: JAVA_TYPE_PATH #javaTypeExpression
| JAVA_FIELD_PATH #javaFieldExpression
| PATH #pathExpression
;
But then, of course, input like this java./*comment*/lang.String would be tokenized wrongly.
Handling it all in the parser would mean manually looking ahead in the token stream and checking if either a Java type, or field exists.
A quick demo:
grammar T;
#parser::members {
String getPathAhead() {
Token token = _input.LT(1);
if (token.getType() != IDENTIFIER) {
return null;
}
StringBuilder builder = new StringBuilder(token.getText());
// Try to collect ('.' IDENTIFIER)*
for (int stepsAhead = 2; ; stepsAhead += 2) {
Token expectedDot = _input.LT(stepsAhead);
Token expectedIdentifier = _input.LT(stepsAhead + 1);
if (expectedDot.getType() != DOT || expectedIdentifier.getType() != IDENTIFIER) {
break;
}
builder.append('.').append(expectedIdentifier.getText());
}
return builder.toString();
}
boolean javaTypeAhead() {
String path = getPathAhead();
if (path == null) {
return false;
}
try {
return Class.forName(path) != null;
} catch (Exception e) {
return false;
}
}
boolean javaFieldAhead() {
String path = getPathAhead();
if (path == null || !path.contains(".")) {
return false;
}
int lastDot = path.lastIndexOf('.');
String typeName = path.substring(0, lastDot);
String fieldName = path.substring(lastDot + 1);
try {
Class<?> clazz = Class.forName(typeName);
return clazz.getField(fieldName) != null;
} catch (Exception e) {
return false;
}
}
}
expression
: {javaTypeAhead()}? path #javaTypeExpression
| {javaFieldAhead()}? path #javaFieldExpression
| path #pathExpression
;
path
: dotIdentifierSequence
;
dotIdentifierSequence
: IDENTIFIER (DOT IDENTIFIER)*
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
DOT
: '.'
;
which can be tested with the following class:
package tl.antlr4;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.misc.NotNull;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
public class Main {
public static void main(String[] args) {
String[] tests = {
"mu",
"tl.antlr4.The",
"java.lang.String",
"foo.bar.Baz",
"tl.antlr4.The.answer",
"tl.antlr4.The.ANSWER"
};
for (String test : tests) {
TLexer lexer = new TLexer(new ANTLRInputStream(test));
TParser parser = new TParser(new CommonTokenStream(lexer));
ParseTreeWalker.DEFAULT.walk(new TestListener(), parser.expression());
}
}
}
class TestListener extends TBaseListener {
#Override
public void enterJavaTypeExpression(#NotNull TParser.JavaTypeExpressionContext ctx) {
System.out.println("JavaTypeExpression -> " + ctx.getText());
}
#Override
public void enterJavaFieldExpression(#NotNull TParser.JavaFieldExpressionContext ctx) {
System.out.println("JavaFieldExpression -> " + ctx.getText());
}
#Override
public void enterPathExpression(#NotNull TParser.PathExpressionContext ctx) {
System.out.println("PathExpression -> " + ctx.getText());
}
}
class The {
public static final int ANSWER = 42;
}
which would print the following to the console:
PathExpression -> mu
JavaTypeExpression -> tl.antlr4.The
JavaTypeExpression -> java.lang.String
PathExpression -> foo.bar.Baz
PathExpression -> tl.antlr4.The.answer
JavaFieldExpression -> tl.antlr4.The.ANSWER

Newbie: 2.4# is accepted as a float. Is '#' a special character?

Wondering why the expression "setvalue(2#)' is happily accepted by the lexer (and parser) given my grammar/visitor. I am sure I am doing something wrong.
Below is a small sample that should illustrate the problem.
Any help is much appreciated.
grammar ExpressionEvaluator;
parse
: block EOF
;
block
: stat*
;
stat
: assignment
;
assignment
: SETVALUE OPAR expr CPAR
;
expr
: atom #atomExpr
;
atom
: OPAR expr CPAR #parExpr
| (INT | FLOAT) #numberAtom
| ID #idAtom
;
OPAR : '(';
CPAR : ')';
SETVALUE : 'setvalue';
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
SPACE
: [ \t\r\n] -> skip
;
Code snippet:
public override object VisitParse(ExpressionEvaluatorParser.ParseContext context)
{
return this.Visit(context.block());
}
public override object VisitAssignment(ExpressionEvaluatorParser.AssignmentContext context)
{
// TODO - Set ID Value
return Convert.ToDouble(this.Visit(context.expr()));
}
public override object VisitIdAtom(ExpressionEvaluatorParser.IdAtomContext context)
{
string id = context.GetText();
// TODO - Lookup ID value
return id;
}
public override object VisitNumberAtom(ExpressionEvaluatorParser.NumberAtomContext context)
{
return Convert.ToDouble(context.GetText());
}
public override object VisitParExpr(ExpressionEvaluatorParser.ParExprContext context)
{
return this.Visit(context.expr());
}
The # character actually isn't matching anything at all. When the lexer reaches that character, the following happen in order:
The lexer determines that no lexer rule can match the # character.
The lexer reports an error regarding the failure.
The lexer calls _input.consume() to skip past the bad character.
To ensure errors are reported as easily as possible, always add the following rule as the last rule in your lexer.
ErrChar
: .
;
The parser will report an error when it reaches an ErrChar, so you won't need to add an error listener to the lexer.

ANTLR4 Specific search

Basically I want to find, in a file, using ANTLR, every expression as defined :
WORD.WORD
for example : "end.beginning" matches
For the time being the file can have hundreds and hundreds of lines and a complexe structure.
Is there a way to skip every thing (character?) that does not match with the pattern described above, without making a grammar that fully represents the file ?
So far this is my grammar but i don't know what to do next.
grammar Dep;
program
:
dependencies
;
dependencies
:
(
dependency
)*
;
dependency
:
identifier
DOT
identifier
;
identifier
:
INDENTIFIER
;
DOT : '.' ;
INDENTIFIER
:
[a-zA-Z_] [a-zA-Z0-9_]*
;
OTHER
:
. -> skip
;
The way you're doing it now, the dependency rule would also match the tokens 'end', '.', 'beginning' from the input:
end
#####
.
#####
beginning
because the line breaks and '#'s are being skipped from the token stream.
If that is not what you want, i.e. you'd like to match "end.beginning" without any char in between, you should make a single lexer rule of it, and match that rule in your parser:
grammar Dep;
program
: DEPENDENCY* EOF
;
DEPENDENCY
: [a-zA-Z_] [a-zA-Z0-9_]* '.' [a-zA-Z_] [a-zA-Z0-9_]*
;
OTHER
: . -> skip
;
Then you could use a tree listener to do something useful with your DEPENDENCY's:
public class Main {
public static void main(String[] args) throws Exception {
String input = "### end.beginning ### end ### foo.bar mu foo.x";
DepLexer lexer = new DepLexer(new ANTLRInputStream(input));
DepParser parser = new DepParser(new CommonTokenStream(lexer));
ParseTreeWalker.DEFAULT.walk(new DepBaseListener(){
#Override
public void enterProgram(#NotNull DepParser.ProgramContext ctx) {
for (TerminalNode node : ctx.DEPENDENCY()) {
System.out.println("node=" + node.getText());
}
}
}, parser.program());
}
}
which would print:
node=end.beginning
node=foo.bar
node=foo.x

Resources