antlr4: token is not recognised as intended - antlr4

I am trying to build a grammar using antlr4 that should be able to store intermediate parsing results as variables which can be accessed for later use. I thought about using a key word, like as (or the German als), which will trigger this storing functionality. Besides this I have a general-purpose token ID that will match any possible identifier.
The storing ability should be an option for the user. Therefore, I am using the ? in my grammar definition.
My grammar looks as follows:
grammar TokenTest;
#header {
package some.package.declaration;
}
AS : 'als' ;
VALUE_ASSIGNMENT : AS ID ;
ID : [a-zA-Z_][a-zA-Z0-9_]+ ;
WS : [ \t\n\r]+ -> skip ;
ANY : . ;
formula : identifier=ID (variable=VALUE_ASSIGNMENT)? #ExpressionIdentifier
;
There are no failures when compiling this grammar. But, when I try to apply the following TestNG-tests I cannot explain its behaviour:
package some.package.declaration;
import java.util.List;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Token;
import org.testng.Assert;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import some.package.declaration.TokenTestLexer;
public class TokenTest {
private static List<Token> getTokens(final String input) {
final TokenTestLexer lexer = new TokenTestLexer(CharStreams.fromString(input));
final CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
return tokens.getTokens();
}
#DataProvider (name = "tokenData")
public Object[][] tokenData() {
return new Object [][] {
{"result", new String[] {"result"}, new int[] {TokenTestLexer.ID}},
{"als", new String[] {"als"}, new int[] {TokenTestLexer.AS}},
{"result als x", new String[] {"result", "als", "x"}, new int[] {TokenTestLexer.ID, TokenTestLexer.AS, TokenTestLexer.ID}},
};
}
#Test (dataProvider = "tokenData")
public void testTokenGeneration(final String input, final String[] expectedTokens, final int[] expectedTypes) {
// System.out.println("test token generation for <" + input + ">");
Assert.assertEquals(expectedTokens.length, expectedTypes.length);
final List<Token> parsedTokens = getTokens(input);
Assert.assertEquals(parsedTokens.size()-1/*EOF is a token*/, expectedTokens.length);
for (int index = 0; index < expectedTokens.length; index++) {
final Token currentToken = parsedTokens.get(index);
Assert.assertEquals(currentToken.getText(), expectedTokens[index]);
Assert.assertEquals(currentToken.getType(), expectedTypes[index]);
}
}
}
The second test tells me that the word als is parsed as an AS-token. But, the third test does not work as intended. I assume it to be an ID-token, followed by an AS-token, and finally followed by an ID-token. But instead, the last token will be recognized as an ANY-token.
If I change the definition of the AS-token as follows:
fragment AS : 'als' ;
there is another strange behaviour. Of course, the second test case does not work any longer, since there is no AS-token any more. Thats no surprise. Instead, the x in the third test case will be recognized as an ANY-token. But, I assume the whole "als x"-sequence to be a VALUE_ASSIGNMENT-token. What am I doing wrong? Any help would be really nice.
Kind regards!

But, the third test does not work as intended. I assume it to be an ID-token, followed by an AS-token, and finally followed by an ID-token. But instead, the last token will be recognized as an ANY-token
That is because you defined:
ID : [a-zA-Z_][a-zA-Z0-9_]+ ;
where the + means "one or more". What you probably want is "zero or more":
ID : [a-zA-Z_][a-zA-Z0-9_]* ;
But, I assume the whole "als x"-sequence to be a VALUE_ASSIGNMENT-token. What am I doing wrong?
Note that spaces are skipped in parser rules, not lexer rules. This means that VALUE_ASSIGNMENT will only match alsFOO, and not als FOO. This rules should probably be a parser rules instead:
value_assignment : AS ID ;

Related

ANTLR4: Parser adding duplicate entries

I have below input to be parsed:-
([LANGUAGE] IN ("Arabic", "Dutch") AND [Content Series] IN ("The Walking Dead") AND [PUBLISHER_NAME] IN ("Yahoo Search", "Yahoo! NAR") )
OR
([LANGUAGE] IN ("English") AND [PUBLISHER_NAME] IN ("Aol News", "Microsoft-Bing!") )
Basically the inputs have 2 groups separated by 'OR'.Both groups has several base exp(targetEntities) separated by AND. So each group has list of target entities.
Grammar file:
grammar Exp;
options {
language = Java;
}
start
: def EOF
;
def : (AND? base)+
| (OR? '(' def ')')*
;
base : key operator values ;
key : LSQR ID RSQR ;
values : '(' VALUE (',' VALUE)* ')' ;
operator : IN
| NIN
;
VALUE: '"' .*? '"' ;
AND : 'AND' ;
OR : 'OR' ;
NOT : 'not' ;
EQ : '=' ;
COMMA : ',' ;
SEMI : ';' ;
IN : 'IN' ;
NIN : 'NOT_IN' ;
LSQR : '[' ;
RSQR : ']' ;
INT : [0-9]+ ;
ID: [a-zA-Z_][a-zA-Z_0-9-!]* ;
WS: [\t\n\r\f ]+ -> skip ;
Below is the listener and parser-
#Component
#NoArgsConstructor
public class ANTLRTargetingExpressionParser {`
static List<Group> groupList = new ArrayList<>();
public String entityOperator;
public static class ExpMapper extends ExpBaseListener {
TargetEntity targetEntity;
Group group;
List<TargetEntity> targetEntities;
private static int inc = 1;
#Override
public void exitDef(ExpParser.DefContext ctx) {
group.setTargets(targetEntities);
groupList.add(group);
super.exitDef(ctx);
}
#Override
public void exitValues(ExpParser.ValuesContext ctx) {
targetEntity.setValues(
Arrays.asList(
Arrays.toString(ctx.VALUE().stream().collect(Collectors.toSet()).toArray())));
super.exitValues(ctx);
targetEntities.add(targetEntity);
}
#Override
public void exitOperator(ExpParser.OperatorContext ctx) {
targetEntity.setOperator(ctx.getText());
super.exitOperator(ctx);
}
#Override
public void exitKey(ExpParser.KeyContext ctx) {
targetEntity = new TargetEntity();
ctx.getParent();
targetEntity.setEntity(ctx.ID().getText());
super.exitKey(ctx);
}
#Override
public void enterDef(ExpParser.DefContext ctx) {
group = new Group();
targetEntities = new ArrayList<>();
super.enterDef(ctx);
}
}
public List<Group> parse(String expression) {`
ANTLRInputStream in = new ANTLRInputStream(expression);
ExpLexer lexer = new ExpLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ExpParser parser = new ExpParser(tokens);
parser.setBuildParseTree(true); // tell ANTLR to build a parse tree
ParseTree tree = parser.def();
/** Create standard walker. */
ParseTreeWalker walker = new ParseTreeWalker();
System.out.println(tree.toStringTree(parser));
ExpMapper mapper = new ExpMapper();
walker.walk(mapper, tree);
return groupList;
}
}
Output:-
[Group(targets=[{LANGUAGE, IN, [["Dutch", "Arabic"]]}, {Content_Series, IN, [["The Walking Dead"]]}, {PUBLISHER_NAME, IN, [["Yahoo Search", "Yahoo! NAR"]]}]),
Group(targets=[{LANGUAGE, IN, [["English"]]}, {PUBLISHER_NAME, IN, [["Aol News", "Microsoft-Bing!"]]}]),
Group(targets=[{LANGUAGE, IN, [["English"]]}, {PUBLISHER_NAME, IN, [["Aol News", "Microsoft-Bing!"]]}])]
Q1:- I am getting duplicate value in the grouplist at end. Tried checking the value in ctx to stop the walker but couldnt help.
Q2:- Also how can we catch the soft exception thrown by grammar file in case of wrong input given in java.
(NOTE: It's MUCH easier to sort questions out if you ensure that the examples you provide are valid and are compilable. I had to change a few things just to get a clean parse, and there's too much missing to attempt to compile and run your code.)
That said....
def : (AND? base)+
| (OR? '(' def ')')*
;
Would normally be represented as something akin to
def: '(' def ')'
| def AND def
| def OR def
| base
;
(Note: these are not exactly equivalent. Your rule requires parentheses around defs used in an OR, but disallows them when used with AND. Those would be "odd" constraints, so I'm not sure if you intended that.)
You'll notice here that it's clear that a def can contain other defs. This is also true in your rule for (but only as the second half of an OR type.
It can be really useful to use a plugin or the -gui option of the antler tool, to see a visual representation of your tree. (Both IntelliJ, and VS Code have good plugins available for this). With that visualization it would have been clear that there was a def in a subtree of a def. (The information would have been the in the output of the System.out.println(tree.toStringTree(parser));, but a bit harder to notice.
This is your clue. You're getting a duplicate of the second half of your OR and this is because you'll have a nested def and, as a result, you'll exitDef twice (and add it twice in the process).
Your listener does not handle nested structures like this properly (having only a targetEntity and a group). You'll need to do something like maintaining a stack of Group instances and pushing/popping as you enter/exit (and only dealing with the top of the stack).
A few other observations:
super.enterDef(ctx);
There's no need to call the super method on your listener overrides, the default methods are empty. (Of course, it does no harm, and it can be a "safe" practice to generally call the super method when overriding.
ctx.getParent();
You didn't do anything with this parent, as a result, this doesn't do anything.

Get the most possible token types according to line and column number in ANTLR4

I would like to get a list of most possible list of tokens for a given location in the text (line and column number) to determine what has to be populated for auto code completion. Can this be easily achieved using ANTLR 4 API.
I want to get the possible list of tokens for a given location because the user might be writing/editing somewhere in the middle of the text which still guarantees the possible list of tokens.
Please give me some guidelines because I was unable to find an online resource on this topic.
One way to get tokens by line number is to create a ParseTreeListener for your grammar, use it to walk a given ParseTree and index TerminalNodes by line number. I don't know C#, but here is how I've done it in Java. Logic should be similar.
public class MyLineIndexer extends MyGrammarParserBaseListener {
protected MultiMap<Integer, TerminalNode> filelineTokenIndex;
#Override
public void visitTerminal(#NotNull TerminalNode node) {
// map every token to its file line for searching later...
if ( node.getSymbol() != null ) {
List<TerminalNode> tokens;
Integer line = node.getSymbol().getLine();
if (!filelineTokenIndex.containsKey(line)) {
tokens = new ArrayList<>();
filelineTokenIndex.put(line, tokens);
} else {
tokens = filelineTokenIndex.get(line);
}
tokens.add(node);
}
super.visitTerminal(node);
}
}
then walk the parse tree the usual way...
ParseTree parseTree = ... ; // parse it how you want to
MyLineIndexer indexer = new MyLineIndexer();
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(indexer, parseTree);
Getting the token at a line and range is now reasonably straight forward and efficient assuming you have a reasonable number of tokens on a line. For example you can add another method to the Listener like this:
public TerminalNode findTerminalNodeAtCaret(int caretPos, int caretLine) {
if (caretPos <= 0) return null;
if (this.filelineTokenIndex.containsKey(caretLine)) {
List<TerminalNode> nodes = filelineTokenIndex.get(caretLine);
if (nodes.size() == 0) return null;
int tokenEndPos, tokenStartPos;
for (TerminalNode n : nodes) {
if (n.getSymbol() != null) {
tokenEndPos = n.getSymbol().getCharPositionInLine() + n.getText().length();
tokenStartPos = n.getSymbol().getCharPositionInLine();
// If the caret is within this token, return this token
if (caretPos >= tokenStartPos && caretPos <= tokenEndPos) {
return n;
}
}
}
}
return null;
}
You will also need to ensure your parser allows for 'loose' parsing. While a language construct is being typed, it is likely not to be valid. Your Parser rules should allow for this.

LL_EXACT_AMBIG_DETECTION - Interpetation

When using PredictionMode::LL_EXACT_AMBIG_DETECTION I get the following error messages:
line 186:7 reportAttemptingFullContext d=30, input='ON REPORT HEAD
How am I to interpret the d attribute. Does it reference a rule in my grammar and how can I find out which?
According to the code:
#Override
public void reportAttemptingFullContext(#NotNull Parser recognizer,
#NotNull DFA dfa,
int startIndex, int stopIndex,
#NotNull ATNConfigSet configs)
{
recognizer.notifyErrorListeners("reportAttemptingFullContext d=" +
dfa.decision + ", input='" +
recognizer.getTokenStream().getText(Interval.of(startIndex, stopIndex)) + "'");
}
the attribute d is a decision in DFA. But I have not found out how the use the information back to the rule in the grammar.
Thank for your help.
Kind regards,
Wolfgang Hämmer
The following helper methods can convert a decision number to a rule name. You can create your own error listener implementation based on DiagnosticErrorListener and use these methods to include the name of the rule in each message.
If a rule has more than one decision, then you can pass the -atn flag to ANTLR when you generate code for your grammar. Once you have the name of the rule, look at the graph produced by ruleName.dot (where ruleName is the rule), and you'll see a node in the graph labeled d=decisionNumber (where decisionNumber is the number you're currently seeing). That will point you to the exact location where the problem is occurring.
Keep in mind that rule and decision numbers change when you change your grammar, so when you open ruleName.dot you'll want to verify the actual decision number each you regenerate the code for your grammar.
public static int getDecisionRule(Recognizer<?, ?> recognizer, int decision) {
if (recognizer == null || decision < 0) {
return -1;
}
if (decision >= recognizer.getATN().decisionToState.size()) {
return -1;
}
return recognizer.getATN().decisionToState.get(decision).ruleIndex;
}
public static String getRuleDisplayName(Recognizer<?, ?> recognizer, int ruleIndex) {
if (recognizer == null || ruleIndex < 0) {
return Integer.toString(ruleIndex);
}
String[] ruleNames = recognizer.getRuleNames();
if (ruleIndex < 0 || ruleIndex >= ruleNames.length) {
return Integer.toString(ruleIndex);
}
return ruleNames[ruleIndex];
}

How to fix "Path Manipulation Vulnerability" in some Java Code?

The below simple java code getting Fortify Path Manipulation error. Please help me to resolve this. I am struggling from long time.
public class Test {
public static void main(String[] args) {
File file=new File(args[0]);
}
}
Try to normalize the URL before using it
https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()
Path path = Paths.get("/foo/../bar/../baz").normalize();
or use normalize from org.apache.commons.io.FilenameUtils
https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/FilenameUtils.html#normalize(java.lang.String)
Stirng path = FilenameUtils.normalize("/foo/../bar/../baz");
For both the result will be \baz
Looking at the OWASP page for Path Manipulation, it says
An attacker can specify a path used in an operation on the filesystem
You are opening a file as defined by a user-given input. Your code is almost a perfect example of the vulnerability! Either
Don't use the above code (don't let the user specify the input file as an argument)
Let the user choose from a list of files that you supply (an array of files with an integer choice)
Don't let the user supply the filename at all, remove the configurability
Accept the vulnerability but protect against it by checking the filename (although this is the worst thing to do - someone may get round it anyway).
Or re-think your application's design.
Fortify will flag the code even if the path/file doesn't come from user input like a property file. The best way to handle these is to canonicalize the path first, then validate it against a white list of allowed paths.
Bad:
public class Test {
public static void main(String[] args) {
File file=new File(args[0]);
}
}
Good:
public class Test {
public static void main(String[] args) {
File file=new File(args[0]);
if (!isInSecureDir(file)) {
throw new IllegalArgumentException();
}
String canonicalPath = file.getCanonicalPath();
if (!canonicalPath.equals("/img/java/file1.txt") &&
!canonicalPath.equals("/img/java/file2.txt")) {
// Invalid file; handle error
}
FileInputStream fis = new FileInputStream(f);
}
Source: https://www.securecoding.cert.org/confluence/display/java/FIO16-J.+Canonicalize+path+names+before+validating+them
Only allow alnum and a period in input. That means you filter out the control chars, "..", "/", "\" which would make your files vulnerable. For example, one should not be able to enter /path/password.txt.
Once done, rescan and then run Fortify AWB.
I have a solution to the Fortify Path Manipulation issues.
What it is complaining about is that if you take data from an external source, then an attacker can use that source to manipulate your path. Thus, enabling the attacker do delete files or otherwise compromise your system.
The suggested remedy to this problem is to use a whitelist of trusted directories as valid inputs; and, reject everything else.
This solution is not always viable in a production environment. So, I suggest an alternative solution. Parse the input for a whitelist of acceptable characters. Reject from the input, any character you don't want in the path. It could be either removed or replaced.
Below is an example. This does pass the Fortify review. It is important to remember here to return the literal and not the char being checked. Fortify keeps track of the parts that came from the original input. If you use any of the original input, you may still get the error.
public class CleanPath {
public static String cleanString(String aString) {
if (aString == null) return null;
String cleanString = "";
for (int i = 0; i < aString.length(); ++i) {
cleanString += cleanChar(aString.charAt(i));
}
return cleanString;
}
private static char cleanChar(char aChar) {
// 0 - 9
for (int i = 48; i < 58; ++i) {
if (aChar == i) return (char) i;
}
// 'A' - 'Z'
for (int i = 65; i < 91; ++i) {
if (aChar == i) return (char) i;
}
// 'a' - 'z'
for (int i = 97; i < 123; ++i) {
if (aChar == i) return (char) i;
}
// other valid characters
switch (aChar) {
case '/':
return '/';
case '.':
return '.';
case '-':
return '-';
case '_':
return '_';
case ' ':
return ' ';
}
return '%';
}
}
Assuming you're running Fortify against a web application, during your triage of Fortify vulnerabilities that would likely get marked as "Not an issue". Reasoning being A) obviously this is test code and B) unless you have multiple personality disorder you're not going to be doing a path manipulation exploit against your self when you run that test app.
If very common to see little test utilities committed to a repository which produces this style of false positive.
As for your compilation errors, that generally comes down to classpath issues.
We have code like below which was raising Path Manipulation high category issue in fortify .
String.join(delimeter,string1,string2,string2,string4);
Our program is to deal with AWS S3 bucket so, we changed as below and it worked .
com.amazonaws.util.StringUtils.join(delimeter,string1,string2,string2,string4);
Using the Tika library FilenameUtils.normalize solves the fortify issue.
import org.apache.tika.io.FilenameUtils;
public class Test {
public static void main(String[] args) {
String filePath = FilenameUtils.normalize(args[0]); //This line solve issue.
File file=new File(filePath);
}
}
Try this for replacing FileInputStream. You will need to close your project and open again to accurately see whether changes worked.
File to byte[] in Java
Use Normalize() function in C# and it resolved the fortify vulnerability in next scan.
string s = #:c:\temp\scan.log".Normalize();
Use regex to validate the file path and file name
fileName = args[0];
final String regularExpression = "([\\w\\:\\\\w ./-]+\\w+(\\.)?\\w+)";
Pattern pattern = Pattern.compile(regularExpression);
boolean isMatched = pattern.matcher(fileName).matches();

Is there an additional runtime cost for using named parameters?

Consider the following struct:
public struct vip
{
string email;
string name;
int category;
public vip(string email, int category, string name = "")
{
this.email = email;
this.name = name;
this.category = category;
}
}
Is there a performance difference between the following two calls?
var e = new vip(email: "foo", name: "bar", category: 32);
var e = new vip("foo", 32, "bar");
Is there a difference if there are no optional parameters defined?
I believe none. It's only a language/compiler feature, call it syntactic sugar if you like. The generated CLR code should be the same.
There's a compile-time cost, but not a runtime one...and the compile time is very, very minute.
Like extension methods or auto-implemented properties, this is just magic the compiler does, but in reality generates the same IL we're all familiar with and have been using for years.
Think about it this way, if you're using all the parameters, the compiler would call the method using all of them, if not, it would generate something like this behind the scenes:
var e = new vip(email: "foo", category: 32); //calling
//generated, this is what it's actually saving you from writing
public vip(string email, int category) : this(email, category, "bar") { }
No it is a compile-time feature only. If you inspect the generated IL you'll see no sign of the named parameters. Likewise, optional parameters is also a compile-time feature.
One thing to keep in mind regarding named parameters is that the names are now part of the signature for calling a method (if used obviously) at compile time. I.e. if names change the calling code must be changed as well if you recompile. A deployed assembly, on the other hand, will not be affected until recompiled, as the names are not present in the IL.
There shouldn't be any. Basically, named parameters and optional parameters are syntactic sugar; the compiler writes the actual values or the default values directly into the call site.
EDIT: Note that because they are a compiler feature, this means that changes to the parameters only get updated if you recompile the "clients". So if you change the default value of an optional parameter, for example, you will need to recompile all "clients", or else they will use the old default value.
Actually, there is cost at x64 CLR
Look at here http://www.dotnetperls.com/named-parameters
I am able to reproduce the result: named call takes 4.43 ns, and normal call takes 3.48 ns
(program runs in x64)
However, in x86, both take around 0.32 ns
The code is attached below, compile and run it yourself to see the difference.
Note that in VS2012 the default targat is AnyCPU x86 prefered, you have to switch to x64 to see the difference.
using System;
using System.Diagnostics;
class Program
{
const int _max = 100000000;
static void Main()
{
Method1();
Method2();
var s1 = Stopwatch.StartNew();
for (int i = 0; i < _max; i++)
{
Method1();
}
s1.Stop();
var s2 = Stopwatch.StartNew();
for (int i = 0; i < _max; i++)
{
Method2();
}
s2.Stop();
Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000 * 1000) /
_max).ToString("0.00 ns"));
Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000 * 1000) /
_max).ToString("0.00 ns"));
Console.Read();
}
static void Method1()
{
Method3(flag: true, size: 1, name: "Perl");
}
static void Method2()
{
Method3(1, "Perl", true);
}
static void Method3(int size, string name, bool flag)
{
if (!flag && size != -1 && name != null)
{
throw new Exception();
}
}
}

Resources