Is there any way that I can use antlr4 java 7 grammar to parse java8 files without adding much changes to the java.g4 grammar file. As far as I understand the syntactical changes from java7 to java8 are the lambda expression syntax and the double column operator. I was able to incorporate the lambda expression but to include the double column operator in my current grammar file seems a bit complex.
Related
We are working on a tool to validate user configurations. Invalid configurations will be described in some text file or json file in following form:
case1: if something > 5 and something.else != 10
case2: (if a <= 3 or a >= 5) and b == 10
In case the if statement evaluates to true, the configuration is invalid. We used SLY module to create a lexer and parser to parse this sentence and check, whether it's valid or not. After thinking a bit more, we realized, that instead of writing our own grammar, it would be interesting to use a subset of the Python grammar - let's say expressions, bool operators and few others, but not the complete set, as we don't want and need to incorporate support for functions, classes and many more. The reason for such approach was, that we are writing our tool in Python, so it could cooperate nicely.
I've checked the ast module, however, I've a feeling, that the grammar is tightly coupled with it. If I understand it correctly, the Python parser is not generated automatically using some existing parser generator based on a grammar, right? The parser is "hard coded". Or em I wrong?
Is there "simple" way of doing this?
In general, we are looking for a parser generator, which generates the parser for a subset of Python grammar, but I'm afraid to cover part of the Python grammar, we would need to write the grammar by ourselves and based on it generate a parser. Is my assumption right?
I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
Section 2.1.3 of the Python Language Reference says:
Comments are ignored by the syntax.
While I'm not entirely sure about this, I believe this means the Python Intepreter will ignore comments.
In contrast, section 2.1.4 says:
If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration.
This also seems to be a statement of fact about the Python Interpreter: That it does not ignore a comment if it's in the first or second line of the script, as long as it matches the expression coding[=:]\s*([-\w.]+)
Source
Don't these two statements about the interpreter contradict each other? What the hell is going on?
You have valid points about the clarity of the documentation.
However, as with many other languages (HTML, XML, JSON pre-2017 standard*), the character encoding of a source file/document is determined prior to any language lexical or syntactical processing. So, it is correct to say, "Comments are ignored by the syntax." Because once the character encoding is determined, processing restarts and the syntactical processing ignores all comments.
In a sense, there are two languages: 1) for expressing the character encoding; 2) for expressing a Python script. The first one is designed so it is accepted by but has no meaning to the second.
Subsequent standards for JSON reduce the set of allowable character encoding from UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE to simply UTF-8.
So I have written my grammar in antlr4 syntax. Then I setup codegeneration, and now I can parse source files in my own defined language. This works great!
The next step I took is to create an object model from the expression tree. This is also working well.
However, now I want to generate an expression from my object model.
Can I generate code using the generated language parser objects API? Obviously, I can write methods that hand-generates strings. But I want to use a geenrated API based on the grammar to achieve some level of type safety and to detect errors when I make a grammar change.
I'm using the latest antlr4: antlr 4.7.1.
There's no generated solution. You have to wire this all up manually.
Say I'd like to find instances of the expression while using the Java7 grammar:
FoobarClass.getInstanceOfType("Bazz");
Using a ParseTreeWalker and listening to exitExpression() calls sounded like a good first place to start. What surprised me was the level of manual traversal of the Java7Parser.ExpressionContext required to find expressions of this type.
What's the appropriate method to find matches to the above expression? At this point using a Regex in place of ANTLR4 yields simpler code, but this won't scale.
ANTLR 4 does not currently include feature allowing you to write concrete or abstract syntax queries. We hope to add something in the future to help with this type of application.
I've needed to write a few pattern recognition features for ANTLR 4 parse trees. I implemented the predicate itself with relative success by extending BaseMyParserVisitor<Boolean> (the parser in this example is called MyParser).