Parse subset of Python grammar - python-3.x

We are working on a tool to validate user configurations. Invalid configurations will be described in some text file or json file in following form:
case1: if something > 5 and something.else != 10
case2: (if a <= 3 or a >= 5) and b == 10
In case the if statement evaluates to true, the configuration is invalid. We used SLY module to create a lexer and parser to parse this sentence and check, whether it's valid or not. After thinking a bit more, we realized, that instead of writing our own grammar, it would be interesting to use a subset of the Python grammar - let's say expressions, bool operators and few others, but not the complete set, as we don't want and need to incorporate support for functions, classes and many more. The reason for such approach was, that we are writing our tool in Python, so it could cooperate nicely.
I've checked the ast module, however, I've a feeling, that the grammar is tightly coupled with it. If I understand it correctly, the Python parser is not generated automatically using some existing parser generator based on a grammar, right? The parser is "hard coded". Or em I wrong?
Is there "simple" way of doing this?
In general, we are looking for a parser generator, which generates the parser for a subset of Python grammar, but I'm afraid to cover part of the Python grammar, we would need to write the grammar by ourselves and based on it generate a parser. Is my assumption right?

Related

How to get the grammar production when there is an error with Ply(Yacc)?

In the yacc.py file I defined the output of a grammar and also an error, like this:
def p_error(p):
if p:
print("Error when trying to read the symbol '%s' (Token type: %s)" % (p.value, p.type))
else:
print("Syntax error at EOF")
exit()
In addition to this error message, I also want to print what was the production read at the time of the error, something like:
print("Error in production: ifstat -> IF LPAREN expression RPAREN statement elsestat")
How can I do this?
Really, you can't. You particularly can't with a bottom-up parser like the one generated by Ply, but even with top-down parsers "the production read at the time" is not a very well-defined concept.
For example, consider the erroneous code:
if (x < y return 42;
in which the error is a missing parentheses. At least, that's how I'd describe the error. But a closing parenthesis is not the only thing which could follow the 0. For example, a correct program might include any of the following:
if (x < y) return 42;
if (x < y + 10) return 42;
if (x < y && give_up_early) return 42;
and many more.
So which production is the parser trying to complete when it sees the token return? Evidently, it's still trying to complete expression (which might actually have a hierarchy of different expression types, or which might be relying on precedence declarations to be a single non-terminal, or some combination of the two.) But that doesn't really help identify the error as a missing close parenthesis.
In a top-down parser, it would be possible to walk up the parser stack to get a list of partially-completed productions in inclusion order. (At least, that would be possible if the parser maintained its own stack. If it were a recursive-descent parser, examining the stack would be more complicated.)
But in a bottom-up parser, the parser state is more complicated. Bottom-up parsers are more flexible than top-down parsers precisely because they can, in effect, consider multiple productions at the same time. So there often really isn't one single partial production; the parser will decide which production it is looking at by gradually eliminating all the possibilities which don't work.
That description makes it sound like the bottom-up parser is doing a lot of work, which is misleading. The work was already done by the parser generator, which compiles a simple state transition table to guide the parse. What that means in practice is that the parser knows how to handle every possibly-correct token at each moment in the parse. So, for example, when it sees a ) following if (x < y, it immediately knows that it must finish up the expression production and proceed with the rest of the if statement.
Bison -- a C implementation of yacc -- has an optional feature which allows it to list the possible correct tokens when an error is encountered. That's not as simple as it sounds, and implementing it correctly creates a noticeable overhead in parsing time, but it is sometimes useful. (It's often not useful, though, because the list of possible tokens can be very long. In the case of the error I'm using as an example, the list would include every single arithmetic operator, as well as those tokens which could start a postfix operator. The bison extended error handler stops trying when it reaches the sixth possible token, which means that it will rarely generated an extended error message if the parse is in the middle of an expression.) In any case, Ply does not have such a feature.
Ply, like bison, does implement error recovery through the error pseudo-token. The error-recovery algorithm works best with languages which have an unambiguous resynchronisation point, as in languages with a definite statement terminator (unlike C-like languages, in which many statements do not end with ;). But you can use error productions to force the parser to pop its stack back to some containing production in order to produce a better error message. In my experience, a lot of experimentation is needed to get this strategy right.
In short, producing meaningful error messages is hard. My recommendation is to first focus on getting your parser working on correct inputs.

Given an antlr4 grammar, can I build up an expression tree?

So I have written my grammar in antlr4 syntax. Then I setup codegeneration, and now I can parse source files in my own defined language. This works great!
The next step I took is to create an object model from the expression tree. This is also working well.
However, now I want to generate an expression from my object model.
Can I generate code using the generated language parser objects API? Obviously, I can write methods that hand-generates strings. But I want to use a geenrated API based on the grammar to achieve some level of type safety and to detect errors when I make a grammar change.
I'm using the latest antlr4: antlr 4.7.1.
There's no generated solution. You have to wire this all up manually.

System Verilog 2012 dependency analysis

I am in the process of adapting the System Verilog LRM into Antlr4. This is a huge overkill for what I really need, however. Basically I need dependency analysis similar to the -M switch in gcc. This problem has been surprisingly difficult to solve, and my current regex based solution is incomplete, buggy and constantly breaks when exposed to new code, even though it has been patched many times. I have tried to use various freely available parsers, but none of them seem to handle code that conforms to the latest Systemverilog (2012) standard.
I think I need a parser based approach, and I think I am stuck building my own parser. But I am very interested to hear any other suggestions about this. I can't be the only one who has this problem.
Here is my Antlr question: I am attempting to use the "Island in the stream" approach where the Antlr grammar will ignore most of the details and complexity of the Systemverilog language and only parse code where modules are being instanced or headers are being referenced. Obviously the difficulty here is determining how to distinguish between code I care about and code I don't. Has anyone used Antlr this way (not necessarily for Systemverilog)? I am hoping to get a strategy about how to write the "catch all" rule that matches everything that is not related to module instances.
Thanks.
The idiomatic strategy is to match what is wanted and let everything else be consumed by an 'other' rule. So the basic structure of the parser will be:
verilog : statement+ EOF ;
statement : header
| module
| <<etc>>
| other
;
header : INCLUDE filePathspec SEMI ;
filePathspec: <<whatever>> ;
module : MODULE <<whatever>> SEMI ;
other : . ; // consume a single, uninteresting token at a time
The only requirement is to make the statement rules sufficiently detailed to uniquely match their statements. The Verilog syntax gives you that explicitly.
UPDATE
Take a look at the example Verilog grammar is in the grammar achieve.

How to find matching expressions while using a ParseTreeWalker

Say I'd like to find instances of the expression while using the Java7 grammar:
FoobarClass.getInstanceOfType("Bazz");
Using a ParseTreeWalker and listening to exitExpression() calls sounded like a good first place to start. What surprised me was the level of manual traversal of the Java7Parser.ExpressionContext required to find expressions of this type.
What's the appropriate method to find matches to the above expression? At this point using a Regex in place of ANTLR4 yields simpler code, but this won't scale.
ANTLR 4 does not currently include feature allowing you to write concrete or abstract syntax queries. We hope to add something in the future to help with this type of application.
I've needed to write a few pattern recognition features for ANTLR 4 parse trees. I implemented the predicate itself with relative success by extending BaseMyParserVisitor<Boolean> (the parser in this example is called MyParser).

Can a language have Lisp's powerful macros without the parentheses?

Can a language have Lisp's powerful macros without the parentheses?
Sure, the question is whether the macro is convenient to use and how powerful they are.
Let's first look how Lisp is slightly different.
Lisp syntax is based on data, not text
Lisp has a two-stage syntax.
A) first there is the data syntax for s-expressions
examples:
(mary called tim to tell him the price of the book)
(sin ( x ) + cos ( x ))
s-expressions are atoms, lists of atoms or lists.
B) second there is the Lisp language syntax on top of s-expressions.
Not every s-expression is a valid Lisp program.
(3 + 4)
is not a valid Lisp program, because Lisp uses prefix notation.
(+ 3 4)
is a valid Lisp program. The first element is a function - here the function +.
S-expressions are data
The interesting part is now that s-expressions can be read and then Lisp uses the normal data structures (numbers, symbols, lists, strings) to represent them.
Most other programming languages don't have a primitive representation for internalized source - other than strings.
Note that s-expressions here are not representing an AST (Abstract Syntax Tree). It's more like a hierarchical token tree coming out of a lexer phase. A lexer identifies the lexical elements.
The internalized source code now makes it easy to calculate with code, because the usual functions to manipulate lists can be applied.
Simple code manipulation with list functions
Let's look at the invalid Lisp code:
(3 + 4)
The program
(defun convert (code)
(list (second code) (first code) (third code)))
(convert '(3 + 4)) -> (+ 3 4)
has converted an infix expression into the valid Lisp prefix expression. We can evaluate it then.
(eval (convert '(3 + 4))) -> 7
EVAL evaluates the converted source code. eval takes as input an s-expression, here a list (+ 3 4).
How to calculate with code?
Programming languages now have at least three choices to make source calculations possible:
base the source code transformations on string transformations
use a similar primitive data structure like Lisp. A more complex variant of this is a syntax based on XML. One could then transform XML expressions. There are other possible external formats combined with internalized data.
use a real syntax description format and represent the source code internalized as a syntax tree using data structures that represent syntactic categories. -> use an AST.
For all these approaches you will find programming languages. Lisp is more or less in camp 2. The consequence: it is theoretically not really satisfying and makes it impossible to statically parse source code (if the code transformations are based on arbitrary Lisp functions). The Lisp community struggles with this for decades (see for example the myriad of approaches that the Scheme community has tried). Fortunately it is relatively easy to use, compared to some of the alternatives and quite powerful. Variant 1 is less elegant. Variant 3 leads to a lot complexity in simple AND complex transformations. It usually also means that the expression was already parsed with respect to a specific language grammar.
Another problem is HOW to transform the code. One approach would be based on transformation rules (like in some Scheme macro variants). Another approach would be a special transformation language (like a template language which can do arbitrary computations). The Lisp approach is to use Lisp itself. That makes it possible to write arbitrary transformations using the full Lisp language. In Lisp there is not a separate parsing stage, but at any time expressions can be read, transformed and evaluated - because these functions are available to the user.
Lisp is kind of a local maximum of simplicity for code transformations.
Other frontend syntax
Also note that the function read reads s-expressions to internal data. In Lisp one could either use a different reader for a different external syntax or reuse the Lisp built-in reader and reprogram it using the read macro mechanism - this mechanism makes it possible to extend or change the s-expression syntax. There are examples for both approaches to provide a different external syntax in Lisp.
For example there are Lisp variants which have a more conventional syntax, where code gets parsed into s-expressions.
Why is the s-expression-based syntax popular among Lisp programmers?
The current Lisp syntax is popular among Lisp programmers for two reasons:
1) the data is code is data idea makes it easy to write all kinds of code transformations based on the internalized data. There is also a relatively direct way from reading code, over manipulating code to printing code. The usual development tools can be used.
2) the text editor can be programmed in a straight forward way to manipulate s-expressions. That makes basic code and data transformations in the editor relatively easy.
Originally Lisp was thought to have a different, more conventional syntax. There were several attempts later to switch to other syntax variants - but for some reasons it either failed or spawned different languages.
Absolutely. It's just a couple orders of magnitude more complex, if you have to deal with a complex grammar. As Peter Norvig noted:
Python does have access to the
abstract syntax tree of programs, but
this is not for the faint of heart. On
the plus side, the modules are easy to
understand, and with five minutes and
five lines of code I was able to get
this:
>>> parse("2 + 2")
['eval_input', ['testlist', ['test', ['and_test', ['not_test', ['comparison',
['expr', ['xor_expr', ['and_expr', ['shift_expr', ['arith_expr', ['term',
['factor', ['power', ['atom', [2, '2']]]]], [14, '+'], ['term', ['factor',
['power', ['atom', [2, '2']]]]]]]]]]]]]]], [4, ''], [0, '']]
This was rather a disapointment to me. The Lisp parse of the equivalent expression is (+ 2 2). It seems that only a real expert would want to manipulate Python parse trees, whereas Lisp parse trees are simple for anyone to use. It is still possible to create something similar to macros in Python by concatenating strings, but it is not integrated with the rest of the language, and so in practice is not done.
Since I'm not a super-genius (or even a Peter Norvig), I'll stick with (+ 2 2).
Here's a shorter version of Rainer's answer:
In order to have lisp-style macros, you need a way of representing source-code in data structures. In most languages, the only "source code data structure" is a string, which doesn't have nearly enough structure to allow you to do real macros on. Some languages offer a real data structure, but it's too complex, like Python, so that writing real macros is stupidly complicated and not really worth it.
Lisp's lists and parentheses hit the sweet spot in the middle. Just enough structure to make it easy to handle, but not too much so you drown in complexity. As a bonus, when you nest lists you get a tree, which happens to be precisely the structure that programming languages naturally adopt (nearly all programming languages are first parsed into an "abstract syntax tree", or AST, before being actually interpreted/compiled).
Basically, programming Lisp is writing an AST directly, rather than writing some other language that then gets turned into an AST by the computer. You could possibly forgo the parens, but you'd just need some other way to group things into a list/tree. You probably wouldn't gain much from doing so.
Parentheses are irrelevant to macros. It's just Lisp's way of doing things.
For example, Prolog has a very powerful macros mechanism called "term expansion". Basically, whenever Prolog reads a term T, if tries a special rule term_expansion(T, R). If it is successful, the content of R is interpreted instead of T.
Not to mention the Dylan language, which has a pretty powerful syntactic macro system, which features (among other things) referential transparency, while being an infix (Algol-style) language.
Yes. Parentheses in Lisp are used in the classic way, as a grouping mechanism. Indentation is an alternative way to express groups. E.g. the following structures are equivalent:
A ((B C) D)
and
A
B
C
D
Have a look at Sweet-expressions. Wheeler makes a very good case that the reason things like infix notation have not worked before is that typical notation also tries to add precedence, which then adds complexity, which causes difficulties in writing macros.
For this reason, he proposes infix syntax like {1 + 2 + 3} and {1 + {2 * 3}} (note the spaces between symbols), that are translated to (+ 1 2) and (+ 1 (* 2 3)) respectively. He adds that if someone writes {1 + 2 * 3}, it should become (nfx 1 + 2 * 3), which could be captured, if you really want to provide precedence, but would, as a default, be an error.
He also suggests that indentation should be significant, proposes that functions could be called as fn(A B C) as well as (fn A B C), would like data[A] to translate to (bracketaccess data A), and that the entire system should be compatible with s-expressions.
Overall, it's an interesting set of proposals I'd like to experiment with extensively. (But don't tell anyone at comp.lang.lisp: they'll burn you at the stake for your curiosity :-).
Code rewriting in Tcl in a manner recognizably similar to Lisp macros is a common technique. For example, this is (trivial) code that makes it easier to write procedures that always import a certain set of global variables:
proc gproc {name arguments body} {
set realbody "global foo bar boo;$body"
uplevel 1 [list proc $name $arguments $realbody]
}
With that, all procedures declared with gproc xyz rather than proc xyz will have access to the foo, bar and boo globals. The whole key is that uplevel takes a command and evaluates it in the caller's context, and list is (among other things) an ideal constructor for substitution-safe code fragments.
Erlang's parse transforms are similar in power to Lisp macros, though they are much trickier to write and use (they are applied to the entire source file, rather than being invoked on demand).
Lisp itself had a brief dalliance with non-parenthesised syntax in the form of M-expressions. It didn't take with the community, though variants of the idea found their way into modern Lisps, so you get Lisp's powerful macros without the parentheses ... in Lisp!
Yes, you can definitely have Lisp macros without all the parentheses.
Take a look at "sweet-expressions", which provides a set of additional abbreviations for traditional s-expressions. They add indentation, a way to do infix, and traditional function calls like f(x), but in a way that is backwards-compatible (you can freely mix well-formatted s-expressions and sweet-expressions), generic, and homoiconic.
Sweet-expressions were developed on http://readable.sourceforge.net and there is a sample implementation.
For Scheme there is a SRFI for sweet-expressions, SRFI-110: http://srfi.schemers.org/srfi-110/
No, it's not necessary. Anything that gives you some sort of access to a parse tree would be enough to allow you to manipulate the macro body in hte same way as is done in Common Lisp. However, as the manipulation of the AST in lisp is identical to the manipulation of lists (something that is bordering on easy in the lisp family), it's possibly not nearly as natural without having the "parse tree" and "written form" be essentially the same.
I think this was not mentioned.
C++ templates are Turing-complete and perform processing at compile-time.
There is the well-known expression templates mechanism that allow transformations,
not from arbitrary code, but at least, from the subset of c++ operators.
So imagine you have 3 vectors of 1000 elements and you must perform:
(A + B + C)[0]
You can capture this tree in a expression template and arbitrarily manipulate it
at compile-time.
With this tree, at compile time, you can transform the expression.
For example, if that expression means A[0] + B[0] + C[0] for your domain, you could
avoid the normal c++ processing which would be:
Add A and B, adding 1000 elements.
Create a temporary for the result, and add with the 1000 elements of C.
Index the result to get the first element.
And replace with another transformed expression template tree that does:
Capture A[0]
Capture B[0]
Capture C[0]
Add all 3 results together in the result to return with += avoiding temporaries.
It is not better than lisp, I think, but it is still very powerful.
Yes, it is certainly possible. Especially if it is still a Lisp under the bonnet:
http://www.meta-alternative.net/pfront.pdf
http://www.meta-alternative.net/pfdoc.pdf
Boo has a nice "quoted" macro syntax that uses [| |] as delimiters, and has certain substitutions which are actually verified syntactically by the compiler pipeline using $variables. While simple and relatively painless to use, it's much more complicated to implement on the compiler side than s-expressions. Boo's solution may have a few limitations that haven't affected my own code. There's also an alternate syntax that reads more like ordinary OO code, but that falls into the "not for the faint of heart" category like dealing with Ruby or Python parse trees.
Javascript's template strings offer yet another approach to this sort of thing. For instance, Mark S. Miller's quasiParserGenerator implements a grammar syntax for parsers.
Go ahead and enter the Elixir programming language.
Elixir is a functional programming language that feels like Lisp with respect to macros, but it is on Ruby's clothes, and runs on top of the Erlang VM.
For those who do not like the parenthesis, but wish their language has powerful macros, Elixir is a great choice.
You can write macros in R (it have more like Algol Syntax) that have notion of delayed expression like in LISP macros. You can call substitute() or quote() to not evaluate the delayed expression but get actual expression and traverse its source code like in LISP. Even structure of the expression source code is like in LISP. Operators are first item in list. e.g.: input$foo which is getting property foo from list input as expression is written as ['$', 'input', 'foo'] just like in LISP.
You can check the ebook Metaprogramming in R that also show how to create Macros in R (not something you would normally do but it's possible). It's based on Article from 2001 Programmer’s Niche: Macros in R that explain how to write LIPS macros in R.

Resources