Get grammar from ANTLR4 post conversion? - antlr4

I understand that the way ANTLR4 works it does some conversions on your grammar for you (getting rid of ambiguity, left-factoring, etc) so that you can just focus on writing more human-readable grammars rather than doing conversions by hand so the machine will accept it. Is there any way to export my grammar after ANTLR makes these changes? I'd like to see what changes were made to my grammar.

Related

Create a C and C++ preprocessor using ANTLR

I want to create a tool that can analyze C and C++ code and detect unwanted behaviors, based on a config file. I thought about using ANTLR for this task, as I already created a simple compiler with it from scratch a few years ago (variables, condition, loops, and functions).
I grabbed C.g4 and CPP14.g4 from ANTLR grammars repository. However, I came to notice that they don't support the pre-processing parsing, as that's a different step in the compilation.
I tried to find a grammar that does the pre-processing part (updated to ANTLR4) with no luck. Moreover, I also understood that if I'll go with two-steps parsing I won't be able to retain the original locations of each character, as I'd already modified the input stream.
I wonder if there's a good ANTLR grammar or program (preferably Python, but can deal with other languages as well) that can help me to pre-process the C code. I also thought about using gcc -E, but then I won't be able to inspect the macro definitions (for example, I want to warn if a user used a #pragma GCC (some students at my university, for which I write this program to, used this to bypass some of the course coding style restrictions). Moreover, gcc -E will include library header contents, which I don't want to process.
My question is, therefore, if you can recommend me a grammar/program that I can use to pre-process C and C++ code. Alternatively, if you can guide me on how to create a grammar myself that'd be perfect. I was able to write the basic #define, #pragma etc. processings, but I'm unable to deal with conditions and with macro functions, as I'm unsure how to deal with them.
Thanks in advance!
This question is almost off-topic as it asks for an external resource. However, it also bears a part that deserves some attention.
The term "preprocessor" already indicates what the handling of macros etc. is about. The parser never sees the disabled parts of the input, which also means it can be anything, which might not be part of the actual language to parse. Hence a good approach for parsing C-like languages is to send the input through a preprocessor (which can be a specialized input stream) to strip out all preprocessing constructs, to resolve macros and remove disabled text. The parse position is not a problem, because you can push the current token position before you open a new input stream and restore that when you are done with it. Store reported errors together with your input stream stack. This way you keep the correct token positions. I have used exactly this approach in my Windows resource file parser.

What is the best way to test the output of a Grammar(is written in javascript)? How about the Mocha test?

I wrote a Grammar about traffic signs information (this information is in a text format) and used Antlr4 to generate a parser in JavaScript.
Over time, my Grammar is getting bigger and more complicated and I don't want to ruin my previous codes. How I can make sure that my Grammar is still well-defined and covered the cases which I worked on them.
I was thinking of using Mocha test to test the output of the Grammar which is an JSON object. But I am wondering if there is another better approach!
Thanks in advance.

Verify that a grammar is LL(1)

I want to verify that my ANTLR 4 grammar is LL(1). There is an option to do just that in older versions of ANTLR. Is there something similar in ANTLR 4?
I looked through through the documentation, but didn't find anything. Though especially the page on options seems to be lacking, I didn't even find a list of all possible options.
One of the design goals of ANTLR 4 is allowing language designers to focus on writing accurate grammars rather than worrying about characteristics like "LL(1)" which have little to no impact on users of the language.
However, it is likely that you can identify an LL(1) grammar by examining the generated parser. If there are no calls to adaptivePredict in the generated code, then the grammar is LL(1). The intent is for the inverse to also be true, but considering a call to adaptivePredict produces the same result as the inline version of an LL(1) decision, we have not rigorously evaluated this.

What do I need to learn to build an interpreter?

For my AQA A2-level Computing project, I've decided to create a basic interpreted programming language, outputting to Console. I don't know how to build an interpreter. I have a copy of the purple dragon book, which is all about compiler design, as user166390 said on an answer to this question that the initial steps to building a compiler are the same to build an interpreter. My question is: is this true?
Can I use the techniques described in the dragon book to write an interpreter? And if so, which steps do I need to use and learn how to use?
Do I need to write a lexical analyser, a syntax analyser, a semantic analyser and an intermediate code generator, for example?
Could I get away with writing a basic parser that reads each line of the source code, parses it, and executes the instruction straight away, or is that a notoriously bad idea?
Yes, you can use the techniques described in the dragon book to write an interpreter.
You need a lexical analyzer and a parser regardless.
As others have pointed out, you do need to write the code to do actual execution -- but for a simple interpreter, this can be essentially the same as the syntax-directed translation described in the dragon book.
Everything else is optional.
If you want to skip straight from the parser to execution, you can. That will leave you with a very simple language, which can be both good and bad -- look at Tcl for an example of such a language.
If you want to interpret each line as you parse it, you can do that, too; this is what most command-line interpreters (Unix shell scripts, Microsoft's cmd.com and PowerShell) do, as well as interactive "REPL's" (Read-Eval-Print-Loops) for languages like Python and Ruby.
"Semantic analyzer" seems vague to me, but sounds like it should include most kinds of load-time consistency checks. This is also optional, but there are advantages in an interpreter that won't take any old garbage and try to execute it as a program...
"Intermediate code" is also kind of vague, but it is arguably optional. If you aren't executing directly from the program string (as in Tcl), you need some kind of internal representation to store your code once you've read it in. One popular option is to execute from an internal tree structure, based more or less closely on your parse tree, which is arguably distinct from producing "intermediate code". On the other hand, if your "intermediate code" could be written out more or less directly from your internal tree structure, you might as well count the internal structure as your "intermediate code".
There are important issues that you haven't addressed; one that stands out is: how do you want to handle names? Presumably you will want the programmer to be able to define and use his own names (e.g., for variables, functions, and so forth), so you will need to implement some kind of mechanism for that.
Exactly how names are handled is a big design decision, with major implications for the usability and implementability of your language. The simplest option for implementation is to use a single, global hash map to implement a single, global namespace -- but note that this choice has well-known usability problems...
Could I get away with writing a basic parser that reads source code and executes the steps straight away?
You could but you'd be doing it the hard way.
Do I need to write a lexical analyser, a syntax analyser, a semantic analyser and an intermediate code generator, for example?
You can skip intermediate code generation except if you want to write a VM-based interpreter. Perl for example, used to execute its parse graph directly; this is in contrast with Java or Python, which produces intermediate byte code.
The interpreter part of a VM-based language is generally simpler than the interpreter that have to understand a parse graph (so each component in the system is simpler), however the complexity of the whole interpreter stack is generally simpler when you don't need to define an intermediate bytecode language. So pick your poison.

How can I build an AST using ANTLR4? [duplicate]

This question already has answers here:
How to create AST with ANTLR4?
(4 answers)
Closed 5 years ago.
I have an ANTLR3 grammar that builds an abstract syntax tree. I'm looking into upgrading to ANTLR4. However, it appears that ANTLR4 only builds parse trees and not abstract syntax trees. For example, the output=AST option is no longer recognized. Furthermore neither "AST" nor "abstract syntax" appears in the text of "The Definitive ANTLR4 reference".
I'm wondering if I'm missing something.
My application currently knows how to crawl over the AST produced by ANTLR3. Changing it to process a parse tree isn't impossible but it will be a bit of work. I want to be sure it's necessary before I start down that road.
ANTLR 4 produces parse trees based on the grammar instead of ASTs based on arbitrary AST operators and/or rewrite rules. This allows ANTLR 4 to automatically produce listener and visitor interfaces that you can implement in the code using your grammar.
The change can be dramatic for users upgrading existing applications from version 3, but as a whole the new system is much easier to use and (especially) maintain.

Resources