antlr4: Ignore superflous tokens when creating ParseTrees

antlr4: Ignore superflous tokens when creating ParseTrees - antlr4

I am developing a compiler for the real-time language PEARL with ANTLR4.
With ANTLR4 my ParseTree is populated with superflous tokens like e.g. semicolons for ending a
grammatical unit.
Is there a way to tell ANTLR to ignore these kinds of token?

Is there a way to tell ANTLR to ignore these kind of tokens?
No, but using ANTLR4's built-in listener/visitor, there's no need to remove these tokens.
See: "skip" changes parser behavior

Related

How to implement a multi character auto-complete trigger in language server

I'm building a language server for a custom DSL using the Pygls library.
Member expressions are handled with . in the language as used commonly in other languages, eg datum.deadline. I've made auto-completions work with this trigger character.
However, another operator that is used to access static methods in the language is the double colon ::, eg Date::new(). From what I read according to the ls-protocol specification it seems multi chars are not supported (yet). Is there still perhaps another way to make auto-completions possible with multi char triggers? Perhaps through a textDocument/didChange event?
Thanks.

Best way to implement lexer with full Unicode support

I was wondering what the best way to implement a lexer with full unicode support is. The traditional (f)lex approach is to use a 2d-array based transition table, but it would consume way too much memory with full unicode support. What is the best way to implement this? Solutions for any language are fine but I would pefer Java.

Create a C and C++ preprocessor using ANTLR

I want to create a tool that can analyze C and C++ code and detect unwanted behaviors, based on a config file. I thought about using ANTLR for this task, as I already created a simple compiler with it from scratch a few years ago (variables, condition, loops, and functions).
I grabbed C.g4 and CPP14.g4 from ANTLR grammars repository. However, I came to notice that they don't support the pre-processing parsing, as that's a different step in the compilation.
I tried to find a grammar that does the pre-processing part (updated to ANTLR4) with no luck. Moreover, I also understood that if I'll go with two-steps parsing I won't be able to retain the original locations of each character, as I'd already modified the input stream.
I wonder if there's a good ANTLR grammar or program (preferably Python, but can deal with other languages as well) that can help me to pre-process the C code. I also thought about using gcc -E, but then I won't be able to inspect the macro definitions (for example, I want to warn if a user used a #pragma GCC (some students at my university, for which I write this program to, used this to bypass some of the course coding style restrictions). Moreover, gcc -E will include library header contents, which I don't want to process.
My question is, therefore, if you can recommend me a grammar/program that I can use to pre-process C and C++ code. Alternatively, if you can guide me on how to create a grammar myself that'd be perfect. I was able to write the basic #define, #pragma etc. processings, but I'm unable to deal with conditions and with macro functions, as I'm unsure how to deal with them.
Thanks in advance!

This question is almost off-topic as it asks for an external resource. However, it also bears a part that deserves some attention.
The term "preprocessor" already indicates what the handling of macros etc. is about. The parser never sees the disabled parts of the input, which also means it can be anything, which might not be part of the actual language to parse. Hence a good approach for parsing C-like languages is to send the input through a preprocessor (which can be a specialized input stream) to strip out all preprocessing constructs, to resolve macros and remove disabled text. The parse position is not a problem, because you can push the current token position before you open a new input stream and restore that when you are done with it. Store reported errors together with your input stream stack. This way you keep the correct token positions. I have used exactly this approach in my Windows resource file parser.

Verify that a grammar is LL(1)

I want to verify that my ANTLR 4 grammar is LL(1). There is an option to do just that in older versions of ANTLR. Is there something similar in ANTLR 4?
I looked through through the documentation, but didn't find anything. Though especially the page on options seems to be lacking, I didn't even find a list of all possible options.

One of the design goals of ANTLR 4 is allowing language designers to focus on writing accurate grammars rather than worrying about characteristics like "LL(1)" which have little to no impact on users of the language.
However, it is likely that you can identify an LL(1) grammar by examining the generated parser. If there are no calls to adaptivePredict in the generated code, then the grammar is LL(1). The intent is for the inverse to also be true, but considering a call to adaptivePredict produces the same result as the inline version of an LL(1) decision, we have not rigorously evaluated this.

Get grammar from ANTLR4 post conversion?

I understand that the way ANTLR4 works it does some conversions on your grammar for you (getting rid of ambiguity, left-factoring, etc) so that you can just focus on writing more human-readable grammars rather than doing conversions by hand so the machine will accept it. Is there any way to export my grammar after ANTLR makes these changes? I'd like to see what changes were made to my grammar.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string